Teaching TDM on Education and Skill Development WG Case Statement

20 Oct 2016

Teaching TDM on Education and Skill Development WG Case Statement






In the healthcare sector, 1.3 million new pieces of research related to biomedical science alone are published each year.[2] A typical database search returns about 80,000 hits, and only 4,000 of those are likely to be very relevant to a researcher’s work. Text and Data Mining (TDM) techniques can already be used to zoom in on the top 25% of papers which are most relevant to any given search query. Researchers believe that, with a little more work, it will be possible to use TDM to identify the top 10% of search results. In a similar vein, the quantity of data being created has also grown exponentially, making it difficult to handle and analyse. Data mining techniques are needed to help researchers to spot patterns in large batches of data.

TDM was initially defined as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different (…) resources, to reveal otherwise hidden meanings.” It’s applicability in all fields of research is growing in this age of information overload.[3]

Recent studies show the uptake of TDM is lacking.[4] One of the reasons is the lack of awareness and skill amongst researchers, librarians, and industry practitioners.[5] A key conclusion from the Publishing Research Consortium survey on TDM was that ‘Awareness of text mining techniques is still relatively low.[6] Moreover, the European Communications Monitor identified a ‘gap between training offered and development needs’ [7] Both industry and academia have confirmed a need for education on TDM. Our focus will be on providing the basic skills so as to reach the widest audiences. The decision therefore is to establish a Working Group with a clear focus and purpose to develop a course within the 18 month time frame.


Purpose of Initiative

This Working Group aims to address the current skills gap identified with respect to Text and Data Mining (TDM) and help improve the adoption of these practices in a range of research disciplines.

TDM is a cross-cutting skill of value to a wide range of researchers. This working group aims to develop a short module that can plug into existing courses (e.g. the CODATA-RDA School of Research Data Science and existing university research skills courses) to equip researchers and practitioners with basic TDM skills and increase the use of these.


Scope of initiative

The Working Group aims to develop a short introductory programme and related content (presentations, exercises and case studies) to introduce researchers[8] to TDM and provide practical experience in applying open source tools to use these skills in their field of research.[9]
The design of the course will be developed based on the research and feedback from the stakeholder communities in the upcoming months.(see timeline and workplan) More specifically the content and the proposed duration of the course will be determined after these consultations. For now we envision a 1-2 day modular course for people with no prior knowledge that includes stand alone modules,lessons and elements that can be selected independently depending on the focus and level of knowledge the participants. The course can be spread out over several days or weeks to fit within existing courses and trainings.

The introductory course will not be discipline-specific, though later iterations could be tailored towards this if needed and for example go into more detail into discipline related fields of interest and expertise.. Although the 1-2 day course aim is to address the skills gap for researchers with no prior knowledge we anticipate that we may need to extend the duration of the course to 4-5 days if we find that we need to include more basic introduction courses on for example the more technical aspects of TDM.

The course and course materials will be made available online and in digital easy to use and modular format accessible for anyone who is interested to use and adapted the course to suit their specific level/audience.[10]


Background to Initiative

The European projects FutureTDM, FOSTER and EDISON confirm that there is a growing demand for researchers who understand and are able to use TDM and that current education is falling behind in providing people with the skills and knowledge needed both in academia and industry.[11]

At RDA Plenary 8, a discussion on TDM in the IG on Education session confirmed community interest in developing training materials to address the skills gap. This working group therefore aims to look at how education and in-work training can help fill the gap and create enough expert data scientists.[12]


Relevance of the Initiative

Taking into account the many benefits of TDM for research and society this is a topic relevant for RDA. By designing a course to cover TDM skills and developing course materials and making them available to the community we can contribute in bridging this gap. This will include learning outcomes (essential and desirable) and  course content (specific readings, lecture and discussion content, class activities, practical assignments, and graded assignments).
Proposed Outcome
The aim is to develop a generic/adaptable course or training module that can then be used by different disciplines on TDM skills and knowledge.


Timeline and Workplan : Term: 12-18 months

Quarter one - 2017: Requirements gathering phase

This will include identifying survey participants (such as existing course providers, the research community, industry partners, librarians and RDA members) and undertaking a questionnaire to understand what skills need to be covered in an introductory TDM course.

Analysing survey outputs and drafting a course design, learning outcomes and programme for consultation at the RDA plenary in Barcelona.

This work will be conducted via virtual meetings and desk-based research.

Deliverable: Survey and results

Milestone: Preliminary course outline for discussion in Barcelona

Quarter two - 2017: Course development

Development of course content, including specific readings, lecture and discussion content, class activities, practical exercises and graded assignments. For this we will look at existing courses and tutorials and build upon those with input from the TDM community such as users and tool developers. For example we will work together with Contentmine, Industry partners such as SAS and at least two Universities who have expressed interest in adopting a course.

Establishing an international network of experts and potential TDM trainers. This will build on the initial survey work and contacts developed through the WG and will support roll-out and reuse of the materials.

The majority of this work will be conducted virtually, with OKFN leading. At least one face-to-face meeting will be scheduled to help define the structure of the course and/or develop key components.

Deliverable: A draft set of training materials and user guides ready for testing

Quarter three - 2017: testing

Liaising with contacts to establish one or two potential opportunities to trial the course. These could be aligned with existing events from partners such as DCC, FutureTDM or institutions who have expressed an interest in hosting events for researchers.

A train-the-trainers style session could be run at RDA Montreal to walk members through the course content and how this should be delivered to receive feedback from potential adopters.

This work will require at least two face-to-face sessions to deliver courses in different contexts

Milestone: Have tested the course and gathered feedback from trainers and pilot participants

Quarter four - 2017: evaluation and review

Here we will take stock of feedback received during the trial. Particular emphasis will be paid to which sessions were most effective in addressing the learning outcomes and engaging participants. The time taken to deliver the sessions, any technical issues encountered by trainers and ideas for reworking content or improving flow will also be addressed.

The course materials will be refined based on the feedback and materials to assist others in reusing the content such as speaker notes will also be improved.

The work will be conducted remotely with regular virtual meetings to support the analysis and review.

Deliverable: a revised set of openly-licensed training materials available online for reuse

Quarter five - 2018: adoption

The complete course materials will be made available online (github, slideshare, zenodo) together with documentation on how to implement the course module, FAQs and contact details for support. Further events like the train-the-trainers at Montreal could help others to understand and adopt the resources.

Through the DCC, European training initiatives (e.g. Swafs-07) and e-infrastructure projects like OpenAIRE, we will raise awareness of the module and promote adoption in academia.

In addition the IEA has a number of industrial partners (including Microsoft, Airbus, environmental consultancies and civil engineering companies)  and can be used as a route to gaining contact with industry.

This work will involve promoting the outputs at events, as well as specific meeting with key targets (e.g. training departments and Doctoral Training Centres) to promote adoption


WG Communication


Bi-weekly calls for the Chairs or others engaged in specific activities currently underway

Monthly calls to update all members of the Working Group on progress

WG Email list for discussion and sharing of relevant information

Google Drive/ Github for collaboration on course materials





-                       Freyja van den Boom (EU)
                        Sarah Jones (EU)

                        Devan Ray Donaldson (US)
                        Clement E. Onime (TBC)


  • Steve Brewer
  • Vicky Lucas
  • Simon Hodson
  • Amy Nurnberger        
  • Puneet Kishor
  • Baden Appleyard
  • Christoph Bruch
  • Alex Fenlon
  • Jez Cope
  • Hugh Shanahan
  • Małgorzata Krakowian
  • Bridget Almas

Group Email: tdm@rda-groups.org

Secretariat Liaison: Fotis Karayannis

TAB Liaison: Devika Madalli

Engagement with existing work in the area:

Collaborations and opportunities for further engagement include:

http://www.futuretdm.eu/ The FutureTDM project seeks to improve uptake of text and data mining (TDM) in the EU. FutureTDM actively engages with stakeholders such as researchers, developers, publishers and SMEs and looks in depth at the TDM landscape in the EU to help pinpoint why uptake is lower, to raise awareness of TDM and to develop solutions.

EDISON is a 2-year project (started September 2015) with the purpose of accelerating the creation of the Data Science profession.


The forthcoming Swafs-07 ‘Training on Open Science in the European Research Area’ project.

CODATA-RDA School of Research Data Science


Part of the University of Reading, providing training on analytics and producing proof of concept software either by using environmental data or big data for environmental applications.  The IEA is funded until 2019 by the Higher Education Funding Council for England. The IEA recognises that TDM is a growing field for environmental analysis and applications.  The IEA currently has projects using TDM in tweets and text messages and is moving into larger document analysis, specifically environmental impact assessments.


The Belmont Forum is a group of national science funders, including NSF (US) and NERC (UK).  The e-infrastructure group is exploring training requirements for research data scientists, including developing a relevant curriculum in 2017.



The UK Digital Curation Centre has delivered training on Research Data Management for several years and is involved in training activities for a number of European projects such as FOSTER, OpenAIRE, EUDAT and the European Open Science Cloud. Through these and participation in the CODATA summer schools, the DCC will help to embed the module in existing courses and encourage broad adoption.

Other possible collaborations:

Academia: We have interest from several Universities

Possible try-outs may be organized alongside Trieste School 10-21 July at ICTP in Trieste; followed by Sao Paolo, Brazil, 4-15 December.

School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively

Industry and organisations: Contentmine, SAS


[1] Developed during the Plenary in Denver IG session Education and Training on handling of research data

[2] FutureTDM project report D4.3 Compendium of Best Practices and Methodologies available online at http://www.futuretdm.eu/knowledge-library/

[3] See for an overview of use examples in the US: Why “Big Data” Is a Big Deal Information science promises to change the world, Shaw. J Harvard Magazine available online http://harvardmag.com/pdf/2014/03-pdfs/0314-30.pdf

[4] The EU expert report on Text and Datamining states that Europe is falling behing the US and China with respect to the uptake of TDM available at http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_ex...

[5] FutureTDM consortium D4.3 Compendium of Best Practices and Methodologies report shows the need for more TDM practitioners in industry as well as a lack of awareness and skill amongst students and researchers in different disciplines.

[6] Key finding from the Publishers community on this issue available here http://publishingresearchconsortium.com/index.php/prc-projects/text-mining-of-journal-literature-2016?platform=hootsuite

[7] As identified in Europe. See European Communication Monitor 2016 http://www.communicationmonitor.eu/

[8] We will initially develop this course aimed for (student) researchers with no or little prior knowledge on TDM. For a second iteration of the course we will also look at industry, librarians and other interested parties to see how the course can be tailored more to specific needs.

[9] The course will be made available under an open access license using open source tools and materials to make sure the course can be adopted by a wide audience.

[10] The content of the course, course materials and best platform to make them available will be looked at in this working group. See timeline for more detailed information,

[11] FutureTDM Deliverable 2.4 and 4.3 available at http://www.futuretdm.eu/

[12] The UK Royal Society is holding a special conference on this topic see https://royalsociety.org/science-events-and-lectures/2016/11/data-skills-workshop/


Review period start: 
Friday, 9 December, 2016 to Monday, 9 January, 2017
  • Malcolm Wolski's picture

    Author: Malcolm Wolski

    Date: 17 Jan, 2017

    As described these outputs/resources resources will be very useful in complementing our existing activities in Software Carpentry, Resbaz and our communities of practice drop in sessions. They will also provide a very useful referral resource. While the demand for this has not been high at this University we suspect it is a case of researchers don't know what they don't know so these resources and working group activities will be of interest for awareness raising as well. There is also potential interest from the Library and Graduate Schools.

    It is not quite clear how the WG will keep their activities non-discipline specific. 

    There were positive comments around the train-the-trainer approach (similar to ANDS 23 Things approach) that resonates with people who read the case statement. Overall general support for the WG to the point that potential initial target groups have already been discussed. 

    We would be interested in participating in working group activities where possible.

  • Lynn Yarmey's picture

    Author: Lynn Yarmey

    Date: 19 Jan, 2017

    Many thanks Malcolm,  your comments will be considered in the TAB review report to the group.
     We appreciate your perspective!


    TAB Liaison from Secretariat

submit a comment