Advancing Digital Frontiers of Crystallography, Chemistry and Research Data in Botswana
On the morning of Sunday November 4th, 2018, I arrived from London at the Gaborone International Conference Centre in Botswana. I was there to participate in activities associated with International Data Week 2018, including the 12th Research Data Alliance (RDA) Plenary. This was my ninth time attending an RDA Plenary, my second time participating in International Data Week, but my first time in Africa. My journey here was generously supported by a grant from the RDA Europe 4.0 Experts Programme.
If I was to try and summarise International Data Week 2018 then I would describe it as a top-notch, highly intensive seven days of activities aimed at advancing practices and policies associated with research data at both a local and a global scale. Over 800 people with diverse backgrounds and training from across 66 countries converged to share their knowledge and experiences through sessions at SciDataCon, a conference organised by CODATA and the World Data System (WDS), and to advance the activities of RDA’s many Working and Interest Groups.
As well as carrying the accolade of being a designated RDA Europe Expert, I was also representing the Cambridge Crystallographic Data Centre (CCDC), and channelling experiences drawn from the wider communities of crystallography and chemistry. For those that may not be aware, the CCDC has more than 50 years of experience making small molecule crystallography data and knowledge available in metadata-rich, standard and curated forms to researchers across domains. Crystallography as a discipline has been committed to rigorous standards of semantic data reporting for even longer and is often considered to be a great advertisement for what can be achieved through data sharing if you have the right components in place. It is a great opportunity for CCDC to be able to attend RDA events to share these experiences, learn from others and contribute to activities that can impact positively on the effective sharing of research data across communities.
One of the first RDA activities that the CCDC got involved in was a working group looking to establish more effective mechanisms for linking between data and articles. The outcome of this has been a set of guidelines and recommendations for data article-linking known as Scholix. The current RDA/WDS Scholarly Link Exchange Working Group has been focussing on adoption of Scholix and the CCDC has been participating in this. We have recently been working with Elsevier to transition existing data-article linking mechanisms to use Scholix-based workflows and at the Working Group session in Botswana I shared some of our journey thus far. A key challenge currently is making sure that the various systems needed to implement Scholix effectively are all updated in a timely manner so there is minimal delay between an article being published and the link to associated data becoming available. A key step in the workflow for us involves the repository being notified by a publisher when an article associated with a dataset has been published. This isn’t specifically a Scholix problem, but it is definitely a component of data publishing workflows where there is opportunity for improvement, particularly to remove the need we still have to manually scan some journals in order to identify articles that depend on data deposited at the CCDC.
Once datasets have been published, there is further benefit for the community in understanding what level of usage these are getting. This can help demonstrate to various stakeholders the value and impact of investing time and effort in making data publicly available. The RDA Data Usage Metrics Working Group is looking to establish standardised recommendations for metrics that can help demonstrate this impact. A starting point for this group is the Make Data Count project which has helped establish a COUNTER Code of Practice for Research Data Metrics and is developing services that enable views and downloads to be reported according to the code. That we can now reliably count views and downloads allows us to start considering what these metrics can tell us about community re-use of data and what else might be needed to fully reflect the value of publishing a dataset. To fuel discussion of this at the Working Group session, I shared observations from the CCDC on factors that can influence such metrics and noted ways in which data can be impactfully reused that won’t necessarily register a view or a download. One of the main suggestions to come out of discussion during the session was to make sure that whatever metrics this group recommends, they are accompanied by an honest assessment of what they can tell us and what their limitations are.
Another area where metrics featured at this RDA meeting was in the context of assessing well a dataset conforms to the FAIR Data Principles. This relates to a broader concept of “Fitness for Use” and how to signal what aspects of a dataset might need improvement to enable it to be effectively reused. This can be a difficult concept to grasp, particularly at a level that is domain agnostic, but members of the RDA/WDS Assessment of Data Fitness for Use Working Group have done an excellent job in coming up with a set of core criteria that offer a handle on this. These reflect both the FAIR Data Principles but also some of the criteria of the CoreTrustSeal requirements for certifying a repository as trustworthy. The complementarity of FAIR and the CoreTrustSeal is something that came up in several conversations over the week and will, I suspect, be an issue that is revisited in the months ahead.
Aspects of FAIR from a domain perspective were drilled into as part of a Chemistry Research Data Interest Group session that I co-chaired with Leah McEwen, Chemistry Librarian at Cornell University. Having reviewed how FAIR currently does and doesn’t work in chemistry, participants engaged in discussions aimed at gathering input from other domains on their chemistry data needs as well as identifying common challenges that could perhaps be advanced through the RDA. Themes relating to interdisciplinary interoperability as they pertain to chemistry as well as crystallography and biology were explored further in a SciDataCon session jointly organised by the International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Crystallography (IUCr). Ideas to emerge from discussions across the week gravitated towards the challenge of meaningfully conveying domain-specific metadata across disciplines – perhaps mediated by a FAIR metadata extension to domain-specific file formats for example – and the need to understand how different disciplines conceptualise “a chemical”. There was also a call for a machine API to the Periodic Table of the Elements; working towards this could be a fitting activity for 2019, which is both the 100th anniversary of IUPAC and the International Year of the Periodic Table.
In this post I have primarily focussed on my engagement in RDA activities throughout the week but I experienced so much more. I got to hear perspectives on Democratising Data Publishing and Upstream Data vs Downstream Innovation as part of SciDataCon. I learnt about the Nagoya Protocol and the challenges this potentially presents for biomedical research. I was inspired by some great keynotes and found myself tempted to go read some T. S. Eliot. I got to experience some of the infrastructure limitations that many across the world experience on a daily basis. I heard the drums, echoing in the night. Most of all, I had the privilege of engaging once again with a community that is friendly, knowledgeable, open and passionate about making research data widely accessible and meaningfully reusable across borders of all kinds. I thank RDA Europe 4.0 for supporting my attendance at this event and for continuing to facilitate and nurture opportunities for cross-community innovation and collaboration as part of the global RDA enterprise.