Principles and best practices in data versioning for all data sets big and small

Output Type: Working Group Supporting Output
Output Status: Endorsed
Review Period End: 2020-02-28
DOI:
Group: Data Versioning IG
Standards:
Regions:
Language:

Non RDA Author(s)
Adopters

Abstract

Data Versioning WG

Group co-chairs:

Jens Klump, Lesley Wyborn, Ari Asmi, Robert Downs

Supporting Output title: Principles and best practices in data versioning for all data sets big and small

Authors: Jens Klump, Lesley Wyborn, Robert Downs, Ari Asmi, Mingfang Wu, Gerry Ryder, Julia Martin

Impact: Provides recommendations for standard practices in the versioning of research data, adding a central element to the systematic management of research data at any scale which in turn enhances reproducibility and enables the attribution of any person or organisation that contributed to the development or funding of any version of a dataset.

DOI: 10.15497/RDA00042

Citation: Klump, J., Wyborn, L., Downs, R., Asmi, A., Wu, M., Ryder, G., & Martin, J. (2020). Principles and best practices in data versioning for all data sets big and small. Version 1.1. Research Data Alliance. DOI: 10.15497/RDA00042.

Abstract:

The demand for better reproducibility of research results is growing. More and more data is becoming available online. In some cases, the datasets have become so large that downloading the data is no longer feasible. Data can also be offered through web services and accessed on demand. This means that parts of the data are accessed at a remote source when needed. In this scenario, it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. However, while the means to identify datasets using persistent identifiers have been in place for more than a decade, systematic data versioning practices are currently not available.

Versioning procedures and best practices are well established for scientific software. The related Wikipedia article gives an overview of software versioning practices. The codebase of large software projects does bear some semblance to large dynamic datasets. Are therefore versioning practices for code also suitable for data sets or do we need a separate suite of practices for data versioning? How can we apply our knowledge of versioning code to improve data versioning practices? This Working Group investigated to which extent these practices can be used to enhance the reproducibility of scientific results.

The Research Data Alliance (RDA) Data Versioning Working Group produced this white paper to document use cases and practices, and to make recommendations for the versioning of research data. To further adoption of the outcomes, the Working Group contributed selected use cases and recommended data versioning practices to other groups in RDA and W3C. The outcomes of the RDA Data Versioning Working Group add a central element to the systematic management of research data at any scale by providing recommendations for standard practices in the versioning of research data. These practice guidelines are illustrated by a collection of use cases.

Please note that the previous version (v1.0) underwent community review. The current version (v1.1) was updated following the community review.

Impact Statement
Primary Field or Expertise
Mathematics
Explanation of Sustainable Development Goals
Citations
Output

Report of the RDA Data Versioning Working Group_V1.1.pdf

Download

Primary Domain: Natural Sciences
RDA Pathways:
Group Technology focus: Data (Output) Management Planning
Regions:
Stakeholders:
Sustainable Development Goals:

No comments found.

Principles and best practices in data versioning for all data sets big and small

Non RDA Author(s)

Adopters

Abstract

Data Versioning WG

Abstract:

Impact Statement

Primary Field or Expertise

Explanation of Sustainable Development Goals

Citations