Approaches to Research Data Packaging - RDA 11th Plenary BoF meeting

BoF Meeting title

Approaches to Research Data Packaging (Remote Access Instructions)

Collaborative session notes (with links to presentations): 

https://docs.google.com/document/d/1ZlfaDAgjyI1iLf5zHtMIdE4pUBpUvmEC11C6...

Short introduction describing the scope of the group and if any previous activities

A number of projects, initiatives, and infrastructure providers have identified the value of bundling research data along with associated metadata into a simple, folder structure, variously referred to as a data “package”, “bundle”, “crate”, or “object”. The approach of bundling data and minimal metadata into a “self-describing” package draws on similar conventions that have been shown to scale exceptionally well in other domains: for example, conventions used in software packaging (apt, yum, ruby gems, npm, R packages, etc.); the conventions used in software development to place metadata in well-known files such as README.md/.txt or LICENCE.txt; the success of epub and similar file formats in bundling content and metadata in a simple zipped folder structure; the core idea behind docker containers as portable, interoperable, reproducible units. In the context of research data and the F.A.I.R. Guidelines, the approach has huge potential in terms of scaling out a minimal notion of interoperability (conventions for packaging minimal semantics with data) and reusability (the potential to develop a wide array of tools which can interpret a small number of common packaging formats), and works well with existing approaches to findability (it is trivial to associate a DOI with a research data package) and accessibility (a simple file/folder approach allows for fine-grained access permissions across different protocols, e.g. POSIX, HTTP etc.). 

A data packaging approach works across the research lifecycle. There is a need to integrate minimal data stewardship practices/conventions into existing daily practices of researchers already familiar with thinking about research outputs in terms of files and folders on disk, and in such a way that complements rather than interferes with existing practices and tooling. In the context of emerging content-addressed, peer-to-peer storage protocols (e.g. IPFS, DAT, Ethereum) data packaging allow rich semantics to be associated with globally synced datasets and immutable references. Similarly, data packages work well in existing cloud environments and with workloads that require highly efficient access and computational I/O (e.g. http://bd2k.ini.usc.edu/tools/bdbag/). Later in the data management cycle, a data packaging approach is a good fit for existing repository workflows (most repositories conceptually draw on the idea of Submission/Archival/Dissemination packages from the OAIS model) and for existing long-term preservation environments such as DPN or LOCKSS.

The scope of this group is to explore the state of the art in terms of the application of data packaging in the context of research data management, and to establish what additional work could be carried out by the Research Data Alliance to further support existing approaches.

Additional links to informative material related to the group

A list (compiled with input from various groups and individuals) of existing/candidate data packaging specifications used in the context of research:
https://docs.google.com/document/d/155lA2BcixTl-zwJHGfLkxsmg7WmQbBK00QWy...

A more selective version of the list above, presented in a spreadsheet with a comparison matrix:
https://docs.google.com/spreadsheets/d/1Tg-oYGPdBDs5LORt0olD5t4X1R_YliUr...

Notable projects working in this area:
--- Frictionless Data (https://frictionlessdata.io/data-packages/)
--- Research Objects (http://www.researchobject.org/)
--- RDA Repository Interoperability Working Group packaging recommendation (https://docs.google.com/document/d/1VmmhNMl4ie5zqbCKkf3NDNRHtgdb2SgYF_cE...
--- DataCrate (https://github.com/UTS-eResearch/datacrate/tree/master/spec/0.1

Two related blog posts by Cameron Neylon, highlighting the pain points that individual researchers face in managing “long tail” research data, and how a simple packing format and good tooling can alleviate those challenges:
https://cameronneylon.net/blog/as-a-researcher-im-a-bit-bloody-fed-up-wi...
https://cameronneylon.net/blog/packaging-data-the-core-problem-in-genera...

Meeting objectives

The objective of the meeting is to: 
a) identify and capture research-based use-cases for data packaging; 
b) compare and contrast the strengths of existing data packaging formats and the extent to which the variety of approaches complement each other; 
c) establish whether there is a need for further RDA work in this area, either through an Interest Group, or a Working Group to, for example, develop an RDA recommendation relating to specific research data packaging formats

Meeting agenda

1) Overview of Data packaging initatives (6 x ~10 minutes):

  1. Intro - Eoghan Ó Carragáin, University College Cork
  2. DataOne Packages - Dave Vieglais, DataOne
  3. RDA Research Data Interoperability Working Group - Thomas Jejkal, Karlsruhe Institute of Technology
  4. researchobjects.org - Stian Soiland-Reyes, University of Manchester (remote)
  5. DataCrate v 0.2 - Peter Sefton, University of Technology Sydney (remote)
  6. Frictionless DataPackages - Vitor Baptista, Open Knowledge International (remote)

2) Questtions, discussion and evaluation of need for further RDA work in this area, e.g. formation of an Interest or Working Group 

Target audience:

Ideally the meeting will have representatives from existing data packaging related projects (see the list above). Some of these groups have already expressed interest and others will be invited to attend. 

Other than that, it would be very useful to get additional feedback from RDA delegates who may have particular use-cases for which data packaging would be a good fit.

In particular, those working on infrastructure for long-tail research, those working on strategies for long-term curation of long-tail research data, those working on tools for active data management (e.g. electronic notebook providers), those with an expertise in metadata modelling and with an interest in the UX (user experience) considerations of how to capture rich semantics in daily workflows. 

Note: There was a “Data Packages BoF” at RDA Plenary 6 focusing specifically on the packaging specification from the Frictionless Data project of Open Knowledge International. Similarly the Frictionless Data specifications were presented during a joint-session of MDIIIG and PaNSIG at Plenary 8. The Frictionless Data team are aware of and interested in hte current proposal which takes a broader perspective on the variety of approaches to packaging of research data objects. 

Group chair serving as contact person: Eoghan Ó Carragáin

Type of meeting: Informative meeting

Remote Access Instructions:

Please join my meeting from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/634210829

You can also dial in using your phone.

Access Code: 634-210-829

Australia: +61 2 9087 3604
Austria: +43 7 2081 5427
Belgium: +32 28 93 7018
Canada: +1 (647) 497-9410
Denmark: +45 32 72 03 82
Finland: +358 923 17 0568
France: +33 170 950 594
Germany: +49 692 5736 7317
Ireland: +353 15 360 728
Italy: +39 0 230 57 81 42
Netherlands: +31 207 941 377
New Zealand: +64 9 280 6302
Norway: +47 21 93 37 51
Spain: +34 932 75 2004
Sweden: +46 853 527 836
Switzerland: +41 225 4599 78
United Kingdom: +44 330 221 0088

United States: +1 (224) 501-3216

First GoToMeeting? Let's do a quick system check: https://link.gotomeeting.com/system-check