Interoperable Data Archiving and Migration Using the RDRI Working Group Recommendations
The RDA’s Research Data Repository Interoperability (RDRI) Working Group has proposed recommendations (https://www.rd-alliance.org/group/research-data-repository-interoperability-wg/outcomes/research-data-repository-0) based on existing standards for the packaging richly annotated datasets for archiving and exchange. The communities around Dataverse and Clowder, two independent data-focused software systems that are in use today across a broad range of disciplines and in countries around the world, have recently been provided support to upgrade their existing archiving mechanisms to conform to these recommendations and to assess the benefits and limitations of these recommendations in achieving practical interoperability. This poster outlines the data/metadata architecture of the two systems, describes the in-progress implementations of export/import functionality that leverage the recommendations and underlying standards, and discusses some preliminary interoperability findings.
As noted in abstract, this is work done to implement RDA working group recommendations.
Author: Sarah Jones
Date: 02 Apr, 2020
This looks really useful. Can you say where the work is up to please. Have you managed to extend DataVerse and Clowder so you can export data packages or is it still at the development stage?
It would be great to hear how you found adopting the RDRI recommendation too. Did this run smootly or are there changes you would suggest? And did it allow exchange across repositories as expected?
Author: James Myers
Date: 02 Apr, 2020
Thanks for the comment! The project itself is just starting - I think we officially got funds just in time to create this poster. That said, both Dataverse and Clowder have signficiant capabilities to start, building on work during the NSF DataNet program to use OAI-ORE in a JSON-LD serialization and BagIt. Clowder can export pre-RDA Bags and, via a separate program (now called DVUploader) re-import them. Dataverse is currently limited to export, but it is already RDA-conformant. I'm currently working on import to Dataverse and Max is getting started in Clowder.
I expect we'll be publishing a paper in a few months that will cover how things went and what we learned, but I can mention a few 'early impressions' here:
I've been a champion of BagIt and OAI-ORE, particularly in JSON-LD serialization for many years. To me, their key benefit is that they standardize the things nobody wants to argue about - regardless of what fancy features a repository has, most can be described as having datasets that contain ~file-like parts with both the dataset and parts having metadata. And, with the implementation we've developed, they scale to Bags containing 120K+ files and 600GB+ content, without having to switch to using 'holey' Bags (that reference files by URL rather than including them in what's zipped/transferred).
BagIt and OAI-ORE standardize a way to serialize that without limiting what they parts are, how they're arranged, or what metadata you have about them. The RDA recommendations add to this with more ~common-sense additions such as where in the Bag will you find the ORE file. RDA also adds a requirement for a datacite XML metadata file that eseentially specifies a minimal metadata set. The combination however, retains the ability to let repositories add whatever domain specific metadata, provenance, etc. that they use within this common recommended structure.
Overall, creating an export using the RDA recommendations should be straight-forward for most repositories (I think it was for ours), which would allow standards-based archiving for example (which was the initial goal in Dataverse). Import is where the fun is. Our expectation is that the recommendations are sufficient to let us recreate a dataset and its parts, but we still need to investigate how to handle 'arbitrary' metadata that could be in the ORE file. For Clowder, which allows metadata to be defined dynamically, this could be relatively straight-forward. Dataverse is also customizable, but not dynamically (defining a new term in Dataverse involves specifying how it can be edited and displayed, etc. - info that's not available just from getting an example of that term in an ORE file), so we're contemplating ways to address that, and why we think the contrast between Clowder and Dataverse in the project is interesting w.r.t. making generalizations about interoperability.
The one aspect of the RDA recommendation that I (personally, not as representative of the project) am not convinced is useful, as currently defined, is the inclusion of a Bag profile. The idea of a profile is that it gives you an idea of what metadata you can expect in a Bag, which is potentially useful, but a) it seems to be most useful as metadata about a repository rather than a particular bag (i.e. I'd like to know if I can harvest datasets from a repository, so I'd like to know what BagIt versions it might produce, but a given Bag conforms to only one BagIt version), b) profiles only cover the Bag structure/metadata and not the ORE metadata whereas understanding the dataset will require knowing both, and c) it's not clear how repositories like Dataverse an Clowder, which can both be customized by admins/end users, can develop useful standard profiles - we can create generic ones for the software where lots of elements are optional, or we could try to generate them per instance so that they'll be more informative, but again, for elements like Contact-Phone, which relies on optional metadata (a dataset creator might include it or not), you'll have to inspect a given Bag to know if you have one or not. The recommendation is that Bags SHOULD have a profile rather than MUST, so we'll be seeking input as the project goes forward on how we address this.
If anyone's interested, there are ways you can view/try the current capabilities. All the datasets published in the SEAD repository, which uses Clowder, are in the pre-RDA ORE/Bag format. For Dataverse, the OAI-ORE metadata map is available for any published dataset from most of the installations around the World (see https://iqss.github.io/dataverse-installations/). Dataverse datasets can contain restricted access files, so the full RDA-conformant Bags are only available to administrators, but we'll be posting examples with unrestricted data as we go.
Author: Asahiko Matsuda
Date: 08 Apr, 2020
I'm particularly interested in how domain-specific metadata works here. If I have a dataset with domain-specific metadata that's not in any existing JSON-LD vocabulary, would I start by just writing them down in the part where you labeled "Information about each DataFile"? How would indivual applications be expected to deal with that, or would it just live as a file? And I would presume that it would be a good idea to use a consistent rule for that, and if a group of people can agree on that consistent rule (where RDA would be an ideal place to discuss that), that would be a standard.
(I think what we're interested in have overlaps. Come see our poster Materials metadata: as a custom schema, as directories, or in a data package.)
Author: James Myers
Date: 08 Apr, 2020
I'm not sure I fully understand your comment, but I would agree that RO-Crate overlaps with the RDA recommendation we're implementing. My sense of the two is that the OAI-ORE/BagIt combination just defines packaging whereas RO-Crate defines it's own packaging and picks specific vocabularies to use. It's not quite that simple in that the RDA recommendation requires DataCite metadata (to create the datacite.xml file).
If you want to use other community vocabularies in our work, you simply need to define the term with a label and URI so it can be represented in JSON-LD. So, for example, to associate an instrument ID with a file, you could just add "InstrumentId":"http://myproject.com/vocab/instrumentid" to the @context and add an "InstrumentId":"1234" to a file or dataset's metadata. The OAI-ORE/BagIt combo and RDA recommendations do not limit what metadata you provide. Clowder can accept, store, display, and re-export such metadata (in an archival package or via API) already and we're considering how to do that with Dataverse. (Dataverse can be customized to allow new metadata terms associated with external vocabularies to be added through the web interface and API, and to be exported, but such customization currently has to be done by an admin prior to entering such metadata.)
Doing exactly what I just said would not really be a good practice - no one but you would understand what "http://myproject.com/vocab/instrumentid" is intended to mean, so it would be much better if experimentalist would standardize that and other terms, perhaps through RDA, and use that instead.
To me, having standards/recommendations like this one that just address the packaging of data and metadata, without constraining what vocabularies you use is valuable - the code to export and import such packages doesn't change when communities or the tools they use decide to add or change the metadata terms/vocabularies/ontologies they want to use. (In the past I've made analogies to the HTML in terms of separating how you package content and what's in the package - browsers don't have to change if you decide to add new information to your webpage.)When a new vocabulary becomes available, people can start using it immediately without waiting for repositories to update their import/export capabilities to handle it.
If I understand correctly, RO-Crate can be extended but the spec standardized a lot of metadata terms already, which increases interoperability at the cost of requiring more work to implement. I know that outside the effort described in this poster, the Dataverse team is also considering whether/how to support RO-Crate as well (Dataverse already exports metadata in multiple formats). That could lead to interesting hybrids where we're also able to provide RO-Crate metadata within the RDA-recommended OAI-ORE/BagIt structure.
Author: Asahiko Matsuda
Date: 09 Apr, 2020
Thank you for describing the process in detail. I'm glad to know that it works quite like what I had in mind. I wasn't aware of the nuances among the specs (I need to read the specs more!), but like you said, I imagine this family of formats to be hybridizable, so I'm excited in what can be done using these formats. Thank you for your very informative poster and discussion.