Skip to main content

Notice

The new RDA web platform is still being rolled out. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Please report bugs, broken links and provide your feedback using the UserSnap tool on the bottom right corner of each page. Stay updated about the web site milestones at https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/.

Enhancing Generic Data Descriptors With Discipline Specific Metadata

By Vaidas Morkevičius, 12 December 2022

Summary: This project created a recommended framework for harvesting and delivering for discovery rich metadata of Social Science Data (SSD) objects for the EOSC Portal Service Catalogue.

This work received funding from the EOSC Future project as part of the RDA Open Calls Programme aimed at on strengthening the work of RDA Communities in EOSC. 

This proposal extends on recommendations identified in the RDA recommendation Guidelines for publishing structured metadata on the Web and the RDA Supporting Outcome Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories. It addresses objectives of the “Research Data Architectures in Research Institutions” Interest Group (IG) that aim 1) to explore how diverse tools, technologies, and services can be integrated to meet the evolving needs of researchers in research institutions and 2) to consider interoperability between institutional research data infrastructures and (inter)national or discipline-based infrastructures.

Generic data catalogs and repositories, such as Harvard Dataverse RepositoryDryad Digital Repository, or Figshare Repository, commonly use generic metadata schemes for curating their data. And even if descriptions in more specific metadata standards are available in some of them (for example, DDI Codebook metadata format in the Harvard Dataverse Repository), they are just transformations from and/or additions to more generic formats. This practice is largely predetermined by the nature of these repositories – they aim to store very different data sets from variety of disciplines. Therefore, they need to keep metadata as simple as possible so that tabular, textual, coordinate, or visual data could be uniformly described.

On the other hand, the FAIR Guiding Principles require that data be described with rich metadata (see F2) as digital resources and objects that are “not well-described cannot be accurately discovered” (Jacobsen et al., 2020). Moreover, metadata should meet domain-relevant community standards (see R1.3), which implies that data should be described according to domain-tailored standards. Therefore, we may question whether generic data repositories could ever comply with the FAIR Guiding Principles, if they continue to use generic data descriptors as their metadata standards. This question is very important for the freshly developed EOSC Portal Catalogue and Marketplace, as its architecture envisions to become a Web portal that facilitates searching, discovering and ordering of services form various providers across domains in European countries (EOSC Executive Board, 2021: 7).

In order to investigate the possibilities of including rich (discipline specific) data descriptions into generic metadata schemes, so that generic data descriptors become compatible with the FAIR Guiding Principles, members of Lithuanian National RDA Node submitted a project Framework for Increased Discoverability of Social Science Data Objects in the EOSC Portal Service Catalogue to the RDA Open Call mechanism of the EOSC Future project, which received support and several products relevant for RDA community were produced. These products extend the RDA recommendation Guidelines for publishing structured metadata on the Web with a more detailed specification, how generic descriptors could incorporate discipline specific metadata when publishing structured metadata. In addition, they address Recommendation 3 of the RDA Supporting Outcome Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories by showcasing how discipline specific information could be added to generic descriptors in order to make data easier to discover for researchers. Finally, they address one of the main objectives of the RDA Interest Group Research Data Architectures in Research Institutions, providing examples how to ensure interoperability between institutional research data infrastructures and (inter)national infrastructures.

The project explored three most commonly used generic metadata schemesDublin Core termsDataCite Metadata Schema and OpenAIRE Guidelines. Analysis showed that all of them contain two elements (terms) that could be employed for enriching generic data descriptors with discipline specific metadata:

1. Type (Resource Type). Suitable for differentiating the types of datasets. This element, if properly specified, would allow to produce harvesting algorithms that are able to collect discipline and dataset specific metadata and display them correctly in generic repositories. Among the three standards the OpenAIRE Guidelines contain most advanced specification of resource types. It both requires to identify the general type of the resource in the attribute resourceTypeGeneral, and also demands to use COAR Resource Type Vocabulary (with uri attribute linking to a vocabulary term) for describing the data set more precisely. This standard may also be employed in other generic data descriptors.

2. Description. Suitable for including detailed information about the various important aspects related to the context, conditions and process of data collection. DataCite Metadata Schema already allows more detailed specification of the description types with sub-property descriptionType. This sub-property is essential for including different descriptors form discipline specific metadata schemes and needs to be specified, preferably, as a controlled vocabulary. Following the logic of the element Type (Resource Type) element Description in generic data descriptors should be specified as having attribute descriptionType with recommendations for controlled vocabulary use, when relevant (with uri attribute linking to a vocabulary term).

As a result of these investigations two products relevant to the rich description of social science data objects were developed:

1. Recommendation of the vocabulary for standardized description of types of social science data objects (https://doi.org/10.5281/zenodo.7152218). This recommendation could be integrated into the existing COAR Resource Type Vocabulary.

2. Recommendation of metadata fields for detailed description of two most common types of social science data objects: survey data and aggregated data (https://doi.org/10.5281/zenodo.7125596). These recommendations reflect the CESSDA Metadata Model (Akdeniz et al., 2021), one of the most authoritative guides for creating metadata in the social science domain, and DDI Codebook and DDI Lifecycle standards specifically created for description of social science data objects.

Based on these discoveries a use case (prototype) was developed that involves servicing of metadata of social science data objects from the Lithuanian Data Archive for Humanities and Social Sciences (LiDA) Dataverse repository (https://lida.dataverse.lt) for the portal of the Lithuanian Academic Electronic Library (LVB, http://www.lvb.lt/en), which performs metadata harvesting, indexing and publishing (https://doi.org/10.5281/zenodo.7125728). The LVB portal, which works on Ex Libris Primo library discovery service, employs OpenAIRE metadata standard for describing its records. Therefore, transformation of fields from Dataverse native JSON descriptor into OpenAIRE Guidelines compliant metadata scheme had to be developed (see Annex 1).

Readers of this blog entry may explore the implemented use case (prototype) as follows:

1. In the Internet browser, enter the link of the LVB English interface: https://www.lvb.lt/primo-explore/search?vid=ELABA&lang=en_US.

2. Select the resource Lithuanian Data Archive for Social Sciences and Humanities (LIDA) in the list of search resources.

3. Enter lida in the search field.

4. From the list of results of short records follow a link to a full record (for example, the third record is Emigracija iš Lietuvos pagal šalis, 1919-1940 m., or use permalink: https://www.lvb.lt/permalink/f/11u90ov/LIDAOOA21.12137/C0WSSS).

5. Explore the metadata fields of the detailed record (especially, in the Description field).

6. Compare metadata fields with those available in the original data object (https://hdl.handle.net/21.12137/C0WSSS).

Finally, a blueprint of a framework for harvesting and delivering for discovery rich metadata of social science data objects for the EOSC Portal Catalogue and Marketplace was developed (see below). It describes the implemented use case as well as presents the generalized version of suggested implementation for the EOSC Portal Catalogue and Marketplace (or any other generic data repository).