DF Configuration "PID Centric Data Management and Access"

Editorial note: This document describes the scene and possible investigation paths that ultimately ended up in the foundation of the PID Kernel Information WG. This spin-off group dedicated itself to addressing the challenges sketched below.

This is the place to further and stepwise develop the plans about a specific configuration of core components which is called " PID Centric Data Management and Access". As Data Fabric IG indicates in its documents there will be many different such configurations. In this configuration building only those RDA experts will be involved who share the same basic concepts as explained in the document.

PID Profiles: summary of P8 session (Beth Plale)

DF-IG held a session at P8 on PID profiles, or otherwise known as minimal metadata associated with PIDs.  The motivation for the work is to enable new forms of discovery and management of data objects.  Two specific examples were given:  provenance and rapid internet-speed filtering/routing of data objects.    Data provenance, despite being in existence since 2005, and having standard description languages, remains siloed in systems that create the provenance.  Attaching a minimal object oriented provenance record directly to a handle, opens the opportunity for new tools that remove the silos thus fully realizing the capabilty of data provenance.   Rapid decision making on PIDs:  suppose a client tool is handed a list of 100,000,000 PIDs and needs to take action on the items in the list quickly.  The only action it is going to be able to do quick enough is consult the PID minimal metadata.  How does this capability enable a new ecosystem of tools?  

The meeting accomplished two things:   1) we agreed to restrict our focus to Handles in this activity.   Other PID types may be follow on work.  2) we formed four small subgroups that will define a profile per group between now and P9.   The groups and their membership is given below.   Slides from the session are attached. 

Data provider - Digital Humanities

1. Bridget Almas, Tufts

2. Ulrich Schwarzmann, GWDG, Germany

3. Beth Plale, Data To Insight Center, Indiana University

 

Data consumer - Digital Humanities

1.  Daan Broeder,  MPI

2.  Mike Jones, Mendeley

3.  Beth Plale,  Data To Insight Center, Indiana University

 

Data provider - natural/physical science

1. Stuart Chalk, Univ North Florida

2. Alex Thompson, iDigBio

3. Yumiang Zhu, Institute for Geographic Sciences and Natural Resources, CAS, China

4. Cyndy Chandler, Woods Hole

5. Stuart Rhea, AgConnections

6. Mario Silva, Institute for Systems and Computer Engineering, Portugal   

7. Beth Plale,  Data To Insight Center, Indiana University

8. Tobias Weigel, DKRZ, Germany

 

Data consumer - natural/physical science

1. Stuart Chalk, Univ North Florida

2.  Alex Thompson, iDigBio

3. Kei Kurakawa, Nat’l Institute of Informatics, Japan

4. Sharef Youssef, NIST

5. Jim Duncan, Vermont Monitoring Cooperative

6. Stuart Rhea, Ag Connections

7. Beth Plale,  Data To Insight Center, Indiana University

8. Tobias Weigel, DKRZ, Germany

RDA Data Fabric Configurations

Beth, Larry, Peter

March 9, 2016

At the RDA P7 plenary the Data Fabric Interest Group initiated the next logical step in data fabric configurations by adopting an inductive strategy of identifying compositions of components, drawing heavily on RDA Recommendations with reference software and those that are machine actionable (schemas, ontologies, etc.), and building compositions of components that work together, but also checking compliance with other RDA recommendations where possible. By leveraging the testbed efforts emerging in national data services, the composition of components may serve as shareable services upon which the evaluation of machine actionable RDA Recommendations can be carried out by projects and groups who do not have resources of their own to evaluate the services in their own environment. Additionally, the early compositions of components can be extended stepwise over time with additional components, serving to extend and demonstrate the greater value of the RDA Recommendations when coupled than when separate. Even more this emerging configuration will serve as another excellent environment to identify actual gaps in the set of RDA recommendations.

Background

The Data Fabric Interest group (DFIG) worked on optimising the conditions of those who carry out data intensive work in the various labs applying this continuous circle of creating and consuming data to get new insights. This is indicated in the diagram below. The task of DFIG was and is to investigate this circle, identify essential core components, place RDA results into this circle, identify gaps, determine interop requirements etc. In the meantime a large number of potential components have been listed (https://rd-alliance.org/group/data-fabric-ig.html) and the experts are checking their relevance and priority. Essential components from other initiatives such as OAI-PMH (OAI) and PROV (W3C) also need to be integrated into data fabrics.

The outcome of all previous discussions was the agreement that it is important to identify Core Components (CoCos) that are the basis of efficiently functioning “data fabrics” that use components shared across fabrics. Concrete instantiations combine such CoCo in different ways, so we do not speak of a specific “architecture” in the narrow IT sense, but about “configurations” of such interoperating components dependent on the task as is shown in second diagram. Of course such configurations can only be instantiated if the CoCos have clearly specified and interoperable interfaces. Different sets of CoCos can be used in different configurations depending on requirements. The key is that a standard set of CoCos is agreed upon, the components are well-developed and documented, and are readily available for data intensive work.

Evaluative Testing

It is understood that it is important to support the evaluation of the CoCos individually and in combination as evaluation is the precursor to adoption. Therefore, and in particular in the US and EU, several evaluation projects have been started. Since only this type of real-world testing will indicate whether the current specifications of current and future CoCos are sufficient or need to be changed, many such evaluations across communities will need to take place.

PID Centric Data Management and Access

A group has formed to stepwise build up a fabric composition around a “PID Centric Data Management and Access” configuration based on the reference software of the Persistent Identifier Types and Data Type Registry RDA recommendations and overall compliance with the model as suggested by DFT. Numerous data experts see PIDs as central for proper data management and access. Experts committed to this model will work and collaborate on two strands of this configuration of this configuration:

  • work out a model for minimal information types associated with PIDs that works with the reference software of the Persistent Identifier Types and Data Type Registry RDA recommendations and do a concrete implementation
  • work out more details of the proposed configuration

While the first strand focuses on the use of RDA outcomes with existing software the second strand is looking for easy to add functionality to specify and add further components of this PID Centric Data Management and Access configuration.

Why undertake this? Minimal information types associated with a PID allows for rapid and lightweight accept/reject decisions to be made about the suitability of an object. PIDs resolve extremely quickly, so the minimal PID information could be used to make decisions about routing an object for instance. But the PID record should also contain the pointer to the full metadata description (see diagram below) so that richer decisions can be made. Pointers to the metadata embedded with the PID also ensure that the metadata record is not decoupled from the data record – it is bound forever through the metadata record of the PID. To enable this kind of functionality, the information in the PID record needs to have type information with it, and the definition of the type both registered and described in a way that both humans and machines can interpret the typed data. The PID thus serves, in essence, as a map or a guide to both locating a given data object, aka digital object, as well as interpreting and potentially reusing the object. As is indicated in the diagram a PID record contains minimal information about the Digital Object it references. This can be implemented as links that are “call by reference” or “call by value” in computer science terms. If the former is true, the PID record points to external reliable information sources. In either case the PID has a binding role. All of the pieces that go together to make a data object available and understandable over time have to be sustained, so they have to be chosen carefully. If this binding model is useful to many it can be used on which to add additional components.

A new phase in Data Fabric IG activity exists therefore: one that is characterised by the will of some experts to go beyond a fabric consisting of the two RDA reference software components to include RDA agreed upon schemas and other components (such as a handle service) that more fully enable their use.

1.      Inductive Data Fabric and Minimal PID Information Types Effort

Component composition in RDA is the marrying of RDA reference software emerging from RDA recommendations with RDA recommendations that are actionable by software (schemas, ontologies, etc.). In an inductive style of thinking, data fabrics are built up by assembling RDA reference software with machine actionable RDA recommendations and other software which are compliant with underlying models. Since at the culmination of P7 only a small number of RDA recommendations have reference software, we start with the two that do (Persistent Identifier Types and Data Type Registry) and grow a fabric from there. With this data model in place and published as an RDA recommendation, three components now exist to form a working data fabric. The group will further examine issues of access so these components can be accessed by other groups for experimentation.

This activity is expected to have a 12-month duration and is led by Beth Plale and Larry Lannom. 

2.      Binding Role of PIDs

The work within the first strand is focusing on the software that has been worked out by two RDA groups, enrich it by the definition of a minimal set of information types and develop a reference implementation. The second strand will build on top of this, identify additional functionality that fits to the binding role of PIDs, looks for existing software and will work out possible extensions of the basic configuration as is indicated above in the description of the third diagram. Additional software will be required and developed.

This activity is expected to have first a 12-month duration and is led by Tobias Weigel and Peter Wittenburg. 

Planning

The plan is to bring the group of experts committed to making concrete steps towards a “PID Centric Data Management and Access” configuration together to a workshop in July 2016 and present progress in the DFIG sessions at plenary 8 in Denver.

Attached are a few slides which Beth presented in Tokyo.