Data Citation WG Activity Overview WGDC Pilot VAMDC

WGDC Pilot VAMDC

Creator

Discussion
April 18, 2016 at 6:23 am #137944
Stefan Proell
Member
The Virtual Atomic and Molecular Data Centre

Introduction

VAMDC is a worldwide e-infrastructure that federates 41 heterogeneous and interoperable Atomic and Molecular databases. In the VAMDC jargon, each federated database is a Data-Node. Each VAMDC partner has in charge the curation of its node and decides independently about the growing rate, the ingestion system, and the corrections to apply to the already stored data. Indeed the VAMDC infrastructure can grow in two ways: each node can grow (independently) and new nodes can join the federated infrastructure.

Each data-node, regardless of the technology used for storing data (SQL, No-sql, ascii files), implements the VAMDC access/query protocols and returns results formatted into a standardized XML format, called XSAMS (http://standards.vamdc.eu).

The user can access the data directly node-by-node or can use the VAMDC portal, which relays the user request to each node.

Overview
- Pilot Name : The VAMDC consortium, http://www.vamdc.org (project leaded at the Paris Observatory)
- Contact Person: Carlo Maria Zwölf
- Type: e-infrastructure pilot
- Status: active
- Type of data: all the A+M shared through the VAMDC infrastructure are part of the use case. Regardless of the technology used by each federated database for storing data (the data could be stored on every node using SQL-like bases, no-sql or even text files) each node implements the vamdc access/query protocols and returns result formatted into a standardized XML format, called XSAMS. All the standards that the nodes have to satisfy are specified at the url http://www.vamdc.eu/standards. The nodes are indeed accessible in a single and unified way. A web interface is available (http://portal.vamdc.eu/vamdc_portal_test/home.seam) and the infrastructure is also accessible using standalone software. Ad-hoc libraries (Java and Python) are provided for integrating the access to VAMDC into third-party software.
- Dynamics: Each partner has in charge the curation of its node and decides independently about the growing rate, the ingestion system, the corrections to apply to the already stored data. Indeed the VAMDC infrastructure can grow in two ways: each node can grow (independently) and new nodes can join the federated infrastructure
- Domains: astrophysics, atmospheric physics, fusion, plasma and lightning technologies, environmental sciences, health and clinical science (e.g. radiotherapy)
- Short Description: Make dynamic data extracted from the VAMDC infrastructure citable
- Timeline: 2016-2017.
- Supplementary material: C.M. Zwölf, N.Moreau, M-.L. Dubernet, New Model for dataset citation and extraction reproducibility in VAMDC, Journal of Molecular Spectroscopy, doi:10.1016/j.jms.2016.04.009, (arXiv version at http://arxiv.org/abs/1606.00405).
Implementing in VAMDC the RDA Data citation Recommendations

Motivation

Assume that a scientist extracts from VAMDC at a given time a “dataset”, composed of an ensemble of “data”, and wishes to use this “dataset” in order to produce some science that will be published into a scientific paper: how can he/she cite this “dataset” and the individual “data”? Since the database content may evolve, for the consistency of the scientific publication, the citation should refer to “datasets” well defined in space (where the “dataset” physically comes from) and time (at what time the “dataset” was produced and extracted). In addition the citations should contain pointers to the authors who originally measured, calculated and/or fitted the individual data. Moreover, for the reproducibility of the scientific process described into the paper referencing the “dataset”, everybody wish- ing to verify step-by step the procedures described into the paper, should be able to easily recover the original “dataset” and replay the data-production workflow.

Data Versioning

We have designed a two layer mechanism for versioning the data:
- The first one has a fine-grained granularity and comes with a major evolution of the VAMC XML output standard (XSAMS). The technical details are discussed in the paper cited in the last item of the Overview section (doi:10.1016/j.jms.2016.04.009). The proposed evolutions should be officially endorsed by the VAMDC Consortium Board during the next annual meeting (January 2017).
- The second one has a coarse-grained granularity. Each data-node is characterised by a given (and timestamped) version tag. At each data modification, the version of the node changes.
These two mechanisms are autonomous, but may work in synergy: suppose that a given data-node, due to some modification, pass from the version1 (timestamp1) to the version2 (timestamp2). With the second mechanisms we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of what data changed is accessible using the first mechanisms.

On the other hand, we know that, as long as the version of a node will not change, the results of the same query submitted at different times will be the same.

Query Store

Considering the distributed architecture of the VAMDC infrastructure, the implementation of a query store is not straightforward. We have designed (and are implementing) a query store which may be seen as a complex kind of log service:

When a VAMDC node receives from a user a query, it notifies to the log service the following information:
- The IP of the user (optional)
- The identifier of the user, typically his/her e-mail (optional)
- The used software client for interacting with the node
- The identifier of the Node receiving the query
- The version (with the related timestamp) of the Node receiving the query
- The version of the output standard used for replying the result.
- The query submitted by the user
- The link to the data, resulting by processing the user query.
For each received query, the service checks if it exist an already existing entry
- having the same query
- submitted to the same node
  - having the same Node version
  - having the same version of output standards.
If such an entry does not exist:
- we provide the query with a unique identifier and a timestamp;
- following the link to the data, we get the output data, process it for extracting the relevant metadata (typically a bibtex file containing the references of all the papers used for compiling the output file);
- We store all the relevant metadata associated to the unique identified, timestamped query.
- If provided, we associate the user identifier with the query identifier.
If such an entry already exists
- We get the already existing unique query identifier and (incrementally) associate the new request timestamp (and if provided the identifier of the user) with the query identifier.
REMARKS:
- Query uniqueness: The query language supported by the VAMDC infrastructure is VSS2 (VAMDC SQL Subset2, http://vamdc.eu/documents/standards/queryLanguage/vss2.html). We are working on a specific VSS2 parser (based on AntLR, http://www.antlr.org) which should identify, from queries expressed in different ways, the ones that are semantically identical.
- Resolving unique identifiers: The unique identifier associated with a query is resolvable and the corresponding landing page will return the metadata associated with the original query, in an anonymised way (i.e. without users IPs or e-mails). The metadata include: the original query, the first timestamp when the query was original processed by the system, the accessed node, the version of the node, the version of the standards used by the node for computing the result data file, the bibtex file containing the references of all the papers used for compiling the output file.
Creator

Discussion

Data Citation WG

Group Organizers

WGDC Pilot VAMDC

The Virtual Atomic and Molecular Data Centre

Introduction

Overview

Implementing in VAMDC the RDA Data citation Recommendations

Motivation

Data Versioning

Query Store