Skip to main content

Notice

The new RDA web platform is still being rolled out. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Please report bugs, broken links and provide your feedback using the UserSnap tool on the bottom right corner of each page. Stay updated about the web site milestones at https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/.

WGDC Pilot VAMDC

  • Creator
    Discussion
  • #137944

    The Virtual Atomic and Molecular Data Centre

    Introduction

    VAMDC is a worldwide e-infrastructure that federates 41 heterogeneous and interoperable Atomic and Molecular databases. In the VAMDC jargon, each federated database is a Data-Node. Each VAMDC partner has in charge the curation of its node and decides independently about the growing rate, the ingestion system, and the corrections to apply to the already stored data. Indeed the VAMDC infrastructure can grow in two ways: each node can grow (independently) and new nodes can join the federated infrastructure.

    Each data-node, regardless of the technology used for storing data (SQL, No-sql, ascii files), implements the VAMDC access/query protocols and returns results formatted into a standardized XML format, called XSAMS (http://standards.vamdc.eu).

    The user can access the data directly node-by-node or can use the VAMDC portal, which relays the user request to each node

     

    Overview

    • Pilot Name : The VAMDC consortium, http://www.vamdc.org (project leaded at the Paris Observatory)
    • Contact Person: Carlo Maria Zwölf
    • Type: e-infrastructure pilot
    • Status: active
    • Type of data: all the A+M shared through the VAMDC infrastructure are part of the use case. Regardless of the technology used by each federated database for storing data (the data could be stored on every node using SQL-like bases, no-sql or even text files) each node implements the vamdc access/query protocols and returns result formatted into a standardized XML format, called XSAMS. All the standards that the nodes have to satisfy are specified at the url http://www.vamdc.eu/standards. The nodes are indeed accessible in a single and unified way. A web interface is available (http://portal.vamdc.eu/vamdc_portal_test/home.seam) and the infrastructure is also accessible using standalone software. Ad-hoc libraries (Java and Python) are provided for integrating the access to VAMDC into third-party software.
    • Dynamics: Each partner has in charge the curation of its node and decides independently about the growing rate, the ingestion system, the corrections to apply to the already stored data. Indeed the VAMDC infrastructure can grow in two ways: each node can grow (independently) and new nodes can join the federated infrastructure
    • Domains: astrophysics, atmospheric physics, fusion, plasma and lightning technologies, environmental sciences, health and clinical science (e.g. radiotherapy)
    • Short Description: Make dynamic data extracted from the VAMDC infrastructure citable
    • Timeline: 2016-2017.

     

     

    Implementing in VAMDC the RDA Data citation Recommendations

    Motivation

    Assume that a scientist extracts from VAMDC at a given time a “dataset”, composed of an ensemble of “data”, and wishes to use this “dataset” in order to produce some science that will be published into a scientific paper: how can he/she cite this “dataset” and the individual “data”? Since the database content may evolve, for the consistency of the scientific publication, the citation should refer to “datasets” well defined in space (where the “dataset” physically comes from) and time (at what time the “dataset” was produced and extracted). In addition the citations should contain pointers to the authors who originally measured, calculated and/or fitted the individual data. Moreover, for the reproducibility of the scientific process described into the paper referencing the “dataset”, everybody wish- ing to verify step-by step the procedures described into the paper, should be able to easily recover the original “dataset” and replay the data-production workflow. 

     

    Data Versioning

    We have designed a two layer mechanism  for versioning the data: 

    • The first one has a fine-grained granularity and comes with a major evolution of the VAMC XML output standard (XSAMS). The technical details are discussed in the paper cited in the last item of the Overview section (doi:10.1016/j.jms.2016.04.009). The proposed evolutions should be officially endorsed by the VAMDC Consortium Board during the next annual meeting (January 2017).
    • The second one has a coarse-grained granularity. Each data-node is characterised by a given (and timestamped) version tag. At each data modification, the version of the node changes. 

    These two mechanisms are autonomous, but may work in synergy: suppose that a given data-node, due to some modification, pass from the version1 (timestamp1) to the version2 (timestamp2). With the second mechanisms we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of what data changed is accessible using the first mechanisms.

    On the other hand, we know that, as long as the version of a node will not change, the results of the same query submitted at different times will be the same.

    Query Store

    Considering the distributed architecture of the VAMDC infrastructure, the implementation of a query store is not straightforward. We have designed (and are implementing) a query store which may be seen as a complex kind of log service: 

    When a VAMDC node receives from a user a query, it notifies to the log service the following information:

    • The IP of the user (optional)
    • The identifier of the user, typically his/her e-mail (optional)
    • The used software client for interacting with the node
    • The identifier of the Node receiving the query
    • The version (with the related timestamp) of the Node receiving the query
    • The version of the output standard used for replying the result. 
    • The query submitted by the user
    • The link to the data, resulting by processing the user query.

    For each received query, the service checks if it exist an already existing entry

    • having the same query 
    • submitted to the same node
      • having the same Node version
      • having the same version of output standards.

    If such an entry does not exist:

    • we provide the query with a unique identifier and a timestamp;
    • following the link to the data, we get the output data, process it for extracting the relevant metadata (typically a bibtex file containing the references of all the papers used for compiling the output file);
    • We store all the relevant metadata associated to the unique identified, timestamped query.
    • If provided, we associate the user identifier with the query identifier. 

    If such an entry already exists

    • We get the already existing unique query identifier and (incrementally) associate the new request timestamp (and if provided the identifier of the user) with the query identifier. 

     

    REMARKS:

     

    • Resolving unique identifiers: The unique identifier associated with a query is resolvable and  the corresponding landing page will return the metadata associated with the original query, in an anonymised way (i.e. without users IPs or e-mails). The metadata include: the original query, the first timestamp when the query was original processed by the system, the accessed node, the version of the node, the version of the standards used by the node for computing the result data file, the bibtex file containing the references of all the papers used for compiling the output file.

     

     

Log in to reply.