Scalable Dynamic Data Citation Methodology

02
Nov
2015

Scalable Dynamic Data Citation Methodology

By Andreas Rauber


Data Citation Working Group

Group Co-Chairs: Andreas Rauber (Vienna University of Technology), Dieter Van Uytvanck (CLARIN), ​​​​​​​Ari Asmi (University of Helsinki), ​​​​​​​Stefan Pröll (SBA Research) (Secretary)

Recommendation Title: Scalable Dynamic Data Citation Methodology

Authors: Andreas Rauber; Ari Asmi; Dieter van Uytvanck; Stefan Proell

Impact: Supports accurate citation of data subjected to change, for the efficient processing of data and linking from publications.

Recommendation package DOI: http://dx.doi.org/10.15497/RDA00016
Citation: Andreas Rauber; Ari Asmi; Dieter van Uytvanck; Stefan Proell (2015): Data Citation of Evolving Data: Recommendations of the Working Group on Data Citation (WGDC). DOI: 10.15497/RDA00016

 

Digitally driven research is dependent on quickly evolving technology. As a result, many existing tools and collections of data were not developed with a focus on long term sustainability. Researchers strive for fast results and promotion of those results, but without a consistent and long term record of the validation of their data, evaluation and verification of research experiments and business processes is not possible.

There is a strong need for data identification and citation mechanisms that identify arbitrary subsets of large data sets with precision in a machine-actionable way. These mechanisms need to be user-friendly, transparent, machine-actionable, scalable and applicable to various static and dynamic data types.

 

The aim of the Dynamic Data Citation Working Group was to devise a simple, scalable mechanism that allows the precise, machineactionable identification of arbitrary sub selections of data at a given point in time irrespective of any subsequent addition, deletion or modification. The principles must be applicable regardless of the underlying database management system (DMBS), working across technological changes. It shall enable efficient resolution of the identified data, allowing it to be used in both human-readable citations as well as machine-processable linking to data as part of analysis processes.

 

The approach recommended by the Working Group relies on dynamic resolution of a data citation via a time-stamped query also known as dynamic data citation. It is based on time-stamped and versioned source data and time-stamped queries utilized for retrieving the desired dataset at the specific time in the appropriate version.

 

The solution comprises of the following core recommendations:

» Data Versioning: For retrieving earlier states of datasets the data needs to be versioned. Markers shall indicate inserts, updates and deletes of data in the database.

» Data Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp.

» Data Identification: The data used shall be identified via a PID pointing to a time-stamped query, resolving to a landing page.

 

Instead of providing static data exports or textual descriptions of data subsets, we support a dynamic, query centric view of data sets. The proposed solution enables precise identification of the very subset and version of data used, supporting reproducibility of processes, sharing and reuse of data. 


The attached recommendation gives a set of 14 clear rules that, if you follow, you make your dynamic data citable.

Please use the comment function below for questions and suggestions. Please note that you need to login in order to comment. 

Output Status: 
RDA Endorsed Recommendations
Group content visibility: 
Use group defaults
Primary WG Focus / Output focus: 
Domain Agnostic: 
Domain Agnostic
  • James Reid's picture

    Author: James Reid

    Date: 05 Nov, 2015

    This is generally a good document having just read it cold. Well done.

    However, my concern is that is says little about the Query language used - be it SQL or custom API. Without some documentation/provenance on that, rerunning a query at some indeterminate future point seems fragile as the query language itself will evolve/be replaced over time. Perhaps R3 is intended to address that? If so I think some expansion is required as in today’s RESTful data access world, query languages are both ephemeral and often poorly documented (would OData provide a partial solution here I wonder)?. 

    One only need think how different RDBMS vary in SQL implementation (saying nothing of the multifarious nosql stores), detail to realise the *query languages and their expression are neither persistent nor portable*. That strikes me as a major deficiency to any concrete implementation. It is possible that I missed something obvious though so happy to be educated..

     

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 05 Jan, 2016

    The recommendations are, on purpose, neutral to the query language (or, in fact, any technology) used. The reason for that is, that the recommendations need to work across mutiple technologies and platforms, both for different data providers, but also across time.

    Having said that, the need to keep a query re-executable does not rest with any data user (and is thus not necessarily part of an external API), but with the data provider! For the user, the fact that a query is being stored and pointed to will, in most cases, be entirely transparent.

    As long as a data provider stays with a given technology (say, a specific SQL dialect or a shell script cutting rows and columns from CSV files), queries can be re-executed. Upon migration of the data to a new technology (which is a major project in any case, as the entire data representation might change, or all internal access APIs need to be adapted), then the queries need to be migrated as well, as addressed in Recommendations R13 and R14. Assuming that any new data representation will need to be as granular/powerful as the preceeding one, supporting the same type of data selection methods, such a migration will be possible.

    There is no need for queries to work across different SQL dialects, as they are always local to the system processing the queries in the first place.

     

  • Fran Lightsom's picture

    Author: Fran Lightsom

    Date: 24 Nov, 2015

    It seems like a good approach. But I don't know how hard it would be to implement with some of our data systems that badly need a solution. Would it be possible to talk with somebody about arranging a pilot implementation?

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 05 Jan, 2016

    Sure - we'll be happy to help! We are also curious to learn from feedback from new pilots, different settings, to see how easy or difficult the recommendations are to adopt. This is valid for any new pilot setting. Just contact us, any of the chairs (you can reach me at rauber@ifs.tuwien.ac.at ) , or feel free to also discuss any issues discovered here in the forum.

     

    Andreas

  • Dirk Roorda's picture

    Author: Dirk Roorda

    Date: 09 Dec, 2015

    I second the intention to make database usage replicable/reproducible.

    However, I think the current recommendations are specific for a certain type of use case.
     
    I see at least two types of database usage scenarios that could lead to at least two strategies of preserving queries:
     
    scenario 1) flat data model, huge amount of data (typically from sensors or instruments), many users needing different kinds of slices of the data.
     
    scenario 2) complex data, not in truly large amounts, modeled according to prevailing but changing insights (the data model itself changes over time).
     
    Preservation method 1): timestamp every grain of data. It makes applications a bit more complex, but in scenario 1) this is doable and leads to huge savings of space compared to the next method.
     
    Preservation method 2): store query results together with the queries. Make periodic snapshots of the database as a whole. It takes more space, but it is doable in scenario 2), and it keeps applications simpler compared to the previous method where every grain of data must be made time sensitive.
     
    At the Eep Talstra Centre for Bible and Computer, we are dealing with a linguistic text database of the Hebrew Bible. It has all the hall-marks of scenario 2). Users develop queries not to define a slice, but to detect special (and rare) patterns. The web application SHEBANQ (https://shebanq.ancient-data.org) acts as a query saver. It provides access to multiple snapshots of the database. Users can share a query (without the promise of unchangeability), but they can also publish a query against one of the snapshots. After the act of publishing, the user gets one week to unpublish, and after that the query body and its results (against one particular snapshot of the data) get frozen. The user can still add other query bodies for other snapshots and publish them separately. Using another body for another snapshot accomodates changes in datamodel between snapshots. Of course, there is no guarantee that the query bodies of one query share the same intention.

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 05 Jan, 2016

    The second option is definitely a valid method if the data set and the accumulated query results stay small enough to be stored repeatedly and redundantly. It is basically a trade-off between the complexity of managing the complexity of having a potentially large number of result sets, or managing the complexity of migrating the data schema plus the queries frequently.

    In many (most?) cases I have personally come across so far in different research infrastructures, the data schema tends to evolve rather slowly, as any change to the data schema will usually require massive changes down the processing pipline on external APIs, internal APIs, sometimes GUIs, as well as training the users or guiding them through the changes smoothly.

    In settings as you described it may,however, proove better to simply store the queries and result sets redundantly (which isn't all that different from the recommendations, in general, as you still keep the query to have provenance information on the data). the main difference I see in your scenario is on how the versioning is done, i.e. whether it is done on a record level or on a snapshot level. this has to be decided based on the usage scenario, i.e. how frequent updates are to the database, and when users will be able to see those updates in their query. In many cases, if repeatability is desired, versioning the data rather than keeping redundant copies is more efficient.

    Andreas

     

  • Fred Merceur's picture

    Author: Fred Merceur

    Date: 31 Dec, 2015

    Some databases of datasets are composed by several billion of records. Several millions of operations can be operated each day on the datasets. Saving the history of each single operations may indeed be considered as a best practice, but on huge databases it can be very expensive in term of place, processing and development. Then the benefits of the systematic saving of each operations have to be compared to its cost. In some cases, this systematic saving may be so much expensive, that it cannot be an option. Therefore the current recommendations may not be applied on all cases.

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 05 Jan, 2016

    True: cost will always be an issue that has to be matched against the requirements.

    IF subsets of dynamic data need to be persisted THEN the only way to ensure this is by having the respective version available. An aspect that is not discussed in detail in the 2-page flyer, but that has been addressed durign the WG meetings and that will also be described in a bit more detail in the more comprehensive summary document, is the granularity of versioning. The main thoughts, briefly summarized, are the following:

    If a specific version of a data item  has not been accessed (read) before being updated again, then there may be no need to keep that specific version of a data item. Thus, for repeatability requirements, versions of values only need to be persisted if they had been used, otherwise they can, theoretically, be simply overwritten. (This will differ, obviously, in settings where the change history of a record is of interest.) this trades storage space against versioning complexity (and may lead to a performance bottleneck).

    But, in any case, there is no need to version any database at all costs: it has to be matched against the requirements. If costs are to high, then (versions of) data may also be deleted. In such cases, an specific data set cannot be reproduced anymore. The PID, query and landing page, however, will still exist, providing evidence and provenance information on the data, and the may allow re-execution of the query against whichever suitable snapshot version of the database (closest timestamp to query-timestamp, or simply a current version) to get a dataset according to the same concept (query), even though it may not be identical anymore. (Identity can still be determined via teh result hash keys)

     

    Andreas

  • Emmanouil Chaniotakis's picture

    Author: Emmanouil Chani...

    Date: 21 Jan, 2016

    I have a few comments (I am only lately acquainted with RDA so please excuse me, in case some of my comments are already answered)

    1.

    In many cases researchers that have developed a dataset propose recommendations on how to cite the data. This is most times done by citing a published article that describes the work performed in order to acquire this data and analysis' results.

    My question is: who is the "author" of the data? The owner (organizations like universities) or the person responsible for the dataset. This is kind of important in the sense that datasets receive quite a lot of citations, as some of them are very expensive to acquire and scarce.

    A suggestion would be to have two fields for that: Organization that stores the data and the people responsible for collecting the data.

    2.

    The second comment that I have is that the citing process and collaboration is required with people involved in citation styles in order to ensure the data citation is going to be included in those styles. I believe that the RDA suggestions are aligned towards this direction. Including queries could be tricky. A suggestion that I have is to built a website that would provide a doi-like unique url for each query stored by researchers and that could be used in the citations (given its sort length). This might also include other meta-data required. My main concern on this suggestion is that big datasets are stored on a database table at first and then they are moved to another - when the first is close to capacity (e.g. count(3)). I do not know how this could be prevented, however a suggestion would be to ask for the storing procedure in the automated citation text generator.

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 29 Feb, 2016

    Your comments are absolutely right:

    (1)  the attribution question is a critical issue, and the solution will likely depend on the actual setting. Within our solution we do foresee something like the dual field concept you propose, namely a PID (e.g. DOI) being minted for the query, and the landing page and citation providing a pointer (PID) to the actual data set, uüpwards to the data center. Technically, it is a chain of pointers forming a citation graph. Which of these are listed as text in a raditional paper-style citation is somehting that needs to be decided on separately. We have had discussions around this in our WG (although it is outside the scope of the planned WG activity) which pointe din the direction of somehting similar to the end credits of a movie, with that information being provided via the landing page - of the subset, all the way up to the entire data set.

    (2) comment 2 covers exactly what the recommendations of the WG specify. PIDs are assigned to the queries (managed by the respective data center), include additional metadata, provide an according landing-page and re-execution capability (i.e. the fact that the DOI points to a query is a transparent, technical solution that is not necessarily visible to the user). the recommendations also address the aspect of migrating the data to a new representation/system. the key point is that the support for providing the data, i.e. storing the queries, rests with the data center, not the researcher.

     

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 29 Feb, 2016

    We have prepared an extended description of the recommendations to complement the condensed 2-page flyer summarizing the recommendations.

    This extended report is available at

    https://www.rd-alliance.org/rda-wgdc-recommendations-extended-description-tcdl-draft.html

    for comments that will be integratedinto an updated version.

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 08 Jun, 2016

    Dear all,

     

    Asomewhat longer paper describing the 14 recommendations listed on the 2-page flyer as official recommendations of the Working Group on Dynamic Data Citation has been published in the Bulletin of the IEEE Technical Committe on Digital Libraries (TCDl) available at

     

    Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use
    By Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Pröll

    Bulletin of the IEEE Technical Committe on Digital Libraries (TCDl), Vol. 12, Issue 1, May 2016

    http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1...

     

    It has also been available in the Repository of the Working Group in different draft versions, the final one being available at

    https://rd-alliance.org/rda-wgdc-recommendations-extended-description-tc...

     

    Andi

     

  • Simon Cox's picture

    Author: Simon Cox

    Date: 14 Sep, 2016

    I understand that a DataCite implementation of the recommendations is being prepared, using a syntax that postfixes the 'query parameters' following a # fragment separator. That may satisfy DOI/DataCite syntax requirements, however, when using HTTP-based resolvers, the part of the identifier following the # is not sent to the server, but is used on the client side to find the fragement witin the complete resource returned from the server. (On the other hand, content following a "?" separator is sent to the server.) 

    This interaction between the HTTP protocol and DataCite syntax would have the clearly unintended but probably undesirable consequence of shifting processing from the server over to the client, which may not be capable of interpeting the fragment properly, and also may lead to unnecessarily large data transfers. 

    I may have misunderstood my informant here (Justin Buck, BODC) but wanted to put this issue on the table in case. 

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 13 Jun, 2018

    We discussed this quite intensively in the WG and came to the conclusion that putting the query string in whichever form as part of an identifier is not a viable way forward. Apart from violating the mantra of not embedding semantics in an identifier, any such approach would definitely lead to problems as soon as the data model /data infrastructure changes. In such a case, the recommendatons recommend to also migrate the queries so that they can operate on the current data model. Keeping the original queries (as one would be forced to do in the case of having them as part of an identifier) would mean that query semantics would need to be migrated on demand. Other disadvantages included that this might not work for all types of data and queries (e.g. drawing a boundary line on an image) or might pose security risks by exposing the data structure, etc.

    We thus came to the conclusion that there was no advantage in having the query parameters exposed as part of an identifier, but plenty of disadvantages and would thus discourage such an approach.

     

  • Sara john's picture

    Author: Sara john

    Date: 28 Feb, 2017

    I understand that a DataCite implementation of the recommendations is being prepared, using a syntax that postfixes the 'query parameters' following a # fragment separator. That may satisfy DOI/DataCite syntax requirements, however, when using HTTP-based resolvers, the part of the identifier following the # is not sent to the server, but is used on the client side to find the fragement witin the complete resource returned from the server

    برنامج حسابات

  • John Graybeal's picture

    Author: John Graybeal

    Date: 22 Feb, 2021

    Hello, having finally had the necessity to read the details, thank you for the thought and time invested to produce this. I had a few comments which, having shared them with a different audience, I thought I should share with you.

    I have a minor quibble with its framing  It adds a few detailed directives that are not required if the core system is implemented rigorously, and in at least one place conflates 'queries' with 'data sets'. (A FAQ question was about queries and the response was only about data sets, but they aren't the same thing.)

    For example, "assigning a query PID" is not essential if the query actually is an API request that embeds the timestamp—the request is the PID.  And a new PID will never be needed for the same query, because the system should always produce the same response to that query if the system is designed for reproducibility. In which case, the system does not need to store responses, it can simply re-issue them (classic space-time tradeoff applies, of course). 

    (Note that many real-time tools like Google doc manage revisions on this basis—when I go to a given IRI that includes a version identifier, the system has a way to 'play the changes' (possibly from an intermediate cached copy) to get to that stage. If you want to access all the changes, it is more efficient to track each change than to issue a new "saved copy" for every change.)

    Along these lines, I notice in previous responses to comments you refer to a query as being dependent on the service and subject to change.

    putting the query string in whichever form as part of an identifier is not a viable way forward. Apart from violating the mantra of not embedding semantics in an identifier, any such approach would definitely lead to problems as soon as the data model /data infrastructure changes.

    These claims are not always true, as the service in some cases is explicitly designed to respond to a language definition that must be persistent. For example, say an IRI specifically points to a plain-text document. The servicing system should always serve that document, no matter what services underpin the response and regardless of whether there is semantic content in that IRI. If I define my IRI formats as persistent, and I commit to honoring that format, it will definitely not lead to problems when the data model/data infrastructure changes, because any such change is obliged to honor my commitment.  Semantic services such as semantic repositories are in this role of providing meaningful responses to specific IRI patterns, and a failure to do so would be an explicit violation of those services contracts.

submit a comment