Comments on the Recommendations from an Australian group dealing with large volume (Petascale) raster data sets.
-
Discussion
-
Dear all
We welcome the efforts of this group and we have been following them with interest (although the timing of many of your webinars makes participation a little difficult!).
Now that the final recommendations are available we would like to make the comment that the approach is seems tied to data that are in databases (particularly relational data bases) and are relatively small in volume. There also seems to be a feeling for time-stamped snap shots of the data bases that can be retrieved over time.
Our personal experience is with large volume raster arrays that can be over a Petabyte in volume and in multi-petabyte climate models. Storing multiple time stamped snap shots of these is not feasible, fundamentally due to cost of the infrastructure.
There are at least two user cases where new data are dynamically added to an existing data set:
· Use case 1: new data are regularly and systematically appended to an existing data set over time, e.g., with outputs from a satellite sensor such as Landsat or MODIS: no changes are made to the existing dataset.
· Use case 2: pre-existing data in a large data set is modified or updated. This use case is common where errors are found in pre-existing data, or analytical and or processing techniques affect some attributes of the existing data set.
For use case 1, time stamping of when the new additions are added is the simplest solution.
For use case 2, the data sets have to go through a release process, similar to software and the exact changes to the data set are documented.
Reading the recommendation and the associated literature, we got the impression that the recommendation had a certain use case in mind that is based on relational databases with tabular, alphanumerical data. Several recommendations are not applicable to data that are not alphanumerical tables. The usefulness of query uniqueness for addressing web services (R4) is questionable and stable sorting (R5) does not make sense with, say, raster data.
To accommodate also other forms of data, we think that this recommendation should be more general. For these large scale, non-numerical data sets we would recommend that provenance workflow engines are used, that automatically capture the version of the data set that was used, the version of the software as well as the infrastructure to process the data, and the exact time the process was run. The Provenance workflow itself would have a persistent identifier, as would all components of the workflow.
Take care
Lesley
Lesley Wyborn (Associate Fellow, National Computational Infrastructure, ANU Canberra)
Jens Klump (OCE Science Leader, CSIRO, Perth)
Ben Evans (Associate Director, National Computational Infrastructure, ANU Canberra)
Jingbo Wang (Data Collections Manager, National Computational Infrastructure, ANU Canberra) )
Fabiana Santana (HPC Innovations Project Manager, National Computational Infrastructure, ANU Canberra)
From: on behalf of rauber
Date: Thursday, 25 June 2015 3:24 am
To: Data Citation WG
Subject: [rda-datacitation-wg] Webinar Presentation and Recommendations uploaded
Dear all,
thanksa to all of you who attended today’s webinar, and also for the feedback you provided.
As promised, i have uploaded the slides of today’s presentation as well as the latest version of the 2-page flyer of the recommendations into the file depot of the workking group. They are available at the following URLs:
Slides:
https://rd-alliance.org/filedepot/folder/262?fid=668
2-page Recommendations Flyer:
https://rd-alliance.org/filedepot/folder/262?fid=667
If you have any further feedback on the wording, comments, etc. please let me know.
Best regards,
Andreas
—
Full post: https://www.rd-alliance.org/group/data-citation-wg/post/webinar-presenta…
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/48933
Log in to reply.