Skip to main content

Notice

The new RDA web platform is still being rolled out. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Please report bugs, broken links and provide your feedback using the UserSnap tool on the bottom right corner of each page. Stay updated about the web site milestones at https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/.

Comments on the Recommendations from an Australian group dealing with large volume (Petascale) raster data sets.

  • Creator
    Discussion
  • #125567

    Lesley Wyborn
    Participant

    Dear all
    We welcome the efforts of this group and we have been following them with interest (although the timing of many of your webinars makes participation a little difficult!).
    Now that the final recommendations are available we would like to make the comment that the approach is seems tied to data that are in databases (particularly relational data bases) and are relatively small in volume. There also seems to be a feeling for time-stamped snap shots of the data bases that can be retrieved over time.
    Our personal experience is with large volume raster arrays that can be over a Petabyte in volume and in multi-petabyte climate models. Storing multiple time stamped snap shots of these is not feasible, fundamentally due to cost of the infrastructure.
    There are at least two user cases where new data are dynamically added to an existing data set:
    · Use case 1: new data are regularly and systematically appended to an existing data set over time, e.g., with outputs from a satellite sensor such as Landsat or MODIS: no changes are made to the existing dataset.
    · Use case 2: pre-existing data in a large data set is modified or updated. This use case is common where errors are found in pre-existing data, or analytical and or processing techniques affect some attributes of the existing data set.
    For use case 1, time stamping of when the new additions are added is the simplest solution.
    For use case 2, the data sets have to go through a release process, similar to software and the exact changes to the data set are documented.
    Reading the recommendation and the associated literature, we got the impression that the recommendation had a certain use case in mind that is based on relational databases with tabular, alphanumerical data. Several recommendations are not applicable to data that are not alphanumerical tables. The usefulness of query uniqueness for addressing web services (R4) is questionable and stable sorting (R5) does not make sense with, say, raster data.
    To accommodate also other forms of data, we think that this recommendation should be more general. For these large scale, non-numerical data sets we would recommend that provenance workflow engines are used, that automatically capture the version of the data set that was used, the version of the software as well as the infrastructure to process the data, and the exact time the process was run. The Provenance workflow itself would have a persistent identifier, as would all components of the workflow.
    Take care
    Lesley
    Lesley Wyborn (Associate Fellow, National Computational Infrastructure, ANU Canberra)
    Jens Klump (OCE Science Leader, CSIRO, Perth)
    Ben Evans (Associate Director, National Computational Infrastructure, ANU Canberra)
    Jingbo Wang (Data Collections Manager, National Computational Infrastructure, ANU Canberra) )
    Fabiana Santana (HPC Innovations Project Manager, National Computational Infrastructure, ANU Canberra)
    From: on behalf of rauber
    Date: Thursday, 25 June 2015 3:24 am
    To: Data Citation WG
    Subject: [rda-datacitation-wg] Webinar Presentation and Recommendations uploaded
    Dear all,
    thanksa to all of you who attended today’s webinar, and also for the feedback you provided.
    As promised, i have uploaded the slides of today’s presentation as well as the latest version of the 2-page flyer of the recommendations into the file depot of the workking group. They are available at the following URLs:
    Slides:
    https://rd-alliance.org/filedepot/folder/262?fid=668
    2-page Recommendations Flyer:
    https://rd-alliance.org/filedepot/folder/262?fid=667
    If you have any further feedback on the wording, comments, etc. please let me know.
    Best regards,
    Andreas

    Full post: https://www.rd-alliance.org/group/data-citation-wg/post/webinar-presenta
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/48933

Log in to reply.