Comments on the Recommendations from an Australian group dealing with large volume (Petascale) raster data sets.

    You are here

02 Jul 2015
Groups audience: 

Dear all
We welcome the efforts of this group and we have been following them with interest (although the timing of many of your webinars makes participation a little difficult!).
Now that the final recommendations are available we would like to make the comment that the approach is seems tied to data that are in databases (particularly relational data bases) and are relatively small in volume. There also seems to be a feeling for time-stamped snap shots of the data bases that can be retrieved over time.
Our personal experience is with large volume raster arrays that can be over a Petabyte in volume and in multi-petabyte climate models. Storing multiple time stamped snap shots of these is not feasible, fundamentally due to cost of the infrastructure.
There are at least two user cases where new data are dynamically added to an existing data set:
· Use case 1: new data are regularly and systematically appended to an existing data set over time, e.g., with outputs from a satellite sensor such as Landsat or MODIS: no changes are made to the existing dataset.
· Use case 2: pre-existing data in a large data set is modified or updated. This use case is common where errors are found in pre-existing data, or analytical and or processing techniques affect some attributes of the existing data set.
For use case 1, time stamping of when the new additions are added is the simplest solution.
For use case 2, the data sets have to go through a release process, similar to software and the exact changes to the data set are documented.
Reading the recommendation and the associated literature, we got the impression that the recommendation had a certain use case in mind that is based on relational databases with tabular, alphanumerical data. Several recommendations are not applicable to data that are not alphanumerical tables. The usefulness of query uniqueness for addressing web services (R4) is questionable and stable sorting (R5) does not make sense with, say, raster data.
To accommodate also other forms of data, we think that this recommendation should be more general. For these large scale, non-numerical data sets we would recommend that provenance workflow engines are used, that automatically capture the version of the data set that was used, the version of the software as well as the infrastructure to process the data, and the exact time the process was run. The Provenance workflow itself would have a persistent identifier, as would all components of the workflow.
Take care
Lesley
Lesley Wyborn (Associate Fellow, National Computational Infrastructure, ANU Canberra)
Jens Klump (OCE Science Leader, CSIRO, Perth)
Ben Evans (Associate Director, National Computational Infrastructure, ANU Canberra)
Jingbo Wang (Data Collections Manager, National Computational Infrastructure, ANU Canberra) )
Fabiana Santana (HPC Innovations Project Manager, National Computational Infrastructure, ANU Canberra)
From: <***@***.***-groups.org> on behalf of rauber <***@***.***>
Date: Thursday, 25 June 2015 3:24 am
To: Data Citation WG <***@***.***-groups.org>
Subject: [rda-datacitation-wg] Webinar Presentation and Recommendations uploaded
Dear all,
thanksa to all of you who attended today's webinar, and also for the feedback you provided.
As promised, i have uploaded the slides of today's presentation as well as the latest version of the 2-page flyer of the recommendations into the file depot of the workking group. They are available at the following URLs:
Slides:
https://rd-alliance.org/filedepot/folder/262?fid=668
2-page Recommendations Flyer:
https://rd-alliance.org/filedepot/folder/262?fid=667
If you have any further feedback on the wording, comments, etc. please let me know.
Best regards,
Andreas
--
Full post: https://www.rd-alliance.org/group/data-citation-wg/post/webinar-presenta...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/48933

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 02 Jul, 2015

    Hi Lesley, all,
    what I have learnt from you climate data folks is that we need 2 timestamps: the
    time when the update is done (which you are referring to, and which is the
    versioning information), and the "phenomenon time"; in a timeseries datacube,
    the latter is the one that is used for accessing and subsetting, the former one
    is for provenance.
    So far common knowledge. But it raises an interesting issue: provenance info may
    refer to part of a dataset. Hence, for mere documentation purposes, we need a
    standardized way to describe subsets (such as WCS Core for coverages).
    Stable sorting (R5) is interesting as well. I looked up the definition in the
    document referenced, and this looks like a very wise definition allowing
    multi-dimensional (orthogonal) sorting criteria like space, time, height. Along
    each axis sorting is unambiguous (ie: a total ordering). Kudos!
    Curiously looking around I find R4 (query uniqueness). Let me observe that this
    is possible also for queries on multi-dimensional arrays or coverages, for
    example using OGC Web Coverage Processing Service (WCPS) language or the
    forthcoming ISO SQL/MDA (Multi-DImensional Arrays). I'd just not talk about a
    checksum for identifying (I guess the document wants to state requirements, not
    solutions).
    my 2 cents,
    Peter

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 02 Jul, 2015

    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    > ·Use case 2: pre-existing data in a large data set is modified or
    > updated. This use case is common where errors are found in pre-existing
    > data, or analytical and or processing techniques affect some attributes
    > of the existing data set.
    correct!
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    > ·Use case 2: pre-existing data in a large data set is modified or
    > updated. This use case is common where errors are found in pre-existing
    > data, or analytical and or processing techniques affect some attributes
    > of the existing data set.
    correct!
    > For use case 1, time stamping of when the new additions are added is the
    > simplest solution.
    correct - this is what the recommendation is meant to state.
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    > ·Use case 2: pre-existing data in a large data set is modified or
    > updated. This use case is common where errors are found in pre-existing
    > data, or analytical and or processing techniques affect some attributes
    > of the existing data set.
    correct!
    > For use case 1, time stamping of when the new additions are added is the
    > simplest solution.
    correct - this is what the recommendation is meant to state.
    > For use case 2, the data sets have to go through a release process,
    > similar to software and the exact changes to the data set are documented.
    correct as well - and again, this will happen on the individual record
    level (depending on how the dat ais being stored, i.e. a line in a CSV
    file, a row in a RDBMS table, a bit sequence at a certain position when
    versioning files using a versioning system, a triple in a linked data
    setting, etc.) There usually is an according space-efficient way to
    perform this versioning (and associated timestamping) for each data
    representation, sometimes several options exist with different trade-offs.
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    > ·Use case 2: pre-existing data in a large data set is modified or
    > updated. This use case is common where errors are found in pre-existing
    > data, or analytical and or processing techniques affect some attributes
    > of the existing data set.
    correct!
    > For use case 1, time stamping of when the new additions are added is the
    > simplest solution.
    correct - this is what the recommendation is meant to state.
    > For use case 2, the data sets have to go through a release process,
    > similar to software and the exact changes to the data set are documented.
    correct as well - and again, this will happen on the individual record
    level (depending on how the dat ais being stored, i.e. a line in a CSV
    file, a row in a RDBMS table, a bit sequence at a certain position when
    versioning files using a versioning system, a triple in a linked data
    setting, etc.) There usually is an according space-efficient way to
    perform this versioning (and associated timestamping) for each data
    representation, sometimes several options exist with different trade-offs.
    > Reading the recommendation and the associated literature, we got the
    > impression that the recommendation had a certain use case in mind that
    > is based on relational databases with tabular, alphanumerical data.
    > Several recommendations are not applicable to data that are not
    > alphanumerical tables.
    If this impression is generated, then we definitely need to address this
    issue, as this is not what we intend to communicate!
    While most pilots were relying on that kind of data (and it is really
    unfortunate that few other pilots were presented by interested
    stakeholders early in the process so we could address them right form
    the onset - I am not sure why this was the case) the principles should
    be applicable to virtually any form of data - also because this is a
    requirement for the principles to be technology-independent and thus
    applicable beyond the current generation of data representations.
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    > ·Use case 2: pre-existing data in a large data set is modified or
    > updated. This use case is common where errors are found in pre-existing
    > data, or analytical and or processing techniques affect some attributes
    > of the existing data set.
    correct!
    > For use case 1, time stamping of when the new additions are added is the
    > simplest solution.
    correct - this is what the recommendation is meant to state.
    > For use case 2, the data sets have to go through a release process,
    > similar to software and the exact changes to the data set are documented.
    correct as well - and again, this will happen on the individual record
    level (depending on how the dat ais being stored, i.e. a line in a CSV
    file, a row in a RDBMS table, a bit sequence at a certain position when
    versioning files using a versioning system, a triple in a linked data
    setting, etc.) There usually is an according space-efficient way to
    perform this versioning (and associated timestamping) for each data
    representation, sometimes several options exist with different trade-offs.
    > Reading the recommendation and the associated literature, we got the
    > impression that the recommendation had a certain use case in mind that
    > is based on relational databases with tabular, alphanumerical data.
    > Several recommendations are not applicable to data that are not
    > alphanumerical tables.
    If this impression is generated, then we definitely need to address this
    issue, as this is not what we intend to communicate!
    While most pilots were relying on that kind of data (and it is really
    unfortunate that few other pilots were presented by interested
    stakeholders early in the process so we could address them right form
    the onset - I am not sure why this was the case) the principles should
    be applicable to virtually any form of data - also because this is a
    requirement for the principles to be technology-independent and thus
    applicable beyond the current generation of data representations.
    > The usefulness of query uniqueness for addressing
    > web services (R4) is questionable and stable sorting (R5) does not make
    > sense with, say, raster data.
    We absolutely agree - and we should probably add the rationale for these
    recommendations however briefly aleady in the abridged form.
    We discussed this during several meetings. The rationale is:
    (R5): this is only required if the sorting of the data has a potential
    impact on subsequent processing. This is an issue in many process chains
    (to clarify the goal: we not only want to support citation for the
    purpose of giving credit, but also have precise data identification to
    support repeatability/verifyability/re-use in a processing chain) where
    the order in which individual data elements are fed into a process may
    impact the result, as e.g. in machine learning tools)
    In such cases, unique sorting is highly desirable. In other settings, as
    mentioned by you, it does either not matter or it does not make sense as
    the order in which the individual bits are returned is pre-determined
    anyway (in which case the "unique sorting" of the bit sequence would
    even be fulfilled automatically.)
    We will add a clarification on this to R5!
    R4: we again probably should add the rationale for this even into the
    short form of the flyer. The reason stems from communities where there
    is a strong desire to ensure that the (semantically) same subset of data
    gets the same PID - i.e. they do not want to have different PIDs for
    what is actually the same intellectual thing.
    Thus we need to identify if two "queries" are semantically identical -
    which usually requires some form of normalization. It is impossible to
    guarantee this in the generic case, not even for languages like SQL. but
    given the fact that most data sources are accessed by researchers not
    via completely open programming APIs but via some more dedicated
    workbenches or limited APIs, this is viable - at least according to the
    famous 80:20 rule.
    I would be very interested in discussing the need / non-importance of
    this requirement in a raster query setting, and what the concept of
    query uniqueness would resolve to if the query consisted of a free-form
    area drawn on an image. Maybe anything but pixel-identical boundaries
    are different by definition, maybe they can be "normalized" to a certain
    granularity/boundary conditions as used in the subsequent region
    retrieval algorithm. This is, of course, and optional aspect, but was
    considered important enough in many use cases to be include din the
    recommendations while not harming those settings where it is not needed.
    But again, we maybe should stress this even in the short version.
    I am not entirely sure what your reference to Web Services is related to
    in this context (apart form the fact that, if a web service is being
    used for procesing the queries, this normalization and assignment of a
    PID would likely happen within that service).
    Dear Lesley,
    Thank you very much for your comments! I highly appreciate this
    discussion, as you raise some very important issues! I would suggest to
    schedule a dedicated telephone conference or webinar to discuss these in
    detail. (We have tried to schedule these webinars at different timeslots
    to enable participation from around the globe at more or less convenient
    times, but this time zone issue will always be a challenge, it seems.)
    I will be travelling/lecturing for the next two weeks with somewhat
    unpredictable internet access availability, so I would suggest to
    arrange a call for the second half of July if that suits you. I can
    prepare a Doodle poll to allow more people to join if second half of
    July in general is ok to you.)
    Until the, we can also discuss a few issues by email - I just wanted to
    provide some quick feedback on your comments inline below. Some of the
    issues you raise have been discussed during the various WG meetings,
    some probably need further clarification that will happen in the more
    extended report to complement the somewhat minimalistic 2-page flyer,
    and some definitely will merit further in-depth discussion.
    > We welcome the efforts of this group and we have been following them
    > with interest (although the timing of many of your webinars makes
    > participation a little difficult!).
    >
    >
    > Now that the final recommendations are available we would like to make
    > the comment that the approach is seems tied to data that are in
    > databases (particularly relational data bases) and are relatively small
    > in volume. There also seems to be a feeling for time-stamped snap shots
    > of the data bases that can be retrieved over time.
    While it is true that many of the pilots we used as a basis for our
    discussion are based on relational databases, we did have a few other
    settings, ranging from (small-scale) CSV files, via XML to some pretty
    large RDBMS. There will be a workshop sponsored by EUDAT to discuss
    large-scale time-series data in October or November (I know, that's
    after the official end of the WG, but hey, 18 months is incredibly short
    if we want to go beyond generic discussions and recommendations,
    elaborating and implementing actual solutions. And we were told that
    there is nothing that prevents us from keeping working on these
    questions even if the official period of the WG is coming to an end :)
    > Our personal experience is with large volume raster arrays that can be
    > over a Petabyte in volume and in multi-petabyte climate models. Storing
    > multiple time stamped snap shots of these is not feasible, fundamentally
    > due to cost of the infrastructure.
    There seem to be two misundestandings resulting from the compact
    phrasing of the recommendations:
    (1) it is not entire time-stamped snap-shots of data representations
    that we recommend to be kept. Rather we state that IF one wants to
    (needs to?) support the option of going back to earlier versions of the
    data, then these earlier versions must be kept available somehow. If
    there is no requirement to enable repeatability, supporting re-use of
    older data to compare models, etc. then there is no need to keep earlier
    versions. Similarly, if it is economically infeasible to do so, there is
    nothing that prevents data from being deleted - however, this should be
    a clear policy decision, not just happen by chance.
    (2) We do not recommend to keep time-stamped snap-shots of the entire
    data. While this is an implementational issue (and thus beyond the level
    of the recommendations/WG) we claim that way more efficient (and also
    more flexible, i.e. down to arbitrary granularity of time-stamping and
    versioning as opposed to fixed snapshots) means of time-stamping and
    versioning data are available for most data representations.
    For RDBMS, for example, we do not recommend snapshot dumps of the DB,
    but to integrate versioning at the record level (eiher integrating it in
    the master tables which requires changes to all APIs; or having
    dedicated history tables which avoids changes to existing APIs, keeps
    performance of the master system unchanged, but wasted significant
    amounts of space; or hybrid solutions which are more space-efficient but
    less performant when processing historic queries; None of this is new
    technology, even state of the art in many data centers, sometimes to the
    suprise of the researchers when they talk to the ICT staff about this
    versioning issue)
    For CSV files there is a prototype that demonstrates this functionality
    via transparent migration to RDBMS, or by applying a versioning system
    in the backend (eg. GIT, SVN). A similar approach seems feasible in
    settings using file based repositories. Repositories, in general,
    usually consist of these two components, i.e. some form of file system
    based storage (which can be versioned, even at the bit-level,
    transparently to the system) and a form of database storing metadata of
    the individual files)
    > There are at least two user cases where new data are dynamically added
    > to an existing data set:
    >
    > ·Use case 1: new data are regularly and systematically appended to an
    > existing data set over time, e.g., with outputs from a satellite sensor
    > such as Landsat or MODIS: no changes are made to the existing dataset.
    Correct. In fact, MODIS was one of the pilots we discussed during the
    ESIP workshop in Wahsington in January. It turned out to be the most
    complex use case discussed during that meeting - but not because of the
    time-stamping and versioning, all of which can be relatively easily
    handled using standard time-stamping technology (i.e. timestamping the
    newly appended records on a record level, no waste of space beyond the
    storage space of the timestamp, which is minute)
    The challenge was the tracing and storage of the queries, as MODIS
    allows multiple forms of access: the query interfaces offered were felt
    to be straightforward to modify according to the recommendations. The
    problem was the FTP access mode, which does not suport the kind of query
    concept and tracing. In this case, a dedicated tool would have to be
    written that monitors the FTP session and aggregates the downloaded
    files if the form of traceable data identification / citation should be
    supported for this access mechanism as well.
    In any case, the timestamping and versioning seemed to be easily doable
    - at the record level, definitely not relying on redundant snapshots!
    > ·Use case 2: pre-existing data in a large data set is modified or
    > updated. This use case is common where errors are found in pre-existing
    > data, or analytical and or processing techniques affect some attributes
    > of the existing data set.
    correct!
    > For use case 1, time stamping of when the new additions are added is the
    > simplest solution.
    correct - this is what the recommendation is meant to state.
    > For use case 2, the data sets have to go through a release process,
    > similar to software and the exact changes to the data set are documented.
    correct as well - and again, this will happen on the individual record
    level (depending on how the dat ais being stored, i.e. a line in a CSV
    file, a row in a RDBMS table, a bit sequence at a certain position when
    versioning files using a versioning system, a triple in a linked data
    setting, etc.) There usually is an according space-efficient way to
    perform this versioning (and associated timestamping) for each data
    representation, sometimes several options exist with different trade-offs.
    > Reading the recommendation and the associated literature, we got the
    > impression that the recommendation had a certain use case in mind that
    > is based on relational databases with tabular, alphanumerical data.
    > Several recommendations are not applicable to data that are not
    > alphanumerical tables.
    If this impression is generated, then we definitely need to address this
    issue, as this is not what we intend to communicate!
    While most pilots were relying on that kind of data (and it is really
    unfortunate that few other pilots were presented by interested
    stakeholders early in the process so we could address them right form
    the onset - I am not sure why this was the case) the principles should
    be applicable to virtually any form of data - also because this is a
    requirement for the principles to be technology-independent and thus
    applicable beyond the current generation of data representations.
    > The usefulness of query uniqueness for addressing
    > web services (R4) is questionable and stable sorting (R5) does not make
    > sense with, say, raster data.
    We absolutely agree - and we should probably add the rationale for these
    recommendations however briefly aleady in the abridged form.
    We discussed this during several meetings. The rationale is:
    (R5): this is only required if the sorting of the data has a potential
    impact on subsequent processing. This is an issue in many process chains
    (to clarify the goal: we not only want to support citation for the
    purpose of giving credit, but also have precise data identification to
    support repeatability/verifyability/re-use in a processing chain) where
    the order in which individual data elements are fed into a process may
    impact the result, as e.g. in machine learning tools)
    In such cases, unique sorting is highly desirable. In other settings, as
    mentioned by you, it does either not matter or it does not make sense as
    the order in which the individual bits are returned is pre-determined
    anyway (in which case the "unique sorting" of the bit sequence would
    even be fulfilled automatically.)
    We will add a clarification on this to R5!
    R4: we again probably should add the rationale for this even into the
    short form of the flyer. The reason stems from communities where there
    is a strong desire to ensure that the (semantically) same subset of data
    gets the same PID - i.e. they do not want to have different PIDs for
    what is actually the same intellectual thing.
    Thus we need to identify if two "queries" are semantically identical -
    which usually requires some form of normalization. It is impossible to
    guarantee this in the generic case, not even for languages like SQL. but
    given the fact that most data sources are accessed by researchers not
    via completely open programming APIs but via some more dedicated
    workbenches or limited APIs, this is viable - at least according to the
    famous 80:20 rule.
    I would be very interested in discussing the need / non-importance of
    this requirement in a raster query setting, and what the concept of
    query uniqueness would resolve to if the query consisted of a free-form
    area drawn on an image. Maybe anything but pixel-identical boundaries
    are different by definition, maybe they can be "normalized" to a certain
    granularity/boundary conditions as used in the subsequent region
    retrieval algorithm. This is, of course, and optional aspect, but was
    considered important enough in many use cases to be include din the
    recommendations while not harming those settings where it is not needed.
    But again, we maybe should stress this even in the short version.
    I am not entirely sure what your reference to Web Services is related to
    in this context (apart form the fact that, if a web service is being
    used for procesing the queries, this normalization and assignment of a
    PID would likely happen within that service).
    > To accommodate also other forms of data, we think that this
    > recommendation should be more general. For these large scale,
    > non-numerical data sets we would recommend that provenance workflow
    > engines are used, that automatically capture the version of the data set
    > that was used, the version of the software as well as the infrastructure
    > to process the data, and the exact time the process was run. The
    > Provenance workflow itself would have a persistent identifier, as would
    > all components of the workflow.
    This is exactly the direction that we would like to see this evolve
    into: not just versioning the data and tracking its provenance, but to
    apply the same principles to entire workflows, documenting their
    provenance, versions, but going even much deeper (as famous examples
    published eg. in PLOs One have shown, documenting the source code is not
    sufficient to guarantee identical re-execution or proper re-use: we need
    to go much deeper into documenting the underlying operating system, the
    version of individual libraries being used, even down to the HW level)
    We have implemented such a system, starting from the execution of a
    process instance (either ad-hoc, i.e. executing individual steps on a
    command line; or via a well-defined workflow modeled in a workflow
    language), monitoring all data items, files, I/O, libraries, ports, web
    services, uder ids etc. touched/used, to create a context model. This
    can subsequently be refined (aggregated, manualle enhanced) to represent
    a process context. Additionally, verification data can be captured,
    together with according metrics for comparison, to verify correctness
    upon re-execution)
    The data used in such a process is then represented via a PID pointing
    to the respective "query" identifying that data - where the data
    management system has the responsibility of tracking the provenance of
    the data. It all becomes networked into what you may refer to as a
    Research Object, Process Management Plan, Process Context Model,
    Verification Plan, or whichever other terminology will finally emerge
    capturing this complex network
    However, this goes way beyond what we were aiming to address within the
    scope of this 18-month WG :-) But I would be very happy to discuss this,
    the suitability of the prototypes we developed, their limitations,
    applicability in your setting, etc. potentially even within the context
    of a dedicated RDA WG under the umbrella of the Repeatability IG? Would
    you be interested?
    In any case, within the RDA WGDC we wanted to keep (as a short-term
    solution that can be applied with reasonable effort, picking low-hanging
    fruit credo fo rWGs etc.) things focused, concentrating on how the
    provenance of a specific data subset could be made track-able
    efficiently. The solution to that seems to be the timestamping and
    versioning of the data (to track the provenance) and to store the query
    (to efficiently identify arbitrary subsets and store their semantic
    characteristics).
    So far we have not come across a data setting where these principles
    could not be applied. The effort of implementing and rolling out this
    service will naturally differ, depending on the amount of data, APIs,
    tapes and number of users, etc. - but it seems to be a viable solution
    in all cases discussed so far.
    If you have a specific use case (raster data?) where there would be a
    desire to support such precise data identification/citation, I'd be very
    happy to learn more about it and see whether we can work throgh this
    together on a concrete case to see whether any changes to the
    recommendations are necessary.
    We will definitely consider revising/rephrasing the recommendations
    seeing where misunderstandings seem to emerge (do you have any
    recommendations in terms of wording for clarifying these concisely -
    assuming that I managed to clarify them in rather lengthy text?)
    We will defintely also prepare a more comprehensive document detailing
    all the thinking behind the individual recommendations, their
    applicability and how to potentially implement them.
    Sorry for the long email, thanks for reading all the way down to here
    (assuming you did) and I hope it helped as a first clarification before
    a potential follow-up call if there is interest from your side.
    best regards, Andreas

submit a comment