Skip to main content

Notice

The new RDA web platform is still being rolled out. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Please report bugs, broken links and provide your feedback using the UserSnap tool on the bottom right corner of each page. Stay updated about the web site milestones at https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/.

Metadata for collections

  • Creator
    Discussion
  • #117982

    Rolf Krahl
    Participant

    Dear all,
    I got a question for all the metadata experts out there:
    If I understand DataCite right, you’ll always have one metadata record
    for one single resource and as a consequence one separate metadata
    file for each resource. So if you have for instance a collections of
    n datasets and you want to describe the collection as a whole and also
    every single dataset, you will need n+1 metadata files. These
    metadata records should refer to each other via the RelatedIdentifier
    property with relationType IsPartOf and HasPart respectively.
    Is that understanding correct? Or is there any way with DataCite to
    put the metadata for the whole collection and for the individual
    datasets in one single metadata file?
    If the above is correct, we might need to amend or recommendations
    document to allow for multiple metadata files using one single schema.
    I’d assume that a collection of dataset is a rather common use case
    that we need to take into account. Maybe allow something like:
    datacite.xml
    datacite-ds001.xml
    datacite-ds002.xml
    datacite-ds003.xml
    datacite-ds004.xml
    in the META-INF folder, where datacite.xml describes the collection
    and datacite-
    .xml describes dataset
    .
    Best regards,
    Rolf

    Rolf Krahl
    Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
    Albert-Einstein-Str. 15, 12489 Berlin
    Tel.: +49 30 8062 12122

  • Author
    Replies
  • #131498

    Dear Rolf, dear all,
    that indeed is a good question. If we defined our bags to be
    “single-objected”, as in the DARIAH-DE Repository BagIt bags will be
    used, we have no problem: we have got one content file, and a bunch of
    metadata files (and one DataCite metadata file).
    How are all the other repositories using the BagIt bags? Are they all
    singel-objected, too? If yes, we would only need one more metadata file,
    the DataCite one.
    All the best,
    Stefan

    Stefan E. Funk
    Abteilung Forschung & Entwicklung
    Georg-August-Universität Göttingen
    Niedersächsische Staats- und Universitätsbibliothek Göttingen
    D-37070 Göttingen
    Papendiek 14 (Historisches Gebäude, Raum 2.409)
    +49 551 39-7700 (Tel)
    +49 551 39-3856 (Fax)
    ***@***.***-goettingen.de
    http://www.sub.uni-goettingen.de
    http://www.rdd.sub.uni-goettingen.de

  • #131497

    Dear all,
    that’s really a good point and I see the same ‘limitations’ if using DataCite. In our case statement and in the primer document we are talking about ‘Migration/Replication of a Digital Object[…]’ which (for me) implies a single resource. Of course, this depends on the definition of ‘Digital Object’. If we define a Collection to be a Digital Object/resource, too, we are fine as long as we describe only the collection itself as bag content within datacite.xml, which should be possible.
    If we want to describe also single resources of the collection we’ll need some hierarchical approach for providing generic metadata, e.g. in the form suggested by Rolf. The problem here is, that the single part-ids (ds001 – ds004) must be mapped to the single collection resources as well as to the payload located in the ‘data’ folder. Thus, adopters of our recommendations may have to change the structure/naming of the payload which could break existing implementations. However, we should continue this discussion in a couple of minutes.
    Regards,
    Thomas.

    Karlsruhe Institute of Technology (KIT)
    Institute for Data Processing and Electronics
    Hermann-von-Helmholtz-Platz 1
    76344 Eggenstein-Leopoldshafen
    Germany
    fon : +49 721 608-24042
    fax : +49 721 608-23560
    ORCID: http://orcid.org/0000-0003-2804-688X
    ———————————————————
    Macht es, kosmisch betrachtet, wirklich was aus, wenn ich nicht aufstehe und arbeiten gehe?
    -Douglas Adams-
    Am 13.09.17, 14:09 schrieb “funk=***@***.***-groups.org im Auftrag von StefanFunk” :
    Dear Rolf, dear all,
    that indeed is a good question. If we defined our bags to be
    “single-objected”, as in the DARIAH-DE Repository BagIt bags will be
    used, we have no problem: we have got one content file, and a bunch of
    metadata files (and one DataCite metadata file).
    How are all the other repositories using the BagIt bags? Are they all
    singel-objected, too? If yes, we would only need one more metadata file,
    the DataCite one.
    All the best,
    Stefan

    Stefan E. Funk
    Abteilung Forschung & Entwicklung
    Georg-August-Universität Göttingen
    Niedersächsische Staats- und Universitätsbibliothek Göttingen
    D-37070 Göttingen
    Papendiek 14 (Historisches Gebäude, Raum 2.409)
    +49 551 39-7700 (Tel)
    +49 551 39-3856 (Fax)
    ***@***.***-goettingen.de
    http://www.sub.uni-goettingen.de
    http://www.rdd.sub.uni-goettingen.de

  • #131494

    What about using the definition from the Research Data Collection WG as the
    definition(s) for collection, to keep inline with other WG?
    https://www.rd-alliance.org/group/research-data-collections-wg/wiki/coll
    Claire

  • #131492

    fwiw, you could also look into our draft RDA recommendation:
    https://github.com/RDACollectionsWG/specification
    Collections are just a specific kind of DO, and the recommendation
    reflects that. So you can indeed hide much of the hierarchy complexity.
    Best, Tobias

  • #131491

    Thank you for this comment, Tobias. I’ve checked your definition of ‘Collection’ today in the morning and added it to our recommendations document. Following this definition, we are also supporting collections as they are defined as digital objects.
    However, the discussion we’ve started yesterday was related to the use case of packaging a ‘local’ collection of datasets stored in one repository instance as multiple zip files in one package with the possibility to add (DataCite) metadata for the entire collection as well as for the single datasets at a defined location in the package following a defined naming scheme. Thus, we must be able to reflect, identify and address the single elements of the collection within the package.
    Regards,
    Thomas

    Karlsruhe Institute of Technology (KIT)
    Institute for Data Processing and Electronics
    Hermann-von-Helmholtz-Platz 1
    76344 Eggenstein-Leopoldshafen
    Germany
    fon : +49 721 608-24042
    fax : +49 721 608-23560
    ORCID: http://orcid.org/0000-0003-2804-688X
    ———————————————————
    Macht es, kosmisch betrachtet, wirklich was aus, wenn ich nicht aufstehe und arbeiten gehe?
    -Douglas Adams-
    Am 14.09.17, 09:11 schrieb “Tobias Weigel” :
    fwiw, you could also look into our draft RDA recommendation:
    https://github.com/RDACollectionsWG/specification
    Collections are just a specific kind of DO, and the recommendation
    reflects that. So you can indeed hide much of the hierarchy complexity.
    Best, Tobias

    Dr. Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784

  • #131490

    Rolf Krahl
    Member

    Dear Claire, Tobias & all,
    Thank you for the pointers! Both definitions match very well the
    particular use case I had in mind. However, this is rather orthogonal
    to what we are doing in our recommendations document. This document
    considers best practices on how to package digital objects for the
    transport from on repository to another and how to add the metadata to
    these packages such that the receiving end will be able to find them.
    We are pretty agnostic on what these objects are after all.
    Best regards,
    Rolf

    Rolf Krahl
    Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
    Albert-Einstein-Str. 15, 12489 Berlin
    Tel.: +49 30 8062 12122

  • #131489

    Dear Rolf,
    even if I don’t exactly know, what your current recommondations say, I
    would assume, that the collection definition is not really orthogonal.
    Because a collection is a DO, you can handle it as the others, you
    ingest it in the new rep with the recommended packaging. In the new
    environment it gets a new PID and is a new collection, refering to all
    its old members. The new collection with its new PID has just the same
    tree structure of other collections as before, finally with
    non-collection DOs as its leaves as before.
    Things become more expensive, if you want to actually transfer the whole
    tree. This would be an iterative ingest in the new rep by backtracking
    through the collections tree. After each ingest one replaces the old
    member PID in the collection structure with the new one created by
    ingest. Finally for each member in the tree you would have a new DO in
    the new rep and you have all references in the new collection set properly.
    However if you are looking for a one step ingest for this whole process,
    you need a relatively complicated packaging schema, and you have to
    setup the PID structure for all contained DOs after the ingest of that
    package anyway. So this seems to me rather expensive and without a mayor
    advantage.
    If you see here the orthogonality of these approaches, then you might be
    right. The collection definition we have in the coloction WG is more
    intended for getting an overview but for step by step processing.

Log in to reply.