Research Data Collections WG Activity Overview Collection requirements, streaming

Collection requirements, streaming

Creator

Discussion
March 16, 2016 at 12:41 pm #122856

Tobias Weigel
Member

Dear all, Frederik,
attached is a new version of the draft requirements document with
updates from during and after the Tokyo plenary session. Not complete in
any sense, but a good start 🙂 There are still several things to think
about and discuss.
One item I find very interesting is Frederik’s notion of a “stream view”
of collections (one possible model in the speak of our case statement).
So far, I understand this as putting a PID on a life broadcast video:
The collection grows over time, but only at the head, is strictly
ordered, and a typical access pattern is to receive continuous parts of
it. To fit into our current general model for collections, the items in
the stream may also have to be discrete and should not overlap; but we
may also think about items that are not discretely defined, which will
be a bit more complex. Is this what you had in mind? There are probably
several more aspects for this model we have to work out.
Best, Tobias
—
Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784

Collection_WG_requirements-20160316.docx
Creator

Discussion

Author

Replies
March 16, 2016 at 12:56 pm #133296

Thomas Zastrow
Member

This “stream” concept, what I read in the doc, seems to be provenance data?
OwnCloud has a module “activities” which is tracking all the
things/changes you are doing (see attached screenshot)
March 16, 2016 at 2:55 pm #133294

Frederik Baumgardt
Member

Yes to Tobias’ question. And I think it might end up as provenance data, Thomas. The issue here being the definition of PID.
In the models that I’m familiar with, PIDs can only reference immutable objects and a stream would really be a set of relations between a series of static PID-referenced collections, e.g. realized as ‘parent’-properties.
However, I wonder how the PID model for e.g. OrcID deals with the mutability of the personal information it references. I.e. does it solve the issue with some sort of indirection or are there two different concepts of identity at work here; semantic and structural, where semantic identity is preserved over structural changes and structural identity is not. In which case semantic identity would actually be an indirection and I would be curious how that’s implemented (some constant properties in the referenced object?). And how it affects citability.
Sorry if my lack of familiarity with the previous work shows here, I’m actually working through a couple different PID specs at the moment.
March 17, 2016 at 10:59 am #133293

Tobias Weigel
Member

Hello Frederik,
to be precise: is a stream a set of relations between a series of static
collections or static objects? I can imagine how both ways may be
useful, but for me, they point to different models:
A) Each object gets a PID. Each new object is related to its predecessor
object through a relation. The relations together form a collection
(with a dedicated PID).
B) Each object gets a PID. Each state of the whole thing at a specific
point in time gets a PID (forms a collection with objects as parents).
Whenever a new state is introduced (for example, by adding a new object
– but there can also be other changes!), a new static collection is
formed and the new collection is related to the old with a relation,
thereby creating a second hierarchical level on top that looks like (A)
again.
Obviously, B is more complex than A, but might be required for some use
cases where there is a need to reference specific states. Regarding
identity: Model A preserves semantic identity, but not structural
identity. Model B preserves both (thus, the increased costs). Does this
sound correct?
ORCID: I would assume that upon a change of personal information in
ORCID, there is no new PID formed – at least not an ORCID, as this would
compromise the whole intention. But ORCID might be an applicant for
model A if there is enough end-user value.
Best, Tobias
——– Original Message ——–
*Subject: *Re: [rda-collection-wg] Collection requirements, streaming
*From: *fbaumgardt
*To: *ThomasZastrow
, Research Data
Collections WG
March 17, 2016 at 12:07 pm #133292

Bridget Almas
Member

Hi Tobias,
Both would seem to be important to support. Model A represents a view on
collections that I hadn’t really thought about before. If I understand
correctly, under this approach, the entire history of changes for an
object is itself a collection? So a URI
likeorcid.org/0000-0001-7556-1572/history becomes a PID for the
collection of records that make up the history of changes to the ORCID
record?
Best
Bridget
March 17, 2016 at 3:32 pm #133291

Tobias Weigel
Member

Hi Bridget,
yes, that is entirely possible. I think we can describe this through two
further variations on A:
A1) The objects do not have a shared identity, but together, they make
up a whole (a bag of several apples).
A2) Each object is a new iteration on the previous one, overtaking some
of its identity aspect (the ORCID history example or an svn trunk history).
There are probably better ways to express this – we will need more
precise model descriptions at some point.
Best, Tobias
——– Original Message ——–
*Subject: *Re: [rda-collection-wg] Collection requirements, streaming
*From: *balmas
*To: ****@***.***-groups.org
March 17, 2016 at 6:50 pm #133290

Frederik Baumgardt
Member

@Tobias: Do the diagrams I put up on the Wiki reflect your mental models? I wouldn’t expect it, so feel free to replace them or scribble on them. I’d also save some of this discussion there, but I’m not yet sure I have full grasp of everybody’s conceptions.
@Bridget: It’s my understanding that the history pointer in your example does not meet the criteria of a PID as in, its content is mutable. Similarly the ‘latest’ pointer in a versioning system, or ‘HEAD’ in git. Whereas the commit IDs would.
I have this vague idea that we can interface persistent and mutable data spaces with persistent traits, e.g. a typed PID on a typed object requires that certain properties are immutable, but others which are not part of the persistent data type can be mutable. Citing a PID-referenced datum is citing the immutable properties only. E.g. a person has a stable SSN and DOB, but their name and address could change. The person’s PID would reference the SSN and DOB, but not the name and address. You can access those dynamically once you’ve got access to the properties of the persistent trait. Does that make sense? I do think it’s outside the scope of the WG though? Have other WGs addressed this issue?
Best,
Frederik
Author

Replies

Research Data Collections WG

Group Organizers

Collection requirements, streaming