Notes from the RDA Data Foundation and Terminology (DFT) WG session 2 at Plenary 3

Notes from the RDA Data Foundation and Terminology (DFT) WG session 2  at Plenary 3

Co-chairs Gary Berg-Cross, Raphael Ritz, Peter Wittenburg   15 people

Edited Notes by Gary Berg-Cross (SOCoP)

Co-Chair Gary Berg-Cross kicked off the meeting with a signup sheet, and the first 3 categories if 10 part table of Core Terms with some initial Definitions and graphic term relations.

Discussion started with the topic of Digital Object with the idea that we should “get some issues on the table and move on.” As context it was also suggested that the definitions attempted here are designed to play and educational role.

The notes below reflect some of the issues and arguments.

In some cases we noted alternative labels for what seemed like the same thing. Bob like the use of entity as a “uniquely identified thing.” Bob thought that object has too many connotations and we should keep it out not to confuse it with concepts from object oriented programming.

Note this on alternate name using object or entity.  Digital object gets About 2,520,000 hits on Google, while “digital entity” gets 35,900 results.  So there is a big difference in how things are named already out there. We can build a definition that is precise enough for our needs and avoid confusion.

There should not be two different words that mean the same concept (i.e. there should be only Digital Object or Digital Entity, not both) because it is confusing. But in vocabularies one often have synonyms.  What is our policy here?
 

One point of discussion/clarification: Are only digital data considered when we talk about data? Peter indicated that we are defining from our model perspective and if so we are essentially talking digital data.  Gary takes a broader perspective which is reflected in Hans’ early discussion of Research Data Objects that may include traditional journals as data not in digital form or “raw” data. There is consensus that the emphasis is on digital things and we are usually talking about or implying this type of representation.
If we take the broader view “Data and digital data are not inter-changeable.”

One question was should we use identifier instead of name in the definition of digital object? 

Whatever the definition, people will have to start using it. If a digital object needs a persistent identifier (and metadata), it should be noted that not everything has one yet. Raphael thought that 80% of people would not buy into this idea of requiring a PID. Bob argued for the identifier while Reagan though a name useful and proposed the following distinctions:

o   Data corresponds to sets of bits

o   Digital objects are sets of data that have been given a name. They need not have a persistent ID, but they may persist.

§  Bob thought that we should call this something other than name

o   A persistent digital object is a set of data that has been given a name (ID) by a persistent resolver. (These points are implied in his use case scenario).

Natasha added that we need a unique ID within a context (such as discussed in Reagan’s use case). Also, one data object can be a bunch of other data objects as seen in Peter’s scenario.  Repositories collect data into such objects. As Natasha put it, “when you harvest records, you bring them into a collection and give them additional metadata.” This may include an additional ID.

Metadata
- One proposition was that: "Metadata is digital object that contains blahblahblah" (information about other data or something similar?). 
The most direct idea is that data becomes metadata in a concept when associated with other data. This is simple and clear, but we need to make it rich enough to be useful. That is metadata is based on a role that it plays. Note also that metadata as data can have metadata.  An example provided by Gary was that metadata is stored in repositories and as such has metadata about when it was stored and by whom for what purposes.

Note also that there is nothing circular about this. It is a matter of role perspective and no more confusing than a person being both a father and a son and a teacher at the same time.

Hans built on his Use Case for research data to note that relevant metadata points to people and expeditions.  The metadata in context IG is discussing the idea of a metadata profile and this idea should be added as they come up with a definition.

 

Note several types of Metadata were listed including:

Discovery, Access, Selection, Licensing, authorization, Quality, suitability and Provenance, reproducibility.

Data Object discussion:
Bob argued that we should get rid of this, but a response is that we need to explain a file being sent from sensors.  People start referring to it as in gappy data example. It’s data then a person adds metadata etc.

 

Should it be called only "Data" because of the use of "object" in other sciences?  This may be handled in definition discussion since the idea of object cuts out data as a thing that can be operated on.

There was a long discussion of the use of the term “type” in the definition.  Bob thought that this is an overused term and suggested that we could use the mathematical term of set.  Gary noted that type comes from the biological realm and describes hierarchies.

Peter Fox commented that we need to discuss something like type to distinguish how something is handled.  Different types are handled differently.  Both squirrels and tigers are mammals but get handled differently.

Note, a sub-type of data object is a Research Data Object, which Hans talked about and which may need an ID as policy.

Representation Object (from Reagan and OAIS)
Some thought that we don't need this term. It should be removed and we use type.
 But as Peter noted some people use it, so we should say what it means.

The distinction of internal vs. external property came up and this was defined in earlier versions of the vocabulary, and is discussed under Properties.  We should connect these.

An internal property of a Digital Object characterizes an aspect of its content by making statements about technical encoding, syntax and the semantics of its informational content.  This is a type of metadata.

 It was noted that we are entering metadata group's area as well as the Type Registry. Natasha suggested that each group has a different approach for things like metadata and so maybe we can make some small distinctions and not go too deep into differences.

Some thought that the definition may be recursive: you need representation object to describe another representation object. Recursion has to end somewhere.
Another question is this not the same as metadata?  It can be a particular role that metadata plays, that is some metadata is a PID and others is information about representation. We can modify the definition to state that RO is a kind of MD.

Proposition for a revised definition: "Representation object is a metadata object that provides some context for a data object."

Workflow object (from Reagan)
"Workflow object is an executable object" ..That is another name for this. Our current definition is too specific and tied to Reagan’s IRODS infrastructure.

Reagan provided the following updated explanation and definition:

 

A workflow can be represented multiple ways.  In practice the workflow is composed by chaining together basic operations, and expressing the composition in a workflow or rule language.  Each existing workflow system implements a different workflow language.  However, all of the workflow languages are expressed in a form of a scripting language, making it possible to list all of the steps in the workflow in a text file.

 

It is important to differentiate between the workflow, and its representation.

 

Def: A workflow is a set of chained operations that collectively specify a procedure that can be executed.

 

Def: A workflow object is a representation of the workflow written in a rule or workflow language, and stored as a text file.

Identifier Discussion

Reagan provided a rationale for persistent identifiers as reproducible research.

 

He is trying to access a known data set, and wanst a persistent identifier for the data set.

 

If someone trying to execute a known procedure, they want a persistent identifier for the procedure.

 

We unfortunately have additional indeterminate requests that may not map to reproducible data.

PID

The current description is a bit ambiguous. "Unique" here means that there is, inside a particular reference system, one (or more) identifier that is used only for that particular object and for nothing else. Reagan notes that in his real world IDs aren’t always persistent and he can use 2 for the same “object”.

Some of the discussion of ID is cultural and not technical. What do we mean by an identifier?  Tim likes an identifier to mean that it resolves to 1 thing.  Then it is OK to have 2 IDs as long as they resolve to 1 thing.

Reagan also associates identify with audit trails and other things than  PIDs.

PID record

Do we need this concept?  Yes, Natasha, “It contains the properties.” Some thought that e shouldn’t associate all PID properties with a record. Some thought we might call it a PID object since record hasn’t been used before.

PID Attribute

This might be confused with metadata and so it might be simplified, but is perhaps understandable in context. A PID is a type of metadata and so its attributes are too.

Other Topics

 

Yann mentioned that it might be good to look into the ontology of data/digital data. These are the Information Artifact Ontology (IAO) which may contain some interesting definitions for the DFT (http://www.ontobee.org/browser/index.php?o=IAO). 

There is also the Software Ontology (SWO, http://www.ontobee.org/browser/index.php?o=SWO) which is being developed and is related to EDAM an ontology for Bioinformatics operation, data and other related information (http://edamontology.org/page). It could be interesting to investigate a little these two ontologies as well. Gary will research this with Yann’s help

 

AttachmentSize
File Notes from session 2 of RDA DFT.docx22.98 KB