new paper including components

09 May 2015
Groups audience: 

Dear Data Fabric colleagues,
a group of people engaged in RDA got together and wrote a paper during the last weeks that describes trends in data management, refers to the data principles such as established by the G8 ministers and based on these discusses consequences and components that are seen as important. The components part is widely based on Use Cases descriptions presented to RDA Data Fabric IG and on the authors' expertise. We uploaded this document to the DFIG wiki to open discussions on it. We will also upload all use cases today which we received so far and continue to motivate people to come up with additional Use Case descriptions.
This paper is NOT meant as a FINAL statement, but much more intended to motivate broad discussions about what needs to be done next. We chose to present lists of points without setting priorities knowing that there will be debates about those being mentioned and that there will be gaps. This document will be presented in various meetings with different stakeholders with the aim to get comments. Its embedding in the DFIG wiki will guarantee that the discussion process will be kept within RDA which we find as being important. It might be necessary in a few months to come up with a new snapshot document that summarizes the state of agreement and disagreement. Of course credits must be given to all who take the time to comment and contribute.
You can find the document under this URL where also the discussion should take place:
https://rd-alliance.org/groups/data-fabric-ig/wiki/data-fabric-ig-compon...
In case you want to cite the document I uploaded it to a permanent store and it got a Handle:
http://hdl.handle.net/11304/f638f422-f619-11e4-ac7e-860aa0063d1f
Apologies for this period of silence which was due to meetings and the preparation of a couple of documents.
Best
Peter

File Attachment: 
AttachmentSize
File paris-doc-v6-1.docx1.07 MB
  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 10 May, 2015

    Congratulations of a large amount of very good work.  Comments word change track on. Keith

    ATTACHMENT: 
    AttachmentSize
    File paris-doc-v6-1.docx1.08 MB

  • Larry Lannom's picture

    Author: Larry Lannom

    Date: 10 May, 2015

    Thanks Keith,
    One comment on your comment on the ‘hourglass’ figure. You point out
    Thanks Keith,
    One comment on your comment on the ‘hourglass’ figure. You point out
    > IP addresses need not be unique over time and may not be persistent
    perhaps suggesting that the analogy needs to be made more clearly.
    My understanding of the ‘narrow neck’ metaphor of IP addresses is that they allow many different kinds of network services to be made available across many different kinds of networks. The analogy for PIDs is that they allow many different kinds of data management services to be made available across many different kinds of data sources.
    Larry

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 11 May, 2015

    Larry -
    Thanks for taking the time and expanding on the analogy - of course this interpretation makes sense.
    Perhaps such wording could be included to avoid others picking up the discrepancy that I did? I believe it is important because the characteristics/properties of PIDs (i.e. the intrinsic properties associated with the character string) are different from those of IP addresses.
    Best
    Keith
    ------------------------------------------------------------------------------------------------------------------
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    -----Original Message-----
    From: llannom=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of llannom
    Sent: 10 May 2015 18:56
    To: Keith Jeffery; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] new paper including components
    Thanks Keith,
    One comment on your comment on the 'hourglass' figure. You point out
    Larry -
    Thanks for taking the time and expanding on the analogy - of course this interpretation makes sense.
    Perhaps such wording could be included to avoid others picking up the discrepancy that I did? I believe it is important because the characteristics/properties of PIDs (i.e. the intrinsic properties associated with the character string) are different from those of IP addresses.
    Best
    Keith
    ------------------------------------------------------------------------------------------------------------------
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    -----Original Message-----
    From: llannom=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of llannom
    Sent: 10 May 2015 18:56
    To: Keith Jeffery; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] new paper including components
    Thanks Keith,
    One comment on your comment on the 'hourglass' figure. You point out
    > IP addresses need not be unique over time and may not be persistent
    perhaps suggesting that the analogy needs to be made more clearly.
    My understanding of the 'narrow neck' metaphor of IP addresses is that they allow many different kinds of network services to be made available across many different kinds of networks. The analogy for PIDs is that they allow many different kinds of data management services to be made available across many different kinds of data sources.
    Larry

  • Larry Lannom's picture

    Author: Larry Lannom

    Date: 11 May, 2015

    Keith,
    Sounds like a good idea. Thanks again.
    Best,
    Larry

  • Ralph Müller-Pfefferkorn's picture

    Author: Ralph Müller-Pf...

    Date: 02 Jun, 2015

    Hi there,

    thanks for the good and concise paper. I added some comments in the technical components section.

    Regards,
    Ralph

    ATTACHMENT: 
    AttachmentSize
    File paris-doc-v6-1_RMP.docx1.06 MB

  • Andrew Maffei's picture

    Author: Andrew Maffei

    Date: 03 Jun, 2015

    I agree. Very good and concise paper. Thanks all. I posted my suggestions to the RDA website here —
    https://rd-alliance.org/comment/3556#comment-3556
    Thanks again,
    Andrew Maffei

  • Leonardo Candela's picture

    Author: Leonardo Candela

    Date: 12 Jun, 2015

    This “RFC” document aims at identifying a number of “components” that have to be put in place to support proper data practices. These components are actually described in Sec. 5 after a very long discussion about trends, principles and consequences (!) about these principles.

    Overall comments:

    • I assume the paper is discussing “research data”, this should be clear since the beginning (e.g. the title);
      • I would suggest to use the term “dataset" rather than the generic term “data" to refer to a unit of information (e.g. http://dx.doi.org/10.1002/meet.14504701240)
      • Terms like “Digital Object” might be misleading since they go well beyond datasets, i.e. any “item" can be represented by a DO. Focus of the paper should remain on “data”.
    • It is not clear how the “trends” in Section 2 have been identified and what are the relationships among them. In the reality they are quite heterogeneous among each other ranging from aspects related to “novel characteristics” of data (e.g. 2.1) to “approaches” (2.8). I do not see anything about
      • comparing “big science” vs “long tail science”;  
      • "data publishing”, intended as the release of data to be used by others;
      • “policies” promoting dataset availability, including policies promoted by publishers and funding agencies. Is this not a trend? Policies are an effective tool to have data to manage;   
      • “fitness for purpose”, this is a very important yet pervasive aspect worth to discuss;
    • Re Sec. 3 and Sec. 4
      • it is not immediate to match the consequences with the principles, actually there are 6 principles and 5 consequences;
      • as you expect, Sec. 4 is leading to a number of questions. The placement of certain bullets under a specific section is unclear. They should be better related to the objective of the section.
    • Re Sec. 5
      • this is a long list (19 services) of components "worth to have". Since it is expected to be the real contribution of this paper I would suggest to
        • (a) have a sort of “reference architecture" organising these services into a coherent whole (e.g. have “functional” areas) and
        • (b) present the components by using such an organisation;
        • (c) to describe the exploitation model(s);
      • Some titles are misleading, e.g. what kind of service/component is “Certification and Trusted Repositories”?
      • Solutions/services have potential limitations … "one size fits all” solution is not always possible. It will be very important to describe “technical components” by highlighting potential pros- and cons-, e.g. are PIDs ok for every data management need/context (dynamic data, granularity issues)?
      • Some of the service descriptions are very generic, the characterisation of proposed services does not provide much information to the reader e.g. Metadata System; 
      • The large majority of components are Registries. Although registries have a key role, I’m wondering whether something can be implemented by using “mediators” while others can be implemented by using “standards”;
      • It is not clear whether these components are expected to be integrated in the current community used “ICT facilities” or to be offered “as-a-Service” by a third party entity;
        • Who are the target users of the selected /suggested components?
      • Examples of existing system and services should be added per section;  
    • Please consider to add a statement to clarify whether this document is “the RDA” position or it is the authors position

    Per section specific comments are reported below.  

    Re Introduction:

    • please rephrase “four factors have been identified” to explain what these factors are about;
      • factor on “stability” is questionable, funders and initiatives are not reconsidering their practices?
      • factor on "common trends" is quite fuzzy, please add examples; Actually examples and references should be there for any factor;

    Re Section 2:

    • In 2.1 text it is used a number of “V” (Veracity is missing) usually used to characterise Big Data without giving any account to this “reuse";  
    • Is Fig. 1 actually needed? There is no real data information behind it, I suggest to drop it;
    • Fig. 2 sounds strange,
      • I do not see any “layer of enabling technology” as the caption suggest. Are you alluding to the fact that there is a relationship between the expected functionalities (as suggested by FAIR), i.e. Access requires Discovery plus something; Interpretation Requires Access plus something;
      • Re “users”, the roles envisaged here are neither a partition nor a complete set, what about just highlighting that “users” can be either human or “machine”;
    • Text in 2.3 is quite challenging. Although it is true that all digital objects are made of bit the management you can perform by relying on this characteristic is quite limited. The same is for emails, you can use them across communities yet this do not guarantee that the communication is effective. Moreover, I do ignore how to understand if something is an internal or external characteristic of a digital object.
      • I do not believe there is any single solution / service that applies to every context, no “one size fit all” approach will ever exist. Existing and forthcoming solutions will be always characterised by a set of features making them more or less appropriate for a given case, “fitness for purpose” is key.
    • Text in 2.4, the similarity between PID and IP is questionable, the two are different. To be precise you should have used  "IP address”, an IP address is for identifying devices rather than unit of information.
      • What about using the concepts underlying the “Web Architecture” it seems a bit more close to the data domain;
    • In 2.5, rather than speaking in absolute terms I would suggest to be more flexible, e.g. a repository might be trusted with respect to certain characteristics and untrusted wrt other characteristics. The same applies to data (dataset?), all the data a user manages to have are somehow registered. This does not imply that they are “appropriate” for any purpose.   
    • In 2.7, I’m a bit confused by the last sentence declaring the approach rarely observed. Section 2 is expected to describe “common trends”. Moreover, the terminology should be revised, e.g. is “self-documenting" meaning that there is a specification of a workflow described using a proper “language”?
    • In 2.8, it seems you are alluding to the “one size fits all” approach again. This is very ambitious;  

    Re Sec. 3

    • It would be great if you could assign “names” to the principles;
    • “Make data manageable” seems a sort of overall concept, manageable includes almost everything (e.g. discoverable, accessible, understandable);
    • Are these principles bringing any added value with respect to those reported in the other documents? Which one? In case there is no real added value it should be better to borrow/endorse (and cite) existing principles than proposing "proprietary ones";  

    Re Sec. 5

    • Which are the “other systems” alluded in 5.2? Who is “we” in “we are using …”;
    • “Trusted repository” is a very community specific concept, i.e. trusting holds with respect to a given purpose;
    • What is the expectation from having a very generic “Metadata System” in this list?
    • Is Schema in Schema Registry associated to DO “format”? Description gives this impression;
      • What is the difference between 5.5 Schema Registry System and 5.7 Registry System for Data Types?
    • Are the “protocols” in 5.12  mandatory? They are certainly very diffuse yet I do not get how do you expect to offer them as “components”;
    • On 5.15 “Big Data Analytics”, there is a rich array of different facilities falling under this umbrella, some of them are oriented to serve scientists and are very flexible, open, and easy-to-use;
    • To what extent a “Repository API” is a component on a par with a “Repository System”?
    • To what extent "training modules” is a component on a par with the others?

    Re Sec. 6

    • I’m not convinced about the roles you assign to Institutions, some of the services can be offered by National/Regional organisations. E.g. it is not necessary that every Institution set up its own infrastructure when their researchers are requested to use “regional” facilities;
      • I would suggest to relax a bit this text and try to link it with the services / components described in 5;
      • It will be useful to discuss the different “exploitation models” that can be used;

    App. A

    • Are these definitions just presented here? Is there no suitable ontology to use to grab some of them? Are these roles somehow connected? The same comment applies to tasks;

    App. B

    • Please, clarify the goal of this Appendix. Is this to compare existing solutions to be used when implementing the planned components?

    Please, add a References Section rather than using footnotes. This will provide more readable information than  often “anonymous” links.    

    Minor typos:
    - page 1 last paragraph “the the type”
    - page 6 "diagram 6” should be Figure 7;

  • Leonardo Candela's picture

    Author: Leonardo Candela

    Date: 12 Jun, 2015

     

    I'm wondering whether there is any plan to have a virtual space to collect "all" the comments and discussions about this document.

  • Donatella Castelli's picture

    Author: Donatella Castelli

    Date: 12 Jun, 2015

    - The title is very ambitious and refers to a topic that has been / is largely studied. I am wondering whether indeed the document aims at addressing “Data Management” in its more general terms or if its objective is more confined, for example, to the management of data in the scientific/research contexts and/or in an infrastructural framework (still wide, but closer to the RDA aims), etc..

    - The introduction section states:

               “RDA aims to be a neutral place where experts from different scientific fields come together to determine               common ground in a domain which is fragmented and, by agreeing on "common data solutions", liberate             resources to focus on scientific aspects.”

    The RDA Mission reported on the RDA website is:

               “The Research Data Alliance (RDA) builds the social and technical bridges that enable open sharing of                  data.”

    If my interpretaion is correct then this mission does not imply that RDA is looking for “common” solutions for data management. It’s well known that given the heterogeneity of the Data Universe different solutions are unavoidable. The RDA mission refers instead to “bridges” (some of which may also be based on common resources, e.g. registries) whose implementation requires a more articulated approach than identifying “common solutions”.

    -  The document refers to data sharing and re-use tasks as major objectives to be achieved. These aims cannot be achieved without including in the scene also the actors producing and consuming data.  If you introduce also these elements then aspects like usage policies, controlled access, access monitoring, credits, quality of service, collaborative management enter in the picture and these largely influence the vision, the list of required technical components and their characteristics. The last part includes one components (e.g. authentication system) that may be related to the “actors” aspect but it is not clear why it has been introduced there.

    -  I have the impression that the “layers of enabling technologies” derive from a very high-level  conceptualization  of a data-centric research process.  If this is the case, you should not forget to include also “Publishing”. This steps, as the others, requires suitable tecnologies & components.  

    -  The link between the content of the different main sections should be improved. Currently it is very difficult to understand how they are related and how the final list  of technical components is a logical consequence of trends and principles.

  • Franco Zoppi's picture

    Author: Franco Zoppi

    Date: 12 Jun, 2015

    Overall comments on how to improve the document message

    Used terminology

    The document seems to suffer from a problem in the used terminology. Terms are sometimes unclear (in many cases definitions would help) or even wrong or mis-used. I guess that most of these problems could be avoided with a correct use of Computer Science/ICT well established and consolidated terminology.

    This is particularly evident in Sections 2.2, 2.3 and 2.6.

    Document perspective

    The document adopts a single perspective: the “user” perspective (from a Computer Science/ICT point of view).

    Themes are faced more according to a “User perception of the problem and requirements specification” approach than to a comprehensive and multi-faceted approach, trying to identify general scope, different views from different stakeholders, different level of abstraction, etc.

    This impression is reinforced when reading Appendix A (Roles), where just one of the roles seems to refer to CS/ICT figures - and even low-level (!) ones.

    To sum up, I feel that adding a sound “CS/ICT perspective” could be an added value for the whole document and could reinforce its message.

    Common Trends

    They are very heterogeneous, ranging from simple observations (e.g. Sect. 2.1), to a sort of “historical overview” (e.g. Sect.2.6), to “visions” (e.g. Sect. 2.3). Homogenizing the description and putting them on the appropriate abstraction/description level would clarify the message.

    I guess that an overview picture of the general scope of the RFC, highlighting the positioning of each of such trends might help. It would be great to have a “fil rouge” starting from that, going though Sections 3 and 4 and leading to the components in Sect. 5. This could reinforce the rationale of the whole document.

    Principles

    This section should be improved. Principles are introduced just by covering them via some references and a list of quite common postulates, which are correct indeed, but do not adequately “match” with the rest of the document (partially apart from Sect. 4).

    Technical Components

    This section largely suffers from not being each component properly positioned in a general picture (call it “model” or “architecture” or whatever else).

    I guess this should be the core part of the document, hence it’s fundamental to have a clear perception of “what, why and how” you are proposing this solution.

    Here again, proper abstraction levels should be identified and links/relationships to current best practices, standards, technologies, etc. should be highlighted.

submit a comment