FAIR Digital Object Fabric IG Activity Overview RE: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…

RE: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…

Creator

Discussion
April 10, 2015 at 2:27 pm #126367

Keith Jeffery
Participant

Hermann –
Good, we converge. Concerning your last comment I believe metadata with formal syntax and declared semantics (i.e. working with DTR, MDR principles) populating catalogs will enable the problem to be tractable; first by manual methods (search the catalog and compose your own workflow) then later autonomically (user neither knows nor cares how it is done, where the information and services are located as long as the quality requirements are met).
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM http://www.ercim.eu (***@***.***)
Past President euroCRIS http://www.eurocris.org
Past Vice President VLDB http://www.vldb.org
Fellow (CITP, CEng) BCS http://www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work…
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
———————————————————————————————————————————-
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
———————————————————————————————————————————-
– Show quoted text -From: Herman Stehouwer [mailto:***@***.***]
Sent: 10 April 2015 15:20
To: Keith Jeffery; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…
Hi Keith,
Exactly, I do not think that the end-user (or most users) would know where to get the exact services/data they want unless they are following a citation of data or come via some catalogue system.*
I would even go as far as to say that the typical scientist doesn’t particularly care which repository is used as long as it is clear that it is trustworthy.
I am (still) assuming that the DF group is aiming towards a loosely coupled, flexible, system for lifecycles.
With interoperability being provided automatically for as far as normally possible.
(It is easy to see how it would be technically possible to do this seamlessly, but in practise, considering the large amount of (non-)standards out there, it isn’t very feasible. Perhaps DTR-type and MD-registry type efforts will reduce this gap in time).
Cheers,
Herman
*) I admit it is a bit odd saying “they don’t know unless they do”, but oh well.
On 10/04/15 16:13, Keith Jeffery wrote:
Herman –
It was not clear to me (and maybe also Gary and Reagan) that the repository registry would cover these aspects and especially a formal description of the repository to be used for interoperability, including services for autonomic access/retrieval within a dynamic (re-)composed workflow. Similarly, I (we?) felt that the current ‘lifecycle within silo’ comparison approach was in danger of leading to divergence rather than convergence without some overall abstract model to which – one would hope – implementations could converge to assure not only interoperability but also full VRE (Virtual Research Environment) capability with virtualisation and dynamic (re-)composition.
Put simply, the present DF white paper seems to assume the end-user knows to which repository(ies) to go and which services to use. The proposal we have generated does not make that assumption – but does assume that catalogs and middleware can make the connection from the user request to the .specific repository services and content.
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM http://www.ercim.eu (***@***.***)
Past President euroCRIS http://www.eurocris.org
Past Vice President VLDB http://www.vldb.org
Fellow (CITP, CEng) BCS http://www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work…
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
———————————————————————————————————————————-
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
———————————————————————————————————————————-
From: herman.stehouwer=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of HermanStehouwer
Sent: 10 April 2015 11:28
To: Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…
Dear Gary, Keith, Reagan,
I fully agree, but I am a bit confused.
I was under the impression that the interoperability machinery (esp. how it is automatically usable within the loosely coupled infrastructure (cf. DTR)) was already part of our discussions.
I agree that the interoperability angle is not explicitly dealt with currently in our first offshoot (the repository registry effort), but it is at least in the back of my mind. We need the (accepted) mechanism to formally describe services offered by repositories. Some of those services will almost certainly deal with interoperability issues.
Cheers,
Herman
On 09/04/15 18:25, Gary wrote:
:The Data Fabric Interest Group (DFIG) has gotten off to a rapid start with a draft white paper, acquisition of use cases, and presentations at Plenaries. Alternate views of the Data Fabric scope and purpose are now starting to emerge. Three of us (Berg-Cross, Keith Jeffery, and Reagan Moore) thought that some alternate views might be outlined and offered for discussion. The intent of this paper is present some ideas on these views and to promote discussion of them within the community. So comments are welcome.
Data Fabric Background and Motivation: Until now DFIG has concentrated mainly on (a) the data workflow within an e-research environment applicable in any domain and (b) how this relates to existing Research Data Alliance groups. The objective is to understand, through use cases, how research data is acquired, managed, manipulated, analyzed, simulated, and displayed in each of the domains and let emerge commonalities of process and (best) practice that may be re-used. From the DFIG web page:
“The loose concept of the Data Fabric is a more nuanced view of data, data management and data relations along with the supporting set of software and hardware infrastructure components that are used to manage this data, along with associated information, and knowledge. This vision offers the possibility to formulate more coherent visions for integrating some RDA work and results.”
Some members of DFIG wish to explore an additional, particular aspect of fundamental importance to RDA: interoperability. Put simply, we wish to bring interoperability (within and cross-domain) aspects into the DF discussions as a ‘first class citizen’ alongside all the other aspects of the research data lifecycle in a domain. Modern research and data management environments are varied across disciplinary projects with multiple types of data and processing resources. As a result in contrast with earlier silos of research that led to discipline specific data management solutions for Interoperability between the existing silos as well as between new emerging technologies, a higher degree of integration is essential for the construction of a data fabric for interdisciplinary research.
Interoperability Vision for DF: Multiple projects are exploring construction of data management environments. The technologies they develop can be viewed as components of a data fabric, provided the systems are able to interoperate. The projects need to be expanded and generalized from specific domains into the larger research arena in part by engaging with RDA in order to further the advancement of data interoperability. Two use cases that illustrate these requirements are EUDAT and the DataNet Federation Consortium.
Two major extensions to the data fabric vision are emerging from these use cases:
1. Interoperability mechanisms required for sharing data, information, and knowledge. This can be quantified as the sharing of digital objects that comprise the basic elements of research (observational data, experimental data, simulations); the sharing of provenance, descriptive, and structural information (stored as metadata attributes about the digital objects, users, and data fabric environment); and the sharing of knowledge procedures (stored as workflows, web services, and applications). It is not sufficient to make digital objects discoverable and retrievable. Researchers also need the ability to manipulate the digital objects, parse the data formats, and share the procedures.
The ability to share procedures has multiple implications:
* The procedure that generates information/metadata about a digital object can be saved. The procedure can be re-executed to generate the associated information, enabling the verification of the information associated with a digital object. This is essential for establishing trust in the sources that provide the digital objects. If a researcher can re-execute a procedure and obtain the same result, reproducible data-driven research is possible.
* In a data fabric, the procedures should be executable at both the repository where the digital objects are stored, and the research environment local to the researcher. Procedures should be transportable, enabling either the movement of data to a remote service, or the encapsulation of the service for local execution.
1. Federation mechanisms needed to assemble collaborations that span institutions, data management environments, and continents. If interoperability mechanisms exist across versions of data management technology, a collaboration environment can be created that federates each of the participating environments into a unified system. There are multiple versions of federated systems:
* Shared name spaces for users, files, and services. A shared name space for users provides a single sign-on environment for use of the federated services. A shared name space for files enables formation of virtual collections that span administrative domains. A shared name space for services enables re-use of procedures across researcher resources.
* Shared services for manipulating digital objects. A simple approach is to move digital objects to the shared service through a broker, accessing the service through its access protocol. A more sophisticated approach is to encapsulate the service in a virtual machine environment, for movement to the local research resources for execution.
* Third-party access. It is possible to implement shared services by posting requests to a third party, such as a message queue, and eliminate direct communication between the federated system components. The posted message can request the application of a service, which is processed by the remote system independently of the original request.
Both of these aspects imply (and need) virtualization: that is the hiding of complexity from the end-user. Here the use of Virtual machines in a CLOUD or GRID environment is required to manage dynamic resource allocation, scalability, distributed parallelism, energy efficiency and other aspects. Most importantly, it is necessary to work at the appropriate level of abstraction in order to allow the computing environment/middleware to behave optimally. Too low or prescriptive a level constrains the environment, too high or abstract a level does not indicate clearly the requirement of the user. Recent work from an Expert Group in Europe has come up with Triple-I Computing as a concept (Information-Intention-Incentive model proposed by [Schubert and Jeffery, 2014][1]) and already research projects are addressing the challenges therein. One of these research directions (named BigMESH) is exploring mathematical signatures of all the components in such an environment within a framework of deploying optimally non-Turing / von Neumann architectures (i.e. distributed, parallel, non-sequential, communication not CPU intensive applications) over existing von-Neumann provision. For all these reasons virtualization, like interoperability, is an important topic for DF discussion.
Analysis and Conclusions: We have made a useful start but the DF vision needs to be expanded but also focused. We need to consider framing DF and its services broadly as data use and applications taking into account the available context or environment. Good data management practices and services are necessary and support this, but are not sufficient to handle interoperability. Improved metadata for data in context with enhanced semantics are one aspect to consider for enhanced interoperability.
Another aspect to consider is the emergence of a mathematical foundation for federation of data management systems (e.g. work by Hao Xu). In both the BigMESH and DataNet Federation Consortium projects mathematical signatures are being explored. A signature of a data management system defines the types of entities and the operations that are supported. The signature can be expressed as an algebra that defines operations on entities. Interoperability between data management systems can then be defined in terms of operations on the algebras. Entity types can be converted for interaction between the repositories, and equivalent operations can be mapped. What is interesting, is that assertions about operations within a data management system can also be mapped to assertions within other data management systems. The mathematical foundation defines the minimal properties needed for federation between systems. The DF IG can provide an integrated forum for defining what needs to change to enable Interoperability – a hard but tractable problem and something for DF to keep its eye on.
________________________________
[1] http://ec.europa.eu/digital-agenda/en/news/complete-computing-toward-inf…
Attached files:
DF-position-paper-final-20150409.docx
—
Full post: https://rd-alliance.org/group/data-fabric-ig/post/data-fabric-position-p…
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/48085
—
Dr. ir. Herman Stehouwer
Rechenzentrum Garching @ Max Planck for Plasmaphysics
RDA Secretariat
***@***.*** 0031-619258815
Skype: herman.stehouwer.mpi
—
Dr. ir. Herman Stehouwer
Rechenzentrum Garching @ Max Planck for Plasmaphysics
RDA Secretariat
***@***.*** 0031-619258815
Skype: herman.stehouwer.mpi
Hermann –
Good, we converge. Concerning your last comment I believe metadata with formal syntax and declared semantics (i.e. working with DTR, MDR principles) populating catalogs will enable the problem to be tractable; first by manual methods (search the catalog and compose your own workflow) then later autonomically (user neither knows nor cares how it is done, where the information and services are located as long as the quality requirements are met).
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM http://www.ercim.eu (***@***.***)
Past President euroCRIS http://www.eurocris.org
Past Vice President VLDB http://www.vldb.org
Fellow (CITP, CEng) BCS http://www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work…
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
———————————————————————————————————————————-
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
———————————————————————————————————————————-
From: Herman Stehouwer [mailto:***@***.***]
Sent: 10 April 2015 15:20
To: Keith Jeffery; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…
Hi Keith,
Exactly, I do not think that the end-user (or most users) would know where to get the exact services/data they want unless they are following a citation of data or come via some catalogue system.*
I would even go as far as to say that the typical scientist doesn’t particularly care which repository is used as long as it is clear that it is trustworthy.
I am (still) assuming that the DF group is aiming towards a loosely coupled, flexible, system for lifecycles.
With interoperability being provided automatically for as far as normally possible.
(It is easy to see how it would be technically possible to do this seamlessly, but in practise, considering the large amount of (non-)standards out there, it isn’t very feasible. Perhaps DTR-type and MD-registry type efforts will reduce this gap in time).
Cheers,
Herman
*) I admit it is a bit odd saying “they don’t know unless they do”, but oh well.
On 10/04/15 16:13, Keith Jeffery wrote:
Herman –
It was not clear to me (and maybe also Gary and Reagan) that the repository registry would cover these aspects and especially a formal description of the repository to be used for interoperability, including services for autonomic access/retrieval within a dynamic (re-)composed workflow. Similarly, I (we?) felt that the current ‘lifecycle within silo’ comparison approach was in danger of leading to divergence rather than convergence without some overall abstract model to which – one would hope – implementations could converge to assure not only interoperability but also full VRE (Virtual Research Environment) capability with virtualisation and dynamic (re-)composition.
Put simply, the present DF white paper seems to assume the end-user knows to which repository(ies) to go and which services to use. The proposal we have generated does not make that assumption – but does assume that catalogs and middleware can make the connection from the user request to the .specific repository services and content.
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM http://www.ercim.eu (***@***.***)
Past President euroCRIS http://www.eurocris.org
Past Vice President VLDB http://www.vldb.org
Fellow (CITP, CEng) BCS http://www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work…
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
———————————————————————————————————————————-
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
———————————————————————————————————————————-
– Show quoted text -From: herman.stehouwer=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of HermanStehouwer
Sent: 10 April 2015 11:28
To: Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…
Dear Gary, Keith, Reagan,
I fully agree, but I am a bit confused.
I was under the impression that the interoperability machinery (esp. how it is automatically usable within the loosely coupled infrastructure (cf. DTR)) was already part of our discussions.
I agree that the interoperability angle is not explicitly dealt with currently in our first offshoot (the repository registry effort), but it is at least in the back of my mind. We need the (accepted) mechanism to formally describe services offered by repositories. Some of those services will almost certainly deal with interoperability issues.
Cheers,
Herman
On 09/04/15 18:25, Gary wrote:
:The Data Fabric Interest Group (DFIG) has gotten off to a rapid start with a draft white paper, acquisition of use cases, and presentations at Plenaries. Alternate views of the Data Fabric scope and purpose are now starting to emerge. Three of us (Berg-Cross, Keith Jeffery, and Reagan Moore) thought that some alternate views might be outlined and offered for discussion. The intent of this paper is present some ideas on these views and to promote discussion of them within the community. So comments are welcome.
Data Fabric Background and Motivation: Until now DFIG has concentrated mainly on (a) the data workflow within an e-research environment applicable in any domain and (b) how this relates to existing Research Data Alliance groups. The objective is to understand, through use cases, how research data is acquired, managed, manipulated, analyzed, simulated, and displayed in each of the domains and let emerge commonalities of process and (best) practice that may be re-used. From the DFIG web page:
“The loose concept of the Data Fabric is a more nuanced view of data, data management and data relations along with the supporting set of software and hardware infrastructure components that are used to manage this data, along with associated information, and knowledge. This vision offers the possibility to formulate more coherent visions for integrating some RDA work and results.”
Some members of DFIG wish to explore an additional, particular aspect of fundamental importance to RDA: interoperability. Put simply, we wish to bring interoperability (within and cross-domain) aspects into the DF discussions as a ‘first class citizen’ alongside all the other aspects of the research data lifecycle in a domain. Modern research and data management environments are varied across disciplinary projects with multiple types of data and processing resources. As a result in contrast with earlier silos of research that led to discipline specific data management solutions for Interoperability between the existing silos as well as between new emerging technologies, a higher degree of integration is essential for the construction of a data fabric for interdisciplinary research.
Interoperability Vision for DF: Multiple projects are exploring construction of data management environments. The technologies they develop can be viewed as components of a data fabric, provided the systems are able to interoperate. The projects need to be expanded and generalized from specific domains into the larger research arena in part by engaging with RDA in order to further the advancement of data interoperability. Two use cases that illustrate these requirements are EUDAT and the DataNet Federation Consortium.
Two major extensions to the data fabric vision are emerging from these use cases:
1. Interoperability mechanisms required for sharing data, information, and knowledge. This can be quantified as the sharing of digital objects that comprise the basic elements of research (observational data, experimental data, simulations); the sharing of provenance, descriptive, and structural information (stored as metadata attributes about the digital objects, users, and data fabric environment); and the sharing of knowledge procedures (stored as workflows, web services, and applications). It is not sufficient to make digital objects discoverable and retrievable. Researchers also need the ability to manipulate the digital objects, parse the data formats, and share the procedures.
The ability to share procedures has multiple implications:
* The procedure that generates information/metadata about a digital object can be saved. The procedure can be re-executed to generate the associated information, enabling the verification of the information associated with a digital object. This is essential for establishing trust in the sources that provide the digital objects. If a researcher can re-execute a procedure and obtain the same result, reproducible data-driven research is possible.
* In a data fabric, the procedures should be executable at both the repository where the digital objects are stored, and the research environment local to the researcher. Procedures should be transportable, enabling either the movement of data to a remote service, or the encapsulation of the service for local execution.
1. Federation mechanisms needed to assemble collaborations that span institutions, data management environments, and continents. If interoperability mechanisms exist across versions of data management technology, a collaboration environment can be created that federates each of the participating environments into a unified system. There are multiple versions of federated systems:
* Shared name spaces for users, files, and services. A shared name space for users provides a single sign-on environment for use of the federated services. A shared name space for files enables formation of virtual collections that span administrative domains. A shared name space for services enables re-use of procedures across researcher resources.
* Shared services for manipulating digital objects. A simple approach is to move digital objects to the shared service through a broker, accessing the service through its access protocol. A more sophisticated approach is to encapsulate the service in a virtual machine environment, for movement to the local research resources for execution.
* Third-party access. It is possible to implement shared services by posting requests to a third party, such as a message queue, and eliminate direct communication between the federated system components. The posted message can request the application of a service, which is processed by the remote system independently of the original request.
Both of these aspects imply (and need) virtualization: that is the hiding of complexity from the end-user. Here the use of Virtual machines in a CLOUD or GRID environment is required to manage dynamic resource allocation, scalability, distributed parallelism, energy efficiency and other aspects. Most importantly, it is necessary to work at the appropriate level of abstraction in order to allow the computing environment/middleware to behave optimally. Too low or prescriptive a level constrains the environment, too high or abstract a level does not indicate clearly the requirement of the user. Recent work from an Expert Group in Europe has come up with Triple-I Computing as a concept (Information-Intention-Incentive model proposed by [Schubert and Jeffery, 2014][1]) and already research projects are addressing the challenges therein. One of these research directions (named BigMESH) is exploring mathematical signatures of all the components in such an environment within a framework of deploying optimally non-Turing / von Neumann architectures (i.e. distributed, parallel, non-sequential, communication not CPU intensive applications) over existing von-Neumann provision. For all these reasons virtualization, like interoperability, is an important topic for DF discussion.
Analysis and Conclusions: We have made a useful start but the DF vision needs to be expanded but also focused. We need to consider framing DF and its services broadly as data use and applications taking into account the available context or environment. Good data management practices and services are necessary and support this, but are not sufficient to handle interoperability. Improved metadata for data in context with enhanced semantics are one aspect to consider for enhanced interoperability.
Another aspect to consider is the emergence of a mathematical foundation for federation of data management systems (e.g. work by Hao Xu). In both the BigMESH and DataNet Federation Consortium projects mathematical signatures are being explored. A signature of a data management system defines the types of entities and the operations that are supported. The signature can be expressed as an algebra that defines operations on entities. Interoperability between data management systems can then be defined in terms of operations on the algebras. Entity types can be converted for interaction between the repositories, and equivalent operations can be mapped. What is interesting, is that assertions about operations within a data management system can also be mapped to assertions within other data management systems. The mathematical foundation defines the minimal properties needed for federation between systems. The DF IG can provide an integrated forum for defining what needs to change to enable Interoperability – a hard but tractable problem and something for DF to keep its eye on.
________________________________
[1] http://ec.europa.eu/digital-agenda/en/news/complete-computing-toward-inf…
Attached files:
DF-position-paper-final-20150409.docx
—
Full post: https://rd-alliance.org/group/data-fabric-ig/post/data-fabric-position-p…
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/48085
—
Dr. ir. Herman Stehouwer
Rechenzentrum Garching @ Max Planck for Plasmaphysics
RDA Secretariat
***@***.*** 0031-619258815
Skype: herman.stehouwer.mpi
—
Dr. ir. Herman Stehouwer
Rechenzentrum Garching @ Max Planck for Plasmaphysics
RDA Secretariat
***@***.*** 0031-619258815
Skype: herman.stehouwer.mpi
Creator

Discussion

FAIR Digital Object Fabric IG

Group Organizers

RE: [rda-datafabric-ig] A Data Fabric Position Paper to Broaden Discussion (Berg…