You are here

Body:

TAB Review: https://rd-alliance.org/group/wiki/data-fabric-ig-tab-review-final.html

Data Fabric (DF) IG

Several RDA groups are working on core components supporting a basis for a Data Infrastructure for reproducible science. These include: PID IG, PIT WG, PP WG, MD WG and IG, Provenance IG, Brokering IG, DFT WG and others. Some of these groups are Working Groups and will finish in 2014. Yet we miss an overarching concept & discussion framework for most of these groups to relate the components with the overall data lifecycle landscape and identify gaps, obstacles and possible incompatibilities. The loose concept of the Data Fabric is a more nuanced view of data, data management and data relations along with the supporting set of software and hardware infrastructure components that are used to manage this data, along with associated information, and knowledge. This vision offers the possibility to formulate more coherent visions for integrating some RDA work and results. The Data Fabric discussions will also contribute to discussions in TAB where aspects of the Data Fabric can be utilized & integrated into the broader context of RDA.

Background

Excellent, leading research in many areas such as human brain analysis, health care, climate change, etc. are based on smart computations on large aggregated data collections. These collections exhibit a high complexity/heterogeneity in a number of dimensions such as spatial and time resolution, with the type of multidisciplinary data extending from linear time series to array data with that data’s knowledge and information context expressed as relations between descriptive metadata which may be formalized as ontologies. Barrier breaking results that give new insights of how to deal with the big societal challenges need to rely on the availability of more efficient and cost-effective ways to make use of and combine all relevant data and software services which have been created. It becomes increasingly obvious that we need to turn to automatic workflows adhering to some criteria when we want to cope with these volumes and the complexity such that reproducible and traceable science is guaranteed. These workflows are guided by what we call “practical policies”, i.e. self-documenting procedural statements at an abstract level that are specified by data managers, data scientists and other persons involved.

 

Text Box:  This diagram indicates possible actions within a Data Fabric. Primary data, stored in repositories is in a continuous process of being enriched and analyzed creating new derived data. Some of this data (testing) will be temporary, but most of the data will be part of some workflows and thus should be referable. In some cases when the results have been used and quality checked for a publication they may become citable. Also publications are part of this Data Fabric since they are often used for data mining and other analysis.

 

 

Characteristics of a Data Fabric

We characterize the Data Fabric concept by the following observations:

·       The Data Fabric covers a domain of registered software components (workflows, services) that are in fact a special class of digital objects (DOs).

·       Data in the fabric includes a means of registering DOs using an authorized registration site

·       DOs are stored and managed in persistent and accessible repositories.

·       DOs have metadata describing their creation, context and history (provenance).

·       DOs are registered with a PID at authorized registration sites.

·       Actions on DOs may be guided by abstract policies that are explicit and thus auditable.

There can be multiple data fabric implementations and to advance work these should be highly interoperable. These data fabrics are a critical component of infrastructures paving the way to reproducible science. In a data fabric we see how the separate components, developed separately, can be made to work together, this means that for different sets of components the data fabric will be different. We note, strongly, that it is meant as a descriptive way to deal with the interrelation between many components, rather than prescriptive (like you would have with an architecture). We note also that a data fabric is in a constant state of evolution and adaptation. The technologies in use for any component today, will be replaced with a cheaper and higher performing technology in the future. 

The suggested Data Fabric IG is therefore planned as a forum to discuss:

·       Alternate views, components and aspects of the DF concept.

·       How the outputs from the RDA working groups fit in the DF concept and how they relate to each other and to various related WGs and IGs within the RDA.

·       Which further activities are required to advance the data fabric concept.

·       Continuation and initialization of working group activities related to the DF.

·       Improving the uptake of various WG outputs by connecting and relating them as a coherent whole within the DF concept.

 

The goal of the discussions is to:

·       Write a concept paper that defines a vision of data fabrics, including within RDA IG &WGs, to drive the discussions and work program.

·       Identify the need for new WGs that remove concrete barriers on the way towards a functioning DF.

·       Define and initiate these WGs.

·       Promote DF WG results & demonstrations as they are produced, integrate the work of the WGs into an overall DF concept and maintain them in order to evaluate their effectiveness.

·       Promote the adoption of DF WGs outputs within other RDA efforts

We believe that this will provide a useful and productive addition to the discussion in RDA.

Intentions

·       The first action of this IG will be to agree on a description of what is meant by the concept of DF. We will propose to organize a BoF session at P4 in Amsterdam. Various people/groups are asked to work out short concept notes beforehand and exchange them. The group of initiators will work on these notes to structure the presentation part and the discussion.

·       At the P4 meeting the chairs of this IG will be identified. Currently Peter, Gary, Reagan and Keith offer to push the discussion and chair the interactions.

·       The second action is to formulate a White Paper based on the notes and the discussion results and offer it for open discussions in the RDA forums. This White Paper should be ready in December 2014.

·       The IG will give the current WG chairs collaborations a formal RDA basis as an IG which is being treated the same way as all other IG. The DF topics as described above will be an issue of the regular IG meetings.

Initiating Members

The idea for this IG emerged from the discussions amongst the chairs of the WGs who are the initiating members. Some more experts who showed interest joined this group.

Rebecca Koskela, Keith Jefferey, Jane Greenberg, Reagan Moore, Rainer Stotzka, Tim Delauro, Tobias Weigel, Raphael Ritz, Gary Berg-Cross, Peter Wittenburg, Daan Broeder, Larry Lannom, Juan Bicarregui, Herman Stehouwer

 

 

 

Review period start:
Custom text:
Body:

Life sciences are becoming increasingly more data intensive, especially resulting from the huge improvements in large-scale gene sequencing and other “omics” techniques. There is need for large-scale sustainable data storage methods allowing secure and easy access to these highly complex data. Simultaneously, since life science research projects increasingly depend on more than one type of measurement, there is a wide–felt need for integrating data of different types. Very similar issues and similar data sets come up in the different sectors of life sciences, such as health, agriculture, bioindustry and marine life, calling also for interoperability between these sectors. ELIXIR (http://www.elixir-europe.org) has been established to build a sustainable pan-European research infrastructure for biological information providing support to life science research including medicine, agriculture, bioindustry and society.

 

The ELIXIR Bridging Force Interest Group is formed to serve as a bridge between ELIXIR and relevant RDA Interest Groups, e.g. those on agricultural data, big data analysis, federated identity management, marine data, structural biology, toxicogenomics, and data publishing. Furthermore, ELIXIR will interact with relevant infrastructures outside Europe, e.g. NCBI.

 

A number of Working Groups dedicated to defined tasks will be formed over the next years. Examples include:

 

•          Interoperability of different kinds of large-scale (“omics”) data

•          Sustainability of life science data identifiers

•          Mechanisms for secure access to sensitive data, including authorisation and authentication issues

•          Strategies for data storage allowing for computationally intensive analyses

 

These tasks will be performed partly in collaboration with other RDA Interest Groups and with the many emerging data-related initiatives in the life sciences field.

Contact people (chairs):
Bengt Persson, Sweden
Carole Goble, UK
Rob Hooft, the Netherlands

Initial group members:
Tommi Nyrönen, Finland (liasor with Federated Identity IG)
Susanna Sansone (liasor with Metabolomics IG)
Mikael Borg, Sweden

Review period start:
Custom text:
Body:

WG Charter
The case statement outlines our work and provides the focus and the boundaries where our research will go.
We need to integrate all stakeholders and reflect their views accordingly. So far we identified four stakeholders that will actually use our contributions:

  • Data providers – data will be reused
  • Solution providers – machine readable data citations
  • Researchers – receives citable results
  • Community – gains trust and transparency

The beneficiaries will be able to reuse data, reproduce experiments, provide machine readable and machine actionable data citations for complex data sets and trace their data and its usage.
Being able to reliably and efficiently cite entire or subsets of data in large and dynamically growing or changing datasets constitutes a significant challenge for a range of research domains. Several approaches for assigning PIDs to support data citation at different levels in the process have been proposed. These may range from individual PIDs being assigned to individual data elements to PIDs assigned to queries executed on time-stamped and versioned databases.
Based on the discussions at the First Plenary Meeting in Gothenburg, the formation of a Working Group on Data Citation (WG-DC) was initiated. The RDA Working Group on Data Citation (WG-DC) aims to bring together a group of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing subsets of data. The WG-DC focuses on a narrow field where we can contribute significantly and provide prototypes and reference implementations. So far different data citation initiatives exist, all of which have their advantages and special purposes. An overview of these standards and their best practices was published by the CODATA Task Group on Digital Data Curation [1]. We encourage strong cooperation with existing initiatives is required: CODATA, OpenAire, DataCite, W3C, Open Annotation Coalition and the related standards.

Our concept includes machine actionable data citation that is efficient and can be applied transparently. We will be looking at different types of data and database management systems, including:
- SQL-style databases
- XML databases / semi-structured databases
- Graph-based databases
- NetCDF files
- HDF5 files
- …
The goal is to assure that subsections of data can be uniquely identified in the face of data being added, deleted or otherwise modified in a database, across longer periods of time, even when data is being migrated from one DBMS to another. We want to discuss and evaluate different existing approaches to this challenge, evaluate their advantages and shortcomings and identify obstacles to their deployment in different settings, as well as concrete recommendations for the deployment of prototypes within existing data centers. Amongst others these should subsequently form a solid basis for citing data, linking to it from publications in an actionable manner.

Dynamic data citation tackles challenges of versioning and the proper definition of subsets of data in different domains. Potential issues concern the relations between data sets, which need to be captured as well. Other challenges are scalability, costs and benefits (trade off) of ownership and operations that are potentially not reversible. This WG concentrates on the technical aspects of data citation solutions, focusing on proof of concept and prototype implementations. It will collaborate with other RGA working groups on PIDs and other topics under the umbrella of the Interest Group on Data Publication.

The principle currently proposed includes the following aspects:
- Ensuring that data items added to a data collection are added in a manner that is time-stamped
- Ensuring that the data collection is versioned, i.e. changes/deletions to the data are marked as changed with validity timestamps
- PIDs are assigned to the query/expression identifying a certain subset of the data that one wishes to cite, with the query being time-stamped as well
- Hash keys are computed for the selection result to allow subsequent verification of identity
- Issues such as unique sorting of results need to be considered when the operation returns data as sets and subsequent process work on the sequence the data is provided in
These should be working across all settings where we have a combination of data sources and operations identifying subsets at specific points in time.

We propose a three stage plan consisting of solutions (short-term), plans (mid-term) and the future perspective (long-term).


Download the full document

Review period start:
Custom text:
Body:

UPDATE of 28 October can be found here: https://www.rd-alliance.org/wg-proposal-resubmission.html

Working Group Charter
The goal of the WG is to set up the framework to run a series of Summer Schools in Data Science and Cloud Computing in the Developing World. It will run at a variety of locations but will have a base in the developing world and the UK. This series would be endorsed by both the RDA and CODATA and would make use of the wide variety of expertise within these organisations. In the 18 month period of the WG it will
•    Arrange funding for an initial period for the school to run (five years).
•    Organise partnerships with Developing World institutions.
•    Determine the best curriculum for the school in collaboration with others.
•    Arrange how the materials can also be delivered online.
 

Value Proposition
Research in the Developing World is hampered by a variety of infrastructural issues. Access to research data in the public domain and the computational resources to analyse them, would give researchers in the Developing World the chance to do world-class research that is relevant to their society. Distributed computing techniques, for example cloud computing, present the possibility to sidestep the infrastructural difficulties. In order for this to happen, researchers need to understand the techniques of Data Science and the use of cloud computing. By training individuals who will ultimately pass on their training over a comparatively short period of time the level of understanding of this field can be built up and impact research. The creation of a cadre of individuals who can analyse, maintain and curate data sets will be an important skill which will also positively impact the local society as enterprises based on data analysis will expand.

Engagement with existing work in the area:
There exist a number of initiatives that provide training in these fields; for example there are a number of Summer Schools in Europe on cloud computing (see attached document of links). Microsoft Research provide training sessions on their cloud Azure.  There are now two MOOC’s in Data Science and the number of masters programmes in this area in expanding rapidly. Analysis platforms such as R have an ever-expanding set of teaching resources. The Software Carpentry movement, which has many of the correct pedagogical approaches for this school has events across the developed world (and has had sessions in South Africa). AidData have provided training in the analysis of social data in the developing world. The Wellcome Trust has provided funding for a successful series of schools in Bioinformatics in the Developing World. The  Accelerated Data Programme of the International Household Panel Survey, and the IHPS in general provide tools and training to many developing country institutions, to increase the availability and quality of survey data.
 

Within the RDA there is a proposal for an IG in training in Data Science.  CODATA has longstanding activities to build capacity in countries with developing and emerging economies.  These include:
•    The PASTD Task Group (Preservation of and Access to Scientific and Technical Data in/for/with Developing Countries) has run a series of workshops in developing countries.  The next will be in Nairobi, Kenya in August 2014.
•    CODATA China, with support from the Chinese Academy of Science, is running a training workshop in Big Data for Science for Researchers from Countries with Developing and Emerging Economies.
We have already made contacts with a variety of the above different organisations. CODATA is very happy to participate and to be a co-sponsor of the group.  AidData has expressed an interest in the proposal. A new IG is being formed on the teaching and training of Data Science and they can provide guidance on what the curriculum should be. The Wellcome Trust has expressed a willingness to provide guidance based on their experience.

_________________________________

Download full document below
 

Review period start:
Custom text:
Body:

To address the Governance of the brokering framework middleware and interconnect existing international e-infrastructures. The Working Group will address the following:
1. Brokering configuration and strategies;
2. Brokering governance and agreements;
3. Publications and transparency;
4. Community adoption and sustainability.
Value Proposition
Middleware significantly simplifies distributed system construction, as well as providing a much more efficient means of integrating legacy systems with new technology. Brokering middleware provides mediation and transformation services to simplify data discovery, evaluation, access, and use. Brokering was conceived to work as a third-party tier in a three-tier architecture (extending the Client-Server paradigm). This introduces the clear need to govern and manage the new brokering tier. This is particularly important when brokering middleware is used to interconnect existing large e-infrastructures in a way that is sustainable. Given the brokering middleware’s importance in the integration of disparate information and data systems, continued access to and availability to these middleware components is vital to supporting long-term development and continued use of integrated data systems. It is important also as brokering should be transparent to most users, thus lacking high-level visibility of its impacts.
.
Effective middleware governance has the potential to support longer-term development under a variety of funding models, to simplify and standardize access models, and establish a basis for the continued value of brokered systems. It is not, however, clear what the best practices for this governance are, and how those practices shift in response to different funding and property models, under different architectures or as standards change. To ensure sustainable, stable development and effectiveness in an operational environment of brokering systems reliant on middleware service architectures, an effective model for the governance and reuse of that middleware must be agreed upon.
We propose to consider and recommend a set of best practices for governing and managing brokering middleware. These practices will work to ensure future interoperability, access, and use to brokering middleware independent or in light of various development and funding models to support long-term planning of brokered, integrated systems. These will be of value not only to interoperability architects and to developers (who can plan integrated systems assuming the continued use and support of brokering middleware) but also to system managers and end users. The potential for scaling and expansion of integrated data resources and systems in brokering middleware is of value to increasingly interdisciplinary research work as well as in managing growing big data sets.
The expected outcomes of the Brokering Governance WG will be:

  • A Position Paper including guidelines, best practices and recommendations for the management and support of an international brokering capability to interconnect existing and future multidisciplinary data systems and infrastructures. This will include a consensus recommendation of a path for adoption of this capability at the international level.
  • A set of use cases to assess options and best practices for governance recommendations in three diverse areas:

o Global Changes: GEO-BON;
o Environmental sciences: European Commission Danube SDI;
o International repositories: ICSU WDS.
Engagement with existing work in the area
The working group will engage with significant international programmes, including:

  • NSF BCUBE project;
  • GEO GEOSS and in particular its Infrastructure Implementation Board (IIB);
  • European Commission GEOWOW project;
  • Transatlantic ODIP project;
  • ICSU WDS.

These projects are part of the brokering technology development, which has been going on for the last decade. They have addressed the areas of: discovery, access, semantic harmonization, and use for facilitating multidisciplinary and organizational science research (see the Useful References section).
An operational capability for discovery and access brokering: the GEO DAB (Discovery and Access Brokering framework) has been implemented by GEO (Group on Earth Observation) in the GEOSS (Global Earth Observation System of Systems). Presently, GEO DAB interconnects more than 20 diverse research infrastructures sharing more than 65 Millions of resources. This experience showed the technical scalability of brokering, as far as resources and service performances.
Operational capability in GEOSS works on a GEO unique governance model of country contributions. While this works, the growth and sustainability of international broker service needs to address a broader and open governance modality, which offers a strategic and practical process to sustainability and operations.

_______________________________

Download full document

 

TAB Review: https://rd-alliance.org/groups/rda-technical-advisory-board-tab/wiki/bro...

Review period start:
Custom text:

Pages