You are here


Life sciences are becoming increasingly more data intensive, especially resulting from the huge improvements in large-scale gene sequencing and other “omics” techniques. There is need for large-scale sustainable data storage methods allowing secure and easy access to these highly complex data. Simultaneously, since life science research projects increasingly depend on more than one type of measurement, there is a wide–felt need for integrating data of different types. Very similar issues and similar data sets come up in the different sectors of life sciences, such as health, agriculture, bioindustry and marine life, calling also for interoperability between these sectors. ELIXIR ( has been established to build a sustainable pan-European research infrastructure for biological information providing support to life science research including medicine, agriculture, bioindustry and society.


The ELIXIR Bridging Force Interest Group is formed to serve as a bridge between ELIXIR and relevant RDA Interest Groups, e.g. those on agricultural data, big data analysis, federated identity management, marine data, structural biology, toxicogenomics, and data publishing. Furthermore, ELIXIR will interact with relevant infrastructures outside Europe, e.g. NCBI.


A number of Working Groups dedicated to defined tasks will be formed over the next years. Examples include:


•          Interoperability of different kinds of large-scale (“omics”) data

•          Sustainability of life science data identifiers

•          Mechanisms for secure access to sensitive data, including authorisation and authentication issues

•          Strategies for data storage allowing for computationally intensive analyses


These tasks will be performed partly in collaboration with other RDA Interest Groups and with the many emerging data-related initiatives in the life sciences field.

Contact people (chairs):
Bengt Persson, Sweden
Carole Goble, UK
Rob Hooft, the Netherlands

Initial group members:
Tommi Nyrönen, Finland (liasor with Federated Identity IG)
Susanna Sansone (liasor with Metabolomics IG)
Mikael Borg, Sweden

Review period start:
Custom text:

WG Charter
The case statement outlines our work and provides the focus and the boundaries where our research will go.
We need to integrate all stakeholders and reflect their views accordingly. So far we identified four stakeholders that will actually use our contributions:

  • Data providers – data will be reused
  • Solution providers – machine readable data citations
  • Researchers – receives citable results
  • Community – gains trust and transparency

The beneficiaries will be able to reuse data, reproduce experiments, provide machine readable and machine actionable data citations for complex data sets and trace their data and its usage.
Being able to reliably and efficiently cite entire or subsets of data in large and dynamically growing or changing datasets constitutes a significant challenge for a range of research domains. Several approaches for assigning PIDs to support data citation at different levels in the process have been proposed. These may range from individual PIDs being assigned to individual data elements to PIDs assigned to queries executed on time-stamped and versioned databases.
Based on the discussions at the First Plenary Meeting in Gothenburg, the formation of a Working Group on Data Citation (WG-DC) was initiated. The RDA Working Group on Data Citation (WG-DC) aims to bring together a group of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing subsets of data. The WG-DC focuses on a narrow field where we can contribute significantly and provide prototypes and reference implementations. So far different data citation initiatives exist, all of which have their advantages and special purposes. An overview of these standards and their best practices was published by the CODATA Task Group on Digital Data Curation [1]. We encourage strong cooperation with existing initiatives is required: CODATA, OpenAire, DataCite, W3C, Open Annotation Coalition and the related standards.

Our concept includes machine actionable data citation that is efficient and can be applied transparently. We will be looking at different types of data and database management systems, including:
- SQL-style databases
- XML databases / semi-structured databases
- Graph-based databases
- NetCDF files
- HDF5 files
- …
The goal is to assure that subsections of data can be uniquely identified in the face of data being added, deleted or otherwise modified in a database, across longer periods of time, even when data is being migrated from one DBMS to another. We want to discuss and evaluate different existing approaches to this challenge, evaluate their advantages and shortcomings and identify obstacles to their deployment in different settings, as well as concrete recommendations for the deployment of prototypes within existing data centers. Amongst others these should subsequently form a solid basis for citing data, linking to it from publications in an actionable manner.

Dynamic data citation tackles challenges of versioning and the proper definition of subsets of data in different domains. Potential issues concern the relations between data sets, which need to be captured as well. Other challenges are scalability, costs and benefits (trade off) of ownership and operations that are potentially not reversible. This WG concentrates on the technical aspects of data citation solutions, focusing on proof of concept and prototype implementations. It will collaborate with other RGA working groups on PIDs and other topics under the umbrella of the Interest Group on Data Publication.

The principle currently proposed includes the following aspects:
- Ensuring that data items added to a data collection are added in a manner that is time-stamped
- Ensuring that the data collection is versioned, i.e. changes/deletions to the data are marked as changed with validity timestamps
- PIDs are assigned to the query/expression identifying a certain subset of the data that one wishes to cite, with the query being time-stamped as well
- Hash keys are computed for the selection result to allow subsequent verification of identity
- Issues such as unique sorting of results need to be considered when the operation returns data as sets and subsequent process work on the sequence the data is provided in
These should be working across all settings where we have a combination of data sources and operations identifying subsets at specific points in time.

We propose a three stage plan consisting of solutions (short-term), plans (mid-term) and the future perspective (long-term).

Download the full document

Review period start:
Custom text:

UPDATE of 28 October can be found here:

Working Group Charter
The goal of the WG is to set up the framework to run a series of Summer Schools in Data Science and Cloud Computing in the Developing World. It will run at a variety of locations but will have a base in the developing world and the UK. This series would be endorsed by both the RDA and CODATA and would make use of the wide variety of expertise within these organisations. In the 18 month period of the WG it will
•    Arrange funding for an initial period for the school to run (five years).
•    Organise partnerships with Developing World institutions.
•    Determine the best curriculum for the school in collaboration with others.
•    Arrange how the materials can also be delivered online.

Value Proposition
Research in the Developing World is hampered by a variety of infrastructural issues. Access to research data in the public domain and the computational resources to analyse them, would give researchers in the Developing World the chance to do world-class research that is relevant to their society. Distributed computing techniques, for example cloud computing, present the possibility to sidestep the infrastructural difficulties. In order for this to happen, researchers need to understand the techniques of Data Science and the use of cloud computing. By training individuals who will ultimately pass on their training over a comparatively short period of time the level of understanding of this field can be built up and impact research. The creation of a cadre of individuals who can analyse, maintain and curate data sets will be an important skill which will also positively impact the local society as enterprises based on data analysis will expand.

Engagement with existing work in the area:
There exist a number of initiatives that provide training in these fields; for example there are a number of Summer Schools in Europe on cloud computing (see attached document of links). Microsoft Research provide training sessions on their cloud Azure.  There are now two MOOC’s in Data Science and the number of masters programmes in this area in expanding rapidly. Analysis platforms such as R have an ever-expanding set of teaching resources. The Software Carpentry movement, which has many of the correct pedagogical approaches for this school has events across the developed world (and has had sessions in South Africa). AidData have provided training in the analysis of social data in the developing world. The Wellcome Trust has provided funding for a successful series of schools in Bioinformatics in the Developing World. The  Accelerated Data Programme of the International Household Panel Survey, and the IHPS in general provide tools and training to many developing country institutions, to increase the availability and quality of survey data.

Within the RDA there is a proposal for an IG in training in Data Science.  CODATA has longstanding activities to build capacity in countries with developing and emerging economies.  These include:
•    The PASTD Task Group (Preservation of and Access to Scientific and Technical Data in/for/with Developing Countries) has run a series of workshops in developing countries.  The next will be in Nairobi, Kenya in August 2014.
•    CODATA China, with support from the Chinese Academy of Science, is running a training workshop in Big Data for Science for Researchers from Countries with Developing and Emerging Economies.
We have already made contacts with a variety of the above different organisations. CODATA is very happy to participate and to be a co-sponsor of the group.  AidData has expressed an interest in the proposal. A new IG is being formed on the teaching and training of Data Science and they can provide guidance on what the curriculum should be. The Wellcome Trust has expressed a willingness to provide guidance based on their experience.


Download full document below

Review period start:
Custom text:

To address the Governance of the brokering framework middleware and interconnect existing international e-infrastructures. The Working Group will address the following:
1. Brokering configuration and strategies;
2. Brokering governance and agreements;
3. Publications and transparency;
4. Community adoption and sustainability.
Value Proposition
Middleware significantly simplifies distributed system construction, as well as providing a much more efficient means of integrating legacy systems with new technology. Brokering middleware provides mediation and transformation services to simplify data discovery, evaluation, access, and use. Brokering was conceived to work as a third-party tier in a three-tier architecture (extending the Client-Server paradigm). This introduces the clear need to govern and manage the new brokering tier. This is particularly important when brokering middleware is used to interconnect existing large e-infrastructures in a way that is sustainable. Given the brokering middleware’s importance in the integration of disparate information and data systems, continued access to and availability to these middleware components is vital to supporting long-term development and continued use of integrated data systems. It is important also as brokering should be transparent to most users, thus lacking high-level visibility of its impacts.
Effective middleware governance has the potential to support longer-term development under a variety of funding models, to simplify and standardize access models, and establish a basis for the continued value of brokered systems. It is not, however, clear what the best practices for this governance are, and how those practices shift in response to different funding and property models, under different architectures or as standards change. To ensure sustainable, stable development and effectiveness in an operational environment of brokering systems reliant on middleware service architectures, an effective model for the governance and reuse of that middleware must be agreed upon.
We propose to consider and recommend a set of best practices for governing and managing brokering middleware. These practices will work to ensure future interoperability, access, and use to brokering middleware independent or in light of various development and funding models to support long-term planning of brokered, integrated systems. These will be of value not only to interoperability architects and to developers (who can plan integrated systems assuming the continued use and support of brokering middleware) but also to system managers and end users. The potential for scaling and expansion of integrated data resources and systems in brokering middleware is of value to increasingly interdisciplinary research work as well as in managing growing big data sets.
The expected outcomes of the Brokering Governance WG will be:

  • A Position Paper including guidelines, best practices and recommendations for the management and support of an international brokering capability to interconnect existing and future multidisciplinary data systems and infrastructures. This will include a consensus recommendation of a path for adoption of this capability at the international level.
  • A set of use cases to assess options and best practices for governance recommendations in three diverse areas:

o Global Changes: GEO-BON;
o Environmental sciences: European Commission Danube SDI;
o International repositories: ICSU WDS.
Engagement with existing work in the area
The working group will engage with significant international programmes, including:

  • NSF BCUBE project;
  • GEO GEOSS and in particular its Infrastructure Implementation Board (IIB);
  • European Commission GEOWOW project;
  • Transatlantic ODIP project;

These projects are part of the brokering technology development, which has been going on for the last decade. They have addressed the areas of: discovery, access, semantic harmonization, and use for facilitating multidisciplinary and organizational science research (see the Useful References section).
An operational capability for discovery and access brokering: the GEO DAB (Discovery and Access Brokering framework) has been implemented by GEO (Group on Earth Observation) in the GEOSS (Global Earth Observation System of Systems). Presently, GEO DAB interconnects more than 20 diverse research infrastructures sharing more than 65 Millions of resources. This experience showed the technical scalability of brokering, as far as resources and service performances.
Operational capability in GEOSS works on a GEO unique governance model of country contributions. While this works, the growth and sustainability of international broker service needs to address a broader and open governance modality, which offers a strategic and practical process to sustainability and operations.


Download full document


TAB Review:

Review period start:
Custom text: