The Life Cycle of Structural Biology Data
|RDA Structural Biology Interest Group
|Supporting Output Title: The Life Cycle of Structural Biology Data
|Corresponding author: Chris Morris, STFC, Daresbury Laboratory, WA4 4AD
|Contributors: Claudia Alen, Lucia Banci, Alexandre Bonvin, Pablo Conesa, Alfonso Duarte, John Helliwell, Yogesh Gupta, Rob Hooft, John Markley, Brian Matthews, Gaetano Montelione, Antonio Rosato, Sameer Velankar, Matthew Viljoen, Geerten Vuister, John Westbrook, Martyn Winn, and Christine Zardecki.
Research data is acquired, interpreted, published, reused, and sometimes eventually discarded. This document reports how structural biologists perform these tasks, and recommends improvements to the infrastructure available to them.
Download The Life Cycle of Structural Biology Data report
Research data is acquired, interpreted, published, reused, and sometimes eventually discarded. Understanding this life cycle better will help the development of appropriate infrastructural services, ones which make it easier for researchers to preserve, share, and find data.
Structural biology is a discipline within the life sciences, one that investigates the molecular basis of life by discovering and interpreting the shapes of macromolecules. Structural biology has a strong tradition of data sharing, expressed by the founding of the Protein Data Bank (PDB) in 1971 (PDB, 1971). In the early years, data submissions to the archive were made by mailing decks of punched cards. The culture of structural biology is therefore already in line with perspective of the European Commission that data from publicly funded research projects are public data (COM(2011) 882 final).
This report is based on the data life cycle as defined by the UK Data Archive. This is the most clearly defined workflow that the authors are aware of. It identifies six stages: creating data, processing data, analysing data, preserving data, giving access to data, re-using data. Each will be discussed below. However, the data infrastructure for structural biology is not a perfect match for this workflow. For clarity, ʻpreserving dataʼ and ʻgiving access to dataʼ are discussed together. We also add a final stage to the life cycle, ʻdiscarding dataʼ.
Changes in research goals and methods have led to some changes in the requirements for IT infrastructure. A common data infrastructure is required, giving a simple user interface and simple programmatic access to scattered data. Progress on these tasks will support the development of workflows that facilitate the use of datasets from different facilities and techniques. The automatic acquisition of metadata can help. Large experimental centres already provide a highly professional data infrastructure. For smaller centres this is onerous - it is desirable that a standard package is provided enabling them to use the European e-infrastructure resources, in a way that integrates with other structural biology resources.