FAIRification of Genomic Annotations WG

Primary Domain: Natural Sciences
Group Focus: Identify, Store, and Reserve, Disseminate, Link, and Find
Group Technology Focus: Dissemination, Search & Discovery, Re-Use
RDA Pathways: Semantics, Ontology, Standardisation, Discipline Focused Data Issues

Group Description
The FAIRification of Genomic Annotations Working Group (WG) will focus on the challenges of harmonising metadata and software solutions to improve the discovery and reuse of publicly available “genomic annotation” data.

Motivation: Since the completion of the Human Genome Project, we have witnessed an explosion of datasets annotating particular locations along reference DNA sequences with e.g. aspects of functional processes. Tremendous amounts of research funding have been provided to large and small projects that have generated genomic annotation datasets. Many of these datasets relate to human genomes, but an increasing amount of such data is also generated for other model organisms – and with the recent surge of biodiversity projects – for a range of species spanning the whole tree of life. Unfortunately, researchers who want to make use of such data face practical challenges in discovering and reusing datasets relevant to their research. While many repositories and data distribution solutions exist, the metadata is often poorly aligned with best practices for Findable, Accessible, Interoperable and Reusable (FAIR) research data. Even in cases where proper metadata exists, relevant solutions typically provide metadata according to different metadata models and APIs.

Figure 1: Example of genomic annotations (tracks) from the ENCODE consortium imported into the UCSC Genome Browser.
From Rosenbloom et. al. (2011). “ENCODE whole-genome data in the UCSC genome browser: update 2012.” Nucleic acids
research. 40. D912-7. Licence: CC BY-NC 3.0

Genomic annotation data: The FAIRification of Genomic Annotations Working Group (FGA-WG) focuses on the challenges of harmonising metadata and software solutions to improve the discovery and reuse of publicly available “genomic annotation” data (see Figure 1). Genomic annotations, sometimes referred to as “genomic tracks”, refer to data files that annotate reference sequence positions and can be visualised in genome browsers, such as the UCSC Genome Browser and Ensembl, or analysed often non-visually using computational tools that span the domains of life science (e.g., statistical colocalization analyses). Genomic annotations are often routinely generated by computational workflows in larger datasets; in the context of software- or method-oriented publications; or as the result of manual curation processes. The types of data include condensed summaries of experiment outputs as well as predictive and descriptive annotation of DNA reference sequence positions.

Deliverables: The FGA-WG establishes a minimal metadata schema based on the harmonisation of existing schemas and recommendations, working to gradually improve this schema based on the interrogation of a set of use cases. To foster the availability of metadata that adheres to the schema, the FGA-WG will develop practical recommendations to build scalable and maintainable metadata transformation pipelines to “FAIRify” and unify metadata from multiple data sources. Recommendations will be developed for publishing and registering the harmonised metadata – with persistent and globally unique identifiers – such that the metadata can be easily harvested by search engines and other services that improve the discovery and reuse of the metadata (and associated datasets). Lastly, the FGA-WG will develop a harmonised API for the use of search and discovery services by downstream users and tools.

The three main use cases to be considered are the following:

1. Biomedical analysis: Discovery of genomic annotations through harmonised metadata to improve the availability and scope of datasets for analytical applications, including methodology-oriented software tools and AI/ML efforts. The focus area will be human biomedical research, but this use case could also include support for research on other species insofar as this aligns with the interests of the FGA-WG contributors.

2. Biodiversity genome annotation: Establish genome annotations for biodiversity assemblies as FAIR objects with improved metadata and integration with relevant infrastructure and tools, including both the manual and automated annotation of genome assemblies.

3. Track Hub infrastructure: Enhance genome browser-related infrastructure for track hubs with harmonised metadata and related services, to simplify the process of generating, hosting, and registering metadata-enriched track hubs; improve discovery of track hub-hosted data at the level of individual tracks; and facilitate the integration of metadata-enriched track hubs with genome browsers and other analysis tools.

Community building: The FGA-WG aims to build a broad community of individuals and groups with an interest in data integration related to genomic annotations, encompassing data producers, domain experts, tool/service developers, FAIR/RDM specialists, ethics and ELSI expertise, and analytical end users. The long-term goal is to build a sustainable infrastructure that improves the FAIRness of genomic annotations, with a particular focus on improving the end-user experience.
Rationale Summary
TAB Liason(s)
Anupama Gururaj
Secretariat Liason(s)
Bridget Walker
Group Visibility
Public
Group Creation Date
Endorsement Date
05/06/2024
Estimated End Date
05/12/2025
Actual End Date
Group Email
annotations@rda-groups.org

Group Type: Working Group
Group Status: recognised-and-endorsed
Co-Chair(s): Sveinung Gundersen, Adam Wright, Anna Bernasconi, Nathan Sheffield

FAIRification of Genomic Annotations WG

Group Description

Rationale Summary

TAB Liason(s)

Secretariat Liason(s)

Group Visibility

Group Creation Date

Endorsement Date

Estimated End Date

Actual End Date

Group Email

Leave a Reply Cancel reply