You are here

Body:

The Charter for the Big Data IG, which was recognized and endorsed by Council on 22 May 2015, can be found here: https://rd-alliance.org/groups/big-data-analytics-ig/wiki/rda-big-data-analytics-interest-group-charter.html

Review period start:
Thursday, 18 July, 2019
Custom text:
Body:

The attached file is the Charter for the Agricultural Data Interest Group (IGAD), as submitted on 21 May 2013.

Review period start:
Custom text:
Body:

Instalação do DataVerse

 

Review period start:
Thursday, 4 July, 2019
Custom text:
Body:

 

This group has been spun off from the RDA Data Discovery Paradigms IG.

 

WG charter

 

The widespread use of schema.org[1], DCAT[2] and other vocabularies in web pages to add structured metadata describing research data has brought new opportunities for making these outputs FAIRer. The opportunities include, but are not limited to:

  1. Leveraging robust commercial search engines like Google, Yahoo, Bing etc. to facilitate broader discovery of, and access to, research data;
  2. Providing a common set of vocabularies to describe research resources, enabling improved metadata interoperability across data repositories, increasing re-use and sharing of captured metadata;
  3. Providing a potentially new method for metadata/content syndication among data catalogues and registries, enabling federated search across resources of a specific domain, or related domains relevant to a research need.

However, these opportunities also come with new challenges. Schema.org provides a core, minimalistic vocabulary for describing the kind of entities that most common web applications need. By design, schema.org expects, and has enabled, domains of practice to extend this core (Guha et al, 2016). Like other domains of practice, research data communities have their own needs for extending this core to describe research data and its relationships to other resources.  These extensions include specific data types and the properties they possess, domain relevant and type specific to persistent identifiers, etc. There are some communities that are addressing these issues and have planned extensions to the core of schema.org to meet their own community needs, for example, bioschemas.org[3] for life science, and science-on-schema.org[4] for earth and environmental science. According to our recent survey[5] that was carried out by the Data Discovery Paradigms IG, more data repositories are following a similar route by either implementing structured markups in metadata landing pages, or else with planned extensions of schema.org in one way or another.

This proposed working group will provide a platform to complement, build on and extend efforts from bioschema.org, science-on-schema.org and the like communities in applying and extending the core schema.org vocabulary for describing research datasets and related resources (e.g. workflow, software, researchers, etc.). The objectives of this work group are twofold:

  1. to identify and bridge gaps in existing schemas commonly used for research data, by bringing together communities who are working with such vocabularies to document research data and related resources;
  2. to provide guidelines for those communities whose needs are not addressed by existing metadata schema such as schema.org, and provide guidelines on proposing extensions.

The planned outputs will include:

  1. A generic ‘conceptual data model’ with essential types and properties for research data discovery over the web. The model will be built on bioschemas.org, science-on-schema.org, schema.org, DCAT, DDI-DISCO[6] and SSN[7] schemas from some representative research domains, and data discovery use cases. A research domain can map their schema to the conceptual model when they publish data to the web or exchange metadata between data portals/repositories.
  2. A guideline, illustrated with common patterns, of common patterns for publishing metadata landing pages with structured data markups; and a guideline of how to customise the research schemas for target domains with examples. 
  3. Toolings for making the implementation easier if resources are available. This could include collecting and cataloguing tools that generate, validate and parse schema.org & DCAT markup, etc.

 

Value proposition

It is expected the proposed work will benefit a range of data stakeholders as follows:
 

Data providers and data catalogue managers:

  • The conceptual data model and the guidelines will help data providers and data catalogue managers to implement structured metadata markups and have their data more findable by data seekers/consumers.
  • Being able to adopt or map to a common research schema will make it possible for metadata from one catalogue to be more interoperable with, and reusable by, other data catalogues.

 

Data seekers:

  • It will benefit data seekers/consumers for effective and efficient data search via faceted search and filtering, or other opportunities of either human or machine APIs built on combination of structured data search and keyword search.
  • It will also make it easier for people to publish info about datasets and thus increase the range of datasets that are discoverable.
  • Since all research data is expected to have some common properties, this work will make it possible to describe these common properties, and standard means through which they can be exposed for discovery.

 

Data technologies:

  • When there is a common way to describe metadata across data catalogues, it provides opportunity for developing applications such as federated search either vertically to a discipline or across different disciplines based on research needs, applications that can support a spectrum of data search needs from free text search to SPARQL queries.

 

 

Engagement with existing work in the area

The proposed work will be built on existing work of bioschema.org, science-on-schema.org, and a number of mappings to and from schema.org as identified from our recent survey[8] (e.g. DCAT to schema.org[9], DCAT-AP to schema.org[10], ISO 19115 to schema.org[11]). We will also reference data models and schemas from ISO and W3C recommended standards such as DCAT, which is currently being updated. Through exploration of existing work, we will identify common elements across research domains and domain specific elements as well.

 

The group will work with and encourage collaborations with other RDA WGs/IGs. In particular, we will examine closely the outputs from previous RDA WGs and IGs, including:

  • Data Description Registry Interoperability (DDRI) WG
  • Research Data Collections WG
  • Research Data Repository Interoperability WG
  • Metadata Interest Group (MIG)
  • Preservation Tools, Techniques and Policies
  • (more may be identified …)

 

The data models and types proposed from these WGs and IGs may directly benefit from the proposed work of this group, which is to extend core schema.org vocabulary to include more essential research resources for discovery and re-use of data.

 

The group will engage with existing RDA WGs/IGs for clear definition of data types and terms, including:

  • Data Type Registries WG & #2
  • Data in Context IG
  • Domain Repositories IG
  • (more may be identified …)

 

The group will seek to collaborate with domain specific RDA WGs/IGs, for example, International Materials Resource Registries WG and Marine Data Harmonisation IG, these groups could be potential adopters of recommendations from this group.

 

We will also liaise with the Schema.org W3C Community Group[12] to recommend the proposed research schemas.

 

Work plan

Timing

Duration

Action

Main participants

Oct. 2019

0

RDA P14: Official start of the WG

Session participants 

Oct. 2019 - March 2020

 

Identify common elements across research domains and domain specific elements based on existing work and the survey

Co-chairs and group members

 

Draft guidelines for publishing metadata landing pages with structured data markups with the latest version of schema.org

March/April 2020

6 months

RDA P15: Progress report and seeking feedback

Session participants

April - Oct. 2020

 

Propose data model and data types for research schemas

Co-chairs, technical advisory group, group members

 

Extend guidelines for including research schemas

 

Identify existing toolings that help to map, compile and validate structured markups, collaborate and work with these groups to make tools work for the recommendations.  

Oct. 2020

12 months

RDA P16: Report recommendation draft, early adoption use case(s), seek more adopters

Session participants

Oct. 2020 - March 2021

 

Revise research schemas

Co-chairs, technical advisory group, group members

 

Revise guidelines

 

List/catalogue of toolings

March/April 2021

18 months

RDA P17: Recommendation output with adoption use cases

Co-chairs and group members

Oct. 2021

24 months

RDA P18: More adoption use cases

Co-chairs and group members

 

Working group operation, communication and engagement

The group has set up a regular meeting every four week to communicate, discuss and get feedback from group members. Advance notice of each meeting will be sent to the group’s mailing list, meeting notes and relevant documents will be made available from the group’s wiki page at the RDA website. 

 

We have a plan to organise group sessions at RDA plenaries, will be communicating and promoting the work to communities outside of RDA, and most importantly, we will take feedback and seek consensus to ensure the outputs are in line with community needs.

 

Apart from having four co-chairs, the group would also like to have a technical advisory group with members representing different standard bodies and disciplines from within and outside of RDA. Chairs and the advisory group will meet regularly (2~3 months) to review work in progress and resolve any technical and engagement issues as they arise.

 

We will have terms of reference for co-chairs, so each co-chair is aware her/his responsibility and RDA principles of openness and diversity inclusiveness. When there are disagreements and conflicts among co-chairs and group members, co-chairs will ensure different viewpoints are discussed and presented, and work with members and communities to achieve consensus. 

 

Co-chairs of this WG have representatives from domains of life science and earth and environmental science, which have already adopted and extended schema.org. Their participation will ensure lessons learnt and outputs produced from the two communities will be carried over to the WG and the two communities will be consulted and engaged with the latest developments. Co-chairs also include a representative from a potential adopter – the Research Data Australia portal (run by Australian Research Data Commons (ARDC)). ARDC will not only likely adopt the outputs from the WG but also promote outputs to their Australia data providers and international partners such as Korea Institute of Science and Technology Information.

 

Adoption plan

ELIXIR and ESIP semantics technologies cluster have been working with life science community and earth and environmental science community on adoption of bioschema.org and science-on-schema.org respectively. The effort from the two communities on extension of the schema and guidelines and training on their respective adoption process have laid the foundation for this WG to work on. Having a representative from each of the two communities as co-chair of this WG shows the two communities will support, be engaged, and very likely adopt the conceptual model and guidelines from the WG.

 

Australian Research Data Commons (ARDC) runs a national data catalogue. The catalog (Research Data Australia) harvests metadata from 101 research organisations from around Australia. ARDC is exploring how to improve global data discovery by providing optimised national aggregation point for syndication to global information systems (e.g. search engines, Scholix, and vertical discipline portals etc). It is likely that the outputs from the working group will be evaluated and adopted, as the outcome from the working group aligns with the direction ARDC is exploring.

 

The Arctic Data Committee (ADC) is an international body whose members come from data centers that hold polar data of any kind.  Its purpose is to “promote and facilitate international collaboration towards the goal of free, ethically open, sustained and timely access to Arctic data through useful, usable, and interoperable systems”.  The ADC is comprised of members of the International Arctic Science Committee (IASC), the Sustaining Arctic Observing Networks program (SAON), and Standing Committee on Antarctic Data Management (SCADM). During a meeting in Geneva last fall, the ADC community unanimously agreed that adopting structured metadata àla schema.org was in the community’s best interest.  As a result, they are awaiting the results of this WG in order to guide development.  Towards that end, they have appointed a liaison to this community, both to provide input and to take outputs back for implementation.

 

The WG chairs and members of technical advisory group will actively engagement communities in and out of RDA to promote the output and encourage more adoptions. 

 

The working group chairs

Leyla Garcia (ZBMED Information Centre for Life Sciences, Bioschemas, Germany)

Sarala Dissanayake (DataCite, FREYA, Germany)

Adam Shepherd (Biological and Chemical Oceanography Office (BCO-DMO), US)

Mingfang Wu (Australian Research Data Commons, Australia)

 

Technical advisory group members

Simon Cox (CSIRO, Australia)

Ruth Duerr (Ronin Institute, US)

Doug Fils (Consortium for Ocean Leadership, US)

Rafael C. Jimenez (Research Informatics at Alzheimer's Research, UK)

Nick Juty (ELIXIR, UK)

Siri-Jodha Khalsa (National Snow and Ice Data Center, University of Colorado, US)

Andrea Perego (European Commission, Joint Research Centre (JRC))

 

 

Acknowledgement:

We would like to thank the following group members who have contributed to the writing up of the case statement:

Tim Clark - Massachusetts General Hospital / Harvard Medical School, US

Simon Cox - CSIRO, Australia

Anusuriya Devaraja - PANEAGA, Germany

Ruth Duerr - Ronin Institute for Independent Scholarship, US

Doug Fils - EarthCube Science Support Office, US

Leyla Garcia - Bioschemas, UK

Nick Juty - ELIXIR-UK

Fidan Limani - German National Library of Economics, Germany

Stefanie Kethers - Australian Research Data Commons, Australia

Siri Jodha Khalsa - NSIDC, US

Andrea Perego - European Commission, Joint Research Centre, Italy

Adam Shepherd - Biological and Chemical Oceanography Office (BCO-DMO), US

Andrew Treloar - Australian Research Data Commons, Australia

Mingfang Wu - Australian Research Data Commons, Australia

 

 

Reference

Guha, R. V., Brickley, D., & Macbeth, S. (2016). Schema.org: Evolution of Structured Data on the Web. Communications of the ACM, 59(2), 44–51. doi:10.1145/2857274.2857276

 

Review period start:
Thursday, 11 July, 2019 to Sunday, 11 August, 2019
Custom text:
Body:

The I-ADOPT WG will focus on creating a community-agreed framework for representing observable properties by bringing together groups that have been working on developing terminologies to accurately encode what was measured, observed, derived, or computed. The consensus building will be informed by reviewing current practices and by a set of use cases, which will be used to define the requirements and to test and refine the common framework iteratively. Much like a generic blueprint, this framework will be a basis upon which terminology developers can formulate local, but globally aligned, design patterns. With these, they may leverage their local “materials” in a multi-pronged attempt to represent complex properties observed across the environmental sciences (from marine to terrestrial ecosystems, as well as biodiversity, atmospheric, and Earth sciences). The WG will then seek to synthesize these approaches into global best practice recommendations. Furthermore, it will help mediate between generic observation standards (SSNO, SensorML, ..) and current community-led resources, fostering harmonized implementations. Through this effort, FAIRer observable property terminologies will be created, the global effectiveness of tools operating upon them will be improved and their impact increased. The WG will thus strengthen existing collaborations and build new connections between terminology developers and providers, disciplinary experts, and representatives of scientific data user groups.

Please not that a revised statement has been defined and can be downloaded below. 

Review period start:
Tuesday, 16 July, 2019 to Friday, 16 August, 2019
Custom text:
Body:

Open Science Graphs for FAIR Data Interest Group

Case statement

NB: this case statement is the revised version from October 2019. Attached below, the downloadable initial and revised case statements. 

Co-chairs:

  • Amir Aryani (Research Graph Foundation)
  • Martin Fenner (DataCite)
  • Wouter Haak (Elsevier, Mendeley Data)
  • Paolo Manghi (OpenAIRE Infrastructure - Institute of Information Science and Technologies, CNR, IT)

 

Mission

The goal of the Open Science Graphs Interest Group (OSG IG) is to build on the outcomes and broaden the challenges of the Data Description Registry Interoperability (DDRI) and Scholarly Link Exchange (Scholix) RDA Working Groups to investigate the open issues and identify solutions towards achieving interoperability between services and information models of Open Science Graph initiatives. The aim is to improve FAIRness of research data, and more generally FAIR*-ness of science, by enabling the smooth exchange of the interlinked metadata overlay required to access research data at the meta-level of the discovery-for-citation/monitoring and at the thematic level of the discovery-for-reuse. Such “FAIR-ness” and “interlinked-ness” provide strong support for research integrity and research innovation which in turn underpin significant social environmental and economic benefits.

Objectives

Open Science is urging scientists, communities, institutions, and policymakers to define and adopt methodologies, practices, and tools for publishing research products, beyond the scientific article, including research data, software, digital experiments, etc. The ultimate goal is to achieve transparency and reproducibility of science. As a consequence of this trend, researchers are depositing into scholarly communication data sources the metadata and files relative to all these products, together with semantic links between them, and towards other relevant entities, such as those kept in registries for authors, organizations, and data repositories (e.g. ORCID, ROR, re3data.org). De facto, Open Science publishing practices materialize a distributed/federated/de-centralized and global Open Science Graph, where by “graph” we mean a collection of objects (i.e. scientific product metadata) interlinked by semantic relationships (i.e. claims of object-to-object relationships together with their meaning). Needless to say, there is a great interest to contribute to and/or consume this Graph for sharing, discovering, and monitoring Open Science. To address this, several initiatives are aggregating targeted subsets of such sources to build specialized Open Science Graphs, subsets of the global Open Science Graph, capable of serving specific user needs: Google Scholar, Microsoft Academics, Scopus, FREYA PID Graph, Research Graph Foundation, OpenAIRE Research Graph, Open Knowledge Graph, Human Brain Project Knowledge Graph, as well as the CERIF graphs built via CRIS systems are just a few of the real-case graphs being built and consumed out there.

Clearly, FAIRness of research data strongly relies on the success and diffusion of such graphs, both at cross-discipline level and at the thematic level. Research data (as well as software, or any research object) can be contextualized, thus maximizing its value and ability to reuse, and be reachable via navigation from other related objects. Nonetheless, research data value and related scientific reward may be derived from its constantly updated context, relying on a network of of citations and usage statistics. The general architecture and use cases have been studied by the RDA Data Description Registry Interoperability (DDRI) Working Group. Besides, the RDA/WDS Scholarly Link Exchange (Scholix) Working Group has added significantly to our understanding of the subset of the research data graph connecting data and literature. Both working groups have done extensive work on the implementation of services and community adoption and are today in maintenance mode. However, other projects continue to work on these outcomes, including the OpenAIRE Research Graph, the Scholexplorer graph, Research Graph, and the FREYA PID Graph. Driven by such motivations, the co-chairs of the aforementioned WGs organized a BoF on “Research Data Graphs” at the RDA Plenary Conference in Philadelphia, to check on the general interests and possible commitments on this topics, which resulted on this IG case-statement and proposal.

The Open Science Graphs Interest Group (OSG WG) will investigate the challenges and identify solutions towards achieving interoperability between services and information models of Open Science Graph initiatives. The aim is to improve FAIRness of research data, and more in general FAIR*-ness of science, by enabling the smooth exchange of the interlinked metadata overlay required to access research data at the meta-level of the discovery-for-citation and at the thematic level of the discovery-for-reuse. The following main challenges can be identified as the worth of investigation:

  1. Build a community of Open Science Graph initiatives working together in the context of RDA with a focus on FAIR data;
  2. Build on and provide input to the outcomes of RDA IGs and WGs such as DDRI, Scholix, PID, Metadata, Go-FAIR, FAIR Data Maturity Model, Data Usage Statistics, and Data Citation; this will be achieved thanks to the current membership of co-chairs to such groups and by inviting co-chairs of other groups to build active synergies;
  3. Analyse the state of the art in this domain, by making synergies with the tens of initiatives today building Open Science Graphs and provide an overview of current research data graph activities to frame a definition and classification of such graphs;
  4. Study the foundations of an information model, a lingua franca, that would enable the realization of an interoperability layer facilitating the exchange of information between graphs;
  5. Discuss the ideal services, protocols, and APIs required to exchange graphs, query graphs, navigate graphs in both aggregation scenarios and federated access scenarios.
  6. Identify one or more dedicated RDA Working Groups to tackle/address relevant challenges.

Participation
We envisage participation from the following kinds of actors:

  • ●  Open Science Graphs information models (e.g. Scholix.org, CERIF, Research Graph, OpenAIRE and PID Graph information models)
  • ●  Open Science Graphs providers targeting specific end-users:
    • Thematic graphs for scientists
    • Citation graphs for scientists, communities, institutions, policymakers, funders
    • Monitoring graphs for institutions, policymakers, funders
  • ●  Open Science Graphs consumers: scientists, communities, institutions, policymakers, funders

Already Committed

  • ●  RDA Data Description Registry Interoperability (DDRI) Working Group (Benjamin Zapilko)
  • ●  RDA Scholarly Link Exchange Working Group (Wouter Haak)
  • ●  OpenAIRE Research Graph (Paolo Manghi)
  • ●  FREYA PID Graph (Martin Fenner)
  • ●  Research Graph Foundation (Amir Aryani)
  • ●  Thematic Graph (Mark Parsons)
  • ●  Open Research Knowledge Graph (Markus Stocker)
  • ●  CERIF/EuroCris (Jan Dvorak)
  • ●  DBPedia/Wikipedia (Daniel Mietchen)

Interaction Mechanisms

OSG IG members will interact by means of a dedicated D4Science Virtual Research Environment (web site, mailing lists, online file system) and publish documents in Zenodo.org, in a dedicated Collection. The co-chairs will meet virtually three times a year (between plenaries), while members will regularly meet at the RDA plenaries twice a year. Other virtual meetings will be organized to address specific topics by sub-groups of interested members.

 

Outcome

The Open Science Graphs for FAIR Data IG will be considered successful if:

  • The main Open Science Graphs initiatives have started to form a community and have identified common goals and challenges.
  • The challenges identified above will translate in recommendations and possibly in RDA Working Groups where standards, protocols can be defined and implemented by the committed members of the IG.
  • Adequate visibility, participation, and uptake of the recommendations will be demonstrated within RDA and beyond.
Review period start:
Tuesday, 18 June, 2019 to Thursday, 18 July, 2019
Custom text:
Body:

Preserving Scientific Annotation Working Group (PSA-WG) Case Statement

A WG of the RDA IG Preservation Techniques, Tools and Policies

1. WG Charter

The Preserving Scientific Annotation RDA Working Group (PSA-WG) will precipitate adoption of reliable standards-based preservation solutions for both newly-created research contributions which employ annotation of data and documents, and also those redelivered from existing research investments. Annotation of digital resources has emerged as a new research paradigm, extending across scientific domains and offering significant opportunities for improving discovery and preservation of research investment compared with existing workflows and conventional publication processes.

However, realizing long-term benefits from annotation methodologies has remained elusive, due to continuing evolution of both the instruments1 which create annotations and infrastructures to store them. This has prevented development and operation of stable annotation workflows, and poor awareness of preservation vulnerabilities has led to loss of research investment or the need for costly redelivery to overcome cyclic technology obsolescence. However, this situation can now be remedied, in particular through use of PID strategies2 for reliably connecting annotation lists, contributing scientists and digital resources in the long-term.

 

1.1 Multiple Annotation Techniques

Annotation of born-digital texts and datasets, as well as annotations on digitized physical and media objects, can be traced from the ideas of Vannevar Bush in the 1940s and demonstrations at the SRI Augmentation Laboratory3 founded by Douglas Englebart in the 1960s. Widespread use of Bill Atkinson’s HyperCard and its later development by Apple Computer in the late 1980s led to annotation applications linking digital media4, and with rapid uptake of the World Wide Web after 1990 these began using HTML. However, it wasn't until 2009 that accessible instruments for creation of application-independent 'stand-off' annotations (represented, for example, in XML) for digitized objects appeared, creating a new mode of research which has entered mainstream practice. New research infrastructures precipitated by developments such as the International Image Interoperability Framework5 (IIIF) in late 2011 and the emergence in 2012 of nascent standards for representing annotations in JSON and potentially other serializations, have significantly accelerated adoption of annotation in fields such as medical imaging and the social sciences and humanities. IIIF breaches 'silos' of digital resources previously confined by different, incompatible technologies and instead enables delivery of assets from multiple organizations into research workflows using a single consistent interface. Although initially a framework for image interchange, IIIF specifically promotes stand-off annotation—partly because of its close organizational links with the Web Annotation Data Model6 (WADM) editors. IIIF is expanding to address other media types and multiple projects are already using WADM for annotating 3D objects and time-based media such as movies7.

The stand-off style of annotation has advanced, in contrast to the alternative of 'embedding' annotations with digital assets, which requires integration of annotation data with the specific digital representation employed to describe the object. Modification of different digital media formats in this way, and subsequent versioning of assets is generally impractical: for example multiple annotations created by different contributors lead to scalability problems with methods to retrieve and maintain embedded annotations. A particular limitation of embedded annotation approaches is the inability to effectively annotate across multiple objects or distributed corpora, since they are necessarily tied to a particular location. This seriously limits the ability to use annotation to construct narratives or flows through data.


1 We use the term ‘instrument’ throughout this document in the sense of Virtual Research Instrument, rather than implying any specific physical apparatus

2 https://www.clarin.eu/content/comparison-pid-systems, https://office.clarin.eu/pp/D2R-2b.pdf “... each individual resource (even an annotation) needs to be referenced, so that we can expect a huge number of PIDs.”

3 https://www.sri.com/blog/future-augmentation

4 http://www.uni-lueneburg.de/hyperimage/hyperimage/ebsKart.htm

 

 

1.2 Working Group Focus

PSA-WG will focus initially on stand-off annotation which can be represented using the digital asset- independent WADM from W3C. It will not pursue preservation of embedded annotations except through conversion to stand-off. Although the WADM specification is not currently supported directly by most research instruments, those already in use do generate annotations represented in JSON using various versions of the predecessor to WADM, Open Annotation Data Model8 (OADM), and this can be converted into WADM. In addition, the need to transform such existing annotations as the WADM specification evolves is also recognized, and can be automated. Moreover, through WADM's incorporation of a linked data approach to classifying annotations, the groundwork has been established for effective re-use of such research. As a result, millions of OADM annotations have been generated since 2016 and there is currently an explosion in the volume of new annotations being created. These developments have provided some important components for infrastructures permitting long-term protection of research investment employing annotation. However other key building blocks, particularly discovery and persistent storage for annotation, remain to be addressed before effective solutions can be provided to guarantee the preservation of both annotations and the resources they target. Consequently, research investment that relies on annotation is currently vulnerable.

There are also stand-off annotation techniques and activities which PSA-WG will not address at this time. For example hypothes.is9 an open source project enabling annotation of web resources, which continues an approach advanced by Google before 2010 (Sidewiki10). hypothes.is addresses annotation of text on webpages, rather than elements of independent digital assets—for example features depicted in digital images themselves, or on solid models or in movie frames—and as soon as the webpage is altered the annotations are lost. Because they do not select targets on digital resources which can be maintained independently of the evolving internet environment we do not address this family of technologies here, although it could be an important future activity of the WG to do so. hypothes.is currently supports Google Chrome and has announced plans for a similar Firefox development. Separately, Apache Annotator 11, is an ASF Incubator project supporting ‘Web annotation in Web browsers, Web publication readers, and the servers that serve them’, which has set out a broader roadmap of annotator.js-based projects and plugins.


5 https://iiif.io/

6 https://www.w3.org/TR/annotation-model/

7 DOI: 10.1109/VSMM.2017.8346274 https://www.researchgate.net/publication/324785534_I-media- cities_a_searchable_platform_on_moving_images_with_automatic_and_manual_annotations/download

8 http://www.openannotation.org/spec/core/

 

 

1.3 Summary of Outcomes

Anticipated impacts of the WG are discussed in Section 2 and in Section 4 the deliverables are described. The task of the PSA-WG is to tackle preservation of annotation in a substantive manner and its anticipated outcomes can be summarized as follows:

  • communicating vulnerabilities: a campaign to raise awareness of preservation risks and halt the continuing loss of investment in research which employs annotation

  • overcoming roadblocks to creation of end-to-end solutions for preserving stand-off annotation: identifying essential but currently incomplete preservation mechanisms and precipitating the implementation of new functionality by repositories and annotation instruments to fill these vacua

  • developing preservation use-cases in collaboration with partner organizations in multiple domains of research activity, which demonstrate exemplary strategies for preserving annotation and communicating these effectively to the broader community

  • influencing on-going standards developments to ensure robust and efficient preservation solutions for the long term and delivering new benefits from stand-off annotation such as annotation store-based discovery services, which cannot be realized until preservation issues have been resolved 12   

  • Short-term priorities will include, for example, PSA-WG working with ORCID to implement an Annotation Work Type marker for contributor attribution and working with CERN to implement a 13  Zenodo Annotation Collection data resource type. These activities will produce a core identifier scheme—ORCID/URN/Zenodo-DOI—which will be evaluated in pilot projects and incorporated in a Technical Report within the planned 18-month activities of the WG. The WG will later pursue the preservation of relationships between annotations and the digital resources they target using multiple identifier schemes, but discussions originating at PIDapalooza 2019 have led to an ORCID/URN/ Zenodo-DOI implementation proposal which could be evaluated in pilot projects and subsequently delivered within the planned 18-month activities of the WG. The WG will later consider an instrument- agnostic OADM implementation guideline to promote display and maintenance of annotations multiple instruments and improve migration to WADM.


9 https://web.hypothes.is/
10 https://chrome.googleblog.com/2009/10/bringing-google-sidewiki-goodness-... 11 https://annotator.apache.org/
12 Activity commenced at PIDapalooza 2019
13 https://www.openaire.eu/zenodo-is-launched

 

2. Value Proposition

 

2.1 Overcoming Roadblocks to Preservation of WADM Annotations

Stand-off scientific annotations using WADM are packaged as discrete digital entities—residing in databases, cloud services, etc—implicitly related to a file in a digital resource: the annotated asset. Annotations comprise 'bodies' of information associated with 'targets' defined on the asset—the latter remains unmodified. The target definition is stored in the annotation, and uses ‘selectors’ depending on the asset type to identify the feature which the annotation body refers to. For example a movie clip or just one frame can be defined by a 'time code' referred to the end of the leader and start of discrete exposed frames (the ‘in point’), plus an ‘out point’. Media types determine how such assets can be targeted—a PDF produced by a scanner without embedded OCR can’t be targeted with a text selector in the same way that a born-digital PDF or Word file might be. However, the frames of a movie or the page boundaries of a document containing a snippet of text nevertheless ‘anchor’ annotations to the respective assets in one-way relationships constituted by ‘target selectors’.

With stand-off annotation these relationships extend no further than the contents of the annotations themselves, and unless precautions are taken to insure that the digital assets which they target can be located in the long-term then they become vulnerable. Management of research data constituting annotations is subject to local IT infrastructure policies and invariably different to the policies and patterns of investment affecting sustainability of the various digital resources which they target. Consequently, any changes affecting either infrastructure which compromise unique identification of digital assets, using annotations’ targets selectors alone, renders investment in those annotations useless. This situation is complicated further by versioning and the existence of multiple representation formats for digital assets; for example images may be compressed using JPEG for internet use, compared with lossless representations produced at the point of digitization. Annotation target selectors will, in general, be valid for only one version of a digital asset file, even if variants have the same resolution, so vulnerabilities currently arise when certainty of identifying the correct asset file using only the information in an annotation target selector cannot be maintained over time.

Moreover, annotations created using one instrument cannot currently be presented using another, without conversion and potential loss of information.

Lastly, although evolving annotation standards make provision for identification of creating agencies— whether directly by a researcher or indirectly by software algorithm—none of the existing annotation instruments yet support ascription mechanisms, such as ORCID contributor identifiers.


14 https://orcid.org/

 

2.2 Benefits for the Research Community

In summary, PSA-WG will address principal roadblocks to preservation of WADM annotation through the following activities:

  • evaluating and producing recommendations on the use of persistent identifiers for annotated resources, to ensure long-term resolvability of annotation targets—additionally, to address evolving resources so that annotations continue to reference the correct version

  • recommendations for the use of WADM (and OADM in the short term) so that annotation tools can interoperate and share annotations without conversion

  • develop attribution and credit mechanisms to ensure that annotations can be properly treated as scholarly activity—additionally, to provide aggregation mechanisms so that scholarly contributions can be managed at effective levels of granularity

    Success with these developments would also enable the WG to contribute to formalizing ‘research activity’ identification15, and contributing significantly to better discovery of research investment  employing annotation. Research activities could be connected definitively to Annotation 16
    Collections —themselves comprising annotation lists containing both identifiers of the resource  targeted and institutional or original contributor identifiers. Such connectivity would allow contributors as well as annotated digital resources to be identified by research activities with fine granularity, as well as permitting automated verification of continued accessibility of digital resources in the long-term.

     

 

2.3 Key Impacts

The volume of existing research investment incorporating annotation which is vulnerable, and strategies for its protection are assessed in Section 2.4. However, it is evident that the lack of robust standards-based infrastructure for preservation of annotation, as of the date of this document, is the most pressing concern. Without this, the probability of saving historic investment and mitigating the significant costs of redelivering research which is currently in progress or being planned are low. The roadmap set out in Section 4 to deliver annotation preservation solutions to the research community within the 18-month program of PSA-WG is not unrealistic, since new technologies will not have to be developed: effective persistent identifier and long-term repository components have been in use for several years. To provide preservation infrastructures for annotation, these components have to be assembled into practical solutions, and this could be achieved by the WG through selection from available components, publication of recommendations and detailed use cases. However, effective communication of vulnerabilities as well as recommendations and development of early adoption projects in specific sectors of the research community will be a WG priority. Key impacts of these activities can be summarized as follows:

  • sharp reduction of loss in research investment using annotation currently being planned, through effective communication of vulnerabilities to the research community
  • overcoming roadblocks to creating workable solutions for long-term preservation of annotation using available technologies where possible

  • creation and communication of guidelines and use cases leading to uptake of preservation solutions for annotation and improved rates of recovery of historic material at risk of loss

  • elimination of cyclic costs of redelivery or loss of investment in the future arising from lack or failure of preservation strategies for annotation

  • create new potential for FAIR digital resources incorporating annotation


15 e.g. https://pub.uni-bielefeld.de/record/1972842, https://www.raid.org.au/, http://www.researchobject.org/ 16 see section 5.0 of W3C Recommendation 23 February 2017, https://www.w3.org/TR/annotation-model/

 

 

2.4 Scale of Research Investment using Annotation

Stand-off annotation is already a mainstream mode of research—increasingly, scholarly information is being made available as annotations and comments created as part of the discursive process, rather than via conventional publication. Many of the assertions so made are individually not significant enough to warrant publication unless combined with many others and rewritten to form an overarching narrative. This may not happen immediately, and annotations often remain the only instances of such contribution for long periods. However, in the long run, the information content of accumulated annotations often 17 constitutes a significant online resource. For example, Early Modern Letters Online is primarily a catalogue resource but contains over 50,000 comments made by historians and literary scholars that greatly enhances its utility.

Individual research endeavors organized around annotation are also multiplying, as standards for defining annotations emerge and new instruments for creating them precipitate innovative methods and inflect research agendas. This evolution is distributed across research domains, from the natural and life sciences to the social sciences and humanities. For example, one of the long-standing editors of the current WADM annotations standards group is based at Massachusetts General Hospital18; the Digital Imaging and Communications in Medicine (DICOM19) group represents hundreds of institutions and medical equipment manufacturers, some of which are in the process of adopting WADM. Heidelberg’s Excellence Cluster for Transcultural Studies20 produced less than 100,000 annotations between 2010 and 2015 across multiple projects, whereas a single contemporary research activity at Europa Institute Basel has already produced more than 900,000.

Annotation investment by the Heidelberg Cluster would have been lost in 2017 when personnel supporting it’s software infrastructure were reassigned. Urgent data forensic work on several projects led to creation of Invenio repositories and IIIF services and redelivery of tens of thousands of complex annotations as OADM targeting these new resources. However, mechanisms to insure connection between annotations and these resources remain to be finalized. The current Europa Institute project has developed its own standards-based repository infrastructure to be able to guarantee long-term accessibility of its outputs. These are high cost activities, for which support through conventional applications for research funding would be unlikely.


17 emlo.bodleian.ox.ac.uk
18 Paolo Ciccarese, Massachusetts General Hospital https://www.w3.org/TR/annotation-model/
19 https://www.nema.org/Standards/Pages/Digital-Imaging-and-Communications-... 20 http://www.asia-europe.uni-heidelberg.de/en/hcts.html

 

 

3. Engagement with Existing Work and Adoption Plan

PSA-WG will engage with existing data preservation work and with the research community in three distinct activities. First it will work with identifier management authorities and repository services to develop identifier schemes for preservation of annotation data, including organizations developing ‘research activity’ identification mechanisms (as already discussed in Section 2.2). Secondly, it will develop partnerships with multiple research communities to evaluate effective identifier schemes for the specific digital resources and annotation instruments which they employ, leading to publication of use-cases and recommendations. Thirdly, PSA-WG will also work with groups engaged in on-going development of standards relating to preservation of annotation, including:

  • developers of instruments supporting display, maintenance and creation of annotation
  • standards developers contributing both to convergence of existing usage of annotation standards and their future evolution

Additionally, PSA-WG will engage with RDA ESIP/RDA Earth, Space, and Environmental Sciences IG, RDA PID Kernel Information WG and W3C WADM early in its activities in order to consult over development of pilot projects, use-cases and standards.

 

 

3.1 Collaboration with Identifier Authority and Repository Organizations

Section 4 addresses outreach activity work-packages, and describes the WG’s strategy of creating and testing a core identifier scheme for preservation of annotation as the basis for development of schemes tailored to the needs of specific research communities.

This approach builds on preliminary work after RDA Plenary-11 on preservation of annotation pilot projects conducted by The Bodleian Libraries, CERN Repositories Section and Data Futures LBG. These activities redelivered vulnerable research in the humanities employing complex annotation; transforming annotation in legacy formats into OADM targeting new IIIF digital resources. However they did not implement a persistent identification scheme to connect the annotations and targeted resources. Since RDA Plenary-12 PSA-WG has worked with Zenodo and ORCID, as summarized in Section 1.3, to implement support for annotation-based research activities, leading to implementation of an Annotation Work Type marker for contributor attribution and a Zenodo Annotation Collection data resource type.

In the first months of its work-plan after being formally established, PSA-WG will implement an ORCID/URN/Zenodo-DOI annotation identifier scheme for the previously redelivered humanities corpora and additionally implement this scheme in a new humanities research project (see WP2). However the WG recognizes that IIIF digital resources (and consistent implementation of URN) and the Mirador21-flavored OADM employed in these projects is not representative of preservation requirements of other research communities. PSA-WG will work with partners in the life and physical science communities (current discussions summarized in Section 3.2) who are already planning development of annotation workflows, in order to develop effective identifier schemes for those domains. This will require collaboration with multiple digital resource PID authorities, research activity developers, such as Research Object Consortium, and with repository services and developers. Accordingly, Bodleian, CERN and Data Futures will pursue discussions with a new tier of organizations, such as California Digital Library and Duraspace from Plenary-13 onwards.

 

 

3.2 Engagement with Research Activities in the Life and Physical Sciences

Multiple attendees of the PSA-WG Birds-of-a-Feather meeting at Plenary-12 expressed interest in developing preservation of annotation pilot projects in disciplines other than the humanities. As a result discussions are now in progress with the Earth sciences, bioinformatics and medical imaging communities. Preliminary evaluations using GeoTIFF and DICOM datasets were implemented in early 2019; correspondence between PSA-WG and ELIXIR led to a meeting at PIDapalooza and project meetings scheduled at Plenary-13; multiple internet meetings have led to ESIP joining PSA-WG and discussions are now scheduled at Plenary-13 to develop collaborative project roadmaps. Additionally, pilot projects in science and technology have commenced before Plenary-13 (see Section 5) and will deploy the same ORCID/URN/Zenodo-DOI annotation identifier scheme already developed in PSA- WG projects for Plenary-12—forming the first tier of adopters.

 

 

3.3 Input to Standards Activities

Preliminary PSA-WG partners The Bodleian Libraries, CERN Repositories Section and Data Futures LBG have been developing discussions with the standards activities of multiple communities including IIIF and W3C’s WADM since 2017. The Bodleian Libraries was instrumental in establishment of IIIF and Data Futures was a founder member. CERN is a founding OpenAIRE partner. Separately, Data Futures established some of the first workflows employing IIIF-Mirador to be deployed in high volume, and has already worked with both community groups as well as WADM to resolve roadblocks and contribute to new functionality. Significantly, the Chair of IIIF’s 3D Community Group has joined PSA-WG membership and has commenced a pilot annotation project using the WG’s pilot infrastructure. The Chair of the Universal Viewer Community (UV) Group has joined PSA-WG, and preliminary discussions about display of existing annotations using UV and development of a roadmap for creation and maintenance of annotations in UV will continue after Plenary-13.

Early drafts of PSA-WG’s Technical Report will be provided to the IIIF and WADM editors and also circulated to the Mirador and UV developers to precipitate consultation before publication. Separately, existing collaboration between PSA-WG and ORCID and Zenodo will continue, leading to development of preservation of annotation documentation for those platforms. Finally, it is expected that consultation with another tier of persistent identifier authorities and repository developers will lead to support for annotation work and data types mirroring what has already been achieved with ORCID and Zenodo through implementation of extensions to those technologies.


21 http://projectmirador.org/

 

 

3.4 Plan for Adoption

Two phases of adoption of PSA-WG recommendations by the community are envisaged. It has been demonstrated that collaboration with mainstream identifier and repository organizations has already led to delivery of one robust identifier scheme for preservation of annotation in the humanities and science and technology communities. PSA-WG will commence by applying that scheme to historic research investment redelivered by the preliminary partners and then apply it to a new research activity. Following evaluation of these solutions the WG will support an already-identified first tier of early adopter research projects. Publication of use cases from these activities, together with the guidelines identified in WP1 and WP7 of the work-plan in Section 4, will provide an effective blueprint for preservation of annotation-based research data in both redelivery and new projects. Funding applications are also being developed for a second tier of projects by the PSA-WG implementation partners (see Section 5). It is envisaged that these two tiers of adoption will create sufficient critical mass to support an international conference on preservation of scientific annotation in the humanities and science and technology communities during the 12 months following the completion of the WG work-plan.

In a second phase of adoption the WG will work with its collaboration partners in the Earth sciences, bio-technology and medical imaging communities to establish a tier of adoption projects building on the pilot annotation use-cases in those communities. These projects cannot be identified at the outset, but there are clear strategic differences compared with the phase one adoption plan set out above. In the humanities and science and technology, research activities are more corpus-specific in comparison with the selected phase two communities, where relatively homogeneous data supports a multiplicity of research activities. For example data at a given level of detail might be produced by different sensors: tomography can represent a cross section through a human body using X-rays or ultrasound. This data can support a wide range of expert interpretation but all medical data is about the human body. PSA- WG collaboration partners such as ESIP and ELIXIR and DICOM users have broad constituencies and will enable the WG to develop annotation preservation solutions more generally than the corpus- specific workflows already encountered in the phase one communities.

 

4. Work Plan

4.1 Activities

The PSA-WG work plan is organized into work-packages (WPs) representing three activities:

  • outreach to insure that the existence of data preservation vulnerabilities arising from incorporating stand-off annotation are comprehended by the research community; subsequently, effective communication of the Technical Report of PSA-WG as well as its use-case outputs will be central to success of the WG—WP1 and WP7

  • evaluating and selecting techniques and technologies to plug the gaps in current data preservation planning relating to annotation, and in so doing overcome roadblocks to creating workable end-to- end solutions for long-term preservation of annotation—WP2, WP3, WP4 and WP6

  • consultation with multiple research communities already employing a range of workflows for creating and maintaining annotation-based research data, in order to create effective preservation solutions which are tailored for those communities and encapsulate them in use-cases—such pilot project activities with PSA-WG partners will occur in WP4 and WP5

These activities will require that PSA-WG produce multiple documents:

  • annotation Guideline communicating preservation vulnerabilities, plus an Annotation Primer describing techniques and effective use of annotation—initially posted as a PDF document on the RDA website and subsequently conveyed via summary presentations and posters at key conferences

  • a Technical Report will be produced and posted as a PDF document on the RDA website after evaluating preservation of annotation using the core implementation scheme (see below), and implementing pilot use-cases together with evaluation/reassessment

  • Use-Case publications will serve two purposes—firstly in preliminary form as evaluations supporting the PSA Technical Report, and subsequently as Recommendations tailored for specific communities

PSA-WG’s technical evaluation and selection activities are expected eventually to address multiple identifier implementation schemes, reflecting varying PID adoption and annotation instruments in use by different communities. However, the WG must initially be able to demonstrate concrete preservation solutions within specific fields, in order to articulate vulnerabilities and demonstrate credible remedies to the wider community. Accordingly the WG will first evaluate a ‘core’ identifier scheme using already-accessible technical components. Discussions among the initial PSA-WG membership (see Section 6.) including CERN and ORCID between P12 and P13, have explored developing extensions of existing services to enable robust preservation of stand-off annotation to be available at the time of launch of the WG. Specifically, these include the creation of an Annotation WorkType by ORCID and the support of an Annotation Collection data resource type by Zenodo, which developments have now been confirmed to PTTP by those organizations. Together with identification of annotated digital resources using identifiers including URN, this provides PSA-WG with necessary and sufficient components for its core identifier scheme. The development and testing of these mechanisms in partnership with CERN and ORCID early in its timeline will enable PSA-WG to turn its focus rapidly towards development of pilot projects, leading to use cases and consultation with other stakeholders. Proceeding beyond the core identifier scheme through such consultation is addressed in Section 3.

A potential follow-on activity of PSA-WG which is under consideration is convergence of different instrument-dependent representations for annotation (including variants of OADM) which are currently in use, and which lead to lack of interoperability: annotations created using one instrument cannot be discovered, viewed or maintained using other instruments. Discussion with other stakeholders has commenced, with proposed development of a further Recommendation document to achieve this, but it is not addressed further here.

 

4.2 Work-Packages

In summary, PSA-WG work-packages will include the following communication and technical activities:

WP1: produce ‘Annotation Preservation Vulnerabilities’ guideline

WP2: implement and evaluate ORCID/URN/Zenodo core identifier scheme using ‘Twinger’ corpus22 (recently-commenced research project)

WP3 redeliver pre-WADM humanities research projects (Bodleian Libraries and Heidelberg University) using core identifier scheme; produce use-cases

WP4: plan and implement pilot preservation projects, leading to production of use-cases in partnership with ESIP and European Bioinformatics Institute (EMBL-EBI), as well as either developing further an existing University College Hospital, London, DICOM pilot (which was commenced after P12) or establishing another medical imaging annotation project

WP5: consultation with California Digital Library, IIIF 3D User Community, Research Activity Identifier, Research Object Consortium to extend the core identifier implementation scheme to additional research activity and digital resource PIDs and potentially to CoolURIs23

WP6: production of PSA-WG Technical Report on Preserving Scientific Annotation WP7: launch Technical Report and Recommendations based on use-cases

 

4.3 Timeline

(see PDF for Table)

4.4 Milestones

It is hoped that endorsement of PSA-WG by TAB will be forthcoming before the beginning of June 2019. Accordingly, a four-month window is available for completion of the WG’s first Guideline before Plenary-14, delivered as an RDA website document and for preparation of presentation and poster materials. PTTP will schedule a mid-term review of PSA-WG activities at Plenary-15, which will also provide an opportunity to update the wider community. All key activities identified in this work-plan will be complete in time for Plenary-16 with the exception of communication activities, which the WG anticipates extending beyond WP7. The following summarizes PSA-WG milestones:


22 https://indico.hasdai.org/event/26/
23 https://www.w3.org/Provider/Style/URI

 

PSA-WG work-packages, by month

(see PDF for Table)

 

 

M1: publish Annotation Primer
M2: release vulnerability Guideline to coincide with RDA Plenary-14

M3: complete new annotation and redelivery preservation (Twinger/Basel, Curiosities/Bodleian Libraries and Hachiman/Heidelberg University corpora) using core identifier scheme

M4: commence use-case implementations with ESIP, EBL & DICOM communities

M5: release preliminary use-case reports for use in Technical Report and development of community- specific Recommendations for preservation of annotation

M6: complete identifier scheme consultation, extending initial PSA-WG core identifier scheme M7: release Technical Report and Recommendations to coincide with RDA Plenary-16

 

 

5. Initial Membership

Initial membership of PSA-WG dates from discussions at RDA Plenary-11, between PTTP-IG chairs, Bodleian Libraries, CERN, and Data Futures. Between P-11 and P-12 four legacy humanities research projects from Heidelberg University, plus a Bodleian Libraries project, all with complex annotation, were redelivered using IIIF image services and Invenio, together with conversion of annotations into OADM in a collaboration between CERN and Data Futures. A ‘Preserving Scientific Annotation’ Birds-of-a-Feather meeting took place at P-12, at which this work was presented by Data Futures, Bodleian and CERN, and the following members attended and expressed interest in joining PSA-WG:

Adachi, Sumiko

Japan Science and Technology Agency

Downs, Robert R.

CIESIN, Columbia University

Garcia, Leyla

ELIXIR HUB

Hienola, Anca

Finnish Meteorological Institute

Jejkal, Thomas

Karlsruhe Institute of Technology (KIT)

Jenkyns, Reyna

Ocean Networks Canada

Juty, Nick

University of Manchester and Identifiers lead for ELIXIR-UK

Jeremy Kenyon

University of Idaho

Lambert, Simon

UKRI-STFC

Li , Shih-Chieh Llya

CEO Xtrea.io

Martin, Jose

KAUST

Myers, Natalie

Research Librarian Notre Dame University

Morrison, Monica

Stellenbosch University

Stockhause , Martina

DHRZ

Weber, Tobias

Leibniz-Rechenzentrums, LRZ

Since Plenary-12 a number of additional RDA members have expressed interest in joining the WG:

Ó Carragáin, Eoghan

University College, Cork Library and Research Object community

Decker, Eric

Research Navigator, Europa Institute, Basel

Lamberty, Tom

Publisher, Merve Verlag

Narock, Tom

Department of Mathematics, Notre Dame of Maryland University, ESIP

Weale, Sara

The National Library of Wales, and Chair UV Community Group

Stozka, Rainer

Karlsruhe Institute of Technology (KIT)

 

Several new proposals for PSA-WG preservation of annotation pilot projects are currently in progress, including with CERN’s Digital Memory Project; Earth Science Information Partners (ESIP) via Jet Propulsion Laboratory, Caltech; with ELIXIR via The European Bioinformatics Institute and with ORCID. Additionally, projects with The Bodleian Libraries, Maison de l'Orient et de la Méditerranée Jean Pouilloux, Lyon, Mnemoscene and with Notre Dame's Reilly Center for Science and Technology have commenced before Plenary-13. As a result, a new tier of pilot implementation partners is also in place to develop use-cases which, together with the WG chairs are represented by the following RDA members:

 

Cornwell, Peter (co-chair)

University of Westminster and Director, Data Futures LBG

Haak , Laure

Executive Director, ORCID

Jefferies, Neil

Head of Innovation, The Bodleian Libraries, Oxford University

Juty, Nick

Identifiers lead for ELIXIR-UK interoperability platform

Gonzalez, Jose

Section Leader, Digital Repositories, CERN

Meyers, Natalie (co-chair)

Head of Digital Scholarship, Notre Dame University

McGibbney, Lewis

JPL and Chair, ESIP Semantic Technology Committee

Morandiere, Bruno

Head of Digital Infrastructure for Overseas Laboratories, CNRS

Serif, Ina (co-chair)

Department of History, University of Basel

Silverton, Edward

Director, Mnemoscene Ltd and Chair, IIIF 3D Community Group

 

6. Copyright Notice, License and Disclaimers

Copyright © 2019 PTTP editors and contributors. Published by the RDA Preservation Techniques, Tools and Policies Interest Group (PTTP-IG) under the CC-BY license; see disclaimer.

 

6.1 License

Specifications published by PTTP-IG are made available using the Creative Commons Attribution Required (CC-BY) license.

Please note that this license forbids the assertion, implied or explicit, that PTTP-IG, RDA or any of its members endorses or is any way associated with uses of the specifications or implementations thereof.

 

 

6.2 Disclaimers

Specifications published by PTTP-IG are made available with the following disclaimer of liability:

This document is provided “as-is”, and copyright holders make no representations or warranties, express or implied, including, but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, or title; that the contents of this document are applicable for any purpose, nor that the implementation of such contents will not infringe any third-party patents, copyrights, trademarks or other rights.

Copyright holders will not be liable for any direct, indirect, special or consequential damages arising out of any use of PTTP documents or the performance or implementation of the contents thereof.

This disclaimer is based on that employed by W3C specifications.

Review period start:
Thursday, 18 April, 2019 to Saturday, 18 May, 2019
Custom text:
Body:

RDA Node Slovenia is a national RDA node established to act as a long-term central contact point between the Research Data Alliance and data practitioners, funding organizations, research agencies and other relevant stakeholders in Slovenia. RDA Node Slovenia is being coordinated by the Slovenian Social Science Data Archives (ADP). The RDA Node Slovenia data community was initially composed of the representatives of the Humanities (DARIAH-SI) and Linguistics (CLARIN.Si) research data infrastructures, and the University of Ljubljana. The node is open for additional research data infrastructures, researchers and other interested stakeholders. 

The aims of the Node are general with some specific emphasis related to the coordination of the infrastructure development based on internationally recognised standards, e.g. CoreTrustSeal (CTS), and to the development of journal policies as one of the points in the National Action Plan that can impact the rise of data sharing culture:

  1. To foster a wide range of stakeholders across scientific domains in a diverse Slovenian data community and encourage their active engagement in this community;
  2. To represent Slovenian interests in the RDA Interest/Working groups and governing structures;
  3. To raise awareness of RDA activities, events and funding calls while encouraging active involvement of new Slovenian members in RDA activities;
  4. To promote RDA’s nationally relevant outputs, recommendations, and ICT Technical specifications in order to stimulate their adoption in the Slovenian research environment;
  5. To contribute to the implementation of the policy of open access to research data at a national level;
  6. To interact with national research funding bodies, ministries, and other relevant government officials to influence the implementation of an effective and durable open research data policy and digital research agendas;
  7. To promote and support data management best practices, standards and solutions in the Slovenian research area;
  8. To promote data citation, sharing rewards and crediting best practices amongst Slovenian publishers and stimulating the development of journal policies of research data deposit in connection to a paper publication.

We invite you to join the RDA Node Slovenia group by clicking Join Group button on the right. By doing so, you will be automatically subscribed to the group mailing list and will receive all updates about the activities of the RDA Node Slovenia.

Review period start:
Thursday, 28 February, 2019
Custom text:
Body:

RDA Reproducible Health Data Services Working Group

Purpose:​ Revised Case Statement for application as RDA Working Group: https://www.rd-alliance.org/sites/default/files/case_statement/Reproduci...

 

*This revised charter dated 29 March 2020. Previous and revised charters can be found attached below*

 

Reproducible Health Data Services WG Charter

The goal of the working group is to enhance the reuse of health data for research and improve the FAIRness levels of aggregated and curated data sets for secondary use by providing recommendations to enhance the reproducibility of data curation services.

Processes of health data curation are often conducted by data service proiders within centers of health informatics, health data brokerage, or health stastics, all such centers we aim to include in the ambit of "health data services".   
Examples of health data service stakeholders include: health data curation centers, medical data services, clinical data integration centers, biostatistics and system medicine institutes, and other data centers who assimilate, manage, and distribute health data for various primary and secondary uses such as research, innovation, quality assurance and improvement, and efficiency monitoring.  Health data services facilitate the use and reuse of data in different contexts surrounding health care and health research. The data span across biomedical domains, including clinical, genomic, and patient generated health data repositories.

The actors involved in data services perform many tasks such as data curation, mapping, integration, and publishing. These interdependent tasks build upon each other to create workflows that transform siloed data into new, curated datasets, requiring the navigation of data interoperability, data quality, and data security. Thus, understanding these health services processes is vital to support reproducibility and ensure FAIR data practices.

The case statement outlines our work and provides the focus and the boundaries for the working group activities.

The following stakeholders will potentially benefit from our contribution:

  • Data curators/brokers in their daily activities
  • Data consumers (e.g., clinical researcher, application developers, innovators)
  • Health research data repositories or archivists
  • Health research funders

The benefits may include the ability to reuse processes, gain credit for work, provide transparency, and facilitate machine readable workflows pertaining to the collection, cleaning, and curation of health data for analysis and sharing. 

The RDA Reproducible Data Services Working Group (i) will provide recommendations to identify, capture, and store metadata documenting workflows for collecting and curating health data for secondary reuse, and (ii) will develop an adoption and training guide to improve the uptakes of our outputs.

 

Review period start:
Tuesday, 5 February, 2019 to Tuesday, 5 March, 2019
Custom text:
Body:

**NOTE: The following text has been revised. The authority Charter Statement is here.**

 

Research Data Management in Engineering

 

Introduction

Research in Engineering comprises a vast span of sub-disciplines including for example  chemical, civil, electrical, and mechanical engineering. Traditionally engineering disciplines belong to the applied sciences, which cooperate closely with industry and look to create commercial advantage from the research. Therefore open data and data sharing are rarely considered even when the research has been completed and the economic interests are secured.

 

The Interest Group on “Research Data Management in Engineering” (IG RDM4Eng) will identify, collect and compare industrial and institutional workflows, services and tools regarding ‘engineering research data’.

 

The Interest Group presents an opportunity to highlight emerging FAIR (Findable, Accessible, Interoperable, Re-usable) data approaches from scientific and industrial engineering disciplines and explores how data tools can be used ‘as a service’ to break up existing community specific 'data silos'. The proposal’s background is based on projects and initiatives in the area of engineering sciences, and especially relates to the challenges posed by contract or mission oriented research which is performed together with industrial stakeholders.

 

User scenario(s) or use case(s) the IG wishes to address

The engineering community is highly fragmented in terms of its RDM organization. Data and the descriptive documentation and software basis are crucial components of sustainable engineering research. Within the engineering science community at universities, initiatives such as the American Association of Engineering Societies (AAES) and CESAER (Conference of European Schools for Advanced Engineering Education and Research) with the Task Force Open Science (TFOS) have shown that the approach to RDM is often bottom-up. This is in contrast to other disciplines where RDM is strongly driven by professional associations (e.g. DARIAH in the humanities and ELIXIR in life sciences).

 

A major challenge is the heterogeneity of data generated by research groups, even if they investigate the same phenomenon. The data obtained in the primary data analysis are usually located on local data storage or on a backup storage by the doctoral students and scientists. Many scientists pursue their own strategy for file naming and documentation of their research data. Due to the different systems of the scientists, there are few systematic records within the discipline. Standardized metadata records are not yet widespread which results in data that are difficult to retrieve and access. Data publication is far from being the norm. The FAIR principles are far from being properly implemented.

 

Within industry, on the other hand, particularly Industry 4.0, data management comprises a strategic business approach that applies a consistent set of business processes in support of the collaborative creation, management, dissemination, and use of a product and/or a service. IT Systems supporting these industrial data management processes are known under different names:

 

  • Engineering data management (EDM system)
  • Product data management (PDM system)
  • Product life cycle management (PLM system)
  • Collaborative product development

 

Thanks to these management systems, the industrial workflows of product and service engineering are usually well documented. What is missing, however, are common interfaces and protocols for managing, accessing and re-using research data from industry and academics. Each data provider has its own service offering and returns data in different (proprietary) formats with different licenses and costs. Additionally, commercial data providers are often constrained to particular business sectors in specific geographical areas and keep their data locked within isolated data sets. The combination of these factors hinders interoperability and further uptake in FAIR data platforms (such as the proposed European Open Science Cloud) and a better data value chain around corporate information.

 

During the last year, several workshops and interviews in the CESAER context with participating engineers in the Netherlands, Ireland and Germany were performed, representing major engineering communities such as computational engineering, mechanical engineering, construction and thermodynamics.

 

Major outcomes are needs for

  • an introduction to tailored, existing methods and tools regarding RDM, which are adaptable to the needs of the specific sectors (also in combination with educational resources, e.g. persons who speak the ‘language’ and can demonstrate & adapt existing RDM tools)
  • curated standards for metadata and data modelling that are in line with the FAIR data principles
  • a harmonized software coding platform for engineers (source code presents ‘the most important’ research data within many engineering communities)
  • a basic data documentation guideline (or maybe even a standard), especially concerning contract and mission oriented research with industrial partners as well as journal guidelines
  • better, coordinated access to HPC facilities.

 

We see the introduction of an IG in RDA as an opportunity to seek solutions in a broader international context, activating engineering scientists from all over the world. We also expect that an IG will provide a stronger leverage when it comes to engaging the industrial sector. Industry members are usually not present at university-based or scientific-community based-workshops, but RDA provides a framework which is nowadays widely recognized.

 

 

 

Objectives

The proposed “Research Data Management in Engineering” Interest Group (IG RDM4Eng) seeks to bring together scientific and industrial stakeholders from all relevant sectors. The IG RDM4Eng will provide its scientific and industrial members with the opportunity to discuss and improve the legal and technological challenges to the adoption of FAIR data and software management in Engineering, to exchange knowledge, opinions and experiences, and form or participate in existing Working Groups to address these challenges.

 

This includes in particular contract research and its associated privacy and security concerns (such as the conditions included in non-disclosure agreements), and the important role of software and source code as ‘research data types’ within the engineering sector:

 

1)   Use case: coding base & project data: Analyzing the use of software source code repositories such as GitHub and GitLab by the engineering sector and identifying common needs and best practices, such as a centralized GitLab/GitHub framework with similar best practices and standards for computational engineering and software management. In addition to software management, this use case will also identify workflows and services to facilitate, standardize and harmonize the transfer of engineering project results (e.g. data.DURAARK.eu, an architectural engineering example, or 4TU.ResearchData) into a broader FAIR knowledge base. Currently, outcomes of such projects, data results and accompanying repositories, are usually listed on discovery platforms such as re3data.org. As a principle, these should be however also distributed in the emerging FAIR assessment platforms and tools such as FAIRsharing.org, where they can be linked to metadata and data standards.

 

2)    Use case: privacy and security in engineering data management: Given the lack of legal national and international harmonisation of scientific and industrial data protection and access, different approaches, contracts (e.g. non-disclosure agreements) and protocols will have to be accessed and compared in order to improve the FAIRness of engineering research data. One  example is the use of research data as support for intellectual property and patent claims, and the role of research data in institutional technology transfer offices, where public data sharing is often seen as giving away assets and know-how. Thus, scientists may have to be pointed to possibilities to get access to legal advice as well as appropriate contract templates for the planning and executing of FAIR best practices in research data management in the case of mission oriented research.

 

Based on the use cases as outlined above, the preliminary focus of the IG RDM4Eng will cover following areas:

 

  • Engineering Data and Code landscape
    • defining a list of tools dealing with Engineering Data Management in academia and industry
    • defining and evaluating existing and developing engineering data platforms
    • disseminating the IG results within other relevant engineering organisations on a global, European and national scale

 

  • Privacy and Security in Engineering Data
    • sharing best practice on non-disclosure agreements with industrial stakeholders and differential privacy
    • developing models for dynamic consent that protects industrial and institutional interest while enabling data sharing ‘as open as possible, as closed as necessary
    • providing a forum for discussing, explaining and responding to data regulation issues on a national and international level

 

As a result, the IG RDM4Eng will build a knowledge base in order to share technical practices, identify common data and service requirements, and facilitate search and analysis of existing FAIR data solutions for interoperability challenges that are shared among engineering research infrastructures, universities and companies. The IG will seek collaboration with those RDA groups that have affinity to the objectives mentioned above, as well as with external organisations (such as AAES, CESAER, NIST) past and ongoing engineering projects (Big Data Europe, BOOST 4.0, DURAARK) and industrial stakeholders from different engineering sectors:

  • Automotive sector
  • Construction sector
  • Computational engineering sector
  • Mechanics
  • Architectural engineering
  • Chemical engineering
  • Coastal Engineering
  • and others

 

Participation

This IG will be open to all RDA members from all countries and scientific disciplines. Particularly, but not exclusively, the IG will welcome members from the following backgrounds:

  • Scientists involved in contract research, to share their experience in dealing with RDM questions and non-disclosure agreements
  • Industrial representatives from major and minor companies representing engineering science and the industry (particularly industry 4.0) sector
  • Practitioners of software engineering for the industry sector
  • Policy-makers for non-disclosure agreements & legal experts
  • Data Stewards and related research data experts
  • HPC and distributed computing experts

 

Outcomes

Major/Preliminary outcomes of the IG RDM4Eng will include the following:

  • Strengthen the connection between the industrial and academic sector
  • Bring light to the issue of contract and mission oriented engineering research from global and national points of view
  • Establish an exchange & information knowledge base for engineering data types and software products
  • Display funder guidelines and best practices
  • It is planned to solve particular problems like those identified in the CESAER context by spawning of RDA Working Groups

 

 

Mechanism

Outputs and recommendations will be produced based on consensus of the participating RDA group members. All topics will be openly discussed via the RDA communication platform providing a CMS, document store, and Wiki.

 

At the RDA plenaries the IG will organize group sessions and will interact with other RDA groups, e.g. by the organization of joint sessions. In between plenaries regular virtual conferences will guarantee the continuity of activities and encourage the continuous exchange of information.

 

The initial co-chairs will accompany the group’s creation and establish the activities. It is intended to conduct a co-chair election every two years.

 

The proposed IG has identified overlap with regard to contents with the following RDA groups:

  • IG Chemistry Research Data
  • IG Data Fabric
  • IG Data Foundations and Terminology
  • WG Data Type Registries
  • WG Data Versioning
  • IG Disciplinary Collaboration Framework IG
  • IG Domain Repositories
  • IG From Observational Data to Information
  • IG Health Data
  • WG Blockchain Applications in Health
  • WG International Materials Resource Registries
  • WG Metadata Standards Catalog
  • IG RDA/CODATA Legal Interoperability
  • IG RDA/CODATA Materials Data, Infrastructure & Interoperability
  • IG RDA/NISO Privacy Implications of Research Data Sets
  • IG Reproducibility
  • WG Research Data Collections
  • IG Software Source Code
  • IG Vocabulary Services

 

While especially the IG RDA/CODATA Materials Data, Infrastructure & Interoperability  as well as the IG RDA/NISO Privacy Implications of Research Data Sets have conceptual similarities with this the IG RDM4Eng, to our knowledge, none of the above IG focus on an inclusion of both, industrial and scientific stakeholders from the engineering sector and bringing them together both on an European and on an international scale.

 

 

 

Timeline

We are looking forward to having the IG established at the 13th RDA Plenary Meeting in April 2019 in Philadelphia, USA. The first outcomes of this IG are planned to be presented in a timely fashion using the RDA platform and file repository structure, with a formal presentation and discussion at the latest after 12 months after the establishment of the IG.

 

 

List of initial members

 

Name

Affiliation

Country

Chair

Marta Teperek

TU Delft

Netherlands

 

Susanna-Assunta Sansone

Uni of Oxford, Dep of Engineering Science (and RDA FAIRsharing WG)

UK

 

Alastair Dunning

TU Delft

Netherlands

 

Daniela Hausen

RWTH Aachen University

Germany

Chair

Angelina Kraft

Technische Informationsbibliothek (TIB) German National Library of Science and Technology

Germany

Co-Chair

Markus Stocker

Technische Informationsbibliothek (TIB) German National Library of Science and Technology

Germany

 

Gerald Jagusch

ULB Darmstadt

Germany

 

Nanette Rißler-Pipka

Karlsruhe Institute of Technology (KIT)

Germany

 

David Wallom

University of Oxford

UK

 

Kyong-Ha Lee

Korea Institute of Science and Technology Information

South Korea

 

Jonathan Petters

Virginia Tech

USA

 

Gretchen Greene

National Institute of Standards and Technology (NIST)

USA

 

 

Review period start:
Tuesday, 29 January, 2019 to Friday, 1 March, 2019
Custom text:

Pages