Use Case for Provenance of Linguistic Annotations

Problem Statement:

We have complex linguistic annotations which are represented in XML.  Traditionally provenance information has been carried directly within the XML structure of the annotation, but the current structures are idiosyncratic, and are not sufficient for conveying the full chain of provenance that might occur for a given annotation.

Example of current XML structure:





   <name>John Smith</name>

   <address>University of XYZ</address>




   <name>Mary Smith</name>

   <address>University of ABC</address>




   <name>Jane Doe</name>

   <address>University of ABC</address>











Where the schema supports one or more dates specified at the level of the entire file (/treebank/date), a full list of all contributing annotators specified at the level of the entire file (/treebank/annotator), and pointers to primary and secondary annotators specified at the level of the sentence (/treebank/sentence/primary|secondary) which reference the annotator by short name.

There are many deficiencies in this schema, and we can leverage existing provenance standards (e.g. W3C PROV) to provide for finer grained specification of provenance.

Some of the provenance scenarios that we must support include:

  • specification of pre-processing steps that impact the nature of the annotation (primarily tokenization and segmentation algorithms)

  • distinguishing between machines and persons as agents producing and reviewing the annotation

  • identify any automated service as an agent that inserted data into an annotation, per annotated item

  • identify when a person agent actively corrected an annotation supplied by a software agent

  • more complex edit/review workflows (the examples shown has two primary annotators for a sentence and a third secondary annotator responsible for reconciling differences between the two primary annotations.  Some of the use cases we want to support for these annotations include;

    • partial annotation by a machine (e.g. automatic parsing of morphology) curated by a human

    • classroom collaboration on an annotation corrected by one or more instructors or expert reviewers

    • collaboration on an annotation by two or more ‘expert’ individuals

  • annotations which are created over a long period of time (a single text made up of thousands of sentences may take months to fully annotate)

  • provenance data specified at the level of the sentence or maybe even the individual word

  • individual contributions of micro-publications (such as annotation of a single sentence, or subset of words in a sentence) which can be combined to make up a single annotation of an entire text (or subset of a text)

  • specification of the annotation ontologies and algorithms for presenting those ontologies to the annotator(s)

    • different versions of a single ontology, or multiple ontologies, may also have been used through the course of annotation. ideally we should be able to provide access to the specific version of an annotation ontology used for any single annotation within the larger document.