Skip to main content

Notice

The new RDA web platform is still being rolled out. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Please report bugs, broken links and provide your feedback using the UserSnap tool on the bottom right corner of each page. Stay updated about the web site milestones at https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/.

Use Case for Provenance of Linguistic Annotations

  • Creator
    Discussion
  • #138492

    Problem Statement:

    We have complex linguistic annotations which are represented in XML.  Traditionally provenance information has been carried directly within the XML structure of the annotation, but the current structures are idiosyncratic, and are not sufficient for conveying the full chain of provenance that might occur for a given annotation.

    Example of current XML structure:

     2014-01-01:00:00:00Z

     

       john

       John Smith

       

    University of XYZ



     

     

       mary

       Mary Smith

       

    University of ABC



     

       jane

       Jane Doe

       

    University of ABC



     

     

        john

       jane

        mary

        …

        …

     

     …

    Where the schema supports one or more dates specified at the level of the entire file (/treebank/date), a full list of all contributing annotators specified at the level of the entire file (/treebank/annotator), and pointers to primary and secondary annotators specified at the level of the sentence (/treebank/sentence/primary|secondary) which reference the annotator by short name.

    There are many deficiencies in this schema, and we can leverage existing provenance standards (e.g. W3C PROV) to provide for finer grained specification of provenance.

    Some of the provenance scenarios that we must support include:

    • specification of pre-processing steps that impact the nature of the annotation (primarily tokenization and segmentation algorithms)

    • distinguishing between machines and persons as agents producing and reviewing the annotation

    • identify any automated service as an agent that inserted data into an annotation, per annotated item

    • identify when a person agent actively corrected an annotation supplied by a software agent

    • more complex edit/review workflows (the examples shown has two primary annotators for a sentence and a third secondary annotator responsible for reconciling differences between the two primary annotations.  Some of the use cases we want to support for these annotations include;

      • partial annotation by a machine (e.g. automatic parsing of morphology) curated by a human

      • classroom collaboration on an annotation corrected by one or more instructors or expert reviewers

      • collaboration on an annotation by two or more ‘expert’ individuals

    • annotations which are created over a long period of time (a single text made up of thousands of sentences may take months to fully annotate)

    • provenance data specified at the level of the sentence or maybe even the individual word

    • individual contributions of micro-publications (such as annotation of a single sentence, or subset of words in a sentence) which can be combined to make up a single annotation of an entire text (or subset of a text)

    • specification of the annotation ontologies and algorithms for presenting those ontologies to the annotator(s)

      • different versions of a single ontology, or multiple ontologies, may also have been used through the course of annotation. ideally we should be able to provide access to the specific version of an annotation ontology used for any single annotation within the larger document.

Log in to reply.