A gold
standard corpus generation comprises steps such as: document collection/selection,
manual document annotation, annotation result collection and statistical
analysis.
The first
and last steps can be computationally assisted and partially automated. However,
this is not the case for the manual document annotation. Also called “curation”,
the manual document annotation comprises the manual scanning of the document
text to identify environment descriptive terms and map them to unique
identifiers according to a community resource (the Environment Ontology (EnvO)
in this case).
The tediousness
and time-demands of such process call for collaborative effort. Aa international
group of six researchers: Lucia Fanini, Sarah Faulwetter, Evangelos Pafilis, Christina Pavloudi, Julia Schnetzer, Katerina Vasileiadou (in alphabetical order) have
undertaken this task.
Coming from a diverse range of scientific background (such
as ecology, computational biology, molecular biology, and systematic) they
represent different mindsets upon scanning pieces of text, in a way
representing different EOL readers.
Such
pluralism is a desired feature for the corpus curation, however a common
understanding among team members has to be established.
This was
one of the main aims of the test curation (“dry run”) that took place during
June 2013. A small set of documents (Text sections from EOL species pages, see post) were delivered to all curators. Upon manually annotating these
documents curators collected as many
questions as possible around unclear and/or problematic annotation cases. Some
examples of the latter are: terms and/or synonyms missing from EnvO, words that
could be mapped to multiple EnvO terms, location names, nested environment
descriptive terms.
A strategy employing
a set of flags to indicate such cases is now in place. The previously generated
the curation guideline document (see post) has been updated
accordingly and the production-level curation may now start.