Amid
July – August high temperatures for some of the team members, visits in
associate labs for some others, and as a side-activity to normal lab/office
work for the rest, the most tedious and time-consuming part of this
project has now been completed.
Environments-600
(E600), a corpus comprising 600 EOL Taxa pages was evenly and randomly
distributed among the 6 curators (4 graduate students, 2 postdocs, see June’s post).
To
maximize environment type coverage the 600 EOL documents were species pages randomly
picked from the following eight taxonomic taxa: Actinopterygii, Annelida,
Arthropoda, Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta. These are taxa
either associated with different environments to each other, or known to exist
in a diverse range of environments.
Each
curator had 45 days to annotate 120 documents (ie. their part of the corpus:
600/6 = 100 documents each, plus 20 documents (ie. 20% of 100) that are common
with other curators. The ‘20% overlap’ is an important part of the curation process.
It supports the calculation of the Inter-annotator agreement (IAA, based on
pairwise calculations of the Cohen's kappa coefficient.
Each
curator had access to his/her own documents only. No information on the shared
documents had been disclosed.
All
curators were instructed to evaluate all document substrings and map the
recognized environment descriptors to the corresponding EnvO terms.
Reflecting on the EnvO, envo-basic.obo, version-date: 14th June 2013,
such environment descriptors included: habitats, biomes, enviromental
features, conditions and materials (EnvO high level terms: 00002036, 00000428, 00002297, 01000203, 00010483
respectively)
All recognized mentions should be listed (including repetitions)
in the order of appearance in text. To facilitate EnvO term search and ontology
browsing OBO-Edit has been employed.
When
an environment descriptor could refer to more than one EnvO terms multiple
mappings were allowed (e.g. mapping “forest” to ENVO:00000111, “forest”
(environmental feature), and 01000174,
“forest biome”).
In
the case of “nested” environment descriptors, a “left-longest most”-like
matching approach applied. If for example “sandy sediment” is met in text, it
will be mapped to ENVO: 01000118, “sandy sediment” (and not to the nested terms: sand,
sediment).
During
the curation a range of special cases were encountered. Cases like
misspellings, EnvO term missing synonyms and enumerations were indicated as
such. Environment descriptors that did not correspond to an existing EnvO term
were also marked as such.
Finally,
when environment descriptive terms where part of geographical locations and/or
common taxon names (e.g. Steppe Eagle, Aquila nipalensis, shown in
the Figure) were flagged as such to allow for downstream analysis.
Calculating
the IAA, merging the annotated document in a single corpus are now ongoing,
paving the ground for the ENVIRONMENT’s accuracy benchmark. Stay tuned!
Steppe Eagle, Aquila
nipalensis, a common species name including a reference to an
environment. Such cases occurred during the curation have been flagged for
follow-up analysis (Image License: CC BY NC SA, © Tarique Sani, Source: Flickr: EOL Images)