March 2013 has
largely been a preparatory month paving the ground for follow-up tasks.
The main task of
ENVIRONMENTS-EOL is the identification of environment descriptive terms, such
as terrestrial, aquatic, lagoon, coral reef, in EOL Pages.
To materialize this aim one needs: on the one hand to
collect the text bits of the EOL Taxon Pages containing environmental context
information that could be mined, and on the other hand a piece of software
capable of identifying environment descriptors in these text bits.
For the former
scripts have been written employing the EOL API to
retrieve sections (“subjects” in the EOL terminology) of
every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more. EOL’s adherence to the
standards (e.g. to the Species Profile Model)
has significantly assisted such procedure. In active collaboration with the EOL
Developer Team the text retrieval will be optimized further.
For the latter a
prototype tagger, ENVIRONMENTS, has been compiled. ENVIRONMENTS is based on
SPECIES,
a tagger capable of identifying organism names in text using a dictionary-based
approach (Main developers: Lars, Sune).
ENVIRONMENTS is
capable of identifying environment descriptive terms by looking up words in the
text against a dictionary of environment descriptors. A prototype dictionary
has been created according to the naming information available in the Environment Ontology (EnvO).
EnvO is a
community resource offering a controlled, structured vocabulary for ecosystems
types (“biomes”), environmental materials, and environmental features (e.g.
habitats).
The different
types and sources of EnvO term names and synonyms have been explored and the more precise
ones have been selected.
Further steps
include actions that will improve the match between the way terms are written
in the text and the way they exist in EnvO e.g. by automatically adding the plural
form of the terms in the dictionary.
Another important
aspect of the ENVIRONMENTS-EOL project is the evaluation of the accuracy of the
environment descriptive term identification. To this end, the creation of a
manually annotated corpus (gold standard) is necessary.
Such a corpus
comprises a set of documents in which environment descriptors have been
manually identified and mapped to unique identifiers in community resources
(e.g. the Environment Ontology terms).
Once such a gold
standard corpus is in place, its manually annotated tags can be compared with
those predicted by named entity recognition software. In this way the accuracy
of the latter can be calculated.
Reflecting on the
experience gained from the creation of an manually annotated corpus of
taxonomic mentions (S800 corpus) and on
the pilot annotation of environment descriptive terms in PubMed abstracts (Thanks to Christina for her support) a guideline document is now in place.
Such a document
will provide the cutator team (Aikaterini, Christina, Evangelos, Julia,
Lucia, Sarah) a guide with examples of documents in which environment
descriptors have been manually identified and mapped to the corresponding EnvO terms.
Additionally, this guide elaborates on the main categories employed by
EnvO, presents web-search tools
dedicated to EnvO and text editors to facilitate the annotation task, discusses
issues already spotted e.g. how to handle environmental descriptors currently
missing from EnvO, and enlists hints and tips that could assist the tedious
task of the manual annotation.
No comments:
Post a Comment