Nice days during the visit in Copenhagen (1-13 April 2013); while working on ENVIRONMENTS balloons showed up in the "air"(ENVO:00002005) (image taken at Frederiksberg, Copenhagen; CC BY-NC-SA)
April 2013 has been
mainly a travelling month. Besides the presentation of ENVIRONMENTS at GSC 15
(see previous post) a 2-week visit (made possible via the EOL-RubensteinFellowship support) at Dr. Lars Juhl Jensen and Dr. Sune Frankild premises
(NNF-CPR, Denmark) has resulted in the implementation of series of critical
tasks.
ENVIRONMENTS Software Development: Dictionary generation
The name and synonym
information in the Environment Ontology (EnvO) resource has been assessed.
Based on them a dictionary has been generated mapping environment descriptive
terms to EnvO identifiers.
Where necessary, extra synonyms were generated capturing the variable ways EnvO terms may be written in text. As an extension to previous work (see post) the generation of adjectives (e.g. coast – coastal) or plural forms (e.g. brackish water – brackish waters) has been included.
Species names and
anatomy terms present in EnvO, also described in other taxonomies/ontologies,
were not included in the dictionary. Moreover food names were excluded as they
may give rise to out-of-context text mining results.
Encyclopedia of Life textual component retrieval and processing
The EOL API has been
used to retrieve sections (“subjects” in the EOL terminology) for every taxon
such as: TaxonBiology, Description, Biology, Distribution,
Habitat and more)
The EOL Taxa text
components have been downloaded in the JSON format. A parser has been
development that collected the selected sections only and converts the text in
a ENVIRONMENTS compatible format (e.g. removes HTML tags and convers UTF8
characters to ASCII)
ENVIRONMENTS Software Development: Stopword curation
Local text
repositories of PubMed and EOL were processed with ENVIRONMENTS (using an
early-version dictionary). The most frequently tagged terms were inspected
manually in-text. Those that were found, most of the times, in a context other
than describing an environment were added in a “stopword” list. “well”,
“sping”, “range” are a few such examples. Such terms would have caused a high number
of false positive matches. This “stopword” list is a mechanism against such
phenomenon; its terms will be excluded from the analysis.
ENVIRONMENTS corpus preparation
A major component of
this project is the creation of a manually annotated corpus (gold standard).
Such a corpus
comprises a set of documents in which environment descriptors have been
manually identified and mapped to unique identifiers in community resources
(e.g. the Environment Ontology terms). By comparing the manual annotations with
software predicted tags the accuracy of environment descriptive term
identification software can be evaluated.
Two basic requirements
that such a corpus must meet is:
a.
to be comprehensive i.e. to
contain text that refer to diverse types of environments
b. to contain a minimum number
of terms per document that would make the manual annotation feasible in a
pragmatic time frame
No comments:
Post a Comment