
Nice days during the visit in Copenhagen (1-13 April 2013); while working on ENVIRONMENTS balloons showed up in the "air"(ENVO:00002005) (image taken at Frederiksberg, Copenhagen; 
CC BY-NC-SA)
 
ENVIRONMENTS Software Development: Dictionary generation
The name and synonym
information in the Environment Ontology (EnvO) resource has been assessed.
Based on them a dictionary has been generated mapping environment descriptive
terms to EnvO identifiers. 
Where necessary, extra synonyms were generated capturing the variable ways EnvO
terms may be written in text. As an extension to previous work (see post) the generation
of adjectives (e.g. coast – coastal) or plural forms (e.g. brackish water – brackish
waters) has been included. 
Species names and
anatomy terms present in EnvO, also described in other taxonomies/ontologies,
were not included in the dictionary. Moreover food names were excluded as they
may give rise to out-of-context text mining results.
Encyclopedia of Life textual component retrieval and processing
The EOL API has been
used to retrieve sections (“subjects” in the EOL terminology) for every taxon
such as: TaxonBiology, Description, Biology, Distribution,
Habitat and more)
The EOL Taxa text
components have been downloaded in the JSON format. A parser has been
development that collected the selected sections only and converts the text in
a ENVIRONMENTS compatible format (e.g. removes HTML tags and convers UTF8
characters to ASCII)
ENVIRONMENTS Software Development: Stopword curation
Local text
repositories of PubMed and EOL were processed with ENVIRONMENTS (using an
early-version dictionary). The most frequently tagged terms were inspected
manually in-text. Those that were found, most of the times, in a context other
than describing an environment were added in a “stopword” list. “well”,
“sping”, “range” are a few such examples. Such terms would have caused a high number
of false positive matches. This “stopword” list is a mechanism against such
phenomenon; its terms will be excluded from the analysis.
ENVIRONMENTS corpus preparation
A major component of
this project is the creation of a manually annotated corpus (gold standard).
Such a corpus
comprises a set of documents in which environment descriptors have been
manually identified and mapped to unique identifiers in community resources
(e.g. the Environment Ontology terms). By comparing the manual annotations with
software predicted tags the accuracy of environment descriptive term
identification software can be evaluated.
Two basic requirements
that such a corpus must meet is:
a.    
to be comprehensive i.e. to
contain text that refer to diverse types of environments
b.    to contain a minimum number
of terms per document that would make the manual annotation feasible in a
pragmatic time frame
Having such criteria on mind several in silico experiments were
conducted to collect documents from EOL. The basic components of a pipeline
have been implemented to randomly select EOL pages of species belonging to
certain higher-level taxa for the corpus generation. The next step is to
further define such higher-level taxa. Could randomly picking bird species
EOL-Pages (members of class Aves) result in a set of documents mentioning
different types of environments ?