Friday, May 3, 2013

April 2013: Visit to NNFCPR, Copenhagen, Denmark. Implementing main ENVIRONMENTS components

Nice days during the visit in Copenhagen (1-13 April 2013); while working on ENVIRONMENTS balloons showed up in the "air"(ENVO:00002005) (image taken at Frederiksberg, Copenhagen; CC BY-NC-SA)

April 2013 has been mainly a travelling month. Besides the presentation of ENVIRONMENTS at GSC 15 (see previous post) a 2-week visit (made possible via the EOL-RubensteinFellowship support) at Dr. Lars Juhl Jensen and Dr. Sune Frankild premises (NNF-CPR, Denmark) has resulted in the implementation of series of critical tasks.

ENVIRONMENTS Software Development: Dictionary generation
The name and synonym information in the Environment Ontology (EnvO) resource has been assessed. Based on them a dictionary has been generated mapping environment descriptive terms to EnvO identifiers.

Where necessary, extra synonyms were generated capturing the variable ways EnvO terms may be written in text. As an extension to previous work (see post) the generation of adjectives (e.g. coast – coastal) or plural forms (e.g. brackish water – brackish waters) has been included.

Species names and anatomy terms present in EnvO, also described in other taxonomies/ontologies, were not included in the dictionary. Moreover food names were excluded as they may give rise to out-of-context text mining results.

Encyclopedia of Life textual component retrieval and processing
The EOL API has been used to retrieve sections (“subjects” in the EOL terminology) for every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more)

The EOL Taxa text components have been downloaded in the JSON format. A parser has been development that collected the selected sections only and converts the text in a ENVIRONMENTS compatible format (e.g. removes HTML tags and convers UTF8 characters to ASCII)

ENVIRONMENTS Software Development: Stopword curation
Local text repositories of PubMed and EOL were processed with ENVIRONMENTS (using an early-version dictionary). The most frequently tagged terms were inspected manually in-text. Those that were found, most of the times, in a context other than describing an environment were added in a “stopword” list. “well”, “sping”, “range” are a few such examples. Such terms would have caused a high number of false positive matches. This “stopword” list is a mechanism against such phenomenon; its terms will be excluded from the analysis.

ENVIRONMENTS corpus preparation
A major component of this project is the creation of a manually annotated corpus (gold standard).
Such a corpus comprises a set of documents in which environment descriptors have been manually identified and mapped to unique identifiers in community resources (e.g. the Environment Ontology terms). By comparing the manual annotations with software predicted tags the accuracy of environment descriptive term identification software can be evaluated.

Two basic requirements that such a corpus must meet is:
a.     to be comprehensive i.e. to contain text that refer to diverse types of environments
b.    to contain a minimum number of terms per document that would make the manual annotation feasible in a pragmatic time frame

Having such criteria on mind several in silico experiments were conducted to collect documents from EOL. The basic components of a pipeline have been implemented to randomly select EOL pages of species belonging to certain higher-level taxa for the corpus generation. The next step is to further define such higher-level taxa. Could randomly picking bird species EOL-Pages (members of class Aves) result in a set of documents mentioning different types of environments ? 

No comments:

Post a Comment