Friday, May 3, 2013

April 2013: Visit to NNFCPR, Copenhagen, Denmark. Implementing main ENVIRONMENTS components

Nice days during the visit in Copenhagen (1-13 April 2013); while working on ENVIRONMENTS balloons showed up in the "air"(ENVO:00002005) (image taken at Frederiksberg, Copenhagen; CC BY-NC-SA)







April 2013 has been mainly a travelling month. Besides the presentation of ENVIRONMENTS at GSC 15 (see previous post) a 2-week visit (made possible via the EOL-RubensteinFellowship support) at Dr. Lars Juhl Jensen and Dr. Sune Frankild premises (NNF-CPR, Denmark) has resulted in the implementation of series of critical tasks.

ENVIRONMENTS Software Development: Dictionary generation
The name and synonym information in the Environment Ontology (EnvO) resource has been assessed. Based on them a dictionary has been generated mapping environment descriptive terms to EnvO identifiers.

Where necessary, extra synonyms were generated capturing the variable ways EnvO terms may be written in text. As an extension to previous work (see post) the generation of adjectives (e.g. coast – coastal) or plural forms (e.g. brackish water – brackish waters) has been included.

Species names and anatomy terms present in EnvO, also described in other taxonomies/ontologies, were not included in the dictionary. Moreover food names were excluded as they may give rise to out-of-context text mining results.

Encyclopedia of Life textual component retrieval and processing
The EOL API has been used to retrieve sections (“subjects” in the EOL terminology) for every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more)

The EOL Taxa text components have been downloaded in the JSON format. A parser has been development that collected the selected sections only and converts the text in a ENVIRONMENTS compatible format (e.g. removes HTML tags and convers UTF8 characters to ASCII)

ENVIRONMENTS Software Development: Stopword curation
Local text repositories of PubMed and EOL were processed with ENVIRONMENTS (using an early-version dictionary). The most frequently tagged terms were inspected manually in-text. Those that were found, most of the times, in a context other than describing an environment were added in a “stopword” list. “well”, “sping”, “range” are a few such examples. Such terms would have caused a high number of false positive matches. This “stopword” list is a mechanism against such phenomenon; its terms will be excluded from the analysis.

ENVIRONMENTS corpus preparation
A major component of this project is the creation of a manually annotated corpus (gold standard).
Such a corpus comprises a set of documents in which environment descriptors have been manually identified and mapped to unique identifiers in community resources (e.g. the Environment Ontology terms). By comparing the manual annotations with software predicted tags the accuracy of environment descriptive term identification software can be evaluated.

Two basic requirements that such a corpus must meet is:
a.     to be comprehensive i.e. to contain text that refer to diverse types of environments
b.    to contain a minimum number of terms per document that would make the manual annotation feasible in a pragmatic time frame

Having such criteria on mind several in silico experiments were conducted to collect documents from EOL. The basic components of a pipeline have been implemented to randomly select EOL pages of species belonging to certain higher-level taxa for the corpus generation. The next step is to further define such higher-level taxa. Could randomly picking bird species EOL-Pages (members of class Aves) result in a set of documents mentioning different types of environments ? 


Thursday, May 2, 2013

ENVIRONMENTS@GSC15, April 22-24, NIH, Washington DC

ENVIRONMENTS was presented (talk, poster) last week (April 22-24), at the 15th Genomic Standards Consortium meeting (GSC15) (NIH, Washington DC). Very good and creative feedback was received. A PDF version of the poster is available here (15MB). A special edition of the Standards in Genomic Sciences (SIGS) Journal with all the accepted abstract of the meeting can be found in this link.

"River bank" (ENVO:00000143) of the Potomac River. Picture taken from a location close to the Lincoln Memorial (CC BY-NC-SA).

Saturday, March 30, 2013

March 2013: paving the ground on several project aspects


March 2013 has largely been a preparatory month paving the ground for follow-up tasks.

The main task of ENVIRONMENTS-EOL is the identification of environment descriptive terms, such as terrestrial, aquatic, lagoon, coral reef, in EOL Pages.

To materialize this aim one needs: on the one hand to collect the text bits of the EOL Taxon Pages containing environmental context information that could be mined, and on the other hand a piece of software capable of identifying environment descriptors in these text bits.

For the former scripts have been written employing the EOL API  to retrieve sections (“subjects” in the EOL terminology) of every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more. EOL’s adherence to the standards (e.g. to the Species Profile Model) has significantly assisted such procedure. In active collaboration with the EOL Developer Team the text retrieval will be optimized further.

For the latter a prototype tagger, ENVIRONMENTS, has been compiled. ENVIRONMENTS is based on SPECIES, a tagger capable of identifying organism names in text using a dictionary-based approach (Main developers: Lars, Sune).  

ENVIRONMENTS is capable of identifying environment descriptive terms by looking up words in the text against a dictionary of environment descriptors. A prototype dictionary has been created according to the naming information available in the Environment Ontology (EnvO).

EnvO is a community resource offering a controlled, structured vocabulary for ecosystems types (“biomes”), environmental materials, and environmental features (e.g. habitats).

The different types and sources of EnvO term names and synonyms have been explored and the more precise ones have been selected.

Further steps include actions that will improve the match between the way terms are written in the text and the way they exist in EnvO e.g. by automatically adding the plural form of the terms in the dictionary.

Another important aspect of the ENVIRONMENTS-EOL project is the evaluation of the accuracy of the environment descriptive term identification. To this end, the creation of a manually annotated corpus (gold standard) is necessary.

Such a corpus comprises a set of documents in which environment descriptors have been manually identified and mapped to unique identifiers in community resources (e.g. the Environment Ontology terms).

Once such a gold standard corpus is in place, its manually annotated tags can be compared with those predicted by named entity recognition software. In this way the accuracy of the latter can be calculated.

Reflecting on the experience gained from the creation of an manually annotated corpus of taxonomic  mentions (S800 corpus) and on the pilot annotation of environment descriptive terms in PubMed abstracts (Thanks to Christina for her support) a guideline document is now in place.

Such a document will provide the cutator team (Aikaterini, Christina, Evangelos, Julia, Lucia, Sarah) a guide with examples of documents in which environment descriptors have been manually identified and mapped to the corresponding  EnvO terms.

Additionally, this guide elaborates on the main categories employed by EnvO, presents web-search tools dedicated to EnvO and text editors to facilitate the annotation task, discusses issues already spotted e.g. how to handle environmental descriptors currently missing from EnvO, and enlists hints and tips that could assist the tedious task of the manual annotation.

Welcome to ENVIRONMENTS-EOL, a few words on the project


Large-scale biological questions such as retrieving all species belonging to a specific group (e.g. Invertebrates), associated with a particular environment (e.g. coral reefs) and occurring in a specific region (e.g. Indo-Pacific Ocean) require the combinatorial analysis of information available in a diverse range of resources.

Taxonomy information along with species occurrence data (stored in centralized biodiversity resources) can be combined to this end. To fill-in, however, the missing pieces of the puzzle, input based on knowledge existing in the scientific literature is required.

The Encyclopedia of Life (http://eol.org) by collecting the available information about a given taxon is a one-stop-shop that greatly facilitates answering such questions.

The identification of environment descriptive terms, such as terrestrial, aquatic, lagoon, coral reef, in EOL text can drive the mining of species environmental context information.

ENVIRONMENTS is an open source tool supporting such identification. It does so by looking up words in the text against a dictionary of environment descriptors.

The Environment Ontology (http://environmentontology.org/), a community resource offering a controlled, structured vocabulary for ecosystems types (“biomes”), environmental materials, and environmental features (e.g. habitats), serves as the source of names and synonyms for the creation of such a dictionary.

While the environment descriptive term identification is the core of the project, tasks such as:
  • the evaluation of the accuracy of the method (via the creation of a manually annotated, gold standard, corpus)
  • the assessment of the contribution of the different EOL page sections to the environmental context mining
  • the consideration of taxonomy and species occurrence information
  • the generation of summarizing visualizations supporting comparisons and biological inferences

are equally important in answering large-scale biological questions like the one in the beginning of this post.

What lies ahead is a challenging project comprising a diverse range of tasks. As response, a team of researchers with diverse backgrounds (molecular biology, microbial ecology, data analysis, text/literature mining, bioinformatics, statistics and more) has been put together to this end.

Through the posts in this blog, it will, hopefully, be possible to keep you up-to-date with the project developments, provide you with more information on the tasks involved, present to you and bring you in contact with the people contributing to the different tasks.

Stay tuned!

Wednesday, January 30, 2013

The ENVIRONMENTS - EOL project is about to take off

Please find below an excerpt from the Encyclopedia of Life press release for the 2013 EOL Rubenstein Fellows. It announces ENVIRONMENTS as well as other six projects, all aiming at employing EOL contents to answer large scale biological questions. The complete press release can be found here: http://eol.org/info/485.

The Encyclopedia of Life Announces 2013 Rubenstein Fellows


Seven teams of biologists, information scientists and software developers to collaborate on Big Data research through EOL

Washington, D.C. - January 28, 2013 - The Encyclopedia of Life (EOL) is pleased to announce the winners of the 2013 EOL Rubenstein Research Fellowship awards.  The seven awardees will lead research teams seeking to answer novel research questions not readily addressable without the extensive data resources served by EOL.  EOL Rubenstein Research Fellow awards are made possible by a generous gift to EOL by David M. Rubenstein through the Smithsonian Institution’s National Museum of Natural History. .......... (Read more:   http://eol.org/info/485 )