Saturday, March 30, 2013

March 2013: paving the ground on several project aspects

March 2013 has largely been a preparatory month paving the ground for follow-up tasks.

The main task of ENVIRONMENTS-EOL is the identification of environment descriptive terms, such as terrestrial, aquatic, lagoon, coral reef, in EOL Pages.

To materialize this aim one needs: on the one hand to collect the text bits of the EOL Taxon Pages containing environmental context information that could be mined, and on the other hand a piece of software capable of identifying environment descriptors in these text bits.

For the former scripts have been written employing the EOL API  to retrieve sections (“subjects” in the EOL terminology) of every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more. EOL’s adherence to the standards (e.g. to the Species Profile Model) has significantly assisted such procedure. In active collaboration with the EOL Developer Team the text retrieval will be optimized further.

For the latter a prototype tagger, ENVIRONMENTS, has been compiled. ENVIRONMENTS is based on SPECIES, a tagger capable of identifying organism names in text using a dictionary-based approach (Main developers: Lars, Sune).  

ENVIRONMENTS is capable of identifying environment descriptive terms by looking up words in the text against a dictionary of environment descriptors. A prototype dictionary has been created according to the naming information available in the Environment Ontology (EnvO).

EnvO is a community resource offering a controlled, structured vocabulary for ecosystems types (“biomes”), environmental materials, and environmental features (e.g. habitats).

The different types and sources of EnvO term names and synonyms have been explored and the more precise ones have been selected.

Further steps include actions that will improve the match between the way terms are written in the text and the way they exist in EnvO e.g. by automatically adding the plural form of the terms in the dictionary.

Another important aspect of the ENVIRONMENTS-EOL project is the evaluation of the accuracy of the environment descriptive term identification. To this end, the creation of a manually annotated corpus (gold standard) is necessary.

Such a corpus comprises a set of documents in which environment descriptors have been manually identified and mapped to unique identifiers in community resources (e.g. the Environment Ontology terms).

Once such a gold standard corpus is in place, its manually annotated tags can be compared with those predicted by named entity recognition software. In this way the accuracy of the latter can be calculated.

Reflecting on the experience gained from the creation of an manually annotated corpus of taxonomic  mentions (S800 corpus) and on the pilot annotation of environment descriptive terms in PubMed abstracts (Thanks to Christina for her support) a guideline document is now in place.

Such a document will provide the cutator team (Aikaterini, Christina, Evangelos, Julia, Lucia, Sarah) a guide with examples of documents in which environment descriptors have been manually identified and mapped to the corresponding  EnvO terms.

Additionally, this guide elaborates on the main categories employed by EnvO, presents web-search tools dedicated to EnvO and text editors to facilitate the annotation task, discusses issues already spotted e.g. how to handle environmental descriptors currently missing from EnvO, and enlists hints and tips that could assist the tedious task of the manual annotation.

No comments:

Post a Comment