Monday, June 3, 2013

May 2013: gearing up the corpus curation, mining names and synonyms from EOL content, following closing EnvO developments

A “beach sand” (ENVO:00002138) picture taken at the island of Chrisi, a Natura 2000 site south-east of Crete, Greece. “Coarse beach sand” (ENVO:00002148) can be observed along with shells forming a “biogenous sediment” (ENVO:01000082); a unique feature of this island. Besides the “Coarse beach sand” are all types of sand included in the Environment Ontology? Can the Environments-EOL project assist in proposing terms, names and synonyms? (Image: CC BY-NC-SA)

The Environments-EOL project is nearing its main stages (corpus creation, tagger  bench-marking, EOL annotation and taxa characterization, to take place in Summer 2013). To this end a range of preparatory tasks are being/have been conducted.

May 2013 has seen a “dry-run” curation being setup. A small set of document is being used for a trial curation (ongoing). The manual and lengthy nature of a corpus generation dictates tests take place before the main body of work commences. Via such a “dry-run” curators are getting familiarized with the Environment Ontology as well as with relevant browsing and searching tools. Additionally, questions are being raised and discussions invoked on the exact context of terms to be annotated by the Environments-EOL project.

In parallel: early tests showed that the manual addition of synonyms in the dictionary (see “Dictionary Generation in previous post”) could improve the tagger performance. To facilitate such task specialized EOL sections (e.g. Habitat) have been analyzed (counting word frequency in non-tagged text segments).  A priority list of terms to be considered was derived. After manual inspection environment related words have been mapped to EnvO terms and can now be added in the dictionary. The EOL records involved in this training step have been excluded from the corpus generation (and subsequently the software evaluation).

Last but not least: Environments-EOL is a project tightly bound to the Environment Ontology community resource. Highlighting this connection as well the projects’ dynamic nature: a thank you for the EnvO team’s prompt and timely response in updating the "terrestrial biome" hierarchy, comprising now more compact and fine grained terms (see EnvO News Post)