Friday, October 3, 2014

September 22 -25 2014: "SEQenvIII: From Signals to Environmentally Tagged Sequences III" Hackathon, HCMR, Crete, Creece

Participants: From left to right: Christina Pavloudi,
Anastasis Oulas, Lex Overmars, Conor Meehan, Lars Juhl
Jensen, Tomas Flouri, Tomas Vetrovsky, Chris Quince, Lucas
Sinclair. Not shown: Umer Ijaz (via Skype), Evangelos Pafilis
SEQenv, is a sister project of ENVIRONMENTS-EOL, addressing the microbial realm and bringing together the worlds of text mining, sequence analysis, statistics and vizualization through the prism of microbial ecology.

In particular, SEQenv is a pipeline aiming at annotating 16S rRNA and metagenomics microbial sequences based on environment descriptive terms.

Sequence similarity searches against public databases and the recognition of terms such as “glacier, pelagic, forest, lagoon” (i.e. Environment Ontology terms identified by the ENVIRONMENTS tagger) within Genbank records (e.g. “isolation source" field) and/or in the relevant literature (PubMed abstracts) are being employed to characterize novel microbial sequences. Subsequently, a range of visualizations, such as tag clouds, heatmaps, are generated to describe OTUs and samples.

Built incrementally, in three hackathons since September 2012, several features were added to SEQenv. e.g. starting from 16S rRNA sequences the pipeline now may be invoked either a. for DNA sequences or b. for protein sequences.

This year, it was time to clean up the code and package it in a language that would be easy to distribute. Moreover, speed performance was optimized and novel forms of visualization were explored. The core modules were rewritten in Python, speed-ups, e.g. by optimizing the sequence similarity searches were implemented, and interactive HTML/Javascript vizualizations started replacing R-generated static diagrams. In addition, integrating SEQenv-derived annotations with phylogenetic information and text-mining module extensions were investigated.

The effort and devotion of all participants (see image above) was the driving force overcoming the long-hours and any technical difficulties that arose. A big thank you from the organizers (without forgetting the SEQenvI and SEQenvII participants).

All three SEQenv hackathons have been funded by the EU COST ES 1103 Action on "Microbial ecology & the earth system: collaborating for insight and success with the new generation of sequencing tools". 

Besides the EU COST ES1103 Action the organizers would like to thank the LifeWatchGreece project for additional local support.
Any software produced during the hackathon will be made available as open source.

Friday, March 7, 2014

November 2013 - February 2014: EOL/Traitbank Integration, On-going Benchmark, Exploratory Visualizations

Unlike the silence of this blog, progress has occurred in all fronts of the ENVIRONMENTS-EOL project. Many sub-parts of which are now either finished, or nearing completion.

Have you seen ENVIRONMENTS-EOL predictions in the Encyclopedia of Life?

Early in 2014 the Encyclopedia of Life (EOL) released its new version, along with Traitbank, its novel data search facility. Among other innovative biodiversity research features, Jennifer Hammock, Patrick Leary, Katja Schulz, Cyndy Parr (and anyone else I may be omitting) have incorporated ENVIRONMENTS-EOL predictions into the EOL system.

As shown on the left Environment Ontology (ENVO) descriptive terms associated with a taxon (in this example: Hexanchus griseus, the Bluntnose Six-gill Shark), can be seen both in the Overview, Quick Facts (short list, upper-right part of the figure), and under the Data Tab (extended list, lower-right part of the figure).

Such features render the ENVIRONMENTS-EOL predictions accessible for all EOL users indifferent of the Information Technology skills.

Due to natural language intricacies, such as the multiple meanings a word may have, erroneous predictions will occur. As described in previous posts (April, July 2013), ENVIRONMENTS-EOL has been developed in iterative cycles aiming to identify and handle the most prominent of such errors.

An improved version of the ENVIRONMENTS tagger is now ready and the release of a new ENVIRONMENTS-EOL annotation dataset is in preparation.

Named Entity Recognition (NER) Evaluation
In a text-mining point-of-view, the work on the ENVIRONMENTS-600 corpus has been concluded; the Inter Annotator Agreement having been the last step of the process.

The tagger evaluation in terms of precision and recall is nearly complete. Points of particular interest were:
  • the handling of multiple EnvO identifiers having been mapped to a term by the curators, and/or being returned by the tagger
  • the hierarchical relationship of curated/predicted terms according to EnvO
  • the NER evaluation for distinct EnvO sub-graphs only (e.g. only for environmental features, or habitats, or environmental materials)
  • the NER evaluation for the different EOL Species page sections (e.g. only for "Habitat", or "Distribution", or "Taxon Biology")
The analysis of the NER performance is on-going.

North, South America bird habitat associations and vizualisations
Early in February 2014, the NESCent EOL-BHL Research Sprint (Durham, North Carolina) event gave ENVIRONMENTS-EOL a unique opportunity to explore concrete biological questions based on its machinery.

Interdisciplinary collaboration was the at very center of the event; Biologists were teamed with Information Technology reseachers to tackle open biodiversity research questions based on EOL/TraitBank, and Biodiversity Heritage Library (BHL) data.

NoPlaceLikeHome, a project initiated and driven by Prof. Rob Stevenson, U Mass, Boston, aimed at exploring species - habitat associations. Significant contribution was received by the local collaborator Dr. Carl Nordman, NatureServe.

In this context ENVIRONMENTS was used to annotate in-house North and South America Bird (Aves) information such as ecology, habitat, migration descriptions and others. 

Heatmaps and tagclouds (see below) were generated to visualise the text mining results: species - EnvO term associations based on simple term counts.

The visualisation scripts have been developed as part of the SEQenv sister project to characterize microbial sample sequences, according to the environment from which they derive. 

Employed in a higher eukaryote context, the same tools can still support knowledge exploration e.g. by highlighting rare/frequent habitats, species habitat breadth, and intra-taxon environment association differences.

An even user-friendlier, interactive version of the graphicsis under way in collaboration with Dr. Umer Ijaz, Uni. of Glasgow

The image shown below is a compilation of the project outputs by Cyndy Parr, as found in the EOL Blog (among other reports from the NESCent EOL-BHL Research Sprint and EOL News). Quoting Cyndy's to-the-point legend: "Species are on the X axis and the Environment Ontology (EnvO) habitat term associations on the Y axis, with the redness (or size in the inset Wordle) based on simple term counts."  

Tuesday, October 29, 2013

September – October 2013: ENVIRONMENTS-EOL Outreach (BioCreative IV, TDWG 2013), E600 Housekeeping

Outreach activities have been the main focal point so far in Autumn 2013.

ENVIRONMENTS, ENVIRONMENTS-EOL, and the sister project SPECIES have been presented at an invited talk at the BioCreative IV workshop (7 - 9 October, Washington DC, US) as part of a DOE-funded Discussion Panel on Metagenomics.

Bridging the metagenomics and text mining communities e.g. by employing text mining techniques to support standards-compliant sequence metadata annotation was one of the main discussion points.
The Biocreative IV workshop proceedings including opinions on the previous point are available here (see Volume 1, pages 279-291).

On behalf of the ENVIRONMENTS-EOL team a big thank you to the BioCreative organizers.

At the time of writing, the Biodiversity Information Standards Conference (TDWG 2013, 28 Oct - 1 Nov, Firenze), is on-going.

ENVIRONMENTS-EOL will be be presented this Friday (1st Nov, 11:20) in the "Interoperability with genomic and ecological semantics" session of the Semantics for Biodiversity Symposium of TDWG2013 (Travel made possible thanks to EOL Rubenstein Fellows Program's funding).

In parallel and while the benchmarking algorithms are being prepared, the ENVIRONMENTS-600 (E600) corpus returned by the curators (see August's post) underwent housekeeping processing e.g. by removing any errors that had been introduced during the manual curation such as missing tabs in the annotation items, flag misspellings and others.

A mountain range (ENVO:00000080) as seen on board a flight from M√ľnich, Germany to Florence, Italy to attend TDWG2013. Could it be the Dolomite mountain range?

Saturday, September 14, 2013

August 2013: The E600 curation month

Amid July – August high temperatures for some of the team members, visits in associate labs for some others, and as a side-activity to normal lab/office work for the rest, the most tedious and time-consuming part of this project has now been completed.

Environments-600 (E600), a corpus comprising 600 EOL Taxa pages was evenly and randomly distributed among the 6 curators (4 graduate students, 2 postdocs, see June’s post).

To maximize environment type coverage the 600 EOL documents were species pages randomly picked from the following eight taxonomic taxa: Actinopterygii, Annelida, Arthropoda, Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta. These are taxa either associated with different environments to each other, or known to exist in a diverse range of environments.

Each curator had 45 days to annotate 120 documents (ie. their part of the corpus: 600/6 = 100 documents each, plus 20 documents (ie. 20% of 100) that are common with other curators. The ‘20% overlap’ is an important part of the curation process. It supports the calculation of the Inter-annotator agreement (IAA, based on pairwise calculations of the Cohen's kappa coefficient.

Each curator had access to his/her own documents only. No information on the shared documents had been disclosed.

All curators were instructed to evaluate all document substrings and map the recognized environment descriptors to the corresponding EnvO terms.

Reflecting on the EnvO, envo-basic.obo, version-date: 14th June 2013, such environment descriptors included: habitats, biomes, enviromental features, conditions and materials (EnvO high level terms:  00002036, 00000428, 00002297, 01000203, 00010483 respectively)

All recognized mentions should be listed (including repetitions) in the order of appearance in text. To facilitate EnvO term search and ontology browsing OBO-Edit has been employed.

When an environment descriptor could refer to more than one EnvO terms multiple mappings were allowed (e.g. mapping “forest” to ENVO:00000111, “forest” (environmental feature), and  01000174, “forest biome”).

In the case of “nested” environment descriptors, a “left-longest most”-like matching approach applied. If for example “sandy sediment” is met in text, it will be mapped to ENVO: 01000118, “sandy sediment” (and not to the nested terms: sand, sediment).

During the curation a range of special cases were encountered. Cases like misspellings, EnvO term missing synonyms and enumerations were indicated as such. Environment descriptors that did not correspond to an existing EnvO term were also marked as such.

Finally, when environment descriptive terms where part of geographical locations and/or common taxon names (e.g. Steppe Eagle, Aquila nipalensis, shown in the Figure) were flagged as such to allow for downstream analysis.

Calculating the IAA, merging the annotated document in a single corpus are now ongoing, paving the ground for the ENVIRONMENT’s accuracy benchmark. Stay tuned!

Steppe Eagle, Aquila nipalensis, a common species name including a reference to an environment. Such cases occurred during the curation have been flagged for follow-up analysis (Image License: CC BY NC SA, © Tarique Sani, Source: Flickr: EOL Images) 

Friday, August 9, 2013

July 2013: First Deliverables: Tagger, Dictionary, Stopword-list: v1.0 Ready!

July 2013 has been a highly active month. 

A visit of  Dr. Lars Juhl Jensen in HCMR (Hellenic Center for Marine Research), Crete followed up on last April’s ENVIRONMENTS software developments (see post).

The main focus was on updating the dictionary and the stopword-list according to the information contained in a recent Environmental Ontology version (envo-basic.obo, date: 14 June 2013)

The Environmental Ontology updates including an improved coverage of terrestrial biomes (see EnvO News post) were the main reason for such an update.

As a result, the v1.0 ENVIRONMENTS tagger is now ready and has been delivered to EOL (including the latest dictionary of environment descriptive terms and the relevant stopword-list). All these software components are open source and will be made available at due time.

An annotation of all EOL-Taxon pages using the v1.0 tagger, along with a precision analysis of the different EOL page section annotation have been completed.

The gold standard corpus curation and the analysis of ENVIRONMENTS’ accuracy based on that corpus are now the main focus. 600 EOL species pages (from eight taxonomic taxa: Actinopterygii, Annelida, Arthropoda, Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta – to maximize environment diversity) have now been shared among the curators and the manual annotation is ongoing.

At the mean time brief holiday opportunities arise :) (Picture taken at Ancient Falasarna, Chania, Crete, Early August 2013, CC BY-NC-SA)

Thursday, July 4, 2013

June 2013: The “dry-run” curation month

A gold standard corpus generation comprises steps such as: document collection/selection, manual document annotation, annotation result collection and statistical analysis.

The first and last steps can be computationally assisted and partially automated. However, this is not the case for the manual document annotation. Also called “curation”, the manual document annotation comprises the manual scanning of the document text to identify environment descriptive terms and map them to unique identifiers according to a community resource (the Environment Ontology (EnvO) in this case).

The tediousness and time-demands of such process call for collaborative effort. Aa international group of six researchers: Lucia Fanini, Sarah Faulwetter, Evangelos Pafilis, Christina Pavloudi, Julia Schnetzer, Katerina Vasileiadou (in alphabetical order) have undertaken this task. 

Coming from a diverse range of scientific background (such as ecology, computational biology, molecular biology, and systematic) they represent different mindsets upon scanning pieces of text, in a way representing different EOL readers.

Such pluralism is a desired feature for the corpus curation, however a common understanding among team members has to be established.

This was one of the main aims of the test curation (“dry run”) that took place during June 2013. A small set of documents (Text sections from EOL species pages, see post) were delivered to all curators. Upon manually annotating these documents curators  collected as many questions as possible around unclear and/or problematic annotation cases. Some examples of the latter are: terms and/or synonyms missing from EnvO, words that could be mapped to multiple EnvO terms, location names, nested environment descriptive terms.

A strategy employing a set of flags to indicate such cases is now in place. The previously generated the curation guideline document (see post) has  been updated accordingly and the production-level curation may now start.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text @ PLOS ONE

The sister projects of SPECIES and ORGANISMS now published at PLOS ONE, part of the PLOS Text Mining Collection.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, et al. (2013) PLoS ONE 8(6): e65390. doi:10.1371/journal.pone.0065390

The knowledge, skills and know-how gained through this work paved the ground for ENVIRONMENTS.

A big thank you to the team, Evangelos