A project aiming at processing the Encyclopedia of Life's (EOL) Taxon pages to extract descriptions of their environmental context. Such input will subsequently employed to answer integrative large-scale biological questions (Funded by the EOL Rubenstein Fellows Program)
Tuesday, October 29, 2013
September – October 2013: ENVIRONMENTS-EOL Outreach (BioCreative IV, TDWG 2013), E600 Housekeeping
Outreach activities have been the main focal point so far in Autumn 2013.
ENVIRONMENTS, ENVIRONMENTS-EOL, and the sister project SPECIES have been presented at an invited talk at the BioCreative IV workshop (7 - 9 October, Washington DC, US) as part of a DOE-funded Discussion Panel on Metagenomics.
Bridging the metagenomics and text mining communities e.g. by employing text mining techniques to support standards-compliant sequence metadata annotation was one of the main discussion points.
The Biocreative IV workshop proceedings including opinions on the previous point are available here (see Volume 1, pages 279-291).
On behalf of the ENVIRONMENTS-EOL team a big thank you to the BioCreative organizers.
At the time of writing, the Biodiversity Information Standards Conference (TDWG 2013, 28 Oct - 1 Nov, Firenze), is on-going.
ENVIRONMENTS-EOL will be be presented this Friday (1st Nov, 11:20) in the "Interoperability with genomic and ecological semantics" session of the Semantics for Biodiversity Symposium of TDWG2013 (Travel made possible thanks to EOL Rubenstein Fellows Program's funding).
In parallel and while the benchmarking algorithms are being prepared, the ENVIRONMENTS-600 (E600) corpus returned by the curators (see August's post) underwent housekeeping processing e.g. by removing any errors that had been introduced during the manual curation such as missing tabs in the annotation items, flag misspellings and others.
A mountain range (ENVO:00000080) as seen on board a flight from Münich, Germany to Florence, Italy to attend TDWG2013. Could it be the Dolomite mountain range?
Saturday, September 14, 2013
August 2013: The E600 curation month
Amid
July – August high temperatures for some of the team members, visits in
associate labs for some others, and as a side-activity to normal lab/office
work for the rest, the most tedious and time-consuming part of this
project has now been completed.
Environments-600
(E600), a corpus comprising 600 EOL Taxa pages was evenly and randomly
distributed among the 6 curators (4 graduate students, 2 postdocs, see June’s post).
To
maximize environment type coverage the 600 EOL documents were species pages randomly
picked from the following eight taxonomic taxa: Actinopterygii, Annelida,
Arthropoda, Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta. These are taxa
either associated with different environments to each other, or known to exist
in a diverse range of environments.
Each
curator had 45 days to annotate 120 documents (ie. their part of the corpus:
600/6 = 100 documents each, plus 20 documents (ie. 20% of 100) that are common
with other curators. The ‘20% overlap’ is an important part of the curation process.
It supports the calculation of the Inter-annotator agreement (IAA, based on
pairwise calculations of the Cohen's kappa coefficient.
Each
curator had access to his/her own documents only. No information on the shared
documents had been disclosed.
All
curators were instructed to evaluate all document substrings and map the
recognized environment descriptors to the corresponding EnvO terms.
Reflecting on the EnvO, envo-basic.obo, version-date: 14th June 2013,
such environment descriptors included: habitats, biomes, enviromental
features, conditions and materials (EnvO high level terms: 00002036, 00000428, 00002297, 01000203, 00010483
respectively)
All recognized mentions should be listed (including repetitions)
in the order of appearance in text. To facilitate EnvO term search and ontology
browsing OBO-Edit has been employed.
When
an environment descriptor could refer to more than one EnvO terms multiple
mappings were allowed (e.g. mapping “forest” to ENVO:00000111, “forest”
(environmental feature), and 01000174,
“forest biome”).
In
the case of “nested” environment descriptors, a “left-longest most”-like
matching approach applied. If for example “sandy sediment” is met in text, it
will be mapped to ENVO: 01000118, “sandy sediment” (and not to the nested terms: sand,
sediment).
During
the curation a range of special cases were encountered. Cases like
misspellings, EnvO term missing synonyms and enumerations were indicated as
such. Environment descriptors that did not correspond to an existing EnvO term
were also marked as such.
Finally,
when environment descriptive terms where part of geographical locations and/or
common taxon names (e.g. Steppe Eagle, Aquila nipalensis, shown in
the Figure) were flagged as such to allow for downstream analysis.
Calculating
the IAA, merging the annotated document in a single corpus are now ongoing,
paving the ground for the ENVIRONMENT’s accuracy benchmark. Stay tuned!
Steppe Eagle, Aquila
nipalensis, a common species name including a reference to an
environment. Such cases occurred during the curation have been flagged for
follow-up analysis (Image License: CC BY NC SA, © Tarique Sani, Source: Flickr: EOL Images)
Friday, August 9, 2013
July 2013: First Deliverables: Tagger, Dictionary, Stopword-list: v1.0 Ready!
July 2013 has been a highly active month.
A visit of Dr. Lars Juhl Jensen in HCMR (Hellenic Center for Marine Research), Crete followed
up on last April’s ENVIRONMENTS software developments (see post).
The main focus was on updating the dictionary and the stopword-list
according to the information contained in a recent Environmental Ontology version (envo-basic.obo, date: 14 June 2013)
The Environmental Ontology updates including an improved coverage of terrestrial
biomes (see EnvO News post) were the main reason for such an update.
As a result, the v1.0 ENVIRONMENTS
tagger is now ready and has been delivered to EOL (including the latest dictionary
of environment descriptive terms and the relevant stopword-list). All these software components are open source and will be made available at due time.
An annotation of all
EOL-Taxon pages using the v1.0 tagger, along with a precision analysis of the different EOL page section annotation have been completed.
The gold
standard corpus curation and the analysis of ENVIRONMENTS’ accuracy based on
that corpus are now the main focus. 600 EOL
species pages (from eight taxonomic taxa: Actinopterygii, Annelida, Arthropoda,
Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta – to maximize environment
diversity) have now been shared among the curators and the manual annotation is
ongoing.
Thursday, July 4, 2013
June 2013: The “dry-run” curation month
A gold
standard corpus generation comprises steps such as: document collection/selection,
manual document annotation, annotation result collection and statistical
analysis.
The first
and last steps can be computationally assisted and partially automated. However,
this is not the case for the manual document annotation. Also called “curation”,
the manual document annotation comprises the manual scanning of the document
text to identify environment descriptive terms and map them to unique
identifiers according to a community resource (the Environment Ontology (EnvO)
in this case).
The tediousness
and time-demands of such process call for collaborative effort. Aa international
group of six researchers: Lucia Fanini, Sarah Faulwetter, Evangelos Pafilis, Christina Pavloudi, Julia Schnetzer, Katerina Vasileiadou (in alphabetical order) have
undertaken this task.
Coming from a diverse range of scientific background (such
as ecology, computational biology, molecular biology, and systematic) they
represent different mindsets upon scanning pieces of text, in a way
representing different EOL readers.
Such
pluralism is a desired feature for the corpus curation, however a common
understanding among team members has to be established.
This was
one of the main aims of the test curation (“dry run”) that took place during
June 2013. A small set of documents (Text sections from EOL species pages, see post) were delivered to all curators. Upon manually annotating these
documents curators collected as many
questions as possible around unclear and/or problematic annotation cases. Some
examples of the latter are: terms and/or synonyms missing from EnvO, words that
could be mapped to multiple EnvO terms, location names, nested environment
descriptive terms.
A strategy employing
a set of flags to indicate such cases is now in place. The previously generated
the curation guideline document (see post) has been updated
accordingly and the production-level curation may now start.
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text @ PLOS ONE
The sister
projects of SPECIES and ORGANISMS now published at PLOS ONE, part of the PLOS Text Mining Collection.
The SPECIES
and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names
in Text. Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, et al.
(2013) PLoS ONE 8(6): e65390. doi:10.1371/journal.pone.0065390
http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390
http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390
The
knowledge, skills and know-how gained through this work paved the ground for
ENVIRONMENTS.
A big thank
you to the team, Evangelos
Monday, June 3, 2013
May 2013: gearing up the corpus curation, mining names and synonyms from EOL content, following closing EnvO developments
A “beach sand” (ENVO:00002138) picture taken at the island of Chrisi, a Natura 2000 site south-east of Crete, Greece. “Coarse beach sand” (ENVO:00002148) can be observed along with shells forming a “biogenous sediment” (ENVO:01000082); a unique feature of this island. Besides the “Coarse beach sand” are all types of sand included in the Environment Ontology? Can the Environments-EOL project assist in proposing terms, names and synonyms? (Image: CC BY-NC-SA)
The Environments-EOL project is nearing its
main stages (corpus creation, tagger bench-marking, EOL annotation and taxa
characterization, to take place in Summer 2013). To this end a range of
preparatory tasks are being/have been conducted.
May 2013 has seen a “dry-run” curation being
setup. A small set of document is being used for a trial curation (ongoing).
The manual and lengthy nature of a corpus generation dictates tests take place
before the main body of work commences. Via such a “dry-run” curators are
getting familiarized with the Environment Ontology as well as with relevant
browsing and searching tools. Additionally, questions are being raised and
discussions invoked on the exact context of terms to be annotated by the Environments-EOL
project.
In parallel: early tests showed that the manual
addition of synonyms in the dictionary (see “Dictionary Generation in previous
post”) could improve the tagger performance. To facilitate such task
specialized EOL sections (e.g. Habitat)
have been analyzed (counting word frequency in non-tagged text segments). A priority list of terms to be considered was
derived. After manual inspection environment related words have been mapped to
EnvO terms and can now be added in the dictionary. The EOL records involved in
this training step have been excluded from the corpus generation (and
subsequently the software evaluation).
Last but not least: Environments-EOL is a project
tightly bound to the Environment Ontology community resource. Highlighting this
connection as well the projects’ dynamic nature: a thank you for the EnvO team’s
prompt and timely response in updating the "terrestrial biome"
hierarchy, comprising now more compact and fine grained terms (see EnvO News Post)
Friday, May 3, 2013
April 2013: Visit to NNFCPR, Copenhagen, Denmark. Implementing main ENVIRONMENTS components
Nice days during the visit in Copenhagen (1-13 April 2013); while working on ENVIRONMENTS balloons showed up in the "air"(ENVO:00002005) (image taken at Frederiksberg, Copenhagen; CC BY-NC-SA)
April 2013 has been
mainly a travelling month. Besides the presentation of ENVIRONMENTS at GSC 15
(see previous post) a 2-week visit (made possible via the EOL-RubensteinFellowship support) at Dr. Lars Juhl Jensen and Dr. Sune Frankild premises
(NNF-CPR, Denmark) has resulted in the implementation of series of critical
tasks.
ENVIRONMENTS Software Development: Dictionary generation
The name and synonym
information in the Environment Ontology (EnvO) resource has been assessed.
Based on them a dictionary has been generated mapping environment descriptive
terms to EnvO identifiers.
Where necessary, extra synonyms were generated capturing the variable ways EnvO terms may be written in text. As an extension to previous work (see post) the generation of adjectives (e.g. coast – coastal) or plural forms (e.g. brackish water – brackish waters) has been included.
Species names and
anatomy terms present in EnvO, also described in other taxonomies/ontologies,
were not included in the dictionary. Moreover food names were excluded as they
may give rise to out-of-context text mining results.
Encyclopedia of Life textual component retrieval and processing
The EOL API has been
used to retrieve sections (“subjects” in the EOL terminology) for every taxon
such as: TaxonBiology, Description, Biology, Distribution,
Habitat and more)
The EOL Taxa text
components have been downloaded in the JSON format. A parser has been
development that collected the selected sections only and converts the text in
a ENVIRONMENTS compatible format (e.g. removes HTML tags and convers UTF8
characters to ASCII)
ENVIRONMENTS Software Development: Stopword curation
Local text
repositories of PubMed and EOL were processed with ENVIRONMENTS (using an
early-version dictionary). The most frequently tagged terms were inspected
manually in-text. Those that were found, most of the times, in a context other
than describing an environment were added in a “stopword” list. “well”,
“sping”, “range” are a few such examples. Such terms would have caused a high number
of false positive matches. This “stopword” list is a mechanism against such
phenomenon; its terms will be excluded from the analysis.
ENVIRONMENTS corpus preparation
A major component of
this project is the creation of a manually annotated corpus (gold standard).
Such a corpus
comprises a set of documents in which environment descriptors have been
manually identified and mapped to unique identifiers in community resources
(e.g. the Environment Ontology terms). By comparing the manual annotations with
software predicted tags the accuracy of environment descriptive term
identification software can be evaluated.
Two basic requirements
that such a corpus must meet is:
a.
to be comprehensive i.e. to
contain text that refer to diverse types of environments
b. to contain a minimum number
of terms per document that would make the manual annotation feasible in a
pragmatic time frame
Thursday, May 2, 2013
ENVIRONMENTS@GSC15, April 22-24, NIH, Washington DC
ENVIRONMENTS was presented (talk, poster) last week (April 22-24), at the 15th Genomic Standards Consortium meeting (GSC15) (NIH, Washington DC). Very good and creative feedback was received. A PDF version of the poster is available here (15MB). A special edition of the Standards in Genomic Sciences (SIGS) Journal with all the accepted abstract of the meeting can be found in this link.
Saturday, March 30, 2013
March 2013: paving the ground on several project aspects
March 2013 has
largely been a preparatory month paving the ground for follow-up tasks.
The main task of
ENVIRONMENTS-EOL is the identification of environment descriptive terms, such
as terrestrial, aquatic, lagoon, coral reef, in EOL Pages.
To materialize this aim one needs: on the one hand to
collect the text bits of the EOL Taxon Pages containing environmental context
information that could be mined, and on the other hand a piece of software
capable of identifying environment descriptors in these text bits.
For the former
scripts have been written employing the EOL API to
retrieve sections (“subjects” in the EOL terminology) of
every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more. EOL’s adherence to the
standards (e.g. to the Species Profile Model)
has significantly assisted such procedure. In active collaboration with the EOL
Developer Team the text retrieval will be optimized further.
For the latter a
prototype tagger, ENVIRONMENTS, has been compiled. ENVIRONMENTS is based on
SPECIES,
a tagger capable of identifying organism names in text using a dictionary-based
approach (Main developers: Lars, Sune).
ENVIRONMENTS is
capable of identifying environment descriptive terms by looking up words in the
text against a dictionary of environment descriptors. A prototype dictionary
has been created according to the naming information available in the Environment Ontology (EnvO).
EnvO is a
community resource offering a controlled, structured vocabulary for ecosystems
types (“biomes”), environmental materials, and environmental features (e.g.
habitats).
The different
types and sources of EnvO term names and synonyms have been explored and the more precise
ones have been selected.
Further steps
include actions that will improve the match between the way terms are written
in the text and the way they exist in EnvO e.g. by automatically adding the plural
form of the terms in the dictionary.
Another important
aspect of the ENVIRONMENTS-EOL project is the evaluation of the accuracy of the
environment descriptive term identification. To this end, the creation of a
manually annotated corpus (gold standard) is necessary.
Such a corpus
comprises a set of documents in which environment descriptors have been
manually identified and mapped to unique identifiers in community resources
(e.g. the Environment Ontology terms).
Once such a gold
standard corpus is in place, its manually annotated tags can be compared with
those predicted by named entity recognition software. In this way the accuracy
of the latter can be calculated.
Reflecting on the
experience gained from the creation of an manually annotated corpus of
taxonomic mentions (S800 corpus) and on
the pilot annotation of environment descriptive terms in PubMed abstracts (Thanks to Christina for her support) a guideline document is now in place.
Such a document
will provide the cutator team (Aikaterini, Christina, Evangelos, Julia,
Lucia, Sarah) a guide with examples of documents in which environment
descriptors have been manually identified and mapped to the corresponding EnvO terms.
Additionally, this guide elaborates on the main categories employed by
EnvO, presents web-search tools
dedicated to EnvO and text editors to facilitate the annotation task, discusses
issues already spotted e.g. how to handle environmental descriptors currently
missing from EnvO, and enlists hints and tips that could assist the tedious
task of the manual annotation.
Subscribe to:
Posts (Atom)