Application and evaluation of automated semantic annotation of gene expression experiments

Supplement to Application and evaluation of automated semantic annotation of gene expression experiments.
French L, Lane S, Law T, Xu L, Pavlidis P pubmed reprint

Abstract

Motivation:
Many microarray datasets are available online with formalized standards describing the probe sequences and expression values. Unfortunately, the description, conditions and parameters of the experiments are less commonly formalized and often occur as natural language text. This hinders searching, high-throughput analysis, organization and integration of the datasets.

Results:
We use the lexical resources and software tools from the Unified Medical Language System (UMLS) to extract concepts from text. We then link the UMLS concepts to classes in open biomedical ontologies. The result is accessible and clear semantic annotations of gene expression experiments. We applied the method to 595 expression experiments from Gemma, a resource for re-use and meta-analysis of gene expression profiling data. We evaluated and corrected all stages of the annotation process. The majority of missed annotations were due to a lack of cross-references. The most error prone stage was the extraction of concepts from phrases. Final review of the annotations in context of the experiments revealed 89% precision. We have integrated the annotation pipeline into Gemma.

Publication

Full article is available at Oxford Bioinformatics.

Citation:

Leon French, Suzanne Lane, Tamryn Law, Lydia Xu, and Paul Pavlidis (2009) Application and evaluation of automated semantic annotation of gene expression experiments. Bioinformatics 25(12):1543-1549; doi:10.1093/bioinformatics/btp259

Contact:

paul@msl.ubc.ca

Example:

A detailed guide to how an example text fragment is annotated by the system.

Annotations:

The complete set of predicted annotations is available as a machine readable RDF graph below.

Both manual and predicted annotations can be viewed and searched via the Gemma website.

Evaluations:

Mapping from Phrase to CUI
Mapping from CUI to Ontologies
Uninformative URLs
HighLevel Review of 100 experiments

Results:

From the original paper:
Machine Readable RDF graph of predicted annotations and a sample SPARQL query.

 

Please contact us if you have problems or advice for the RDF, in the future we wish to improve the RDF generation.

 

Example RDF generated by ExampleAnnotator on the phrase “brain cancer”. It’s format is slightly different than above because it does not reference an experiment (tabulator recommended for viewing).

Binary:

Available as a zip file that includes a main jar file and a few others:
GEOMMTX.zip

 

Requires:

 

Make sure the Annotator.properties file points to the required resources from UMLS and the evaluations.
* Copy mmtxProjectJS.jar from your MMTx installation folder to the project root folder
* Edit Annotator.properties to point to cui_source_loc from your MMTx installation folder (nls/mmtx/data/2006/mmtx)
* Edit Annotator.properties to point to a MRCONSO.RRF file from a UMLS installation (META/)
* Edit Annotator.properties to point to the evaluation files

 

Example execution:

java -Xmx3524m -Xms2250m -classpath icu4j-3.4.jar:xercesImpl-2.8.1.jar:mmtxProjectJS.jar:GEOMMTx.jar:. ubic.GEOMMTx.ExampleAnnotator “Text to be processed, like brain cancer”

 

The result is written to example.rdf. The first time it is executed it may take awhile to download and cache the ontology mappings.

Source:

Source archive

To build you need:

 

The instructions are found in readme.txt:

 

Installation:

* Copy mmtxProjectJS.jar from your MMTx installation folder to the project root folder
* Edit Annotator.properties to point to cui_source_loc from your MMTx installation folder (nls/mmtx/data/2006/mmtx)
* Edit Annotator.properties to point to a MRCONSO.RRF file from a UMLS installation (META/)
* Edit Annotator.properties to point to the evaluation files
* setup your maven repository and settings
* Run mvn install to download libraries and compile

 

Execution:

ubic.GEOMMTx.ExampleAnnotator.java provides a simple class for executing the pipeline. It’s input is the first command line argument, and it’s output is example.rdf

 

Note: the first run may take some time as it makes the mappings for BIRNLex and Disease Ontology (after the first run its stored locally)

Recommended Resources:

Gemma
The Open Biomedical Ontologies
NCBO Bioportal
Jena
Tabulator Extension