Automated recognition of brain region mentions in neuroscience literature

Supplement to Automated recognition of brain region mentions in neuroscience literature.

French L, Lane S, Xu L, Pavlidis P pubmed reprint


The ability to computationally extract mentions of neuroanatomical regions from the literature would assist linking to other entities within and outside of an article. Examples include extracting reports of connectivity or region-specific gene expression. To facilitate text mining of neuroscience literature we have created a corpus of manually annotated brain region mentions. The corpus contains 1,377 abstracts with 18,242 brain region annotations. Interannotator agreement was evaluated for a subset of the documents, and was 90.7% and 96.7% for strict and lenient matching respectively. We observed a large vocabulary of over 6,000 unique brain region terms and 17,000 words. For automatic extraction of brain region mentions we evaluated simple dictionary methods and complex natural language processing techniques. The dictionary methods based on neuroanatomical lexicons recalled 36% of the mentions with 57% precision. The best performance was achieved using a conditional random field (CRF) with a rich feature set. Features were based on morphological, lexical, syntactic and contextual information. The CRF recalled 76% of mentions at 81% precision, by counting partial matches recall and precision increase to 86% and 92% respectively. We suspect a large amount of error is due to coordinating conjunctions, previously unseen words and brain regions of less commonly studied organisms. We found context windows, lemmatization and abbreviation expansion to be the most informative techniques. The corpus is freely available below.



Normalization datasets are available.

We will continue to maintain and correct the corpus. Please notify us of any problems or errors in the datastore.


WhiteText GATE Datastore version 1.0


WhiteText GATE Datastore version 1.1 Changes: removed duplicate AbbrevShort tags and added XML version. Note that XML version contains only the final merged annotation set, and lacks information on expanded abbreviations (they were causing tag crossovers).


WhiteText GATE Datastore version 1.2 Changes: Edit of 11 annotations that had a “The” prefix. Removed 731 annotations that mentioned a neuroanatomical tract. This was done to make sure the annotations matched our guidelines. This corpus has 17612 brain region mentions.


WhiteText GATE Datastore version 1.3 Changes: Edit of 34 annotations that had a “midbrain” prefix. Removed 34 annotations that mentioned a neuroanatomical tract or nerve. This was done to make sure the annotations matched our guidelines. This corpus has 17,585 brain region mentions (UnionMerge set).


Currently the corpus is provided in the GATE format and a simple XML format. Upon request we can help convert it to other common formats. If you convert the corpus to another format, we request a link or copy so we can point others to it.


We encourage you to test new methods and applications of the dataset. Please contact us if you do, we would like to hear about and link to your work.


The abstracts are from PubMed/Medline, specifically The Journal of Comparative Neurology.

