Supplement to Using text mining to link journal articles to neuroanatomical databases.
The electronic linking of neuroscience information, including data embedded in the primary literature, would permit powerful queries and analyses driven by structured databases. This task would be facilitated by automated procedures which can identify biological concepts in journals. Here we apply an approach for automatically mapping formal identifiers of neuroanatomical regions to text found in journal abstracts, and apply it to a large body of abstracts from the Journal of Comparative Neurology. The analyses yield over one hundred thousand brain region mentions which we map to 8,225 brain region concepts in multiple organisms. Based on the analysis of a manually annotated corpus, we estimate mentions are mapped at 95% precision and 63% recall. Our results provide insights into the patterns of publication on brain regions and species of study in the Journal, but also point to important challenges in the standardization of neuroanatomical nomenclatures. We find that many terms in the formal terminologies never appear in a JCN abstract, while conversely, many terms authors use are not reflected in the terminologies. To improve the terminologies we deposited 136 unrecognized brain regions into the Neuroscience Lexicon (NeuroLex). The training corpus, lexicons, normalizations, evaluations and annotated journal abstracts are freely available at http://www.chibi.ubc.ca/WhiteText/.
Table S1: Extracted and filtered species
The column SpeciesText lists the terms recognized in the corpus as representing the species. The Species column provides the NCBI taxonomy identifier. Manual annotation of filtered terms is provided by the Filter column, blank entries were evaluated and not filtered.
Table S2: Manually evaluated concept mappings
Mentions are grouped and separated by blank rows. Term to concept mappings that have the involved the same string labels are grouped into one row with the ShortNames column listing the concept identifiers. The PubMed Links provide hypertext links to the abstracts the mentions are extracted from. Predicates and PredicatesShort list the resolver used to make the match.
Table S3: Manually created mention to concept links
Manually created mappings between top ranked unmatched mentions. We attempted to map some mentions to more than one lexicon and that is represented by the multiple Manual Mapping columns. Mentions not maching the curation guidelines because they were tracts or systems were marked for correction in the corpus.
Table S4: Complete evaluations for all extracted species
The “SpeciesText” column list all recognized terms for a given species. The Species column provides the NCBI taxonomy identifier. The evaluation results (accept, reject, specToGen) are divided by the total number of mention to concept links to create percentages.
Data: RDF dataset
Irrelevant species (Table S1) are already filtered out.
Manual(Table S2) and automatic evaluations from above are encoded in the RDF files.
Merging the two RDF files will result in incorrect counts of mention occurrences.