Application and evaluation of automated methods to extract connectivity statements from free text

Back to the White Text Project.


Supplement to Application and evaluation of automated methods to extract neuroanatomical connectivity statements from free text.

French L, Lane S, Xu L, Siu C, Kwok C, Chen Y, Krebs C, Pavlidis P pubmed reprint


Motivation: Automated annotation of neuroanatomical connectivity statements from the neuroscience literature would enable accessible and large scale connectivity resources. Unfortunately, the connec-tivity findings are not formally encoded and occur as natural lan-guage text. This hinders aggregation, indexing, searching, and inte-gration of the reports. We annotated a set of 1,377 abstracts for connectivity relations to facilitate automated extraction of connec-tivity relationships from neuroscience literature. We tested several baseline measures based on co-occurrence and lexical rules. We compare results from nine machine learning methods adapted from the protein interaction extraction domain that employ part-of-speech, dependency and syntax features.
Results: Co-occurrence based methods provided high recall with weak precision. The shallow linguistic kernel recalled 70% of the sentence level connectivity statements at 50% precision. Due to its speed and simplicity we applied the shallow linguistic kernel to a large set of new abstracts. To evaluate the results we compared 2,688 extracted connections to the Brain Architecture Management System (BAMS; an existing database of rat connectivity). The ex-tracted connections were connected in BAMS at a rate of 63.5%, compared to 51.1% for co-occurring brain region pairs. We found that precision increases with the recency and frequency of the ex-tracted relationships.


Supplemental Kernel Reference

Software for kernel evaluations was collected and implemented by Domonkos Tikk, Philippe Thomas, Peter Palaga, Jörg Hakenberg, and Ulf Leser. It is available at their online appendix: “A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.” The kernel and parser versions were left unchanged.


Name Feature source Kernel Description Reference URL
Subtree Kernel syntax tree identical subtrees Vishwanathan SVN, Smola AJ (2002) Fast kernels for string and tree matching. Proc. of Neural Information Processing Systems (NIPS’02). Vancouver, BC, Canada: pp. 569–576. PDF
Subset Tree Kernel syntax tree shared subset trees Collins M, Duffy N (2001) Convolution kernels for natural language. Proc. of Neural Information Processing Systems (NIPS’01). Vancouver, BC, Canada: pp. 625–632. PDF
Partial Tree Kernel syntax tree shared partial trees Moschitti A (2006) Efficient convolution kernels for dependency and constituent syntactic trees. Proc. of The 17th European Conf. on Machine Learning. Berlin, Germany: pp. 318–329. PDF
Spectrum Tree Kernel syntax tree vertex-walks of the tree Kuboyama T, Hirata K, Kashima H, Aoki-Kinoshita KF, Yasuda H (2007) A spectrum tree kernel. Information and Media Technologies 2: 292–299. HTML
k-band Shortest Path Spectrum dependency tree vertex-walks of the tree, allowing mismatches Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. and Leser, U. (2010) A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature, PLoS computational biology, 6, e1000837. HTML
Shallow Linguistic Kernel POS-tag, lemma, bag-of-words global and local word based features Giuliano C, Lavelli A, Romano L (2006) Exploiting shallow linguistic information for relation extraction from biomedical literature. EACL’06. PDF
All-paths Graph Kernel dependency tree, word sequence weighted shared paths plus surface words Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, et al. (2008) All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 9: S2. HTML


Supplement Figure S1, ROC curve from SL classification run


Source code and instructions.


GATE data store that includes both the annotated corpus and the set of 12,557 abstracts. Use the below GATE plugin to view connections in the annotated corpus (or programmatically via the source code).


GATE plugin for viewing corpus and connections. To install just unzip to GATE plugins folder then in Gate goto file -> manage CREOLE plug-ins and click on the first checkmark for the Connections plugin.


Annotation and evaluation guidelines.



Evaluation of 723 region pairs that have been resolved to 899 BAMS region pairs that are not listed as connected in BAMS.


Airola XML for annotated set. For use with ppi-benchmark by Tikk et al. The Airoal XML versions do not contain all of the abstracts, sentences and entities that are in the above GATE versions. They only contain sentences with to or more brain region mentions.


Airola XML produced by automated pipeline on the 12,557 abstract corpus. All interactions are set to false in this set as it to be used with ppi-benchmark in cross-corpus mode.


Shallow Linguistic Kernel positive predictions encoded in RDF. Annotated set of abstracts, use below SPARQL query for example. Normalized to NIFSTD terms, other normalizations are possible, just email .


Shallow Linguistic Kernel positive predictions encoded in RDF. Full set of abstracts, use below SPARQL query for example. Normalized to NIFSTD terms.


example SPARQL queryTwinkle is recommended for querying and examining the data.


Connection matrix created by normalization of connections to BAMS(Swanson98) brain region names. Each cell gives the number of extracted connections in the 12,557 abstract corpus. More matrices available upon request.