Supplement to Application and evaluation of automated methods to extract neuroanatomical connectivity statements from free text.
Motivation: Automated annotation of neuroanatomical connectivity statements from the neuroscience literature would enable accessible and large scale connectivity resources. Unfortunately, the connec-tivity findings are not formally encoded and occur as natural lan-guage text. This hinders aggregation, indexing, searching, and inte-gration of the reports. We annotated a set of 1,377 abstracts for connectivity relations to facilitate automated extraction of connec-tivity relationships from neuroscience literature. We tested several baseline measures based on co-occurrence and lexical rules. We compare results from nine machine learning methods adapted from the protein interaction extraction domain that employ part-of-speech, dependency and syntax features.
Results: Co-occurrence based methods provided high recall with weak precision. The shallow linguistic kernel recalled 70% of the sentence level connectivity statements at 50% precision. Due to its speed and simplicity we applied the shallow linguistic kernel to a large set of new abstracts. To evaluate the results we compared 2,688 extracted connections to the Brain Architecture Management System (BAMS; an existing database of rat connectivity). The ex-tracted connections were connected in BAMS at a rate of 63.5%, compared to 51.1% for co-occurring brain region pairs. We found that precision increases with the recency and frequency of the ex-tracted relationships.
Supplemental Kernel Reference
Software for kernel evaluations was collected and implemented by Domonkos Tikk, Philippe Thomas, Peter Palaga, Jörg Hakenberg, and Ulf Leser. It is available at their online appendix: “A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.” The kernel and parser versions were left unchanged.
|Name||Feature source||Kernel Description||Reference||URL|
|Subtree Kernel||syntax tree||identical subtrees||Vishwanathan SVN, Smola AJ (2002) Fast kernels for string and tree matching. Proc. of Neural Information Processing Systems (NIPS’02). Vancouver, BC, Canada: pp. 569–576.|
|Subset Tree Kernel||syntax tree||shared subset trees||Collins M, Duffy N (2001) Convolution kernels for natural language. Proc. of Neural Information Processing Systems (NIPS’01). Vancouver, BC, Canada: pp. 625–632.|
|Partial Tree Kernel||syntax tree||shared partial trees||Moschitti A (2006) Efficient convolution kernels for dependency and constituent syntactic trees. Proc. of The 17th European Conf. on Machine Learning. Berlin, Germany: pp. 318–329.|
|Spectrum Tree Kernel||syntax tree||vertex-walks of the tree||Kuboyama T, Hirata K, Kashima H, Aoki-Kinoshita KF, Yasuda H (2007) A spectrum tree kernel. Information and Media Technologies 2: 292–299.||HTML|
|k-band Shortest Path Spectrum||dependency tree||vertex-walks of the tree, allowing mismatches||Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. and Leser, U. (2010) A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature, PLoS computational biology, 6, e1000837.||HTML|
|Shallow Linguistic Kernel||POS-tag, lemma, bag-of-words||global and local word based features||Giuliano C, Lavelli A, Romano L (2006) Exploiting shallow linguistic information for relation extraction from biomedical literature. EACL’06.|
|All-paths Graph Kernel||dependency tree, word sequence||weighted shared paths plus surface words||Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, et al. (2008) All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 9: S2.||HTML|
Supplement Figure S1, ROC curve from SL classification run
GATE data store that includes both the annotated corpus and the set of 12,557 abstracts. Use the below GATE plugin to view connections in the annotated corpus (or programmatically via the source code).
GATE plugin for viewing corpus and connections. To install just unzip to GATE plugins folder then in Gate goto file -> manage CREOLE plug-ins and click on the first checkmark for the Connections plugin.
Airola XML for annotated set. For use with ppi-benchmark by Tikk et al. The Airoal XML versions do not contain all of the abstracts, sentences and entities that are in the above GATE versions. They only contain sentences with to or more brain region mentions.
Airola XML produced by automated pipeline on the 12,557 abstract corpus. All interactions are set to false in this set as it to be used with ppi-benchmark in cross-corpus mode.
Shallow Linguistic Kernel positive predictions encoded in RDF. Annotated set of abstracts, use below SPARQL query for example. Normalized to NIFSTD terms, other normalizations are possible, just email email@example.com .
Shallow Linguistic Kernel positive predictions encoded in RDF. Full set of abstracts, use below SPARQL query for example. Normalized to NIFSTD terms.
Connection matrix created by normalization of connections to BAMS(Swanson98) brain region names. Each cell gives the number of extracted connections in the 12,557 abstract corpus. More matrices available upon request.