Motivation: The Gene Ontology (GO) is heavily used in systems biology but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.
Results: We report that GO annotations are stable over short periods with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their .functional identity. over time, with 20% of genes not matching to themselves (by semantic similarity) after two years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally risen in humans. Finally, we discovered that many entries in protein interaction databases are due to the same published reports that are used for GO annotations with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.
Contact: paul[at]chibi.ubc.ca or JGillis[at]cshl.edu for assistance with the data.
The following files for human genes are intended to assist researchers who wish to check their own data for the types of effects we report in the paper. The files are tab-delimited. Genes are referenced by NCBI IDs or official symbols, and publications by PubMed IDs.
- HIPPIE PPIN – The protein interaction data used in sections 3.3 and 3.4.
- frac_confound_aved.txt – The connection-level data plotted in figure 5A.
- frac_confound_go_aved – The GO-term-level data plotted in figure 5A.
- frac_confound_go_103 – Each GO group’s confoundedness for our final data point for GO. These data are plotted in Figure 3A. “NaN” occurs where there was division by zero.
- frac_confound_con_103.txt – Number of functions shared by gene pairs from the PPIN, and the number of functions confounded for our final data point for GO (edition 103). These data are plotted in Figure 3B.
- frac_confound_con_aved.txt – The connection-level data plotted in figure 5A.
- frac_confound_GO_aved.txt – The GO-term-level data plotted in figure 5A.
- Confound table List of GO IDs and Pubmed IDs of papers contributing the most confound edges for those functions
- Semantic stability table List of genes and number of GO editions since they changed their functional identity (measured as the highest semantic similarity with itself)
- Semantic similarity table Similarity ranking for each gene back through each edition of GO. A value of “1” means the gene was “most similar to itself” or tied for first.
- Multifunctionality rankings table List of gene multifunctionality rankings over time. Useful if there’s interest to reduce the annotation bias in GO
Use case: The postsynaptic proteome
The following two data files were used in the analysis described in section 3.4 of the manuscript.