Hello, you’ve reached the “multifunctionality bias” project page. Helpful lists and code follow below. If you have a different metric, performance test, dataset or analysis type you’d like to see tried out or need help implementing, please contact Paul (email@example.com) and we’ll be happy to help out. The findings of the paper are very consistent so that performance tests which seem to ideally suit data (a common choice) typically show correspondingly high prevalence bias and often the control cases allow the same – very specific – conclusions as the real data.
Many previous studies have shown that by using variants of “guilt-by-association”, gene function predictions can be made with very high statistical confidence. In these studies, it is assumed that the “associations” in the data (e.g., protein interaction partners) of a gene are necessary in establishing “guilt”. In this paper we show that multifunctionality, rather than association, is a primary driver of gene function prediction. We first show that knowledge of the degree of multifunctionality alone can produce astonishingly strong performance when used as a predictor of gene function. We then demonstrate how the multifunctionality is encoded in gene interaction data (such as protein interactions and coexpression networks) and how this can feed forward into gene function prediction algorithms. We find that high-quality gene function predictions can be made using data that possesses no information on which gene interacts with which. By examining a wide range of networks from mouse, human and yeast, as well as multiple prediction methods and evaluation metrics, we provide evidence that this problem is pervasive and does not reflect the failings of any particular algorithm or data type. We propose computational controls that can be used to provide more meaningful control when estimating gene function prediction performance. We suggest that this source of bias due to multifunctionality is important to control for, with widespread implications for the interpretation of genomics studies.
Text files of optimal (over GO) gene lists are provided (best at top, then descending). Some caveats: I’m keeping the lists up to date so they differ (very slightly) from the ranking used in the paper. For result replication, use the programs provided outside of the “Helpful Tools” category – these files are primarily for seeing where particular genes of interest rank, or comparing the output to another specific list of genes. Similarly, for detailed computational analyses the programs provided in the immediately following “prevalence programs” should be used to generate scores so that ties can be taken into account. Some of the programs may use common Matlab toolbox functions, but if you’re missing these and are having trouble, please contact me for work-arounds.
Prevalence programs – this performs many of the analyses used in the paper in a convenient way (in matlab). I have updated the GO associations. Programs are described below:
- calc_opt_list.m – calculates the scores which give best ROC performance given a list of genes.
Scores returned in same order as original list, higher is better, and NaN appears where the gene did not match the list of known genes (from humanlist.mat in the downloadable zip). Also uses GO data in the zip (but can replace with any gene categorization scheme). opt_scores=calc_opt_list(gene_list,species) would be a call to this function.
- ROC_opt.m calculates ROC for a subset of genes out of another list of genes.
It uses calc_opt_list to calculate the way to order the larger set of genes. So, this function tells you if your subset would be predicted to arise from the larger set using nothing more than prevalence. This means it is circular to use this function on groups picked specifically from GO (since the ordering of the list is derived from GO). [ROC,fp,tp]=ROC_opt(genes_of_interest,universe_of_genes,species) would be a call to this function. human_list (from humanlist.mat) would often be reasonable to use for the universe_of_genes. One can plot (fp,tp) to obtain the ROC curve (by untied rank, whereas the ROC uses tied rank).
- GOvectorscorer.m– given a score for each of a list of genes, calculates what the ROC values are for each GO group over that list of genes.
NaNs will appear where the GO group no longer has genes within it. A call to this function would be output=GOvectorscorer(scores,genes). For example: temp=GOvectorscorer(sum(ppinkernel),gene_list(TF570),’human’); generates the associability scores for ppin. One can set the number of genes necessary in the GO group easily in the code (at “if sum(TF)>=20”, default 20-1000) or replace with a different gene categorization matrix, etc. One is typically interested in the distribution of the output.
It’s also often of interest looking at correlations, since even weak correlations with the optimal list induce strong effects (since “optimal” is strong) e.g.,: corr(sum(ppinkernel)’,opt_scores,’type’,’spearman’)
Lists (of genes, experiment ids, etc)
Gene Ontology (GO) numbers
GO slim numbers
Gene Ontology numbers – Used in human.
all human experiments
human experiments used
human genes (out of full list) included (logical)
mouse gene lists
mouse genes id numbers used
mouse experiments used (common array design) – Plus Su et al. (id=166).
Gene Ontology numbers – Used in mouse.
Gene list – Used in yeast.
Yeast GO data – Including membership matrix, names, etc.
Expression data and coexpression (etc) can be accessed at:
where “###” is the experiment number given, e.g., 149.
Yeast networks were downloaded directly from their respective websites
Real association matrix scores
KEGG coexpression (top overlap) GeneMANIA ROC
GO protein-interaction GeneMANIA ROC
KEGG protein-interaction SVM CCR
GO protein-interaction SVM CCR
GO coexpression (top overlap) SVM CCR
GO coexpression (threshold) GeneMANIA ROC
GO coexpression (top overlap) GeneMANIA ROC
IPN Interaction Matrices
IPN matrix scores
Gene Categorization Schemes
Code (and associability scores)
The associability scores were generated as needed from the code prefaced with “assess_vec”. These functions calculate (for example) GO group performance prediction over all groups using a particular vector of gene scores. The scores are assumed to be in the same order as gene_list (for human data). The second entry is a logical vector expressing which genes (in gene_list) are being used. For all of the analysis in the paper, TF570 (given in the lists section) was used. The mouse program (prefix “fig5”) takes input in a slightly different format – just using the scores over the presumed list of genes used for the mouse analysis (and the GO groups data is in a cell array rather than a sparse matrix). The variables used are similarly in the format downloadable above. In essence, these are less helpful versions of the function above, GOvectorscorer.m. The functions with the suffix “CC” after “assess_vec” are calculating correct classification rates in the same way done for SVM validation (that is, piecewise, over the genome). Other than that, variation between the functions is just the gene association matrix used (given in the function name, GO where none mentioned)
In general, paths set in the code will need to be altered in straightforward ways (likely to be changed) – using the data in the lists and associations, given above. The only code necessary for generating the results of the paper not given is the base code for GeneMANIA.
NCer.txt – This function generates an IPN given an association matrix.
GMer3.txt – Cross-validation for GeneMANIA.
SVMer.m – Cross-validation for SVM.
topover.txt – Sparsifies a gene-gene matrix by the top-overlap methods, given a sparsity.
noderand.txt – Generates a random matrix given a vector of node degrees. In general, this matrix is not as desirable as using the “assess_vec”s or IPNs for method validation (since can be “too random” for a given method, and thus is conservative when claiming prevalence bias – the paper goal – but optimistic if measuring significance).
nodedegree.m– this function gives the average internal coexpression for the GO groups used in the full human analysis. In figures 2-4, only GO groups for which this value was greater than 1 (it’s lower limit) were plotted. This has a strong effect only on the threshold data, but as suggested by figure 4 (and the supplementary discussion on absolute performance), choosing more stringent thresholds has a strong effect on average performance. Note that this has (apparently) nothing to do with absolute sparsity of the GO groups, but only their relative sparsity. This is expected from the effect of prevlance bias where average coexpression and not just internal, is what is being used to predict performance (and consistent with that, higher performance obtained in this way increases prevalence bias).
nodenull5216.mat – Null distribution figure 5c, for 5216th GO group in GO descriptions.
(likely future changes: increased usability/descriptions of materials above, additional data, possibly scripts to auto-generate figures in paper from materials above)