Extended Guilt by Association

Supplement to The role of indirect connections in gene networks in predicting function.
Gillis J, Pavlidis P pubmed reprint


Except where indicated, files are in matlab format (mat or m-files), although others are available by request.
Accessing some files requires decompressing a zip archive after download.


MOTIVATION: Gene networks have been used widely in gene function prediction algorithms, many based in complex extensions of the “guilt by association” principle. We sought to provide a unified explanation for the performance of gene function prediction algorithms in exploiting network structure and thereby simplify future analysis.

RESULTS: We use coexpression networks to show that most exploited network structure simply reconstructs the original correlation matrices from which the coexpression network was obtained. We show the same principle works in predicting gene function in protein interaction networks and that these methods perform comparably to much more sophisticated gene function prediction algorithms.

Additional comment about the effect of outliers in node degree

A central message of this paper is that there is a broad range of reasonable ways to weight indirect connections to get ‘quite good’ performance without worrying much. One of the heuristics discussed in this paper for selecting optimal values for weighting indirect associations depends somewhat on the network itself being “reasonable”. It is possible to concoct a network in which the approaches described will fail.

In particular, if a single gene is connected to all other genes, then any heuristic relying on path length is not workable. Some real biological data we have seen does (almost) have this problem in which a single gene has thousands of connections to other genes, therefore creating very short paths of close to uniform length among nearly all genes. In this case even though GBA is indifferent to a huge range of possible weightings, the heuristic of counting up path lengths of different types will fail. That is, you will move outside the broad plateau of sane values described in the paper.




Figure data
These files have the data used in the figures.  The figure 1 data is a little bulky.  Please contact us if you’re interested in any other specfific underlying data.


Supplementary Figures in a Word doc

Figure 1 data
Figure 2 data
Figure 3 data
Supplementary Figure 1 data
Supplementary Figure 2 data


Gene lists
These are the gene lists (UCSC Golden Path for Human and Mouse, NCBI for yeast) used in the paper with gene symbols, NCBI ids, and Gemma gene IDs (useful for accessing Gemma webservices)


Human list

Mouse list
Yeast list


Gene Ontology matrices (with descriptions, IDs, etc)


Yeast data

Mouse data
Human data


Network data


protein-protein interaction network (PPIN)

Yeast PPIN


The coexpression is very large and can be regenerated more easily than transferred (or saved, for individual expression experiments).


The expression data  can be accessed at:
where “###” is the experiment number given, e.g., 149


with experiment IDs in Bias exps ID.


The networks can be generated using:
Read raw expression data
Coexpression matrix calculator


These files call Gemma but will first check if the expression data has already been saved, so the data can be downloaded from the links above directly, with a bit of a look to ensure formatting is OK (with appropriate directory changes, etc).  The expression web pages also have sparse coexpression links for most experiments, from which an aggregated sparse network could be constructed which is very similar to aggregating un-thresholded data, as discussed in the paper.  However, the paper used the underlying data directly to construct the aggregate sparse network.


Cross-validation calculation


Neighbour voting cross-validation (ROC)

Neighbour voting cross-validation (Precision-recall style b)
Neighbour voting cross-validation (Precision-recall style c)
ROC curve overlay


Extension algorithm
In practise, because networks rarely need to be extended very far, it’s typically easiest (in Matlab, anyway) to write the one or two lines of code that will extend the network directly each time.  This is also preferable since one can examine network properties in doing this and it’s a fast, one-time activity which is also (we would argue) fundamental.  Nonetheless, here is some code to automate the process: Extending network


As discussed in the paper, it’s possible to improve performance further by ranking among ties –  at some cost to time – but in practise, it was not worth fine-tuning to this degree (judging by ROC scores), since most findings of the paper are very robust to any sort of variation (that is, weighting indirect connections incredibly “badly” will have virtually no effect, as long as within a “sane” range, as discussed in the paper).