Comparison of threshold selection methods for microarray gene co-expression matrices

Background Network and clustering analyses of microarray co-expression correlation data often require application of a threshold to discard small correlations, thus reducing computational demands and decreasing the number of uninformative correlations. This study investigated threshold selection in the context of combinatorial network analysis of transcriptome data. Findings Six conceptually diverse methods - based on number of maximal cliques, correlation of control spots with expressed genes, top 1% of correlations, spectral graph clustering, Bonferroni correction of p-values, and statistical power - were used to estimate a correlation threshold for three time-series microarray datasets. The validity of thresholds was tested by comparison to thresholds derived from Gene Ontology information. Stability and reliability of the best methods were evaluated with block bootstrapping. Two threshold methods, number of maximal cliques and spectral graph, used information in the correlation matrix structure and performed well in terms of stability. Comparison to Gene Ontology found thresholds from number of maximal cliques extracted from a co-expression matrix were the most biologically valid. Approaches to improve both methods were suggested. Conclusion Threshold selection approaches based on network structure of gene relationships gave thresholds with greater relevance to curated biological relationships than approaches based on statistical pair-wise relationships.


Introduction
To extract gene networks from microarray data, correlations are often used as a measure of gene co-expression. A typical microarray with 20,000 gene probes will produce 200 million correlations. Correlations below a threshold value, closer to zero, will be less meaningful. Hard and soft threshold approaches have been applied to biological data. Hard thresholds discard gene pairs with correlation below the threshold, while soft thresholds use the correlation value to weight gene network relationships. Zhang and Horvath [1] concluded that soft thresholds based on aggregate, modular relationships between genes gave more robust results, but data reduction by a hard threshold is often essential for computational tractability of graph algorithms.
We focus on relevance networks, created by applying a hard threshold to the gene expression correlation matrix [2], then extracting gene networks. The resulting networks have been well documented in recent literature to yield sets of co-expressed genes [3][4][5]. Relevance networks are easily converted to graphs, with genes as vertices, only connected by an edge if their correlation is above the threshold. A clique is a sub-graph in which all nodes are connected to each other [6]. A disadvantage of using cliques is the computational requirements, which grow exponentially with number of genes. Thus hard threshold selection is required when performing clique extraction on microarray data. Current approaches to threshold selection are typically statistically based, and do not fully reflect the connectivity of the data [7]. Methods based on statistical arguments may not necessarily yield biologically significant relationships [3,8].
Some studies used an arbitrary threshold correlation such as 0.80 [9]. Moriyama et al. [10] obtained random correlation distributions for gene pairs by permuting their expression values and defended their choice of threshold based on statistical significance. Lee et al. [11] used the top 1% of correlations (absolute value) to build a coexpression network. Voy et al. [3] used distribution of correlations of genes with buffer spots on the arrays to select a threshold correlation value of 0.875.
However, using connectivity of the data to derive thresholds has been suggested. Langston et al. [12] recommended use of ontological distance, statistical significance and various graph structural attributes to arrive at a correlation threshold. Palla et al. [13] found that a threshold based on clique size was effective at separating networks.
Here two threshold selection methods based on correlation graph structure are compared with common statistically based methods. The graph based methods used spectral properties [14] or number of cliques to select a threshold. Objectives were to compare the various hard threshold methods for validity (retention of biological information), stability, and reliability.

Datasets
Three yeast S. cerevisiae time-series datasets were chosen for this study: 31 arrays for Anoxia state [15], 21 arrays for Reoxygenation state [15] and 18 arrays from yeast cultures synchronized using Alpha-factor arrest [16]. Data are available on Gene Expression Omnibus under GSE2246, GSE2267 and GSE22. Extensive GO annotation for S. cerevisiae genes influenced the selection. Exploratory data analyses within each dataset using PCA, box plots and pair-wise correlations between arrays found no outlier arrays. Quantile plots showed data were normally distributed, and distribution of correlations among gene expression profiles had the expected bell-shaped curve, so all data were used.

Software
Software written by Langston and colleagues (University of Tennessee) was used, including Datagen version 1.4a for computing correlations, maximal clique enumeration code version 2.0.1 [17], spectral analysis code [14], and GO Pairwise Similarity analysis code version 1.0. Matrix calculations for spectral graph analysis were carried out in MATLAB 7.0. P-values were calculated in SAS version 9.1 (Cary; NC). Statistical power was calculated using PASS statistical software http://www.ncss.com/pass.html.

Threshold Estimation
Six conceptually different approaches were evaluated: 1) Numbers of maximal cliques were calculated at each potential correlation threshold, starting at r = 0.99. The threshold was lowered, in steps of 0.01, and number of maximal cliques increased due to greater connections among genes. When clique number increased two times (Maximal Clique-2) or three times (Maximal Clique-3) the previous value, that correlation was chosen as the threshold.
2) For each potential threshold correlation value, spectral graph theory [18] was used to decompose the resulting graph into eigenvalues and eigenvectors, which were used to enumerate spectral clusters [19]. As the potential threshold was incrementally lowered in steps of 0.01, a peak in the number of clusters occurs, and the threshold is chosen to maximize cluster number. Details are in [14].
3) Correlations of control spots with all other genes on the array were calculated, creating a null distribution. The 99th percentile correlation value (absolute value) of this distribution gave the threshold.
4) The top 1% of all correlations (absolute value) among genes was used to estimate a threshold [11]. Correlations were ranked, and the correlation at the 99th percentile was the threshold estimate. Note that the control spot method uses a different subset of correlations (only with control spots), whereas this method uses all correlations among genes.

5) A p-value for every correlation was computed, testing if the correlation was zero (Fisher's z-transformation).
Threshold estimate was the correlation value corresponding to the critical Bonferroni p-value, 0.05/number of correlations. This threshold will remove any correlations that are statistically equal to zero. 6) Statistical power calculations were used to find the correlation value that gave an 80% chance of rejecting the null hypothesis, Ho: correlation = 0. Type I error rate in these calculations was Bonferroni-adjusted to correct for multiple testing.
Further details on computing these threshold estimation methods are in the Additional file 1.

Performance Evaluation
Performance of the threshold estimation methods was evaluated by comparison to a biologically based Gene Ontology threshold. GO data used was gene_ontology_edit.obo.2008-05-01.gz. The biological meaning for each correlation bin (in 0.01 increments) was the average of functional similarity scores for all gene pairs within that correlation bin. Functional similarity for a pair of genes was defined as log(n/N)/log(2/N), where n is the number of genes in the lowest GO category that contained both genes, and N is the total number of genes annotated for the organism. The formula normalizes Functional similarity to a 0 to 1 range, and a value of 1 means the GO category contained only the two genes being considered (perfect similarity). GO threshold estimate was defined as the correlation at which change in average functional similarity exceeded median change plus half its standard deviation, thus identifying where biological information begins to accumulate.
To study stability of the methods, 10,000 block bootstrap samples were created by sampling arrays with replacement from each block. Blocks were defined to be 2 or 3 adjacent time periods, such that each block contained 3 or 4 arrays. Block bootstrapping was necessary to preserve as much as possible the time-course dependency structure of the experiments [20]. For each of the 10,000 samples, a threshold estimate was calculated by each method, and the distribution of these thresholds was used to compare threshold methods for stability.

Results
Functional similarity scores for the three datasets are displayed in Figure 1. Changes in scores across correlation values were similar for all datasets, and the lack of GO term relationship for negative correlations is striking. Because of this, the GO threshold was defined by the curve for positive correlations. Biological relationship begins to increase sharply above a correlation value of 0.80, and this produced the GO thresholds in Table 1.
Estimated thresholds obtained by each method are listed in Table 1 for the three datasets. If estimated threshold is higher than the biological threshold, false negatives will occur, because data reduction by the higher threshold will remove real relationships. Conversely, using a threshold below the biological threshold will create false positives, and relationships that are not real would be included in the network. In discovery-based settings, false positives are more acceptable, as they can be removed with further validation. Thus methods that estimate a lower threshold are preferred. Maximal Clique-2 and Spectral Clustering performed better than the other methods, based on summed absolute deviations from GO threshold ( Table  1). Maximal Clique-2 was further from the GO threshold, but might be preferred since it never exceeded that threshold.
The estimated threshold derived for selected methods for each dataset is compared to bootstrap distributions in Table 2. The best methods from above, Maximal Clique-2 and Spectral Clustering, and two other methods for comparative purposes were chosen for this analysis. The bootstrap mean was never less than the estimated threshold, and occasionally was two standard deviations above. This upward bias in correlation is expected, as each time period had a limited number of arrays, making it likely that the identical array would be resampled. However, Maximal Clique and Spectral Clustering methods showed more resistance to this bias. The bootstrap standard deviation measures ability of the methods to produce similar threshold estimates from randomized arrays. Again the network-based methods showed the lowest standard deviations, and highest stability. All methods showed poorest performance with the Alpha dataset, possibly due to its unreplicated design. This makes it less likely that all time levels would be represented in the bootstrap samples, whereas the other datasets had glucose and galactose biological replicates.

Discussion
The two network-based methods, Maximal Clique-2 and Spectral Clustering, performed very well in terms of boot-strap stability and biological validity. Though Maximal Clique-2 method gave thresholds close to the biological threshold, and always below, the method had slightly higher bootstrap standard deviations. The robustness of the Maximal Clique-2 algorithm could be enhanced by exclusion of smaller cliques in the graph, for example cliques of size 3. Spectral Clustering thresholds were on average closer to biological thresholds, but too often exceeded it. However, if all thresholds for Spectral Clustering were lowered by 0.05, it would have been clearly the best method. Further fine-tuning of the parameters in the algorithm (size of sliding window, different tolerance lev-  els for cluster formation) may improve the method's validity. In a recent paper, Almendral and Díaz-Guilera [21] documented the sensitivity of the non-zero eigenvalue to network changes. All methods had subjective settings, and further work on many more species and experiments would be needed to establish best choices.
The results from this study complement the work of Zhang and Horvath [1] which concluded that thresholds based on the scale-free topology -the formation of hubs and densely-connected sub-graphs -produced more robust results. The statistically-based methods studied here are directly dependent on the correlation distribution and thus were unable to capture biological relationships.
Although the Control-Spot method is based on logical reasoning, the high correlation of control spots with other genes on the arrays weakened the method's validity. The Top 1% Correlations method is arbitrary, and failed to capture biological relationships. Statistical considerations used for the Power and Bonferroni methods were also not able to identify biological relationships, reflecting the well-known discrepancy between biological and statistical significance. Experiments that are small will produce thresholds that are too high, while large experiments will give excessively low thresholds, even though the biological relationships are the same.
The GO similarity measure of biological validity we have used, however, is by no means perfect and is just one way of quantifying biological information. Khatri and Draghici [22] have listed limitations of GO in detail. We also found low GO scores at high negative correlations as compared to the high GO score associated with high positive correlations for all three datasets. The drop in GO score at high negative correlations could be due to several reasons, for example experimental and analytical limitations to detect biologically negative correlations among genes, and limited gene annotations [11]. As the quantification of biological information in data gets more precise, the selection of thresholds should become easier. In fact, note that a method like the GO threshold used here would be a logical choice if GO information were complete and accurate.