Finding gene clusters for a replicated time course study
© Qin et al.; licensee BioMed Central Ltd. 2014
Received: 25 July 2013
Accepted: 15 January 2014
Published: 24 January 2014
Finding genes that share similar expression patterns across samples is an important question that is frequently asked in high-throughput microarray studies. Traditional clustering algorithms such as K-means clustering and hierarchical clustering base gene clustering directly on the observed measurements and do not take into account the specific experimental design under which the microarray data were collected. A new model-based clustering method, the clustering of regression models method, takes into account the specific design of the microarray study and bases the clustering on how genes are related to sample covariates. It can find useful gene clusters for studies from complicated study designs such as replicated time course studies.
In this paper, we applied the clustering of regression models method to data from a time course study of yeast on two genotypes, wild type and YOX1 mutant, each with two technical replicates, and compared the clustering results with K-means clustering. We identified gene clusters that have similar expression patterns in wild type yeast, two of which were missed by K-means clustering. We further identified gene clusters whose expression patterns were changed in YOX1 mutant yeast compared to wild type yeast.
The clustering of regression models method can be a valuable tool for identifying genes that are coordinately transcribed by a common mechanism.
Clustering is a useful tool to look for unknown groupings of objects . It has become an important part of the analysis of gene expression data, owing to the pioneering work of Eisen et al. . By looking for clusters of genes that have similar expression levels across samples, researchers hope to better understand gene functions, genetic pathways, and regulatory circuits. Cluster analysis can also be used to cluster samples; we will focus on the problem of gene clustering in this paper.
A number of analytic methods have been applied to the problem of gene clustering. They can largely be classified to two categories: (1) algorithmic clustering methods, such as K-means clustering and hierarchical clustering [2, 3]; and (2) model-based clustering methods, such as the multivariate normal mixture model [4, 5]. These methods generally do not take into account the experimental design, such as cross-sectional (CS) design, longitudinal with no replication (LNR) design, and longitudinal with replications (LWR) design.
We previously developed a model-based clustering method, the clustering of regression models (CORM) method, which employs regression to model gene expression and clusters genes based on their relationship between expression levels and sample covariates . Different from previous clustering methods, CORM partitions systematic variation from non-systematic variation and bases the clustering on systematic variation only. CORM is applicable to data collected under various designs for microarray experiments and takes into account the specific experimental design in the modeling. We have previously implemented the Clustering of Linear Models (CLM) method and the Clustering of Linear Mixed Models (CLMM) method, and applied them to analyze data collected under the CS design and the LNR design, respectively.
Our contributions in this paper are as follows: (1) we illustrate the methodologic advantages of the CORM method over K-means clustering, (2) we demonstrate the application of the CLMM method to gene expression data collected under the LWR design, using a yeast time course dataset measured for two yeast cell lines each with two technical replicates, and (3) we show empirical evidence of CLMM’s benefits compared to K-means through a comparison of the clustering results for the yeast data – two clusters were uniquely identified by CLMM but missed by K-means and a spurious cluster was picked up by K-means and spared by CLMM.
Given a set of objects, K-means clustering seeks a partition of all objects into K groups to minimize the total within group sum of squared Euclidean distance . The minimum could, in theory, be found by searching over all possible clusterings; however, this approach is computationally prohibitive when the number of objects is large. An iterative procedure is instead adopted to search for the minimum. Specifically, K-means starts with an initial value for the cluster centers, then iterates between the cluster-assigning step (each object is assigned to the closest cluster center) and the cluster-center-recalculating step (each cluster center is updated as the average of objects assigned to that cluster), until convergence.
where ϵg is the vector of measurement errors, I is an identity matrix, and ug is a random variable on (1, 2, . . . , K) with probabilities πk = 1/K. Cluster memberships are considered as missing data in the EM algorithm: cluster-assigning step corresponds to the E-step and cluster-center-recalculating step to the M-step.
The CORM method
For the problem of differential expression analysis, the regression modeling framework has been employed to characterize systematic variation in the expression profile of each gene and distinguish it from random variation. Differential expression is identified by contrasting expression levels measured under different experimental conditions or by identifying dependencies on concomitantly measured covariates. The resulting estimated regression models can provide an accurate and precise description of expression profiles. Similarly, the regression model framework can be used for the problem of gene clustering: systematic variation is separated from random variation and gene clustering is based solely on the systematic part of the variation. We call this the clustering of regression models method (CORM) .
where ug is a random variable on (1, 2, . . . ,K) with probabilities (π1, π2, . . . , πK). Complete specification of the CORM modeling framework requires identification of the error structure (parameterized by ξ), which depends on the form of the regression model. The specific form of the regression model used for CORM is flexible. For example, it can be the linear model, the linear mixed model, the nonlinear model, and the nonparametric regression model. Its choice should depend on the experimental design and the scientific question. The EM algorithm can be used to fit the CORM model [11, 12]. Implementation details can be found in  for the clustering of linear models (CLM) method and the clustering of linear mixed models (CLMM) method.
Comparing K-means and CORM
The different gene features considered by K-means and CORM also have implications on other important issues of cluster analysis. We will comment on three such issues here, one with impact before gene clustering and two after gene clustering.
(a) Selection of genes. A microarray provides measurements on thousands of genes, but it is common to select a small subset (tens to hundreds) of genes to cluster, especially for partitional clustering. One reason to select a subset is to keep the computation manageable. Another reason is to try to exclude uninformative genes to prevent them from deteriorating the clustering. For K-means, however, ‘uninformative’ is not well defined. One might select the most variable genes. However, that strategy does not distinguish genes with large signal and genes with large noise when including genes; nor does it distinguish genes with small signal and genes with small noise when excluding genes. With CORM, informative genes can be selected using a per-gene regression model and a significance cutoff appropriately adjusted for multiplicity , and genes with significant systematic variation are clustered to find those that are similarly associated to the covariates. CORM and regression-based differential expression analysis can thus form an integrated framework for the analysis of microarray data.
(b) Characterization and interpretation of clusters. After clustering genes, it is useful to determine the cluster signatures for the identified clusters. Often they are set to be the cluster centers. CORM clusters can be identified by their regression coefficients and have a specific interpretation depending on the experimental design. For example, we can tell whether a gene cluster tends to be up-regulated or down-regulated comparing diseased samples to normal samples. The interpretability of CORM clusters allows a more interpretable comparison of gene clusters identified in different data sets with similar experimental designs – not only the clustering of genes can be compared but also the characteristics of the clusters.
(c) Application of clusters. The average of genes in the same cluster has been proposed to act as predictors for sample classification . CORM tends to find clusters that are more stable across samples, as we will show later. In addition, CORM provides an explicit prediction rule for new genes that are measured on a new set of samples.
CORM provides an alternative clustering method for scenarios when K-means has limitations. For example, while applicable to both CS data and LNR data, K-means does not distinguish the two experimental designs. K-means cannot naturally handle LWR data – profiles of a gene need to be averaged or connected first. K-means might not use all information in the data; for example, in a longitudinal study, it considers time points to be exchangeable and ignores their ordering and correlation. Unlike K-means, CORM can naturally deal with missing values for any gene or sample (under the assumption of missing at random) as well as imbalanced experimental design (for example, different sampling times for different samples in a longitudinal study). Moreover, CORM can easily incorporate technical replicates together with biological replicates in a hierarchical manner.
The gains of CORM depend on the truth of the regression model and its robustness to model misspecification. Ideally, the design of an experiment determines the gene-related feature available for clustering and hence informs parameterization of the regression model for CORM. Experimental design should be chosen to produce the feature that most likely reflects biological clusters of interest. For example, a longitudinal design can be used to find clusters of genes that behave similarly across time, while a cross-sectional design can be used to find clusters of genes that behave similarly across different levels of covariate (for example, disease stage).
To study the regulation of the cell cycle in yeast, we studied gene expression across the cell cycle for both wild type (WT) yeast and single mutant (SM) yeast with the YOX1 gene knocked out . Alpha factor was used for cell synchronization and 6,227 ORFs were measured at 5-min intervals for 120 min. cDNA microarrays were used with a common reference mRNA obtained from logarithmically growing wild type yeast cells and log ratios were used to measure expression levels. Replicate measurements were obtained for both WT yeast and SM yeast. Our goal was to identify co-expression behavior of cell cycle dependent genes. Using three microarray data sets across the yeast cell cycle published by , Zhao et al. (2001) identified a set of 256 genes whose transcription is cell cycle dependent in at least two out of the three data sets using a per-gene regression modeling approach . We focused on these 256 periodic genes in our analysis.
The primary goal of our analysis is to cluster genes that have similar expression patterns among WT yeast. As a secondary goal, we clustered genes using both WT yeast and SM yeast to identify genes whose expression patterns are changed by the mutation. Unlike K-means clustering, CLMM can explicitly accommodate both the replication and the sample covariate (mutation status). In addition, CLMM can naturally deal with the imbalanced experimental design: WT had one bad time point at 105 min and SM had three at 25 min, 40 min, and 55 min, due to technical problems (for example, sample loss, poor hybridization). These bad time points were removed from the cluster analysis. There was also missing data: 41 measures belonging to 17 genes for WT data and 17 measures belonging to 17 genes for SM data were clearly outliers based on signal strength compared to other time points, which arise due to technical failures of the measurement procedure rather than reflecting true biological variation that should be modeled. See Additional file 1.
Cluster WT data
Compare the CLMM-based clustering (rows) versus the K-means (KM) based clustering (columns) for the WT data
CLMM WT 1
CLMM WT 2
CLMM WT 3
CLMM WT 4
CLMM WT 5
CLMM WT 6
CLMM WT 7
CLMM WT 8
Cluster both WT data and SM data
Compare the CLMM clustering using the WT data (rows) and that using both the WT and SM data (columns)
CLMM WT 1
CLMM WT 2
CLMM WT 3
CLMM WT 4
CLMM WT 5
CLMM WT 6
CLMM WT 7
CLMM WT 8
To summarize, both K-means and CORM are useful tools for clustering genes using expression data. K-means makes no assumption about the relationship between expression levels and sample covariates. It is intuitive and has produced reasonable results in applications . K-means is especially useful to explore the data when no prior knowledge is available on genes’ relationship to covariates. CORM assumes a regression relationship between gene expression and covariate. When the assumption holds, CORM is able to provide more precise clustering and cluster center estimates. Moreover, CORM is capable of naturally handling data with complicated experimental design, for example, longitudinal with replications design, unbalanced time points, and missing data.
Gene clustering for time course data has been under active research over the past decade . A number of other researcher groups have also independently developed gene clustering methods using the mixture of random effects models as the backbone with some variations in the specific modeling and implifications [19–23]. Our results offer further evidence of benefits of the CORM method for time course data over the traditional K-means clustering method. The CORM method is now available as an R package named CORM at the R CRAN.
This work is supported in part by NIH grants CA151947 and CA140146 to LXQ, GM41073 to LB, and NIH grants U01 AI46703, R01 AG014358, and R37 AI29168 to SGS.
- Kaufman L, Rousseeuw PJ: Finding groups in data: an introduction to cluster analysis. 1990, New York: John Wiley & Sons, IncView ArticleGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22 (3): 281-285. 10.1038/10343.PubMedView ArticleGoogle Scholar
- Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Model-based clustering and data transformations for gene expression data. Bioinformatics (Oxford, England). 2001, 17 (10): 977-987. 10.1093/bioinformatics/17.10.977.View ArticleGoogle Scholar
- Ghosh D, Chinnaiyan AM: Mixture modelling of gene expression data from microarray experiments. Bioinformatics (Oxford, England). 2002, 18 (2): 275-286. 10.1093/bioinformatics/18.2.275.View ArticleGoogle Scholar
- Qin LX, Self SG: The clustering of regression models method with applications in gene expression data. Biometrics. 2006, 62 (2): 526-533. 10.1111/j.1541-0420.2005.00498.x.PubMedView ArticleGoogle Scholar
- MacQueen J: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Some methods for classification and analysis of multivariate observations. 1967, Berkeley, Calif: University of California Press, 281-297.http://projecteuclid.org/euclid.bsmsp/1200512992,Google Scholar
- Celeux G, Govaert G: A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data An. 1992, 14 (3): 315-332. 10.1016/0167-9473(92)90042-E.View ArticleGoogle Scholar
- McLachlan G: The classification and mixture maximum likelihood approaches to cluster analysis. Handbook of Statistics. 1982, 2 (1982): 199-208.View ArticleGoogle Scholar
- McLachlan G, Basford K: Mixture models: inference and applications to clustering. 1988, New York: Marcel DekkerGoogle Scholar
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol. 1977, 39 (1): 1-38.Google Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J. 1998, 41 (8): 578-588. 10.1093/comjnl/41.8.578.View ArticleGoogle Scholar
- Storey JD: A direct approach to false discovery rates. J R Stat Soc Ser B Stat Methodol. 2002, 64 (3): 479-498. 10.1111/1467-9868.00346.View ArticleGoogle Scholar
- Park MY, Hastie T, Tibshirani R: Averaged gene expressions for regression. Biostatistics (Oxford, England). 2007, 8 (2): 212-227. 10.1093/biostatistics/kxl002.View ArticleGoogle Scholar
- Pramila T, Miles S, GuhaThakurta D, Jemiolo D, Breeden LL: Conserved homeodomain proteins interact with MADS box protein Mcm1 to restrict ECB-dependent transcription to the M/G1 phase of the cell cycle. Gene Dev. 2002, 16 (23): 3034-3045. 10.1101/gad.1034302.PubMedPubMed CentralView ArticleGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9 (12): 3273-3297. 10.1091/mbc.9.12.3273.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao LP, Prentice R, Breeden L: Statistical modeling of large microarray data sets to identify stimulus-response profiles. Proc Natl Acad Sci USA. 2001, 98 (10): 5631-5636. 10.1073/pnas.101013198.PubMedPubMed CentralView ArticleGoogle Scholar
- Li L, Lu Y, Qin LX, Bar-Joseph Z, Werner-Washburne M, Breeden LL: Budding yeast SSD1-V regulates transcript levels of many longevity genes and extends chronological life span in purified quiescent cells. Mol Biol Cell. 2009, 20 (17): 3851-3864. 10.1091/mbc.E09-04-0347.PubMedPubMed CentralView ArticleGoogle Scholar
- Luan Y, Li H: Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics (Oxford, England). 2003, 19 (4): 474-482. 10.1093/bioinformatics/btg014.View ArticleGoogle Scholar
- Ma P, Castillo-Davis CI, Zhong W, Liu JS: A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 2006, 34 (4): 1261-1269. 10.1093/nar/gkl013.PubMedPubMed CentralView ArticleGoogle Scholar
- Ng SK, McLachlan GJ, Wang K, Ben-Tovim Jones L, Ng SW: A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics (Oxford, England). 2006, 22 (14): 1745-1752. 10.1093/bioinformatics/btl165.View ArticleGoogle Scholar
- Joo Y, Casella G, Hobert J: Bayesian model-based tight clustering for time course data. Computation Stat. 2010, 25 (1): 17-38. 10.1007/s00180-009-0159-7.View ArticleGoogle Scholar
- Wang K, Ng SK, McLachlan GJ: Clustering of time-course gene expression profiles using normal mixture models with autoregressive random effects. BMC Bioinformatics. 2012, 13: 300-10.1186/1471-2105-13-300.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.