BiGGEsTS: integrated environment for biclustering analysis of time series gene expression data
© Gonçalves et al; licensee BioMed Central Ltd. 2009
Received: 24 March 2009
Accepted: 07 July 2009
Published: 07 July 2009
The ability to monitor changes in expression patterns over time, and to observe the emergence of coherent temporal responses using expression time series, is critical to advance our understanding of complex biological processes. Biclustering has been recognized as an effective method for discovering local temporal expression patterns and unraveling potential regulatory mechanisms. The general biclustering problem is NP-hard. In the case of time series this problem is tractable, and efficient algorithms can be used. However, there is still a need for specialized applications able to take advantage of the temporal properties inherent to expression time series, both from a computational and a biological perspective.
BiGGEsTS makes available state-of-the-art biclustering algorithms for analyzing expression time series. Gene Ontology (GO) annotations are used to assess the biological relevance of the biclusters. Methods for preprocessing expression time series and post-processing results are also included. The analysis is additionally supported by a visualization module capable of displaying informative representations of the data, including heatmaps, dendrograms, expression charts and graphs of enriched GO terms.
BiGGEsTS is a free open source graphical software tool for revealing local coexpression of genes in specific intervals of time, while integrating meaningful information on gene annotations. It is freely available at: http://kdbio.inesc-id.pt/software/biggests. We present a case study on the discovery of transcriptional regulatory modules in the response of Saccharomyces cerevisiae to heat stress.
Extracting relevant biological information from expression data provides important insights into the relations between genes participating in biological processes. This information can be used to identify co-regulated genes corresponding to transcriptional regulatory modules, thus contributing to the challenging goal of regulatory network inference.
Processing expression data is time and resource consuming. In this context, the development of novel computational algorithms and tools for expression data analysis is primarily focused on efficiency and robustness. Clustering techniques have been extensively applied to both dimensions of expression matrices separately, focusing on either gene or sample expression patterns. However, many patterns are common to a subset of genes only in a specific subset of experimental conditions. In fact, our general understanding of cellular processes leads us to expect subsets of genes to be coexpressed only in certain experimental conditions, but to behave almost independently in other. These local expression patterns can only be discovered using biclustering techniques [1, 2], which may be the key for uncovering many regulatory mechanisms that are not apparent otherwise . Although the majority of the biclustering formulations are NP-hard , the biclustering problem becomes tractable when expression levels are measured over time, restricting the analysis to biclusters with consecutive time points [4–7].
BiGGEsTS (BiclusterinG Gene Expression Time Series) is a free and open source graphical application using state-of-the-art biclustering algorithms specifically developed for analyzing gene expression time series. The current version integrates the methods proposed by Zhang et al.  and Madeira and Oliveira [6, 7]. An alternative approach from Ji and Tan  was not included due to complexity issues . The integration of additional algorithms pursuing similar goals is straightforward. In addition, BiGGEsTS offers well-known preprocessing techniques to filter genes, treat missing values and to smooth, normalize, and discretize expression data. A visualization module supports the analysis of both data and results. Graphical representations include colored matrices (heatmaps), expression and pattern charts, and dendrograms. Biclusters can be studied using Gene Ontology (GO) annotations. BiGGEsTS is also able to generate ontology graphs representing enriched GO terms, and filter and/or sort biclusters according to several numerical and statistical criteria.
Several applications are available for the analysis of gene expression data using clustering [8–15], biclustering [16, 17] or both [18, 19] approaches. For clustering we highlight Genesis , which implements hierarchical clustering, k-means, self-organizing maps (SOMs), principal component analysis and support vector machines, together with filtering and normalization methods.
Expander  and BicAT  offer both clustering (hierarchical and k-means) and biclustering techniques. Expander also performs clustering using SOMs and CLICK . Regarding biclustering, Expander uses SAMBA , and BicAT integrates the Cheng and Church approach , the Iterative Signature algorithm (ISA) , the Order-preserving Submatrix method (OPSM) , xMotif  and BiMax . Both tools include preprocessing methods such as filtering, normalization, log transformation and discretization. Expander further evaluates the biological relevance of clusters/biclusters by computing the functional enrichment of GO terms and retrieves information on promoter signals.
Few applications actually address the problem of analyzing time series and they typically apply clustering [11–14]. CAGED  and GQL  model expression profiles using Markov chains. CAGED applies agglomerative clustering to group genes with similar expression profiles, while GQL combines the individual Markov models into a mixture model. STEM  uses a greedy clustering algorithm.
TimeClust  offers hierarchical clustering and SOMs, together with Bayesian and temporal abstraction approaches. To our knowledge, only PAGE  provides a biclustering algorithm specifically designed for expression time series, which is a modified version of the approach of Ji and Tan . Most of these applications miss essential preprocessing steps useful to clean and prepare data for analysis.
Input and preprocessing of time series gene expression data
The input of expression time series is straightforward (Figure 1(a)) and is usually followed by a set of preprocessing steps (Figure 1(b)). These handle occasional and systematic errors, reduce noise, and prepare data to be analyzed.
Occasional errors may occur when measuring the abundance of mRNA in cells, leading to missing values, not always supported by biclustering algorithms. This can be addressed by filtering all genes with missing values, thus eliminating all rows with at least one missing value, and may be regarded as a good strategy to reduce noise. However, when analyzing a small number of genes, removing some of them can lead to a significant reduction in the dimension of the dataset, potentially compromising further analysis. The tradeoff between the elimination of missing values and the dimension of the dataset is usually mitigated by establishing an upper bound for the percentage of missing values allowed per gene. Genes with percentages higher than a user-defined threshold are filtered. The remaining missing values must be filled.
Systematic errors, on the other hand, affect every measurement action and are associated with the differences between the experimental settings of each trial. Sources of this kind of errors include the different incorporation efficiency of dyes, and the different scanning and processing parameters of distinct experiments. Normalization is a widely used technique, which attempts to compensate for these systematical differences between time points and highlight the similarities and differences in the expression profiles. Additionally, a smoothing algorithm acts as a low-pass filter, attenuating the effect of outliers. Depending on the biclustering algorithm, it may be necessary to discretize data, reducing the range of expression values to an adequate set of discrete values.
e-CCC-Biclustering extracts all maximal CCC-Biclusters with approximate expression patterns in time polynomial on the size of the expression matrix. The expression patterns in an e-CCC-Bicluster may vary from one gene to another, as long as the number of errors between each pattern and the pattern profile does not exceed a predefined value. Two kinds of errors are supported: general and restricted. General errors identify measurement errors and allow symbols to be substituted by any other symbol in the discretization alphabet. Restricted errors identify discretization errors and only consider as valid the substitutions of symbols by a predefined number of neighbors in the discretization alphabet.
Both CCC-Biclustering and e-CCC-Biclustering are provided with three additional extensions that identify biclusters with shifted/scaled, anti-correlated and time-lagged patterns . Sometimes, distinct genes exhibit similar expression evolutions at different expression levels, thus not reflecting a similar pattern after discretization. This problem is addressed by identifying biclusters with shifted patterns. Anti-correlation allows genes with opposite expression patterns, in a set of consecutive time points, to be included in the same bicluster. The time-lagged approach identifies genes that exhibit similar expression patterns starting at different time points, enabling the identification of activation/inhibition delays.
CC-TSB is an adaptation of the biclustering algorithm proposed by Cheng and Church . This heuristic approach uses the mean squared residue (MSR) as merit function and iteratively alternates the addition/removal of genes/time points, forcing the MSR to reduce. The addition/removal of time points is restricted to discover only biclusters with contiguous columns.
Applying biclustering to expression data often yields a large number of biclusters. Since analyzing all is usually prohibitive, post-processing techniques are performed in order to rank biclusters according to their relevance. Several methods are available to filter and sort biclusters based on numerical and statistical criteria (Figure 2(b)).
Biclustering groups genes and conditions based on relations inferred from data, relying strictly on computational methods. Researchers are usually interested in analyzing the results looking for statistically significant biological phenomena. This significance can be assessed using Ontologizer's term-for-term analysis , which computes the functional enrichment of the genes in the biclusters by identifying the overrepresented GO terms. In a first step, GO annotations are extracted requiring two distinct files (downloadable from the GO repository if not available) containing the complete ontology and organism-dependent annotations. Term-for-term analysis is then applied to compute a p-value for each GO term. Such p-value is calculated with respect to the null hypothesis that the inclusion of genes follows a hypergeometric distribution, and measures the statistical significance of each term by computing the ratio of the frequencies of annotated genes in the bicluster and in the complete dataset. The Bonferroni correction for multiple testing is applied. The lower the p-value, the more significant the term is. According to standard statistical practice, terms with a corrected p-value lower than 0.05 and 0.01 are considered significant and highly significant, respectively.
A visualization module provides different graphical representations of expression time series, enhancing their most important features: tables of values, colors and symbols; dendrograms; expression charts; pattern charts; tables of GO terms and functional enrichment; and graphs of enriched terms.
Expression matrices and heatmaps
Dendrograms are branching tree-like diagrams used for representing similarity relationships between the genes/time points in the expression data (Figure 3(e)). The similarity hierarchy is obtained using agglomerative hierarchical clustering to group genes and/or time points. At each step, the cluster pairwise similarity is used to decide which clusters to merge. For single element clusters, this similarity is computed using a distance measure (Euclidean or cityblock) or a correlation coefficient (uncentered, Pearson's, absolute uncentered, absolute Pearson's, Spearman's or Kendall's correlation). Otherwise, a single-linkage, complete-linkage, average-linkage or centroid-linkage approach is used. Java TreeView  is used to interpret hierarchical clustering results and display the dendrograms.
Expression and pattern charts
Expression charts can be displayed using either the subset of time points in the bicluster, or all the time points in the dataset. The latter are particularly suitable for highlighting the coherent behavior of the genes in the bicluster time points as opposed to the uncorrelated behavior in the remaining time points. Both expression and pattern charts provide a context menu with extra functionalities, including displaying and modifying chart properties, saving chart to an image file, printing, and zooming in or out.
Go terms and functional enrichment
Graphs of enriched terms
Term-for-term results can be used to generate tree structured graphs highlighting the enriched terms in each of the three GO ontologies (Figure 5(b)). Graphs of enriched terms are generated using Ontologizer , which outputs the structure of the graph into a text file. Graphviz  is used to convert the text file into an SVG file describing the same graphical structure using the XML standard. Finally, the Batik SVG Toolkit  is used to interpret the SVG file and display the corresponding image. The graph with the enriched terms can be zoomed in or out and saved as a raster (PNG) or vector (SVG) image file.
BiGGEsTS is a software for analyzing gene expression time series using biclustering. It was designed to comply with the broad specifications of a software tool, essentially focused on user-friendliness, platform independence, modularity, reusability and efficiency.
Additional material includes a case study describing how to use the software to discover transcriptional regulatory modules in a dataset containing the response of Saccharomyces cerevisiae to heat stress , reproducing the results published in  [see Additional file 4] [see Additional file 5] [see Additional file 6].
Availability and requirements
Project name: BiGGEsTS – BiclusterinG Gene Expression Time Series
Project home page: http://kdbio.inesc-id.pt/software/biggests/
Operating systems: Platform independent
Programming language: Java
Other requirements: Java 1.5 or higher, 1024 MB of RAM, Graphviz (in OSs other than Windows and Mac OS)
License: GNU GPL version 3 or higher
This work was partially supported by projects ARN – Algorithms for the Identification of Genetic Regulatory Networks, PTDC/EIA/67722/2006, and Dyablo – Models for the Dynamic Behavior of Biological Networks, PTDC/EIA/71587/2006, funded by FCT, Fundação para a Ciência e Tecnologia. The work of JPG was partially supported by FCT grant SFRH/BD/36586/2007.
- Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004, 1 (1): 24-45. 10.1109/TCBB.2004.2.View ArticlePubMedGoogle Scholar
- Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol. 2000, 8: 93-103.PubMedGoogle Scholar
- Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: The Order-Preserving Submatrix Problem. J Comput Biol. 2003, 10 (3-4): 373-384. 10.1145/565196.565203.View ArticlePubMedGoogle Scholar
- Ji L, Tan K: Identifying time-lagged gene clusters using gene expression data. Bioinformatics. 2005, 21 (4): 509-516.View ArticlePubMedGoogle Scholar
- Zhang Y, Zha H, Chu CH: A time-series biclustering algorithm for revealing co-regulated genes. Proc of the 5th IEEE International Conference on Information Technology: Coding and Computing. 2005, Las Vegas, Nevada, USA: IEEE Computer Society, 32-37.Google Scholar
- Madeira SC, Teixeira MC, Sá-Correia I, Oliveira AL: Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2008, [http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.34]Google Scholar
- Madeira SC, Oliveira AL: A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series. Algorithms Mol Biol. 2009, 4: 8-10.1186/1748-7188-4-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Sturn A, Quackenbush J, Trajanoski Z: Genesis: cluster analysis of microarray data. Bioinformatics. 2002, 18 (1): 207-208.View ArticlePubMedGoogle Scholar
- Yoshida R, Huguchi T, Miyano S: ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles. Bioinformatics. 2006, 22 (12): 1538-1539.View ArticlePubMedGoogle Scholar
- Pan F, Kamath K, Zhang K, Pulapura S, Achar A, Nunez-Iglesias J, Huang Y, Yan X, Han J, Hu H, Xu M, Hu J, Zhou X: Integrative Array Analyzer: a software package for analysis of cross-platform and cross-species microarray data. Bioinformatics. 2006, 22 (13): 1665-1667.View ArticlePubMedGoogle Scholar
- Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics. Proc Natl Acad Sci U S A. 2002, 99 (14): 9121-9126. 10.1073/pnas.132656399.PubMed CentralView ArticlePubMedGoogle Scholar
- Costa I, Schönhuth A, Schliep A: The Graphical Query Language: a tool for analysis of gene expression time-courses. Bioinformatics. 2005, 21 (10): 2544-2545.View ArticlePubMedGoogle Scholar
- Ernst J, Bar-Joseph Z: STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics. 2006, 7: 191-PubMed CentralView ArticlePubMedGoogle Scholar
- Magni P, Ferrazi F, Sacchi L, Bellazzi R: TimeClust: a clustering tool for gene expression time series. Bioinformatics. 2008, 24 (3): 430-432.View ArticlePubMedGoogle Scholar
- Dietzsch J, Gehlenborg N, Nieselt K: Mayday – a microarray data analysis workbench. Bioinformatics. 2006, 22 (8): 1010-1012.View ArticlePubMedGoogle Scholar
- Cheng KO, Law NF, Siu WC, Lau TH: BiVisu: software tool for bicluster detection and visualization. Bioinformatics. 2007, 23 (17): 2342-2344.View ArticlePubMedGoogle Scholar
- Leung E, Bushel P: PAGE: phase-shifted analysis of gene expression. Bioinformatics. 2006, 22 (3): 367-368.View ArticlePubMedGoogle Scholar
- Sharan R, Maron-Katz A, Shamir R: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics. 2003, 19 (14): 1787-1799.View ArticlePubMedGoogle Scholar
- Barkow S, Bleuler S, Prelić A, Zimmermann P, Zitler E: BicAT: a biclustering analysis tool. Bioinformatics. 2006, 22 (10): 1282-1283.View ArticlePubMedGoogle Scholar
- Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A. 2004, 101 (9): 2981-2986. 10.1073/pnas.0308661100.PubMed CentralView ArticlePubMedGoogle Scholar
- Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data. Bioinformatics. 2004, 20 (13): 1993-2003.View ArticlePubMedGoogle Scholar
- Murali TM, Kasif S: Extracting conserved gene expression motifs from gene expression data. Pac Symp Biocomput. 2003, 77-88.Google Scholar
- Madeira SC: Efficient Biclustering Algorithms for Time Series Gene Expression Data Analysis. PhD thesis. 2008, Instituto Superior Técnico, Technical University of Lisbon, Lisbon, PortugalGoogle Scholar
- Robinson PN, Wollstein A, Bohme U, Beattie B: Ontologizing gene-expression microarray data: characterizing clusters with Gene Ontology. Bioinformatics. 2004, 20 (6): 979-981.View ArticlePubMedGoogle Scholar
- Saldanha AJ: Java Treeview: extensible visualization of microarray data. Bioinformatics. 2004, 20 (17): 3246-3248.View ArticlePubMedGoogle Scholar
- Ellson J, Gansner E, Koutsofios L, North S, Woodhull G: Graphviz and Dynagraph – static and dynamic graph drawing tools. Graph Drawing Software. 2003, Springer-Verlag, 127-148.Google Scholar
- Batik – Java SVG Toolkit. [http://xmlgraphics.apache.org/batik/]
- Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression program in the response of yeast cells to environmental changes. Molecular Biology of the Cell. 2000, 11 (12): 4241-4257.PubMed CentralView ArticlePubMedGoogle Scholar