Analysing time course microarray data using Bioconductor: a case study using yeast2 Affymetrix arrays

Background Large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, proper statistical analysis of time-course data requires the use of more sophisticated tools and complex statistical models. Findings Using the open source CRAN and Bioconductor repositories for R, we provide example analysis and protocol which illustrate a variety of methods that can be used to analyse time-course microarray data. In particular, we highlight how to construct appropriate contrasts to detect differentially expressed genes and how to generate plausible pathways from the data. A maintained version of the R commands can be found at http://www.mas.ncl.ac.uk/~ncsg3/microarray/. Conclusions CRAN and Bioconductor are stable repositories that provide a wide variety of appropriate statistical tools to analyse time course microarray data.


Introduction
As experimental costs decrease, large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This is because time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, a proper statistical analysis of time-course data requires the use of more sophisticated tools and complex statistical models. For example, problems due to multiple comparisons are increased by catering for changing effects over time. In this case study, we demonstrate how to analyse time-course microarray data by investigating a data set on yeast. We discuss issues related to normalisation, extraction of probesets for specific species, chip quality and differential expression. We also discuss network inference in the Additional file 1. The freely available software system R (see [1,2]) has many benefits for analysing data of this type and so throughout the analysis we give the R commands that produce the numerical/graphical output shown in this paper. A maintained version of the R commands can be found at http://www.mas.ncl.ac.uk/~ncsg3/microarray/.

Description of the data
The data were collected according to the experimental protocol described in [3]. Briefly, three biological replicates were studied on each of a wild-type (WT) yeast strain and a strain carrying the cdc13-1 temperature sensitive mutation (in which telomere uncapping is induced by growth at temperatures above around 27°C). These replicates were sampled initially at 23°C (at which cdc13-1 has essentially WT telomeres) and then at 1, 2, 3 and 4 hours after a shift to 30°C to induce telomere uncapping. The thirty resulting RNA samples were hybridised to Affymetrix yeast2 arrays. The microarray data are available in the ArrayExpress database (see [4]) under accession number E-MEXP-1551 .
A list of packages used in this paper is given in the Additional file 1.

Entering data into Bioconductor
The data used in this paper can be downloaded from ArrayExpress into R using the commands >library(ArrayExpress) > yeast.raw = ArrayExpress('E-MEXP-1551') Unfortunately due to changes in the ArrayExpress website, the ArrayExpress package for Bioconductor 2.4 (the default version for R 2.9) produces an error and so we must use the package in Bioconductor 2.5 (the default version for R 2.10). Details for downloading the latest ArrayExpress package can be found in the Additional file 1.
A brief description of the yeast.raw object can be obtained by using the print (yeast.raw) command: AffyBatch The data frame exp_fac stores all the necessary information, such as strain, time and replicate, which are necessary for the statistical analysis.
Note that there are two yeast species on this chip, S. pombe and S. cerevisiae. Also, amongst the 10,928 probesets (with each probeset having 11 probe pairs), there are 5,900 S. cerevisiae probesets.

Extraction of S. cerevisiae probesets
As these microarrays contain probesets for both S. cerevisiae and S. pombe, we first need to extract the S. cerevisiae data before normalisation. This can be done by filtering out the S. pombe data using the s_cerevisiae.msk file from the Affymetrix website (see [9]). Note that users first need to register with the Affymetrix website before downloading this file. Also note that in our analysis, the transcript id i.e. the systematic orf name (obtained from [10]) is used for genes with no name.
We obtain a data frame containing lists of S. cerevisiae genes, probes and transcripts (using the function Extrac-tIDs() in the Additional file 1) as follows >#Read in the mask file > s_cer = read. We also need to restrict the view of yeast.raw to the xand y-coordinates of the S. cerevisiae probesets in the cdf environment by using >#Get the raw dataset for S. cerevisiae only >library(affy) >library(yeast2probe) >source('RemoveProbes.R') > cleancdf = cleancdfname(yeast.raw@cdfName) > RemoveProbes(probe_filter, cleancdf, 'yeast2probe') Note that the commands in RemoveProbes.R are listed in the Additional file 1. Thus the attributes of yeast.raw, obtained via print(yeast.raw), are now AffyBatch object size of arrays = 496 × 496 features(3167 kb) cdf = Yeast_2(5900 affyids) number of samples = 30 number of genes = 5900 annotation = yeast2 and the number of genes (actually probesets here) is 5,900 now that the S. pombe probesets have been removed.

Data Quality Assessment
Before any formal statistical analysis, it is important to check for data quality. Initially, we might examine the perfect and mismatch probe-level data to detect anomalies. Images of the first five arrays can be obtained using > op = par(mfrow = c(3, 2)) >for ( Figure S2. Data quality can be assessed by examining such images for anything that appears non-random such as rings, shadows, lines and strong variations in shade. The images for our data set do not appear to have any non-random structure and so data quality is probably high. Another useful quality assessment tool is to examine density plots of the probe intensities. The command > d = exp_fac$data_order [1:5] >hist(yeast.raw[, d], lwd = 2, ylab = 'Density', xlab = 'Log (base 2) intensities') produces the image shown in the Additional file 1: Figure  S3. Typically, differences in spread and position are corrected by normalisation. However, the appearance of significant multi-modality in the distribution or many outlying observations are indicative of poor data quality.
Other exploratory data analysis techniques that should be carried include MAplots, where two microarrays are compared and their log intensity difference for each probe on each gene are plotted against their average. Also of interest is to examine RNA degradation (see [6]), although [11] cast some doubt over the validity of this method. For details on how to carry out both of these methods in R, see [12,13] for detailed instructions.

Normalising Microarray Data
There are number of methods for normalising microarray data. Two of the most popular methods are GeneChip RMA (GCRMA) and Robust Multiple-array Average (RMA); see [14,15]. Essentially, GCRMA and RMA differ in how they deal with background noise, with GCRMA using a more sophisticated correction algorithm. However, the approach adopted by GCRMA means that it can be time-consuming to use with large data sets in contrast to RMA. A potential drawback of using RMA is that it assumes that the overall levels of expression are similar for each array. However this assumption may be invalid if, for example, mutant cells have a radically different level of transcriptional activity than the WT. For further information regarding normalising microarray data sets, see for example [16,17].
Since we have thirty microarray data sets and believe that the levels of transcriptional activity are similar across strains, we will use the RMA normalisation method. This technique normalises across the set of hybridizations at the probe level. The data can be normalised via > yeast.rma = rma(yeast.raw) > yeast.matrix = exprs(yeast.rma) [, exp_fac$data_order] > cnames = paste(exp_fac$strain, exp_fac$tps, sep = ' ') >colnames(yeast.matrix) = cnames > exp_fac$data_order = 1:30 The normalisation procedure consists of three steps: model-based background correction, quantile normalisation and robust averaging. The aim of the quantile normalisation is to make the distribution of probe intensities for each array in a set of arrays the same. We illustrate its effect by studying boxplots of the raw S. cerevisiae data against their normalised counterparts values, shown in the Additional file 1: Figure S4. Boxplots provide a useful graphical view of data distributions and contain their median, quartiles, maximum and minimum values. The boxplot command is in the affyPLM package and so the figure is produced by using >library(affyPLM) >par(mfrow = c(1, 2)) >#Raw data intensities >boxplot(yeast.raw, col = 'red', main="") >#Normalised intensities >boxplot(yeast.rma, col ='blue')

Principal Component Analysis
Principal component analysis (PCA) is useful in exploratory data analysis as it can reduce the number of variables to consider whilst still retaining much of the variability in the data. In particular, PCA is useful for identifying patterns in the data. Essentially, principal components partition the data into orthogonal linear components which explain different contributions to the variability in the data. The first component explains the largest contribution to variability in the original dataset, that is, retains most information, with the second component explaining the next largest contribution to variability, and so on. The following commands calculate the principal components > yeast.PC = prcomp(t(yeast.matrix)) > yeast.scores = predict(yeast.PC) which we can then plot using >#Plot of the first two principal components >plot(yeast. Identifying differentially expressed genes In this experiment, interest lies in differences in gene expression over time between the wild-type and mutant yeast strains. It is expected that the wild-type expression level is independent of time. Also we anticipate that the mutant expressions at time t = 0 are the same as the wildtype expression level. This hypothesis is supported by the PCA plot in Figure 1.
There are currently two main packages available to detect differentially expressed genes using this kind of data: the timecourse package and the limma package. We illustrate how to analyse these data using both packages.

Using the timecourse package
This package assesses treatment differences by comparing time-course mean profiles allowing for variability both within and between time points. It uses the multivariate empirical Bayes model proposed by [18].
Further details of the timecourse package can be found in [19]. After installing the timecourse library, we construct a size matrix describing the replication structure using >library(timecourse) > size = matrix ( The expression profiles can also be easily obtained. The profile for the top ranked expression is found using > plotProfile(MB.2D, ranking = 1, gnames = rownames(yeast.matrix)) and is shown in the Additional file 1: Figure S5.

Using the limma package
The limma package uses the moderated t-statistic described by [7,20]. The function lmFit within the limma library fits a linear model for each gene for a given series of arrays, where the coefficients of the fitted models describe the differences between the RNA sources hybridised to the arrays. Precisely, we fit the model E [y g ] = Xα g , where y g = (y g,1 , ..., y g, n ) T contains the expression values for gene g across the n arrays, X is a design matrix which describes key features of the experimental design used and α g is the coefficient vector for gene g. In the analysis studied here, the yeast data consists of data from n = 30 arrays. The entries in the columns of X depend on the experimental design used: there are two yeast strains (mutant and wild type), each measured at five separate time points, and we are interested in comparing the gene expressions between mutant and wild type strains over time. Thus we seek a lin- ear model describing the ten strain × time combinations by determining values for the ten coefficients in the coefficient vector α g . We will label these ten coefficients as ('m0', 'm60, 'm120', 'm180', 'm240', 'w0', 'w60', 'w120', 'w180', 'w240'), where the first five coefficients represent the levels of the mutant strain at time points t = 0, 1, 2, 3, 4 and the remaining five coefficients are the equivalent versions for the wild type strain. Statistically speaking, the model has a single factor with ten levels. The design matrix X links these factors to the data in the arrays by having zero entries except when an array contributes an observation to a particular strain × time combination. For example, array 26 measures the expression of the first wild type microarray at time t = 0 and so contributes an observation to level 'w0', the sixth strain × time combination. Thus the entry in row 26, column 6 of the design matrix X(26, 6) = 1. Further, the arrays are arranged in groups of three replicates. Thus the overall experimental structure (expt_structure below) has three arrays on level 'm0', then three arrays on 'm60', and so on. Setting up the factor levels and the design matrix is done in R by using >library(limma) > expt_structure = factor(colnames(yeast.matrix)) >#Construct the design matrix > X = model.matrix(~0 + expt_structure) >colnames(X) = c('m0', 'm60', 'm120', 'm180', 'm240', 'w0', 'w60', 'w120', 'w180', 'w240') and then the coefficient vector α g is estimated via the command >lm.fit = lmFit(yeast.matrix, X) Determining the differentially expressed genes amounts to studying contrasts of the various strain × time levels, as described by a contrast matrix C. For these data, we are mainly interested in differences at the later time points, and so a possible set of contrasts to investigate is that of differences between the mutant and wild type strains at each time point, that is, ('m60-w60', 'm120-w120', 'm180-w180', 'm240-w240'). The limma package allows complete flexibility over the choice of contrasts, however this necessarily includes an additional level of complexity. The values in the coefficient vector of contrasts, β g = C T α g for gene g, describe the size of the difference between strains at each time point. The relevant R commands are > mc = makeContrasts('m60-w60', 'm120-w120', 'm180-w180', 'm240-w240', levels = X) > c.fit = contrasts.fit(lm.fit, mc) > eb = eBayes(c.fit) The final command uses the eBayes function to produce moderated t-statistics which assess whether individual contrast values β gj are plausibly zero, corresponding to no signifficant evidence of a difference between strains at time point j. The moderated t-statistic is constructed using a shrinkage approach and so is not as sensitive as the standard t-statistic to small sample sizes. It also gives a moderated Fstatistic which can be used to test whether all contrasts are zero simultaneously, that is, whether there is no difference between strains at all time points.

Ranking differentially expressed genes
There are a number of ways to rank the differentially expressed genes. For example, they can be ranked according to their log-fold change >#see help(toptable) for more options > toptable(eb, sort.by = 'logFC') or by using F-statistics > topTableF(eb) The advantage of using F-statistics over the log fold change is that the F-statistic takes into account variability and reproducibility, in addition to fold-change.
Our analysis is based on a large number of statistical tests, and so we must correct for this multiple testing. In our example we use the (very) conservative Bonferroni correction since we have a large number of differentially expressed genes and the resulting corrected list is still long. Another common method of correcting for multiple testing is to use the false discovery rate (fdr) (use the command ?p.adjust to obtain further details). The following commands rank genes according to their (corrected) F-statistic p-value and annotates the output by indicating the direction of the change for each contrast for each gene: +1 for up-regulated expression (mutant type having higher expression than wild type at a particular time point), -1 for down-regulated expression and 0 for no significant change.
> The following code (adapted from lecture material found at [13]) plots the time course expression for the top one hundred differentially expressed genes according to their Fstatistic (see Figure 2).  0, 60, 120, 180, 240), exprs.row [(5 * j-4):(5 * j)], type = 'b', pch = pch_value) + } + } When interpreting rank orderings based on statistical significance, it is important to bear in mind that a statistically significant differential expression is not always biologically meaningful. For example, Figure 2 contains RNR2. This gene is highly significant because of low variation in its time course. However the actual difference in expression levels between wild-type and mutant stains is relatively small. We address this problem in the next section.

Comparison of the timecourse and limma packages
Both packages have different strengths. One advantage of the timecourse package over the limma package is that it allows for correlation between repeated measurements on the same experimental unit, thereby reducing false positives and false negatives; these false positives/negatives are a significant problem when the variance-covariance matrix is The result is a moderately large overlap of fifty-three probesets. We note that changing the ranking method in the limma package also yields similar results as those given by the timecourse library.

Two fold-change list
When looking for "interesting" genes it can be helpful to restrict attention to those differential expressed that are both statistically significant and of biological interest. This objective can be achieved by considering only significant genes which show, say, at least a two-fold change in their expression level. This gene list is obtained using the following code (adapted from [12]

Cluster Analysis
Biological insight can be gained by determining groups of differentially expressed genes, that is, groups of genes which increase or decrease simultaneously. This can be achieved by using cluster analysis.

Traditional cluster analysis
In this section, we separate the top fifty differentially expressed genes into groups of similar pattern (clusters   Figure 4 shows the relative expression levels for the mutant strain at each time point ('0', '60', '120', '180', '240'). As expected, the relative expression levels at time t = 0 are very similar. However, as time progresses, groupings of genes appear whose levels are up-regulated (red) or downregulated (green). Note that the intensity of the colour corresponds to the magnitude of the relative expression. Gene names appear on the right side of the figure and on the left side, the cluster dendrogram shows which genes have similar expression. The dendrogram suggests that there are perhaps six to ten clusters.

Soft clustering
Soft clustering methods have the advantage that a probe can be assigned to more than one cluster. Furthermore, it is pos-sible to grade cluster membership within particular groupings. Soft clustering is considered more robust when dealing with noisy data; for more details see [21,22]. The Mfuzz package implements soft clustering using a fuzzy cmeans algorithm. Analysing the data for c = 8 clusters is achieved by using >library(Mfuzz) > tmp_expr = new('ExpressionSet', exprs = m) > cl = mfuzz(tmp_expr, c = 8, m = 1.25) > mfuzz.plot(tmp_expr, cl = cl, mfrow = c (2, 4), new.window = FALSE) Of course, it is usually not clear how many clusters there are (or should be) within a dataset and so the sensitivity of conclusions to the choice of number of clusters (c) should always be investigated. For example, if c is chosen to be too large then some clusters will appear sparse and this might suggest choosing a smaller value of c. Figure 5 shows the profiles of the eight clusters obtained from the Mfuzz package. The probes present within each cluster can be found by using > cluster = 1 > cl [ [4]][, cluster]

Conclusion
The response to telomere uncapping in cdc13-1 strains was expected to share features in common with responses to cell cycle progression, environmental stress, DNA damage and other types of telomere damage. The statistical analysis determined lists of probesets associated with genes involved in all of these processes. The techniques used focussed on making best use of the temporal information in time-course data. The use of cdc13-1 strains, which uncap telomeres quickly and synchronously, also allowed the identification of genes involved in the acute response to telomere damage. This case study has demonstrated the power of R/Bioconductor to analyse time-course microarray data. Whilst the statistical analysis of such data is still an active research area, this paper presents some of the cutting-edge tools that are available to the life science community. All software discussed in this article is free, with many of the packages being open-source and subject to on-going development.