Gene expression microarrays allow messenger RNA (mRNA) abundance to be quantified quickly and cost-effectively on a genome-wide scale. The production of mRNA is a key step in the process that leads from the information contained within DNA to the formation of the proteins that act within a cell. Quantifying the abundance of mRNA is therefore of interest because it provides much information regarding the state of the cell [1].
Microarrays for measuring mRNA expression make use of probes that hybridize to fluorescently-labelled sample material, where the measured level of fluorescence is used to infer the expression level of each interrogated gene. Traditionally they are constructed by attaching probes directly to a specific point on the array's surface. By contrast, the BeadArray expression platform developed by Illumina makes use of probes attached to beads that are subsequently randomly arranged on the array surface [2]. There are approximately 30 beads for each type of probe (a high degree of replication for a microarray), providing robustness against systematic spatial influences on the array.
For BeadArrays, the raw (bead-level) intensity information is stored in a proprietary format. Until recently, only summarized output (averaged values over the replicate beads on a given array) was available from Illumina's analysis software (BeadScan and BeadStudio). As such, most published studies make use of summarized Illumina data, a state that leaves the low-level (but vital) steps (e.g. image analysis, background correction, summarization) beyond the control of the data analyst.
The data from microarray experiments generally require transformation in order to facilitate simple analyses such as the confident fitting of basic linear models. Variance-stabilizing transformations are applied to microarray data in order to remove the mean-variance relationship in intensities. A log2 transformation is the simplest variance-stabilizing transformation commonly applied to microarray data. Other, more sophisticated approaches have been developed, such as the variance-stabilizing normalisation (VSN) method of Huber et al. [3] and that of Durbin et al. [4].
The VST method [5] is an adaptation of the VSN methodology for Illumina data, exploiting the replicate beads on the array and is defined for intensity x as
where c3 is defined as the variance of bead types that estimate background noise and c2 and c1 respectively represent additive and multiplicative levels of error in the intensity.
Using previously published data [6], VST is found to outperform the approach of log2 transformation, based on the results of a mixture experiment where each sample was a pool of blood and placenta at various ratios. However, the authors commented on the then lack of a publicly available spike-in experiment, a data set that would have provided an ideal test for their method.
Coinciding with the publication of VST, Dunning et al. [7] published an independent account of such a spike-in experiment using customized Mouse WG-6 BeadArrays. In addition to the approximately 48,000 probes (bead types) included as standard, the content of these chips was modified to include 33 probes targeting bacterial and viral genes absent from the mouse genome. These "spikes" were added at specific concentrations on each array, and hence the relative change in expression level of a particular spike between arrays is known a priori. The expression levels of the remaining probes ("non-spikes") should not change between arrays. Twelve different concentrations of spike were used (1000 pM, 300 pM, 100 pM, 30 pM, 10 pM, 3 pM, 1 pM, 0.3 pM, 0.1 pM 0.03 pM, 0.01 pM and 0 pM) and each was replicated four times. Control experiments such as this have proven useful for comparing low-level analysis methods in other microarray platforms, such as Affymetrix [8].
Access to the raw bead-level data allows some of the low-level analysis steps to be explored in greater detail [7]. In particular, it was shown that the local background correction and summarization steps carried out by BeadScan and BeadStudio reduce bias and produce robust summary measurements.
The "Background normalisation" method (BGN) available in BeadStudio adjusts the intensities on each array by subtracting the average expression level of the negative controls (probes that have no targets in the genome being studied) in order that arrays might have comparable baselines. In the analysis of the spike-in experiment, it was shown that BGN resulted in many negative values, and also in increased variability of intensity at low expression levels when combined with the standard log2 transformation. Concordant with previously published observations [6], it was concluded that BGN is not desirable.
It has also been shown that, by using the variances of each bead type as inverse weights, the performance of linear models intended to detect differentially expressed (DE) genes can be improved [7]. This approach is generally only possible if bead-level data are available and a log2 transformation applied prior to calculating bead type averages and variances. We shall refer to this approach as a weighted log2 analysis. Other advantages of having access to data at the bead-level were also shown. Naturally such data allow for detailed quality control and also for greater flexibility in the choice of statistical model.
In this paper, we apply VST to data from the spike-in experiment. This offers further validation of the VST method, not only because the estimation of differential expression can be objectively assessed, but also because the microarray used is different: the mixture data used to validate VST was from a HumanRef-8 BeadArray with some 22,000 probes, rather than the 48,000 MouseWG-6 BeadArray used in the spike experiment. By design, BeadArrays with 48,000 probes tend to have many more probes at low intensity [7] than the HumanRef-8 BeadArray that only contains probes taken from a curated database. Since 48,000 probe BeadArrays are more widely used, it is important to confirm that VST can be applied to these higher density arrays with no impairment due to the different distribution of intensities. Additionally, we will investigate whether VST can reduce some of the problems encountered when applying a standard log2 transformation after BGN.