Assessment of a novel multi-array normalization method based on spike-in control probes suitable for microRNA datasets with global decreases in expression

Background High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Results Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the “common reference design” and processed as “pseudo-single-channel”. They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription–polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Conclusions Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository.

and S1B shows the distributions of the log2-transformed raw signal for the lung dataset from the Hy3 Exiqon miRCURY LNA™ and Affymetrix GeneChip® miRNA arrays, respectively. In both cases, the distributions cover a range of approximately nine and peak strongly at low values: 75% of the Exiqon values lie in the lower ~22% of the intensity range, and 75% of the Affymetrix values lie in the lower ~11% of the intensity range. This suggests that a majority of mouse miRNAs were weakly expressed in these samples and that the raw signal distribution was broader on the Exiqon platform compared with its distribution on the Affymetrix platform, indicating a higher sensitivity of the Exiqon dual-channel hybridization. The miRNA probe set detection calls measure the difference between the detected probe intensities and the background noise. Although the values are presented associated with normalized expression values in Figure S1C and S1D, they actually depend only on the raw data (see "Methods"). Therefore, only one representative normalization method was needed for each platform: Sewer et al.
Page 2 of 9 spike-in controls based normalization (SCN) for Exiqon and quantile normalization (AQN) for Affymetrix (see Table 1). Figure S1C shows the overall distributions of the miRNA probe set normalized intensities, split according to the associated "absent" and "present" detection call values. The intensity distributions shared similar features in the data from the two arrays. One of the features is a clear peak at the lower end of the detection range, which contained the "absent" miRNA probe sets. A second feature is that the distribution of the "present" probe sets was close to a straight line on the log-scaled intensity histogram, corresponding to a power law dependence in the absolute intensity scale. The slope of the Exiqon intensity data was slightly less steep than that of the Affymetrix data. Figure S1D shows an overall MA-plot between the SCN and AQN normalized datasets. Each common mouse miRNA probe set is colored according to its detection calls values on each platform. The asymmetry between the two sides of the horizontal black line indicates that the situation was different, depending on whether SCN or AQN produced the largest normalized intensity; that is, when the SCN intensity value was higher (upper part of the plot), the difference was greater than when the AQN intensity value higher (lower part of the plot).
This feature also appeared in the detection calls comparison: the extension of the blue points (SCN:present-AQN:absent, 10% of the cases) was greater than the extension of the red points (SCN:absent-AQN:present, 16% of the cases). This finding suggests that, on average, the "pseudo-singlechannel" hybridization strategy used on the Exiqon platform allowed more signal to be detected than were detected on the single-channel Affymetrix platform.  Pearson correlations between the spike-in control probe sets was too low (0.51). Furthermore, both the intensity specificity and coverage of the spike-in probe sets were insufficient (assumptions A2 and A3, respectively). This led to the conclusion that the Affymetrix lung raw data were not suitable for the application of the spike-in controls based normalization method.
Relative CV for spike−in control probe sets C Figure S5. Differential miRNA expression for the lung samples and all preprocessing pipelines.
(A) Heat map for t-statistics obtained from the linear model for the treatment response of the expression values of each miRNA. The dendrogram is based on the Euclidean distance between the t-values obtained from the various preprocessing pipelines (described in Table 1). (B) Bar chart of the Spearman correlation coefficients between the differential expressions of the selected miRNAs obtained by RT-qPCR and those obtained from the preprocessing pipelines. The error bars are the 2.5 th -97.5 th percentiles of the values obtained from a simple leave-one-out re-sampling approach.