RNA samples
We used two independent microRNA gene expression data sets. The first (dat1) comprises 8 samples obtained in our lab and the second (dat2) contains 31 samples obtained from GEO database [9]. The dat1 set comprises the microRNA gene expression profiles of bone marrow-derived human mesenchymal stem cells (hMSCs) and human dermal fibroblasts, obtained from 4 independent donors for each tissue. Total RNA was isolated using the miRNeasy kit (Qiagen). 100 ng of each RNA sample were hybridized to Agilent Human microRNA Microarray v2.0 (G4470B, Agilent Technologies). MicroRNA labeling, hybridization and washing were carried out following Agilent's instructions. Images of hybridized microarrays were acquired with a DNA microarray scanner (Agilent G2565BA), and features were extracted using the AFE image analysis tool version A.9.5.3.1 with default protocols and settings [10]. The hMSCs and dermal fibroblasts used in our study have been deposited in the GEO database [9] (accession number GSE19232) and the corresponding raw data can be retrieved from the supplementary file (GSM476577.txt.gz): MSC_rep1, Fib_rep1, MSC_rep2, Fib_rep2, MSC_rep3, Fib_rep3, MSC_rep4 and Fib_rep4. Since the number of replicates in dat1 might be too low to provide compelling evidence, we also analyzed a larger data set (dat2), also hybridized to the Agilent Human microRNA microarray v2.0 (G4470B, Agilent Technologies). dat2 was selected from the raw data deposited in the supplementary file of the GEO GSE16444 series. dat2 is made up of 31 samples from stage 4 neuroblastoma patients: 17 from long survivors and 14 from short survivors. As with the hMSC and dermal fibroblast data, the slides used in the GSE16444 series were scanned with an Agilent G2565BA scanner according to the microRNA Microarray System protocol, and the raw data were obtained with the Agilent Feature Extraction software v. 9.5.3.1 (Agilent Technologies).
Agilent microRNA microarray
Agilent microRNA assays integrate eight individual microarrays on a single glass slide. Each microarray includes approximately 15 k features containing probes sourced from the miRBASE public database [11]. The probes are 60-mer oligonucleotides directly synthesized on the array. In this study we used Human microRNA microarray v2.0, which contains 723 human and 76 human viral microRNAs, each replicated 16 times. 362 microRNAs are interrogated by 2 different oligonucleotides, 45 microRNAs by 3, and 390 microRNAs by 4. Only 2 microRNAs are interrogated by a single oligonucleotide. The array also contains a set of positive and negative controls that are replicated a variety of times. Some of the positive control probes target non-microRNA human RNAs. Each of these targets was interrogated with 4 different probes, which are repeated 5 times. The signals from these positive controls can be bright or dim depending on the sample, and according to Agilent they do not behave consistently enough to be used for normalization.
Agilent total gene signal
The AFE algorithms estimate a single intensity measure for each microRNA, referred to as the total gene signal (TGS). The AFE-TGS is estimated by multiplying the total probe signal by the number of probes per gene. The total probe signal is the robust average of all the background-subtracted signals for each replicated probe multiplied by the total number of probe replicates. Usually the background signal is the sum of the median local background signal plus the spatial detrending surface value computed by AFE, which estimates the noise due to a systematic gradient on the array.
Signal Processing
All the methods used in the study were implemented in R [12] using functions and packages collected in the Bioconductor project [13] as well as custom written routines. Agilent microRNA microarrays interrogate each microRNA with multiple probe sets. The statistical inference requires a processed signal, which is an estimate of the expression measure for every microRNA that can be normalized between arrays. We considered 4 processed signals: a) the AFE-TGS normalized to the 75th percentile (nor75); b) the AFE-TGS normalized by the quantile method (norQ); c) the adapted RMA algorithm using a background-corrected signal based on the exponential-normal convolution model [6] (norRMAbg); and d) the RMA method without background correction (norRMA). Negative values in the AFE-TGS were converted into positive signals by adding the quantity |min (AFE-TGS)| + 2 before log transformation. The processed nor75 signal was obtained for every array by dividing the AFE-TGS by the 75th percentile of the signal for that particular array. This guarantees that the adjusted signals will all have a 75th percentile equal to 1. The reason for using the 75th percentile rather than other statistical measures such as the mean is to diminish the possible influence of outliers. The median could be used instead, but if we assume that about half of the genes will not show any significant expression, the 75th percentile will represent the median of the remaining 50% that are expressed. The norQ was obtained by using the normalizeBetweenArrays function from the Bioconductor limma package [14]. norRMAbg and norRMA estimate the expression of a given microRNA from all the probe measures for that microRNA. The RMA algorithm was applied in the following sequential steps. For norRMAbg only, the raw mean signal was first background corrected by the exponential + normal convolution model [6], using the rma.background.correct function of the Bioconductor preprocessCore package [15]. norRMAbg and norRMA signals were then normalized between arrays by quantile normalization using the normalizeBetweenArrays function [13]. The signals were log 2 transformed, and the median of the replicated probes was obtained, normally yielding 2, 3 or 4 different measures (probe level data) for each microRNA; these measures were summarized into a single microRNA measure with the rma_c_complete_copy function of the affy package [16]. For each feature, the RMA estimates a unique signal by fitting a linear model that takes into account the probe effect. The estimates in the linear model are obtained using the median polish algorithm.