The Discordant model is adapted from Lai et al algorithm to determine concordance between microarrays [20, 21]. The model has been previously published [5].
The R package for the Discordant method was designed for user flexibility. It is separated into two main functions: createVectors and discordantRun. The flow of these functions are illustrated in Fig. 1 and described further below.
–Omics data as mandatory input for each function
Each function requires either one or two –omics dataset(s) as argument(s). If only one –omics dataset is used as an argument, the functions undergo a within –omics analysis, where pairs of the same feature are evaluated (e.g. gene–gene, metabolite–metabolite). If two –omics datasets are used, the functions undergo a paired –omics analysis, where pairs of different features are evaluated (e.g., gene-miRNA, gene-metabolite).
createVectors
The purpose of createVectors is to determine correlation coefficients. The inputs are the –omics dataset(s) and a group vector that indicates samples for groups 1 and 2 (Fig. 1a). An optional parameter called cor.method is available where the user can change the type of correlation (Fig. 1b). The options are Pearson, Spearman, BWMC and SparCC (described in detail below). The default value for cor.method is Pearson. The output are two correlation vectors corresponding to the biological groups (Fig. 1c).
Correlation metrics
Four correlation metrics are applied to sequencing data: Pearson, Spearman, Biweight midcorrelation (BWMC) and SparCC. The commonly used Pearson correlation assumes that both data vectors are normally distributed, and is optimal for identifying linear relationships. Spearman correlation is rank-based and a non-parametric alternative that can handle non-normal data and capture monotonic and linear relationships. BWMC is much like Pearson’s correlation, except it is median-based rather than mean-based. It is considered more robust than Pearson’s correlation method since it minimizes the effect of outliers on the final correlation [22]. SparCC (Sparse Compositional Correlation) correlations are approximated based on the dispersion of the data and the assumption that most feature pairs will have no correlation [15].
discordantRun
The arguments necessary for discordantRun are the correlation vectors and –omics dataset(s) (Fig. 1c). The correlation vectors are input for the Discordant algorithm, and the –omics dataset(s) are used to communicate whether a single –omics or double –omics analysis is requested. The default value for transform is TRUE, which transforms correlation vectors into z scores using Fisher’s transformation. This option was inserted into discordantRun instead of createVectors so users could generate correlation vectors independent of the R package.
Optional arguments are available in order to use the 5-component mixture model or subsampling (Fig. 1d). The outputs are a posterior probability matrix and class matrix (Fig. 1e). For single –omics analyses, results in the output matrices are below the diagonal of the matrix and NAs are above the diagonal in order to avoid duplicating information. The posterior probability output is the differential correlation posterior probabilities (the sum of the off-diagonal in the class matrix in Additional file 1: Figure S1d).
There are several outputs for discordantRun: discordPPMatrix, discordPPVector, classMatrix, classVector, probMatrix, and loglik. The outputs discordPPMatrix and discordPPVector contain the posterior probability of differential correlation. The rows of discordPPMatrix are the features in –omics x and the columns are the features in –omics y (or x if within –omics analysis is being used). The class that had the highest probability for each feature pair are contained in classMatrix and classVector. The formatting of classMatrix is similar to discordPPMatrix. All complete information is in probMatrix, where each row represents a feature pair and nine columns represent the class within the class matrix. The log likelihood of the data fitting the model is in the argument loglik.
Comparison to DiffCorr
The analysis of sequencing data by Discordant was compared to DiffCorr, an R package that uses Fisher’s method to determine differential correlation [7]. Correlation vectors were determined using createVectors() and then DiffCorr’s function compcorr() was used to calculate p values. Simulations and biological validations were used to assess performance.
5-Component normal mixture model
In the simplest model [5], a three component mixture model is used to define whether correlations are not present (0), are positive (+) or are negative (−). We offer an option which increases the number of components to 5, adding correlations that are very positive (++) or very negative (−). This increases the parameter size from 21 to 35 and the number of classes from 9 to 25 (Additional file 1: Figure S2). To run the Discordant algorithm with 5 components instead of 3, set parameter components to 5. The default value for components is 3.
Subsampling
Like other methods [6], the Discordant model makes a false assumption that molecular feature pairs are independent of each other, but features are in multiple pairs which violates the independence assumption. A subsampling option is included to address the assumption and also cut down run-time. Subsampling will run if the argument subsampling is set to TRUE. By default, the EM algorithm updates parameter estimates across all molecular feature pairs until the EM algorithm converges [23]. With the subsampling option, a subsample of correlation coefficients independent of each other are sampled to run the EM algorithm. This is repeated for a number of iterations (default is 100), and the parameters of each mixture component from each iteration are summarized by their mean (Additional file 1: Figure S3 a–e). Once the summarized parameters of the mixture components are determined, the posterior probabilities of all molecular features are determined (Additional file 1: Figure S3 f).
TCGA breast cancer data
From the Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov), we accessed miRNA-Seq and RNA-Seq breast cancer data with matched subjects. Of these, there are 15 samples with normal (or control) tissue and 42 samples with tumor tissue. This dataset was selected because it had the largest sample size of control and tumor groups in TCGA that had matched samples with miRNA-Seq and RNA-Seq data. Both datasets were pre-processed and normalized using HTSeq filtering and TMM normalization [24, 25]. For Peason’s correlation and BWMC [9] the data was also transformed using voom from the limma R package to create continuous values [26], otherwise the methods (Spearman and SparCC) were applied to normalized count data. The number of features remaining were 212 miRNA and 19414 mRNA.
Features were further filtered by the presence of outliers using the median absolute deviation (MAD) outlier method [27]. Even after pre-processing and normalization, the distribution of sequencing data still is asymmetrical, where there is large density around zero and long tails to the right. To determine outliers, the values for each feature are split by being greater or less than the median. The two sets of features are tested for outliers by the difference they have with their respective MAD [28]. The maximum distance of all features from the upper or lower MAD is used to determine if the feature has an outlier. The standard threshold is two or three times the MAD outside the median [27], but since we found that the variation in the sequencing data was more extreme we used larger thresholds. For the transformed sequencing data, a threshold of 7 is used to retain features but still filter out those that were most problematic. Non-transformed data has even larger dispersions, so a threshold of 20 was used. The number of features after filtering for outliers is 16,656 for RNA-Seq data and 200 for miRNA-Seq data for the voom-transformed data and 17,972 for RNA-Seq and 200 for miRNA-Seq for non-transformed data
. The code for this outlier method is included as a function called madOutlier in the R package with an option to change the threshold.
Experimentally validated breast cancer miRNA not involved in other well-researched cancers (prostate cancer, melanoma, glioblastoma multiforme) were used as a biological validation [29, 30]. A total of 8 unique breast cancer miRNAs were found to be in the TCGA breast cancer data. Since the result are in the form of molecular feature pairs, breast cancer miRNAs occur in more than one pair. Therefore, for comparison purposes we report the first occurrence of the miRNA in the pairs ranked by posterior probability.