Background
Mass spectrometry (MS) data are often generated from various biological or chemical experiments. Such vast data is usually analyzed automatically in a computer process consisting of preprocessing, significance test, classification, and clustering. Elaborate preprocessing is essential for successful analysis with reliable results. One preprocessing step is required to detect outliers, which which are extreme due to technical reasons. The plausible outlying observations detected can be examined carefully, and then corrected or eliminated if necessary. However, as the manual examination of all observations for outlier detection is timeconsuming, plausible outlying observations must be detected automatically.
Identification of statistical outliers is the subject of some controversy in statistics[1]. Several outlier detection algorithms have been proposed for univariate data, including Grubbs’ test[2] and Dixon’s Q test[3]. These tests were designed to analyze data under the normality assumption, so that they may produce unreliable outcomes in the case of few replicates. Furthermore, they are not applicable for duplicated samples. Another naive approach to detect outliers statistically constructs lower and upper fences of differences between two samples, Q_{1}  1.5 IQR and Q_{3} + 1.5 IQR, where Q_{1} is the lower 25% quantile, Q_{3} is the upper 25% quantile, and IQR = Q_{3}  Q_{1}. They are claimed to be outliers if they are smaller than the lower fence or larger than the upper fence. However, this may generate a spurious result because variability is heterogeneous in highthroughput data even generated from MS experiments.
Figure1 shows the logscale scatter plot of the technically duplicated samples under the same biological condition from a MS experiment. The variability differs according to the intensity levels in the plot, so that the naive outlier detection method, ignoring the heterogeneity of variability, may often miss true outliers at high levels and select false outliers at low levels. If a number of technical replicates for each peptide under the same biological condition can be obtained in MS experiments, the examination of outliers can be conducted for each peptide. However, a small number of replicates is usually conducted for MS experiments due to the high cost of experiments and the limited supply of biological samples.
Cho et al.[4] proposed a more elaborate approach for detecting outliers with low false positive and negative rates in MS data to solve the problem when the number of technical replicates is two. The algorithm was developed by utilizing quantile regression for duplicate MS experiments. The R package (called OutlierD) that was also developed can only be used for duplicate experiments. Therefore, we here propose a new outlier detection algorithm for multiple highthroughput experiments, particularly those with few, but more than two replicates.
Classical Approaches
Suppose that there are n replicated samples and p peptides in MS data. Then let x_{
ij
} be the i th replicated sample from experiments under the same biological or experimental condition, where i = 1,…,n and j = 1,…,p. For convenience, let{y}_{\mathit{\text{ij}}}=\underset{2}{\text{log}}\left({x}_{\mathit{\text{ij}}}\right). Typically, n is small and p is very large in highthroughput data, i.e., p >> n. In addition, let y_{(1)j} ≤ y_{(2)j}≤⋯≤ y_{(n)j} be ordered samples for peptide j, where{y}_{\left(1\right)j}=\underset{1\le i\le n}{\text{min}}{y}_{\mathit{\text{ij}}} and{y}_{\left(n\right)j}=\underset{1\le i\le n}{\text{max}}{y}_{\mathit{\text{ij}}}, the smallest and the largest observations, respectively.
Outliers are often detected by the classical approaches such as Dixon’s Range Test and Grubbs test. Dixon’s Range Test, also known as Dixon’s Qtest[3], utilizes order statistics as follows.
{Q}_{j}=\frac{({y}_{\left(2\right)j}{y}_{\left(1\right)j})}{({y}_{\left(n\right)j}{y}_{\left(1\right)j})}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\text{or}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\frac{({y}_{\left(n\right)j}{y}_{(n1)j})}{({y}_{\left(n\right)j}{y}_{\left(1\right)j})}.
(1)
The denominator is the difference between the largest and smallest observations and the numerator is the difference between the smallest two values or the largest two values. If the test statistic Q_{
j
} is smaller than the critical value given by Rorabacher[5], peptide j is flagged as an outlier. If n = 2, the statistic is always 1; thus, this test is applicable for n ≥ 3.
Grubbs’ test[2, 6] also utilizes order statistics and its test statistic is defined as follows.
{T}_{\mathit{\text{nj}}}=\frac{({y}_{\left(n\right)j}{\stackrel{\u0304}{y}}_{\xb7j})}{{s}_{j}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\text{and}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{T}_{1j}=\frac{({\stackrel{\u0304}{y}}_{\xb7j}{y}_{\left(1\right)j})}{{s}_{j}},
(2)
where{\stackrel{\u0304}{y}}_{\xb7j} is the sample mean and s_{
j
} the standard deviation for peptide j. The denominator is the standard deviation and the numerator is the difference between the smallest (or largest) value and the sample mean. If T_{
nj
} or T_{1j} is smaller than the critical value, peptide j is flagged as an outlier. If n = 2, the statistic is always1/\sqrt{2}; thus, this test is also applicable for n ≥ 3.
Proposed Methods
In duplicated experiments (n = 2), two observed values, x_{1j} and x_{2j} for each j, should be theoretically identical, but are not identical in practice due to their variability. Even though they are not identical, they should not differ substantially. The tolerance of the difference between the two observed values from the same condition is not constant because their variability is heterogeneous. The variability of highthroughput data depends on intensity levels.
Cho et al.[4] proposed the construction of lower and upper fences using quantile regression in an MA plot with M and A values in vertical and horizontal axes, respectively, where M_{
j
} is the difference between replicated samples for j and A_{
j
} is the average, i.e.{M}_{j}={y}_{1j}{y}_{2j}=\underset{2}{\text{log}}({x}_{1j}/{x}_{2j}) and{A}_{j}=({y}_{1j}+{y}_{2j})/2=(1/2)\underset{2}{\text{log}}\left({x}_{1j}{x}_{2j}\right) to detect the outliers accounting for the heterogeneity of variability.
In multiple experiments (n ≥ 2), it is natural to investigate outliers based on all observed values in a highdimensional space. An outlier will be a very large distance from the center of the distribution of a peptide. The cutoffs of distances for classification of outliers depend on the degree of variability from the center. The degree of variability is dependent on intensity levels and the center can be defined as the 45° line from the origin. More flexibly, the center can be obtained by principal component analysis (PCA), as seen in Figure2. The first principal component (PC) becomes the center of each intensity level, i.e., a new axis for intensity levels. The experiments are replicated under the same biological and technical condition; hence, most variation can be explained by the first PC. It implies that it is enough to use the first PC practically. An outlier will have a large distance from its projection. Following the notations for applying quantile regression, we can define the distance of peptide j to the projection as M_{
j
} and the length of the projection on the new axis as A_{
j
}. Then the first and third quantiles can be obtained by applying quantile regression on an MA plot with M and A in the vertical and horizontal axes, repectively; hence, the upper and lower fences can be constructed to classify the outliers.
Describing this projection approach in more detail, we first subtract the sample mean of each sample from each observation to shift the sample mean to the origin because the PC go through the sample means. The first PC vector v can be found on the new sample space from{y}_{1}^{\ast},\dots ,{y}_{n}^{\ast} and the projection of each peptide on the vector v can be obtained. Then, we can calculate the length of the projection,\left{{\mathbf{y}}_{\mathbf{j}}^{\ast}}^{\prime}\mathbf{v}\right/\sqrt{{\mathbf{v}}^{\prime}\mathbf{v}}, and the length of the difference between a vector of peptide j and the projection,{\mathbf{y}}_{{\mathbf{j}}^{\ast}}({{y}_{j}^{\ast}}^{\prime}\mathbf{v}/{\mathbf{v}}^{\prime}\mathbf{v}\left)\mathbf{v}\right. The length of the projection is multiplied by the sign of{{y}_{j}^{\ast}}^{\prime}\mathbf{v} to distinguish the positive and negative directions. The signed length of the project and the length of the difference are defined as A_{
j
} and M_{
j
} of peptide j, respectively. Outlying peptides will have unduly large M values. Judging whether it is undue or not depends on A_{
j
} because the variability of M values is heterogeneous. Like OutlierD, we obtain first and third quantiles, Q_{1} and Q_{1}, depending on intensity levels, and then construct the upper and lower fences to classify outliers from normal observations. Quantile regression[7] is utilized on an MA plot to obtain the first and third quantile estimates, Q_{1}(A) and Q_{3}(A), respectively, depending on the intensity levels A. The qquantile linear quantile regression with {(A_{
j
}M_{
j
}),j = 1,…,p} is used to find the parameters minimizing
\begin{array}{l}\sum _{\{j:{M}_{j}\ge g({A}_{j};{\theta}_{0},{\theta}_{1}\left)\right\}}q{M}_{j}g({A}_{j};{\theta}_{0},{\theta}_{1}\left)\right\phantom{\rule{1em}{0ex}}+\phantom{\rule{2em}{0ex}}\\ \sum _{\{j:{M}_{j}<g({A}_{j};{\theta}_{0},{\theta}_{1}\left)\right\}}(1q){M}_{j}g({A}_{j};{\theta}_{0},{\theta}_{1}\left)\right\phantom{\rule{2em}{0ex}}\end{array}
(3)
where 0 < q < 1, and g(A_{
j
};θ_{0}θ_{1}) = θ_{0} + θ_{1}A_{
j
}. Using Equation (3), the 0.25 and 0.75 quantile estimates, Q_{1}(A) and Q_{3}(A), are calculated depending on the levels A. Then, the lower and upper fences are constructed: Q_{1}(A)  kIQR(A) and Q_{3}(A) + kIQR(A), where IQR(A) = Q_{3}(A)  Q_{1}(A) and k is a tuning parameter. We set k to 1.5 as the default value in our algorithm and software program because the value is practically often used. A larger k value selects fewer peptides, while a smaller k selects more outliers. The value can be adjusted empirically according to the magnitude of the variation of the data.
We can obtain more flexible quantile estimates by nonlinear and nonparametric quantile regression approaches[8]. For nonlinear quantile regression, the asymptotic function[9] can be employed:
g({A}_{j};{\theta}_{1},{\theta}_{2},{\theta}_{3})={\theta}_{1}\{1\text{exp}[\text{exp}\left({\theta}_{2}\right)\times ({A}_{j}{\theta}_{3})\left]\right\},
(4)
where θ_{1} is the asymptote, θ_{2} is the log rate, and θ_{3} is the value of A at which the response becomes zero. In addition, Selfstarting, Frank, Asymptotic with Offset and Copula functions can be employed. For nonparametric quantile regression, we utilize smoothing spline with the total variation regularization for univariate data to our algorithm[10]. A smoothing parameter plays a role in adjusting the degree of smoothness. We set it to 1 as the default, but it can be changed by users. The algorithm using projection can be summarized as follows.
Proposed Algorithm

1.
Shift the sample means ({\stackrel{\u0304}{y}}_{1},\dots ,{\stackrel{\u0304}{y}}_{n}) to the origin (0,…,0), i.e., {y}_{\mathit{\text{ij}}}^{\ast}={y}_{\mathit{\text{ij}}}{\stackrel{\u0304}{y}}_{i}.

2.
Find the first PC vector v using PCA on the space of {\mathbf{y}}_{\mathbf{1}}^{\ast},\dots ,{\mathbf{y}}_{\mathbf{n}}^{\ast}.

3.
Obtain the projection of a vector {\mathbf{y}}_{\mathbf{j}}^{\mathbf{\ast}}=({y}_{1j}^{\ast},\dots ,{y}_{\mathit{\text{nj}}}^{\ast}) of each peptide j on v, where j = 1,…,p.

4.
Compute the signed length of the projection {A}_{j}=\text{sign}\left({{\mathbf{y}}_{\mathbf{j}}^{\mathbf{\ast}}}^{\mathbf{\prime}}\mathbf{v}\right)\left{{\mathbf{y}}_{\mathbf{j}}^{\ast}}^{\mathbf{\prime}}\mathbf{v}\right/\sqrt{{\mathbf{v}}^{\prime}\mathbf{v}} and the length of the difference between a vector of peptide j and the projection {M}_{j}={\mathbf{y}}_{\mathbf{j}}^{\mathbf{\ast}}({{\mathbf{y}}_{\mathbf{j}}^{\ast}}^{\prime}\mathbf{v}/{\mathbf{v}}^{\prime}\mathbf{v}\left)\mathbf{v}\right, where j = 1,2,…,p.

5.
Obtain the first and third quantile values Q _{1}(A) and Q _{3}(A), on an MA plot using a quantile regression approach. Then calculate IQR(A) = Q _{3}(A)−Q _{1}(A).

6.
Construct the lower and upper fences, LB(A) = Q _{1}(A)  k IQR(A) and UB(A) = Q _{3}(A) + k IQR(A), where k is a tuning parameter.

7.
Declare peptide j as an outlier if it is located above the upper fence or under the lower fence.
This projection approach utilizes all the replicates simultaneously, and a highdimensional problem reduces to twodimensional one that can easily be solved. Shifts from biased experiments can be ignored due to the use of PCA.