Algorithm - Finding Change-Points
Similar to CBS, the newly proposed algorithm, eCBS, detects change-points relying on a sequence of maximal-t tests. Before getting into the details of the maximal-t test, we first introduce the maximal-t statistic as follows. Let r1,..., r
N
be log2 ratios of signal intensities indexed by N ordered genomic markers. Let S
i
= r1 + ... + r
i
, 1 ≤ i ≤ N, be the partial sums, and let . Statistics T
ij
are given by [11]
(1)
where 1 ≤ i < j ≤ N, s refers to the standard deviation of r1,..., r
N
, and k = j - i. Among all values of T
ij
derived from all possible permutes of i and j under consideration, the maximal statistic, T
max
, is referred to as the maximal-t statistic, and the corresponding locations, i
c
and j
c
, are termed candidate change-points, which are given by
Based on the maximal-t statistic, a maximal-t test with two hypotheses (H0: there is no change-point, H1: there are change-points locating at i
c
and j
c
) is formulated. We reject the null hypothesis H0 and declare locations i
c
and j
c
as change-points if T
max
exceeds a significance threshold.
For a chromosome under consideration, similar to CBS, eCBS detects the regions with equal DNA copy numbers by recursively invoking a change-point function, named "finding change-points", in which two ends of a sequential data are first connected and then ternary splits are determined using the maximal-t test. The function of finding change-points contains three steps: 1) apply maximal-t statistic to locate candidate change-points i
c
and j
c
; 2) determine whether the candidate change-points are significant or not; and 3) if significant change-points occur, a t-test is applied to remove errors near the edges. We refer to these three steps as candidate location, significance evaluation, and edge effect correction.
The implementation of the process for finding change-points in eCBS is basically the same as CBS except the method for significance evaluation. More precisely, in the original CBS, maximal-t distribution (distribution of the maximal-t statistics under null hypothesis) and the significance of change-points are evaluated using permutations; in the hybrid CBS, the significance of change-points is approximated using a mixed procedure, constituted of permutations, a mathematical approximation, and early stopping rules. In this study, eCBS approximates the significance of change-points using the eXtreme model through a table lookup process. Indexed by the measures of skewness and kurtosis estimated from the array and the number of probes in the sequential data under consideration, the eXtreme model provides an approximation of maximal-t distribution in the form of GEV parameters. We will introduce the GEV distribution and the GEV parameters later in the next subsection. We herein demonstrate the implementation of the function - finding change-points - in eCBS as Algorithm 1. Please note that the input variables, gevParams, represent the approximated maximal-t distribution provided by the eXtreme model. Additionally, gevcdf(T
max
, gevParams) is a function returning the cumulative distribution function of GEV distribution with parameters gevParams at the value T
max
. See Figure 1 for the overall concept.
Algorithm 1 - finding change-points
change points = finding change-points (data, gevParams)
Input: data: aCGH data to be segmented, 1 × N vector
gevParams: GEV parameters, γ, σ, μ.
Output: change-points: a list of change-points
Step:
1. Compute statistics, T
ij
, for all possible locations i and j by Eq.(1);
2. Find candidate change-points i
c
and j
c
with maximal statistic, T
max
;
3. Evaluate the significance of change-points, p-value, using
p-value = 1-gevcdf(T
max
, gevParams);
-
4.
Edge effect correction;
-
5.
If change-points are detected, list change-points into change-points; also, cut and define new subsegments.
The Table Lookup Process
Accurate approximations of maximal-t distribution using the eXtreme model depend on robust estimators of skewness and kurtosis; incorrect estimates of skewness and kurtosis from aCGH data render the table lookup process incapable of finding correct values. Since the estimation of skewness and kurtosis could be biased due to extremely large values [18], a pre-segmentation process - scanning obvious change-points quickly before segmentation - is selected (see Additional File 1: Supplementary Materials). The pre-segmentation process (see Algorithm 2) is quite similar to formal segmentation, but with lower resolution and no edge effect correction. After subtracting the changes of mean values from copy number amplifications or deletions, skewness and kurtosis can be estimated without bias due to CNAs. Since no permutations are involved, the pre-segmentation process is done in a short time.
Algorithm 2 - parameter estimation
[sk, ku] = parameter_estimation(data)
Input: data: aCGH data to be segmented, 1 × N vector
Output: sk, ku: estimates of skewness and kurtosis.
Step:
1. Execute pre-segmentation process without edge effect correction;
2. Discard small segments;
3. Subtract all segments' mean values;
4. Derive measures of skewness and kurtosis after removing segments' mean values.
The table lookup process for deriving approximations of maximal-t distribution can be done by sending three indexes, namely, estimates of skewness and kurtosis and the number of probes under consideration, to the eXtreme model. If these indexes do not fall exactly on the table grid, linear interpolation is applied to accomplish the approximation. We apply linear interpolation because the GEV parameters change smoothly with increasing or decreasing values of skewness, kurtosis, and number of probes (shown later). Through the table lookup process, the eXtreme model can provide accurate approximations of maximal-t distribution when the number of probes in the sequential data under consideration is not too small (described later). We empirically set the default minimum as 100 probes; when number of probes is less than the minimum, the mixed procedure applied in the hybrid CBS kicks in. Additional improvement in computation time may be achieved using a smaller minimum at the expense of accuracy.
Simulating aCGH Data Using the Pearson System
The basic idea to create the eXtreme model, which associates characteristics of aCGH data with maximal-t distribution, was to simulate a large number of aCGH data and then model the distribution of maximal-t statistics under null hypothesis using the GEV distribution. While a fundamental assumption of aCGH data is Gaussian normality, practical microarray data can be non-normally distributed [19]. Additionally, as we observed in Figure 2, maximal-t distribution is quite sensitive to outliers and heavy tails of aCGH data.
Thus, in order to provide accurate modeling results for most aCGH data, we need to consider non-normal properties, namely, the skewness and kurtosis, when generating synthetic datasets.
The Pearson system [20] provides a way to construct probability distributions in which skewness and kurtosis can be adjusted. Without knowing the probability distributions from which aCGH data arose, we hypothesized that, by providing up to 4 th moments (mean, variance, skewness and kurtosis), the Pearson system is sufficient to simulate a wide range of aCGH data under null condition (no change-points). This hypothesis was clearly hold under our simulated condition by the Kolmogorov-Smirnov Test (KS-test) (see Additional File 1: Supplementary Materials for details of the testing results). Including a set of probability density functions (PDFs), the Pearson system is the solution to the differential equation,
(2)
where p (x) is a density function and a, b0, b1, and b2 are parameters of the distribution. Correspondences between the distribution parameters and the central moments (θ1, θ2, θ3, θ4) [21] are denoted as
where denotes the skewness, and denotes the kurtosis, and .
A MATLAB function,pearsrnd(), was applied to generate data with different parameters. Again, since these data were generated under null hypothesis, they contain no change-points. Therefore, the maximal-t statistics generated from these data satisfy the null hypothesis. To be specific, we generated tables of data for,
-
1.
The number of probes, N, which varies from 10 to 10,000 with intervals of 10, 100, and 1000 for the number of probes within 100, 1000 and 10000, respectively;
-
2.
Skewness, sk, selected from -1 to 1 with an interval of 0.1; and
-
3.
Kurtosis, ku, selected from 2.6 to 5.6 with an interval of 0.2.
Please note that skewness is 0, and kurtosis is 3 for a normal distribution. The ranges and intervals of the simulation parameters were carefully chosen based on typical estimates of skewness and kurtosis from real aCGH data (see Additional File 1: Supplementary Figure. One can always refine the table resolution when improved computation resources are accessible. Note that the mean and standard deviation of the simulated data were irrelevant to GEV modeling due to the normalization process in Eq.(1), we thus set the mean and standard deviation for simulating aCGH data to be 0 and 1, respectively.
Modeling Maximal-t Distribution Using the GEV Distribution
After simulating aCGH datasets and the subsequent maximal-t statistics, we modeled maximal-t distribution by the GEV distribution with parameters (described later), γ, σ, and μ, using a maximal likelihood method [22]. Please note that the GEV parameters were derived from the maximal-t statistics, not directly from the simulated datasets. The modeling process was done using a MATLAB function (statistical toolbox), gevfit(), a maximum likelihood estimator of the GEV parameters. As a result, tables of modeling information (one table for each GEV parameter), indexed by skewness, kurtosis and the number of probes, were generated and saved in the eXtreme model on which eCBS was based.
We briefly go through the GEV distribution and the GEV parameters as follows. For a sequential independent and identically distributed (i.i.d.) random variables, t1,..., t
n
, the Fisher-Tippet theorem states that after proper normalization, the maximum of these random variables, t
max
= max (t1,..., t
n
), converges in distribution to one of three possible distributions: the Gumbel distribution, the Fréchet distribution, or the Weibull distribution. The three distributions listed above are unified under the Generalized Extreme Value (GEV) distribution [23, 24]. The density function of the GEV distribution with parameters (μ, σ, γ) is given by
where w = (1 + γz), z = (t
max
- μ)/σ, γ is the shape parameter, σ is the scale parameter (σ > 0), and μ is the location parameter.
In our setting, the statistics T
ij
derived from the quotient in Eq.(1), instead of being a sequential i.i.d. random variables, form a random field spatially correlated in the i - j plane. However, it is not difficult to show that the covariance between any two random variables in the i - j plane, defined by T
ij
, is small when the distance between them or the number of probes, N, is large. Furthermore, as shown in [25, 26], under certain conditions (independence when sufficient apart and non-clustering), Extreme Value Theory (EVT) can be established for a dependent process by constructing an independent process with the same distribution function. The rigorous mathematical derivation of the distribution of T
max
(the maximum among all T
ij
) could be considerably difficult due to the complex dependency and beyond the scope of this paper, we thus simply assume that the distribution of T
max
(i.e., maximal-t distribution) can be modeled by the GEV distribution when N is properly large, taking on different sets of parameters than that of GEV distribution for i.i.d. random variables.
Simulation for Performance Validation
We validated the performance of eCBS via simulation. Two simulation models, similar to previous models used by Olshen et al. (2004) [11], were applied to eCBS. The first model contains 150 probes (N = 150) and four change-points located at 50, 70, 80 and 100. Log2 ratios within each segment were randomly generated by normal distribution or X ~ N(m
i
, v2), where v is the standard deviation and m
i
is the mean value of the i-th segment, which was set to be 0, cv, -cv, cv, and 0, respectively. Parameter c controlled the alteration amplitudes, and cases with c = 2, 3, and 4 were tested. The second model contains 1,500 probes (N = 1,500) and one change-point near the edges or two change-points in the center of the chromosomes. Data were generated exactly the same as in the first model, but the change-point locations and amplitudes were controlled by m
i
= cvI, where I is an indicator function, which equals 1 for segments between l < × < (l + k) and 0 otherwise. Parameter k refers to the width of the variation, and l refers to the location of the variation.
ROC curves were further used to evaluate the power of detecting change-points in the simulated data drawn from the Pearson System. A segment of copy number variations, 15 probes in width (k = 15), was embedded near the edges (l = 0) of every chromosome with 1,500 probes (N = 1,500). Successful detection of change-points from these data was defined as sensitivity. Cases without copy number variations were also tested for deriving specificity. Different settings of variation amplitudes, skewness, and kurtosis were tested, and the performance of the hybrid CBS and eCBS were compared using ROC curves.
Real aCGH Data
Two real aCGH datasets were employed to test the hybrid CBS and eCBS on computation time and segmentation accuracy. Ten unpublished breast cancer aCGH arrays using the Agilent Human Genome CGH 105A platform with 105,072 probes, were analyzed. Probes with unknown positions or small signal-to-noise ratios (SNR < 1) were filtered out, and more than 93,000 probes in each array were left for data segmentation. Another aCGH dataset from GSE9177 (NCBI/GEO, 11 aCGH profiles of human glioblastoma GBM using the Agilent 244A human CGH arrays) were also downloaded and processed for segmentation and performance comparison.