Algorithm  Finding ChangePoints
Similar to CBS, the newly proposed algorithm, eCBS, detects changepoints relying on a sequence of maximalt tests. Before getting into the details of the maximalt test, we first introduce the maximalt statistic as follows. Let r_{1},..., r_{
N
} be log_{2} ratios of signal intensities indexed by N ordered genomic markers. Let S_{
i
} = r_{1} + ... + r_{
i
} , 1 ≤ i ≤ N, be the partial sums, and let {S}_{ij}={S}_{j}{S}_{i}={\sum}_{l=i+1}^{j}{r}_{l}. Statistics T_{
ij
} are given by [11]
{T}_{ij}=\left(\frac{{S}_{ij}}{k}\frac{{S}_{N}{S}_{ij}}{Nk}\right)\u2215\left(s\sqrt{\frac{1}{k}+\frac{1}{Nk}}\right),
(1)
where 1 ≤ i < j ≤ N, s refers to the standard deviation of r_{1},..., r_{
N
} , and k = j  i. Among all values of T_{
ij
} derived from all possible permutes of i and j under consideration, the maximal statistic, T_{
max
} , is referred to as the maximalt statistic, and the corresponding locations, i_{
c
} and j_{
c
} , are termed candidate changepoints, which are given by
\left({i}_{c},{j}_{c}\right)=\underset{1\le i<j\le N}{\text{arg}\text{max}}\mid {T}_{ij}\mid ,1\le {i}_{c}<{j}_{c}\le N.
Based on the maximalt statistic, a maximalt test with two hypotheses (H_{0}: there is no changepoint, H_{1}: there are changepoints locating at i_{
c
} and j_{
c
} ) is formulated. We reject the null hypothesis H_{0} and declare locations i_{
c
} and j_{
c
} as changepoints if T_{
max
} exceeds a significance threshold.
For a chromosome under consideration, similar to CBS, eCBS detects the regions with equal DNA copy numbers by recursively invoking a changepoint function, named "finding changepoints", in which two ends of a sequential data are first connected and then ternary splits are determined using the maximalt test. The function of finding changepoints contains three steps: 1) apply maximalt statistic to locate candidate changepoints i_{
c
} and j_{
c
} ; 2) determine whether the candidate changepoints are significant or not; and 3) if significant changepoints occur, a ttest is applied to remove errors near the edges. We refer to these three steps as candidate location, significance evaluation, and edge effect correction.
The implementation of the process for finding changepoints in eCBS is basically the same as CBS except the method for significance evaluation. More precisely, in the original CBS, maximalt distribution (distribution of the maximalt statistics under null hypothesis) and the significance of changepoints are evaluated using permutations; in the hybrid CBS, the significance of changepoints is approximated using a mixed procedure, constituted of permutations, a mathematical approximation, and early stopping rules. In this study, eCBS approximates the significance of changepoints using the eXtreme model through a table lookup process. Indexed by the measures of skewness and kurtosis estimated from the array and the number of probes in the sequential data under consideration, the eXtreme model provides an approximation of maximalt distribution in the form of GEV parameters. We will introduce the GEV distribution and the GEV parameters later in the next subsection. We herein demonstrate the implementation of the function  finding changepoints  in eCBS as Algorithm 1. Please note that the input variables, gevParams, represent the approximated maximalt distribution provided by the eXtreme model. Additionally, gevcdf(T_{
max
}, gevParams) is a function returning the cumulative distribution function of GEV distribution with parameters gevParams at the value T_{
max
} . See Figure 1 for the overall concept.
Algorithm 1  finding changepoints
change points = finding changepoints (data, gevParams)
Input: data: aCGH data to be segmented, 1 × N vector
gevParams: GEV parameters, γ, σ, μ.
Output: changepoints: a list of changepoints
Step:
1. Compute statistics, T_{
ij
} , for all possible locations i and j by Eq.(1);
2. Find candidate changepoints i_{
c
} and j_{
c
} with maximal statistic, T_{
max
} ;
3. Evaluate the significance of changepoints, pvalue, using
pvalue = 1gevcdf(T_{
max
}, gevParams);

4.
Edge effect correction;

5.
If changepoints are detected, list changepoints into changepoints; also, cut and define new subsegments.
The Table Lookup Process
Accurate approximations of maximalt distribution using the eXtreme model depend on robust estimators of skewness and kurtosis; incorrect estimates of skewness and kurtosis from aCGH data render the table lookup process incapable of finding correct values. Since the estimation of skewness and kurtosis could be biased due to extremely large values [18], a presegmentation process  scanning obvious changepoints quickly before segmentation  is selected (see Additional File 1: Supplementary Materials). The presegmentation process (see Algorithm 2) is quite similar to formal segmentation, but with lower resolution and no edge effect correction. After subtracting the changes of mean values from copy number amplifications or deletions, skewness and kurtosis can be estimated without bias due to CNAs. Since no permutations are involved, the presegmentation process is done in a short time.
Algorithm 2  parameter estimation
[sk, ku] = parameter_estimation(data)
Input: data: aCGH data to be segmented, 1 × N vector
Output: sk, ku: estimates of skewness and kurtosis.
Step:
1. Execute presegmentation process without edge effect correction;
2. Discard small segments;
3. Subtract all segments' mean values;
4. Derive measures of skewness and kurtosis after removing segments' mean values.
The table lookup process for deriving approximations of maximalt distribution can be done by sending three indexes, namely, estimates of skewness and kurtosis and the number of probes under consideration, to the eXtreme model. If these indexes do not fall exactly on the table grid, linear interpolation is applied to accomplish the approximation. We apply linear interpolation because the GEV parameters change smoothly with increasing or decreasing values of skewness, kurtosis, and number of probes (shown later). Through the table lookup process, the eXtreme model can provide accurate approximations of maximalt distribution when the number of probes in the sequential data under consideration is not too small (described later). We empirically set the default minimum as 100 probes; when number of probes is less than the minimum, the mixed procedure applied in the hybrid CBS kicks in. Additional improvement in computation time may be achieved using a smaller minimum at the expense of accuracy.
Simulating aCGH Data Using the Pearson System
The basic idea to create the eXtreme model, which associates characteristics of aCGH data with maximalt distribution, was to simulate a large number of aCGH data and then model the distribution of maximalt statistics under null hypothesis using the GEV distribution. While a fundamental assumption of aCGH data is Gaussian normality, practical microarray data can be nonnormally distributed [19]. Additionally, as we observed in Figure 2, maximalt distribution is quite sensitive to outliers and heavy tails of aCGH data.
Thus, in order to provide accurate modeling results for most aCGH data, we need to consider nonnormal properties, namely, the skewness and kurtosis, when generating synthetic datasets.
The Pearson system [20] provides a way to construct probability distributions in which skewness and kurtosis can be adjusted. Without knowing the probability distributions from which aCGH data arose, we hypothesized that, by providing up to 4 ^{th} moments (mean, variance, skewness and kurtosis), the Pearson system is sufficient to simulate a wide range of aCGH data under null condition (no changepoints). This hypothesis was clearly hold under our simulated condition by the KolmogorovSmirnov Test (KStest) (see Additional File 1: Supplementary Materials for details of the testing results). Including a set of probability density functions (PDFs), the Pearson system is the solution to the differential equation,
\frac{{p}^{\prime}\left(x\right)}{p\left(x\right)}=\frac{xa}{{b}_{2}{x}^{2}+{b}_{1}x+{b}_{0}},
(2)
where p (x) is a density function and a, b_{0}, b_{1}, and b_{2} are parameters of the distribution. Correspondences between the distribution parameters and the central moments (θ_{1}, θ_{2}, θ_{3}, θ_{4}) [21] are denoted as
\begin{array}{c}{b}_{0}=\left({\theta}_{2}\left(4{\beta}_{2}3{\beta}_{1}^{2}\right)\right)\u2215A,\\ {b}_{1}=\left(\sqrt{{\theta}_{2}}{\beta}_{1}\left({\beta}_{2}+3\right)\right)\u2215A,\\ {b}_{2}=\left(2{\beta}_{2}3{\beta}_{1}^{2}6\right)\u2215A,\end{array}
where {\beta}_{1}={\theta}_{3}\mathsf{\text{/}}\sqrt{{\theta}_{2}^{3}} denotes the skewness, and {\beta}_{2}={\theta}_{4}\mathsf{\text{/}}{\theta}_{2}^{2} denotes the kurtosis, and A=10{\beta}_{2}12{\beta}_{1}^{2}18.
A MATLAB function,pearsrnd(), was applied to generate data with different parameters. Again, since these data were generated under null hypothesis, they contain no changepoints. Therefore, the maximalt statistics generated from these data satisfy the null hypothesis. To be specific, we generated tables of data for,

1.
The number of probes, N, which varies from 10 to 10,000 with intervals of 10, 100, and 1000 for the number of probes within 100, 1000 and 10000, respectively;

2.
Skewness, sk, selected from 1 to 1 with an interval of 0.1; and

3.
Kurtosis, ku, selected from 2.6 to 5.6 with an interval of 0.2.
Please note that skewness is 0, and kurtosis is 3 for a normal distribution. The ranges and intervals of the simulation parameters were carefully chosen based on typical estimates of skewness and kurtosis from real aCGH data (see Additional File 1: Supplementary Figure. One can always refine the table resolution when improved computation resources are accessible. Note that the mean and standard deviation of the simulated data were irrelevant to GEV modeling due to the normalization process in Eq.(1), we thus set the mean and standard deviation for simulating aCGH data to be 0 and 1, respectively.
Modeling Maximalt Distribution Using the GEV Distribution
After simulating aCGH datasets and the subsequent maximalt statistics, we modeled maximalt distribution by the GEV distribution with parameters (described later), γ, σ, and μ, using a maximal likelihood method [22]. Please note that the GEV parameters were derived from the maximalt statistics, not directly from the simulated datasets. The modeling process was done using a MATLAB function (statistical toolbox), gevfit(), a maximum likelihood estimator of the GEV parameters. As a result, tables of modeling information (one table for each GEV parameter), indexed by skewness, kurtosis and the number of probes, were generated and saved in the eXtreme model on which eCBS was based.
We briefly go through the GEV distribution and the GEV parameters as follows. For a sequential independent and identically distributed (i.i.d.) random variables, t_{1},..., t_{
n
} , the FisherTippet theorem states that after proper normalization, the maximum of these random variables, t_{
max
} = max (t_{1},..., t_{
n
} ), converges in distribution to one of three possible distributions: the Gumbel distribution, the Fréchet distribution, or the Weibull distribution. The three distributions listed above are unified under the Generalized Extreme Value (GEV) distribution [23, 24]. The density function of the GEV distribution with parameters (μ, σ, γ) is given by
f\left({t}_{max}\right)=\left\{\begin{array}{c}{\sigma}^{1}{w}^{\frac{1}{\gamma}1}{e}^{{w}^{\frac{1}{\gamma}}},\mathsf{\text{}}w>0,\mathsf{\text{for}}\gamma \ne 0\\ {\sigma}^{1}{e}^{z}{e}^{{e}^{z}},\forall t,\mathsf{\text{for}}\gamma =0\end{array}\right.,
where w = (1 + γz), z = (t_{
max
}  μ)/σ, γ is the shape parameter, σ is the scale parameter (σ > 0), and μ is the location parameter.
In our setting, the statistics T_{
ij
} derived from the quotient in Eq.(1), instead of being a sequential i.i.d. random variables, form a random field spatially correlated in the i  j plane. However, it is not difficult to show that the covariance between any two random variables in the i  j plane, defined by T_{
ij
} , is small when the distance between them or the number of probes, N, is large. Furthermore, as shown in [25, 26], under certain conditions (independence when sufficient apart and nonclustering), Extreme Value Theory (EVT) can be established for a dependent process by constructing an independent process with the same distribution function. The rigorous mathematical derivation of the distribution of T_{
max
} (the maximum among all T_{
ij
} ) could be considerably difficult due to the complex dependency and beyond the scope of this paper, we thus simply assume that the distribution of T_{
max
} (i.e., maximalt distribution) can be modeled by the GEV distribution when N is properly large, taking on different sets of parameters than that of GEV distribution for i.i.d. random variables.
Simulation for Performance Validation
We validated the performance of eCBS via simulation. Two simulation models, similar to previous models used by Olshen et al. (2004) [11], were applied to eCBS. The first model contains 150 probes (N = 150) and four changepoints located at 50, 70, 80 and 100. Log_{2} ratios within each segment were randomly generated by normal distribution or X ~ N(m_{
i
} , v^{2}), where v is the standard deviation and m_{
i
} is the mean value of the ith segment, which was set to be 0, cv, cv, cv, and 0, respectively. Parameter c controlled the alteration amplitudes, and cases with c = 2, 3, and 4 were tested. The second model contains 1,500 probes (N = 1,500) and one changepoint near the edges or two changepoints in the center of the chromosomes. Data were generated exactly the same as in the first model, but the changepoint locations and amplitudes were controlled by m_{
i
} = cvI, where I is an indicator function, which equals 1 for segments between l < × < (l + k) and 0 otherwise. Parameter k refers to the width of the variation, and l refers to the location of the variation.
ROC curves were further used to evaluate the power of detecting changepoints in the simulated data drawn from the Pearson System. A segment of copy number variations, 15 probes in width (k = 15), was embedded near the edges (l = 0) of every chromosome with 1,500 probes (N = 1,500). Successful detection of changepoints from these data was defined as sensitivity. Cases without copy number variations were also tested for deriving specificity. Different settings of variation amplitudes, skewness, and kurtosis were tested, and the performance of the hybrid CBS and eCBS were compared using ROC curves.
Real aCGH Data
Two real aCGH datasets were employed to test the hybrid CBS and eCBS on computation time and segmentation accuracy. Ten unpublished breast cancer aCGH arrays using the Agilent Human Genome CGH 105A platform with 105,072 probes, were analyzed. Probes with unknown positions or small signaltonoise ratios (SNR < 1) were filtered out, and more than 93,000 probes in each array were left for data segmentation. Another aCGH dataset from GSE9177 (NCBI/GEO, 11 aCGH profiles of human glioblastoma GBM using the Agilent 244A human CGH arrays) were also downloaded and processed for segmentation and performance comparison.