# A model-based circular binary segmentation algorithm for the analysis of array CGH data

- Fang-Han Hsu
^{1}, - Hung-I H Chen
^{2}, - Mong-Hsun Tsai
^{4}, - Liang-Chuan Lai
^{5}, - Chi-Cheng Huang
^{1, 6}, - Shih-Hsin Tu
^{6}, - Eric Y Chuang
^{1}Email author and - Yidong Chen
^{2, 3}Email author

**4**:394

**DOI: **10.1186/1756-0500-4-394

© Hsu et al; licensee BioMed Central Ltd. 2011

**Received: **2 June 2011

**Accepted: **10 October 2011

**Published: **10 October 2011

## Abstract

### Background

Circular Binary Segmentation (CBS) is a permutation-based algorithm for array Comparative Genomic Hybridization (aCGH) data analysis. CBS accurately segments data by detecting change-points using a maximal-*t* test; but extensive computational burden is involved for evaluating the significance of change-points using permutations. A recent implementation utilizing a hybrid method and early stopping rules (hybrid CBS) to improve the performance in speed was subsequently proposed. However, a time analysis revealed that a major portion of computation time of the hybrid CBS was still spent on permutation. In addition, what the hybrid method provides is an approximation of the significance upper bound or lower bound, not an approximation of the significance of change-points itself.

### Results

We developed a novel model-based algorithm, extreme-value based CBS (eCBS), which limits permutations and provides robust results without loss of accuracy. Thousands of aCGH data under null hypothesis were simulated in advance based on a variety of non-normal assumptions, and the corresponding maximal-*t* distribution was modeled by the Generalized Extreme Value (GEV) distribution. The modeling results, which associate characteristics of aCGH data to the GEV parameters, constitute lookup tables (eXtreme model). Using the eXtreme model, the significance of change-points could be evaluated in a constant time complexity through a table lookup process.

### Conclusions

A novel algorithm, eCBS, was developed in this study. The current implementation of eCBS consistently outperforms the hybrid CBS 4× to 20× in computation time without loss of accuracy. Source codes, supplementary materials, supplementary figures, and supplementary tables can be found at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary.

## Background

Copy number alterations (CNAs) are genomic disorders that closely correlate with many human diseases [1]. For instance, 17q23 was found to be a common region of amplification associated with breast cancers with poor prognosis [2], and copy number losses at 13q and gains at 1q and 5p were frequently observed in a prostate cancer study [3]. While some CNAs are well studied, most CNAs and their relation to genetic disorders remain largely unknown. Identifying regions of DNA copy number gains or losses is thus a critical step for studying the pathogenesis of cancer and many other diseases.

Array comparative genomic hybridization (aCGH) is a high throughput and high-resolution technique for measuring CNAs [4, 5], and the main purpose of aCGH data segmentation is to detect CNAs precisely and efficiently by utilizing neighboring probes' characteristics. Many algorithms have been proposed for this purpose, such as the Clustering Method Along Chromosomes (CLAC) [6], Hidden Markov Model methods [7, 8], and Bayesian segmentation approaches [9, 10]. Among these algorithms, Circular Binary Segmentation (CBS) [11] has the best operational characteristics in terms of its sensitivity and false discovery rate (FDR) for change-point detection [12, 13].

CBS performs consistently [13] and has been widely used by researchers [14, 15], but the major weakness of heavy computational cost prohibits CBS for high-density aCGH microarrays with millions of probes. The original CBS relies fully on a permutation-based maximal-*t* test to detect change-points. Through a recursive cut and test process, reliable results can be derived but require extensive computational burden. Realizing the deficiency, a hybrid version of CBS that mixes the permutations and a mathematical approximation was recently proposed [16]. Along with additional early stopping rules for approximating the significance of change-points, the hybrid CBS can detect change-points in a linear time and provides a substantial gain in speed.

The percentage of time consumed on each step of segmentation using the hybrid CBS

Consumed Time | |||
---|---|---|---|

Array | Candidate Location (%) | Significance Evaluation (%) | Edge Effect Correction (%) |

Array #10 | 1.21 | 88.24 | 10.56 |

Array #19 | 2.32 | 91.53 | 6.14 |

Array #22 | 1.63 | 96.21 | 2.16 |

Array #28 | 1.01 | 95.95 | 3.05 |

Array #42 | 0.87 | 97.54 | 1.59 |

Array #45 | 0.99 | 96.43 | 2.58 |

Array #48 | 1.48 | 98.48 | 0.04 |

Array #65 | 1.11 | 97.63 | 1.26 |

Array #72 | 1.89 | 83.92 | 14.19 |

Array #78 | 2.01 | 91.51 | 6.48 |

In this study, we proposed a model-based version of CBS, termed extreme-value based CBS (eCBS). Instead of evaluating an significance upper bound or a lower bound, eCBS approximates the significance of change-points using a Generalized Extreme Value (GEV) distribution model (eXtreme model). The eXtreme model consists of a set of lookup tables built in advance through simulation. Considering a variety of non-normal aCGH data, we simulated thousands of data without change-points using the Pearson system. The corresponding maximal-*t* distribution was then modeled by the GEV distribution, and the modeling results in the form of the GEV parameters constituted the eXtreme model. Using the eXtreme model, the significance of change-points can be approximated through a table lookup process in a constant time complexity. As a result, permutations are limited and computation time can be significantly reduced. The performance of segmentation in speed and segmentation results using both the hybrid CBS and eCBS were compared via simulation and real data analysis.

## Methods

### Algorithm - Finding Change-Points

*t*tests. Before getting into the details of the maximal-

*t*test, we first introduce the maximal-

*t*statistic as follows. Let

*r*

_{1},...,

*r*

_{ N }be log

_{2}ratios of signal intensities indexed by

*N*ordered genomic markers. Let

*S*

_{ i }=

*r*

_{1}+ ... +

*r*

_{ i }, 1 ≤

*i*≤

*N*, be the partial sums, and let ${S}_{ij}={S}_{j}-{S}_{i}={\sum}_{l=i+1}^{j}{r}_{l}$. Statistics

*T*

_{ ij }are given by [11]

*i < j*≤

*N*,

*s*refers to the standard deviation of

*r*

_{1},...,

*r*

_{ N }, and

*k*=

*j - i*. Among all values of

*T*

_{ ij }derived from all possible permutes of

*i*and

*j*under consideration, the maximal statistic,

*T*

_{ max }, is referred to as the maximal-

*t*statistic, and the corresponding locations,

*i*

_{ c }and

*j*

_{ c }, are termed candidate change-points, which are given by

Based on the maximal-*t* statistic, a maximal-*t* test with two hypotheses (*H*_{0}: there is no change-point, *H*_{1}: there are change-points locating at *i*_{
c
} and *j*_{
c
} ) is formulated. We reject the null hypothesis *H*_{0} and declare locations *i*_{
c
} and *j*_{
c
} as change-points if *T*_{
max
} exceeds a significance threshold.

For a chromosome under consideration, similar to CBS, eCBS detects the regions with equal DNA copy numbers by recursively invoking a change-point function, named "*finding change-points*", in which two ends of a sequential data are first connected and then ternary splits are determined using the maximal-*t* test. The function of finding change-points contains three steps: 1) apply maximal-*t* statistic to locate candidate change-points *i*_{
c
} and *j*_{
c
} ; 2) determine whether the candidate change-points are significant or not; and 3) if significant change-points occur, a *t*-test is applied to remove errors near the edges. We refer to these three steps as candidate location, significance evaluation, and edge effect correction.

*t*distribution (distribution of the maximal-

*t*statistics under null hypothesis) and the significance of change-points are evaluated using permutations; in the hybrid CBS, the significance of change-points is approximated using a mixed procedure, constituted of permutations, a mathematical approximation, and early stopping rules. In this study, eCBS approximates the significance of change-points using the eXtreme model through a table lookup process. Indexed by the measures of skewness and kurtosis estimated from the array and the number of probes in the sequential data under consideration, the eXtreme model provides an approximation of maximal-

*t*distribution in the form of GEV parameters. We will introduce the GEV distribution and the GEV parameters later in the next subsection. We herein demonstrate the implementation of the function -

*finding change-points*- in eCBS as Algorithm 1. Please note that the input variables,

*gevParams*, represent the approximated maximal-

*t*distribution provided by the eXtreme model. Additionally,

*gevcdf(T*

_{ max }

*, gevParams)*is a function returning the cumulative distribution function of GEV distribution with parameters

*gevParams*at the value

*T*

_{ max }. See Figure 1 for the overall concept.

**Algorithm 1** - finding change-points

change points = *finding change-points (data, gevParams)*

**Input:** *data*: aCGH data to be segmented, 1 × *N* vector

*gevParams*: GEV parameters, *γ*, *σ*, *μ*.

**Output:** *change-points*: a list of change-points

**Step:**

1. Compute statistics, *T*_{
ij
} , for all possible locations *i* and *j* by Eq.(1);

2. Find candidate change-points *i*_{
c
} and *j*_{
c
} with maximal statistic, *T*_{
max
} ;

3. Evaluate the significance of change-points, *p*-value, using

*p-value = 1-gevcdf(T*

_{ max }

*, gevParams)*;

- 4.
Edge effect correction;

- 5.
If change-points are detected, list change-points into

*change-points*; also, cut and define new subsegments.

### The Table Lookup Process

Accurate approximations of maximal-*t* distribution using the eXtreme model depend on robust estimators of skewness and kurtosis; incorrect estimates of skewness and kurtosis from aCGH data render the table lookup process incapable of finding correct values. Since the estimation of skewness and kurtosis could be biased due to extremely large values [18], a pre-segmentation process - scanning obvious change-points quickly before segmentation - is selected (see Additional File 1: Supplementary Materials). The pre-segmentation process (see Algorithm 2) is quite similar to formal segmentation, but with lower resolution and no edge effect correction. After subtracting the changes of mean values from copy number amplifications or deletions, skewness and kurtosis can be estimated without bias due to CNAs. Since no permutations are involved, the pre-segmentation process is done in a short time.

**Algorithm 2** - parameter estimation

*[sk, ku] = parameter_estimation(data)*

**Input:** *data:* aCGH data to be segmented, 1 × *N* vector

**Output:** *sk*, *ku*: estimates of skewness and kurtosis.

**Step:**

1. Execute pre-segmentation process without edge effect correction;

2. Discard small segments;

3. Subtract all segments' mean values;

4. Derive measures of skewness and kurtosis after removing segments' mean values.

The table lookup process for deriving approximations of maximal-*t* distribution can be done by sending three indexes, namely, estimates of skewness and kurtosis and the number of probes under consideration, to the eXtreme model. If these indexes do not fall exactly on the table grid, linear interpolation is applied to accomplish the approximation. We apply linear interpolation because the GEV parameters change smoothly with increasing or decreasing values of skewness, kurtosis, and number of probes (shown later). Through the table lookup process, the eXtreme model can provide accurate approximations of maximal-*t* distribution when the number of probes in the sequential data under consideration is not too small (described later). We empirically set the default minimum as 100 probes; when number of probes is less than the minimum, the mixed procedure applied in the hybrid CBS kicks in. Additional improvement in computation time may be achieved using a smaller minimum at the expense of accuracy.

### Simulating aCGH Data Using the Pearson System

*t*distribution, was to simulate a large number of aCGH data and then model the distribution of maximal-

*t*statistics under null hypothesis using the GEV distribution. While a fundamental assumption of aCGH data is Gaussian normality, practical microarray data can be non-normally distributed [19]. Additionally, as we observed in Figure 2, maximal-

*t*distribution is quite sensitive to outliers and heavy tails of aCGH data.

Thus, in order to provide accurate modeling results for most aCGH data, we need to consider non-normal properties, namely, the skewness and kurtosis, when generating synthetic datasets.

^{ th }moments (mean, variance, skewness and kurtosis), the Pearson system is sufficient to simulate a wide range of aCGH data under null condition (no change-points). This hypothesis was clearly hold under our simulated condition by the Kolmogorov-Smirnov Test (KS-test) (see Additional File 1: Supplementary Materials for details of the testing results). Including a set of probability density functions (PDFs), the Pearson system is the solution to the differential equation,

*p*(

*x*) is a density function and

*a*,

*b*

_{0},

*b*

_{1}, and

*b*

_{2}are parameters of the distribution. Correspondences between the distribution parameters and the central moments (

*θ*

_{1},

*θ*

_{2},

*θ*

_{3},

*θ*

_{4}) [21] are denoted as

where ${\beta}_{1}={\theta}_{3}\mathsf{\text{/}}\sqrt{{\theta}_{2}^{3}}$ denotes the skewness, and ${\beta}_{2}={\theta}_{4}\mathsf{\text{/}}{\theta}_{2}^{2}$ denotes the kurtosis, and $A=10{\beta}_{2}-12{\beta}_{1}^{2}-18$.

*pearsrnd()*, was applied to generate data with different parameters. Again, since these data were generated under null hypothesis, they contain no change-points. Therefore, the maximal-

*t*statistics generated from these data satisfy the null hypothesis. To be specific, we generated tables of data for,

- 1.
The number of probes,

*N*, which varies from 10 to 10,000 with intervals of 10, 100, and 1000 for the number of probes within 100, 1000 and 10000, respectively; - 2.
Skewness,

*sk*, selected from -1 to 1 with an interval of 0.1; and - 3.
Kurtosis,

*ku*, selected from 2.6 to 5.6 with an interval of 0.2.

Please note that skewness is 0, and kurtosis is 3 for a normal distribution. The ranges and intervals of the simulation parameters were carefully chosen based on typical estimates of skewness and kurtosis from real aCGH data (see Additional File 1: Supplementary Figure. One can always refine the table resolution when improved computation resources are accessible. Note that the mean and standard deviation of the simulated data were irrelevant to GEV modeling due to the normalization process in Eq.(1), we thus set the mean and standard deviation for simulating aCGH data to be 0 and 1, respectively.

### Modeling Maximal-*t* Distribution Using the GEV Distribution

After simulating aCGH datasets and the subsequent maximal-*t* statistics, we modeled maximal-*t* distribution by the GEV distribution with parameters (described later), *γ*, *σ*, and *μ*, using a maximal likelihood method [22]. Please note that the GEV parameters were derived from the maximal-*t* statistics, not directly from the simulated datasets. The modeling process was done using a MATLAB function (statistical toolbox), *gevfit()*, a maximum likelihood estimator of the GEV parameters. As a result, tables of modeling information (one table for each GEV parameter), indexed by skewness, kurtosis and the number of probes, were generated and saved in the eXtreme model on which eCBS was based.

*i.i.d*.) random variables,

*t*

_{1},...,

*t*

_{ n }, the Fisher-Tippet theorem states that after proper normalization, the maximum of these random variables,

*t*

_{ max }= max (

*t*

_{1},...,

*t*

_{ n }), converges in distribution to one of three possible distributions: the Gumbel distribution, the Fréchet distribution, or the Weibull distribution. The three distributions listed above are unified under the Generalized Extreme Value (GEV) distribution [23, 24]. The density function of the GEV distribution with parameters (

*μ*,

*σ*,

*γ*) is given by

where *w* = (1 + *γz*), *z* = (*t*_{
max
} *- μ*)*/σ*, *γ* is the shape parameter, *σ* is the scale parameter (*σ >* 0), and *μ* is the location parameter.

In our setting, the statistics *T*_{
ij
} derived from the quotient in Eq.(1), instead of being a sequential *i.i.d*. random variables, form a random field spatially correlated in the *i* - *j* plane. However, it is not difficult to show that the covariance between any two random variables in the *i* - *j* plane, defined by *T*_{
ij
} , is small when the distance between them or the number of probes, *N*, is large. Furthermore, as shown in [25, 26], under certain conditions (independence when sufficient apart and non-clustering), Extreme Value Theory (EVT) can be established for a dependent process by constructing an independent process with the same distribution function. The rigorous mathematical derivation of the distribution of *T*_{
max
} (the maximum among all *T*_{
ij
} ) could be considerably difficult due to the complex dependency and beyond the scope of this paper, we thus simply assume that the distribution of *T*_{
max
} (i.e., maximal-*t* distribution) can be modeled by the GEV distribution when *N* is properly large, taking on different sets of parameters than that of GEV distribution for *i.i.d*. random variables.

### Simulation for Performance Validation

We validated the performance of eCBS via simulation. Two simulation models, similar to previous models used by Olshen *et al*. (2004) [11], were applied to eCBS. The first model contains 150 probes (*N* = 150) and four change-points located at 50, 70, 80 and 100. Log_{2} ratios within each segment were randomly generated by normal distribution or *X* ~ *N*(*m*_{
i
} , *v*^{2}), where *v* is the standard deviation and *m*_{
i
} is the mean value of the *i*-th segment, which was set to be 0, *cv*, *-cv*, *cv*, and 0, respectively. Parameter *c* controlled the alteration amplitudes, and cases with *c* = 2, 3, and 4 were tested. The second model contains 1,500 probes (*N* = 1,500) and one change-point near the edges or two change-points in the center of the chromosomes. Data were generated exactly the same as in the first model, but the change-point locations and amplitudes were controlled by *m*_{
i
} = *cvI*, where *I* is an indicator function, which equals 1 for segments between *l < × <* (*l* + *k*) and 0 otherwise. Parameter *k* refers to the width of the variation, and *l* refers to the location of the variation.

ROC curves were further used to evaluate the power of detecting change-points in the simulated data drawn from the Pearson System. A segment of copy number variations, 15 probes in width (*k* = 15), was embedded near the edges (*l* = 0) of every chromosome with 1,500 probes (*N* = 1,500). Successful detection of change-points from these data was defined as sensitivity. Cases without copy number variations were also tested for deriving specificity. Different settings of variation amplitudes, skewness, and kurtosis were tested, and the performance of the hybrid CBS and eCBS were compared using ROC curves.

### Real aCGH Data

Two real aCGH datasets were employed to test the hybrid CBS and eCBS on computation time and segmentation accuracy. Ten unpublished breast cancer aCGH arrays using the Agilent Human Genome CGH 105A platform with 105,072 probes, were analyzed. Probes with unknown positions or small signal-to-noise ratios (SNR *<* 1) were filtered out, and more than 93,000 probes in each array were left for data segmentation. Another aCGH dataset from GSE9177 (NCBI/GEO, 11 aCGH profiles of human glioblastoma GBM using the Agilent 244A human CGH arrays) were also downloaded and processed for segmentation and performance comparison.

## Results

### Computation Time of the Hybrid CBS

In the breast cancer study with ten aCGH experiments, the percentage of time consumed by each critical step: candidate location, significance evaluation, and edge effect correction, are listed in Table 1. The results reveal that significance evaluation took at least 83% (93.7% on average) of the total time required to complete the segmentation process.

### The GEV Distribution Models Maximal-*t* Distribution Adequately

*t*distribution well, we generated 10,000 random datasets under different distributions controlled by skewness and kurtosis. The generated maximal-

*t*distribution was then fitted by maximum likelihood method with the GEV distribution. Figure 3 shows examples of maximal-

*t*distribution and the fitted GEV distribution under three different conditions: (1) normally distributed random data with 250 probes; (2) slightly skewed and heavy-tailed data with 400 probes; and (3) severely skewed and heavy-tailed data with 550 probes. The solid lines refer to the modeling GEV distribution while the dashed lines refer to the maximal-

*t*distribution. The results clearly demonstrate that the GEV distribution can adequately model maximal-

*t*distribution.

### The eXtreme Model Content Changes Smoothly

*N*, the eXtreme model provides three 2-dimensional tables, indexed by skewness and kurtosis, for the GEV parameters,

*γ*,

*σ*, and

*μ*, respectively. Figure 4 shows an example of the tables in the eXtreme model with

*N*= 200. With two parameters, namely, the skewnesss and kurtosis (estimated from aCGH data), an approximation of maximal-

*t*distribution in the form of the GEV parameters,

*γ*,

*σ*, and

*μ*, can be quickly derived through the table lookup process. Additionally, as shown in the figure, since the eXtreme model content changes smoothly, linear interpretation can be properly applied to provide an approximation when the query parameters (

*N*,

*sk*,

*ku*) do not fall on the grid.

### eCBS Performs Equivalently Comparing to the Hybrid CBS

The performance of the eCBS algorithm was tested according to the simulation models described early. Using the second simulation model, it is revealed that eCBS has a negligible effect on change-points detection from data with normal noise. In Additional File 1: Supplementary Table, the "Exact" column accounts for the cases (among 1,000 simulations) that the segmentation results exactly match the desired number (1 for edge and 2 for center) and locations of change-points. As shown in the table, eCBS performed as good as the hybrid CBS wherever change-points were located. In addition, eCBS outperformed when the aberration width was small (*k* = 2).

### eCBS Performs Adequately under Severely Skewed/Heavy-Tailed Conditions

### eCBS Performs 4× to 20× Faster than the Hybrid CBS

Comparison of performance in speed using the hybrid CBS and eCBS - 1

Time(sec.) | # of change-points | ||||
---|---|---|---|---|---|

Array | eCBS | hybrid CBS | eCBS | hybrid CBS | # of same detection |

Array #10 | 11.30 | 82.56 | 66 | 55 | 53 |

Array #19 | 10.16 | 53.89 | 169 | 163 | 160 |

Array #22 | 6.45 | 77.94 | 101 | 110 | 97 |

Array #28 | 4.13 | 82.64 | 53 | 74 | 49 |

Array #42 | 4.02 | 74.44 | 31 | 51 | 31 |

Array #45 | 3.91 | 81.59 | 30 | 35 | 30 |

Array #48 | 2.16 | 29.18 | 1 | 1 | 1 |

Array #65 | 5.98 | 78.64 | 109 | 130 | 105 |

Array #72 | 12.90 | 56.83 | 106 | 103 | 95 |

Array #78 | 9.18 | 69.06 | 99 | 108 | 99 |

Comparison of performance in speed using the hybrid CBS and eCBS - 2

Time(sec.) | # of change-points | ||||
---|---|---|---|---|---|

Array | eCBS | hybrid CBS | eCBS | hybrid CBS | # of same detection |

GSM231848 | 52.26 | 226.7 | 557 | 607 | 531 |

GSM231849 | 62.27 | 227.5 | 1208 | 1217 | 1118 |

GSM231850 | 78.48 | 203.8 | 2383 | 2577 | 2296 |

GSM231851 | 43.87 | 166.8 | 123 | 117 | 109 |

GSM231852 | 57.34 | 181.0 | 101 | 142 | 100 |

GSM231853 | 41.62 | 259.2 | 455 | 520 | 420 |

GSM231854 | 53.16 | 245.0 | 968 | 1010 | 920 |

GSM231855 | 49.91 | 249.1 | 1091 | 1174 | 1017 |

*GSM231856 | 32.09 | 297.8 | 291 | 625 | 291 |

GSM231857 | 91.86 | 223.2 | 2163 | 2116 | 2041 |

GSM231858 | 94.39 | 203.1 | 2476 | 2483 | 2376 |

## Discussion

Circular Binary Segmentation (CBS) performs consistently in detecting change-points and thus provides us a good framework for further improvements. The framework of CBS is mainly constituted of three steps: candidate location, significance evaluation, and edge effect correction. The first step, candidate location, locates candidate change-points by a maximal-*t* statistic. The second step, significance evaluation, approximates the significance of change-points by permutations or a hybrid method; the last step, edge effect correction, removes errors near the edges due to a circling process. Of these steps, significance evaluation was the major component that made the algorithm time-consuming. A time consumption study (shown in Table 1) on the hybrid CBS that involved analyzing ten breast cancer microarrays supported the above statement: among the ten experiments, up to 98% of time consumed was attributable to significance evaluation, in which permutations took the majority of the time for evaluating *p*-values, even when early stopping rules were applied. To improve the performance of the hybrid CBS in speed, we significantly reduce computation time using the eXtreme model.

The eXtreme model contains lookup tables which associates characteristics of aCGH data with the parameters of GEV distribution. The simulation, which applied the Pearson system to generate synthetic aCGH data, was done in advance rather than invoked on demand, such as the permutations for the hybrid CBS. Since maximal-*t* distribution is sensitive to heavy tails and outliers (shown in Figure 2), non-normal aCGH data distribution was considered for the simulation. Thousands of non-normal aCGH data under null hypothesis (without change-points) were simulated. The corresponding maximal-*t* distribution was then modeled by the GEV distribution, and the modeling results in the form of the GEV parameters, *γ*, *σ*, and *μ* , were saved in the eXtreme model (lookup tables).

Using the eXtreme model, maximal-*t* distribution can be approximated given the estimates of skewness (*sk*) and kurtosis (*ku*) from the aCGH data and the number of probes (*N*) under consideration. If the input parameters (*sk*, *ku*, *N*) fall right on the table grid, the output GEV parameters (*γ*, *σ*, *μ*) are directly derived through a table lookup process. Otherwise, since the table content changes smoothly (shown in Figure 4), linear interpolation from eight closest points in the 3-dimensional tables was applied. Through the table lookup process, maximal-*t* distribution and the subsequent approximation of the significance of change-points can be evaluated in *O*(1).

Deriving robust estimates of skewness and kurtosis from aCGH data is a critical step before using the eXtreme model, since estimating bias and variations may lead to incorrect approximations of maximal-*t* distribution and increase false positive or negative. In addition to the noise from microarray experiments, CNAs greatly increase the difficulties in estimating these parameters. Recognized what we needed here were estimates of skewness and kurtosis under null condition (the condition rarely exists because there are always CNAs within tumor samples), we selected a pre-segmentation step (see Additional File 1: Supplementary Materials) to rapidly pre-cut large regions of amplification or deletion. After removing the mean values of gain and loss segments, skewness and kurtosis can be accurately estimated.

Based on the eXtreme model, a novel algorithm, eCBS, has been developed. First, eCBS pre-segments data and estimates the skewness and kurtosis from the aCGH data. These estimates will be utilized later by the eXtreme model to provide approximations of maximal-*t* distribution. Once the estimates are provided, eCBS detects the maximal statistic, *T*_{
max
} , and locates candidate change-points in a similar way operated by CBS. Please note that the maximal-*t* statistic is derived from the original aCGH data, not from the data after pre-segmentation. After candidate change-points are located, the *p*-value, or the significance of change-points, is evaluated based on the maximal-*t* distribution approximated using the eXtreme model. After edge effect correction further removes errors due to the circling process, change-points and the consequent subsegments are found. The process of finding change-points repeats iteratively until no more change-points can be detected. Rather than providing an upper bound or a lower bound of *p*-values, as implemented in the hybrid CBS, eCBS outperforms the hybrid CBS by approximating *p*-values for *T*_{
max
} directly and reducing required permutations for significance evaluation.

## Conclusions

A novel algorithm, eCBS, was developed in this study, in which the significance of change-points is evaluated using the eXtreme model. The eXtreme model provides approximations of maximal-*t* distribution for hypothesis testing. With limited utilization of permutations, eCBS evaluates the significance of change-points through a table lookup process and achieves the best performance in speed. Via real aCGH data analysis and simulations, we showed that eCBS can perform as robustly as CBS, but with much less computation time. The eCBS algorithm with eXtreme model was implemented in an R package and is available from the supplementary website.

## Availability and Requirements

The eCBS algorithm was developed using Linux (version 2.6.11-1.1369 FC4smp) and R (version 2.8.0), and the source code is freely available at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary. In order to execute properly, a compatible version of Linux operating system with R environment is required. Installation instructions are included in the manual (\eCBS\inst\doc\eCBS.pdf).

## Declarations

### Acknowledgements and Funding

The authors would like to thank the members in the Bioinformatics and Biostatistics Core Laboratory, Center for Genomic Medicine, National Taiwan University, for helpful discussions. This research was supported in part by NCI/NIH Cancer Center Grant P30 CA054174-17 (Y.C.), NIH/NCRR 1UL1RR025767-01 (Y.C.), 95HM0033195R0066-BM01-01 (E.Y.C.) and 98R0065 (E.Y.C.) from National Taiwan University, and DOH98-TD-G-111-003 (E.Y.C.) from the Department of Health, Executive Yuan, ROC (Taiwan). The funding agencies had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

## Authors’ Affiliations

## References

- Beckmann JS, Estivill X, Antonarakis SE: Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nature Reviews Genetics. 2007, 8 (8): 639-646. 10.1038/nrg2149.PubMedView Article
- Monni O, Barlund M, Mousses S, Kononen J, Sauter G, Heiskanen M, Paavola P, Avela K, Chen Y, Bittner ML, Kallioniemi A: Comprehensive copy number and gene expression profiling of the 17q23 amplicon in human breast cancer. PNAS. 2001, 98 (10): 5711.-10.1073/pnas.091582298.PubMedPubMed CentralView Article
- Wolf M, Mousses S, Hautaniemi S, Karhu R, Huusko P, Allinen M, Elkahloun A, Monni O, Chen Y, Kallioniemi A, Kallioniemi OP: High-resolution analysis of gene copy number alterations in human prostate cancer using CGH on cDNA microarrays: impact of copy number on gene expression. Neoplasia (New York, NY). 2004, 6 (3): 240-10.1593/neo.03439.View Article
- Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nature Genetics. 2005, 37: S11-S17. 10.1038/ng1569.PubMedView Article
- Davies JJ, Wilson IM, Lam WL: Array CGH technologies and their applications to cancer genomes. Chromosome research. 2005, 13 (3): 237-248. 10.1007/s10577-005-2168-x.PubMedView Article
- Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data. Biostatistics. 2005, 645-
- Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis. 2004, 90: 132-153. 10.1016/j.jmva.2004.02.008.View Article
- Marioni JC, Thorne NP, Tavare S: BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006, 22 (9): 1144-10.1093/bioinformatics/btl089.PubMedView Article
- Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008, 24 (3): 309-10.1093/bioinformatics/btm601.PubMedPubMed CentralView Article
- Wu LY, Chipman HA, Bull SB, Briollais L, Wang K: A Bayesian segmentation approach to ascertain copy number variations at the population level. Bioinformatics. 2009, 25 (13): 1669-10.1093/bioinformatics/btp270.PubMedView Article
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5 (4): 557-10.1093/biostatistics/kxh008.PubMedView Article
- Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21 (22): 4084-10.1093/bioinformatics/bti677.PubMedView Article
- Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005, 21 (19): 3763-10.1093/bioinformatics/bti611.PubMedPubMed CentralView Article
- Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PAW, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature genetics. 2008, 40 (6): 722-729. 10.1038/ng.128.PubMedPubMed CentralView Article
- Deshmukh H, Yeh TH, Yu J, Sharma MK, Perry A, Leonard JR, Watson MA, Gutmann DH, Nagarajan R: High-resolution, dual-platform aCGH analysis reveals frequent HIPK2 amplification and increased expression in pilocytic astrocytomas. Oncogene. 2008, 27 (34): 4745-4751. 10.1038/onc.2008.110.PubMedView Article
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23 (6): 657-10.1093/bioinformatics/btl646.PubMedView Article
- Network TCGA: Integrated genomic analyses of ovarian carcinoma. Nature. 2011, 474: 609-10.1038/nature10166.View Article
- Kim TH, White H: On more robust estimation of skewness and kurtosis. Finance Research Letters. 2004, 156-73.
- Hardin J, Wilson J: A note on oligonucleotide expression values not being normally distributed. Biostatistics. 2009, 10 (3): 446-10.1093/biostatistics/kxp003.PubMedView Article
- Andreev A, Kanto A, Malo P: Simple approach for distribution selection in the Pearson system. Helinski School of Economics-Electronic Working Papers. 2005, 388: 22-
- Stuart A, Ord JK: Kendall's advanced theory of statistics. Vol. 1 Distribution theory. 1994, Hodder Arnold
- Embrechts P, Schmidli H: Modelling of extremal events in insurance and finance. Mathematical Methods of Operations Research. 1994, 39: 1-34. 10.1007/BF01440733.View Article
- Jenkinson AF: The frequency distribution of the annual maximum (or minimum) values of meteorological elements. Quarterly Journal of the Royal Meteorological Society. 1955, 81 (348): 158-171. 10.1002/qj.49708134804.View Article
- Von Mises R: La distribution de la plus grande de n valeurs. Reprinted in Selected Papers Volumen II. American Mathematical Society, Providence, RI. 1954, 271-294.
- Gilleland E, Katz RW: Tutorial for The 'Extremes Toolkit: Weather and Climate Applications of Extreme Value Statistics. [http://www.assessment.ucar.edu/toolkit,2005]
- Gupta C: Statistical properties of chaotic dynamical systems: extreme value theory and Borel-Cantelli Lemmas. PhD thesis. 2010, University of Houston

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.