Bayesian estimation of genomic copy number with single nucleotide polymorphism genotyping arrays
 Beibei Guo^{1},
 Alejandro Villagran^{2},
 Marina Vannucci^{1},
 Jian Wang^{3, 4},
 Caleb Davis^{3, 4},
 TszKwong Man^{3},
 Ching Lau^{3, 4} and
 Rudy Guerra^{1}Email author
DOI: 10.1186/175605003350
© Guerra et al; licensee BioMed Central Ltd. 2010
Received: 29 March 2010
Accepted: 30 December 2010
Published: 30 December 2010
Abstract
Background
The identification of copy number aberration in the human genome is an important area in cancer research. We develop a model for determining genomic copy numbers using highdensity single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference.
Results
The performance of the algorithm is examined on both simulated and real cancer data, and it is compared with the popular CNAG algorithm for copy number detection.
Conclusions
We demonstrate that our Bayesian mixture model performs at least as well as the hidden Markov model based CNAG algorithm and in certain cases does better. One of the added advantages of our method is the flexibility of modeling normal cell contamination in tumor samples.
Background
Gene dosage variations occur in many diseases, as well as in normal populations (e.g., [1, 2]). In cancer, copy number losses and gains are known to contribute to alterations in the expression of tumoursuppressor genes and oncogenes, respectively, see for example [3, 4]. Developmental abnormalities, such as Down, Prader Willi, Angelman and Cri du Chat syndromes, result from gain or loss of one copy of a chromosome or chromosomal region. Thus, detection and mapping of copy number abnormalities provide an approach for associating aberrations with disease phenotype and for identifying critical diseasecausing genes. As an example, Rendon et al. [5] constructed a firstgeneration copy number variation (CNV) map of the human genome through the study of 270 HAPMAP individuals from four populations with ancestry in Europe, Africa or Asia, [6]. A total of 1, 447 copy number variable regions (CNVRs), covering 360 megabases (i.e., 12% of the genome), were identified in this study. These CNVRs contained genes, disease loci, functional elements and segmental duplications.
DNA from the individuals in the study of [5] was analyzed for CNV using two technologies: singlenucleotide polymorphism (SNP) genotyping arrays, and comparative genomic hybridization (CGH). Arraybased Comparative Genomic Hybridization (aCGH) is a molecularcytogenetic method for the analysis of DNA copy number changes [1]. The method is based on hybridization of fluorescently labeled tumor DNA and reference DNA on a microarray platform containing Bacterial Artificial Chromosome (BAC) clones or spotted DNA. As a gold standard, it is robust in identifying long segments of chromosomal alterations. However, although the resolution of aCGH has been improved, it is still not high enough to detect amplifications or deletions of relatively short segments, [7] and [5]. The highdensity SNP array, which can accommodate hundreds of thousands of SNP probe sets simultaneously, is an alternative approach to detect genome wide copy number aberrations which has much higher resolution than CGH, see [8]. Compared to CGH, SNP array based experiments are newer and are becoming more popular for copy number analysis.
A number of statistical methods have been proposed to estimate copy numbers from various platforms. Two of the most popular methods for SNP arrays are dchip and Copy Number Anlyser for GeneChip (CNAG). Zhao et al. [9] proposed dChip, an algorithm that derives modelbased estimates of SNP copy numbers that incorporate probe effects and a hidden Markov model (HMM) to infer integervalued copy numbers. Although the current version of the dChip software can accommodate the newer SNP arrays, such as the Affymetrix 250K array, it is not optimized for it. Nannya et al. [10] developed the CNAG algorithm, which accounts for the length and GC content of the PCR products. Accounting for the length and content of GC elements seems to improve copy number inference [10]. Another source of variation that can affect a copy number analysis is the socalled "genome wave" [11, 12], a genomewide spatial autocorrelation pattern in signal intensity. Since the genome wave may be confounded with the copy number profile across a chromosome, investigators should examine their intensity data for its presence and adjust the data accordingly. Since the genomic wave [11] is thought to be in large part due to GC content, the CNAG algorithm can also be thought of as an adjustment for wave effects possibly present in SNP array data. Again, an HMM is used to infer integer copy numbers. The HMM approach can also be found in the algorithms underlying QuantiSNP [13] and PennCNV [14], both of which use the logRratio and Ballele frequency to infer the copy number state of each SNP. These two methods consider a sixstate Markov model which distinguishes copyneutral loss of heterozygosity from the normal state. Most HMM based algorithms use the Viterbi algorithm [15] to infer integer copy numbers.
To date, there are a handful of Bayesian methods for copy number inference. Most are for CGH data, but a few exist for SNP data. Rueda and DiazUriarte [16] proposed RJaCGH, a nonhomogeneous HMM in a Bayesian context for CGH data. Instead of prespecifying the number of states as a conventional HMM, a reversible jump Markov Chain Monte Carlo (MCMC) method is used to allow for varying numbers of hidden states. Bayesian model averaging is used to obtain final estimates. PiqueRegi et al. [17] developed a method called Genome Alteration Detection Algorithm (GADA) that is based on sparse Bayesian learning [18]. The approach takes advantage of the a priori assumption that the number of copy number alterations (break points) is sparse with respect to the number of probes. As with several other methods, advantage is also taken of the fact that the copy number pattern across a chromosome can be modeled as a piecewise constant function or vector. The GADA output gives copy number results in the form of a segmentation, viz., a collection of ordered segments defined by their breakpoints and amplitudes. To obtain integervalued copy numbers or alteration status (loss, normal, gain), the estimated segments must be analyzed by a thresholding procedure, such as Huang et al. [19]. GADA can be applied to both CGH and SNP based data. Rancoita et al. [20] also make use of piecewise constant modeling in their algorithm, mBPCR, which is a modification of the original Bayesian Piecewise Constant Regression (BPCR) method developed by Hutter [21]. This method is general for data that take the form of a piecewise constant function with unknown segment numbers, boundaries, and levels. Rancoita et al. illustrate the mBPCR method using SNP data, but it appears that logratios based on CGH data can also be analyzed.
In addition to those described above, several other statistical methods have been developed for copy number analysis. They vary in their assumptions, inference (segmentation, alteration status, integer copy number), platform (CGH, SNP), input data (e.g., CEL files or generic normalized logratio), and software implementation (e.g., commercial, webbased, customized academic program). Winchester et al. [22] describe and compare a number of methods. No method stands out as uniformly best and Winchester et al. suggest analyzing copy number data with at least two different methods to assess consistency and robustness of results. In this paper, we could not consider many of those methods for performance comparison because most of these algorithms do not estimate integer copy numbers.
Most of the copy number methods assume normalized logratios as input. Relatively few include adjustments for known factors affecting inference. GC content and fragment length have been mentioned as factors affecting copy number inference. Another factor from tumor samples is normal cell contamination. Indeed, most tumor samples are heterogeneous and include both cancer cells (with copy number aberrations) and normal cells. The larger percentage of normal cells present, the more difficult it is to infer copy number aberrations in the tumor cells; the logratios tend to shrink to the null value of zero. None of the above methods implement an adjustment for normal cell contamination. Below we show how our proposed method can account for this factor.
Here we propose a Bayesian spatial normal mixture model for inferring SNPbased integer copy number. Bayesian mixture models were used by [23] for CGHbased copy number estimation. There the authors considered a threestate (loss/normal/gain) mixture and introduced a spatial structure to reflect correlated segments (e.g. BACs). Spatial correlation was induced through the weights of the mixture via Markov random fields. In our approach, instead of considering three states, we allow for an unknown number of mixture components and achieve inference using a reversible jump Markov chain Monte Carlo method. As in [23] we use Markov random fields to account for correlated neighboring SNPs. In contrast to models that incorporate HMMs to infer integer copy numbers, our modeling approach uses information (neighboring SNPs) on both sides of a SNP. In addition, we account for cell contamination by shrinking the theoretical copy number logratios towards zero. The implementation only requires ordered (normalized) logratios and, therefore, may be applied to data from any platform suitable for copy number estimation. In Section 2 we present the model and method of inference. Section 3 reports on a simulation study and application to real data. The real data study includes cases where cytogenetics has shown large regions of gain or loss and we also show novel smaller regions detected by our algorithm. The new aberrations are validated by CGH and/or PCR. A discussion is given in Section 4 and an Appendix provides details on the MCMC algorithm.
Methods
Model
where ϕ is a scaling factor specified by the user. In the simulation study of Section 3 we investigate robustness of the results to different values ϕ of and varying number of neighbors.
Prior distributions
In this section we discuss the prior distributions for the model parameters, including the number k of mixture components, the normal mixture means and variances, and the smoothing parameter h.
1. Number of mixture components, k
with k_{ max } a prespecified large integer. We take k_{ max } = 7 for illustration purposes, corresponding to copy numbers 0, 1, 2, 3, 4, 5, and > 5. Here 7 is arbitrary, and we can use any positive value that makes sense for the data under consideration.
2. Normal mixture means
We deviate from the approach of [24] by constructing k_{ max } uniform distributions, {ν_{ j } = U (a_{ j }, b_{ j }), j = 1,...,k_{ max }}, and assuming that each component mean μ_{ j } follows one of these uniform distributions independently. The uniform interval boundaries are very important. We choose the intervals to be nonoverlapping and to contain the theoretical copy number values. According to [10], the observed mean values for the 7 components without contamination are approximately 1.24, .49, 0, .365, .657, .899 and 1.106 for copy numbers 0, 1, 2, 3, 4, 5 and > 5, respectively. In this paper results were obtained using the following intervals: (2, .8), (.6, .25), (.05, .05), (.15, .4), (.45, .66), (.75, .9), (.95, 1.3), corresponding to copy numbers 0, 1, 2, 3, 4, 5, > 5, respectively. These intervals are the default values we used in the application and have worked well in most cases. Our results did not show sensitivity to the actual values we used for the extremes of the intervals, i.e., other disjoint sets of intervals worked well too.
Remark 1: Due to normal cell contamination, the true logratios tend to shrink toward zero, and in practice some degree of normal cell contamination tends to be present. We thus decided to center the uniform distributions closer toward the null value of zero rather than at the theoretical means given above, except for CN = 0 and CN > 5. These exceptions are largely due to where we wanted to locate the respective uniform support; see Remark 3 below.
Remark 2: Moving the uniform intervals closer to zero resulted in some of the theoretical means being located close to a uniform boundary. For example, for CN = 5, the theoretical mean of .899 is just inside the right boundary of .9. This does not cause a problem of misclassification since normal cell contamination brings the mean closer to the left boundary.
Remark 3: We also varied the length of the uniform intervals since the log scale makes the consecutive theoretical values become increasingly closer to each other; the consecutive pairwise distances between the theoretical means from 1.24 to 1.106 are .75, .49, .365, .292, .242, .207. If the uniform intervals were forced to be of equal length we would have either relatively short nonoverlapping intervals or overlapping long intervals. Since the uniform intervals are not of equal length, the gaps between the intervals are unequal, as well.
for any copy number j, with background factor b, see [10], and then choosing the length of the intervals so that the k_{ max } intervals are nonoverlapping.
3. Normal mixture variances
We assign an inverse gamma prior distribution to ${\sigma}_{j}^{2}$. In the application we center this distribution on 0.2 and induce a vague specification by letting the variance be large.
4. Smoothing parameter
We assign h a uniform distribution with a wide range, h ~ U (0, h_{ max }), with h_{ max } = 1000, 000, to induce smooth realizations.
We provide further discussion of these prior selections below in Section 3 in the context of the simulations and real data applications.
Posterior inference

Updating k: This step causes creation or deletion of components, therefore requiring the sampler to jump between subspaces with different dimensions. To implement the sampler, we use reversible jump MCMC (RJMCMC), see [25] and [26]. We update k' = k + 1 with probability b_{ k }, and k' = k  1 with probability $1{b}_{k}({b}_{1}=1,\text{}{b}_{{k}_{max}}=0,\text{}{b}_{k}=.5\text{}\text{for}\text{}k=2,\dots ,{k}_{max}1)$. If k'= k + 1, we draw a new component from the remaining k_{ max }  k components with equal probability, and draw μ_{*} from the corresponding uniform distribution. We also draw ${\sigma}_{*}^{2}$ and x_{*} from the prior distributions. We then increase the dimensions of the vector parameters μ'= (μ, μ_{*}), ${\sigma}^{{2}^{\prime}}=({\sigma}^{2},{\sigma}_{*}^{2})$, and x'= (x, x_{*}) and accept the new component with probability:$\begin{array}{l}min(1,\frac{(1{b}_{k+1})\frac{1}{k+1}p(k+1)}{{b}_{k}\frac{1}{{k}_{max}k}p(k)}\\ \text{}\times {\displaystyle \prod _{i=1}^{n}\frac{{\displaystyle {\sum}_{j=1}^{k+1}{{\omega}^{\prime}}_{ij}}N({y}_{i}\text{}\text{}{{\mu}^{\prime}}_{j},{{\sigma}^{\prime}}_{j}^{2})}{{\displaystyle {\sum}_{j=1}^{k}{\omega}_{ij}}N({y}_{i}\text{}\text{}{\mu}_{j},{\sigma}_{j}^{2})}})\end{array}$(6)If k' = k  1, we instead randomly pick a component from the discrete uniform distribution on {1,...,k} and remove ${\mu}_{*},{\sigma}_{*}^{2},{\mathbf{x}}_{*}$ from μ, σ^{2}, x. Similarly, the acceptance probability is$\begin{array}{l}min(1,\frac{{b}_{k1}\frac{1}{{k}_{max}(k1)}p(k1)}{(1{b}_{k})\frac{1}{k}p(k)}\\ \text{}\times {\displaystyle \prod _{i=1}^{n}\frac{{\displaystyle {\sum}_{j=1}^{k1}{{\omega}^{\prime}}_{ij}}N({y}_{i}\text{}\text{}{{\mu}^{\prime}}_{j},{{\sigma}^{\prime}}_{j}^{2})}{{\displaystyle {\sum}_{j=1}^{k}{\omega}_{ij}}N({y}_{i}\text{}\text{}{\mu}_{j},{\sigma}_{j}^{2})}})\end{array}$(7)

Updating x: We update each location using a MetropolisHastings step, see [27] and [28]. We perform these n updates sequentially, i.e., we update (x_{11},...,x_{1k}) first, then update (x_{21},...,x_{2k}), etc.. For each location i, we use a proposal distribution of the type$\prod _{j=1}^{k}N\left({{x}^{\prime}}_{ij}\frac{h{\displaystyle {\sum}_{{i}^{\prime}\sim i}{x}_{{i}^{\prime}j}}}{1+h{n}_{i}},\frac{1}{h{n}_{i}}\right)$where n_{ i } is the number of neighbors at location i. The acceptance probability is$min\left(1,\frac{{\displaystyle {\sum}_{j=1}^{k}{{\omega}^{\prime}}_{ij}}N({y}_{i}{\mu}_{j},{\sigma}_{j}^{2})}{{\displaystyle {\sum}_{j=1}^{k}{\omega}_{ij}}N({y}_{i}{\mu}_{j},{\sigma}_{j}^{2})}\right)$
where ω'are the weights associated to the proposed x.

Updating h: We use a MetropolisHastings random walk with a proposal defined by a truncated normal distribution, ${h}^{\prime}~TN(h,{\sigma}_{h}^{2})I(0\le {h}^{\prime}\le {h}_{max})$. In applications we chose σ_{ h }to have acceptance ratios between 40% and 70%.

Updating allocations: Using a Gibbs step, we draw the n allocations independently from$\begin{array}{r}p({z}_{i}=jy,k,\mathit{\mu},{\mathit{\sigma}}^{2},\mathbf{x},h,\varphi )\\ \propto {\omega}_{ij}N({y}_{i}{\mu}_{j},{\sigma}_{j}^{2})I[j\in \left\{1,\dots k\right\}].\end{array}$(8)

Updating μ, σ^{2},: For each j, we consider the k_{ max }intervals and select that one with largest posterior probability, then sample μ_{ j }from a normal distribution truncated at this interval. In the iterations it may happen that two or more ${{\mu}^{\prime}}_{j}s$ are sampled to the same interval. In this case, we combine these components and update k. The new μ_{ j }, σ_{ j }, x_{ j }, for the new formed component are taken to be the weighted sum of the original ones by the sample size. We then redefine z and calculate ω. We draw ${\sigma}_{j}^{2}$ from its full conditional. See Appendix for the forms of the full conditionals.
For posterior inference, the primary parameters of interest are the weights ω ' s. We propose an allocation rule as follows: at each iteration we record the probability of each SNP to belong to each of the k_{ max } components (we assign zero if a component is empty). After the MCMC is done, we average all the ω ' s and assign a SNP to the component that has the largest probability. We check reproducibility of the clustering with different starting values.
The runtimes of the various copy number algorithms can range from less than a minute to days depending on the algorithm and the probe density of the array platforms. When applied to newer highdensity arrays almost all methods have relatively high runtimes [17]. Reversible jump MCMC methods such as ours and RJaCGH [16] tend to be computationally expensive. Our current implementation may require several hours to > 1 day per chip. However, our current version is implemented in MatLab and we have not attempted to optimize the code. Programming in some version of C and parallel computing by chromosome and/or chromosome arm will likely significantly reduce the time.
Results and Discussion
Simulation Study
We first investigate the performance of our model through simulation experiments. In the next Section we compare our method with two alternative methods in the context of actual tumor samples from leukemia and ependymoma cancers.
We conducted two sets of simulations studies. The first set was designed to examine the influence of hyperparameters in the prior specifications: the scaling parameter, ϕ, of the logistic transformation for the GMRF and the number of smoothing neighbors, nb. Based on the results of the first set of experiments we then conducted a second set of experiments by setting these two parameters at fixed (default) values in order to assess performance of our algorithm. For scenarios with no contamination the logratios corresponding to copy number j were independently drawn from a normal distribution with the corresponding theoretical mean for copy number j and a standard deviation chosen to achieve a given SNR. For scenarios with contamination the logratios corresponding to copy number j were independently drawn from a normal distribution with mean ${\mathrm{log}}_{2}\frac{j(1p)+2p+b}{2+b}$, where p is the percentage of contamination and b is the background factor, see [10].
In the first set of simulation studies we found that a small range of ϕ was suitable over different configurations. In particular, we investigated sensitivity by choosing different values in the range {.005, .01, .5, .1}. For the number of neighboring SNPs (on either side) over which to smooth in the GMRF the two values, 1 and 4 for of total of 2 or 8 neighbors for each SNP. Boundary SNPs at the ends of the chromosomes simply used fewer SNPs. Results and discussion from these sensitivity studies are reported in [Additional file 1]. Based on the results of the first set of experiments we then conducted a second set of experiments by setting these two parameters at fixed values, ϕ = 0.01 and nb = 4, and varying the signaltonoise ratio and location of the copy number breakpoints. We also varied the number of SNPs constituting the aberration regions. In all simulations, the standard deviation (σ_{ h }) of the proposal distribution to update the smoothing parameter, h, was chosen so that acceptance ratios would be between 40% and 70%. For all cases reported we used 50, 000 sampling draws for inference after a 50, 000 iteration burnin period.
Misclassification rates from simulation study.
CN  # SNP  Scenario 1 .05/7.3/0  Scenario 2 .15/2.4/0 

3  5  0  16 
1  10  0  2 
3  20  0  5 
3  40  0  1 
CN  # SNP  Scenario 3 .2/1.8/0  Scenario 4 .2/1.5/20 
3  5  51  50 
1  10  11  37 
3  20  9  11 
3  40  6  6 
CN  # SNP  Scenario 5 .05/7.3/0  Scenario 6 .15/2.4/0 
4  5  6  87 
3  10  0  14 
0  20  0  0 
3  40  0  9 
CN  # SNP  Scenario 7 .2/1.8/0  Scenario 8 .2/1.5/20 
4  5  77  98 
3  10  31  31 
0  20  0  0 
3  40  17  4 
False negative rates from simulation study.
CN  # SNP  Scenario 1 .05/7.3/0  Scenario 2 .15/2.4/0 

3  5  0  16 
1  10  0  2 
3  20  0  5 
3  40  0  1 
CN  # SNP  Scenario 3 .2/1.8/0  Scenario 4 .2/1.5/20 
3  5  47  50 
1  10  11  37 
3  20  5  11 
3  40  3  6 
CN  # SNP  Scenario 5 .05/7.3/0  Scenario 6 .15/2.4/0 
4  5  0  0 
3  10  0  7 
0  20  0  0 
3  40  0  1 
CN  # SNP  Scenario 7 .2/1.8/0  Scenario 8 .2/1.5/20 
4  5  4  6 
3  10  20  31 
0  20  0  0 
3  40  4  4 
False positive rates from simulation study.
CN  # SNP  Scenario 1/5 .05/7.3/0  Scenario 2/6 .15/2.4/0 

2  10  0  3 
2  50  0  1 
CN  # SNP  Scenario 3/7 .2/1.8/0  Scenario 4/8 .2/1.5/20 
2  10  7  9 
2  50  2  2 
The misclassification rate reported is defined as P (CN ≠ j  true CN = j), for j ≠ 2. For the special case j = 2 we obtain the falsepositive rate, FP = P (CN ≠ 2  true CN = 2). The falsenegative rate is defined as the chance of a true loss or gain classified as a normal copy number, FN = P (CN = 2  true CN ≠ 2).
We do not find it very useful to cite global rates since each depends on several factors, including the true CN, signaltonoise ratio (SNR), normal cell contamination, and number of SNPs within the CNA region. We therefore report misclassification, falsenegative and false positive rates given various combinations of these parameters. Other authors (e.g., [17]) define performance accuracy by breakpoint detection. This results in slightly different definitions of falsepositive and falsenegative rates than we do here. Since our model is based on mixture components corresponding to integer copy numbers it makes more sense for us to consider more specific falsenegative and falsepositive rates. As shown below, these rates also depend on factors other than true copy number.
A number of authors have used the simulation data of Willenbrock and Fridlyand [29] to assess their proposed copy number algorithms for aCGH data. Willenbrock and Fridlyand simulated CGH data using real breast cancer data. Their simulation parameters were deduced from the profiles of 145 breast tumor array CGH samples estimated with DNAcopy. For each sample, both the logratios (which emulate the aGCH data) and true copy number data are provided. One concern with these data is that they can be less noisy than real SNP data, since they emulate aCGH data. Since we are specifically interested in how well SNP data performs, we therefore chose to generate our own simulation data. We also note that our simulations generated simple text files of logratios. Therefore, we were unable to compare our method to those whose software implementation requires special data files, such as Affymetrix CEL files. Certain methods for CGH data, such as CBS, DNAcopy and GLAD, only require normalized data. However, these methods are for the threestate (gain, loss, normal) inference. In our simulation study we have regions of discrete copy numbers (0, 1, 2, 3, 4) so the results would not be comparable. The real data studies, however, did allow for such comparisons as we were able to obtain logratios from their analysis. In short, the simulations were for assessing our own method and the real data with validation were for performance assessment under real conditions and comparative purposes.
Table 1 shows misclassification rates (%) for eight different scenarios. Tables 2 and 3 show falsenegative and falsepositive rates, respectively. We first discuss the misclassification (MC) rates in Table 1.
Table 2 shows falsenegative rates. Except for minor differences, the falsenegative rates for Scenarios 14 are the same as the broader misclassification rates (Table 1). This shows that most of the misclassifications in Scenarios 14 were losses and gains that were called normal. Where there are differences between Tables 1 and 2, we see that misclassification rates are at least as large as the falsenegative rates as we would expect. It is worth noting that the aberrations studied in Scenarios 14 are neighbors of normal copy number, viz., CN = 3 is one additional copy and CN = 1 is one less copy. As such, it is not too surprising that the misclassification rates agreed with the falsenegative rates. Especially in the presence of normal cell contamination we expect the logratios to regress toward the mean value of 0. This is contrast to Scenarios 58, which include more extreme aberrations of CN = 0 and CN = 4. Comparing the misclassification rates (Table 1) with the corresponding FN rates (Table 2), we see that the latter can be much smaller than the former. Large differences of MC vs FN rates are seen for CN = 4 in Scenarios 6 (87% vs 0%), 7 (77% vs 4%), and 8 (98% vs 6%). Taken together this implies that almost all of the misclassifications for CN = 4 were called as CN = 3 and very few as CN = 2. A manual calculation of the calls confirms this conclusion. Smaller differences between MC and FN rates occur in Scenario 6 with CN = 3 and 10 SNPs (line 2, Tables 1 and 2); the MC rate is 14% and the FN rate is 7%. Here, half of the 14% is due to normal calls and the other half to calls of CN = 4. In Scenario 7 with true CN = 3 and 10 SNPs the MC rate of 31% is 20% CN = 2 (falsenegative) and 11% CN = 4. Similarly, the MC rates of 9% and 17% for Scenarios 6 and 7 with CN = 3 and 40 SNPs (line 4, Tables 1 and 2), respectively, are only due to false calls of CN = 2 and CN = 4. It is, therefore, seen that when a true copy number of 3 is misclassified, it tends to be called a CN = 4 with a smaller percentage of normal calls, CN = 2. And, as discussed above, a true CN = 4 tends to be called a 3 when misclassified. In this sense, if an investigator is only calling loss/normal/gain, even though misclassifications occur under true copy numbers of 3 and 4, they would both be correctly called as gains with a small percentage of CN = 2 (falsenegative) calls. This is at least the behavior of the Bayes mixture model; other methods may apportion the misclassifications differently. In all scenarios (18) we observe a misclassification rate and a falsenegative rate of 0% for CN = 0 and 20 SNPs. No matter the signaltonoise ratio, the distribution of logratios for CN = 0 is well separated from the other copy number distributions and its call is constantly correct. For CN = 1, the misclassification rates and corresponding falsenegative rates are equal, showing that when misclassified this copy number is called a normal (falsenegative). Table 3 shows falsepositive (FP) rates defined as a true normal copy number being classified as a gain or loss: P(CN ≠ 2  CN = 2). Since the two patterns of copy number structure differed only in their gain and loss patterns we combined the data for the normal copy number segments. Thus the FP rates are based on 100 replicates instead of 50 as with the MC and FN rates in Table 1. As with the FN rate, for a fixed number of SNPs defining the normal segment, the FP positive rate increases with decreasing SNR. And, for a given combination of SNR and normal cell contamination, the FP rate decreases with an increasing number of SNPs in the segment. Under the most difficult configuration considered, 10 SNPs with a SNR of 1.5 and 20% contamination, the falsepositive rate was only 9%.
Rancoita et al. [20] compared their mBPCR method with six other methods and found that in general no method, including their own, was able to detect aberrations of width 510 probes. Lai et al. [30] reached similar conclusions. Use of alternative estimators for a certain covariance parameter led to the detection of these smaller segments, but this was accompanied by dividing larger segments into subsegments. Our method, too, had trouble with regions defined by only 5 probes, although regions with 10 probes were fairly well identified unless the signaltonoise ratio was on the order of 1.5.
Real Data Application
To further assess the Bayes mixture model, we analyzed Affymetrix 250K array data from cancer patients who had either leukemia or ependymoma. Data were obtained from Texas Children's Hospital, Houston, TX. In addition to comparing our results with the popular CNAG software, we also provide biological validation. Some of these cases, in fact, had karyotyping and FISH data for validation. Others were validated using quantitative PCR on selected regions and aCGH data. We also briefly comment on comparisons with some results from the PennCNV algorithm.
One important feature of the CNAG software used to estimate copy number is the fact that it adjusts the observed logratios for variation in GC content across the probes. Integer copy numbers are subsequently inferred from the GC adjusted logratios using a hidden Markov model. To make comparable comparisons between the Bayes and the CNAG methods we applied the Bayes model to the GC adjusted logratios from CNAG. One relatively recent issue arising in the analysis copy number aberration detection is the socalled "genome wave" [11, 12], a genomewide spatial autocorrelation pattern in signal intensity data that may be confounded with the copy number profile across a chromosome. As a result the genome wave may lead to inflated falsepositive rates in copy number calls. The genome wave has been consistently detected in both CGH and SNP based platforms. Diskin et al. [12] and the references therein describe possible genomic features underlying the wave effect and preprocessing methods to remove the wave effect prior to the analysis of copy number. It has been fairly well established that an adjustment for GC content largely removes the wave effect from the signal intensities [12]. Since we are using GC adjusted logratios from CNAG for the real data application we did not expect to observe a wave effect in our data and indeed none was present as shown in [Additional file 1].
We performed additional comparisons with the PennCNV algorithm. For the chromosome 6 region from case 688, the region from about 99 Mb to 118 Mb was identified as copy number 1 by our mixture model, while PennCNV only detected four very short regions inside the region we identified. For the chromosome 12 region from case 1065, the region 45120 Mb is identified as a loss by our method while PennCNV gives 2 very noisy results with most part of this region identified as CN = 2. Also, as we validated, the tail region of this chromosome is a gain, while PennCNV does not detect it. Finally, for the chromosome 9 region from case 688, our mixture model detected the region from about 20.5 Mb to 22.5 Mb as copy number 1, while PennCNV only detected a small part of it (from 20.521 Mb). We report plots from the PennCNV algorithm in the [Additional file 1].
Conclusions
The arraybased comparative genomic hybridization microarray is a widely accepted method for estimating genomic copy number. As the CGH BACs are relatively large segments, the CGH estimates tend to be robust. On the other hand, the large segments do not allow detection of small CNVs. The SNP genotyping array provides an alternative to CGH, which is expected to identify genomic alterations with a higher resolution. Most SNP array algorithms use a hidden Markov model to infer integer copy numbers, and the component means tend to be set at the theoretical values. However, due to normal cell contamination, which occurs in most tumor samples, logratios can be shrunk toward zero, indicating a normal copy number. Consequently, in the presence of a high percentage of contamination, losses or gains may not be detectable. As of this writing, we are not aware of existing algorithms that account for this problem.
We have developed a Bayesian spatial normal mixture model to estimate copy number for SNP array platforms where the means of the components accommodate cell contamination. By using neighboring copy number information on either side of each SNP locus we can generate smoother maps than those based on HMMs. We have shown with a simulation study that our algorithm can detect both long and short segments quite precisely. Our results do not show sensitivity to different values of the scaling factor ϕ in the prior distribution and to the number of neighbors as long as ϕ is chosen to be small enough. By applying our method to real cancer data, we have demonstrated that our algorithm can do as well as CNAG, a very popular and accurate algorithm used with SNP arrays, and in certain cases performs better. In addition, our algorithm provides smoother realizations than CNAG. The Bayesian mixture model could be extended in a few ways. To more precisely smooth over neighboring probes, it would be helpful to account for interprobe distance perhaps as a weighting factor when averaging neighboring information. The logratio copy number means could also be included as parameters with priors reflecting knowledge of normal cell contamination.
Appendix
We report here full details of the derivations for the MCMC algorithm.
which gives (6). Similar derivations hold for (7).
with $A=\frac{1}{2}{N}_{j}+{\alpha}_{{\sigma}^{2}}$ and $B=\frac{1}{2}{\displaystyle {\sum}_{i:{z}_{i}=j}{({y}_{i}{\mu}_{j})}^{2}+{\beta}_{{\sigma}^{2}}}$.
Declarations
Acknowledgements
The authors wish to acknowledge the following funding sources: National Institutes of Health/National Cancer Institute (R21/R33 CA97874 and R01CA109467 to C.C.L.); National Library of Medicine Training Grant (5T15LM0709317 to C.D.); National Institutes of Health/National Human Genome Research Institute (R01HG003319 partially to M.V.); NSF award (DMS1007871 to M.V.); Wendy Will Case Cancer Fund (partially to T.K.M.).
Authors’ Affiliations
References
 Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo W, Chen C, Zhai Y, Dairkee S, Ljung B, Gray J, Albertson D: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet. 1998, 20: 207211. 10.1038/2524.PubMedView ArticleGoogle Scholar
 Wang J, Man T, Wong K, Rao P, Leung H, Guerra R, Lau C: Genomewide analysis of copy number variations in normal population identified by SNP arrays. The Open Biology Journal. 2009, 2: 5465. 10.2174/1874196700902010054.View ArticleGoogle Scholar
 Knuutila S, Bjorkqvist AM, Autio K, Tarkkanen M, Wolf M, Monni O, Szymanska J, Larramendy ML, Tapper J, Pere H, ElRifai W, Hemmer S, Wasenius VM, Vidgren V, Zhu Y: DNA copy number amplifications in human neoplasms: review of comparative genomic hybridization studies. Am J Pathol. 1998, 152: 110723.PubMed CentralPubMedGoogle Scholar
 Knuutila S, Aalto Y, Autio K, Bjorkqvist AM, ElRifai W, Hemmer S, Huhta T, Kettunen E, KiuruKuhlefelt S, Larramendy ML, Lushnikova T, Monni O, Pere H, Tapper J, Tarkkanen M, Varis A, Wasenius VM, Wolf M, Zhu Y: DNA copy number losses in human neoplasms. Am J Pathol. 1999, 155: 68394.PubMed CentralPubMedView ArticleGoogle Scholar
 Redon R, Ishikawa S, Fitch K, Feuk L, Perry G, Andrews T, Fiegler H, Shapero M, Carson A, Chen W, Cho E, Dallaire S, Freeman J, Gonzalez J, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald J, Marshall C, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville M, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad D, Estivill X, TylerSmith C, Carter N, Aburatani H, Lee C, Jones K, Scherer S, Hurles M: Global variation in copy number in the human genome. Nature. 2006, 444: 444454. 10.1038/nature05329.PubMed CentralPubMedView ArticleGoogle Scholar
 Consortium IH: The International HapMap Project. Nature. 2003, 426: 78996. 10.1038/nature02168.View ArticleGoogle Scholar
 Toruner G, Streck D, Schwalb M, Dermody J: An oligonucleotide based arrayCGH system for detection of genome wide copy number changes including subtelomeric regions for genetic evaluation of mental retardation. Am J Med Genet A. 2007, 143: 8249.View ArticleGoogle Scholar
 Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, Grigorova M, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, Shapero MH, Wooster R: Highresolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res. 2004, 14: 28795. 10.1101/gr.2012304.PubMed CentralPubMedView ArticleGoogle Scholar
 Zhao X, Li C, Paez J, Chin K, Janne P, Chen T, Girard L, Minna J, Christiani D, Leo C, Gray J, Sellers W, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004, 64: 30603071. 10.1158/00085472.CAN033308.PubMedView ArticleGoogle Scholar
 Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S: A robust algorithm for copy number detection using highdensity oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005, 65: 60716079. 10.1158/00085472.CAN050465.PubMedView ArticleGoogle Scholar
 Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, Carter NP, Tavare S, Hurles ME: Breaking the waves: improved detection of copy number variation from microarraybased comparative genomic hybridization. Genome Biol. 2007, 8: R22810.1186/gb2007810r228.PubMed CentralPubMedView ArticleGoogle Scholar
 Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H, Bucan M, Maris JM, Wang K: Adjustment of genomic waves in signal intensities from wholegenome SNP genotyping platforms. Nuc Acids Res. 2007, 36: e12610.1093/nar/gkn556.View ArticleGoogle Scholar
 Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J: QuantiSNP: an Objective Bayes HiddenMarkov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Research. 2007, 35: 20132025. 10.1093/nar/gkm076.PubMed CentralPubMedView ArticleGoogle Scholar
 Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: PennCNV: An integrated hidden Markov model designed for highresolution copy number variation detection in wholegenome SNP genotyping data. Genome Res. 2007, 17: 16651674. 10.1101/gr.6861907.PubMed CentralPubMedView ArticleGoogle Scholar
 Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989, 77: 257286. 10.1109/5.18626.View ArticleGoogle Scholar
 Rueda OM, DiazUriarte R: Flexible and accurate detection of genomic copynumber changes from aCGH. PLOS Comput Biol. 2007, 3: 11151122. 10.1371/journal.pcbi.0030122.View ArticleGoogle Scholar
 PiqueRegi R, MonsoVarona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008, 24: 309318. 10.1093/bioinformatics/btm601.PubMed CentralPubMedView ArticleGoogle Scholar
 Tipping M: Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001, 1: 211244. 10.1162/15324430152748236.Google Scholar
 Huang J, Wei W, Zhang J, Liu G, Bignell G, Stratton MR, Futreal PA, Wooster R, Jones KW, Shapero MH: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Hum Genomics. 2004, 1: 287299.PubMed CentralPubMedView ArticleGoogle Scholar
 Rancoita PM, Hutter M, Bertoni F, Kwee I: Bayesian DNA copy number analysis. BMC Bioinformatics. 2009, 1: 119.Google Scholar
 Hutter M: Exact Bayesian regression of piecewise constant functions. Bayesian Anal. 2007, 2: 1635664.View ArticleGoogle Scholar
 Winchester L, Yau C, Ragoussis J: Comparing CNV detection methods for SNP arrays. Brief Funct Genomics. 2009, 8: 353366. 10.1093/bfgp/elp017.View ArticleGoogle Scholar
 Broet P, Richardson S: Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics. 2006, 22: 911918. 10.1093/bioinformatics/btl035.PubMedView ArticleGoogle Scholar
 Fernandez C, Green P: Modelling spatially correlated data via mixtures: a Bayesian approach. J R Stat Soc Series B Stat. 2002, 64: 805826. 10.1111/14679868.00362.View ArticleGoogle Scholar
 Green P: ReversibleJump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995, 82: 711732. 10.1093/biomet/82.4.711.View ArticleGoogle Scholar
 Richardson S, Green P: On Bayesian analysis of mixtures with an unknown number of components. J R Stat Soc Series B Stat. 1997, 59: 731792. 10.1111/14679868.00095.View ArticleGoogle Scholar
 Metropolis N, Rosenbluth AW, Rosenbluth M, Teller A, Teller E: Equations of state calculations by fast computing machine. J Chem Phys. 1953, 21: 10871091. 10.1063/1.1699114.View ArticleGoogle Scholar
 Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970, 57: 97109. 10.1093/biomet/57.1.97.View ArticleGoogle Scholar
 Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 40844091. 10.1093/bioinformatics/bti677.PubMedView ArticleGoogle Scholar
 Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005, 19: 37633770. 10.1093/bioinformatics/bti611.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.