KC-SMARTR: An R package for detection of statistically significant aberrations in multi-experiment aCGH data
© de Ronde et al; licensee BioMed Central Ltd. 2009
Received: 20 August 2010
Accepted: 11 November 2010
Published: 11 November 2010
Most approaches used to find recurrent or differential DNA Copy Number Alterations (CNA) in array Comparative Genomic Hybridization (aCGH) data from groups of tumour samples depend on the discretization of the aCGH data to gain, loss or no-change states. This causes loss of valuable biological information in tumour samples, which are frequently heterogeneous. We have previously developed an algorithm, KC-SMART, that bases its estimate of the magnitude of the CNA at a given genomic location on kernel convolution (Klijn et al., 2008). This accounts for the intensity of the probe signal, its local genomic environment and the signal distribution across multiple samples.
Here we extend the approach to allow comparative analyses of two groups of samples and introduce the R implementation of these two approaches. The comparative module allows for a supervised analysis to be performed, to enable the identification of regions that are differentially aberrated between two user-defined classes.
We analyzed data from a series of B- and T-cell lymphomas and were able to retrieve all positive control regions (VDJ regions) in addition to a number of new regions. A t-test employing segmented data, that we implemented, was also able to locate all the positive control regions and a number of new regions but these regions were highly fragmented.
KC-SMARTR offers recurrent CNA and class specific CNA detection, at different genomic scales, in a single package without the need for additional segmentation. It is memory efficient and runs on a wide range of machines. Most importantly, it does not rely on data discretization and therefore maximally exploits the biological information in the aCGH data.
The program is freely available from the Bioconductor website http://www.bioconductor.org/ under the terms of the GNU General Public License.
Background and motivation
DNA copy number alterations (CNAs) in tumours are an important mechanism of deregulation of cancer genes. CNAs are a consequence of genomic instability, which is common in human cancers . Various microarray platforms have enabled the genome-wide analysis of CNAs by array based Comparative Genomic Hybridization (aCGH) and many different microarray platforms are currently available for aCGH analysis, including platforms based on bacterial artificial chromosome (BAC) clones, cDNA clones, SNPs and long oligonucleotides. Most of these platforms feature measurement points (probes) at specific positions on the genome with a certain distance between the consecutive probes.
Array CGH data generally consist of the ratios of (log-transformed) intensities of fluorescently labeled DNA from case (disease) versus normal diploid (2 n) control samples that are measured by the probes on the array. Although single cell aCGH analysis is possible  most aCGH analyses are performed on samples derived from tissue which contains sub-populations of different cells. This implies that an aCGH measurement will measure the average of CNAs of different sub-populations within the sample. Therefore, discretization of the data may lead to the loss of valuable biological information. KC-SMARTR does not discretize the data and makes use of the continuous signal to preserve all the information contained in the data. The software package allows unsupervised analysis to identify recurrent aberrations across samples as well as supervised analysis to identify regions that are differentially aberrated between user defined classes of samples. These analyses are two of the most commonly performed on aCGH data and KC-SMARTR combines them in one, easy to use and flexible program.
To identify regions which are significantly aberrated the KC-SMART method  takes into account 1) the non-discretized signal intensity of a probe; 2) the strength of neighboring probes and 3) the strength of the probe across multiple samples. These steps are performed separately for the gains and losses. First, the probe intensities are summed across all samples. Next, kernel convolution is performed across the genome, along with locally weighted regression to account for unequally distributed probes. This results in a kernel smoothed estimate of probe intensities, the 'KC score'. The size of the kernel has consequences for the type of aberration that will be detected by the algorithm (see next section). Finally, the significance threshold is determined using a permutation based approach and significant aberrations are defined as the set of probes for which the KC score exceeds this threshold. The set of genomic scales ranging from the smallest to the largest kernel width is defined as the 'scale space'. The KC-SMART analysis is repeated for a selection of kernel widths from the scale space to reveal the aberrations that are significant at different genomic scales.
where μKC1(i) and μKC2(i) are the averages of the KC scores at position i over all samples in Groups 1 and 2, respectively; σKC1,2(i) is the pooled variance over all samples of the KC scores at position i, and f is a regularization factor equal to the 95th percentile of the pooled class standard deviation across all genomic positions. This factor prevents small variances from dominating the SNR statistic. To identify significantly differential CNAs, a class label based permutation scheme is employed to determine the SNR threshold that satisfies the user-specified false discovery rate. In the second approach, the smoothed tumor profiles are employed as input to the SAM package , to identify differentially aberrated loci at a given FDR.
This table shows the regions that were identified by KC-SMARTR as being significantly aberrated in the B- and T-cell lymphoma dataset.
Region (in kb)
Known VDJ loci in region
51300 - 51300
168900 - 170100
171600 - 171900
172800 - 176700
187500 - 188100
88800 - 89400
Ig* Kappa light chain
1800 - 4200
5400 - 6300
11100 - 11100
13200 - 17100
19800 - 23100
24300 - 24300
38100 - 38700
T-cell receptor Gamma
141900 - 142200
T-cell receptor Beta
21300 - 22200
T-cell receptor Alpha
105300 - 105900
Ig heavy chain
21300 - 21600
Ig Lambda light chain
To the best of our knowledge no other software package exists that allows for a supervised aCGH analysis and as such we believe our method delivers an important contribution to this field. Also, given the fact that the method does not make use of discretized data, for recurrent gain and loss analysis the software gives the user the flexibility to look for aberrations across different genomic scales. Given the ever increasing data set sizes it is also important to note that our algorithm scales linearly with the number of probes and number of samples. To give an indication, on our Opteron 2.7 GHz the analysis of a fairly large Affymetrix SNP 6 (1.78 Million probes) dataset consisting of 61 samples a comparative analysis took about five and a half hours.
In the future we would like to implement a parallelized algorithm to make use of additional cpu cores that are frequently available in current machines. This would speed up the process a lot since most calculations can be performed in parallel.
KC-SMARTR is a flexible, fast and user-friendly aCGH tool to determine significantly recurrent CNAs as well as regions showing significantly differential aberrations between two groups of samples. On a set of B- and T-cell lymphomas we were able to locate all positive control regions (VDJ recombination sites) and a number of new regions as significantly aberrated. A t-test run on segmented data was also able to find the positive control regions but resulted in highly fragmented regions. In contrast, KC-SMARTR allows the user to set the kernel width and thereby control the size of the aberrations that are detected. It features output in both visual and tabular format, including a scale space analysis, which allows a visual overview of the aberrations at different scales. KC-SMARTR offers recurrent CNA and class specific CNA detection, at different genomic scales, in a single package without the need for additional segmentation. It is memory efficient and runs on a wide range of machines. Most importantly, it does not rely on data discretization and therefore maximally exploits the biological information in the aCGH data.
Availability and requirements
Project name: KC-SMART
Project home page: http://bioconductor.org/packages/2.5/bioc/html/KCsmart.html
Operating system(s): Platform independent
Programming language: R
License: GNU General Public License
Installation note: To always get the most up-to-date version of KC-SMARTR, follow the procedure below. Update to the latest R and Bioconductor version and type the following at the R prompt: source ("http://bioconductor.org/biocLite.R") biocLite("KCsmart")
JdR was supported by the Netherlands Genomics Initiative (NGI) through the Cancer Genomics Centre (CGC).
CK was supported by grants from the Netherlands Organization for Scientific Research (ZonMw Vidi 917.036.347) and the Dutch Cancer Society (NKI 2006-3486).
- Hanahan D, Weinberg RA: The hallmarks of Cancer. Cell. 2000, 100: 57-70. 10.1016/S0092-8674(00)81683-9.PubMedView ArticleGoogle Scholar
- Fiegler H, Geigl JB, Langer S, Rigler D, Porter K, Unger K, Carter NP, Speicher MR: High resolution array-CGH analysis of single cells. Nucleic Acids Res. 2007, 35: e15-10.1093/nar/gkl1030.PubMed CentralPubMedView ArticleGoogle Scholar
- Klijn C, Holstege H, de Ridder J, Liu X, Reinders M, Jonkers J, Wessels L: Identication of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. Nucleic Acids Res. 2008, 36: e13-10.1093/nar/gkm1143.PubMed CentralPubMedView ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-21. 10.1073/pnas.091062498.PubMed CentralPubMedView ArticleGoogle Scholar
- Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006, 10 (6): 529-41. 10.1016/j.ccr.2006.10.009.PubMedView ArticleGoogle Scholar
- Holstege H, van Beers E, Velds A, Liu X, Joosse SA, Klarenbeek S, Schut E, Kerkhoven R, Klijn, et al: Cross-species comparison of aCGH data from mouse and human BRCA1- and BRCA2-mutated breast cancers. BMC Cancer. 2010, 10: 455-10.1186/1471-2407-10-455.PubMed CentralPubMedView ArticleGoogle Scholar
- Klijn C, Bot J, Adams DJ, Reinders M, Wessels L, Jonkers J: Identification of networks of co-occurring, tumor-related DNA copy number changes using a genome-wide scoring approach. PLoS Comput Biol. 2010, e1000631-10.1371/journal.pcbi.1000631. 1Google Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23 (6): 657-63. 10.1093/bioinformatics/btl646.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.