Evidences showing wide presence of small genomic aberrations in chronic lymphocytic leukemia

Background Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the western population. Although genetic factors are considered to contribute to CLL etiology, at present genomic aberrations identified in CLL are limited compared with those identified in other types of leukemia, which raises the question of the degree of genetic influence on CLL. We performed a high-resolution genome scanning study to address this issue. Findings Using the restriction paired-end-based Ditag Genome Scanning technique, we analyzed three primary CLL samples at a kilobase resolution, and further validated the results in eight primary CLL samples including the two used for ditag collection. From 51,632 paired-end tags commonly detected in the three CLL samples representing 5% of the HindIII restriction fragments in the genomes, we identified 230 paired-end tags that were present in all three CLL genomes but not in multiple normal human genome reference sequences. Mapping the full-length sequences of the fragments detected by these unmapped tags in seven additional CLL samples confirmed that these are the genomic aberrations caused by small insertions and deletions, and base changes spreading across coding and non-coding regions. Conclusions Our study identified hundreds of loci with insertion, deletion, base change, and restriction site polymorphism present in both coding and non-coding regions in CLL genomes, indicating the wide presence of small genomic aberrations in chronic lymphocytic leukemia. Our study supports the use of a whole genome sequencing approach for comprehensively decoding the CLL genome for better understanding of the genetic defects in CLL.

Findings CLL (Chronic lymphocytic leukemia) is an incurable disease mainly affecting the B cell lineage in the western population, with a median age of diagnosis of 72 year old [1]. Determining the cause of CLL is crucial for understanding the acquisition and for clinical diagnosis, treatment and prognosis of CLL. Genetic factors have been linked to the etiology of CLL. Cytogenetic analyses identified chromosomal abnormalities including del11q23 affecting the ATM gene, tri12, del 13q14, and del17p13 affecting TP53 gene [2]. In addition, CGH studies found gains and losses in Xp11.2-p21 and Xq21qter [3]. Molecular studies identified three genes: IgVH, CD38 and ZAP-70 that correlate with CLL prognosis [4][5][6]. A CLL-specific microRNA signature was also identified, suggesting that microRNA deletion could be involved in CLL [7]. SNP array studies identified 2q21.2, 6p22.1 and 18q21.1 abnormalities that follow a Mendelian inheritance pattern [8]. Whole genome association studies also identified multiple loci at 2q37.3, 8q24.21, 15q21.3 and 16q24.1 that appear to be associated with genetic susceptibility to CLL [9].
Although evidence supports the involvement of genetic factors in CLL, the frequency of genomic aberrations identified in CLL is relatively lower than those observed in the leukemias affecting other types of hematopoietic lineages [10]. This information suggests that the CLL genome is relatively intact with fewer aberrations than other types of leukemia. Alternatively, more genomic aberrations may exist in CLL but these could mainly be small lesions in the CLL genome that are difficult to detect using conventional technologies due to their limited resolution. With the rapid progress of genome sequencing technologies, enthusiasm is increasing for pursuing comprehensive detection of genomic aberrations in cancer by sequencing cancer genomes. In the case of CLL, a critical issue is to know the degree of genomic aberrations in order to justify the use of whole genome sequencing approach to analyze CLL genome. We reasoned if we can scan certain CLL genomes at sufficient high resolution and at reasonable genome coverage, we should gain first-hand information to estimate the degree of genomic aberrations in CLL.
We recently developed the DGS (Ditag Genome Scanning) technique that uses next-generation DNA sequencing technologies to collect paired-end sequences from restriction DNA fragments across a genome [11]. Using this technique, we analyzed CLL genomes. Nine samples of peripheral blood from untreated CLL patients diagnosed in Northwestern University Lurie Cancer Center and University of Chicago Medical Center were used in this study, of which three were used for paired-end tag collection, and eight including two used in paired-end tag collection were used for full-length sequencing analysis (Additional file 1: Supplemental Table S1). Informed consent was made by the patients, and the use of clinical CLL samples was approved by the institutional review board of University of Chicago and Northwestern University following institutional guidelines. The detailed experimental process followed the published protocol [11] and outlined in Figure 1. Briefly, mononuclear cells were isolated from each CLL peripheral blood or bone marrow sample by using NycoPrep™ A solution (Axis-Shield). Human genomic DNA was extracted from mononuclear cells by using QIAamp DNA Blood Kit (QIAGEN) following the manufacturer's protocol. To generate the DGS library, genomic DNA was fractionated by HindIII restriction digestion. The restriction fragments were dephosphorylated by CIP and cloned into pDGS-HindIII vector that contains two MmeI sites next to the HindIII cloning site. The genomic library was digested by MmeI to release two tags from the cloned DNA fragments. The tag-vector-tag fragments were then gel-purified, and re-ligated to form a ditag library. Ditags were released from the vectors by HindIII digestion, gel-purified, and concatemerized by using T4 DNA ligase (Promega). The concatemers at 200 to 500 bps were agarose-gel-purified and used for ditag sequencing by using a 454 GS20 sequencer (454 Life Sciences). Ditags were extracted from the resulting sequences based on the HindIII sites. Same ditags were combined to generate a unique ditag with the corresponding copy numbers.
To generate the reference ditag database, virtual Hin-dIII restriction fragments were generated from known human genomic sequences. Two 16-bp virtual tags were extracted from the 5' and the 3' ends of each virtual fragment, and connected to form a reference ditag representing the virtual DNA fragment. The following sequences were used to extract the reference ditags: 1 Initial ditag mapping was performed with perfect match between experimental ditags and hg18 reference ditags. For the unmapped experimental ditags, a singlebase mismatch in each single tag of the ditag was allowed to compensate for possible sequencing error or SNP. To identify the unmapped ditags related with homopolymer generated by 454 sequencing chemistry, the unmapped ditags with more than two homo-bases were stretched, e.g. AAA -> AAAA, or shortened, e.g. AAA -> AA, and mapped to reference ditags again. For the ditags remaining unmapped, they were mapped to the reference ditags of other sequence sources in the ditag reference database. The ditags remaining unmapped after these processes were defined as the unmapped ditags.
Unmapped ditag sequences were used to design sense primers and antisense (reverse/complementary) primers, with four extra bases CAGC added to the 5' end of sense primer and CGCC to the 5' end of antisense primer. Genomic DNA digested by HindIII was used as the templates for PCR amplification. PCR was performed with 35 cycles at 95°C 30 sec, 57°C 60 sec, and 72°C 3 min, followed by extension at 72°C for 10 min. The amplified products in each reaction were cloned into inversion deletion insertion Collect paired-end ditags from three CLL genomes Identify the ditags detected in all three CLL genomes Identify unmapped ditags Map common ditags to multiple human genome sequences Generate full-length sequences represented by the unmapped ditags Map full-length sequences to reference human genomes to determine the type of genomic aberrations Figure 1 Outline of the experimental process. Genomic DNA samples were digested by restriction enzymes. Ditags (paired-end tags) were collected from both ends of restriction fragments and sequenced. The ditag sequences were compared to known human reference genome sequences. The unmapped ditags were used as sense and antisense PCR primers to amplify their original DNA fragments to generate full-length sequences. The sequences were mapped to reference genome sequences to determine the type of genomic aberrations. pGEM-T vector (Promega), transformed into E. coli TOP10 (Invitrogen), and plated in a single well of the 48-well Qtrays (Genetix). Four clones from each transformation were amplified by colony-PCR using M13F and M13R primers, and sequenced by Big-Dye Terminator v3.1 Cycle Sequencing Kit (ABI) using M13F primer. For the sequences that did not reach the full-length, second sequencing reactions were performed using M13R primer. To determine the genomic aberrations, each full-length sequence was mapped to hg18 using BLAT at a minimum of 90% identity as the cut-off.
The paired-end ditags were collected from three CLL samples. Genomic DNA from each sample was fractionated by HindIII digestion, which provides 3,561-bp resolution on average across the genome based on hg18 sequences [11]. Unique paired-end ditags of 272,193, 320,283, and 307,547 was collected from each CLL sample, covering 32%, 34% and 38% HindIII fragments in each CLL genome respectively. Comparing the three ditag sets shows that between 87,968 and 108,579 ditags are present between two CLL samples, and 51,632 ditags are commonly present in all three CLL samples (Table  1A). The ditags present only in individual CLL sample could be the ditags representing individual genomic differences, the ditags potentially originating from experimental artifacts, or ditags detected in one but not in others due to unsaturated ditag collection in each CLL under the sequencing scale. The 51,632 ditags detected in all three CLL samples cover 5% of genomic DNA fragments commonly detected in the three CLL genomes. In order to provide high confidence for further downstream studies, we focused on the 51,632 common ditags for further mapping analysis. We compared the 51,632 common ditags with multiple known human genome sequences, including the human genome reference sequence hg18, human SNP, human GM15510 genome sequences, chimpanzee genome sequences that are highly homologous to the humans, Watson genome sequences, and Venter genome sequences. Of the 51,632 ditags used for the mapping, 98.3% (50,799) map to hg18 that represent normal genomic fragments in the CLL genomes, 0.4% (230) are unmapped ditags that represent potential genomic aberrations commonly present in all three CLL genomes, and the remaining ditags map to other genomes that represent normal genome variations (Table 1B).
To determine the types of genomic aberrations for the unmapped ditags, we generated full-length sequence for the restriction DNA fragment detected by the unmapped ditags by using the "ditag-PCR" method, in which the ditag sequences were used as PCR sense and antisense primers to amplify the original DNA fragment that derived the unmapped ditag. We performed 192 reactions in eight CLL samples including two used in ditag collection and six additional CLL samples. Under the conditions that a full-length sequence must be longer than 50 bases and detected at least in the CLL used in ditag collection or at least in two additional CLL samples, 220 full-length sequences were generated from 100 unmapped ditags. Mapping the full-length sequences to hg18 identified different types of genomic aberrations caused by insertion, deletion and base change. Many of these aberrations created new HindIII restriction site that leads to the release of unmapped ditag, or the change of ditag sequence composition that prevents ditag mapping. These aberrations were observed in both coding and non-coding regions in CLL genome. For example, aberrations were detected in exons of NEK8, RUNX1 and MUC2 genes, and introns of 20 other genes (Table 2A, Additional file 2: Supplementary table S2). NEK8 encodes a member of the serine/threonine protein kinase family, which plays a role in cell cycle progression from G2 to M phase and is over-expressed in breast cancer [12]. A 353-base sequence converted from the unmapped ditag AAGCT-TACCCTCTGGACGCCTGTATGAAGCTT maps to the last exon (Exon 15) coding for the 3' UTR of NEK8. Two HindIII restriction sites were inserted in the sequence that are not present in the wild-type NEK8 gene. RUNX1 is a gene involved in AML through its involvement in the t(8;21) [13]. A 434-base full-length   sequence from a ditag AAGCTTCGGCCTATAG/ ACAACCTAACAAGCTT was detected in all eight CLL samples, and maps to intron 3 and exon 4 of RUNX1.
Analyzing the mapped region shows a T to C singlebase change between the sequence and exon 4 of RUNX1 gene. Searching dbSNP reveals that this is a SNP (rs1235270). Due to the uncertainty of RUNX1 protein coding sequence itself, it is not certain if this germline SNP causes a coding amino acid change. Several bases are also changed in the mapped intron 3 of RUNX1 gene. These base changes raise an interesting question whether RUNX1 could be involved in CLL. MUC2 is a member of the MUCIN family, which codes for high molecular weight glycoproteins. The abnormalities of MUC2 is linked with colorectal and pancreas cancer [14]. A 410-base sequence derived from an unmapped ditag AAGCTTCCGGTCGGCTTCGCAG-TAGAAAGCTT covers intron 29, exon 30 and intron 30 of MUC2 gene. This sequence also contains two Hin-dIII restriction sites AAGCTT inserted at both its ends that do not exist in the wild-type MUC2 gene. Only three aberrations were detected in the exon of three known genes. This could be attributed to the limited genome coverage of the study and the low percentage of the exon-coding sequences in the genome. With increased genome coverage, it would be possible to identify the aberrations affecting more exons. Aberrations also affect the introns of multiple genes. FHIT encodes diadenosine 5',5'''-P1,P3-triphosphate hydrolase involved in purine metabolism [15]. It is located in the common fragile site FRA3B on chromosome 3, where carcinogen-induced damage can lead to translocations in several cancers. A 283-base sequence maps to intron 8 of FHIT gene but its tag 1 contains GA to TG change. HYDIN encodes an axonemal protein; mutation of HYDIN is related to congenital hydrocephalus [16]. Two full-length sequences of 605-bp and 614-bp from two different unmapped ditags were obtained from seven CLL samples. Both sequences map to 21st intron of HYDIN. The 605-bp sequence contains CCTACGGCG in its tag 2 converted from wild-type gCcACaGCa (lowercase refers to the changed base), and the 614-bp sequence contains CGCC converted from wild-type tGCt in its tag 1 and an internal insertion. NCOR2 is a transcriptional regulator that recruits histone deacetylases to promoters [17]. A 582-base sequence maps to intron 1 of NCOR2, but its tag 1 contains an AAGC insertion, and tag 2 contains a C to T change, an AG deletion, and a T insertion. TYK2 is a member of the JAK family involving in IFN-g, IL-6, IL-10 and IL-12 signaling. Mutation in this gene is associated with hyperimmunoglobulin E syndrome [18]. A 268-base sequence maps to intron 14 of TYK2 but its tag 1 contains an AAGCTTA insertion and its tag 2 contains a TGAAGCTT insertion. Both insertions create HindIII restriction sites that lead to the generation of the unmapped ditag. A 197-base sequence was detected in seven CLL samples and two different sequences of 112-base and 170-base were generated from the CLL used in ditag collection. All three sequences map to UBAP2 located at 9p13.3, a gene involved in the ubiquitination pathway [19]. For the 197-base sequence, its 178 bases map to intron 6 of UBAP2 gene and the remaining 18 bases have no map, whereas the 112-base and 170-base sequences contain different insertions. Although the aberrations in many of these genes have been correlated with different types of cancer, most have not been linked with CLL.
Non-coding regions contribute to the majority of the genome, and contain important functional elements involving DNA replication, genome stability, regulation of gene expression, and coding for non-coding transcripts etc. Extensive characterization of non-coding region could provide rich candidate markers for clinical applications and identify the hotspots of genomic aberrations involving cancer development. A total of 37     Table 4 Aberrations in the centromere region sequences generated from 30 unmapped ditags mapped to the non-coding regions in the genome with various types of abnormalities (Table 3, Additional file 3: Supplemental Table S3). Although these loci are not directly located in the coding regions, many genes are located nearby the mapped locations. Of the 26 loci specifically mapped by the sequences, 15 have genes located either upstream, downstream or both within 100 kb distance. For example, a 614 base sequence maps to 5q35.1 between169443856-169444467, where DOCK2 is located 27,836 base upstream and FOXI1 is located 21028 downstream. A 398-base sequence maps to 15q26.1 between 88110782 and 88111168, where two homologous transcriptional factor genes, MESP1 and MESP2, are located 16,678-base upstream and 9,425-base downstream correspondingly. microRNA gene MIR663 are located 20,580 base upstream of 20p11.1 between 26157494-26158252 mapped by a 920-base sequence detected in seven CLL samples. Another microRNA gene MIR663B is located 10,964-base upstream of 2q21.2 between 132742087-132742356 mapped by a 290-base sequence, of which a non-coding RNA gene NCRNA00164 is located in between. The aberrations could affect the nearby genes through influencing the regulation of gene expression.
One hundred and forty seven full-length sequences converted from 57 unmapped tags map to the highly repetitive sequences in the non-coding regions. Of these sequences, 110 sequences map to the ALR/Alpha satellite sequences of the centromere, and chromosome 2, 10, and 17 are among the most frequent ones (Table 4, Additional file 4: Supplemental Table S4): 23 sequences converted from 13 unmapped tags map to the centromere of chromosome 2 at 2p11.1, 41 sequences converted from 16 ditags map to the centromere of chromosome 10 at 10q11.1, and 22 sequences converted from 6 unmapped ditags map to the centromere of chromosome 17 at 17p11.1. The presence of highly frequent aberrations in ALR/Alpha satellite sequences in these three chromosomes suggests that these could be the hot spot of genomic aberrations in CLL. Aberrations in repetitive sequences have been shown to contribute to cancer development [20]. However, it is difficult to analyze the aberrations in these highly repetitive regions using the hybridization-based approach due to the difficulty to designing specific probes. Our results show that restriction sequencing-based approach provides a useful tool to study the aberrations in these regions.
Ten full-length sequences generated from eight unmapped ditags did not map to known human genome sequences (Table 2B. Additional file 5: Supplementary table  S5). For example, a 107-base full-length sequence converted from an unmapped ditag AAGCTTAGATAGAGCG-CAGTCAACTGAAGCTT was detected in all eight CLL samples. However, it does not map to the reference genome sequences. These sequences represent the DNA contents present in CLL genomes but not in normal genomes.
Through high-resolution scanning of three CLL genomes and verifying the results using full-length sequences and additional CLL genomes, our study provides evidence showing the wide presence of genomic aberrations in CLL, of which most are small lesions. Studies with increased number of CLL samples and at high genome coverage will be required to better understand the genetic aberrations in CLL. Although the study used multiple genomics databases to eliminate the changes from normal genomic polymorphism, further studies with normal DNA from the same patient will be required to fully distinguish somatic mutations from germline variations in CLL.