A new approach to in silico SNP detection and some new SNPs in the Bacillus anthracis genome
© Brodzik et al; licensee BioMed Central Ltd. 2011
Received: 10 December 2010
Accepted: 8 April 2011
Published: 8 April 2011
Bacillus anthracis is one of the most monomorphic pathogens known. Identification of polymorphisms in its genome is essential for taxonomic classification, for determination of recent evolutionary changes, and for evaluation of pathogenic potency.
In this work three strains of the Bacillus anthracis genome are compared and previously unpublished single nucleotide polymorphisms (SNPs) are revealed. Moreover, it is shown that, despite the highly monomorphic nature of Bacillus anthracis, the SNPs are (1) abundant in the genome and (2) distributed relatively uniformly across the sequence.
The findings support the proposition that SNPs, together with indels and variable number tandem repeats (VNTRs), can be used effectively not only for the differentiation of perfect strain data, but also for the comparison of moderately incomplete, noisy and, in some cases, unknown Bacillus anthracis strains. In the case when the data is of still lower quality, a new DNA sequence fingerprinting approach based on recently introduced markers, based on combinatorial-analytic concepts and called cyclic difference sets, can be used.
I have deeply regretted that I did not proceed far enough at least to understand something of the great leading principles of mathematics; for men thus endowed seem to have an extra sense.
This research is part of an effort to develop novel techniques for the interrogation of pathogenic genomes. In this domain the task of Bacillus anthracis strain differentiation poses a particularly difficult challenge [1–4]. Since most B. anthracis strains are highly monomorphic, sequence typing must rely on subtle differences between genomes, sampled at multiple loci . The complexity of the problem will increase in cases where only partial sequence data is available, or sequences contain errors, and as design of engineered bacterial genomes becomes possible .
The principal genomic markers used in sequence typing are VNTRs, indels and SNPs. The occurrence of VNTRs and indels in the B. anthracis genome in the three strains considered here was recently investigated in . Here, we undertake the analysis of SNPs. The use of SNPs in both human and microbial DNA investigations has a long tradition . The advantages of SNPs include high concentration in coding regions, fixed length, and lower susceptibility to short read sequencing errors than VNTRs. In applications these advantages must be balanced against SNPs' relatively slow mutation rates and relatively low resolving power. In cases when sequence typing by SNPs is not sufficient, the use of SNPs in combination with other markers should be considered .
In this work the occurrence of SNPs is investigated in the three main strains of the B. anthracis genome: Ames Ancestor, Ames and Sterne. It is shown that SNPs are abundant in the B. anthracis genome and that they are distributed relatively uniformly throughout the sequence. These findings demonstrate that the B. anthracis SNPs can be used effectively as part of an increased resolution, multi-tier strain differentiation scheme for the analysis of moderately incomplete, noisy or uncertain data. The SNP detection approach used here is based on an advanced design theory construction known as the cyclic difference set . In this approach the comparison of DNA sequences is replaced by the comparison of cyclic difference set distributions associated with these sequences. The similarity of these distributions is used first to assess DNA sequence homology and subsequently to identify indels and SNPs. The cyclic difference set approach has many advantages ; the primary one, which is particularly relevant to this work, is that it permits a high degree of flexibility in selecting an appropriate sequence variation resolution that can be adapted to a given application.
The work described here intersects several application domains. Prior work on B. anthracis includes [7, 1, 5, 11, 2, 3], and [12–14]. Prior work on bacterial genome structure includes [15–18]. Prior work on SNP taxonomy and detection includes [8, 19, 1], and . Prior work on cyclic difference sets includes  and [21–23].
The B. anthracis genome is made up of chromosomal DNA and two plasmids, pXO1 and pXO2. We analyzed the chromosomal sequences of Ames Ancestor GenBank: NC_007530.2, Ames GenBank: NC_003997.3, and Sterne GenBank: NC_005945.1, the pXO1 plasmid sequences of Ames Ancestor GenBank: NC_003980 and Sterne GenBank: NC_001496, and the pXO2 plasmid sequences of Ames Ancestor GenBank: NC_003981.1 and Pasteur GenBank: NC_012659.1. For brevity, we refer to Ames Ancestor, Ames, Sterne, and Pasteur as AA, A, S, and P.
SNP definition and taxonomy
There is no standard, mathematically consistent definition of the term SNP . We consider it essential to establish such a definition, so that confusion can be avoided in analysis, in comparison of results and in discussions. In this work a SNP is defined as a single letter difference between two sequences flanked on the left and on the right by at least one letter that is identical in both sequences. For example, in the strings
A C G T A CG T
A A G G A TT T
the second and fourth letters are SNPs but the sixth and seventh letters are indels, as the letter differences are adjacent. This convention is different from general practice, which sometimes permits adjacent letter differences to be regarded as SNPs . We insert the non-adjacency constraint into the SNP definition because: (1) such modification permits mathematically unambiguous separation of SNPs and indels, and (2) such separation is biologically meaningful as adjacent and closely spaced SNPs often coincide with large indels.
The definition of SNP must be further disambiguated when more than two sequences are considered. In this case two or more distinct letters might appear at a putative SNP position, raising the possibility of counting each pair-wise mismatch as a separate SNP. We will ignore this multiplicity. For example, both triples A-C-T and A-C-C will be considered instances of a single SNP. We will distinguish between coding and non-coding SNPs, and between synonymous and non-synonymous SNPs (the latter referred to as nsSNPs). In a three-way comparison a coding SNP is considered non-synonymous when at least one of the pair-wise SNPs is non-synonymous. For example, there are two pair-wise SNPs in letters A-C-C in the three-way comparison of AA-A-S, one for the pair of strains AA-A and one for the pair of strains AA-S. If either of these pair-wise SNPs is non-synonymous then the three-way SNP is declared an nsSNP.
In each of the two DNA sequences being compared identify the consecutive occurrences of a selected DS. For example, choosing the DS, 1101000, the DNA sequences
give rise to the DS sequences associated with the nucleotide C,
Convert the above DS sequences to shorter sequences of inter-DS gaps,
Align the gap sequences and identify the mismatching strings of gaps, 7 and 5, or (CAC)GGGG and (CAC)GG.
The rationale for using DSs as sequence markers is that when DNA sequences are highly homologous, so are the sequences of DS locations. Conversely, in regions where DNA sequences differ, so do the DS sequences. This is convenient as the analysis of DNA sequences can then be replaced by the analysis of much sparser, and therefore easier to compute, DS sequences. Since a difference in DS sequences marks the occurrence of an indel, mismatching segments are removed from the DS sequences.
Point-wise comparison of these sequences reveals a SNP T/A at the 6th bp.
Several comments are necessary here to make statements precise. First, while a more natural acronym for a cyclic difference set would be CDS, to avoid potential confusion with a coding sequence we settle for DS. Second, DSs are combinatorial designs that are associated with, not identical to, the special binary strings considered here. However, for convenience and by abuse of language in this text we will refer to the relevant strings as DSs. While motivating the technical approach, for brevity, we mention here only the computational complexity reason for the utility of DSs.
Specifically, the computational advantage of the method as compared to a direct approach not relying on DSs is proportional to the abundance of DSs in genomes (1 in 500 nucleotides in the B. anthracis genome). This advantage is further enhanced by the suitability of the method for implementation using Fast Fourier Transform algorithm, which requires only n log 2 n complex operations. For a more extensive discussion of the role of DSs in DNA sequence analysis the reader is directed to .
Abundance and taxonomy of SNPs in Ames Ancestor, Ames and Sterne genomes reported in  and computed using the DS approach.
Distribution of SNPs in Ames Ancestor, Ames, and Sterne genomes.
SNP spacing (average)
SNP spacing (adjusted for indels)
The chromosomal analysis included the three pair-wise comparisons of AA-S, AA-A and A-S. These comparisons revealed 131, 19 and 150 SNPs, respectively (Table 1). The SNPs found in the AA-S and AA-A strain comparisons partition the SNPs found in the A-S strain comparison. This suggests that Ames and Sterne are both descendants of Ames Ancestor. The relatively large number of SNPs in AA-S confirms that AA is evolutionarily more distant from S than from A . About 70% of chromosomal SNPs are coding and about 80% of coding SNPs are non-synonymous. The ratio of all coding SNPs to all SNPs is 67%. This ratio is only modestly lower than the ratio of coding DNA and the entire genome sequence lengths, 78% in the AA strain. This result suggests that there is a similar degree of sequence conservation in the two sequence types. Both SNPs and nsSNPs are relatively uniformly distributed along the chromosome (Figures 1 and 2). The minimum, average and maximum distance between subsequent A-S SNPs is 2, 34499 and 163349 bp, respectively, although many SNPs are less than 2000 bp apart (Figure 3, Table 2). Interestingly, despite the close proximity of several pairs of SNPs, only the SNPs 93 and 94 occur within the same gene. The distributions of SNPs are only negligibly affected by the occurrence of indels. This is so because chromosomal sequences are highly homologous: the AA-A comparison yields only two multi-base indels, a 123-base-long indel at 1151242 bp and a 10-base-long indel at 2612043 bp; the AA-S comparison yields a single 100-base long indel at 4147353 bp (all locations are given in the AA coordinates) .
The plasmid analysis included pair-wise comparisons of strains AA-S for pXO1 and AA-P for pXO2. Given their relatively short sequence lengths, the pXO1 and pXO2 plasmids are polymorphism-rich, containing 14 and 21 SNPs each, respectively. Of these SNPs, 7 and 16 are coding SNPs. Of the coding SNPs 6 and 9 are nsSNPs. The minimum, average and maximum distance between subsequent SNPs in the pXO1 plasmid are 3, 12977 and 84568 bp. The minimum, average and maximum distance between subsequent SNPs in the pXO2 plasmid are 94, 4516 and 13884 bp. The density of SNPs decreases in the pXO1 and pXO2 plasmids when indels are removed from the sequences (Table 2). The effect is most pronounced in the pXO1 sequence, due to the occurrence of two large indels at 43348-48589 and 117228-162050 bp.
Overall, when adjusted for indels, SNPs are distributed, rather surprisingly, in a relatively uniform fashion across the entire B. anthracis genome, but with varying inter-SNP spacing in each of the three sequences.
This work describes the structure of B. anthracis SNPs arising from in silico comparison of the Ames Ancestor, Ames and Sterne strains. This result complements the characterization of B. anthracis indels given in  and extends the analysis given in  in both the number of SNPs identified and the information provided about their type and distribution. While a later work, , slightly extends the results of , it does so only with respect to the 12 so-called canonical SNPs.
Indels and SNPs, together with VNTRs (The distinction between indels and VNTRs is made for historical reasons; mathematically, VNTR is a special case of indel), capture all sequence differences in pan-genomes (Pan-genome is a superset of all the genes in all the strains of a species . More generally, pan-genome can be defined as a reference genome for a species plus the superset of all the genomic variants occurring in all the strains.). Knowledge of these differences can be used either to address basic biological research problems, e.g., investigation of genomic function and evolutionary processes , or in applications such as strain fingerprinting  and monitoring of DNA sequence synthesis orders . In each of these problems selecting the appropriate granularity of analysis is one of the main decisions that must be made in experiment design.
While it was previously suggested that many B. anthracis strains, including the ones considered here, can be identified using certain minimal sets of markers, such as the so-called canonical SNPs  or special sets of VNTRs , such approaches are certain to be effective only when the strain is known and the data is perfect. This might not always be the case. Indeed, in many practical sequence analysis scenarios the data can be Large (whole genome), Uncertain (a new strain), Noisy (contaminated at the source, corrupted in the process of data collection, sequencing or sequence assembly, or purposefully engineered), or Incomplete (LUNI). In these cases a minimum set of markers will not, in general, suffice to identify all strains, and higher resolution approaches, relying on sequence over-sampling, must be employed.
DNA sequence fingerprinting scheme choices for three strains of the B. anthracis chromosomal sequence ordered in terms of increasing sequence resolution.
# of markers
Minimal set of SNPs
All SNPs + VNTRs
The authors would like to thank Julie DelVecchio Savage and Alan Moore for support of this work, and Alfred Steinberg for discussion of pathogenic polymorphisms. The DS approach was inspired, in part, by ideas expressed in the Antoine Danchin' book Delphic boat.
- Keim P, Van Ert MN, Pearson T, Vogler AJ, Huynh LY, Wagner DM: Anthrax molecular epidemiology and forensics: using the appropriate marker for different evolutionary scales. Infection genetics and evolution. 2004, 4: 205-213. 10.1016/j.meegid.2004.02.005.View ArticleGoogle Scholar
- Lista F, Faggioni G, Valjevac A, Ciammaruconi A, Vaissaire J, le Doujet C, Gorgé O, De Santis R, Carattoli A, Ciervo A, Fasanella A, Orsini F, D'Amelio R, Pourcel C, Cassone A, Vergnaud G: Genotyping of bacillus anthracis strains based on automated capillary 25-loci multiple locus variable number tandem repeats analysis. BMC Microbiology. 2006, 6: 1-15. 10.1186/1471-2180-6-33.View ArticleGoogle Scholar
- Marston CK, Gee JE, Popovic T, Hoffmaster AR: Molecular approaches to identify and differentiate Bacillus anthracis from phenotypically similar bacillus species isolates. BMC Microbiology. 2006, 6: 22-28. 10.1186/1471-2180-6-22.PubMedPubMed CentralView ArticleGoogle Scholar
- Pallen MJ, Nelson KE, Preston GM: Bacterial pathogenomics. 2007, Washington DC: ASM PressGoogle Scholar
- Keim P, Pearson T, Okinaka R: Microbial forensics: DNA fingerprinting of Bacillus anthracis. Anal Chem. 2008, 4: 4791-4799. 10.1021/ac086131g.View ArticleGoogle Scholar
- Gibson DG, Glass JL, Lartigue C, Noskov VN, Chuang RY, Algire MA, Benders GA, Montague MG, Ma L, Moodie MM, Merryman C, Vashee S, Krishnakumar R, Assad-Garcia N, Andrews-Pfannkoch C, Denisova EA, Young L, Qi ZQ, Segall-Shapiro TH, Calvey CH, Parmar PP, Hutchison CA, Smith HO, Venter JC: Creation of a bacterial cell controlled by a chemically synthesized genome. Science. 2010, 329: 52-56. 10.1126/science.1190719.PubMedView ArticleGoogle Scholar
- Brodzik AK: Rapid Sequence Homology Assessment by Subsampling the Genome Space Using Difference Sets. IEEE Transactions on Information Theory, Special Issue on Molecular Biology and Neuroscience. 2010, 56 (2): 756-770.View ArticleGoogle Scholar
- Brookes AJ: The essence of SNPs. Gene. 1999, 234: 177-186. 10.1016/S0378-1119(99)00219-X.PubMedView ArticleGoogle Scholar
- Brodzik AK: Quaternionic periodicity transform: an algebraic solution to the tandem repeat detection problem. Bioinformatics. 2007, 23: 694-700. 10.1093/bioinformatics/btl674.PubMedView ArticleGoogle Scholar
- Baumert LD: Cyclic difference sets. 1971, Berlin: SpringerGoogle Scholar
- Keim P, Grundike JM, Klevytska AM, Schupp JM, Challacombe J, Okinaka R: The genome and variation of Bacillus anthracis. Molecular Aspects of Medicine. 2009, 30: 397-405. 10.1016/j.mam.2009.08.005.PubMedPubMed CentralView ArticleGoogle Scholar
- Pilo P, Perreton V, Frey J: Molecular epidemiology of Bacillus anthracis: determining the correct origin. Appl and Environ Mirobiol. 2008, 74: 2928-31. 10.1128/AEM.02574-07.View ArticleGoogle Scholar
- Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, Jiang L, Holtzapple E, Busch JD, Smith KL, Schupp JM, Solomon D, Keim P, Fraser CM: Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis. Science. 2002, 296: 2028-2033. 10.1126/science.1071837.PubMedView ArticleGoogle Scholar
- Kolsto A-B, Tourasse NJ, Okstad OA: What sets Bacillus anthracis apart from other Bacillus species?. Annual Rev Microbiol. 2009, 63: 451-476. 10.1146/annurev.micro.091208.073255.View ArticleGoogle Scholar
- Cummings CA, Relman DA: Microbial forensics - cross-examining pathogens. Science. 2002, 296: 1976-1979. 10.1126/science.1073125.PubMedView ArticleGoogle Scholar
- Konstantinidis KT, Ramette A, Tiedje JM: The bacterial species definition in the genomic era. Philosophical Transactions of The Royal Society B. 2006, 361: 1929-40. 10.1098/rstb.2006.1920.View ArticleGoogle Scholar
- Frazer C, Alm EJ, Polz MF, Spratt BG, Hanage WP: The bacterial species challenge: making sense of genetic and ecological diversity. Science. 2009, 323: 741-6. 10.1126/science.1159388.View ArticleGoogle Scholar
- Freeman JM, Plasterer TN, Smith TF, Mohr SC: Patterns of genome organization in bacteria. Science. 1998, 279: 1827a-10.1126/science.279.5358.1827a.View ArticleGoogle Scholar
- Mooney S: Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis. Briefings in Bioinformatics. 2005, 6: 44-56. 10.1093/bib/6.1.44.PubMedView ArticleGoogle Scholar
- Xu Y, Gogarten JP: Computational methods for understanding bacterial and archeal genomes. 2008, Singapore: Imperial College PressGoogle Scholar
- Colbourn CJ, Dinitz JH: Handbook of combinatorial designs. 2006, New York: Chapman and Hall/CRCView ArticleGoogle Scholar
- Erdos P, Turan P: On a problem of Sidon in additive number theory. J London Math Soc. 1941, 3: 212-215. 10.1112/jlms/s1-16.4.212.View ArticleGoogle Scholar
- Sidon S: Ein Satz uber trigonometrische Polynome und seine Anwendung in der Theorie der Fourier-Reihen. Math Ann. 1932, 106: 536-539. 10.1007/BF01455900.View ArticleGoogle Scholar
- Van Ert MN, Easterday WR, Huynh LY, Okinaka RT, Hugh-Jones ME, Ravel J: Global genetic population structure of Bacillus anthracis. PLoS ONE. 2007, 5: 1-10.Google Scholar
- Carlson R: The changing economics of DNA synthesis. Nature Biotechnology. 2009, 27: 1091-4. 10.1038/nbt1209-1091.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.