PeakSeeker: a program for interpreting genotypes of mononucleotide repeats
© Thompson and Salipante et al; licensee BioMed Central Ltd. 2009
Received: 15 December 2008
Accepted: 03 February 2009
Published: 03 February 2009
Mononucleotide repeat microsatellites are abundant, highly polymorphic DNA sequences, having the potential to serve as valuable genetic markers. Use of mononucleotide microsatellites has been limited by their tendency to produce "stutter", confounding signals from insertions and deletions within the mononucleotide tract that occur during PCR, which complicates interpretation of genotypes by masking the true position of alleles. Consequently, microsatellites with larger repeating subunits (dinucleotide and trinucleotide motifs) are used, which produce less stutter but are less genetically heterogeneous and less informative. A method to interpret the genotypes of mononucleotide repeats would permit the widespread use of those highly informative microsatellites in genetic research.
We have developed an approach to interpret genotypes of mononucleotide repeats using a software program, named PeakSeeker. PeakSeeker interprets experimental electropherograms as the most likely product of signals from individual alleles. Because mononucleotide tracts demonstrate locus-specific patterns of stutter peaks, this approach requires that the genotype pattern from a single allele is defined for each marker, which can be approximated by genotyping single DNA molecules or homozygotes. We have evaluated the program's ability to discriminate various types of homozygous and heterozygous mononucleotide loci using simulated and experimental data.
Mononucleotide tracts offer significant advantages over di- and tri-nucleotide microsatellite markers traditionally employed in genetic research. The PeakSeeker algorithm provides a high-throughput means to type mononucleotide tracts using conventional and widely implemented fragment length polymorphism genotyping. Furthermore, the PeakSeeker algorithm could potentially be adapted to improve, and perhaps to standardize, the analysis of conventional microsatellite genotypes.
Microsatellites are short (1- to 5-bp), tandemly repeated DNA motifs that are useful as genetic markers because they display a high degree of polymorphism within populations [1–3]. Polymorphisms consist of differences in the number of repeat sequences contained by a microsatellite and are the consequence of mutations which occur during DNA replication, when subunits are inserted or deleted . Although the mutation rate of microsatellites are influenced by a variety of factors [5, 6], they tend to be inversely proportional to the length of the repeat unit [7, 8]. Accordingly, mononucleotide repeats, uninterrupted tracts of A/T or G/C, are most susceptible to mutation  and are the most polymorphic  class of microsatellite. Polymorphisms at those sites are detectable even among somatic cells from the same individual [9, 10].
Stutter artifacts also complicate determining tract lengths by DNA sequencing , and even next-generation genomic sequencing technologies experience problems at mononucleotide runs . Although dedicated methods have been developed to detect single-base length differences in mononucleotide repeats, including mass-spectrometry  and primer-extension PCR [1, 23], none has come into common use.
The accessibility of highly informative mononucleotide microsatellites could be improved by a high-throughput means to detect single-base length polymorphisms using fragment length polymorphism genotyping, which is already in widespread use. Here we describe an approach to interpret the genotypes of mononucleotide repeats with the aid of an analysis algorithm, which we have named PeakSeeker.
PeakSeeker [Additional File 1: PeakSeeker_V1.zip] operates by interpreting the genotype of an autosomal mononucleotide repeat as the additive product of two homozygous genotypes, each corresponding to one of the two alleles. The program considers each homozygous and heterozygous combination of the two alleles' genotypes over the range where experimental data are present, and the relative amplification at each position is varied such that the additive product of the two allelic genotypes best fits the interrogated data. PeakSeeker scores each potential interpretation of the experimental genotype given how well the additive product fits the experimental data, and how realistic the required degree of unequal allelic amplification. The interpretation with the best score is reported, corresponding to the most probable interpretation. In order to reduce "noise" from stochastic variability between genotypes , the program can average together data from replicate genotypes and produce a "consensus" used for the analysis.
For full description of the program's workflow and scoring mechanism see [Additional File 2: MethodsSupplement.pdf].
(b) Simulated Genotypes
To simulate various polymorphisms, we combined peak patterns of single-molecule genotypes from mononucleotide microsatellites, which themselves well approximate the genotypes of single alleles. For four experimental loci, we superimposed genotypes of two simulated alleles differing in length from 0 to 4 bases, and simulated electropherograms representing the additive product of the alleles. Unequal amplification of alleles was modeled by varying the maximum height of each allele according to likelihoods from the unequal allelic amplification prior. To model the effects of novel insertion and deletion mutations occurring during PCR amplification, individual peaks in the genotypes of simulated alleles were allowed to vary in height from single-molecule genotypes by 2×(± 0.0143 (δ = 0.00679)) of the maximal peak's height, corresponding to the distribution of signal intensities expected from mutated PCR products . To represent variability introduced by the electrophoresis process, peaks were then modified by 2×(± 0.0052 (δ = 0.00744)) of the maximal peak's height . 100 simulations were produced for each of the four experimental loci, and the fraction of correct calls was calculated by the Maximum Likelihood Estimate, using the exact method for calculation of 95% confidence intervals.
(c) Experimental Data
Genomic DNA from ten passaged subclones of the NIH 3T3 (ATCC) cell line, previously reported , was genotyped [see Additional File 2: MethodsSupplement.pdf] at four tracts (Loci 188, 321, 502, and 1292) [see Additional File 3: Table 1.xls]. Six independent genotypes were produced for each locus/subclone pair. Proper genotype interpretation was established by manual analysis of genotypes based on the D-value metric, an arithmetic method of determining mononucleotide repeat heterozygosity or homozygosity based . Subsets of one to six of the replicates were randomly sampled and used as the basis for genotype interpretation by PeakSeeker. Summary statistics were calculated as before.
Results and discussion
As a functional test, we genotyped ten passaged subclones from a diploid mouse cell line . To establish the proper genotype interpretation for each sample, we interrogated genotypes manually using an arithmetic method  unrelated to the PeakSeeker approach. Manual analysis revealed that four of the five microsatellites were polymorphic for multiple isolates, and that homozygous alleles and heterozygous alleles separated by differences of one base were represented. We then evaluated how frequently PeakSeeker correctly interpreted the electropherograms (Figure 4B). Again, the accuracy of PeakSeeker was proportional to the number of replicated genotypes used as the basis for data interpretation, although with lower accuracies than those obtained with simulated data, due to the presence of three sample/loci pairs which showed high rates of PCR error and were frequently called incorrectly. As before, locus 1292 yielded the highest accuracy, with 98.3% correct calls with only one replicate provided.
There were two instances where the results of PeakSeeker analysis and manual data analysis did not agree, but in both cases, PeakSeeker interpretation demonstrated that electropherograms were significantly different than expected from manual calls, and was therefore accepted as correct.
For the typical panel of mononucleotide microsatellites we examined here, PeakSeeker has proven well-suited to interpreting genotypes when overlap of alleles is the most significant, which are also the most difficult cases to call by eye. Thus, the program serves as a valuable augmentation to manual analysis, and can substantially increase throughput. However, if markers are selected for either limited stutter or asymmetric stutter peaks, the program should autonomously achieve perfect accuracy when a limited number of replicates are performed. The PeakSeeker algorithm could potentially be adapted to improve, and perhaps to standardize, the analysis of conventional microsatellites.
Availability and requirements
Project name: PeakSeeker v1.0
Project home page: None, program attached as supplementary information.
Operating system: Platform independent (tested on Linux and OS X)
Programming language: Perl, R Package for Statistics
Other requirements: GeneMapper v4.0 software (ABI) or equivalent, the R Project for Statistical Computing
License: Source code and executables are freely available for academic users.
Any restrictions to use by non-academics: License required
We thank Brian Schultz for his contributions to earlier drafts of the program, and Marshall Horwitz for his suggestions for the manuscript. JT supported by NIH T32GM07735, SJS supported by NIH F30AG030316, NIH DP1OD003278, NIH T32GM007266, and ARCS Fellowship grants to the University of Washington Medical Scientist Training Program.
- Cohen H, Danin-Poleg Y, Cohen CJ, Sprecher E, Darvasi A, Kashi Y: Mono-nucleotide repeats (MNRs): a neglected polymorphism for generating high density genetic maps in silico. Hum Genet. 2004, 115 (3): 213-220. 10.1007/s00439-004-1135-5.View ArticlePubMedGoogle Scholar
- Mukherjee M, Minal V, Mittal RD, Mittal B: Allelic variation of BAT-26 and BAT-40 poly-adenine repeat loci in North Indians. Int J Mol Med. 2002, 9 (1): 91-94.PubMedGoogle Scholar
- Hughes CR, Queller DC: Detection of highly polymorphic microsatellite loci in a species with little allozyme polymorphism. Mol Ecol. 1993, 2 (3): 131-137. 10.1111/j.1365-294X.1993.tb00102.x.View ArticlePubMedGoogle Scholar
- Streisinger G, Okada Y, Emrich J, Newton J, Tsugita A, Terzaghi E, Inouye M: Frameshift mutations and the genetic code. This paper is dedicated to Professor Theodosius Dobzhansky on the occasion of his 66th birthday. Cold Spring Harb Symp Quant Biol. 1966, 31: 77-84.View ArticlePubMedGoogle Scholar
- Wells RD, Dere R, Hebert ML, Napierala M, Son LS: Advances in mechanisms of genetic instability related to hereditary neurological diseases. Nucleic Acids Res. 2005, 33 (12): 3785-3798. 10.1093/nar/gki697.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Yu J, Willson JK, Markowitz SD, Kinzler KW, Vogelstein B: Short mononucleotide repeat sequence variability in mismatch repair-deficient cancers. Cancer Res. 2001, 61 (9): 3801-3805.PubMedGoogle Scholar
- Lee JS, Hanford MG, Genova JL, Farber RA: Relative stabilities of dinucleotide and tetranucleotide repeats in cultured mammalian cells. Hum Mol Genet. 1999, 8 (13): 2567-2572. 10.1093/hmg/8.13.2567.View ArticlePubMedGoogle Scholar
- Boyer JC, Yamada NA, Roques CN, Hatch SB, Riess K, Farber RA: Sequence dependent instability of mononucleotide microsatellites in cultured mismatch repair proficient and deficient mammalian cells. Hum Mol Genet. 2002, 11 (6): 707-713. 10.1093/hmg/11.6.707.View ArticlePubMedGoogle Scholar
- Salipante SJ, Horwitz MS: Phylogenetic fate mapping. Proc Natl Acad Sci USA. 2006, 103 (14): 5448-5453. 10.1073/pnas.0601265103.PubMed CentralView ArticlePubMedGoogle Scholar
- Salipante SJ, Thompson JM, Horwitz MS: Phylogenetic fate mapping: theoretical and experimental studies applied to the development of mouse fibroblasts. Genetics. 2008, 178 (2): 967-977. 10.1534/genetics.107.081018.PubMed CentralView ArticlePubMedGoogle Scholar
- Clarke LA, Rebelo CS, Goncalves J, Boavida MG, Jordan P: PCR amplification introduces errors into mononucleotide and dinucleotide repeat sequences. Mol Pathol. 2001, 54 (5): 351-353. 10.1136/mp.54.5.351.PubMed CentralView ArticlePubMedGoogle Scholar
- Palsson B, Palsson F, Perlin M, Gudbjartsson H, Stefansson K, Gulcher J: Using quality measures to facilitate allele calling in high-throughput genotyping. Genome Res. 1999, 9 (10): 1002-1012. 10.1101/gr.9.10.1002.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuligina ES, Grigoriev MY, Suspitsin EN, Buslov KG, Zaitseva OA, Yatsuk OS, Lazareva YR, Togo AV, Imyanitov EN: Microsatellite instability analysis of bilateral breast tumors suggests treatment-related origin of some contralateral malignancies. J Cancer Res Clin Oncol. 2007, 133 (1): 57-64. 10.1007/s00432-006-0146-0.View ArticlePubMedGoogle Scholar
- Sammalkorpi H, Alhopuro P, Lehtonen R, Tuimala J, Mecklin JP, Jarvinen HJ, Jiricny J, Karhu A, Aaltonen LA: Background mutation frequency in microsatellite-unstable colorectal cancer. Cancer Res. 2007, 67 (12): 5691-5698. 10.1158/0008-5472.CAN-06-4314.View ArticlePubMedGoogle Scholar
- Bacher JW, Abdel Megid WM, Kent-First MG, Halberg RB: Use of mononucleotide repeat markers for detection of microsatellite instability in mouse tumors. Mol Carcinog. 2005, 44 (4): 285-292. 10.1002/mc.20146.View ArticlePubMedGoogle Scholar
- Hoang JM, Cottu PH, Thuille B, Salmon RJ, Thomas G, Hamelin R: BAT-26, an indicator of the replication error phenotype in colorectal cancers and cell lines. Cancer Res. 1997, 57 (2): 300-303.PubMedGoogle Scholar
- Bacher JW, Flanagan LA, Smalley RL, Nassif NA, Burgart LJ, Halberg RB, Megid WM, Thibodeau SN: Development of a fluorescent multiplex assay for detection of MSI-High tumors. Dis Markers. 2004, 20 (4–5): 237-250.PubMed CentralView ArticlePubMedGoogle Scholar
- Murphy KM, Zhang S, Geiger T, Hafez MJ, Bacher J, Berg KD, Eshleman JR: Comparison of the microsatellite instability analysis system and the Bethesda panel for the determination of microsatellite instability in colorectal cancers. J Mol Diagn. 2006, 8 (3): 305-311. 10.2353/jmoldx.2006.050092.PubMed CentralView ArticlePubMedGoogle Scholar
- Vilkki S, Launonen V, Karhu A, Sistonen P, Vastrik I, Aaltonen LA: Screening for microsatellite instability target genes in colorectal cancers. J Med Genet. 2002, 39 (11): 785-789. 10.1136/jmg.39.11.785.PubMed CentralView ArticlePubMedGoogle Scholar
- Wong YF, Cheung TH, Lo KW, Yim SF, Chan LK, Buhard O, Duval A, Chung TK, Hamelin R: Detection of microsatellite instability in endometrial cancer: advantages of a panel of five mononucleotide repeats over the National Cancer Institute panel of markers. Carcinogenesis. 2006, 27 (5): 951-955. 10.1093/carcin/bgi333.View ArticlePubMedGoogle Scholar
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8 (7): R143-10.1186/gb-2007-8-7-r143.PubMed CentralView ArticlePubMedGoogle Scholar
- Bonk T, Humeny A, Gebert J, Sutter C, von Knebel Doeberitz M, Becker CM: Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry-based detection of microsatellite instabilities in coding DNA sequences: a novel approach to identify DNA-mismatch repair-deficient cancer cells. Clin Chem. 2003, 49 (4): 552-561. 10.1373/49.4.552.View ArticlePubMedGoogle Scholar
- Sun X, Liu Y, Lutterbaugh J, Chen WD, Markowitz SD, Guo B: Detection of mononucleotide repeat sequence alterations in a large background of normal DNA for screening high-frequency microsatellite instability cancers. Clin Cancer Res. 2006, 12 (2): 454-459. 10.1158/1078-0432.CCR-05-0919.View ArticlePubMedGoogle Scholar