Optimization of sequence alignment for simple sequence repeat regions
© Jighly et al; licensee BioMed Central Ltd. 2011
Received: 18 February 2011
Accepted: 20 July 2011
Published: 20 July 2011
Microsatellites, or simple sequence repeats (SSRs), are tandemly repeated DNA sequences, including tandem copies of specific sequences no longer than six bases, that are distributed in the genome. SSR has been used as a molecular marker because it is easy to detect and is used in a range of applications, including genetic diversity, genome mapping, and marker assisted selection. It is also very mutable because of slipping in the DNA polymerase during DNA replication. This unique mutation increases the insertion/deletion (INDELs) mutation frequency to a high ratio - more than other types of molecular markers such as single nucleotide polymorphism (SNPs).
SNPs are more frequent than INDELs. Therefore, all designed algorithms for sequence alignment fit the vast majority of the genomic sequence without considering microsatellite regions, as unique sequences that require special consideration. The old algorithm is limited in its application because there are many overlaps between different repeat units which result in false evolutionary relationships.
To overcome the limitation of the aligning algorithm when dealing with SSR loci, a new algorithm was developed using PERL script with a Tk graphical interface. This program is based on aligning sequences after determining the repeated units first, and the last SSR nucleotides positions. This results in a shifting process according to the inserted repeated unit type.
When studying the phylogenic relations before and after applying the new algorithm, many differences in the trees were obtained by increasing the SSR length and complexity. However, less distance between different linage had been observed after applying the new algorithm.
The new algorithm produces better estimates for aligning SSR loci because it reflects more reliable evolutionary relations between different linages. It reduces overlapping during SSR alignment, which results in a more realistic phylogenic relationship.
Microsatellites, or simple sequence repeats (SSRs), are tandemly repeated DNA sequences with a period of from 1 to 6 base pairs . It is sometimes referred to as a variable number of tandem repeats or VNTRs. An SSR which contains one type of repeats, is called a simple SSR (e.g. (CA)15) and those which have more than one type are called compound SSRs (e.g. (CA)8(CG)12) . The repeat units are generally di-, tri- tetra- or pentanucleotides. They are commonly found in non-coding regions of the genome.
SSRs are highly mutable loci . In animals, observed SSR mutation rates have been of the order of 10-3 to 10-4 for autosomal repeat loci [4, 5] (Wiessenbach et al. 1992; Weber and Wong 1993). However the average of mutations in SSR loci is 10-2 in one generation .
Chistiakov et al.  suggested that two mechanisms are responsible for the high mutability in SSRs. First, motif repetition makes SSRs prone to mutation by DNA polymerase slippage during replication because of the multi-complementary sequences, and second, unequal crossing over or related processes [8–11]. The slippage rate is correlated to SSR length and this makes longer SSRs more variable than shorter ones [12, 13]. However, there is no threshold length for slippage mutations . The mutations that happen because of the polymerase slippage could be considered as special types of insertion/deletion (INDELs) mutations that usually occur when adding or erasing sequences without any substitution. Substitution is considered as another kind of mutation called single nucleotide polymorphism (SNPs). In general, SNPs occur much more frequently than INDELs . But SSR replication slippage generates more genetic change in eukaryotes than do all base substitution per generation , so it increases the frequency of INDELs. In addition, it has been reported that the perfect SSR motifs are significantly more variable compared to imperfect repeated motifs [17, 18].
The power of SSR regions relies on their high abundance in the genome, codominant nature, extensive genome coverage, and high polymorphism . The polymorphism of SSR depends on the differences in the numbers of repeated units between alleles at a single locus. The SSRs are used as molecular markers in a wide range of applications, such as genome mapping, marker assisted selection, gene tagging, and evolutionary and diversity studies  The main feature of SSRs that makes them amenable for use as molecular markers is that the flanking regions are highly conserved, allowing the use of specific PCR primers to amplify the same SSR even across different taxa [21, 22].
Sequence alignment involves the identification of the correct location of INDELS that have happened since their divergence from a common precursor. The true alignment reflects the evolutionary relationships between the sequences accurately. Nevertheless, in the case of a compound SSR region, the general alignment will show many overlaps between the different units of repeats, which seem biologically incorrect because of the replication slippage mutations rate. This suggests a need to re-evaluate the general alignment methods and their parameters. In this paper, we surmise that correct alignment should put the repeats separately without overlapping between them and without changing the alignment parameters. We suggest the incorporation of a simple algorithm for the shifting process of SSR loci after applying the usual alignment used in regular software.
Data set file
SSR length (first and last nucleotide)
2- Identify the sequences that do not match the first repeated unit from the beginning of the selected SSR region
Put the tandem repeat in a temporary array
Check if the next nucleotides match the next repeated unit
If not, put the unmatched nucleotides in another temporary array
Fill the gaps to the longest sequence of the repeats in the same array
Merge the temporary arrays
4- Put your results instead of the SSR region.
See the additional file 1: SALT.swf. An animation describes the algorithm.
Testing and Implementation
The main limitation with the new algorithm is in determining the gap position when applied to an imperfect SSR. According to Kruglyak  and Bandström , the imperfect repeats within the SSR region reduces the occurrences of slippage, resulting in the imperfect SSR changing its tandem nature and fixing the region by prohibiting replication slippage. This is because the bases do not find their complementary bases during replication. However, the best place for the imperfect nucleotides within a compound SSR is after the slippage site (the gap) and before the sequence that follows SSR or the next repeated unit (Figure 7).
We can deduce from the last examples that (1) the new algorithm could be a powerful tool for compound SSRs, but less so for a simple SSR, (2) it increase the similarity between sequences during alignment by minimizing the overlaps between different repeated units, and (3) it might be necessary to apply it on sequences containing long and complicated SSRs.
SSR alignment tool (SALT)
The first line contains the number of samples, followed by any kind of separator (space or tab...) and, subsequently, the number of nucleotides.
Each of the next lines contains the name of the allele, followed by any kind of separator, then the sequence; thereafter press the Enter button to start the next allele.
See the additional file 2: SALT.rar. This is a compressed file containing the program and the sample data used in this research.
SALT is a new tool to overcome limitations when aligning SSR loci based on the new shifting algorithm proposed in this paper. This tool is essential when aligning compound or imperfect SSRs, which contain many overlaps between repeated units, and when aligning them using the usual methods. The newly developed tool gives a better alignment estimate for such regions.
Materials and methods
Five microsatellite motifs vary in their types and lengths, representing most SSR types in the genome sequences
SSR length (bp)
- Tautz D: Hypervariability of simple sequences as a general source for polymorphic DNA markers. Nucl Acids Res. 1989, 17: 6563-6571.View ArticleGoogle Scholar
- Peakall R, Gilmore S, Keys W, Morgante M, Rafalski A: Cross-species amplification of soybean (Glycine max) simple sequence repeats (SSRs) within the genus and other legume genera: implications for the transferability of SSRs in plants. Mol Biol Evol. 1998, 15: 1275-1287.PubMedView ArticleGoogle Scholar
- Gow C, Noble JL, Rollinson D, Jones C: A high incidence of clustered microsatellite mutations revealed by parent-offspring analysis in an African freshwater snail, Bulinus forskalii (Gastropoda, Pulmonata). Genetica. 2005, 124: 77-83. 10.1007/s10709-005-0204-6.PubMedView ArticleGoogle Scholar
- Wiessenbach J, Gyapay G, Dib C, Vignal A, Moresette J: A second generation map of the human genome. Nature. 1992, 359: 794-801. 10.1038/359794a0.View ArticleGoogle Scholar
- Weber J, Wong C: Mutation of human short tandem repeats. Hum Mol Genet. 1993, 2: 1123-1128. 10.1093/hmg/2.8.1123.PubMedView ArticleGoogle Scholar
- Ellegren H: Microsatellite mutations in the germline: implications for evolutionary inference. Trends Genet. 2000, 16: 551-558. 10.1016/S0168-9525(00)02139-9.PubMedView ArticleGoogle Scholar
- Chistiakov DA, Hellemans B, Haley CS, Law AS, Tsigenopoulos CS, Kotoulas G, Bertotto D, Libertini A, Volckaert FA: A microsatellite linkage map of the European sea bass Dicentrarchus labrax L. Genetics. 2006, 170: 1821-1826.View ArticleGoogle Scholar
- Jakupciak JP, Wells RD: Genetic instabilities of triplet repeat sequences by recombination. IUBMB Life. 2000, 50: 355-359.PubMedView ArticleGoogle Scholar
- Ellegren H: Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004, 5: 435-445.PubMedView ArticleGoogle Scholar
- Armour JAL, Alegre SA, Miles S, Williams LJ, Badge RM: Minisatellites and mutation processes in tandemly repetitive DNA. Microsatellites Evolution and Applications. Edited by: Goldstein DB, Schlötterer C. 1999, Oxford: Oxford University Press, 24-33.Google Scholar
- Kelkar YD, Tyekucheva S, Chiaromonte F, Makova KDL: the genome-wide determinants of human and chimpanzee microsatellite evolution. Genome Res. 2008, 18 (1): 30-38.PubMedPubMed CentralView ArticleGoogle Scholar
- Whittaker JC, Harbord RM, Boxall N, Mackay I, Dawson G, Sibly RM: Likelihood-based estimation of microsatellite mutation rates. Genetics. 2003, 164 (2): 781-787.PubMedPubMed CentralGoogle Scholar
- Sainudiin R, Durrett RT, Aquadro CF, Nielsen R: Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics. 2004, 168 (1): 383-395. 10.1534/genetics.103.022665.PubMedPubMed CentralView ArticleGoogle Scholar
- Leclercq S, Rivals E, Jarne P: DNA slippage occurs at microsatellite loci without minimal threshold length in humans: a comparative genomic approach. Genome Biol Evol. 2010, 2: 325-335. 10.1093/gbe/evq023.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res. 2003, 31: 5338-5348. 10.1093/nar/gkg745.PubMedPubMed CentralView ArticleGoogle Scholar
- Bell GI: Evolution of simple sequence repeats. Comput Chem. 1996, 20: 41-48. 10.1016/S0097-8485(96)80006-4.PubMedView ArticleGoogle Scholar
- Kruglyak S, Durrett R, Schug D, Aquadro C: Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Natl Acad Sci USA. 1998, 95: 10774-10778. 10.1073/pnas.95.18.10774.View ArticleGoogle Scholar
- Brandström M, Ellegren H: Genome-wide analysis of microsatellite polymorphism in chicken circumventing the ascertainment bias. Genome Res. 2008, 18: 881-887. 10.1101/gr.075242.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Powell W, Machray GC, Provan J: Polymorphism revealed by simple sequence repeats. Trends Plant Sci. 1996, 1: 215-222.View ArticleGoogle Scholar
- Kantety RV, Rota ML, Matthews DE, Sorrells ME: Data mining for simple sequence repeats in expressed sequence rags from barley, maize, rice, sorghum and wheat. Plant Molecular Biol. 2002, 48: 501-510. 10.1023/A:1014875206165.View ArticleGoogle Scholar
- Santibanez-Koref MF, Gangeswaran R, Hancock JM: A relationship between lengths of microsatellites and nearby substitution rates in mammalian genomes. Mol Biol Evol. 2001, 18 (11): 2119-2123.PubMedView ArticleGoogle Scholar
- Sekar M, Suresh E, Kumar NS, Mayak SK, Balakrishna C: Microsatellite DNA markers, a fisheries perspective. Aquaculture Asia Magazine. 2009, 27-29.Google Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24: 1596-1599. 10.1093/molbev/msm092.PubMedView ArticleGoogle Scholar
- PERL, v5.8.8. Copyright 1987-2006, Larry Wall. Binary build provided by ActiveState [http://www.ActiveState.com]. The Perl Home Page [http://www.perl.org/]
- Tk, the extension that makes GUI programming in PERL possible. University of California, Berkeley, Tcl (Tool Command Language) and Tk (ToolKit) was created by Professor John Ousterhout, [http://www.tcl.tk/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/2.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.