Development and characterization of highly polymorphic long TC repeat microsatellite markers for genetic analysis of peanut

Background Peanut (Arachis hypogaea L.) is a crop of economic and social importance, mainly in tropical areas, and developing countries. Its molecular breeding has been hindered by a shortage of polymorphic genetic markers due to a very narrow genetic base. Microsatellites (SSRs) are markers of choice in peanut because they are co-dominant, highly transferrable between species and easily applicable in the allotetraploid genome. In spite of substantial effort over the last few years by a number of research groups, the number of SSRs that are polymorphic for A. hypogaea is still limiting for routine application, creating the demand for the discovery of more markers polymorphic within cultivated germplasm. Findings A plasmid genomic library enriched for TC/AG repeats was constructed and 1401 clones sequenced. From the sequences obtained 146 primer pairs flanking mostly TC microsatellites were developed. The average number of repeat motifs amplified was 23. These 146 markers were characterized on 22 genotypes of cultivated peanut. In total 78 of the markers were polymorphic within cultivated germplasm. Most of those 78 markers were highly informative with an average of 5.4 alleles per locus being amplified. Average gene diversity index (GD) was 0.6, and 66 markers showed a GD of more than 0.5. Genetic relationship analysis was performed and corroborated the current taxonomical classification of A. hypogaea subspecies and varieties. Conclusions The microsatellite markers described here are a useful resource for genetics and genomics in Arachis. In particular, the 66 markers that are highly polymorphic in cultivated peanut are a significant step towards routine genetic mapping and marker-assisted selection for the crop.


Background
Peanut (Arachis hypogaea L.) is an oil crop of great importance in the tropics: in Africa, its production is comparable to all other grain legumes put together, and in Asia it provides about the same number of calories as soya (FAO, 2009). It has a narrow genetic base due to its recent origin event of tetraploidization [1,2], and this has hindered the application of molecular breeding in this crop.
Microsatellites or simple sequence repeats (SSRs) are useful molecular markers, are abundant, highly dispersed through the genomes of eukaryotes, and locus specific. In addition they are the ideal markers for genotyping allotetraploid species, such as peanut, since they are usually co-dominant and multi-allelic. They are considered suitable as tools for genetic diversity studies, genetic linkage mapping, and for use in plant breeding programs [3].
Over the past years several research groups have put considerable effort into developing SSR markers for the genus Arachis in general and cultivated peanut in particular. Now about 5,000 SSR markers have been published [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. These markers have been mainly used for diversity studies of germplasm, and for genetic mapping ( [10,11,[22][23][24][25][26][27][28]. However, in spite of the number of markers available, the very low polymorphism observed within cultivated germplasm requires large-scale marker screening for the identification of sufficient polymorphic markers even for low density genetic maps in populations derived from cultivated × cultivated crosses. For example, in spite of extensive marker screening, the published SSR-based maps of cultivated peanut have only 131, 135 and 175 SSR markers [22,23,28]. In a previous study, we observed that AG/TC microsatellites were more polymorphic than AC/TG ones and that for cultivated germplasm, the highest polymorphism was observed for microsatellites with 21-25 motif repetitions [10]. In this context, we isolated and characterized long repeat AG/ TC SSRs as an effort to develop markers with high polymorphism levels for cultivated peanut [10].

Findings
Sequencing Sequences were obtained for 1401 cloned genomic fragments. Most fragments were sequenced in both forward and reverse orientations. Of these 1401 cloned fragments, 65 harbored sequences very similar to already published markers and so were excluded from further analysis (≥ 50% of sequence with BLAST detected similarity with E-value ≤ E-40). Of the remaining sequences, 193 harbored microsatellite repeats. As expected, most were TC/AG repeats. The 143 unique SSR sequences were deposited in GenBank (accession numbers JN887491 to JN887636).

Design of flanking primer pairs
Of the 193 selected sequences, 135 were appropriate for primer design. Some sequences contained multiple microsatellite repeats that could not be flanked by a single primer pair. Therefore, in total 146 primer pairs were designed. The microsatellites amplified were generally long, the average number of motif repeats being 23.

Polymorphism levels
All 146 primer pairs amplified PCR products of the expected size. Of these, 85 were polymorphic within the tetraploid samples (including cultivated peanut, a synthetic allotetraploid and an accession of the tetraploid wild species, A. monticola (Table 1), and 78 were polymorphic within cultivated germplasm ( Table 2).
The average number of alleles amplified per locus was 5.5, values of Gene Diversity (GD) were between 0.080 and 0.885, with an average of 0.614. Sixty-six markers were highly polymorphic with a GD of more than or equal to 0.5.
Within cultivated peanut, markers with 21-25 motif repetitions were the most polymorphic (69%), followed by markers that amplified more than 30 motif repetitions (60%, most of the markers being composite or imperfect) (Figure 1). The lowest polymorphism was observed with short microsatellites, between 6-10 motif repetitions.

Genic content
Thirty-six of the 135 marker sequences encoded putative proteins that had significant BLAST similarities to known predicted proteins of Arabidopsis and/or legumes (E-value < 1 e-07 , (Additional File 1: Table S1). Of the highly polymorphic markers (GD ≥ 0.5), 23% showed a significant BLAST similarity. This compares to 35% of the markers with GD < 0.5 that do not show significant BLAST similarity.

Genetic relationships
Genetic similarities were estimated by the band-sharing coefficient [29] in pairwise comparisons of the 24 genotypes (Table 1), using 78 microsatellite loci. Genetic similarity values ranged from 0.42-0.77, considering the 22 A. hypogaea genotypes used. Therefore all the genotypes were differentiated. A dendrogram based on UPGMA was constructed for the 24 genotypes ( Figure  2). Cluster analysis showed two main groups according to the subspecies. Within these groups, genotypes of the same botanical varieties tended to group together.

Discussion
In spite of the considerable effort made by several research groups to develop molecular markers for cultivated peanut, the number of polymorphic markers available for this important crop is still limiting. One of the main challenges in the construction of linkage maps using populations derived from cultivated × cultivated crosses is the need to screen thousands of markers to obtain sufficient markers even for the construction of low density maps.
In this study we focused on the class of microsatellites that was shown to be the most highly polymorphic for cultivated peanut in a previous study, long TC repeats [10] For this, sequences were obtained from an enriched genomic library. For processing the sequences, the Staden software was used together with a module for the detection of microsatellites. Starting from a relatively large dataset of unassembled sequences, it was possible to quickly eliminate sequences that were similar to previously described markers, and assemble a compact database of microsatellite containing sequences. Using a naming convention of plasmid clones, it was possible to correctly assemble microsatellite-containing reads even when the only overlap between forward and reverse sequences were microsatellite repeats. This was particularly important for obtaining complete sequences when the repeats were long. For design of primer pairs, the program used took into account the quality values of consensus bases. This was reflected in the 100% success rate of amplification of the primer pairs.
Markers with 21-25 motif repetitions were the most polymorphic, while markers with shorter repeats tended to be less polymorphic. This general tendency agrees with previous studies and reinforces the view that long (21-25 motif repetitions) or composite TC microsatellites are probably the most polymorphic marker class for cultivated peanut. A slightly higher proportion of markers that were not polymorphic or less informative (GD < 0.5) showed significant similarities to protein encoding regions, probably reflecting a tendency for non-coding regions to be more polymorphic than coding regions. Overall 78 of the markers were polymorphic for the cultivated accessions and 66 of these had GD value of 0.5 or above.
Cluster analysis showed two main groups separating the two subspecies of A. hypogaea. Some tendency of grouping of genotypes according to their botanical varieties was also evident. The main exceptions were three accessions, Mf2517, Mf2352, and Mf2534, which clustered with no apparent reason. The upper group contained the five hypogaea/hypogaea genotypes and two of the three hypogaea/hirsuta genotypes. Arachis monticola and the two genotypes collected in the Xingu Indigenous Park also clustered in this group. The Xingu material has some morphological traits, especially in the pods, exceeding the previously variation described in cultivated peanut [30], but it seems to be closely related to hypogaea/hypogaea and hypogaea/hirsuta varieties. Our results also showed the great genetic similarity of the varieties fastigiata and vulgaris, which formed a subgroup, and peruviana and aequatoriana, which formed a separate subgroup. Some studies have shown that genotypes of the varieties peruviana and aequatoriana were more closely related to genotypes of the subspecies hypogaea than to the other two varieties (fastigiata and vulgaris) of subspecies fastigiata [8,17,31,32]. Our results, in contrast, corroborated the current taxonomical classification, despite the small number of genotypes included.

Conclusion
In this study 146 new microsatellite markers were developed for Arachis. All of these markers are new and useful tools for genetics and genomics in Arachis, but in   Total genomic DNA was isolated from young leaves using the CTAB-based protocol described by Grattapaglia and Sederoff [33] modified by the inclusion of an additional precipitation step with 1.2 M NaCl. DNA quality and concentration were estimated on agarose gel electrophoresis and by spectrophotometry (Genesys 4 -Spectronic, Unitech, USA).

Construction of SSR-enriched library
A genomic DNA library enriched for the dinucleotide repeats TC/AG was constructed as described by Moretzsohn [10]. About nine micrograms of DNA were digested with Sau3AI (Amersham Biosciences, UK) and electrophoresed in 0.8% low melting agarose gels to select fragments ranging from 200-600 bp. The selected fragments were purified from the agarose gels using phenol/chloroform, and ligated into Sau3AI specific adaptors (5'-cagcctagagccgaattcacc-3' and 5'-gatcggtgaaatcggctcaggctg-3'). The ligated fragments were hybridized to biotinylated (AG) 15 oligonucleotides and isolated using streptavidin-coated magnetic beads (Dynabeads Streptavidin, Dynal Biotech, Norway). The eluted fragments were amplified using one adaptor-specific primer, cloned into the pGEM-T Easy vector (Promega, WI, USA) and transformed into XL1-Blue E. coli cells with blue/white selection (Invitrogen, CA, USA). Plasmid DNAs of the positive clones were isolated by the alkaline lysis method. Sequencing reactions were performed with T7 and SP6 primers and the Big-Dye Terminator Cycle Sequencing Kit, version 3.1 (Applied Biosystems, CA, USA) using the ABI Prism 377 automated DNA sequencer.

SSR marker development and validation
Sequences were processed and assembled by using the Staden package [34] with the repeat sequence finding  module TROLL [35] and Primer3 for primer design [36], using a module developed by Martins et al. [37]. Sequences with more than ten motif repeats were chosen for primer design. Some sequences with BLASTX hits to genes of interest were also included in spite of having fewer than ten motif repeats. The parameters for primer design were: (1) primer size ranging from 18 bp to 25 bp with an optimal length of 20 bp; (2) primer Tm (melting temperature) ranging from 57°C to 63°C with an optimal temperature of 60°C; and (3)  Total genomic DNA was isolated from young leaves using the CTAB-based protocol described by Grattapaglia and Sederoff [33] modified by the inclusion of an additional precipitation step with 1.2 M NaCl. DNA quality and concentration were estimated on agarose gel electrophoresis and by spectrophotometry (Genesys 4 -Spectronic, Unitech, USA).

Construction of SSR-enriched library
A genomic DNA library enriched for the dinucleotide repeats TC/AG was constructed as described by Moretzsohn [10]. About nine micrograms of DNA were digested with Sau3AI (Amersham Biosciences, UK) and electrophoresed in 0.8% low melting agarose gels to select fragments ranging from 200-600 bp. The selected fragments were purified from the agarose gels using phenol/chloroform, and ligated into Sau3AI specific adaptors (5'-cagcctagagccgaattcacc-3' and 5'-gatcggtgaaatcggctcaggctg-3'). The ligated fragments were hybridized to biotinylated (AG) 15 oligonucleotides and isolated using streptavidin-coated magnetic beads (Dynabeads Streptavidin, Dynal Biotech, Norway). The eluted fragments were amplified using one adaptor-specific primer, cloned into the pGEM-T Easy vector (Promega, WI, USA) and transformed into XL1-Blue E. coli cells with blue/white selection (Invitrogen, CA, USA). Plasmid DNAs of the positive clones were isolated by the alkaline lysis method. Sequencing reactions were performed with T7 and SP6 primers and the Big-Dye Terminator Cycle Sequencing Kit, version 3.1 (Applied Biosystems, CA, USA) using the ABI Prism 377 automated DNA sequencer.

SSR marker development and validation
Sequences were processed and assembled by using the Staden package [34] with the repeat sequence finding module TROLL [35] and Primer3 for primer design [36], using a module developed by Martins et al. [37]. Sequences with more than ten motif repeats were chosen for primer design. Some sequences with BLASTX hits to genes of interest were also included in spite of having fewer than ten motif repeats. The parameters for primer design were: (1) primer size ranging from 18 bp to 25 bp with an optimal length of 20 bp; (2) primer Tm (melting temperature) ranging from 57°C to 63°C with an optimal temperature of 60°C; and (3) GC content ranging from 40%-60%. Default values were used for the other parameters. Total genomic DNA was isolated from young leaves using the CTAB-based protocol described by Grattapaglia and Sederoff [33] modified by the inclusion of an additional precipitation step with 1.2 M NaCl. DNA quality and concentration were estimated on agarose gel electrophoresis and by spectrophotometry (Genesys 4 -Spectronic, Unitech, USA).

Construction of SSR-enriched library
A genomic DNA library enriched for the dinucleotide repeats TC/AG was constructed as described by Moretzsohn [10]. About nine micrograms of DNA were digested with Sau3AI (Amersham Biosciences, UK) and electrophoresed in 0.8% low melting agarose gels to select fragments ranging from 200-600 bp. The selected fragments were purified from the agarose gels using phenol/chloroform, and ligated into Sau3AI specific adaptors (5'-cagcctagagccgaattcacc-3' and 5'-gatcggtgaaatcggctcaggctg-3'). The ligated fragments were hybridized to biotinylated (AG) 15 oligonucleotides and isolated using streptavidin-coated magnetic beads (Dynabeads Streptavidin, Dynal Biotech, Norway). The eluted fragments were amplified using one adaptor-specific primer, cloned into the pGEM-T Easy vector (Promega, WI, USA) and transformed into XL1-Blue E. coli cells with blue/white selection (Invitrogen, CA, USA). Plasmid DNAs of the positive clones were isolated by the alkaline lysis method. Sequencing reactions were performed with T7 and SP6 primers and the Big-Dye Terminator Cycle Sequencing Kit, version 3.1 (Applied Biosystems, CA, USA) using the ABI Prism 377 automated DNA sequencer.

SSR marker development and validation
Sequences were processed and assembled by using the Staden package [34] with the repeat sequence finding module TROLL [35] and Primer3 for primer design [36], using a module developed by Martins et al. [37]. Sequences with more than ten motif repeats were chosen for primer design. Some sequences with BLASTX hits to genes of interest were also included in spite of having fewer than ten motif repeats. The parameters for primer design were: (1) primer size ranging from 18 bp to 25 bp with an optimal length of 20 bp; (2) primer Tm (melting temperature) ranging from 57°C to 63°C with an optimal temperature of 60°C; and (3)  Total genomic DNA was isolated from young leaves using the CTAB-based protocol described by Grattapaglia and Sederoff [33] modified by the inclusion of an additional precipitation step with 1.2 M NaCl. DNA quality and concentration were estimated on agarose gel electrophoresis and by spectrophotometry (Genesys 4 -Spectronic, Unitech, USA).

Construction of SSR-enriched library
A genomic DNA library enriched for the dinucleotide repeats TC/AG was constructed as described by Moretzsohn [10]. About nine micrograms of DNA were digested with Sau3AI (Amersham Biosciences, UK) and electrophoresed in 0.8% low melting agarose gels to select fragments ranging from 200-600 bp. The selected fragments were purified from the agarose gels using phenol/chloroform, and ligated into Sau3AI specific adaptors (5'-cagcctagagccgaattcacc-3' and 5'-gatcggtgaaatcggctcaggctg-3'). The ligated fragments were hybridized to biotinylated (AG) 15 oligonucleotides and isolated using streptavidin-coated magnetic beads (Dynabeads Streptavidin, Dynal Biotech, Norway). The eluted fragments were amplified using one adaptor-specific primer, cloned into the pGEM-T Easy vector (Promega, WI, USA) and transformed into XL1-Blue E. coli cells with blue/white selection (Invitrogen, CA, USA). Plasmid DNAs of the positive clones were isolated by the alkaline lysis method. Sequencing reactions were performed with T7 and SP6 primers and the Big-Dye Terminator Cycle Sequencing Kit, version 3.1 (Applied Biosystems, CA, USA) using the ABI Prism 377 automated DNA sequencer.

SSR marker development and validation
Sequences were processed and assembled by using the Staden package [34] with the repeat sequence finding module TROLL [35] and Primer3 for primer design [36], using a module developed by Martins et al. [37]. Sequences with more than ten motif repeats were chosen for primer design. Some sequences with BLASTX hits to genes of interest were also included in spite of having fewer than ten motif repeats. The parameters for primer design were: (1) primer size ranging from 18 bp to 25 bp with an optimal length of 20 bp; (2) primer Tm (melting temperature) ranging from 57°C to 63°C with an optimal temperature of 60°C; and (3) GC content ranging from 40%-60%. Default values were used for the other parameters. Amplifications were carried out in a PTC 100 thermocycler (MJ Research Inc., MA, USA). PCR conditions were: 96°C for 5 min, followed by 30 cycles of 94°C for 1 min, 48-62°C (annealing temperature depending on primer pair, see Additional file 1) for 1 min, 72°C for 1 min, with a final extension for 10 min at 72°C. PCR products were separated by electrophoresis on denaturing polyacrylamide gels (6% acrylamide:bisacrylamide 29:1, 5 M urea in TBE pH 8.3), stained with silver nitrate [38].

Data analyses
Number of alleles per locus, the range of fragment length and gene diversity (GD) were estimated for the polymorphic primers, using the program "Power Marker 3.25" [39]. Pairwise genetic similarities were estimated from the allelic data using the band-sharing coefficient of Lynch [29]. The resulting diagonal matrix was then submitted to cluster analysis using UPGMA ("unweighted pair-group method analysis"). In order to verify the consistency of the built dendrogram, the cophenetic correlation -r [40] was calculated. All these analyses were performed using the software NTSYS 2.21 [41].