In silico search, characterization and validation of new EST-SSR markers in the genus Prunus

Background Simple sequence repeats (SSRs) are defined as sequence repeat units between 1 and 6 bp that occur in both coding and non-coding regions abundant in eukaryotic genomes, which may affect the expression of genes. In this study, expressed sequence tags (ESTs) of eight Prunus species were analyzed for in silico mining of EST-SSRs, protein annotation, and open reading frames (ORFs), and the identification of codon repetitions. Results A total of 316 SSRs were identified using MISA software. Dinucleotide SSR motifs (26.31 %) were found to be the most abundant type of repeats, followed by tri- (14.58 %), tetra- (0.53 %), and penta- (0.27 %) nucleotide motifs. An attempt was made to design primer pairs for 316 identified SSRs but these were successful for only 175 SSR sequences. The positions of SSRs with respect to ORFs were detected, and annotation of sequences containing SSRs was performed to assign function to each sequence. SSRs were also characterized (in terms of position in the reference genome and associated gene) using the two available Prunus reference genomes (mei and peach). Finally, 38 SSR markers were validated across peach, almond, plum, and apricot genotypes. This validation showed a higher transferability level of EST-SSR developed in P. mume (mei) in comparison with the rest of species analyzed. Conclusions Findings will aid analysis of functionally important molecular markers and facilitate the analysis of genetic diversity. Electronic supplementary material The online version of this article (doi:10.1186/s13104-016-2143-y) contains supplementary material, which is available to authorized users.

Simple sequence repeats (SSRs), also known as microsatellites, are short repeat motifs present in both protein coding and non-coding regions of DNA sequences. SSRs show a high level of length polymorphism due to mutations of one or more repeats. The use of SSRs as molecular markers is favorable due to their multi-allelic nature, reproducibility, high abundance, and extensive genome coverage [2]. On the other hand, Expressed sequence tags (ESTs) are single-pass sequences of cDNA classes that provide direct information of gene expression and also serve as sources of microsatellites [3]. The traditional methods of developing SSR markers from ESTs are usually time consuming and laborintensive. Generally, processes involve genomic library construction, hybridization with the repeated units of nucleotides, and sequencing of the clones. These traditional methods have been applied in Prunus species in the development of SSR-ESTs in peach [4,5], apricot [6,7], almond [8,9] and mei [10,11]. The computational approach for developing SSR markers from ESTs provides a better platform than the conventional approach. EST databases store expressed sequences that are redundant, so they contain repetitive units [12]. Such computational approaches have been recently applied in Prunus species, albeit only in the reference peach genome [13,14].
Expressed sequence tags sequences can be obtained from databases and assembled to develop potential SSR markers in different species even without the availability of a fully sequenced genome. Numerous tools (both standalone and web-based) are available for the mining of EST data to design EST-SSR markers on a large scale [15]. Free software and the large availability of EST data on the web allow researchers to easily perform rapid and low-cost data mining from their local systems. Tools such as crossmatch and trimmest provide non-redundant high-quality EST sequences that do not contain vector contamination or poly-A and -T tails. CAP3 can be used to assemble EST sequences with overlapping regions and produce contigs by joining sequences [16].
However, to the best of the authors' knowledge, no assays have been performed in Prunus species.
Several reasons account for the high popularity of EST derived microsatellite markers (EST-SSRs). First, marker development from existing sequence data is fast, easy and economical. An appropriate search program can detect any type of SSR, whereas enrichment cloning captures only SSRs with predefined motifs. Second, given the preferential association of SSRs with the non-repetitive portion of plant genomes, they are a common component of ESTs [33]. Third, EST-SSRs are physically linked to expressed genes and therefore represent so-called "functional markers" that are of particular interest for markerassisted selection [34]. Finally, primer target sequences residing in expressed DNA regions are expected to be relatively well conserved, thus enhancing the chance of marker transferability across taxonomic boundaries [35].
The objectives of this work included the in silico identification of EST-SSR markers, the functional domain marker analysis, the characterization using reference mei and peach genomes, and the validation across different Prunus species also analyzing the level of synteny among them.

Assembly of EST sequences and frequency and distribution of EST-SSR motifs
A total of 111,788 ESTs were detected in different tissues (leaf, stem, root, etc.) of Prunus species. ESTs retrieved from NCBI (http://www.ncbi.nlm.nih.gov/) were mined for simple sequence repeats (SSRs), which were characterized and a subset for marker design. In addition, all SSR-containing sequences were annotated as far as possible.
The percentage of ESTs forming contigs was 98.8 %, indicating that the majority of ESTs had overlapping sequences with other ESTs, whereas only 1.2 % of sequences were unique and had no corresponding overlapping sequence. Following assembly, a non-redundant group of ESTs was assembled consisting of contigs and singletons, hereafter referred to as "assembled EST sequences. " A 68.75 % reduction in redundancy was observed, i.e., the number of ESTs was reduced by this proportion prior to SSR analysis. These data demonstrate the excessive overlapping that exists in EST sequences belonging to the same genome.
Analysis of EST-SSRs revealed dinucleotide SSRs to be the most common, at 26.31 %, with trinucleotide SSRs accounting for 14.58 % of all data. A large difference was apparent between the number of tri-and tetranucleotide SSRs. Nona-and decanucleotide SSRs made up less than 1 % of all data ( Table 1). The frequency of occurrence of SSRs varied with the number of repeats for each type of SSR from di-to decanucleotides. In this analysis, repeat numbers from 5-mer to 10-mer and a separate class of >10-mer were assessed. For trinucleotide SSRs, 5-mer was the highest repeat number apparent. For repeat sizes of 6-mer to >10-mer, the frequency of dinucleotides was the highest (Table 1).

Distribution of SSRs in putative coding regions and UTRs
Analysis revealed a strong bias in the distribution of SSRs between coding regions and UTRs, with the increased frequency of SSRs in UTRs reflecting their roles as binding sites for proteins and regulatory elements. Further, the relative distribution of SSRs in coding regions revealed that trinucleotide SSRs were the most frequent (26.31 %), whereas octanucleotide, nanonucleotide, and decanucleotide SSRs were the least frequent. Tetra-, penta-, hexa-and heptanucleotide SSRs demonstrated intermediate frequencies of 0.53, 0.27, 0.14, and 0.03 %, respectively. In contrast, dinucleotide SSRs were the most frequent in UTRs (86.5 %). Penta-and hexanucleotide SSRs were not present in UTRs.
Each trinucleotide motif codes an amino acid that has putative roles in the biological activity of protein molecules. Of the 6270 trinucleotides identified during the present study, 27.30 % trinucleotide SSRs encoded Histidine, 14.69 % encoded Glutamine, 10.05 % Threonine, and 6.40 % Serine. However, the distribution of putative encoded amino acids differed according to the Prunus species assayed (Fig. 1).
Grouping of putative encoded amino acids based on their polar and non-polar nature revealed 80.66 % of amino acids to be in polar nature, and 19.33 % nonpolar. This trend was consistent across all Prunus species assayed (Additional file 1: Fig. S1).
To determine the function of SSR-containing sequences, the 316 sequences from which SSRs were mined were annotated against the non-redundant (nr) protein database available at http://www.ncbi.nlm.nih.gov. Of these, annotations were available for 165 (52.21 %) sequences.
For functional annotation, EST-SSRs with significant matches were assigned gene ontology terms in the Swis-sProt database. A biological process is a series of events accomplished by one or more ordered assemblies of molecular functions. In a gamut of biological processes corresponding to EST-SSRs, the most frequent was 'Response to stress' (10 EST-SSRs) followed by 'Response to cadmium ion' and 'Oxidation reduction homeostasis' (15 EST-SSRs) (Additional file 3: Fig. S3). This Additional file 3: Figure S3 demonstrates all biological processes identified for EST-SSRs across the Prunus species assayed.
Finally, a cellular component represents a component of a cell that it is part of some larger object, e.g., an anatomical structure or a gene product group. In a gamut of cellular components housing putative proteins, the most frequent was 'Plasma membrane' (18 EST-SSRs, 19.
An attempt was made to predict ORFs in SSR containing sequences using ORF Finder. Of the 316 SSRs identified, the positions of 302 SSRs with respect to ORF were determined, whereas no ORF was predicted for the remaining 14 SSR containing sequences. Of these 302 SSRs, 164 (54.30 %) were present in the 5′ UTR, 118 (39.07 %) in ORFs, and the remaining 20 (6.62 %) occurred in the 3′ UTR.
On the other hand, 38 SSR markers were validated across peach, almond, plum, pollizo plum and apricot genotypes ( Table 2). Results showed a higher transferability level of EST-SSR developed in P. mume (PruMrest SSRs) in comparison with the rest of species analyzed. On average the percentage of EST-SSR amplified in the assayed Prunus species was of 83.3 % of PruMrest SSR markers, followed by a 62.7 % (PruArest SSRs), 55 % (PruCest SSRs), 40 % (PruPest SSRs), 37.3 % (PruDest SSRs) and 16.6 % (PruAvest SSRs). Differences of success of the total developed EST-SSRs in the assayed Prunus species were lesser between 43.6 % in pollizo plum to 58.3 % in almond. Additional file 5: Table S2 shows the size of 38 SSRs obtained in the analysis of samples of Prunus species assayed. All Prunus genotypes presented different fingerprints for six of the tested SSRs. No amplification was observed for 14 SSRs assayed during this study. In addition, in two cases these SSRs (PruCest-3 and PruPest-73) only showed amplification in certain species or even in some genotypes inside each species in the case of the EST-SSRs PruArest-1, PruArest-12, PruAres-13, PruArest-15 and PruMest-6. Finally, the level of polymorphism observed ranged from three to ten alleles.

Discussion
The frequency of SSRs was 8.32 % in assembled sequences, suggesting that Prunus species' ESTs contain relatively high numbers of SSRs. The frequency of SSRs in EST datasets has previously reported as 2.4 % for Arabidopsis, 4.1 % for almond and peach, and 4.8 % for rose [36]. The combined raspberry unigene dataset has 418 contigs and 1671 singletons, from a total of 2089 unigenes [37].
The percentage of SSRs in tissue specific ESTs of some medicinal plants responsible for secondary metabolite production are 4.5 % in Papaver somniferum, 10 % in Phaseolus vulgaris, 10.8 % in Coptis japonica, 12.9 % in Catharanthus roseus, and 12.31 % in Mentha piperita [38]. The results of the present study are thus in agreement with the previous findings for Citrus sinensis (Rutaceae) ( [19]), Arabidopsis ESTs [39], and exons of genomic DNA sequences in all eukaryotes studied [40].
Total numbers of SSRs identified in the genomes ranged from 0 to 13,514, with the density of microsatellites ranging from 0 to 7.51 SSRs per Kb. The P. domestica genome contained no SSRs, whereas P. persica had the most abundant SSRs (13,514 [29]. In contrast, the average frequency of SSRs identified by the present study from Prunus species was lower than observed in loblolly pine (42.9 SSRs per Kb) [28], some cereal species (6 SSRs per Kb) [17], and palms (4.4 SSRs per Kb) [43]. Differences in the frequencies of SSRs between this and previous studies may have been due to differences in the quantity of data analyzed, although it is generally recognized that the abundance of different repeats can vary broadly depending upon the species examined [40]. A study of five different plant species genomes (A. thaliana, rice, soybean, maize and bread wheat) revealed that the densities of SSRs in transcribed regions were generally higher than those in genomic DNA [33]. In view of this, future studies should examine the significance of intraspecific variation in the densities of SSRs from different genome regions and interspecific variability across the entire genomes of different plant species [44].
The abundance of different repeat motifs (1-6 bp) in SSRs detected from Prunus species during the present study was variable, such that SSRs with different repeat motifs were not evenly distributed. SSRs with dinucleotide repeats (26.31 %) were most abundant, in agreement with the results of earlier studies on Arabidopsis [38]. Similarities may reflect the inclusion of SSRs in non-coding regions of Arabidopsis as well. Smaller repeat motifs were found to be dominant among SSRs identified during this study, with the occurrence of motifs decreasing with increasing repeat lengths. This is consistent with earlier studies conducted [45]. Trinucleotide repeats have previously been found to be abundant in crops [15,39,46,47], as well as citrus [12]. The abundance of trinucleotide SSRs may be attributed to absence of frame shift mutations due to variation in trinucleotide repeats [48]. In the raspberry, trimers, i.e. 3-bp repeats, are more common in gene-coding regions [37].
It was possible to successfully design primers for a very large number (175, 55.37 %) of SSRs during the present study (Additional file 4: Table S1). However, it was not possible to design primers for the remaining SSRs (165, 52.21 %), as the length of sequences flanking both ends of the SSRs was inadequate for primer design. The numerous primer pairs designed during this study can be utilized for a variety of purposes, e.g., gene tagging, genetic mapping, and population studies [37].
In the present study, homologs of 316 SSR containing sequences identified, of which 165 were annotated and categorized into functional classes of protein In Arabidopsis, functions for only 57 % of gene sequences have been assigned, which represents relatively good annotation of sequences, but is still inadequate. Most of the SSR containing sequences that were assigned functions during the present study represented housekeeping genes.
In a previous study, the unigene dataset was aligned to the Gene Ontology (GO) database and classified according to three basic categories: biological process, molecular function, and cellular component. The most abundant GO category was biological process, with a total of 708 sequences associated with metabolic processes, cellular processes, and single organism processes. GO assignments for the molecular function category totaled 323 sequences, with functions for catalytic activity (148), binding (128), and structural molecule activity (47) identified in the raspberry [37]. Additionally, BLAST comparison of the 2089 unigenes to the non-redundant (nr) protein database of NCBI yielded 1664 matches (80 %) [37].
The new EST-SSRs identified during the present study enlarge the number of EST-SSRs identified in Prunus species, including the 256 identified in peach [4,5,14], the 34 identified in apricot [6,7], the 29 identified in almond [8,9], and the 24 identified previously mei [10,11]. Only for the peach, were 52 of these EST-SSRs previously identified [13]. These authors identified using in silico search around 15,000 EST-SSR inside the peach reference genome [13].
The characterization of these EST-SSRs using the available peach and mei reference genomes showed a higher synteny level and positioning of markers in the mei reference genome. In agreement with these results, EST-SSR validation also showed a higher transferability level of EST-SSR developed in P. mume (mei) in comparison with the rest of species analyzed indicating a higher level of synteny. This result should also indicate the better suitability of its reference genome in comparison with the peach genomes for the wide use in Prunsu species. Acceptable PCR primers were designed for 175 simple sequence repeats (SSRs) out of 316 identified SSRs using default settings in the Primer3 software. However, the success rate for the PCR primer design in the different Prunus species assayed is quite moderate (about 55 %). For this reason an alternative to develop better SSR marker should be to design PCR primers with less stringent parameter settings in Primer3 or to use another PCR primer design software. Transferability rates, however, are in accordance with the described phylogenetic characterization [1] of the assayed species being peach and almond from the subgenus Amygdalus, sweet and sour cherry from the subgenus Cerasus, plum and pollizo plum from the subgenus Cerasus section Prunus, and apricot and mei from the subgenus Cerasus section Armeniaca (Additional file 6: Fig. S4).
Cross amplification of the SSRs developed from Prunus species offers new functional genomic opportunities given the well-known synteny among Prunus genomes [36] and transcriptomes [49]. However, no amplification was observed for some SSRs assayed during this study, indicating the limitation of transferability of all EST-SSR markers across the Prunus genus. In addition, the low polymorphism observed should be due to the reduced number of genotypes assayed in each species. EST-SSR validation also showed a higher transferability level of EST-SSR developed in P. mume (mei) in comparison with the rest of species analyzed indicating a higher level of synteny.
Our results confirm the suitability of EST-SSR markers for cultivar discrimination and assessment of genetic diversity and clustering in apricot, as has been previously demonstrated for apricot, peach, and cherry. In addition, we have demonstrated that the EST-SSR markers developed are of great utility in the taxonomic characterization of different species.
The use of coding DNA regions for SSR development represents an additional advantage in association genetic [50] and linkage analysis, as gene functions are often known [51]. Recently, three EST-SSRs developed from flavonoid pathway transcription factors have been assayed as markers for fruit color selection in Japanese plum breeding programs [52].

Conclusions
Development and application of molecular markers is of immense importance in the examination of the genetic composition, inter-species variability, and evolutionary relationships of Prunus species. EST-SSRs developed by the present study provide significant insight into these areas. This study demonstrates an approach to develop computationally mined SSRs from ESTs. Derived SSRs can be used in related species for which less sequence data is available, given the high interspecific transferability of EST-SSRs, thus enhancing cross species attempts to develop conserved orthologous marker sets. The use of coding DNA regions for SSR development represents an additional advantage as gene functions are often known. Findings will aid analysis of functionally important molecular markers and facilitate the analysis of genetic diversity. In addition, these SSRs developed here can be used as molecular markers linked to genes of agronomic interest in association genetic studies and quantitative trait locus (QTL) analysis.

Processing and assembly of EST sequences, and SSR identification and characterization
All EST sequences of Prunus species, namely peach (P. persica), apricot (P. armeniaca), sweet cherry (P. avium), mei (P. mume), almond (P. dulcis), sour cherry (P. cerasus) and prune (P. domestica) were downloaded from Genbank (ftp://ncbi.nlm.nih.gov/genbank/genomes/). To construct longer and less redundant sequences, publicly available ESTs were assembled from CAP3 [16]. CAP3 is a commonly used program [53,54] that identifies overlapping sequences and generates contigs with consensus sequences. The objective was the elimination of redundancy in EST sequences to arrive at a contiguous sequence (contigs) that can be used for analysis of SSRs. For the purpose of SSR identification, CAP3 contig and singleton outputs were combined to form non-redundant sequence data. Genomic SSRs were detected using GMATo (http://sourceforge.net/p/GMATo) (Additional file 7: Fig. S5). The minimum length of SSR was fixed at 14 bp in accordance with criteria used by [14]. SSRs were defined as ≥14 bp mononucleotide or dinucleotide repeats; ≥15 bp trinucleotide repeats; ≥16 tetranucleotide repeats; ≥20 pentanucleotide repeats; and ≥18 hexanucleotide repeats.

SSR primer design, prediction of open reading frames and characterization using reference genomes
Primer design for EST-SSR sequences was performed using Primer3 with default parameters: optimum primer size = 20.0 (range of [18][19][20][21][22][23][24][25][26][27], optimum annealing temperature = 60.0 (range of 57.0-63.0), GC content of 20-80 %. Open reading frames (ORFs) were predicted for all SSR containing sequences using the ORF Finder available at NCBI using standard genetic code. Sequence fragments corresponding to the maximum length uninterrupted by a stop codon were taken as the primary encoding segment (ORF) of query sequences. In all predicted ORFs, the relative position of SSRs was detected, i.e., whether the SSR was present within the ORF, in the 5′ or 3′ un-translated region (UTR) [19]. Using Primer-BLAST, SSRs were also characterized (in terms of position in the reference genome and associated gene) using the two available Prunus reference genomes for mei (http://prunusmumegenome.bjfu.edu.cn/) [55] and peach (https://www.rosaceae.org/) [56].