Characterization of variable EST SSR markers for Norway spruce (Picea abies L.)

Background Norway spruce is widely distributed across Europe and the predominant tree of the Alpine region. Fast growth and the fact that timber can be harvested cost-effectively in relatively young populations define its status as one of the economically most important tree species of Northern Europe. In this study, EST derived simple sequence repeat (SSR) markers were developed for the assessment of putative functional diversity in Austrian Norway spruce stands. Results SSR sequences were identified by analyzing 14,022 publicly available EST sequences. Tri-nucleotide repeat motifs were most abundant in the data set followed by penta- and hexa-nucleotide repeats. Specific primer pairs were designed for sixty loci. Among these, 27 displayed polymorphism in a testing population of 16 P. abies individuals sampled across Austria and in an additional screening population of 96 P. abies individuals from two geographically distinct Austrian populations. Allele numbers per locus ranged from two to 17 with observed heterozygosity ranging from 0.075 to 0.99. Conclusions We have characterized variable EST SSR markers for Norway spruce detected in expressed genes. Due to their moderate to high degree of variability in the two tested screening populations, these newly developed SSR markers are well suited for the analysis of stress related functional variation present in Norway spruce populations.


Background
Natural populations of Picea abies L. (Norway spruce) are found from north-western Europe outside permafrost areas down to northern Greece, westwards to the Massif Central (France) and east to the Ural Mountains. Picea abies is growing above 400-500 m and ascends close to 2000 m in the Alps. Studies on genetic variation based on allozymes have shown that Picea abies genetic differentiation among populations is rather low over its whole distribution range [1,2]. Previous studies on the genetic structure of P. abies using organelle markers showed pronounced differentiation between north-east boreal origins and areas in the central European mountains [3,4], supporting the hypothesis of two distinct main glacial refugia as postulated from pollen data [5].
Initial reports on the occurrence of SSRs in conifers such as Pinus radiata [6], Pinus sylvestris [7] or Picea abies [8] have shown that marker development for such complex genomes is difficult. Frequently several DNA fragments in addition to the expected ones are amplified when using primers flanking putative SSR regions. This can be attributed to the very large size of the conifer genomes, with Norway spruce having 39.140 Mb in the diploid genome (2n = 24) as well as the high proportion of repetitive DNA in the conifer genomes [9]. These repetitive elements as well as pseudogenes frequently produce complex amplification products from multiple loci [10,11]. To overcome this problem, Scotti et al. [12] showed that by isolating di-nucleotide SSRs from cDNA libraries of conifer species this problem can be overcome relying on the fact that expressed genes are less likely to be of repetitive nature in the genome.
Following this line of exploiting the advantages of publicly available data on expressed sequences (ESTs) as source for marker development [13] as described also by Rungis et al. [14] for conifers, we used Norway spruce EST sequences from NCBI for in silico identification of SSR regions which could serve for marker development.

Results
Screening of 14,022 Norway Spruce EST sequences from the NCBI dbEST revealed 158 sequences containing various repeat motifs. Clustering of these sequences produced 92 SSR containing 'unigenes' which fell into 36 clusters and 56 singletons. In these 92 unigenes, 48 different repeat motifs were identified, with tri-nucleotide (50%), in particular (CGG) n and CAA) n , repeats being the most abundant, followed by penta-(23%) and hexanucleotide repeats (10.4%) respectively. Di-and tetranucleotide repeats were the least abundant with 8.3% each. Twenty-seven out of 60 planned primer pairs amplified polymorphic products in the testing population of 16 individuals, while two did not generate a fragment and 31 proved to be monomorphic [see Additional file 1]. Four of the 27 polymorphic SSR regions showing more than 2 alleles per locus (Pa_16, Pa_24, Pa_34 and Pa_40) in the testing population were excluded from further analysis. In a total of 96 individuals 135 alleles were detected in the 23 remaining loci (Table 1) ranging from 2 to 17 alleles per locus (Table 2) with a mean value of 5.6. Total number of effective alleles was and 50.645 with a minimum of 1.123, a maximum of 4.115 and a mean value of 2.202. 87% of the variable regions were based on trinucleotide repeats, the rest were pentanucleotide repeats. Di-, tetra and hexanucleotide repeats proofed to be not polymorphic. Loci Pa_41 and Pa_53 were not polymorphic in the Mayrhofen population. Values for expected heterozygosity ranged from 0.021 to 0.780 in the Gusswerk population (mean 0.367) and from 0.123 to 0.769 in the Mayrhofen population (mean 0.385). Observed heterozygosity was found to be between 0.021 and 1.000 in the Gusswerk population (mean 0.461) and between 0.042 and 1.000 in the Mayrhofen population (mean 0.515). Differences in observed heterozygosities between the two populations ranged from zero (Pa_44) to 0.894 (Pa_49). The minimum difference of expected heterozygosities between the two populations was found at locus Pa_12 (0.002) and the maximum at locus Pa_49 (0.382). Regarding the Gusswerk population, in eleven loci the observed heterozygosity was significantly higher and in three loci significantly lower than the expected value. In the Mayrhofen population we found thirteen loci with the observed heterozygosity significantly higher than the expected and in four loci it was significantly lower. Eleven loci showed significant departure from Hardy-Weinberg expectations (HWE, Guo and Thompson's exact test [P < 0.05]) within the 48 investigated individuals from the Gusswerk population. The number of loci with significant departure from HWE was higher in the Mayerhofen population (16 loci). Frequency of null alleles was variable from zero to 99.0% in the Gusswerk population and from zero to 85.4% in the Mayerhofen population. In both populations, null alleles were present with a high frequency (above 5%) in more than half of the loci deviating from HWE. Tests for linkage disequilibrium (P < 0.01) revealed no disequilibrium among pair wise compared loci. F st ranged between 0 and 0.288 with a mean value of 0.033 (after application of the ENA correction described in Chapuis and Estoup (2007), Table 3). Repeat numbers of all polymorphic diand pentanucleotide repeats added up to a multiple of three and therefore did not cause a frameshift. All polymorphic SSRs are lying within open reading frame sequences (ORF) as detected by GetORF [15]. Fourteen sequences hit functionally annotated genes when compared to the NCBI nucleotide database [see Additional file 2]. The test for outlier loci revealed that Pa_51 (P = 1.000) and Pa_42 (P = 0.981) are under positive selection while the other loci are supposed to behave neutral (P < 0.95).

Discussion
Several studies describe the development of SSR markers for Norway spruce [8,12]. In the present work we used EST mining of sequences available in databases instead of classical approaches like screening of genomic libraries or fragment enrichment and subsequent cloning. The newly developed markers can be used to complete unsaturated maps, and might be useful in marker assisted breeding and population genetics.
The high number of loci deviating from Hardy-Weinberg equilibrium could partly be explained by the presence of null alleles (Table 3), partly be a result of sampling or selective pressure on the coding regions which is supported by the fact that loci deviating from HWE are not consistent between the two populations. Null alleles can occur due to mutations in primer binding sites and lead to the overestimation of homozygosity as shown by Callen [16]. The presence of such null alleles as estimated in this study may further be confirmed by synthesis of alternative oligonucleotide primer pairs. Increased appearance of homozygotes in some loci supports the hypothesis of selective pressure which might be caused by advantages of recessive or dominant homozygotes known as directional selection. Positive selection was confirmed by outlier detection for two loci which show a high frequency of homozygotes. Both of them are located in coding regions -Pa_42 shows homology to a cysteine protease and Pa_51 is related to a cadmium transporting ATPase. These genes play important roles in various physiological processes, including response to biotic and abiotic stress [17,18]. Therefore, additional data of both populations (e.g. environmental growth) conditions would be of interest.
Especially locus Pa_51 may be an interesting target for diversity studies because departure from HWE was only detected in the Mayrhofen population. But although these two loci show considerable degrees of population differentiation, the low F st values present in the other loci support previous findings [3,4] that there is no The new SSR markers characterized in this study are a good resource for assessing diversity in Alpine spruce populations and might be of further use for applications in forestry. Additionally, 31 monomorphic SSR loci were identified in this study. It is still possible to derive polymorphic data from them when screening Norway spruce populations from different origin, as it has already been shown that flanking regions of monomorphic SSRs show a moderate to high degree of variability [19]. Possible variation within these flanking regions in Norway spruce could serve as alternative source for diversity studies applying different techniques for the identification of variability within these regions.
All variable SSRs are located within ORF sequences and 60% of them gave significant hits to functional genes present in GenBank. A frameshift caused by length variation within an ORF might lead to distortion in the coding region and thus to the abortion of the coded protein. In all of the newly identified EST-SSRs no frameshift is occurring because either, as in 50% of the cases, the SSRs are trinucleotide repeats or the repeat number of polymorphic di-and pentanucleotide repeats alters in such a way that the sum of the basepair variation is a multiple of three.

Conclusions
We have identified 92 new SSR regions with 48 different repeat motifs in the expressed part of the Norway spruce genome. In two screening populations of 48 individuals each, 23 SSR regions showed a moderate to high degree of polymorphism. Although a high number of SSR markers were already developed for Norway spruce, most of the available SSRs consist of dinucleotide repeats and do not include our tri-and pentanucleotide repeats. We have demonstrated that on the one hand genetic differentiation between distinct stands is low but on the other hand selective forces are likely to have   been acting on several genes because deviations from HWE could not only be explained due to null alleles and test for outliers revealed two loci under positive selection. The results gave an insight into the abundant repeat classes and will be of use in analysis of variation linked to expressed genes in Alpine Norway spruce populations as well as evolutionary forces acting at these loci.

Methods
14,022 Norway Spruce EST sequences were extracted from the NCBI dbEST and screened for SSR sequences with the SciRoKo software program [20]. Only perfect SSRs were searched in two iterations, one for monoand dinucleotides (at least 4 repeats), and another one for tri-, tetra-, penta-and hexanucleotides (at least 3 repeats). SSR motif containing sequences were subjected to clustering and annotation using the EST2uni pipeline [21]. For clustering, the default settings (30 bp minimum overlap with at least 94% identity for pairwise alignment using BLAST, 93% overlap identity cutoff for assembly using CAP3) were applied. The resulting Unigene Set (contigs and singletons) was compared against three protein databases with BLASTX, in particular NCBI NR, Arabidopsis TAIR7, and Uniprot/Swissprot, using an e-value of 1e-10 as cutoff. The descriptions of the BLAST hits obtained with the different BLAST runs were parsed and merged to yield a descriptive annotation for each unigene. The annotations were attributed with modifiers like "Similar to" or "Highly similar to", depending on the e-value of the alignment with the corresponding BLAST hit. 60 primer pairs were designed using Primer3 web interface [22] with default settings. PCR reactions were performed in 25 μl volumes containing 50 ng genomic DNA, 1X PCR Buffer (QIAGEN), 1 mM MgCl 2 , 0.2 mM each dNTP, 0.3 μM FAM-labelled forward primer, 0.3 μM reverse primer and 0.625 U HotStar Taq polymerase (QIAGEN). The PCR thermal profile consisted of 15 min. initial denaturation at 95°C followed by 35 cycles denaturation at 95°C for 50 sec., a primer-specific annealing-temperature (Table 1) for 50 sec., extension at 72°C for 105 sec., and a final extension at 72°C for 10 minutes. Fragments amplified in 16 individuals from all over Austria (testing population) and 96 individuals from 2 distinct locations in Austria (screening population) were analyzed on an ABI 3100 Genetic Analyzer. Fragment sizes were extracted using PeakScanner 1.0 (Applied Biosystems, Foster City, California, USA) and allele calls were manually performed in Excel. The testing population was only used to test for polymorphic loci and was not included in further calculations. Number of alleles, effective number of alleles, observed and expected heterozygosities and deviation from Hardy-Weinberg equilibrium were calculated using Genepop 4.0 [23]. Fixation indices and fixation indices excluding null alleles (ENA), as well as estimates of null alleles were calculated with FreeNA [24] and SSR presence in coding regions was analyzed with the GetORF software using default settings. Test for outlier loci were performed with LOSITAN [25] using an infinite allele model with 100.000 simulations, a confidence interval of 0.95 and a subsample size of 75.