Context
Massive parallel sequencing/re-sequencing technologies have already provided thousands or even tens of thousands of DNA markers for a number of species, while the genotyping of such numbers of markers is becoming routine due to microarray-based genotyping technologies. The possibilities offered by these developments have already been exploited in order to identify loci under natural selection through genome-wide scans [1]. Some studies have focused on selection due to domestication selection of livestock- (e.g. [2]) and plant species (reviewed in [3]). Only a very limited number of studies have targeted signatures of selection in the context of modern breeding programmes [4]. Such studies could, however, be useful in order to increase our understanding of the locus-level consequences of modern artificial selection. To what extent does, for example, artificial selection lead to significant changes in allele frequency at individual loci, and (implicitly) how likely is it that functional genetic variation may be lost due to artificial selection? The existence of several nearly isolated breeding populations, sharing the same breeding goals, provides opportunities for identifying parallel changes between populations. For aquaculture species only a few generations have passed since selective breeding began, possibly limiting the statistical power to detect selection at single loci. In this study, we wanted to estimate the power to detect selection at single loci as a function of effective population sizes, number of parallel populations, number of generations since onset of selection, selection intensity, and the initial genetic distance between populations.
Computational methods
Two different methods for detecting selection were considered: 1) detection of loci with lower-than-expected values of FST between selectively bred aquaculture populations (hereafter referred to as farmed populations), and 2) detection of loci with higher-than-expected values of FST between a pool of farmed populations and a pool of wild populations. For both methods, the power to detect selection was estimated by simulating a single, bi-allelic locus both in the absence and presence of selection. The simulation program was written in Python (v2.6), utilising simuPop, a library for general-purpose, individual-based, forward-time population genetics simulation [5]. The code may be found in Additional files: low_fst.txt and high_fst.txt. The parameter values were chosen to be relevant for populations of Atlantic salmon in Norway, the focus of our own research, but should match a wide range of aquaculture species. With some exceptions, the Atlantic salmon breeding programmes share the following features: 1) they have been running for 10 or fewer generations, 2) each breeding programme has four parallel year classes, 3) the populations are more or less isolated with little or no gene flow between year classes, and 4) effective population sizes typically lie in the range of 30-50 ([6], Karlsson et al, unpublished data). The breeding programmes were once established from different sets of Norwegian rivers, with some overlap between the different sets [7]. FST values between wild Norwegian populations have been found to lie around 0.05 (allozymes [8, 9], microsatellites [10–12]). (These results were backed up by our own data on four wild populations genotyped for 12 microsatellite loci and 13 wild populations genotyped for 4514 SNP loci (unpublished)). On this background, our default simulated data set consisted of 10 closed farm populations (low-FST outlier approach) or 10 closed farm populations and 10 wild populations (high-FST outlier approach), each population having an effective population size of 50. Specifically, we assumed that (directional) selection is only occurring in the breeding programmes and that this selection is leading to convergent evolution among different breeding strains. In an evolutionary context we are thus interested in detecting low-FST outlier loci, that will appear as low-FST outliers when only different breeding strains are being studied, but as high-FST outlier loci when a pool of breeding strains are compared with a pool of wild populations (where no selection is occurring). From now on these different approaches will be referred to as Low- and High-FST outlier approaches, respectively. The base populations of farmed populations were assumed to be drawn from different rivers, so that FST between farmed population at generation 0 (base population) would be similar to FST between wild populations (default = 0.05). Parameter values (Ne, number of populations, and start FST) were altered one at a time in order to assess the impact of the parameter on experimental power.
Algorithm
Two different approaches for the detection of outlier loci were investigated. The first approach was based on the detection of loci displaying lower-than-expected (under a null hypothesis of no selection) FST values between farmed strains. The second approach was based on the detection of loci displaying higher-than-expected values of FST between a pool of farmed populations and a pool of wild populations. For both approaches, a single bi-allelic locus was simulated with and without selection.
Low-FST outlier approach
In each of 1000 iterations, a single overall allele frequency was first drawn randomly from a uniform distribution between 0 and 1. Npop populations, each consisting of Ne animals with a single diploid locus, were then formed. Half of the individuals were designated as males, the other half as females. Genotypes were assigned randomly to individual animals, given the overall allele frequency. Next, random mating was simulated in each population for a number of generations, until the FST value between populations reached the wanted level for initial FST (FST(0)). Following this initial phase, random mating with (alternative hypothesis) or without (null hypothesis) selection was applied for Ngen generations; selection was applied by defining different fitness values for the different genotypes (assuming no dominance). At the end of each iteration, FST between populations [13] was calculated. This process was iterated 1000 times without selection in order to generate a distribution of FST under the null hypothesis, and 1000 times with selection in order to generate a distribution of FST under the alternative hypothesis. Finally, the power to detect outlier loci was calculated. The power was defined as the fraction of the FST -distribution generated under the alternative hypothesis (i.e. under selection) that was lower than the 5% percentile of the FST -distribution generated under the null hypothesis (i.e. without selection). The Python code can be found in Additional file 1.
High-FST outlier approach
In each of 1000 iterations, a single overall allele frequency was first drawn randomly from a uniform distribution between 0 and 1. Npop * 2 populations, each consisting of Ne animals with a single diploid locus, were then formed. Half of the individuals were designated as males, the other half as females. Genotypes were assigned randomly to individual animals, given the overall allele frequency. Random mating was simulated in each population for a number of generations, until the FST value between populations reached the wanted level for initial FST (FST(0)). The populations were then split into two sets of equal size, representing farmed and wild populations. For the farmed populations, random mating with (alternative hypothesis) or without (null hypothesis) selection was simulated for Ngen generations. For the wild populations, random mating without selection was simulated for Ngen generations, but the size of each population was first increased to 500 in order to minimise the effect of drift in wild populations. At the end of each iteration, the populations were merged into one farmed and one wild 'metapopulation' and FST between these metapopulations was calculated. This process was iterated 1000 times without selection in order to generate a distribution of FST under the null hypothesis, and 1000 times with selection in order to generate a distribution of FST under the alternative hypothesis. Finally, the power to detect outlier loci was calculated. The power was defined as the fraction of the FST-distribution generated under the alternative hypothesis (i.e. under selection) that was higher than the 95% percentile of the FST-distribution generated under the null hypothesis (i.e. without selection). The Python code can be found in Additional file 2.
Testing
With default parameter values, the power to detect non-neutral loci among breeding populations (low-FST outliers) was found to be very low, except for extremely large selection coefficients, while relatively small or moderate selection coefficients were found to be sufficient for detecting non-neutral loci, when comparing farmed and wild population (high-FST outliers) (Figure 1).
The power to detect high-FST outliers rapidly increased, and was large for moderate and large selection coefficients, when the effective population size, number of populations and number of generation passed reached 40, 5, and 10, respectively. Power and initial FST was negatively correlated, with a rapid decline in power with an increasing initial FST. The power to detect weak selection (s = 0.05) was close to zero regardless of effective population size, number of populations, and initial FST, but increased with an increasing number of generations since the establishment of the breeding populations (Figure 2).
The power to detect low-FST outliers was not affected by increasing effective population size, or by the initial FST. The largest effect on the power was observed from increasing the number of populations and number of generations (Figure 3).