A new measure of population structure using multiple single nucleotide polymorphisms and its relationship with FST
© Xu et al; licensee BioMed Central Ltd. 2009
Received: 31 December 2008
Accepted: 06 February 2009
Published: 06 February 2009
Large-scale genome-wide association studies are promising for unraveling the genetic basis of complex diseases. Population structure is a potential problem, the effects of which on genetic association studies are controversial. The first step to systematically quantify the effects of population structure is to choose an appropriate measure of population structure for human data. The commonly used measure is Wright's FST. For a set of subpopulations it is generally assumed to be one value of FST. However, the estimates could be different for distinct loci. Since population structure is a concept at the population level, a measure of population structure that utilized the information across loci would be desirable.
In this study we propose an adjusted C parameter according to the sample size from each sub-population. The new measure C is based on the c parameter proposed for SNP data, which was assumed to be subpopulation-specific and common for all loci. In this study, we performed extensive simulations of samples with varying levels of population structure to investigate the properties and relationships of both measures. It is found that the two measures generally agree well.
The new measure simultaneously uses the marker information across the genome. It has the advantage of easy interpretation as one measure of population structure and yet can also assess population differentiation.
Large scale genome-wide association studies are promising in unraveling the genetic basis of complex diseases in humans. There are many such studies currently being carried out. However, the size of the data produces several issues and challenges in analysis and interpretation. One of the potential problems is hidden population structure in the samples. It can cause spurious associations when cases and controls differ in ancestry and is thus a confounding factor. However, the effects of population structure in real large-scale association studies are very controversial. Therefore, a systematic study is needed to quantify the levels of population structure and its effects on genetic association studies.
The first step to quantify the effects of population structure is to choose an appropriate measure of population structure for human data. The commonly used measure is Wright's FST . For a set of subpopulations it is generally assumed to be one value of FST. However, the estimates could be different for distinct loci. It could be a problem if population structure is adjusted with local estimates in genome-wide association studies because it could mask real association and lead to loss of power. With the available of genomic data, we would like a measure utilizing the information across markers. Therefore, we proposed a new measure C for SNP data. The new measure is same for all loci and utilizes information across loci. It is based on the c parameters for the subpopulation that measures the divergence of the subpopulation from the common ancestor . We then performed extensive simulations to investigate the performance the new measure and compared it to the traditional FST statistic.
where L is the number of loci, P is the number of populations, n ij is the number of chromosomes genotyped at the i th SNP in the j th population, πi is the ancestral allele frequency for the i th SNP and the variance parameter c j specifies how far the j th subpopulation's allele frequencies tend to be away from the ancestral allele frequency. In our simulations, we sample c j and πi from a uniform distribution on (0, 1). The simulations were performed using the simMD program in the Popgen package .
Estimating cj parameter
For each sample in the simulated data sets, we estimate the c parameter for each subpopulation using a Bayesian approach. We assume uniform priors on both c and π parameters and use Markov Chain Monte Carlo (MCMC) methods (a Gibbs sampler) to sample from the posterior distribution. The Markov Chain was run for 20,000 iterations and the first 10,000 iterations were discarded as burn-in. We estimated the c parameter by using the posterior mean values from the posterior samples.
where wj is the weight for the j th subpopulation. There are many possible weighting schemes. Here we propose to use sample size from each subpopulation as a weight. That is, , where nj is the number of individuals from subpopulation j in our sample. In our implementation, we used the posterior estimate of c j and took the weighted mean as an estimate of C.
From the simulated samples, FST was estimated using the unbiased estimator at bi-allelic SNP described by .
Equal sample size
Correlation coefficient of estimates of FST and C from simulations with varying sample sizes
Correlation coefficient of estimates of FST and C from simulations with αij from a mixture distribution
a = 0.05
A = 0.10
a = 0.20
a = 0.30
Unequal sample size
Correlation of the estimates of FST and C from several weighing schemes
Sample size (30, 40, 90)
Sample size (30, 60, 70)
Varying number of SNPs
Varying number of sub-populations
Correlation of the estimates of FST and C with varying number of sub-populations
(60, 60, 60)
(45, 45, 45, 45)
(36, 36, 36, 36, 36)
Natural populations of the same species from different geographic regions tend to differ genetically. Human population is no exception. Previous research has shown that ignoring the genetic differences among sub-populations is a potential problem for genetic association studies of human diseases, especially for genome-wide association studies . The problem could be severe for large multi-centered studies and/or studies in admixed populations, such as African Americans.
The explosion of SNP data in human populations provides an unprecedented opportunity to further characterize population structure and relationships. In this paper, we proposed a new measure of population structure specifically for SNPs. It is based on the c parameter which is population specific and measures the differentiation of the population from the common ancestor population. In contrast, the new measure C is an index of the overall levels of population structure across populations. Through extensive simulations, we showed that the new measure C has very high correlations with the traditional Wright's FST. The correlation increases as we have more information (more SNPs and/or more sub-populations in the samples).
While the new measure is different from the c parameter, it has some inherited advantages from the c parameters. First, it is specific for SNPs and takes account of the ascertainment bias in the process of SNP discovery. Since SNP discovery is generally conducted in small samples, SNPs with high minor allele frequencies are more likely to be discovered than SNPs with low minor allele frequencies, thus creating the possible ascertainment bias. It has been shown that the ascertainment bias could affect the estimation of population parameters in genetic analysis . This ascertainment bias has been explicitly accounted for in the model for estimating the individual c parameter for each sub-population. It is assumed that a large number of potential loci are examined in small samples from each of the sub-populations, and a locus is chosen if it is not fixed for the same allele in all sub-populations.
Second, the new measure is based on inferences from a Bayesian framework. Therefore, it is very flexible in modeling and can incorporate prior information on the parameters. In our simulation studies, we used uninformative prior distributions for the c and π parameters. If we have any prior knowledge regarding the distribution, we could easily incorporate it in the estimations, which can lead to more accurate estimates than the moment-based estimates of FST .
In summary, we proposed a new measure of population structure based on a Bayesian hierarchical model for SNPs. It uses the information at multiple markers and has high correlations with the traditional measure FST. We recommend reporting the new measure along with the individual c parameters for sub-populations so that we could get an idea of the level of population structure and the divergence of each sub-population as well.
We thank the two anonymous reviewers for their constructive comments. This work is supported by R21NS057506 from US National Institute of Health and Scientist Training Grant from Medical College of Georgia to HX.
- Wright S: Isolation by Distance. Genetics. 1943, 28: 114-138.PubMed CentralPubMedGoogle Scholar
- Nicholson G, Smith AV, Jónsson F, Gústafsson Ó, Stefánsson K, Donnelly P: Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc Ser B Stat Methodol. 2002, 64 (4): 695-715. 10.1111/1467-9868.00357.View ArticleGoogle Scholar
- Balding DJ, Nichols RA: A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995, 96 (1–2): 3-12. 10.1007/BF01441146.View ArticlePubMedGoogle Scholar
- Marchini's Hompage. [http://www.stats.ox.ac.uk/~marchini/software.html]
- Weir BS, Cockerham CC: Estimating F-statistics for the analysis of population structure. Evolution. 1984, 38: 1358-1370. 10.2307/2408641.View ArticleGoogle Scholar
- Xu H, Shete S: Effects of population structure on genetic association studies. BMC Genet. 2005, 6 (suppl 1): S109-10.1186/1471-2156-6-S1-S109.PubMed CentralView ArticlePubMedGoogle Scholar
- Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K: The discovery of single-nucleotide polymorphisms – and inferences about human demographic history. Am J Hum Genet. 2001, 69 (6): 1332-1347. 10.1086/324521.PubMed CentralView ArticlePubMedGoogle Scholar