A Monte Carlo test of linkage disequilibrium for single nucleotide polymorphisms
© Xu et al; licensee BioMed Central Ltd. 2011
Received: 10 March 2011
Accepted: 14 April 2011
Published: 14 April 2011
Genetic association studies, especially genome-wide studies, make use of linkage disequilibrium(LD) information between single nucleotide polymorphisms (SNPs). LD is also used for studying genome structure and has been valuable for evolutionary studies. The strength of LD is commonly measured by r2, a statistic closely related to the Pearson's χ2 statistic. However, the computation and testing of linkage disequilibrium using r2 requires known haplotype counts of the SNP pair, which can be a problem for most population-based studies where the haplotype phase is unknown. Most statistical genetic packages use likelihood-based methods to infer haplotypes. However, the variability of haplotype estimation needs to be accounted for in the test for linkage disequilibrium.
We develop a Monte Carlo based test for LD based on the null distribution of the r2 statistic. Our test is based on r2 and can be reported together with r2. Simulation studies show that it offers slightly better power than existing methods.
Our approach provides an alternative test for LD and has been implemented as a R program for ease of use. It also provides a general framework to account for other haplotype inference methods in LD testing.
Genetic association studies, especially large-scale genome-wide association studies have become very popular in recent years due to the rapid advancement of genotyping technologies and the completion of the Human Genome Project [1, 2]. More than 400 susceptibility regions have been identified through genome-wide association approach. This approach relies on the linkage disequilibrium information between genetic markers, mostly single-nucleotide polymorphisms (SNPs), hence been termed linkage disequilibrium mapping. Linkage disequilibrium (LD) refers to the nonrandom association of alleles at different loci on the same haplotype. The underlying assumption of genetic association studies is that there are some disease causing loci in the genome, and if the SNPs under investigation (i.e. markers) and the disease-causing loci are in close proximity, the marker alleles will be associated with the alleles at the disease-causing loci. In other words, those markers are in LD with the disease causing loci if they are in close proximity. Since markers in high LD are highly correlated, testing the significance of LD between alleles of markers is also useful in finding LD blocks and tag-SNPs. This could reduce the number of markers required in genome-wide studies. In addition to gene mapping, LD information also proves to be useful in evolutionary studies of gene dynamics, tracing human origin and history, and studies of genome structure and forensic science.
Consider two bi-allelic SNPs, marker A and marker B. The two alleles at marker A are denoted as A1 and A2 with frequencies p1 and p2, respectively, and the two alleles at marker B are denoted as B1 and B2 with frequencies q1 and q2 respectively. The non-random association of the alleles at the two loci can be measured as the difference between the haplotype frequency of A1B1 in the population and the expected frequency under the null hypothesis of independence i.e., , where is the frequency of haplotype A1B1. If we replace the population haplotype frequency of A1B1, , by the observed frequency, in the sample, we get an estimator of δ, given by . The statistic D depends on marker allele frequency, which makes it harder to compare across different markers and populations. As a result, many measures have been proposed to standardize D. Two such common measures of LD are D' and r2. D' is bounded between 0 and 1. The bound of r2 depends on allele frequency and is given in .
In general, r2 is used to measure the statistical association between marker pairs and is related to the power of LD mapping. In a case-control study, if r2 is the level of LD between a marker and a causative polymorphism and the sample size required to detect the association of the disease with the causative polymorphism is n, then the sample size required to detect the association of the disease with the marker at the same power level is approximately equal to n/r2[5–7]. Because of this convenient relationship, r2 is used extensively in association mapping as a measure of LD.
where N is the number of chromosomes in the sample, or twice the number of individuals for humans. Nr2 is then compared to a -distribution as a test of LD. This works fine when the haplotypes can be directly observed. However, problems arises when we use this approach in the analysis of population-based data, where haplotypes are usually not observed so the cell counts of the contingency table are not known. As a result, an estimation procedure, such as maximum-likelihood approach, has to be used to estimate the haplotype counts. This introduces additional variability and in turn the test statistic Nr2 will not follow a -distribution. For example, in the R package, "genetics", the estimated haplotype counts are used to compute Nr2, which is then compared to a -distribution as a test of LD, although there is a warning in the documentation noting that this may not be a valid test.
An approach that allows for unknown haplotypes in testing LD has been proposed by Weir , based on a composite LD measure. This approach has been extended to markers with multiple alleles by Schaid  and Zaykin et al. . A test of LD based on the common measure r2 has been developed on the asymptotic distribution derived from the δ-method . In this report, we first show that the additional variability from haplotype estimation has to be accounted for in a test of LD when haplotype frequencies are not available. We then propose a test that accounts for this variability and present its properties in terms of type I error rate and power. Finally we compare the our test with that based on the composite LD and the test based on the asymptotic distribution.
Effects of haplotype estimation
As mentioned above, in most population-based studies where haplotypes are not directly observable, the haplotype counts have to be estimated. Most of the estimation procedures are based on maximum-likelihood approach as implemented in the R package "genetics", which is freely availably from the CRAN website http://www.cran.org. This estimation procedure adds additional variability which could make the distribution of the test statistic Nr2 deviate from the -distribution. In order to study the effects of the additional variability on the distribution of the test statistic, we perform simulations under the null hypothesis of no LD. The empirical distribution is then compared to the -distribution. Specifically, we consider 2 bi-allelic SNPs, A and B. The alleles at marker A are denoted as A1 and A2 with frequencies p1 and p2 respectively and those at marker B are B1 and B2 with respective frequencies q1 and q2. When an individual is heterozygous at both markers, the underlying haplotypes cannot be identified with certainty from the genotype. In the first set of simulations, we assume the two markers are in Hardy-Weinberg equilibrium (HWE). Under HWE, the genotype frequencies at SNP A are , 2p1p2, and for A1A1, A1A 2 and A2A2, respectively. Similarly, we can write the genotype frequencies at SNP B. Under the null hypothesis of no LD, the joint distribution of the two-locus genotype follows a multinomial distribution with cell probabilities equal to the product of the corresponding genotype frequencies at the two SNPs because genotypes at the two SNPs are independent. For example, the two-marker genotype frequency of A1A1B1B1 is . We simulate the genotypes at the two SNPs in 1000 individuals by sampling from this multinomial distribution. The haplotype counts are then estimated from the simulated genotype data using the maximum-likelihood approach implemented in the R package "genetics". We then compute the test statistic Nr2 based on the estimates of haplotype counts. We generate 10,000 replications for the simulation. The empirical distribution of Nr2 from the 10,000 replicates is then compared with the -distribution. To examine the effect of ignoring the variation in haplotype estimation, we use the upper 0.05 quartile from the -distribution as cutoff and compute the proportion of simulated replicates with the test statistic exceeding the cutoff point. Similar analyses are performed at 0.01 and 0.001 significance level.
This represents a simple parameterization of genotype frequencies similar to those in  and . Under HWD, there are less heterozygotes compared to the case under HWE when D H < 0, and more heterozygotes when D H > 0. Similar expressions can be written for the genotype frequencies at SNP B. Under the null hypothesis of no LD, the joint distribution of the two-marker genotypes is a multinomial distribution with each cell probability equal to the product of the corresponding genotype frequencies at the two SNPs. Two-marker genotypes for 1,000 individuals are simulated by sampling from this multinomial distribution. Similar to the case under HWE, haplotype counts are estimated using the maximum-likelihood approach implemented in the R package "genetics" and the values of the proposed test statistic Nr2 are computed and compared with the -distribution.
As shown in the results section, with unknown haplotypes, the empirical distribution of the test statistic Nr2 deviates drastically from the -distribution, and type I error is greatly inflated. Therefore, we propose an Monte-Carlo approach for LD testing based on the distribution of the test statistic Nr2 under the null hypothesis of no LD. We use the quartiles from the empirical distribution under the null hypothesis as the critical values in order to give the correct type I error rate. The distribution under null hypothesis is generated using a bootstrap approach . Specifically, under the null hypothesis of no LD, the genotypes at the two SNPs are independent. Therefore, given the observed genotypes from a sample of N individuals, we first generate a bootstrap sample of N individuals by sampling with replacement from the genotypes at SNP A. This process is repeated for SNP B. The bootstrapped genotypes from SNP A are then randomly paired up with those from SNP B to form two-locus genotypes for the N individuals. This constitutes one bootstrap sample. We then apply the likelihood-based method in the R package "genetics" and calculate the test statistic for each bootstrap sample. This is replicated 10,000 times to generate the distribution under the null hypothesis.
Similarly, we have P(B2|A1) = q2 - D/p1,P(B1|A2) = q1 - D /p2, and P(B2|A2) = q2 + D/p2. With the simulated genotypes, haplotype estimation and computation of the LD measure, r2, are performed using the R package "genetics". We carried out 10,000 simulation replications under this scenario, and the proportion of replications with Nr2 greater than the cutoffs from the empirical distribution from the simulations under null hypothesis of no LD is taken as an estimate of empirical power. We compare our approach with two previous methods for testing LD, allowing for unknown haplotype, namely, the method by Weir  and the method based the asymptotic distribution . We apply their tests to the same simulated samples to get the corresponding estimate of power. The cutoffs are based on the empirical distribution of the Nr2 under the null hypothesis of D' = 0. We compute the empirical power for various values of D' ranging from 0 to 0.25 at significance level α = 0.05, 0.01 and 0.001.
Type-I error rate using test with unknown haplotypes
It is evident from Table 1 that the type-I error rate is inflated if we use the χ2 test ignoring the uncertainty in haplotype estimation. At 0.05 level, type-I error rate is inflated by 5.72 times. It is inflated even further as the level of the test decreases. At 0.001 level, it is inflated by 166.2 times. This suggests that for samples with unknown haplotypes, the actual distribution of the test statistic differs drastically from the -distribution, especially under the tail, and therefore, using the usual χ2 test will result in grossly erroneous conclusions.
Table 1 also gives the empirical type-I error rate under HWD. Similar to the case of HWE, type-I error rate is inflated. It is inflated further as the level of test decreases. At 0.05 level, it is inflated by 6.23 times and at 0.001 level, it is inflated by as much as 192.3 times. Compared to the result under HWE, type-I error rate is inflated further. This is probably due to the fact that HWD could bias the haplotype estimate. Therefore, our results suggest that the additional variability brought by the haplotype estimation makes the distribution of the test statistic differs drastically from the expected -distribution regardless of whether the SNPs are in HWE or not. Since we could generate the empirical distribution of the LD measure r2 under the null hypothesis, a direct test of LD could be based the empirical distribution rather than relying on erroneous assumptions.
Power comparison of our test and two previous tests from simulations under HWE
We change the level of LD by varying D' from 0 to 0.25. As shown in Table 2, the power of the Monte Carlo-based test increases quickly as D' increases. It reaches the perfect power of 1.0 at D' = 0.2 for α = 0.05 and α = 0.01. We apply the test based on composite LD to the same simulated data set for the purpose of power comparison. Table 2 also gives the power estimates for the test based on composite LD (labeled as "comp-LD" in the table) and the test based on asymptotic distribution (labeled as "asym-LD" in the table). It is obvious from Table 2 that the power of our test is comparable to the test based on composite LD, though our proposed method has a slight advantage. The test based on asymptotic distribution has the lowest power among the there tests.
Power comparison of our test and two previous tests from simulations under HWD
We have implemented the Monte Carlo-based test in R. The program can be downloaded from the author's website at http://www.biostat.mcg.edu/~hxu/software/ldtest.zip.
Application of our test to the NARAC data
Testing the significance of LD between SNPs is of fundamental importance for genetic association studies. One popular measure of LD is r2. However, as shown in our simulations, for most population-based samples when the haplotypes are not known, the additional variability of haplotype estimation makes the traditional χ2 test inapplicable. The departure form the assumed -distribution is more severe in extreme tails. This makes the χ2 test even more problematic as extremely low significance levels are usually used to account for the effect of multiple testing in genome-wide studies. In this report, we propose a simple LD test based on the null distribution of the test statistic Nr2 from simulations, taking advantage of the increasingly available computing powers. Unlike the test based on a composite LD measure, the Monte Carlo test is directly based on the distribution of the popular LD measure r2 and can be report together with r2. As shown in the results section, our test has similar or slightly increased power compared to the test based on composite LD. The test is easily implemented in R. It works well with existing R packages and suitable for automation in large-scale genome-wide studies. A likelihood ratio test of LD using genotype data with unknown haplotypes has been developed by Slatkin et al. . Similar to our approach, the null distribution of their test statistic is generated using computer-based permutations. However, the likelihood ratio test assumes HWE, while our test works well under either HWE or HWD. Nonetheless, similar to other permutation or bootstrap-based approach, the payoff of our approach is the computer running time, which is generally not a major concern as computing power increases.
Using simulations, we have considered the effect of haplotype estimation using the maximum-likelihood approach implemented in the R package "genetics" and showed that the additional variability brought by the haplotype estimation process cannot be safely ignored. This is an example of single imputation in statistics literature. Similarly, haplotype phase uncertainly can lead to problems in haplotype-phenotype association studies. In these studies, it is tempting to estimate haplotypes from genotype data using the existing haplotype estimation methods and assign the individuals with the most likely haplotype pair (or the pair with the highest posterior probability if a Bayesian method is used). The assigned haplotype pairs are then treated as true haplotypes in downstream association analyses. This two-stage approach, though simple, can lead to erroneous inference about the haplotype-phenotype association. Simulation studies have shown that this approach can lead to substantial bias in the estimated genetic effects, poor coverage of confidence intervals, and significant inflation of type I error [15–17]. For further discussions, please see  and . Several methods have been developed to account for the uncertainty in haplotype estimation in the haplotype-phenotype association setting, including the expectation-substitution method  and the likelihood-based approach [21–23]. The latter involves the calculation of the variance-covariance matrix of the estimates based on the observed information matrix and has been implemented in the haplo.glm() function in the R package "haplo.stats"  and the program "HAPSTAT" .
Besides the maximum-likelihood method examined in the study, there are other more sophisticated methods for haplotype estimation that utilized high-density marker information, e.g. . In humans, one can also utilize the information from large international collaborative efforts such as HapMap  and 1000 Genome Projects  for better haplotype estimation. It should be noted that our test is not novel but based on standard re-sampling procedure. However, the general simulation framework can be used to study the effect of other haplotype estimation methods because this is a two-step procedure. In the first step, the sample genotypes are simulated under the null hypothesis of no LD. The samples are then analyzed in the second step for haplotype estimation and computation of the final test statistic. Notice that we can use whatever method for haplotype estimation that are applicable in the second step. Therefore, the general simulation framework is rather flexible and can easily be extended to study the effect of other haplotype estimation methods. For example, in our study, we considered the haplotypes at 2 bi-allelic loci. It is straightforward to extend it to the cases with multiple SNPs. In the first step, genotypes at multiple SNPs can be generated using the standard bootstrap approach. In the second step, haplotypes at multiple SNPs can then be estimated using haplotype estimation methods for high-density markers. This approach could potentially offer some advantages over the likelihood approach because it relies on the empirical distribution of the final test statistic rather than the normal distribution. Indeed, simulation studies have shown that the likelihood based approach has strong bias away from the null hypothesis when haplotype diversity is high .
We develop and implement a test of LD for population data when the haplotypes are unknown. It is directly based on the empirical distribution of r2, the measure of LD, and uses a Monte-Carlo approach. The test is easy to use and provides an alternative way to testing for LD for SNP data. It also provides a framework to study the effects of other haplotype estimation approaches.
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9 (5): 356-369. 10.1038/nrg2344.PubMedView ArticleGoogle Scholar
- Altshuler D, Daly MJ, Lander ES: Genetic mapping in human disease. Science. 2008, 322 (5903): 881-888. 10.1126/science.1156409.PubMedPubMed CentralView ArticleGoogle Scholar
- Lewontin RC: The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics. 1964, 49: 49-67.PubMedPubMed CentralGoogle Scholar
- Hill WG, Robertson A: Linkage diseqilibrium in finite populations. Theor Appl Genet. 1968, 38 (6): 226-231. 10.1007/BF01245622.PubMedView ArticleGoogle Scholar
- Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet. 1999, 22 (2): 139-144. 10.1038/9642.PubMedView ArticleGoogle Scholar
- Pritchard JK, Przeworski M: Linkage disequilibrium in humans: models and data. Am J Hum Genet. 2001, 69: 1-14. 10.1086/321275.PubMedPubMed CentralView ArticleGoogle Scholar
- Teare MD, Dunning AM, Durocher F, Rennart G, Easton DF: Sampling distribution of summary linkage disequilibrium measures. Ann Hum Genet. 2002, 66 (Pt 3): 223-233. 10.1046/j.1469-1809.2002.00108.x.PubMedView ArticleGoogle Scholar
- Weir BS: Inferences about linkage disequilibrium. Biometrics. 1979, 35: 235-254. 10.2307/2529947.PubMedView ArticleGoogle Scholar
- Schaid DJ: Linkage disequilibrium testing when linkage phase is unknown. Genetics. 2004, 166: 505-512. 10.1534/genetics.166.1.505.PubMedPubMed CentralView ArticleGoogle Scholar
- Zaykin DV, Pudovkin A, Weir BS: Correlation-based inference for linkage disequilibrium with multiple alleles. Genetics. 2008, 180: 533-545. 10.1534/genetics.108.089409.PubMedPubMed CentralView ArticleGoogle Scholar
- Wellek S, Ziegler A: A genotype-based approach to assessing the association between single nucleotide polymorphisms. Hum Hered. 2009, 67 (2): 128-139. 10.1159/000179560.PubMedView ArticleGoogle Scholar
- Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet. 2000, 67 (4): 947-959. 10.1086/303069.PubMedPubMed CentralView ArticleGoogle Scholar
- Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1994, Chapman and Hall/CRCGoogle Scholar
- Slatkin M, Excoffier L: Testing for linkage disequilibrium in genotypic data using the Expectation-Maximization algorithm. Heredity. 1996, 76 (Pt 4): 377-383. 10.1038/hdy.1996.55.PubMedView ArticleGoogle Scholar
- Thomas D, Stram D, Dwyer J: Exposure measurement error: influence on exposure-disease. Relationships and methods of correction. Annu Rev Public Health. 1993, 14: 69-93. 10.1146/annurev.pu.14.050193.000441.PubMedView ArticleGoogle Scholar
- Haiman CA, Stram DO, Pike MC, Kolonel LN, Burtt NP, Altshuler D, Hirschhorn J, Henderson BE: A comprehensive haplotype analysis of CYP19 and breast cancer risk: the Multiethnic Cohort. Hum Mol Genet. 2003, 12 (20): 2679-2692. 10.1093/hmg/ddg294.PubMedView ArticleGoogle Scholar
- Cox DG, Kraft P, Hankinson SE, Hunter DJ: Haplotype analysis of common variants in the BRCA1 gene and risk of sporadic breast cancer. Breast Cancer Res. 2005, 7 (2): R171-R175. 10.1186/bcr973.PubMedPubMed CentralView ArticleGoogle Scholar
- Lin DY, Huang BE: The use of inferred haplotypes in downstream analyses. Am J Hum Genet. 2007, 80 (3): 577-579. 10.1086/512201.PubMedPubMed CentralView ArticleGoogle Scholar
- Kraft P, Stram DO: Re: the use of inferred haplotypes in downstream analysis. Am J Hum Genet. 2007, 81 (4): 863-5. 10.1086/521371. author reply 865-6PubMedPubMed CentralView ArticleGoogle Scholar
- Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered. 2002, 53 (2): 79-91. 10.1159/000057986.PubMedView ArticleGoogle Scholar
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70 (2): 425-434. 10.1086/338688.PubMedPubMed CentralView ArticleGoogle Scholar
- Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ: Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered. 2003, 55: 56-65. 10.1159/000071811.PubMedView ArticleGoogle Scholar
- Lin DY, Zeng D: Likelihood-based inference on haplotype effects in genetic association studies. Journal of the American Statistical Association. 2006, 101: 89-104. 10.1198/016214505000000808.View ArticleGoogle Scholar
- Schemed P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78 (4): 629-644. 10.1086/502802.View ArticleGoogle Scholar
- Consortium IH, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Peltonen L, Dermitzakis E, Bonnen PE, Altshuler DM, Gibbs RA, de Bakker PIW, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Gibbs RA, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, Lee C, McCarrol SA, Nemesh J, Dermitzakis E, Keinan A, Montgomery SB, Pollack S, Price AL, Soranzo N, Bonnen PE, Gibbs RA, Gonzaga-Jauregui C, Keinan A, Price AL, Yu F, Anttila V, Brodeur W, Daly MJ, Leslie S, McVean G, Moutsianas L, Nguyen H, Schaffner SF, Zhang Q, Ghori MJR, McGinnis R, McLaren W, Pollack S, Price AL, Schaffner SF, Takeuchi F, Grossman SR, Shlyakhter I, Hostetter EB, Sabeti PC, Adebamowo CA, Foster MW, Gordon DR, Licinio J, Manca MC, Marshall PA, Matsuda I, Ngare D, Wang VO, Reddy D, Rotimi CN, Royal CD, Sharp RR, Zeng C, Brooks LD, McEwen JE: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467 (7311): 52-58. 10.1038/nature09298.View ArticleGoogle Scholar
- Consortium GP, Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.