- Technical Note
- Open Access
MLEP: an R package for exploring the maximum likelihood estimates of penetrance parameters
BMC Research Notesvolume 5, Article number: 465 (2012)
Linkage analysis is a useful tool for detecting genetic variants that regulate a trait of interest, especially genes associated with a given disease. Although penetrance parameters play an important role in determining gene location, they are assigned arbitrary values according to the researcher’s intuition or as estimated by the maximum likelihood principle. Several methods exist by which to evaluate the maximum likelihood estimates of penetrance, although not all of these are supported by software packages and some are biased by marker genotype information, even when disease development is due solely to the genotype of a single allele.
Programs for exploring the maximum likelihood estimates of penetrance parameters were developed using the R statistical programming language supplemented by external C functions. The software returns a vector of polynomial coefficients of penetrance parameters, representing the likelihood of pedigree data. From the likelihood polynomial supplied by the proposed method, the likelihood value and its gradient can be precisely computed. To reduce the effect of the supplied dataset on the likelihood function, feasible parameter constraints can be introduced into maximum likelihood estimates, thus enabling flexible exploration of the penetrance estimates. An auxiliary program generates a perspective plot allowing visual validation of the model’s convergence. The functions are collectively available as the MLEP R package.
Linkage analysis using penetrance parameters estimated by the MLEP package enables feasible localization of a disease locus. This is shown through a simulation study and by demonstrating how the package is used to explore maximum likelihood estimates. Although the input dataset tends to bias the likelihood estimates, the method yields accurate results superior to the analysis using intuitive penetrance values for disease with low allele frequencies. MLEP is part of the Comprehensive R Archive Network and is freely available athttp://cran.r-project.org/web/packages/MLEP/index.html.
Linkage analysis remains a useful tool for detecting genetic variants that regulate a trait of interest, especially genes associated with a given disease. The likelihood of pedigree data plays an important role in this analysis; however, the entire likelihood function embodies functions of recombination fractions, penetrance parameters, and disease and marker allele frequencies. Such complexity can be reduced by varying only the recombination fraction, assigning fixed values to the other parameters. This allows the multivariate function to be expressed in terms of a single variable (the recombination fraction). The ratio of the univariate likelihood, the so-called LOD score, is then computed to map disease loci, rather than maximizing the likelihood itself. For a small recombination fraction, a high LOD score (greater than 3) implies that the disease locus is located near the markers employed in the analysis. Assuming the conditional parameters to achieve such high LOD scores depends largely on the researcher’s intuition.
Linkage analysis is hindered by the lack of useful tools and programs for the parameter estimation, although several methods for maximum likelihood estimate (MLE) of penetrance have been proposed. Two types of penetrance estimation are described in the literature. The first is based on pedigree data with marker genotype information; for which the likelihood is p θ (a V ,m V ), where a V and m V represent observations of affected status and marker genotypes respectively, for a set of pedigree members V. The parameter vector θ contains penetrance parameters, disease allele frequency, and recombination fraction, so that maximum likelihood estimates are obtained for all parameters simultaneously. The evaluated maximum likelihood estimate of penetrance parameters is therefore affected by the estimates of both recombination fraction and disease allele frequency. Because penetrance and marker genotype observations are independent (unless the marker and disease loci are extremely close), the method is not suitable for penetrance estimation in which a single disease allele determines whether the disease will manifest. This method has been shown in GENEHUNTER-MODSCORE[2–5] in which the ratio of the likelihood, or mod score function, can be maximized in practice. The second approach considers the likelihood of affected status; that is, the likelihood is expressed as. Maximized likelihood is a function of the penetrance parameters alone,, for a particular case. Wang et al. formalized the likelihood using Bayes theorem and proposed a method for estimating the penetrance parameter, but which functions only when a carrier of a disease allele develops the disease. The estimate was applied to a dominant disease; therefore, it is not applicable to diseases which manifest via other modes of inheritance. Swartz et al. developed three different kinds of estimators and compared their efficiencies by a ratio of asymptotic variance of one estimator to another. The efficiency was computed and illustrated by a perspective plot using the program Maple[9, 10], but these methods are limited to sib pairs. The methods have not been packaged as freely available software, hence are not immediately applicable to wider data analysis. A known problem with penetrance estimation methods is that their estimates depend largely on the collected pedigrees. However, these second approaches ensure robust estimates by using multiple sets of previously-recorded pedigree data for the same disease to estimate parameters, which are available from the literature; the marker genotype information is not available.
Our proposed method belongs to the second class of penetrance estimation. The likelihood can be explicitly expressed as a polynomial of penetrance parameters, and the evaluation measure is a vector of likelihood polynomial coefficients. More precisely, the coefficient vectors for each pedigree are evaluated by their independence property. Once the likelihood vectors have been computed, the problem reduces to one of optimization. The maximum likelihood estimates of penetrance parameters are then readily obtained by maximizing the likelihood using standard statistical software. We have developed programs to explore the maximum likelihood estimates of penetrance parameters using R statistical programming language supplemented with external C functions. The main function of the package evaluates the list of likelihood coefficient vectors. The software enables flexible exploration of the penetrance estimates by calculating the likelihood value and its gradient precisely, and passing them to an optimization function. The exploration is rendered more powerful if feasible parameter constraints can be incorporated in the maximum likelihood evaluation. Another advantage of our method is that it provides visual validation of convergence, in the form of perspective plots of the likelihood surface. All of the functions are available in the MLEP R package. Although the MLEP estimates are biased by the collected disease pedigree data, we show that they can more accurately identify a disease locus than can analysis based on intuitive penetrance values provided that both true and assumed disease allele frequency are small. The MLEP package is part of the Comprehensive R Archive Network (CRAN) and is freely downloadable fromhttp://cran.r-project.org/web/packages/MLEP/index.html.
First we introduce penetrance parameters as conditional probabilities given the genotype of the disease locus, such that α = P(affected|A/A), β = P(affected|A/a), and γ = P(affected|a/a), where A and a represent the disease and normal alleles respectively, and the genotype of the disease locus is expressed as two alleles separated by a slash. For simplicity, we suppose that a single locus contributes to disease development and that the disease allele frequency is known a priori. Using the above notations, the likelihood function for a pedigree is explicitly expressed as a polynomial of the parameters as follows:
where i, j, and k run over from 0 to N subject to the constraint max(i + j + k) = N and c ijk is a polynomial coefficient. Note that N, the number of individuals whose disease status (affected or unaffected) is known and the parameter constraint 0 ≤ γ ≤ β ≤ α ≤ 1 is reasonable. An individual of A/a genotype (where A is a dominant disease allele) has equal or higher chance of developing disease than an individual of a/a genotype, while the A/A genotype incurs the highest risk of disease development. Details of the likelihood evaluation algorithm are provided as Additional file1. The polynomial form of a likelihood evaluation for the recombination fraction has been previously proposed and its usefulness has been demonstrated. Here we present a modified form of the evaluation. Because the coefficient c ijk is inherited through the generations from founder to descendant, the likelihood coefficients of penetrance parameters can similarly be evaluated.
Evaluation of maximum likelihood estimate by MLEP
Here we use the MLEP package to explore the maximum likelihood estimate of penetrance parameters through a simulation study.
We used the SLINK package[13, 14] to generate simulation pedigree data. The pedigree structure followed that of Kamatani et al. and the same phenotypes of the top two founders were used in the simulation, while the others were generated. All marker genotypes were also simulated by the program, assuming five marker alleles with equal frequencies. Allele frequency for the disease allele was assumed to be 0.0001. Three penetrance parameters (α,β,γ), with values 0.95, 0.7 and 0, were assigned, and the recombination frequency was assumed to be 0. By repeating the simulation under these conditions, 50 pedigree datasets were generated. All of the diseased individuals for each pedigree are assumed to have acquired their disease from a single locus.
Evaluation of likelihood polynomial
The main function of the MLEP package, mlep, evaluates the list of likelihood coefficient vectors of penetrance parameters. This function accepts a pedigree matrix consisting of the first six columns in a pedfile (the sixth column supplies affected status information), and returns a list of likelihood coefficients. To conserve computational memory, powers of penetrance parameters, i, j, and k, for a coefficient c ijk are converted into a single value, i + j × max _power + k × max _powe r2; the converted values are then combined into a vector and assigned to the evaluated coefficients vector as a “powers” attribute, where “max _power" = N + 1. An additional attribute, “max _power”, maps the converted value to the original powers. After installing the MLEP package from the CRAN website and importing the simulation pedigree data into R as pedigree object, the package can be loaded and the mlep function can be executed by assuming the disease allele frequency to be 0.0001 as follows:
> polynomial = mlep(pedigree, 0.0001)
Evaluation of maximum likelihood estimate
Once the likelihood polynomial has been evaluated by the mlep function, maximum likelihood estimates may be computed using any optimization program. Derivatives (and the Hessian) of the function are readily computed from the evaluated likelihood coefficient, hence the method presents as a powerful tool for seeking maximum likelihood. To this end, the functions fr and grr are provided for evaluation of the log likelihood value and its gradient. The likelihood polynomial is maximized by the pre-included R function constrOptim, which minimizes or maximizes a function subject to linear inequality constraints using an adaptive barrier algorithm. Assuming the initial values of the parameters to be 0.9, 0.8, and 0.1, the constrOptim function can be executed to maximize the likelihood polynomial function, subject to 0 ≤ γ ≤ β ≤ α ≤ 1 as follows:
> constrOptim(c(0.9,0.8,0.1), fr, grr,
ci=c(rep(0,3), rep(-1,3), rep(0,2)), poly=polynomial, control=list
By executing the above command, the maximum likelihood estimates were obtained as 0.970, 0.943, and 0.188 with log likelihood value -1640.339. It is important to check whether the iteration converges to the global maximum. To this end, we use the PerspPenetrance function, which draws a perspective plot of the log likelihood surface fixed on one of the parameters. A plot of the evaluated likelihood for fixed γ = 0.188 is shown in Figure1. The plot is generated from the following command:
> PerspPenetrance(polynomial, "gamma",
0.188, theta=-60, phi=20)
Although the maximum is found near those of the other two maximum likelihood estimates, α and β, evaluated in this study, the estimates appear to be biased. This bias is not inherent in the model, but is introduced by the input data. The moderate probability of disease development for individuals of A/a disease genotype, 0.7, has likely produced the bias. If strong evidence exists that the probability of disease development for individuals of a/a genotype is low (,that is, phenocopy rate is low), a second command can be executed by changing one of the linear inequality constraints and the initial value of γ to 0 ≤ γ ≤ 0.01 and 0.001 as follows:
> constrOptim(c(0.9,0.8,0.001), fr, grr,
ci=c(rep(0,3), rep(-1,2), -0.01,
Following this adjustment, the obtained maximum likelihood estimates were 0.875, 0.759, and 0.003 with log likelihood value -1644.840 less that of global maximum. The perspective plot for fixed γ = 0.003 is shown in Figure2 and the maximum is found near those of the other two estimates although the surface is flat around the maximum.
To assess linkage analysis efficacy for different penetrance values, the expected LOD score was evaluated by the SLINK package given the following four penetrance models: true penetrance model (0.950, 0.700, and 0.000), dominant model (0.999, 0.999, and 0.000), MLE with the γ constraint model (0.875, 0.759, and 0.003), and unrestrained MLE model (0.970, 0.943, and 0.188). The results are summarized in Table1. In the true penetrance model, the highest LOD score (exceeding 3) is correctly obtained at θ = 0. The dominant model, which is often used for simplicity in practical analysis, yields the highest LOD score at θ = 0.1. Therefore, researchers may conclude that the disease locus exists relatively close, but not extremely close to, the employed marker. This example highlights the potential of intuitive approaches to yield an incorrect result. Analysis via MLE constrained by the γ parameter yields results that match the true model, but unconstrained MLE leads to the uncertain conclusion that the disease locus may not exist near the employed marker, since the maximum LOD score at θ = 0.05 in this case is less than 3.
Effect of misspecifying disease allele frequency
In the above simulation study, we assigned the disease allele frequency to the value used to simulate the pedigree data (referred to as the “true” value in this paper) to evaluate the likelihood polynomial, but the extent to which this value matches reality is unknown. To evaluate the effect of misspecification of disease allele frequency on maximum likelihood estimates of penetrance parameters and on the subsequent linkage analysis, we altered both true and assumed disease allele frequencies in the following simulation study.
The pedigree structure, number of marker alleles, marker allele frequencies, penetrance parameters, and recombination fractions were identical to those used in the previous simulation, while disease allele frequency was altered to 0.0001, 0.001, 0.01, 0.1, or 0.2. 50 pedigree datasets were generated for each disease allele frequency. For the simulated pedigree datasets, disease allele frequency was then assumed to be 0.0001, 0.001, 0.01, 0.1, 0.25, or 0.5, and the likelihood polynomials were evaluated in each case.
Maximum likelihood estimate
For each of the evaluated likelihood polynomials, we also evaluated their maximum likelihood estimates subject to the constraint 0 ≤ γ ≤ 0.01. The unconstrained estimates for disease allele frequency 0.0001 displayed similar bias to the assumed frequency in the previous simulation study. Hence we employed the same parameter constraints as previously applied for all cases in the present simulation, although the constraint exerted little effect on the likelihood estimates at disease allele frequencies greater than 0.0001. The evaluated maximum likelihood estimates, mean absolute bias, and mean squared error (MSE) of the estimates are summarized in Table2. Note that γ is restricted to small values. When the true frequency is 0.0001, 0.001, or 0.01, and the likelihood is also evaluated at one of these frequencies, we find that α is underestimated and β is overestimated. If the likelihood is evaluated at frequencies exceeding 0.01, the estimates become unstable as both the mean bias and MSE become large. Assuming the true frequency to be 0.1 or 0.2 but evaluating the likelihood at frequencies less than 0.1 results in overestimation of both α and β. If the likelihood is evaluated at frequency 0.1, the estimates remain biased, but likelihood estimates evaluated at higher allele frequencies become unstable. Perspective plots of the resulting log likelihood surfaces at fixed γ are provided as Additional file2. Although the estimates are biased or unstable, the maxima are correctly located near those of the other two estimates for all cases studied.
Power of detecting linkage
Based on the evaluated maximum likelihood estimates, the expected LOD scores for the following three penetrance models were evaluated by the SLINK packages; true penetrance model (0.950, 0.700, and 0.000), dominant model (0.999, 0.999, and 0.000), and MLE model (the estimates are listed in Table2). The results are shown in Table3. The true and MLE models yield maximum LOD scores greater than 3 at θ = 0 when both true and assumed disease allele frequencies are 0.0001, 0.001, or 0.01, or when the true frequency is one of these values and the assumed frequency is 0.1. These models performed with similar correctness at true frequency 0.1 and assumed frequency 0.01 or 0.1. The dominant model, on the other hand, yielded poor or misplaced LOD scores for all cases.
These results indicate that genetic diseases with low allele frequencies (< 0.01) are suitable for analysis by linkage analysis; that is, the linkage between disease locus and marker locus can be correctly identified. At disease allele frequencies exceeding 0.1, the correct conclusion may not be reached because the peak of the significant LOD score (greater than 3) is obtained at different recombination fractions or because the linkage-detecting efficacy is insufficient (i.e., no LOD score higher than 3 is obtained) even when true penetrance parameters are employed. The MLEP package will almost certainly yield correct results for genetic diseases with small allele frequency, provided that the frequency is also assumed to be 0.1 or less. In this case, incorrectly specifying the disease allele frequency will have little effect on the results. When the true frequency is 0.1, assigning a small frequency (of order 0.01) to likelihood evaluations will again yield the correct result. In all other cases, the effect of allele frequency misspecification cannot be neglected. These results were validated by another simulation study, in which true penetrance parameters were assumed to be 0.990, 0.900, and 0.000. Maximum likelihood estimates, LOD scores for the three penetrance models, and perspective plots are provided as Additional files3,4, and5, respectively.
We have developed the MLEP R package to explore maximum likelihood estimates of penetrance parameters. The polynomial form of the likelihood evaluation enables flexible exploration of penetrance estimates by evaluating both the likelihood value and its gradient precisely, and by introducing parameter constraints if these can be assumed. Introduction of a low phenocopy rate (γ < 0.01) especially improved the quality of the analysis result in the presented simulation study. Convergence may be verified visually by auxiliary perspective plots. The input pedigree datasets bias the evaluated maximum likelihood estimates; however, the likelihood surface may be flat around the maximum and the performance of the estimates at correctly identifying linkages is superior to that of intuitive penetrance values.
Currently, common linkage analysis employs dense marker data. The proposed method is applicable to such vast datasets because it treats only the likelihood that the disease status of a pedigree member is “affected”, and is independent of dataset size, which is derived from the marker genotypes. Penetrance parameters are estimated separately from the marker information, and linkage analysis employing the estimates is conducted as the next step. The method is applicable only to diseases with large effect size, that is, when a single disease locus greatly or wholly contributes to disease development. Diseases resulting from other mechanisms, such as multifactorial diseases, are not suitable for analysis by this approach. Disease allele frequency is another unknown important parameter in linkage analysis, but several previous studies have reported that misspecifying the disease allele frequency does not significantly influence to the detection of the linkage[6, 17]. Our simulation study supports these results, provided that both true and assumed disease allele frequencies are small. It has also been mentioned that linkage analysis would be successful if a disease allele frequency is rare in the population; therefore, assigning the frequency to a small value for such the disease produces a feasible result. Here, we have demonstrated that linkage analysis using maximum likelihood estimates by the MLEP package correctly localizes a disease locus, if the frequency of the disease allele is made small (< 0.1).
Availability and requirements
Project name: MLEP
Project home page: http://cran.r-project.org/web/packages/MLEP/index.html
Operating systems: Linux/Mac/Windows
Programming language: R and C
Other requirements: R version≥2.14
Any restrictions to use by non-academics: None
Availability of supporting data
The data set supporting the results of this article is available athttp://www.stat.math.keio.ac.jp/∼sugaya/PIA/MLEP/index.html.
YS implemented the software and prepared the manuscript.
Morton NE: Sequential tests for the detection of linkage. Am J Hum Genet. 1955, 7: 277-318.
Mattheisen M, Dietter J, Knapp M, Baur MP, Strauch K: Inferential testing for linkage with GENEHUNTER-MODSCORE: the impact of the pedigree structure on the null distribution of multipoint MOD scores. Genet Epidemiol. 2008, 32: 73-83. 10.1002/gepi.20264.
Dietter J, Mattheisen M, Fürst R, Rüschendorf F, Wienker TF, Strauch K: Linkage analysis using sex-specific recombination fractions with GENEHUNTER-MODSCORE. Bioinformatics. 2007, 23: 64-70. 10.1093/bioinformatics/btl539.
Strauch K, Fürst R, Rüschendorf F, Windemuth C, Dietter J, Flaquer A, Baur MP, Wienker TF: Linkage analysis of alcohol dependence using MOD scores. BMC Genetics. 2005, 6 (Suppl1): S162-
Strauch K: Parametric linkage analysis with automatic optimization of the disease model parameters. Am J Hum Genet. 2003, 73 (Suppl1): A2624-
Clerget-Darpoux F, Bonaïti-Pelliè C, Hochez J: Effects of misspecifying genetic parameters in lod score analysis. Biometrics. 1986, 42 (2): 393-399. 10.2307/2531059.
Wang Y, Ottman R, Rabinowitz D: A method for estimating penetrance from families sampled for linkage analysis. Biometrics. 2006, 62: 1081-1088. 10.1111/j.1541-0420.2006.00614.x.
Swartz MD, Minard CG, Amos CI: The relative efficiency of penetrance estimators for sib pairs. Hum Hered. 2005, 59: 61-66. 10.1159/000084737.
Maplesoft, a division of Waterloo Maple Inc.: Maple V. Version release 5.1. 1999, Ontario, Canada
Maplesoft, a division of Waterloo Maple Inc.: Maple 6. 2000, Ontario, Canada
R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, Vienna, Austria, [http://www.R-project.org/] [ISBN 3-900051-07-0],
Sugaya Y, Shibata R: Exploration of the disease locus by a careful evaluation of the likelihood polynomial for pedigree data. J Hum Genet. 2011, 56: 383-389. 10.1038/jhg.2011.24.
Ott J: Computer-simulation methods in human linkage analysis. Proc Natl Acad Sci USA. 1989, 86: 4175-4178. 10.1073/pnas.86.11.4175.
Weeks DE, Ott J, Lathrop GM: SLINK: a general simulation program for linkage analysis. Am J Hum Genet. 1990, 47: A204-
Kamatani N, Moritani M, Yamanaka H, Takeuchi F, Hosoya T, Itakura M: Localization of a gene for familial juvenile hyperuricemic nephropathy causing underexcretion-type gout to 16p12 by genome-wide linkage analysis of a large family. Arthritis Rheum. 2000, 43: 925-929. 10.1002/1529-0131(200004)43:4<925::AID-ANR26>3.0.CO;2-B.
Lange K: Numerical Analysis for Statisticians. 1999, New York: Springer
Pal DK, Durner M, Greenberg DA: Effect of misspecification of gene frequency on the two-point LOD score. Eur J Hum Genet. 2001, 9: 855-859. 10.1038/sj.ejhg.5200724.
Bailey-Wilson JE, Wilson AF: Linkage analysis in the next-generation sequencing era. Hum Hered. 2011, 72 (4): 228-236. 10.1159/000334381.
The author is grateful to Professor Ritei Shibata at Keio University for his helpful suggestions. The author thanks the anonymous referees for their careful reading of the manuscript and their valuable comments and suggestions.
The author declares no competing interests.