A comparison of internal model validation methods for multifactor dimensionality reduction in the case of genetic heterogeneity
 Jeffrey J Gory†^{1},
 Holly C Sweeney†^{1},
 David M Reif^{1} and
 Alison A MotsingerReif^{1}Email author
DOI: 10.1186/175605005623
© Gory et al; licensee BioMed Central Ltd. 2012
Received: 7 May 2012
Accepted: 29 October 2012
Published: 5 November 2012
Abstract
Background
Determining the genes responsible for certain human traits can be challenging when the underlying genetic model takes a complicated form such as heterogeneity (in which different genetic models can result in the same trait) or epistasis (in which genes interact with other genes and the environment). Multifactor Dimensionality Reduction (MDR) is a widely used method that effectively detects epistasis; however, it does not perform well in the presence of heterogeneity partly due to its reliance on crossvalidation for internal model validation. Crossvalidation allows for only one “best” model and is therefore inadequate when more than one model could cause the same trait. We hypothesize that another internal model validation method known as a threeway split will be better at detecting heterogeneity models.
Results
In this study, we test this hypothesis by performing a simulation study to compare the performance of MDR to detect models of heterogeneity with the two different internal model validation techniques. We simulated a range of disease models with both main effects and genegene interactions with a range of effect sizes. We assessed the performance of each method using a range of definitions of power.
Conclusions
Overall, the power of MDR to detect heterogeneity models was relatively poor, especially under more conservative (strict) definitions of power. While the overall power was low, our results show that the crossvalidation approach greatly outperformed the threeway split approach in detecting heterogeneity. This would motivate using crossvalidation with MDR in studies where heterogeneity might be present. These results also emphasize the challenge of detecting heterogeneity models and the need for further methods development.
Background
An important problem in human genetics is the challenge of identifying polymorphisms that are associated with high disease risk. This task can be difficult because the underlying genetic models of many common human diseases, such as heart disease and Type II diabetes, are complex in their genetic etiology [1]. For instance, there can be genegene interactions (known as epistasis) or multiple genotypes that result in the same phenotype (known as genetic heterogeneity) [2]. Epistasis creates a challenge for traditional analytical approaches and these challenges in feature selection and parameter estimation for epistasis models have been previously discussed in the literature [2–4].
To address these problems, a number of new approaches have been developed to try to detect interactions [5, 6]. Recent approaches take a broad range of computational approaches to detect and characterize epistasis, including exhaustive search techniques [7, 8], twostage screening approaches [9], Bayesian approaches [10], evolutionary algorithms [11], treebased approaches [12], etc. Each of these approaches has advantages and disadvantages for a range of genetic etiologies and dataset sizes [13–15]. Recently, a handcurated database of all reported interactions in human genetics documented the methods used to discover these interactions [16]. In the reported interactions, about 37% were detected using new machinelearning methods (as opposed to traditional statistical techniques such as regression, analysis of variance, etc.). Of those, Multifactor Dimensionality Reduction (MDR) [7], was used the most (in 35% of the studies using new methods, representing a total use in 13% of the studies reporting interactions). This widespread use motivates the further investigation of practical implementation issues with this method.
MDR is a nonparametric procedure that reduces the dimensionality of the data by classifying each genotype as either highrisk or lowrisk and then uses internal model validation, typically either fivefold or tenfold crossvalidation (CV), to select the best model [17]. MDR with CV has become common in genetic epidemiology and has successfully found interactions in both simulated and real data related to such diseases as schizophrenia, sporadic breast cancer, multiple sclerosis, and atrial fibrillation. A recent review of the MDR approach and its extension and application can be found in [18].
One drawback of MDR with CV is that it is computationally intensive because it performs an exhaustive search of all possible combinations of factors. Further, the use of mfold CV for internal model validation requires that the MDR algorithm be executed m times for each possible combination, which adds to the computation time. To help reduce the required computation an alternative internal model validation method, the threeway split (3WS), has been incorporated into the MDR algorithm [19]. MDR with 3WS has been shown to be significantly faster than MDR with CV and it does not result in a significant loss in the ability to detect interactions [19]. MDR with 3WS does tend to fit a larger model than MDR with CV, so false positives are more common with 3WS and a pruning procedure may need to be employed if Type I error is to be avoided [19].
Another drawback of MDR is that it performs poorly in the presence of genetic heterogeneity [20, 21]. Genetic heterogeneity (where more than one model underlies disease risk) is a problem for a number of machine learning methods [22]. There are several potential reasons that MDR performs poorly in the presence of heterogeneity, as discussed in these previous studies [20, 21]. The use of crossvalidation is one potential reason – since the usual application of MDR is to pick a single best model, if there are two competing models such as is the case in heterogeneity situations, no single model may emerge as consistently chosen, resulting in a low crossvalidation consistency for all models. It is possible that MDR with 3WS could perform better than MDR with CV in such situations because the 3WS algorithm uses a different approach to screen potential models (allowing multiple models to be passed along at each stage) and tends to fit a larger model. To our knowledge, no study has been done to investigate the power of MDR with 3WS in the presence of genetic heterogeneity.
The purpose of the present study is to compare the effectiveness of MDR with CV to that of MDR with 3WS in situations wherein genetic heterogeneity is present. This is accomplished through simulating genetic data exhibiting heterogeneity and evaluating the success of the two internal model validation methods at identifying the correct underlying models. It is necessary to use simulated data because we must know the true underlying model in order to assess the accuracy of the predicted model and such information is not known with real data.
Methods
Multifactor Dimensionality Reduction (MDR)
MDR is a widely used data mining technique that performs an exhaustive search of all possible genes and combinations of genes to find the best model for a certain genetic trait [23]. It is able to accommodate more complex genetic traits that involve genegene and geneenvironment interactions [7]. MDR uses combinatorial data reduction techniques to collapse the high dimensions of complex genetic data into just one dimension with two levels (highrisk and lowrisk) [7]. MDR is nonparametric as no assumptions about the underlying statistical distribution or genetic models are made. For the following description consider a set of genetic data of sample size N (with n_{1} cases and n_{0} controls) for which the genotypes at K loci are known and it is believed that the largest interaction involves k terms.
The first step in the MDR algorithm is to enumerate all possible combinations of k loci. For each combination of loci the number of cases and controls are counted for every possible combination of genotypes. For genes with two possible alleles each locus has three possible genotypes, so the data can be classified into 3^{k} genotypic combinations. We will refer to each such combination as a multifactor class. The ratio of cases to controls is calculated for each multifactor class using the sample data and this value is used to classify each multifactor class as either highrisk or lowrisk. In the case of balanced data, meaning data with an equal number of cases and controls, the multifactor classes with a casetocontrol ratio exceeding one are considered highrisk while those with a ratio below one are considered lowrisk. In general the threshold is n_{1}/n_{0}. This highrisk/lowrisk parameterization serves to reduce the high dimensionality of the data.
Crossvalidation (CV)
The number of times that a particular model is identified as the best model across the m subsets of the data is known as the crossvalidation consistency. The model chosen as the best overall model is the one that has both the highest prediction accuracy and the highest crossvalidation consistency. If the model that maximizes prediction accuracy is different than the model that maximizes crossvalidation consistency, then the more parsimonious model is chosen [21]. CV is most commonly employed with either five or ten equal splits of the data. It has been shown that five and ten splits yield similar results [17], so this study utilizes five splits of the data to lessen computing time.
Threeway split (3WS)
3WS is an internal model validation method that has only recently been implemented with MDR. For this procedure, the full dataset is randomly split into three parts: a training set to build initial models, a testing set to narrow the list of potential models, and a validation set to choose the best model and assess its predictive capability. It has been shown that the proportion of the data included in each split does not make a major difference in the resulting model, but the optimal split, and the one we use, is a 2:2:1 ratio [19]. MDR is run using each of these three sets with every possible combination of up to k loci considered with the training set, a subset of these possible combinations considered with the testing set, and only the top few models considered with the validation set. The three splits of the data can be considered independent of one another and balanced accuracy can be calculated for each combination of loci to determine the best model. This method is much more computationally efficient than CV because the MDR algorithm is carried out fewer times and fewer models are considered each time.
Data simulation
Summary of genetic models simulated
First model  Second model  

Simulation  Disease loci  Model type  Level of heterogeneity  Odds ratio  Contribution  Odds ratio  Contribution 
1  2  additive  25/75  1.5  25%  1.5  75% 
2  2  additive  25/75  2  25%  2  75% 
3  2  additive  25/75  1.5  25%  2  75% 
4  2  additive  25/75  2  25%  1.5  75% 
5  2  additive  50/50  1.5  50%  1.5  50% 
6  2  additive  50/50  2  50%  2  50% 
7  2  additive  50/50  1.5  50%  2  50% 
8  2  recessive  25/75  1.5  25%  1.5  75% 
9  2  recessive  25/75  2  25%  2  75% 
10  2  recessive  25/75  1.5  25%  2  75% 
11  2  recessive  25/75  2  25%  1.5  75% 
12  2  recessive  50/50  1.5  50%  1.5  50% 
13  2  recessive  50/50  2  50%  2  50% 
14  2  recessive  50/50  1.5  50%  2  50% 
15  4  XOR  25/75  1.5  25%  1.5  75% 
16  4  XOR  25/75  2  25%  2  75% 
17  4  XOR  25/75  1.5  25%  2  75% 
18  4  XOR  25/75  2  25%  1.5  75% 
19  4  XOR  50/50  1.5  50%  1.5  50% 
20  4  XOR  50/50  2  50%  2  50% 
21  4  XOR  50/50  1.5  50%  2  50% 
Analysis
All 100 datasets for each of the 21 simulations were analyzed using MDR with fivefold CV and MDR with 3WS. This was done using the MDR package available for the statistical software R [28, 29]. For MDR with 3WS we used the default split of 2:2:1 (train:test:validate) and a value of x=25 (the total number of loci in each dataset) to allow 25 models to pass from the training set to the testing set. For both methods MDR considered models of size k=1,2 for the twolocus models and k=1,2,3,4 for the fourlocus models.
We collected the output from these MDR procedures to assess the accuracy of the final models. Power was calculated as the percentage of times out of the 100 datasets for each simulation that the final model met some specified criterion. We initially computed a conservative estimate of power for which this criterion was that the final predicted model included all of the true disease loci and no false positive loci. It was immediately apparent that both methods did a poor job finding the entire correct model. We therefore defined several more liberal types of power to assess how often each method found at least one of the two models included in the heterogeneity model. For the power labeled mod1 a trial was considered a success if at least the locus or loci of the first of the two models contributing to the heterogeneity model was included in the final predicted model. For the power labeled onlymod1 the requirement was that the final predicted model be exactly the first of the two simulated models contributing to the overall model with no additional loci included. The power definitions mod2 and onlymod2 are analogous to mod1 and onlymod1, but for the second of the two models. We also defined a power, labeled nofalse, that considered a trial a success if the predicted model included any number of correct loci and no false positive loci.
Differences between the performances of the two internal model validation methods were tested using an analysis of variance (ANOVA), implemented in SASv9.2 [30].
Results and discussion
For both the twolocus and fourlocus heterogeneity models, MDR implemented with CV tended to outperform MDR implemented with 3WS based on the more liberal definitions of power. Statistical significance (at α = .05) was achieved for mod2 (pvalue=.0056), onlymod1 (pvalue=.0012), onlymod2 (pvalue >.0001), and nofalse (pvalue > .0001). The greatest differences in performance were seen with onlymod2 and nofalse where CV had extremely high power while 3WS had minimal power. The only liberal definition of power that did not see a significant difference was mod1. This lack of significance resulted more from the poor performance of MDR implemented with CV than from the strong performance of MDR implemented with 3WS. Many of the models that needed to be identified to be considered a success for this type of power contributed only 25% to the overall heterogeneity model, so they were extremely hard to detect. While the performance of MDR implemented with CV was about the same for mod1 as for onlymod1, there was a significant difference between CV and 3WS based on onlymod1 because MDR implemented with 3WS almost never identified the first model without including any additional loci.
Pvalues from the ANOVA analysis of the simulation results
Effect  Conservative  mod1  onlymod1  mod2  onlymod2  nofalse 

internal model validation method  0.1637  0.5136  0.0012  0.0056  < .0001  <.0001 
level of heterogeneity  0.2482  < .0001  0.0003  0.0005  0.001  0.0733 
model type  0.0006  0.0147  0.1672  0.0004  0.0109  0.155 
odds ratio (OR)  0.0444  0.2025  0.7075  0.18  0.3708  0.0003 
In terms of computing time, MDR implemented with 3WS was approximately five times faster than MDR implemented with CV. This is consistent with results published by Winham et al. [8]. The majority of the computation time is spent classifying all possible combinations of loci as either highrisk or lowrisk and calculating a balanced accuracy estimate for all these combinations in the training set. This process is done only once with 3WS and five times for fivefold CV, so 3WS is theoretically five times faster than fivefold CV. The difference in efficiency also depends on many other factors such as sample size and the total number of loci [19].
Conclusion
While MDR implemented with CV has been effective at detecting disease models exhibiting epistasis, it has been shown to have a dramatic decrease in power in the presence of genetic heterogeneity [20, 21]. Recently, an alternative internal model validation method, the 3WS, has been shown to have roughly the same power as CV for detecting standard epistatic models when implemented with MDR [19]. The main conclusion to draw from this study is that MDR implemented with 3WS not only fails to detect disease models exhibiting genetic heterogeneity better than MDR implemented with CV, but by some measures it performs significantly worse. While we recognize that the current study does not provide solutions for improving detection of heterogeneity, we do hope this study provides important practical guidance when choosing an internal model validation approach.
Both 3WS and CV perform extremely poorly in terms of detecting the full heterogeneity model. Neither method did significantly better than the other in this respect, but neither performed well enough to have any practical utility. Looking at more liberal definitions of power, for which it was considered a success if MDR detected one of the two models contributing to the overall genetic heterogeneity model, differences in performance arise. In particular, MDR implemented with CV is significantly better at detecting models that contribute at least 50% to the overall genetic heterogeneity model. There is not, however, a significant difference in the ability of the two methods to detect models that contribute at most 50% to the overall model. This can be attributed primarily to the extremely poor performance of both methods in regard to detecting the less prevalent model.
When the inclusion of false positives into the model predicted by MDR was considered, it was found that MDR implemented with CV is far better than MDR implemented with 3WS at finding exactly one of the two models contributing to the overall genetic heterogeneity model without including any additional loci. The average final model size for MDR implemented with 3WS was about twice that of MDR implemented with CV. This was expected based on previous findings [19] and was one of the main reasons we initially hypothesized that MDR implemented with 3WS would better detect heterogeneity models. Unfortunately, the additional loci included in the final model by MDR implemented with 3WS were not the hardtodetect disease loci contributing to the heterogeneity model but were instead false positives.
Ultimately, MDR does not appear to be able to effectively detect models exhibiting genetic heterogeneity regardless of the internal model validation method used. Therefore, some other approach must be developed to find this type of model. Ritchie et al. [21] suggested using either cluster analysis or recursive partitioning to confront the challenge presented by genetic heterogeneity. The cluster analysis approach is based on the idea that genetic heterogeneity results from groups of individuals within a population who have different genetic backgrounds. If these groups could be identified prior to looking for associations, then MDR could be run on the groups separately under the assumption that within each group there is only one underlying disease model (and consequently no heterogeneity). Whether using classification trees or cluster analysis, grouping individuals based on a shared genetic background before attempting to identify gene associations seems to be a reasonable direction for further research into finding genetic heterogeneity models. These results highlight the importance of continued development to improve the performance of MDR in the case of heterogeneity, and motivate the use of other approaches if genetic heterogeneity is expected to play a role in the disease etiology.
Notes
Abbreviations
 MDR:

Multifactor dimensionality reduction
 CV:

Crossvalidation
 3WS:

threeway split.
Declarations
Acknowledgements
This project was supported by NSFCSUMS project DMS0703392 (PI: Sujit Ghosh).
Authors’ Affiliations
References
 Moore JH, Asselbergs FW, Williams SM: Bioinformatics challenges for genomewide association studies. Bioinformatics. 2010, 26: 445455. 10.1093/bioinformatics/btp713.PubMedPubMed CentralView Article
 Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003, 56: 7382. 10.1159/000073735.PubMedView Article
 Moore JH, Williams SM: New strategies for identifying genegene interactions in hypertension. Ann Med. 2002, 34: 8895. 10.1080/07853890252953473.PubMedView Article
 Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002, 70: 461471. 10.1086/338759.PubMedPubMed CentralView Article
 Cordell HJ: Detecting genegene interactions that underlie human diseases. Nat Rev Genet. 2009, 10: 392404.PubMedPubMed CentralView Article
 Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010, 86: 622. 10.1016/j.ajhg.2009.11.017.PubMedPubMed CentralView Article
 Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138147. 10.1086/321276.PubMedPubMed CentralView Article
 KamThong T, Putz B, Karbalai N, MullerMyhsok B, Borgwardt K: Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs. Bioinformatics. 2011, 27: i214i221. 10.1093/bioinformatics/btr218.PubMedPubMed CentralView Article
 Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37: 413417. 10.1038/ng1537.PubMedView Article
 Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies. Nat Genet. 2007, 39: 11671173. 10.1038/ng2110.PubMedView Article
 MotsingerReif AA, Dudek SM, Hahn LW, Ritchie MD: Comparison of approaches for machinelearning optimization of neural networks for detecting genegene interactions in genetic epidemiology. Genet Epidemiol. 2008, 32: 325340. 10.1002/gepi.20307.PubMedView Article
 Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, Biernacka JM: SNP interaction detection with random forests in highdimensional genetic data. BMC Bioinforma. 2012, 13: 16410.1186/1471210513164.View Article
 MotsingerReif AA, Reif DM, Fanelli TJ, Ritchie MD: A comparison of analytical methods for genetic association studies. Genet Epidemiol. 2008, 32: 767778. 10.1002/gepi.20345.PubMedView Article
 Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y: Performance analysis of novel methods for detecting epistasis. BMC Bioinforma. 2011, 12: 47510.1186/1471210512475.View Article
 Wang Y, Liu G, Feng M, Wong L: An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics. 2011, 27: 29362943. 10.1093/bioinformatics/btr512.PubMedView Article
 MotsingerReif AA, Wood SJ, Oberoi S, Reif DM: EpistasisList.org: A Curated Database of GeneGene and GeneEnvironment Interactions in Human Epidemiology. 2008, Philadelphia, PA: American Society of Human Genetics
 Motsinger AA, Ritchie MD: The effect of reduction in crossvalidation intervals on the performance of multifactor dimensionality reduction. Genet Epidemiol. 2006, 30: 546555. 10.1002/gepi.20166.PubMedView Article
 Moore JH: Detecting, characterizing, and interpreting nonlinear genegene interactions using multifactor dimensionality reduction. Adv Genet. 2010, 72: 101116.PubMedView Article
 Winham SJ, Slater AJ, MotsingerReif AA: A comparison of internal validation techniques for multifactor dimensionality reduction. BMC Bioinforma. 2010, 11: 39410.1186/1471210511394.View Article
 Ritchie MD, Edwards TL, Fanelli TJ, Motsinger AA: Genetic heterogeneity is not as threatening as you might think. Genet Epidemiol. 2007, 31: 797800. 10.1002/gepi.20256.PubMedView Article
 Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003, 24: 150157. 10.1002/gepi.10218.PubMedView Article
 MotsingerReif AA, Fanelli TJ, Davis AC, Ritchie MD: Power of grammatical evolution neural networks to detect genegene interactions in the presence of error. BMC Res Notes. 2008, 1: 6510.1186/17560500165.PubMedPubMed CentralView Article
 Motsinger AA, Ritchie MD: Multifactor dimensionality reduction: an analysis strategy for modelling and detecting genegene interactions in human genetics and pharmacogenomics studies. Hum Genomics. 2006, 2: 318328.PubMedPubMed Central
 Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007, 31: 306315. 10.1002/gepi.20211.PubMedView Article
 Li W, Reich J: A complete enumeration and classification of twolocus disease models. Hum Hered. 2000, 50: 334349. 10.1159/000022939.PubMedView Article
 Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD: Data simulation software for wholegenome association and other studies in human genetics. Pac Symp Biocomput. 2006, 1: 499510.
 Edwards TL, Lewis K, Velez DR, Dudek S, Ritchie MD: Exploring the performance of Multifactor Dimensionality Reduction in large scale SNP studies and in the presence of genetic heterogeneity among epistatic disease models. Hum Hered. 2009, 67: 183192. 10.1159/000181157.PubMedPubMed CentralView Article
 R Development Core Team: R: A language and environment for statistical computing. 2005, Vienna, Austria: R Foundation for Statistical Computing, URL http://www.Rproject.org
 Winham SJ, MotsingerReif AA: An R package implementation of multifactor dimensionality reduction. BioData Min. 2011, 4: 2410.1186/17560381424.PubMedPubMed CentralView Article
 SAS Institute Inc: 2004, Cary, NC,http://www.sas.org,
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.