AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm
© Rekaya et al; licensee BioMed Central Ltd. 2010
Received: 1 October 2009
Accepted: 28 April 2010
Published: 28 April 2010
Epistatic interactions of multiple single nucleotide polymorphisms (SNPs) are now believed to affect individual susceptibility to common diseases. The detection of such interactions, however, is a challenging task in large scale association studies. Ant colony optimization (ACO) algorithms have been shown to be useful in detecting epistatic interactions.
AntEpiSeeker, a new two-stage ant colony optimization algorithm, has been developed for detecting epistasis in a case-control design. Based on some practical epistatic models, AntEpiSeeker has performed very well.
AntEpiSeeker is a powerful and efficient tool for large-scale association studies and can be downloaded from http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html.
Genetic association studies, which aim at detecting association between one or more genetic polymorphisms and a trait of interest such as a quantitative characteristic, discrete attribute or disease, have gained a lot of popularity in the past decade . Although great progress in mapping genes responsible for Mendelian traits has been made, the genetic basis underlying many complex diseases remain unknown. It is widely accepted that these diseases may be caused by the joint effects of multiple genetic variations, which may show little effect individually but strong interactions. Such interactive effects of multiple genetic variations are often referred to as epistasis or epistatic interactions . Recently, increasing numbers of studies have suggested the presence of epistatic interactions in complex diseases, e.g. breast cancer , type-2 diabetes  and atrial fibrillation .
A number of multi-locus approaches have been proposed to detect epistatic interactions, such as the combinatorial partitioning method (CPM) , restricted partitioning method (RPM) , the multifactor-dimensionality reduction (MDR) , the focused interaction testing framework (FITF)  and the backward genotype-trait association (BGTA) . Although these methods were tested and showed promising performance on small data sets, the computational burden prohibits their application on large scale datasets.
Typically, a large scale dataset for association studies may have several tens to hundreds of thousands of markers. For example, the genome-wide case-control data set for Age-related Macular Degeneration (AMD) contains more than 100 thousand SNPs genotyped on 96 cases and 50 controls . An exhaustive search of two-locus interactions needs to evaluate at least 5.00 × 109 locus combinations, and this number increases to 1.67 × 1014 when three-locus interactions are considered. Although this process is computationally hard it could be enhanced by two recent approaches: the Bayesian epistasis association mapping (BEAM)  and SNPharvester , which were shown to be able to handle large scale datasets. However, more efficient and accurate methods are still desired.
The solution to this difficult search problem could be achieved using an optimization technique called ant colony optimization (ACO) algorithm. Ant colony algorithms, proposed first by Dorigio and Gambardella , are tools to solve difficult optimization problems such as the traveling salesman problem. ACO simulates how real ant colonies find the shortest route to a food source. Real ant colonies communicate through chemicals called pheromones, which are deposited along the path an ant travels. Ants that choose a shorter path will transverse the distance at a faster rate, resulting in more pheromones deposited along that path. Subsequent ants will then choose the path with more pheromones, thus creating a positive feedback. In ACO, artificial ants work as parallel units that communicate through a probability distribution function (PDF), which is updated by weights or pheromones. The change in pheromones is determined by some type of expert knowledge. As the PDF is updated, "paths" that perform better will be sampled at higher rates by subsequent artificial ants, and in turn deposit more pheromones. Thus, a positive feedback similar to real ant colonies is simulated.
Two recent studies showed the possibility of applying ant colony optimization to association studies [14, 15]. However, the use of MDR for detecting epistatic interactions in these studies dramatically increased the computational burden. Besides, these studies did not test performance using the more practical epistatic models such as the ones proposed by Marchini et al. .
In this study, a new tool named AntEpiSeeker has been developed to search for epistatic interactions in large-scale association studies. The use of χ2 values as score function to measure the association between an SNP set and the phenotype is computationally efficient. The two-stage design of ant colony optimization and the idea of searching bigger SNP sets harboring epistatic interactions enhance the power of ACO algorithms. AntEpiSeeker showed improved performance based on some practical epistatic models and large scale datasets.
The generic ant colony optimization
The ACO has been proven to be a successful technique for numerous non-deterministic polynomial-time hard (NP-hard) combinatorial optimization problems such as the traveling salesman problem, the graph coloring problem, the frequency assignment problem, the quadratic assignment problem, feature selection for microarray classification and the vehicle routing problem [17–22]. ACO has the advantages of a positive feedback, and it lends itself to parallel computing, among other advantages.
where ρ is a constant between 0 and 1 that represents the pheromone evaporation rate; Δτ k (i) is the change in pheromone level for locus k at iteration i, which equals 0.1 χ2 of S m in this study, and is set to zero if locus k ∉ S m . This process is repeated for all artificial ants.
Minimizing false positives
AntEpiSeeker may report all detected epistatic interactions at a p-value threshold. In addition, AntEpiSeeker incorporates a procedure for minimizing false positives, which can be described as:
1) The set of all detected epistatic interactions is denoted by EI all and another null set, holding the epistatic interactions with minimized false positives, is denoted by EI m .
2) Each reported epistatic interaction I i in EI all is attempted to be added into EI m sequentially. If I i does not have any locus overlapping with those of each epistatic interaction in EI m , I i is added to EI m . Otherwise, assuming that the epistatic interaction J j in EI m has at least one locus overlapping with those of I i , determine if the p-value of I i is smaller than that of J j . If so, J j in EI m is replaced by I i . If not, I i is not reported in EI m .
The AntEpiseeker package was written in C++. Before compiling, the GNU Scientific Library (GSL) needs to be installed on the user's computer. A separate parameter file named "parameters.txt" specifies the parameters needed to run the program. The SNP data file should be comma-delimited, with the first row specifying the SNP names. All subsequent rows should contain SNP data for each sample. The SNP data should be coded by 0, 1 and 2. The last column indicates the sample status (0 indicates control and 1 indicates case). There are three output files. "AntEpiSeeker.log" records some intermediate results, "results_maximized.txt" reports all detected epistatic interactions, and the user-specified output file shows the epistatic interactions with minimized false positives. The user specified output file includes the locus name, χ2 value and p-value. The software and its source code are available for download at http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html.
The parameters needed to run AntEpiseeker include iAntCount, iItCountLarge, iItCountSmall, α , iTopModel, iTopLoci, ρ , τ0, largesetsize, smallsetsize, iEpiModel, pvalue, INPFILE, OUTFILE. The parameter "iEpiModel" specifies the number of SNPs in an epistatic interaction. The parameters "largesetsize", "smallsetsize" must be greater than "iEpiModel". For a two-locus interaction model, we suggest largesetsize = 6, smallsetsize = 3, iEpiMode = 2; For a three-locus interaction model, we suggest largesetsize = 6, smallsetsize = 4, iEpiModel = 3. The parameters "iItCountLarge", "iItCountSmall" should be chosen according to the number of SNPs genotyped in the data (Denoted by L). Typically, we suggest iItCountSmall ≥ 0.1 × L and iItCountLarge = 0.5 × iItCountSmall. iAntCount may vary from 500 to 5,000, where larger iAntCount should correspond to larger L. ρ should range from 0.01 to 0.1 for better performance, where smaller L should use larger ρ. The default parameters in the AntEpiSeeker package, used in our most simulation studies, were an optimal setting balanced between ρ and iAntCount, which should work well on medium size datasets (2 × 103 ≤ L ≤ 2 × 104).
Power and computational time evaluation on a simulated data set
In addition, AntEpiSeeker is computationally efficient. In the above simulation study, the average running time of AntEpiSeeker, SNPHarvester and BEAM were 27, 54 and 133 minutes respectively, using a Linux system based on Dual Core AMD Opteron(tm) Processor 275.
False positive rate evaluation on a null simulation
False positive rate of different methods on a null simulation.
False positive rate
P value threshold
Before minimizing false positives
After minimizing false positives
5.5 × 10-6
No positives reported
1.4 × 10-2
3.5 × 10-6
3.0 × 10-6
5.3 × 10-5
No positives reported
1.6 × 10-2
3.0 × 10-5
1.1 × 10-5
6.9 × 10-4
No positives reported
2.0 × 10-2
2.9 × 10-4
3.7 × 10-5
8.4 × 10-3
No positives reported
2.4 × 10-2
2.0 × 10-3
6.6 × 10-5
Evaluation of AntEpiSeeker on a simulated large scale dataset
Performance comparison of different methods on a simulated large-scale dataset.
True positive rate
False discovery rate
AntEpiSeeker with minimized false positives
Results on WTCCC RA data
Some epistatic interactions identified by AntEpiSeeker on WTCCC RA data.
In this paper, we proposed a novel tool (AntEpiSeeker) for the discovery of epistatic interactions in large scale case-control studies. AntEpiSeeker was assessed through comparison with two recent approaches on both simulated and real datasets. AntEpiSeeker, which adopts a two-stage optimization procedure, is a modified algorithm derived from the generic ACO. To demonstrate the advantages of the two-stage optimization, we also compared the performance of AntEpiSeeker with that of the generic ACO. AntEpiSeeker is a continuous research project and may be upgraded in the future.
Availability and requirements
Project name: AntEpiSeeker
Project home page: http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html
Operating system(s): Windows, Linux
Programming language: C++
Other requirements: GNU Scientific Library (GSL) is needed for recompile
License: None for usage
Any restrictions to use by non-academics: None
This study was supported in part by resources and technical expertise from the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer.
- Cordell HJ, Clayton DG: Genetic association studies. Lancet. 2005, 366: 1121-1131. 10.1016/S0140-6736(05)67424-7.PubMedView ArticleGoogle Scholar
- Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002, 11: 2463-2468. 10.1093/hmg/11.20.2463.PubMedView ArticleGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138-147. 10.1086/321276.PubMed CentralPubMedView ArticleGoogle Scholar
- Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS: Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia. 2004, 47: 549-554. 10.1007/s00125-003-1321-3.PubMedView ArticleGoogle Scholar
- Tsai CT, Lai LP, Lin JL, Chiang FT, Hwang JJ, Ritchie MD, Moore JH, Hsu KL, Tseng CD, Liau CS, Tseng YZ: Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation. 2004, 109: 1640-1646. 10.1161/01.CIR.0000124487.36586.26.PubMedView ArticleGoogle Scholar
- Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001, 11: 458-470. 10.1101/gr.172901.PubMed CentralPubMedView ArticleGoogle Scholar
- Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol. 2004, 27: 141-152. 10.1002/gepi.20006.PubMedView ArticleGoogle Scholar
- Millstein J, Conti DV, Gilliland FD, Gauderman WJ: A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006, 78: 15-27. 10.1086/498850.PubMed CentralPubMedView ArticleGoogle Scholar
- Zeng T, Wang H, Lo SH: Backward genotype-trait association (BGTA) - based dissection of complex traits in case-control design. Hum Hered. 2006, 62: 196-212. 10.1159/000096995.View ArticleGoogle Scholar
- Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST: Complement factor H polymorphism in age-related macular degeneration. Science. 2005, 308: 385-389. 10.1126/science.1109557.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang Y, Liu JS: Bayesian inference of epistatic interactions in case-control studies. Nat Genet. 2007, 39: 1167-1173. 10.1038/ng2110.PubMedView ArticleGoogle Scholar
- Yang C, He Z, Wan X, Yang Q, Xue H, Yu W: SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics. 2009, 25: 504-511. 10.1093/bioinformatics/btn652.PubMedView ArticleGoogle Scholar
- Dorigio M, Gambardella LM: Ant colonies for the travelling salesman problem. Biosystems. 1997, 43: 73-81. 10.1016/S0303-2647(97)01708-5.View ArticleGoogle Scholar
- Greene CS, White BC, Moore JH: Ant colony optimization for genome-wide genetic analysis. Lect Notes Comput Sci. 2008, 5217/2008: 37-47. full_text.View ArticleGoogle Scholar
- Greene CS, Gilmore JM, Kiralis J, Andrews PC, Moore JH: Optimal use of expert knowledge in ant colony optimization for the analysis of epistasis in human disease. Lect Notes Comput Sci. 2009, 5483/2009: 92-103. full_text.View ArticleGoogle Scholar
- Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37: 413-417. 10.1038/ng1537.PubMedView ArticleGoogle Scholar
- Talbi EG, Roux O, Fonlupt C, Robillard D: Parallel Ant Colonies for the quadratic assignment problem. Future Generation Computer System. 2001, 17: 441-449. 10.1016/S0167-739X(99)00124-7.View ArticleGoogle Scholar
- Maniezzo V, Carbonaro A: An ANTS heuristic for the frequency assignment problem. Future Generation Computer Systems. 2000, 16: 927-935. 10.1016/S0167-739X(00)00046-7.View ArticleGoogle Scholar
- McGraw-Hill, Dorigo M, Di Caro G: New Ideas in Optimization. New Ideas in Optimization. Edited by: Corne D, Dorigo M, Glover F. 1999, McGraw-HillGoogle Scholar
- Dorigo M, Di Caro G, Gambardella LM: Ant Algorithms for Discrete Optimization. Artificial Life. 1999, 5: 137-172. 10.1162/106454699568728.PubMedView ArticleGoogle Scholar
- Robbins KR, Zhang W, Bertrand JK, Rekaya R: The ant colony algorithm for feature selection in high-dimension gene expression data for disease classification. Math Med Bio. 2007, 24: 413-26. 10.1093/imammb/dqn001.View ArticleGoogle Scholar
- The MIT Press, Dorigo M, Stützle T: Ant Colony Optimization. 2004, The MIT PressView ArticleGoogle Scholar
- Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics. 2009, 10 (Suppl 1): S65-10.1186/1471-2105-10-S1-S65.PubMed CentralPubMedView ArticleGoogle Scholar
- Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W: MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study. BMC Bioinformatics. 2009, 10: 13-10.1186/1471-2105-10-13.PubMed CentralPubMedView ArticleGoogle Scholar
- The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- Plenge RM, Cotsapas C, Davies L, Price AL, de Bakker PI, Maller J, Pe'er I, Burtt NP, Blumenstiel B, DeFelice M, Parkin M, Barry R, Winslow W, Healy C, Graham RR, Neale BM, Izmailova E, Roubenoff R, Parker AN, Glass R, Karlson EW, Maher N, Hafler DA, Lee DM, Seldin MF, Remmers EF, Lee AT, Padyukov L, Alfredsson L, Coblyn J, Weinblatt ME, Gabriel SB, Purcell S, Klareskog L, Gregersen PK, Shadick NA, Daly MJ, Altshuler D: Two independent alleles at 6q23 associated with risk of rheumatoid arthritis. Nat Genet. 2007, 39: 1477-1482. 10.1038/ng.2007.27.PubMed CentralPubMedView ArticleGoogle Scholar
- Raychaudhuri S, Remmers EF, Lee AT, Hackett R, Guiducci C, Burtt NP, Gianniny L, Korman BD, Padyukov L, Kurreeman FA, Chang M, Catanese JJ, Ding B, Wong S, Helm-van Mil van der AH, Neale BM, Coblyn J, Cui J, Tak PP, Wolbink GJ, Crusius JB, Horst-Bruinsma van der IE, Criswell LA, Amos CI, Seldin MF, Kastner DL, Ardlie KG, Alfredsson L, Costenbader KH, Altshuler D, Huizinga TW, Shadick NA, Weinblatt ME, de Vries N, Worthington J, Seielstad M, Toes RE, Karlson EW, Begovich AB, Klareskog L, Gregersen PK, Daly MJ, Plenge RM: Common variants at CD40 and other loci confer risk of rheumatoid arthritis. Nat Genet. 2008, 40: 1216-1223. 10.1038/ng.233.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.