Background
The field of human genetics is currently experiencing an explosion of genetic data, as genotyping technology becomes more inexpensive and accessible. This creates an analytical challenge for genetic epidemiologists. This challenge is exaggerated in the case of complex diseases since the phenotype is likely the result of many genetic and environmental factors[1, 2]. The limitations of traditional parametric statistical tools in searching for such interactions motivate the development of novel computational methods[2–4].
Recently, our group proposed a Grammatical Evolution Neural Network (GENN), a machine-learning approach designed to detect gene-gene interactions in the presence or absence of marginal main effects[5, 6]. GENN performs both variable selection and statistical modelling without the computational burden of exhaustively searching all possible variable combinations. GENN uses an evolutionary computation algorithm (grammatical evolution) to build neural networks (NN). NN analogize the parallel processing of the human brain, and are used as non-linear statistical data modeling tools to model complex relationships between inputs and outputs or to find patterns in data[7].
GENN has been shown to have high power to detect interactions in a range of empirical studies with both real and simulated data[5, 6, 8, 9]. The performance of GENN has been compared to other NN strategies and found to have significantly improved performance in large datasets[6]. GENN has also been shown to efficiently scale to large datasets[6].
In the current study, we assess the robustness of GENN to several common types of noise. Data originally simulated for Ritchie et al 2003[10] was used to examine the impact of genotyping error (GE), missing data (MS), phenocopy (PC), and genetic heterogeneity (GH) in six different gene-gene interaction models. Additionally, we examine the impact of all possible combinations of these types of noise to detect potentially synergistic effects. Also, we compare the performance of GENN to that of another method designed to detect interactions – Multifactor Dimensionality Reduction (MDR)[11].
Grammatical Evolution Neural Networks (GENN)
GENN methodology and software have been previously described[5, 6]. The steps of GENN are shown in Figure 1. Grammatical Evolution is a variation on genetic programming that uses a Backus-Naur Form grammar to create a computer program using a genetic algorithm[12]. A genetic algorithm is an array of bits that encodes definitions in the grammar (a set of rules that is used to construct computer programs – NN in this case). Then the program is executed and fitness is recorded. The genetic algorithm evolves chromosomes until an optimal solution is found, using balanced classification error as the fitness function (lower error represents higher fitness). GENN automatically selects the inputs from a pool of variables, optimizes synaptic weights, and evolves the architecture of the network, automatically selecting the appropriate network architecture for a dataset.
In the case of missing data the algorithm does not include that observation in the calculation of classification error. Only the particular missing instance is ignored, not all data for an entire individual or entire locus. Configuration parameters used in the current analyses were: 10 demes, migration every 25 generations, population size of 200 per deme, maximum of 200 generations, crossover rate of 0.9, tournament selection, standard two-point crossover, selection and a reproduction rate of 0.1[8].
Multifactor Dimensionality Reduction (MDR)
The steps of MDR, and details of the MDR analyses presented here have been previously described[11]. Briefly, in the first step, the data is divided into a training set and an independent testing set for cross validation. Second, a set of n genetic and/or environmental factors are selected. These factors and their multi-factor classes are divided in n-dimensional space. Then the ratio of cases to controls is calculated within each multifactor class. Each multifactor cell class is then labelled "high risk" or "low risk" based on the ratio calculated, reducing the n-dimensional space to one dimension with two levels. The collection of these multifactor classes composes the MDR model for a particular combination of factors. For each possible model size (one-locus, two-locus, etc.) a single MDR model is chosen that has the lowest number of misclassified individuals. In order to evaluate the predictive ability of the model, prediction error is calculated using 10-fold cross-validation. The result is a set of models, one for each model size considered. From these models, a final model is chosen based on minimization of prediction error and maximization of cross validation consistency (number of times a particular set of factors is identified across the cross validation subsets).
Data Simulations
Datasets were originally generated and described for Ritchie et al, 2003[10]. Briefly, case-control data were generated under six different two-locus epistatic models, where the functional variables are single-nucleotide polymorphisms (SNPs). Data were generated using penetrance functions (shown in Figure 2), where a risk of disease is specified for each genotype combination, using software described in[13]. Each dataset is comprised of 200 cases and 200 controls, with a total of 10 biallelic SNPs each.
Each model was simulated in the presence and absence of common sources of noise, including: 5% genotyping error (GE), 5% missing data (MS), 50% phenocopy (PC), and 50% genetic heterogeneity (GH). For each model, 100 datasets were generated in the absence of any type of noise, 100 datasets were generated for each type of noise, and 100 datasets were generated for each 2, 3, or 4-way combination of the sources of noise. 96 sets of 100 datasets were generated in total.
As previously described[10], GE was simulated using a directed-error model[14] so that 5% of genotypes were biased towards one allele. MS were simulated by randomly removing 5% of genotype information. PC was simulated so that 50% of cases actually had genotypes that correspond to low-risk profiles according to the penetrance function of the corresponding epistasis model. This simulation corresponds to case status that is due to a random environmental effect. 50% GH was simulated by using a second penetrance function to define half of the cases, so that two different two-locus models predicted disease risk.