# Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests

- João Maroco
^{1}Email author, - Dina Silva
^{2}, - Ana Rodrigues
^{3}, - Manuela Guerreiro
^{2}, - Isabel Santana
^{3}and - Alexandre de Mendonça
^{2}

**4**:299

**DOI: **10.1186/1756-0500-4-299

© Maroco et al; licensee BioMed Central Ltd. 2011

**Received: **19 March 2011

**Accepted: **17 August 2011

**Published: **17 August 2011

## Abstract

### Background

Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test.

### Results

Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5.

### Conclusions

When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing.

## Background

It is estimated that about 25 million people suffer from dementia nowadays and, as a consequence of the population aging, the number of people affected is expected to double every 20 years [1]. The presence of cognitive complaints is very common in aged people and may be the first sign of an on-going dementing disorder like Alzheimer's disease. It is possible to identify people with cognitive complaints who are at risk for the progression to dementia, that is to say, who have Mild Cognitive Impairment (MCI) [2, 3]. Since the establishment of MCI requires the demonstration of cognitive decline greater than expected for an individual's age and education level, neuropsychological testing is a key element in the diagnostic procedures [4].

Recently, it has become possible to identify the traces, or biomarkers, of Alzheimer's disease in patients with MCI, by the use of Magnetic Resonance Imaging (MRI) volumetric studies, neurochemical analysis of the cerebrospinal fluid, and Positron Emission Tomography (PET) scan [5]. These studies, however, are expensive, technically challenging, some invasive, and not widely available. Longitudinal studies assessing the predictive value of neuropsychological tests in progression of MCI patients to dementia have shown an area under the receiver operating characteristic curve of 61-94% (being higher for tests assessing verbal episodic memory) but with lower accuracy and sensitivity values [6–11]. It would be important to improve the value of neuropsychological tests to predict the progression of MCI patients to dementia. This can be achieved at a clinical level by increasing the number of patients with longer clinical follow-ups. Predictive power of these tests may be also enhanced through innovating statistical classification and data mining techniques. Traditional statistical classification methods (e.g., Fisher's Linear Discriminant Analysis (LDA) and Logistic Regression (LR)) have been extensively used in medical classification problems for which the criterion variable is dichotomous [12–18]. More recently, research has been steadily building on the accuracy and efficiency of data mining, with classifiers like Neural Networks (NN), Support Vector Machines (SVM), Classification Trees (CT) and Random Forests (RF) used for medical prediction and classification tasks [13, 14, 19–27]. Research on the comparative accuracy of traditional classifiers (LDA and LR) vs. new, computer intensive data mining methods which require large computing power, innovative iterative algorithms and user intervention, has been growing steadily. Several authors propose that data mining classifiers have higher accuracy and lower error rates than the traditional classification methods [22, 25, 28, 29]. However, this superiority is not apparent with all data sets, especially with real data [12, 13, 30–32]. Results regarding the superiority of classification accuracy of newer classification methods as compared to traditional, less computer demanding methods, as well as the stability of the findings are still controversial [31, 33–35]. Most comparisons between methods are based only on total classification accuracy and/or error rates; they involve human intervention for training and optimization of the data mining classifiers vs. out-of-the-box results for the traditional classifiers. Furthermore, in medical contexts, sensitivity (the ability to predict the condition when the condition is present), specificity (the ability to predict the absence of the condition when the condition is not present) as well as the classifier discriminant power (as estimated from the area under the Receiver Operating Characteristic (ROC) curve) are key features that must be considered when comparing classifiers and diagnostic methods.

In this paper we evaluated the sensitivity, specificity, overall classification accuracy, area under the ROC and Press' Q of data mining classifiers like Neural Networks (Multilayer Perceptrons and Radial Basis Networks), Support Vector Machines, Classification Trees and Random Forests as compared to the traditional Linear, Quadratic Discriminant Analysis and Logistic Regression in the prediction of the evolution into dementia of 400 elderly people with Mild Cognitive Impairment.

## Methods

### Classifiers

#### Discriminant Analysis

*j*= min(

*k*-1

*,p*) discriminant functions that estimate discriminant scores (

*D*

_{ ji }) for each of

*i =*1,...,

*n*subjects classified into

*k*groups, from

*p*linearly independent predictor variables (

**X**) as

*w*

_{ ij }) are estimated by ordinary least squares so that the ratio of the variance within the

*k*groups to the variance between the

*k*groups is minimal. Classification functions of the type

*j*= 1,...,

*k*groups can therefore be constructed from the discriminant scores. The coefficients of the classification function for the

*j*th group are estimated from the within sum of squares matrixes (

**W**) of the discriminant scores for each group and from the vector of the

*p*discriminant predictors means in each of the classifying groups (

**M**) as

**C**

_{ j }=

**W**

^{-1}

**M**with ${c}_{jo}=\mathrm{log}p-\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.{C}_{j}{M}_{j}$. Quadratic Discriminant Analysis (QDA) uses the same within vs. between sum of square minimization optimization but on a quadratic discriminant function of the form:

Both on LDA and QDA, a subject is then classified into the group for which its classification function score is higher [for a detailed description of LDA and QDA see [37]].

#### Logistic Regression

*π*

_{i}) as

_{ i }the probability of success for each subject is estimated as

If the estimated probability is greater than 0.5 (or other user pre-defined threshold value), the subject is classified into the success group; otherwise, it is classified into the failure group [for a detailed description see [38]].

#### Neural Networks

*y*

_{ k }with

*k*classes, the NN can be described by general the model

**x**is the vector of

*p*predictors,

**w**is the vector of input weights,

**o**is the vector of hidden weights for the hidden layer,

**x**

_{0}and

**o**

_{0k}are bias (memory) constants. The functions

*g*(.) and

*f*(.) are processing activation functions for the hidden layer and output layer respectively. Activation functions are one of the general linear, logistic, exponential or gaussian function families. Several topologies of Neural Networks (NN) can be used in binary classification problems. Two of the most used NN are the Multilayer Perceptron (MLP) and the Radial Basis Function (RBF). The main differences between these two NN reside in the activation functions of the hidden layer: For the MLP the activation function belongs, generally, to a linear

**w**) of the NN is upgraded in each iteration in way to maximize the correct classification rate and or minimize a function of the classification errors; either a function of the sum of squares of the errors for a continuous criterion

[for a detailed description of NN see [40]].

#### Support Vector Machines

**x**of predictors mapped into a higher dimension feature space by a nonlinear feature function

*ϕ*, a vector

**w**of weights and a bias offset

*b*, that classifies all the observation

*y*

_{ i }in one of the two groups {-1; +1} [41]. The classification function is then

**w**'

*ϕ*(

**x**) +

*b*≥ +1 for the {+1} group and

**w**'

*ϕ*(

**x**) +

*b*≤ -1 for the {-1} group. These support planes are pushed apart until they bum into a small number of observations or training patterns that respect the above constrains and thus are called support vectors. Figure 2 illustrates this concept. The classification goal can be achieved by maximizing the distance or margin of separation

*r*between the two planes

**w**'

*ϕ*(

**x**) +

*b*= +1 and

**w'x**+

*b*= -1 given by

*r*= 2/||

**w**||. This is equivalent to minimizing the cost function

where *c* > 0 is penalty parameter that balances classification errors vs. the complexity of the model, which is controlled by the margin of separation, and *ξ*_{
i
} , is the so called slack-variable. This variable is the penalty of a misclassified observation that controls how far on the wrong side of the hyperplane a point can lie when the training data cannot be classified without error, that is when the objects are not linearly separable and a soft separating non-linear margin is required [41, 42]. Because the feature space can be infinite, the nonlinear mapping by the feature function *ϕ* is computed through special nonlinear semi-positive definite K functions called kernels (Ivanciuc, 2007).

Where *α*_{
i
} (*i* = 1,...,*n*) are nonnegative Lagrange multipliers and K(.) is a kernel unction. In classification problems (c-SVM) the usual kernel functions are the linear kernel K(**x** _{
i
}**x**_{
j
} ) = **x**_{
i
}**'x**_{
j
} or the Gaussian K(**x** _{
i
}, **x** _{
j
}) = exp(-*γ* ||**x**_{
i
} - **x** _{
j
}||^{2}) where γ is the kernel parameter. The use of kernel functions has the advantage of operating in the original input variables where the solution of the classification problem is a weighted sum of kernels evaluated at the support vectors [for a complete description of SVM see [28, 41, 43].

#### Classification Trees

*t*branch of the tree until all data points are classified into

*C*mutually exclusive classes. The impurity measure of choice in CART is the Gini impurity index defined as

*P*(

*c*|

*t*) is the conditional probability of a class

*c*given the node

*t*. This probability is estimated as

where π(*c*) is the probability of observing the group *c* and *n*_{c}(*t*) is the number of elements in group *c* at a given node *t*. The tree is grown until no further predictors can be used or the impurity of each group at a final branch of the tree cannot be reduced further. Non significant predictors (branches) can be pruned from the final tree and removed from the analysis.

*p-value*obtained from the chi-square statistic applied to two-way classification tables with

*C*classes and

*K*splits for each tree node:

where *n*_{
ck
} stands for the observed frequencies of cell *ck* and ${\widehat{n}}_{ck}$ stands for the expected frequencies under the null hypothesis of two-way homogeneity.

*F*statistic:

where ${\stackrel{\u0304}{x}}_{c}\left(t\right)$ is the average of predictor *X* in the *c* group at node *t* and $\stackrel{\u0304}{x}\left(t\right)$ is the average of predictor *X* at node *t* for all groups. For categorical predictors, a chi-square like statistic similar to the one defined for a CHAID is used.

#### Random Forests

Random Forests (RF) were proposed by Leo Breinman [47]. This "ensemble learning" classification method construct a series of CART using random bootstrap samples of the original data sample. Each of these trees is built from further random sub-set of the total predictors who maximize the classification criteria at each node. An estimate of the classification error-rate can be obtained using each of the CART to predict the data not in the bootstrap sample ("out-of-the bag") used to grow the tree, and then average the out-of-the bag predictions for the grown set of trees (forest). These out-of-the bag estimates of the error-rate can be quite accurate if enough trees have been grown [48]. Object classification is then performed from the majority of predictions given by the trees in the random forest. Although this classification strategy may lack a perceivable advantage over single CT, according to its creator (Leo Breiman), it has unexcelled accuracy among current algorithms, performing very well when compared to many classifiers including LDA, NN and SVM [for a detailed description of RF see [47]]. Furthermore, this method is quite user-friendly since it has only two parameters that the user needs to define: the number of random trees in the forest; and the number of predictor variables in the random subset of tree at each node. These parameters can be easily optimized although random forests are not very sensitive to their values [48].

### Case study application

#### Sample

Sample demographics: The two groups in the criterion were "MCI" - Mild Cognitive impaired patients; and "Dementia" patients.

MCI | Dementia |
| |
---|---|---|---|

Group size (%) | 275(69%) | 125 (31%) | <0.001 |

Age (M ± SD) | 67.8 ± 8.8 | 71.6 ± 8.4 | <0.001 |

Sex (♀/♂) | 165/110 | 78/47 | 0.649 |

Schooling years (M ± SD) | 8.1 ± 4.7 | 8.64 ± 4.9 | 0.469 |

Time between assessments (year)(M ± SD) | 2.3 ± 1.6 | 2.2 ± 1.4 | 0.517 |

### Criterion and Predictors

### Data mining settings and classifiers evaluation

To prevent overfitting and artificial accuracy improvement due to the use of the same data for training and testing of classifiers, a 5-fold cross-validation strategy was followed to train and evaluate the 10 classifiers. The total sample was divided into 5 proportional sub-samples. In each of the 5 steps, 4/5 of the sample was used for training and 1/5 for testing. Test results for the 5 runs, gathered from the 5 test samples, were then considered for further comparisons. The performances (total accuracy, sensitivity, specificity, AUC and Press' Q) of the different classifiers where compared with Friedman's test followed by Dunn's post-hoc multiple comparisons of mean ranks for paired samples. Statistical significance was assumed for *p* < 0.05. To avoid biases from the data sets, equal a priori classification probabilities were used for Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression. Neural Networks, Support Vector Machines, Classification trees and Random forests used settings that are most frequently employed in practical data mining applications as follows. The Multilayer Perceptron was trained with 11 inputs (one for each predictor) in the input layer, 1 hidden layer with 4-7 neurons and a hyperbolic tangent activation function. The number of neurons in the hidden layer was iteratively adjusted by the software to minimize classification errors in the train data set. The activation function for the output layer was the Softmax with a cross-entropy error function. Synaptic weights were obtained from a 80%:20% train: test setup. The Radial Basis Function Neural Network had 11 inputs, one hidden layer with 2-8 neurons and a Softmax activation function. The activation function for the output layer was the identity function with a sum of squares error function. The Gaussian function was the kernel used in the SVM. Cost (*c*) and γ parameters were optimized by a linear grid search in the intervals [2^{-3}; 2^{15}] for *c* and [2^{-15}; 2^{3}] for γ, followed by cross-validation of each of the SVM obtained in the 5 train sets. The classification function was the sign of the optimum margin of separation. CHAID, CART and QUEST classification trees used α to split and α to merge of 0.05, with 10 intervals. Tree growth and pruning of CART were set with a minimum parent size of 5 and minimum child size of 1. Classification priors for both trees were fixed at 0.5:0.5. Random Forests were composed of 500 CART trees with 2-9 predictors per tree cross-validation optimization. The Predictive Analytic Software (PASW) Statistics (v. 18, SPSS Inc., Chicago, Il) was used for Discriminant Analysis, Logistic Regression, Neural Networks and Classification Trees. Support Vector Machines and Random Forests were performed with R (v. 2.8, CRAN) with the *e1071*[56] and *randomForest*[48] packages, respectively.

## Results

Classification accuracy, sensitivity, specificity, area under the ROC and Press' Q statistic were evaluated in the 5 test sets resulting from the 5-fold cross validation strategy as described before. Data gathered is illustrated in box-plots for the different classifiers.

### Total Accuracy

^{2}

_{Fr}(9) = 22.211;

*p*= 0.008). Post-hoc, multiple mean rank comparisons for paired samples revealed that SVM and RF had higher mean ranks than the other classifiers who did not differ significantly in mean rank accuracy (

*p*> 0.05).

### Specificity

^{2}

_{Fr}(9)= 37.292;

*p*< 0.001). SVM scored the highest in specificity followed by a second group composed by MLP, LR and RBF with significant differences from a third group composed by LDA, QDA, classification trees and RF.

### Sensitivity

^{2}Fr(9) = 29.0;

*p*= 0.001). LDA, CART, QUEST and RF had the highest sensitivity values. It is worthwhile to mention that LR, MLP, RBF and CHAID had median sensitivity values close to or lower than 0.5, and that SVM was the classifier with the significantly lowest sensitivity.

### Area under the ROC

^{2}

_{Fr}(9) = 23.745;

*p*= 0.005). SVM shows the highest AUC, however an extreme low value removes the significance of the differences with the AUC distributions from the other classifiers. LDA, LR, MLP, RBF and RF are a homogenous group statistically different from the group composed by QDA, CHART and CHAID. QUEST had the significantly lowest AUC.

### Classification by chance alone

*N*is the total sample size,

*n*is the number of observations correctly classified and

*k*is the number of groups. Under the null hypothesis that the classifier is no better than chance alone, Press' Q has a chi-square distribution with 1 degree of freedom. Thus, classifiers with Q≥3.84 classify significantly better than chance alone for a 0.05 significance level. The Q distributions in the 5 sample tests are shown in Figure 8. There were statistically significant differences between the Q distributions (X

^{2}Fr(9) = 21.582;

*p*= 0.01). Dunn's multiple mean rank comparisons revealed that SVM had the highest mean rank followed by RF, MLP, CHAID and LR. The smallest mean ranks were observed for LDA, QDA, RBF, CART and QUEST. All classifiers, with the exception of QUEST, had 1

^{st}quartiles higher than 3.84 (

*p*< 0.05).

## Discussion

All classifiers evaluated showed better median (Me) classification than chance alone in the prediction of evolution into dementia of elderly people with Mild Cognitive Impairment. Median Press's Q statistic was larger or equal to 5 for all classifiers, although in QUEST the 1^{st} quartile was below the critical level for this statistics. Discriminant power of the classifiers, as judged by the AUC, was appropriate for most classifiers (greater than 0.7) with the exception for classification trees (median AUC of 0.6). No statistically significant differences were found in the total accuracy of 8 of the 10 evaluated classifiers (Medians between 0.63 and 0.73), but RF (Me = 0.74) and SVM (Me = 0.76) obtained statistically significant higher classification accuracy. Median specificity ranged from a minimum of 0.64 (CART and LDA) to a maximum of 1 (SVM). With the exception of LDA, CART and QUEST, all the other classifiers were quite efficient in predicting group membership in the group with larger number of elements (the MCI group corresponding to 69% of the sample) (Median specificity larger than 0.6). Judging from total accuracy, SVM and RF rank highest amongst the classifiers tested as has been suggested elsewhere [47, 48, 57, 58]. However, a quite different picture emerges from the analysis of the sensitivity of the classifiers. Prediction for the group with lower frequency (the Dementia group, 31% of the sample) was quite poor for several of the tested classifiers, including the ones with some of the highest specificity values. Minimum median sensitivity was 0.30 (SVM) and maximum median sensitivity was 0.66 (QUEST, followed by 0.64 for LDA and RF). Only six of the ten classifiers tested showed median sensitivity larger than 0.5 (and only five had 1^{st} quartile sensitivity larger than 0.5). Considering that conversion into dementia is the key prediction in this biomedical application and thus higher sensitivity of classifiers is required, classifiers like Logistic Regression, Neural Networks, Support Vector Machines and CHAID trees are inappropriate for this type of binary classification task. Similar findings were observed in studies comparing different classifiers in other biomedical conditions [24, 34, 58]. Total accuracy of classifiers is misleading since some classifiers are good only at predicting the larger group membership (high specificity) but quite insufficient at predicting the smaller group membership (low sensitivity). Some of the classifiers with the highest specificity (Neural Networks (MLP and RBF) and SVM) are also the classifiers with the lowest sensitivity. Unbalance of classification efficiency for small frequency vs. large frequency groups has been found in other real-data studies for Logistic Regression and Neural Networks [30, 34, 59, 60]. To our knowledge, such unbalance of SVM in the prediction of the lowest frequency was not been published elsewhere. David Meyer (Personal communication) has observed also that SVM predict poorly low frequency groups. Taking into account total accuracy, specificity and sensitivity, the oldest Fisher's Linear Discriminant Analysis does not rank much lower than Multiple Layer Perceptrons or Random Forests, the newest member of the binary classification family. The relatively small sample size, although in the range of most biomedical experimental studies with dementia and cognitive impairment, may limit the performance of some data mining methods assessed in this study. Sample size has been known to play an important role in the accuracy of Neural Networks [61, 62]. In our study, the number of cases for the training and testing sets are at lower limit for recommended data set dimensions for Neural Networks applications (several hundred) [61–63]. Large data sets requirements are also found in LR, but less in LDA if the model assumptions are met. The present sample size was not, apparently, limiting for the achievement of an acceptable accuracy, specificity and sensitivity of both Random Forests and LDA, as reported elsewhere [18, 63]. Furthermore, there are studies with relatively small samples where data mining techniques, like SVM and Neural Networks have been used with high accuracy in classification problems [see e.g. [58, 64–66]]. Equivalent or even superior performances have been reported for Linear Discriminant Analysis and Random Forests when compared with Neural Networks, Classification Trees and Support Vector Machines [see e.g. [34, 47, 58, 67, 68]]. However, controversy still prevails regarding the effects on classifiers' performance of different combinations of predictors, data assumptions, sample sizes and parameters tuning [16, 17, 31, 58, 69, 70]. Different application with different data sets (both real and simulated) have failed to produce a classifier that ranks best in all applications as shown in the studies by Michie et al., [71] (STALOG project with 23 different classifiers evaluated in 22 real datasets); Lim et al [72] (33 classifiers evaluated on 16 real data sets) and Meyer et al. [34] (24 classifiers, available in the R Software, evaluated on 21 data sets).

It must be pointed out that the results gathered in our study are based on a specific data set and a single set of tuning parameters. It is well known that for Neural Networks and Support Vector Machines the performance of these classifiers and the properties of the resulting predictions are heavily dependent on the chosen values for the tuning parameters [33, 34, 72, 73]. Although, we used settings, that are most commonly used in data mining applications, and tuning parameters, that were optimally determined by grid search methods that minimize total error rates, it may well be that the performance of the data mining methods is just a reflection of the tuning parameters chosen. Discussing Neural Networks versus traditional classifiers, Duin, [73] takes this argument one step further when he states that "(...) a straight forward fair comparison demands automatic classifiers with no user interaction. As this conflicts with one of the main characteristics of neural networks, their flexibility, the question whether they are better or worse than traditional techniques might be undecidable".

Similar results to the ones reported in this study have been made by other authors when classifiers were compared on more than total accuracy or total error rates. For example, Breinman et al. (1984) state that "LDA does as well as other classifiers in most applications". Meyer et al. [34] point out in their comparison study of data mining classifiers, including Neural Networks and SVM, that LDA is a very competitive classifier, producing good results "*out-of-the-box* without the inconvenience of delicate and computationally expensive hyperparameter tuning". In a similar application of Random Forests, SVM, Neural Networks and Linear Discriminant Analysis for recognition of Alzheimer's disease based on electrical brain activity, Lehmann et al. [58] state that "even though modern computer-intensive classification algorithms such as Random Forest, SVM and Neural Networks show a slight superiority, more classical classification algorithms performed nearly equally well".

## Conclusions

For binary classification problems, like prediction of dementia, where classes can be linearly separated and sample size may compromise training and testing of popular data mining and machine learning methods, Random Forests and Linear Discriminant Analysis proved to have high accuracy, sensitivity, specificity and discriminant power. On the contrary, data mining classifiers like Support Vector Machines, Neural Networks and Classification Trees showed low sensitivity, recommending against its use in classification problems where the class of interest is less represented. Since for some data mining techniques the final result and the classifier performance is dependent on the skill of the analyst who applies them and his "special art for tuning the parameters" the question raised by Dunn [33] if "A data mining method can outperform the traditional classifiers?" may well not be ever deniable. However, it is noteworthy to mention that Fisher's Linear Discriminant Analysis, a classifier devised almost a century ago, stands up against computer intensive classifiers, as a simple, efficient, user- and time-proof classifier.

## Declarations

### Acknowledgements

Supported by grants from Fundação Calouste Gulbenkian and Fundação para a Ciência e Tecnologia (PIC/IC/82796/2007). The authors acknowledge the facilities provided by Memoclínica.

## Authors’ Affiliations

## References

- Ferri CPM, Brayne C: Global prevalence of dementia: a Delphi consensus study. Lancet Neurology. 2005, 366: 2112-2117.View ArticleGoogle Scholar
- Petersen RC, Stevens JC, Ganguli M, Tangalos EG, Cummings JL, DeKosky ST: Practice parameter: Early detection of dementia: Mild cognitive impairment (an evidence-based review) - Report of the Quality Standards Subcommittee of the American Academy of Neurology. Neurology. 2001, 56: 1133-1142.PubMedView ArticleGoogle Scholar
- Portet F, Ousset PJ, Visser PJ, Frisoni GB, Nobili F, Scheltens P, Vellas B, Touchon J: Mild cognitive impairment (MCI) in medical practice: a critical review of the concept and new diagnostic procedure. Report of the MCI Working Group of the European Consortium on Alzheimer's Disease. J Neurol Neurosurg Psychiatry. 2006, 77: 714-718. 10.1136/jnnp.2005.085332.PubMedPubMed CentralView ArticleGoogle Scholar
- de Mendonca A, Guerreiro M, Ribeiro F, Mendes T, Garcia C: Mild cognitive impairment - Focus on diagnosis. Journal of Molecular Neuroscience. 2004, 23: 143-147. 10.1385/JMN:23:1-2:143.PubMedView ArticleGoogle Scholar
- Dubois B, Feldman HH, Jacova C, Dekosky ST, Barberger-Gateau P, Cummings J, Delocourte A, Galasko D, Gauthier S, Jicha G, et al: Research criteria for the diagnosis of Alzheimer"s disease: revising the NINCDS-ADRDA criteria. Lancet Neurology. 2007, 6: 734-746. 10.1016/S1474-4422(07)70178-3.PubMedView ArticleGoogle Scholar
- Chong MS, Sahadevan S: Preclinical Alzheimer's disease: diagnosis and prediction of progression. Lancet Neurology. 2005, 4: 576-579. 10.1016/S1474-4422(05)70168-X.PubMedView ArticleGoogle Scholar
- Lehrner J, Gufler R, Guttmann G, Maly J, Gleiss A, Auff E, Dal-Bianco P: Annual conversion to Alzheimer disease among patients with memory complaints attending an outpatient memory clinic: The influence of amnestic mild cognitive impairment and the predictive value of neuropsychological testing. Wiener Klinische Wochenschrift. 2005, 117: 629-635. 10.1007/s00508-005-0428-6.PubMedView ArticleGoogle Scholar
- Fleisher AS, Sowell BB, Taylor C, Gamst AC, Petersen RC, Thal LJ, Alzheimers Disease C: Clinical predictors of progression to Alzheimer disease in amnestic mild cognitive impairment. Neurology. 2007, 68: 1588-1595. 10.1212/01.wnl.0000258542.58725.4c.PubMedView ArticleGoogle Scholar
- Fleisher AS, Sowell BB, Taylor C, Gamst AC, Petersen RC, Thal LJ: Alzheimer's Disease Cooperative Study. Clinical predictors of progression to Alzheimer disease in amnestic mild cognitive impairment. Neurology. 2007, 68: 1588-1595. 10.1212/01.wnl.0000258542.58725.4c.PubMedView ArticleGoogle Scholar
- Perri R, Serra L, Carlesimo GA, Caltagirone C, Early Diag Grp Italian I: Preclinical dementia: an Italian multicentre study on amnestic mild cognitive impairment. Dementia and Geriatric Cognitive Disorders. 2007, 23: 289-300. 10.1159/000100871.PubMedView ArticleGoogle Scholar
- Sarazin M, Berr C, De Rotrou J, Fabrigoule C, Pasquier F, Legrain S, Michel B, Puel M, Volteau M, Touchon J, et al: Amnestic syndrome of the medial temporal type identifies prodromal AD - A longitudinal study. Neurology. 2007, 69: 1859-1867. 10.1212/01.wnl.0000279336.36610.f7.PubMedView ArticleGoogle Scholar
- Michael G, Jonas B, Jakob F, Ulf E, Lars E, Mattias O: Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. 2006Google Scholar
- Peter CA: A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Statistics in Medicine. 2007, 26: 2937-2957. 10.1002/sim.2770.View ArticleGoogle Scholar
- Goss EP, Ramchandani H: Comparing classification accuracy of neural networks, binary logit regression and discriminant analysis for insolvency prediction of life insurers. Journal of Economics and Finance. 1995, 19: 1-18.View ArticleGoogle Scholar
- Efron B: The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis. Journal of the American Statistical Association. 1975, 70: 892-898. 10.2307/2285453.View ArticleGoogle Scholar
- Fan X, Wang L: Comparing linear discriminant function with logistic regression for the two-group classification problem. Journal of Experimental Education. 1999, 67: 265-286. 10.1080/00220979909598356.View ArticleGoogle Scholar
- Lei PW, Koehly LM: Linear discriminant analysis versus logistic regression: a comparison of classification errors in the two-group case. The Journal of Experimental Education. 2003, 72: 25-49. 10.1080/00220970309600878.View ArticleGoogle Scholar
- Pohar M, Blas M, Turk S: Comparison of Logistic Regression and Linear. Discriminant Analysis: A Simulation Study. Metodološki zvezki. 2004, 1: 143-161.Google Scholar
- Pitarque A, Roy JF, Ruiz JC: Redes neurales vs modelos estadísticos: Simulaciones sobre tareas de predicción y clasificación. Psicológica. 1998, 19: 387-400.Google Scholar
- Nabney IT: Efficient training of RBF networks for classification. International Journal of Neural Systems. 2004, 14: 201-208. 10.1142/S0129065704001930.PubMedView ArticleGoogle Scholar
- Poon TC, Chan AT, Zee B, Ho SK, Mok TS, Leung TW, Johnson PJ: Application of classification tree and neural network algorithms to the identification of serological liver marker profiles for the diagnosis of hepatocellular carcinoma. Oncology. 2001, 61: 275-283. 10.1159/000055334.PubMedView ArticleGoogle Scholar
- Suka M, Oeda S, Ichimura T, Yoshida K, Takezawa J: Advantages and disadvantages of neural networks for predicting clinical outcomes. IMECS 2007: International Multiconference of engineers and computer scientists. 2007, I & II: 839-844.Google Scholar
- Kestler HA, Schwenker F: RBF network classification of ECGs as a potential marker for sudden cardiac death. Radial basis function networks 2: new advances in design archive. 2001, Heidelberg, Germany: Physica-Verlag GmbH, 162-214.Google Scholar
- Maglogiannis I, Sarimveis H, Kiranoudis CT, Chatziioanno AA, Oikonomou N, V A: Radial basis function neural networks classification for the recognition of idiopathic pulmonary fibrosis in microscopic images. IEEE Trans Inf Technol Biomed. 2008, 12: 42-54.PubMedView ArticleGoogle Scholar
- Sut N, Senocak M: Assessment of the performances of multilayer perceptron neural networks in comparison with recurrent neural networks and two statistical methods for diagnosing coronary artery disease. Expert Systems. 2007, 24: 131-142. 10.1111/j.1468-0394.2007.00425.x.View ArticleGoogle Scholar
- Sommer M, Olbrich A, Arendasy M: Improvements in Personnel Selection with Neural Nets: A Pilot Study in the field of Aviation Psychology. The International Journal of Aviation Psychology. 2004, 14: 103-115. 10.1207/s15327108ijap1401_6.View ArticleGoogle Scholar
- Zollner FG, Emblem KE, Schad LR: Support vector machines in DSC-based glioma imaging: Suggestions for optimal characterization. Magn Reson Med. 2010Google Scholar
- Ivanciuc O: Applications of Support Vector Machines in Chemistry. Reviews in Computational Chemistry. Edited by: Lipkowitz KB, Cundari TR. 2007, Weinheim: John Wiley & Sons, Inc, 23: 291-400.View ArticleGoogle Scholar
- Kurt I, Ture M, Kurum AT: Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Systems with Applications. 2008, 34: 366-374. 10.1016/j.eswa.2006.09.004.View ArticleGoogle Scholar
- Finch H, Schneider MK: Misclassification Rates for Four Methods of Group Classification: Impact of Predictor Distribution, Covariance Inequality, Effect Size, Sample Size, and Group Size Ratio. Educational and Psychological Measurement. 2006, 66: 240-257. 10.1177/0013164405278579.View ArticleGoogle Scholar
- Finch H, Schneider MK: Classification Accuracy of Neural Networks vs. Discriminant Analysis, Logistic Regression, and Classification and Regression Trees: Three- and Five-Group Cases. Methodology. 2007, 3: 47-57.View ArticleGoogle Scholar
- Gelnarova E, Safarik L: Comparison of three statistical classifiers on a prostate cancer data. Neural Network World. 2005, 15: 311-318.Google Scholar
- Duin RPW: A note on comparing classifiers. Pattern Recognition Letters. 1996, 17: 529-536. 10.1016/0167-8655(95)00113-1.View ArticleGoogle Scholar
- Meyer D, Leischa F, Hornik K: The support vector machine under test. Neurocomputing. 2003, 55: 169-186. 10.1016/S0925-2312(03)00431-4.View ArticleGoogle Scholar
- Behrman M, Linder R, Assadi AH, Stacey BR, Backonja MM: Classification of patients with pain based on neuropathic pain symptoms: Comparison of an artificial neural network against an established scoring system. European Journal of Pain. 2007, 11: 370-376. 10.1016/j.ejpain.2006.03.001.PubMedView ArticleGoogle Scholar
- Fisher R: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics. 1936, 7: 179-188. 10.1111/j.1469-1809.1936.tb02137.x.View ArticleGoogle Scholar
- McLachlan GJ: Discriminant Analysis and Statistical Pattern Recognition. 2004, London: Wiley InterscienceGoogle Scholar
- Hosmer DW, Lemeshow S: Applied Logistic Regression. 2000, New York: Chichester, Wiley, 2View ArticleGoogle Scholar
- Yang ZR: Neural networks. Methods Mol Biol. 2010, 609: 197-222. 10.1007/978-1-60327-241-4_12.PubMedView ArticleGoogle Scholar
- Bishop C: Neural Networks for Pattern Recognition. 1995, Oxford: Oxford: University PressGoogle Scholar
- Cortes C, Vapnik V: Support-Vector Networks. Machine Learning. 1995, 20: 273-297.Google Scholar
- Karatzoglou A, Meyer D, Hornik K: Support Vector Machines in R. Journal of Statistical Software. 2006, 15: 1-28.View ArticleGoogle Scholar
- Bennett KP, Campbell C: Support vector machines: Hype or hallelujah?. SIGKDD Explorations. 2000, 2:Google Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and regression trees. 1984, Monterey, Calif., USA: Wadsworth, IncGoogle Scholar
- Kass G: An exploratory technique for investigation large quantities of categorical data. Applied Statistics. 1980, 29: 119-127. 10.2307/2986296.View ArticleGoogle Scholar
- Loh W-Y, Shih Y-S: Split selection methods for classification trees. Statistica Sinica. 1997, 7: 815-840.Google Scholar
- Breiman L: Random forests. Machine Learning. 2001, 45: 123-140. 10.1023/A:1010950718922.View ArticleGoogle Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest. R News. 2002, 2/3 (December): 18-22.Google Scholar
- APA: Diagnostic and statistical manual of mental disorders. 2000, Washington, DC: American Psychiatric Association, Text revision, 4Google Scholar
- Garcia C: Doença de Alzheimer, problemas do diagnóstico clínico. Tese de Doutoramento. 1984, Universidade de Lisboa., Faculdade de Medicina de LisboaGoogle Scholar
- Benton AL, Hamsher K: Multilingual Aphasia Examination. 1976, Department of Neurology, University of Iowa Hospitals, Iowa CityGoogle Scholar
- Wechsler D: Manual for the Wechsler Adult Intelligence Scale--Revised. 1981, Psychological Corporation, New YorkGoogle Scholar
- Freedman M, Leach L, Kaplan E, Winocur G, Shulman K, Delis DC: Clock-drawing: a neuropsychological analysis. 1994, New York: NY: Oxford University PressGoogle Scholar
- Wechsler D, Stone CP: Wechsler memory scale. 1945, New York: Psychological CorporationGoogle Scholar
- Ribeiro F, Guerreiro M, de Mendonça A: Verbal learning and memory deficits in Mild Cognitive Impairment. Journal of Clinical and Experimental Neuropsychology. 2007, 29: 187-197. 10.1080/13803390600629775.PubMedView ArticleGoogle Scholar
- Meyer D: Support Vector Machines: The Interface to libsvm in package e1071. R News. 2001, 1/3: 23-26.Google Scholar
- Burges C: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery. 1998, 2: 121-167. 10.1023/A:1009715923555.View ArticleGoogle Scholar
- Lehmann C, Koenig T, Jelic V, Prichep L, John RE, Wahlund LO, Dodge Y, Dierks T: Application and comparison of classification algorithms for recognition of Alzheimer's disease in electrical brain activity (EEG). Journal of Neuroscience Methods. 2007, 161: 342-350. 10.1016/j.jneumeth.2006.10.023.PubMedView ArticleGoogle Scholar
- Orr RK: Use of a Probabilistic Neural Network to Estimate the Risk of Mortality after Cardiac Surgery. Medical Decision Making. 1997, 17: 178-185. 10.1177/0272989X9701700208.PubMedView ArticleGoogle Scholar
- Schwarzer G, Vach W, Schumacher M: On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Statistics in Medicine. 2000, 19: 541-561. 10.1002/(SICI)1097-0258(20000229)19:4<541::AID-SIM355>3.0.CO;2-V.PubMedView ArticleGoogle Scholar
- Fukunaga K, Hayes RR: Effects of Sample Size in Classifier Design. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1989, 11: 873-885. 10.1109/34.31448.View ArticleGoogle Scholar
- Raudys SJ, Jain AK: Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991, 13: 252-264. 10.1109/34.75512.View ArticleGoogle Scholar
- Vach W, Roßner R, Schumacher M: Neural networks and logistic regression. Part II. Computational Statistics and Data Analysis. 1996, 21: 683-701. 10.1016/0167-9473(95)00033-X.View ArticleGoogle Scholar
- Oliveira PP, Nitrini R, Busatto G, Buchpiguel C, Sato JR, Amaro E: Use of SVM methods with surface-based cortical and volumetric subcortical measurements to detect Alzheimer's disease. J Alzheimers Dis. 2010, 19: 1263-1272.PubMedGoogle Scholar
- Zhu Y, Tan Y, Hua Y, Wang M, Zhang G, Zhang J: Feature selection and performance evaluation of support vector machine (SVM)-based classifier for differentiating benign and malignant pulmonary nodules by computed tomography. J Digit Imaging. 2010, 23: 51-65. 10.1007/s10278-009-9185-9.PubMedPubMed CentralView ArticleGoogle Scholar
- Jahandideh S, Abdolmaleki P, Movahedi MM: Comparing performances of logistic regression and neural networks for predicting melatonin excretion patterns in the rat exposed to ELF magnetic fields. Bioelectromagnetics. 2010, 31: 164-171.PubMedGoogle Scholar
- Smith A, Sterba-Boatwright B, Mott J: Novel application of a statistical technique, Random Forests, in a bacterial source tracking study. Water Res. 2010, 44: 4067-4076. 10.1016/j.watres.2010.05.019.PubMedView ArticleGoogle Scholar
- Statnikov A, Aliferis CF: Are random forests better than support vector machines for microarray-based cancer classification?. AMIA Annu Symp Proc. 2007, 686-690.Google Scholar
- Lemon SC, Roy J, Clark MA, Friedmann PD, Rakowski W: Classification and regression tree analysis in public health: Methodological review and comparison with logistic regression. Annals of Behavioral Medicine. 2003, 26: 172-181. 10.1207/S15324796ABM2603_02.PubMedView ArticleGoogle Scholar
- Lisboa PJ: A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Networks. 2002, 15: 11-39. 10.1016/S0893-6080(01)00111-3.PubMedView ArticleGoogle Scholar
- Michie D, Spiegelhalter DJ, Taylor CC: Machine learning, neural and statistical classification. 1994, New York: Ellis HorwoodGoogle Scholar
- Lim T-S, Loh W-Y, Shih Y-S: A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning. 2000, 40: 203-228. 10.1023/A:1007608224229.View ArticleGoogle Scholar
- Duin RPW: A note on comparing classifiers. Pattern Recognition Letters. 1996, 17: 529-536. 10.1016/0167-8655(95)00113-1.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.