Variable selection methods for developing a biomarker panel for prediction of dengue hemorrhagic fever
© Ju and Brasier; licensee BioMed Central Ltd. 2013
Received: 19 March 2013
Accepted: 10 September 2013
Published: 11 September 2013
The choice of selection methods to identify important variables for binary classification modeling is critical to produce stable models that are interpretable, that generate accurate predictions and have minimum bias. This work is motivated by data on clinical and laboratory features of severe dengue infections (dengue hemorrhagic fever, DHF) obtained from 51 individuals enrolled in a prospective observational study of acute human dengue infections.
We carry out a comprehensive performance comparison using several classification models for DHF over the dengue data set. We compared variable selection results by Multivariate Adaptive Regression Splines, Learning Ensemble, Random Forest, Bayesian Moving Averaging, Stochastic Search Variable Selection, and Generalized Regularized Logistics Regression. Model averaging methods (bagging, boosting and ensemble learners) have higher accuracy, but the generalized regularized regression model has the highest predictive power because the linearity assumptions of candidate predictors are strongly satisfied via deviance chi-square testing procedures. Bootstrapping applications for evaluating predictive regression coefficients in regularized regression model are performed.
Feature reduction methods introduce inherent biases and therefore are data-type dependent. We propose that these limitations can be overcome using an exhaustive approach for searching feature space. Using this approach, our results suggest that IL-10, platelet and lymphocyte counts are the major features for predicting dengue DHF on the basis of blood chemistries and cytokine measurements.
KeywordsVariable selection Classification Bootstrap sampling Data mining
Analysis of high-dimensional data, where the number of predictors exceeds the sample size, poses many challenges for statisticians and calls for new statistical methodologies in order to select relevant variables that are correlated with one-another. In these data, feature selection is used to overcome the curse of dimensionality by removing non-essential variables to achieve a model with predictive accuracy. Consequently, the choice of a variable selection procedure becomes very important for the ability to generate reproducible findings and generalizable conclusions. In high-dimensional data it is desirable to have parsimonious or sparse representations of prediction models. Since highly complex models will be penalized by increased total error, regularization can help control the complexity of a classification by minimizing over-fitting of the training data. This approach is evaluated by maximizing goodness-of-fit and simultaneously minimizing the number of variables selected. An alternative to regularization is to develop a classifier via exploring information from many models built on a perturbed version of a learning ensemble, such as a bagging and a boosting algorithm. We evaluated different models by randomly selecting and withholding the training data to be used later for testing. The receiver characteristic operating curve is a measure by which to compare prediction accuracy based on sensitivity and specificity for both training and test data.
This study is motivated by our previous work on the identification of biomarker panels associated with dengue hemorrhagic fever [1, 2]. Dengue is an acute viral infection hyperendemic to the tropics for which significant numbers are at risk. A subgroup of patients acutely infected with dengue is at risk for a later manifestation of capillary leak and hemorrhage, known as dengue hemorrhagic fever (DHF). In this study discriminative features were identified that associated with dengue hemorrhagic fever. Various blood chemistries and cytokine measurements can be associated with disease outcome. However, many of these variables are highly correlated and which of the factors will result in the most stable classifier is not known. Here we seek to extend this work by comparing the effects of various feature reduction methods to identify the most robust features for early detection of DHF.
Active surveillance for people with dengue infection was conducted in Maracay, Venezuela. Fifty one febrile subjects with signs and symptoms consistent with dengue infection who presented at participating clinics and hospitals, or who were identified by community-based active surveillance, were included in the study. On the day of presentation, a blood sample was collected for dengue virus reverse transcription-polymerase chain reaction (RT-PCR) confirmation and clinical testing. Using the 2009 World Health Organization (WHO) criteria, 13 subjects developed DHF (dengue with warning signs; of these three were classified as severe dengue caused by plasma leakage and severe bleeding). Predictive modeling was performed using laboratory values of 51 individuals (38 DF and 13 DHF) obtained on initial presentation via binary classification models.
Feature selection methods identify the subset of differentially-expressed predictors that are useful and relevant in distinguishing different classes of samples. Feature reduction using Significance Analysis of Microarray (SAM)
where L is the maximum likelihood for the model, and p is number of covariates estimated in the model .
where n is the samples size, and p are defined as those variables in AIC .
For the Learning Ensemble, we used a stochastic gradient boosted model (TreeNet). TreeNet is a generalized tree boosting algorithm that is an accurate and effective off-the-shelf procedure for data mining [5, 6]. Software implementation of the bagging ensemble model was from Salford Predictive Modeler version 7.0 from Salford Systems.
Bayesian model averaging
Bayesian model averaging (BMA) was used to consider all the possible 2p model configurations from the model space as the potentially correct mode. BMA was applied using the R library BMA (http://cran.r-project.org).
Stochastic Search Variable Selection (SSVS)
Generalized Path Seeker (GPS) modeling
Automated stepwise logistic regression model selection has the problem of instability and overfitting of subset selection in multivariate regression by retaining redundant information and noisy variables. The method sequentially drops and adds predictor variables by examining the mean squared error or a modified version of it. Recent techniques, such as shrinkage and regularized estimation, can account for and correct the overfitting. We used the optimized L1 and L2 regularization generalized linear model to select to most important features. Regularized and shrinkage estimation methods such as a lasso (least absolute shrinkage and selection operator) estimator help address variable selection and multicollinearity . The lasso utilizes the L1 penalty and does both continuous shrinkage and automated variable selection simultaneously. For a binary response variable and the logistic regression models, the lasso estimator is estimated by penalized the negative log-likelihood with the L1-norm. The penalty term is chosen by a cross-validation technique to evaluate the out-of-sample negative log-likelihood. The elastic net combines the L1 and L2 penalizing terms and possesses a grouping effect, i.e., in a set of variables which have high pairwise correlations, the elastic net groups the correlated variables together .
where pi = Pr(y = 1|x).
We used a soft loss function which is robust to the influence of outliers. Resampling techniques and cross-validations were evaluated to reduce variability in the prediction measures. Bootstrap methods are introduced when an averaged prediction is made using multiple models generated on random resamples of the observations with replacement. Bootstrap resampling started with fitting the logistic model in a bootstrap sample of n subjects, which was drawn with replacements from the original sample. Zou (2006) proposed the adaptive lasso, which permits different weights for different parameters, and is also shown to have the oracle property . The comparisons of penalized regression methods in binary response and logistic regression such as the ridge, lasso, and elastic net were conducted. To estimate the coefficient, we performed generalized path seeker (GPS), a high speed lasso-style regression from Friedman (2012) to regularize regression . GPS demonstrates the regularized regression based on the generalized elastic net family of penalties. Software implementation of GPS is available in Salford Predictive Modeler version 7.0 from Salford Systems.
Multivariate Adaptive Regression Splines: MARS
where yˆ is the response described by the model, a0 the coefficient of the constant basis function (intercept), p the total number of basis functions and ak the coefficient of the kth basis function Bk(x). MARS models use hockey stick basis functions of the form (x − t)+ and (t − x)+, with t being the knot. The basis functions in MARS are single truncated spline functions or a product of two or more spline functions for different predictors. The first order MARS model was built without interactions to over-fit the training data. A maximum number of basis functions equal to 30 was used as the stopping criterion. The model was pruned using a ten-fold and generalized cross validation. The optimal model was selected based on evaluation of the model complexity and its predictive quantities for the test sets. Software implementation of MARS model is available in Salford Predictive Modeler version 7.0 from Salford Systems.
Random Forest (RF)
RF modifies the classic CART (Classification and regression tree) algorithm by randomly selecting from a subset of descriptors, rather than choosing the best split among all samples. This procedure is repeated until a sufficiently large number of trees have been computed. RF performs a bootstrapping cross-validation procedure in parallel with the training step of its algorithm. This method allows some of the data to be left out at each step, then used later to estimate the accuracy of the classifier after each instance has been carried out. Software implementation of the RF model is available in Salford Predictive Modeler version 7.0 from Salford Systems.
Nonlinear testing procedures
Modeling was performed to assess the linear or non-linear association of binary responses variables of selected variables. For each part predictor, we also examined the log-likelihood ratio test p-values comparing the deviance between the full model and the model without that variable. We calculated the projection (hat) matrix, Cook’s distance, various residuals and the estimated probabilities versus each predictor to evaluate outliers and identify influential points in the models. We used both the change in residual deviance (as in parametric or nonparametric models), and the area under the receiver operating characteristic curve (ROC) to compare of the performance of the statistical models. The area under the curve measures indicate the ability of the model to discriminate between the outcome groups. A score of 0.5 indicates that a model has no discriminatory ability and a score of 1 indicates the two groups are perfectly discriminated.
Results and discussion
In this study, we measured various blood chemistries and cytokine measurements that are associated with disease outcome. We next sought to identify those variables that were most informative for outcome. To improve the model performance, we sought to first reduce the dimensionality. Feature reduction removes meaningless features which are not related to a studied disease, thus leading to overfitting of a classifier in high dimensional data. This happens because without prior removal of irrelevant genes, the classification problem is known as the small sample size problem (the number of features far exceeds the number of samples in data set). Dimensionality can be reduced by selecting an appropriate subset among existing variables. For this purpose, we applied two sample t-test or where appropriate, a two-sample t-test on log transformed data. We addressed the multiple testing using false discovery rate (FDR), a re-sampling method have an improved ability to reject a false hypothesis compared to the Bonferroni correction.
To accomplish this, we performed variable selection using a SAM test. SAM is a widely used permutation-based approach to identifying differentially expressed genes when assessing statistical significance using FDR adjustment in high dimensional datasets . Efron, Tibshirani, Storey et al. (2001) developed an empirical bayesian approach using non-informative priors and deriving the posterior probability difference for each predictors without having to run t tests or Wilcoxon tests to identify which were differentially expressed . In some cases, a heuristic approach is investigated for feature selection by integrating correlation, histogram, and other statistical measures. We used the AIC information criterion on the modeling (Methods) with the false discovery adjustment to select the top 25 variables with the most significant p-values, based on the training data for each fold of a 10-fold cross validation. Using these variables, the training data were split into two parts: one to fit penalized logistic models and one to select tuning parameters.
Model assessment of the predictive power
TreeNet Gradient Boosting
MARS Basis Functions
(IL10 − 5.17)+
(Lymphocytes − 8)+
(Platelets − 45)+
GPS- generalized lasso estimators
Overall, this comparison indicated that IL-10, platelet count and lymphocytes were the three most important variables that consistently shown high relative importance across all models. We also evaluated the nonlinearity of these important predictors
Analysis of the model performance on the training data set using ROC measures indicated that the GPS model had an AUC of 0.92, the MARS model had an AUC of 0.79, and the TreeNet model had an AUC of 0.70 (Table 1).
Based on this ROC comparison, we concluded that the GPS model has the best prediction accuracy of the models tested. Another comparison of the model performance can be based on whether it will generalize well. In this analysis, we estimated the change in AUC in the test set relative to the training set. Here, the GPS model had the smallest difference in AUC (Table 1), suggesting that it will generalize better than the other models.
In this article, we compared several statistical methods for feature reduction and constructing a predictive modeling of developing dengue hemorrhagic fever disease. Our attempts to identify the most informative predictors is based on the assumption that different variable reduction techniques have inherent biases, resulting in the omission of potentially informative predictors and inclusion of uninformative ones. For example, the learning ensemble did not identify lymphocytes as an important variable, yet this variable was identified by the BMA methods. Model comparison suggests that GPS method with IL10, Lymphocytes, and Platelets outperforms other models tested on the basis of AUC and predicted ability to generalize. We recognize that our study is limited because of the small data set. Further evaluation of this modeling procedure on a large independent data set is needed. It also will be a meaningful study to explore the effect of sample size on the performance of various methods.
Availability of data
Data are available on the NIAID Clinical Proteomics Center Web site at:
This research was supported by the NIAID Clinical Proteomics Center, HHSN272200800048C (ARB), NIH-NHLBI-HHSN26820 1000037C NHLBI Proteomics Center for Airway Inflammation (ARB), and the Clinical and Translational Science Award (UL1TR000071) from the National Center for Advancing Translational Sciences, NIH (ARB).
- Brasier AR, Garcia J, Wiktorowicz JE, et al: Discovery proteomics and nonparametric modeling pipeline in the development of a candidate biomarker panel for dengue hemorrhagic fever. Clin Transl Sci. 2012, 5 (1): 8-10. 10.1111/j.1752-8062.2011.00377.x.PubMedPubMed CentralView ArticleGoogle Scholar
- Brasier AR, Ju H, Garcia J, et al: A three-component biomarker panel for prediction of dengue hemorrhagic fever. The American Society of Tropical Medicine and Hygiene. 2012, 86 (2): 341-348. 10.4269/ajtmh.2012.11-0469.View ArticleGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974, 19 (6): 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6 (2): 461-464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
- Friedman JH: Greedy function approximation: A gradient boosting machine. Ann Stat. 2001, 2: 1189-1232.View ArticleGoogle Scholar
- Friedman JH: Stochastic gradient boosting: Nonlinear methods and data mining. Computational Statistics and Data Analysis. 2002, 38: 367-378. 10.1016/S0167-9473(01)00065-2.View ArticleGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc. Series B. 1996, 58 (1): 267-288.Google Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc. Series B. 2005, 67 (2): 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Zou H: The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006, 101 (476): 1418-1429. 10.1198/016214506000000735.View ArticleGoogle Scholar
- Friedman JH: Fast sparse regression and classification. Int J Forecasting. 2012, 28 (3): 722-738. 10.1016/j.ijforecast.2012.05.001.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Efron B, Tibshirani R, Storey JD, et al: Empirical bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.View ArticleGoogle Scholar
- Breiman L: Bagging predictors. Mach Learn. 1996, 24 (2): 123-140.Google Scholar
- Fridley BL: Bayesian variable selection and model selection methods for genetic association studies. Genet Epidemiol. 2009, 33: 27-37. 10.1002/gepi.20353.PubMedView ArticleGoogle Scholar
- George E, McCulloch R: Variable selection via gibbs sampling. J Am Stat Assoc. 1993, 88 (423): 881-889. 10.1080/01621459.1993.10476353.View ArticleGoogle Scholar
- Albert J, Chib S: Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc. 1993, 88 (422): 669-679. 10.1080/01621459.1993.10476321.View ArticleGoogle Scholar
- Friedman JH: Multivariate adaptive regression splines. Ann Stat. 1991, 19 (1): 1-67. 10.1214/aos/1176347963.View ArticleGoogle Scholar
- Breiman L: Random Forests. 2001, Berkeley, CA: University of CaliforniaGoogle Scholar
- Hastie T, Tibshirani R: Generalized additive models. Stat Sci. 1986, 1 (3): 297-318. 10.1214/ss/1177013604.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.