Performance of variable selection methods using stabilitybased selection
 Danny Lu^{1},
 Aalim Weljie^{2},
 Alexander R. de Leon^{3}Email author,
 Yarrow McConnell^{4},
 Oliver F. Bathe^{5, 6} and
 Karen Kopciuk^{2, 5, 7}Email author
DOI: 10.1186/s1310401724618
© The Author(s) 2017
Received: 9 May 2016
Accepted: 17 March 2017
Published: 4 April 2017
Abstract
Background
Variable selection is frequently carried out during the analysis of many types of highdimensional data, including those in metabolomics. This study compared the predictive performance of four variable selection methods using stabilitybased selection, a new secondary selection method that is implemented in the R package BioMark. Two of these methods were evaluated using the more wellknown false discovery rate (FDR) as well.
Results
Simulation studies varied factors relevant to biological data studies, with results based on the median values of 200 partial area under the receiver operating characteristic curve. There was no single top performing method across all factor settings, but the student t test based on stability selection or with FDR adjustment and the variable importance in projection (VIP) scores from partial least squares regression models obtained using a stabilitybased approach tended to perform well in most settings. Similar results were found with a real spikedin metabolomics dataset. Group sample size, group effect size, number of significant variables and correlation structure were the most important factors whereas the percentage of significant variables was the least important.
Conclusions
Researchers can improve prediction scores for their study data by choosing VIP scores based on stability variable selection over the other approaches when the number of variables is small to modest and by increasing the number of samples even moderately. When the number of variables is high and there is block correlation amongst the significant variables (i.e., true biomarkers), the FDRadjusted student t test performed best. The R package BioMark is an easytouse opensource program for variable selection that had excellent performance characteristics for the purposes of this study.
Keywords
Stabilitybased variable selection False discovery rate (FDR) Highdimensional biological data Partial area under the receiveroperating characteristic curve (pAUC) Variable importance in projection (VIP)Background
Variable selection is an important first step in the analysis of diverse chemical data, where often the goal is to identify a subset of measured variables that can distinguish between two or more different groups. Including all measured variables is impossible in practice and leads to reduced precision of model estimates and overfitting in most analytical methods [1]. Many variable selection methods have been developed for high and ultra highdimensional data settings and for a variety of modelling approaches and data types [2–4].
The R package Biomark [5, 6] includes these popular variable selection methods: student t test, Variable Importance in Projection (VIP) scores [7, 8] from Partial Least Squares Regression (PLSDA) models, Least Absolute Shrinkage and Selection Operator (LASSO) [9], and Elastic Net [10, 11]. Each method has different strengths and weaknesses for identifying significant variables often found in biological data like in metabolomics, and possibly, for modelling them. Models for such data should be able to handle multicollinearity in the measured variables, small nlarge p cases (i.e., more variables than samples), sparsity (i.e., few significant variables), and multiple variables in a regression context, and should have ease of interpretation.
Resampling approaches used for prediction with variable selection methods tend to perform poorly when the numbers of samples within groups are small. Stabilitybased selection is a new general approach that can be used with several different analytical methods, from student t tests to various regression techniques [12, 13]. It is similar to multipletesting methods like the FDR [14] and qvalues [15, 16], as these approaches all employ secondary selection based on an initial evaluation of the variables. Stabilitybased selection operates by repeatedly taking subsets of variables and samples from the full data set, and then estimates and ranks the coefficients, scores or P values as generated by the chosen analytical method in each perturbed dataset. If the fraction of the time a variable is included in a fixed number of top variables in the perturbed datasets is high, it is deemed to be stable and, therefore, a potential significant variable (i.e., a biomarker). Those variables appearing by chance in a few perturbations will not be a consistent indicator of class differences when the results are averaged, so are not selected. Thus, stabilitybased selection, like the jackknife [17] approach, perturbs the data to identify those variables that are consistently selected as group difference indicators [5, 12, 13] to improve prediction.
Performance of the variable selection methods using their selected significant variables can be evaluated using Receiver Operating Characteristic (ROC) curves when the real significant variables (i.e., true biomarkers) are known, such as with simulated and spikedin data. ROC curves are generated by starting with the first selected biomarker then sequentially including the remaining ones and plotting the proportion of false positives (xaxis) against the proportion of true positives (yaxis) at each step. The Area Under the ROC curves (AUC) summarizes the performance of the selected set of significant variables by a variable selection method on a scale between zero and one; that is, the AUC measures how well a random pair of samples, one from each group, is correctly classified. The higher the AUC, the better the biomarker classification method performs. Instead of calculating the AUC for the whole curve, often the partial AUC (pAUC) is calculated [18–20] since most of the true significant variables are usually selected first without too many false ones. Restricting the calculation of the AUC to a smaller range of values of the false positive rate (i.e., higher specificity) is appropriate for diagnostic and other medical tests based on biomarkers for use in a clinical situation [21] and is the metric adopted here.
The objective of this study was to evaluate the performance of four popular variable selection methods using the robust stabilitybased selection criterion and two of these methods (VIP and student t test) with an FDR adjustment to identify significant variables. Our evaluation metric for each method was the pAUC, which assesses predictive performance using modelbased simulated biological data for each of the variable selection methods.
Methods
Simulation study design
To evaluate the variable selection methods, we used modelbased simulated data that mimicked biological data that have undergone preprocessing and pretreatment steps in a metabolomics analysis pipeline [22]. By varying several biological and experimental factors likely to impact variable selection, we could systematically evaluate their effects across the methods we considered, since we knew the identity of the true significant variables.
Metabolomics data are typically right skewed and their range can vary substantially between individual metabolites. Logarithmic or other transformations are used along with centering and scaling the data to unit variance before analysis [22]. The simulated datasets were generated assuming these preanalytical steps have been followed, resulting in standard multivariate normal distributions, which is the joint distribution of correlated univariate normal variables with zero means and unit variances (i.e., scaled data). The parameters that were varied included the combined study sample size N = 50, 100; the number P = 50, 200, 1000 of measured variables; the percentage Q = 10, 15, 20% of significant variables; the effect size (or mean abundance or signal) Δ = 0.2, 0.4, 0.8 in the treatment (or disease) group; and the correlation structure. Similar to Wehrens et al. [13], the following correlation structures were adopted: (1) independent (i.e., pairwise and higher order correlations were zero), (2) block correlation (i.e., correlation between significant variables was 0.7, between nonsignificant variables was 0.1, and between blocks was zero), and (3) autoregressive of order 1 [AR(1), with correlation rho^{abs(i−j)} between variables i and j, for rho = 0.5].
Combinations of the various parameter values resulted in a total of 36 distinct parameter configurations for each correlation structure. For each configuration, 200 datasets were simulated and results are based on medians across these datasets. For LASSO and Elastic Net, the mixing parameter α was set at 1 and 0.5, respectively, and the value of the regularization parameter λ was chosen when the number P × Q of significant variables was first selected or the maximum number of variables when fewer were identified. Two components were adopted for PLSDA models and the BioMark default values for the percentage of variables (variable.fraction) included in the subgroups as well as percentage of samples removed per group (oob.size) were used. A default top fraction ntop = 10 and a stringent consistency threshold level min. present = 0.5 (i.e., 50%) [13] were used so that a selected variable had to be in the top 10 variables in at least half of the 200 resampled datasets. In every setting, two groups of equal sample sizes were used to compare the ability of the four stabilitybased and two FDRadjusted selection methods to correctly identify the significant variables associated with the treatment group. The pAUC was set at 0.2, which corresponds to a false positive fraction of 0.2 or equivalently, a specificity of 80%. The R code used to generate these data and the results can be found here: https://people.ucalgary.ca/~kakopciu/Simulated Metabolomics Biomarkers R Code.docx.
Spikedin metabolomics dataset
Reference and measured identification parameters for spiked metabolites, with range of serum concentrations and injected amounts over the dilutional series
Metabolite species  Concentration range (μM)  Injection amount range (ng)  GOLM database  Measured parameters  

RI^{a}  Select m/z ions  Mean RI  m/z ions  RSD^{a} (%)  
Solution 1  
Glycine (3TMS)^{a}  200–300  0.6–3.7  1302.7  17424827610086  1305.7  174248147  101 
Serine (3TMS)  100–150  0.4–2.6  1352.8  204218278306100  1357.9  204218147  106 
Threonine (3TMS)  80–120  0.4–2.4  1377.2  219291218117320  1382.9  291218117  122 
Aspartic acid (3TMS)  26.5–41  0.1–0.8  1511.2  232218306202334  1512.7  232218100  70 
Solution 2  
Alanine (2TMS)  187  0.6–3.7  1108.6  116190218100233  1098.9  116204118  100 
Valine (2TMS)  50–300  0.5–2.9  1207.1  144218156246100  1209.9  14421872  116 
Lysine (4TMS)  33–200  0.4–2.4  1881.2  156174317230434  1915.1  156174317  94 
Pyroglutamic acid (2TMS)  8–50  0.1–0.5  1650.4  156258230140273  1516.6  156258147  20 
Findings
In the independent correlation setting (centre panels in all figures), all selection methods provided nearly identical results, except when the number of variables and the effect size was high (P = 1000, Δ = 0.8). The study parameters with the greatest effect on the pAUC values were the combined sample size N (Figs. 1 vs. 2, 3 vs. 4) and the effect size Δ (Figs. 1 vs. 3, 2 vs. 4). Doubling the combined sample size from 50 (i.e., 25 per group) to 100 (i.e., 50 per group) increased the pAUC values by at least 0.15 when Δ was 0.4 or 0.8, but had no effect at 0.2 (results not shown). Doubling the effect size (Figs. 1 vs. 3, 2 vs. 4) from 0.4 to 0.8 increased the pAUC values by at least 0.2. pAUC values increased the most when both N and the Δ were increased for all combinations of P and Q compared to when Δ was 0.2 (not all results shown). In the setting with P = 1000 variables and a large effect size (Δ = 0.8), the FDRadjusted student t test and the Elastic Net had pAUCs values that were from 0.15 to 0.4 higher than those for the other four methods.
In the AR(1) correlation setting (left panel in all figures), all selection methods provided results that were quite similar to those for the Independent Correlation setting; however, the Elastic Net and LASSO methods had slightly lower pAUC values than in the Independent Correlation setting. For number of variables P = 50 or 200, the FDRadjusted student t test or the student t test and VIP obtained using the stabilitybased approach had pAUC values that were generally higher by at least 0.1 than those for the other methods. As in the Independent Correlation setting, when either the number of variables or effect size was high, the FDRadjusted student t test and the Elastic Net had the highest pAUCs values.
In the block correlation setting (right panel in all figures), all six selection methods provided very different results when the effect size was greater than 0.2. The Elastic Net and LASSO consistently ranked the lowest for pAUC values for any P, any Q (i.e., percentage of significant variables), and for any N, and for modest or high Δ. The VIP scores based on an FDR adjustment had higher pAUC values when Q was 0.1, but performed poorly as it increased to 0.2 and 0.3. As Δ was increased, the pAUC values for both unadjusted and FDRadjusted student t tests and the stabilitybased VIP scores increased substantially: twofold to fourfold when Δ was doubled from 0.2 to 0.4 (results not shown), and from 0.4 to 0.8 when N was 50, but generally only when Q was low (0.1) to modest (0.2). When N was 100, a twofold increase was observed when Δ was doubled from 0.2 to 0.4 (results not shown), with less dramatic but still substantial increases (0.15 to 0.45) when it was doubled from 0.4 to 0.8. The only scenario where the VIP scores tended to perform worse in the block correlation setting was when Q was high (0.3) and P was at least 200. Both unadjusted and FDRadjusted student t tests were less affected with increasing P and high Q, especially at the greatest effect size (Δ = 0.8). However, when P was high (1000) and Q was greater than 0.1, the FDRadjusted student t test outperformed all other methods.
Comparison of results between stabilitybased and FDRadjusted methods for the spikedin data set for four selection methods (true biomarkers in italics font)
Method  Test statistic  Biomarkers selected  pAUC^{a}, sensitivity  1Specificity 

Stability (0.5)  VIP  Gly Ser Thr Ala Val Lys 20 36 24 13 42 23 41 44 57 19 Asp  0.875  0.169 
Student t test  Gly Ser Thr Ala Val Lys 20 36 13 23 42 41 Asp 44 57 24 48  0.875  0.169  
Lasso  Gly Ala Val 20 23 Thr 21  0.5  0.034  
Elastic net  Gly Ala Val 23 20 Thr  0.5  0.051  
FDR  Student t test P values <0.05  Gly Ser Thr Ala Val Lys 25  0.75  0.034 
VIP P values adj <0.05  –  0  0 
Conclusions
The results from this study add to the literature on variable selection methods in several ways. Multivariate approaches such as VIP scores, which incorporate correlations across the variables, should theoretically outperform simple univariate methods such as the student t test when the effect size is low and the significant variables are correlated [25–27]. This was confirmed in this study for the stabilitybased version of the VIP method in the Block Correlation setting when the effect size was low (Δ = 0.4) but only when the number of measured variables (P = 50) as well as the percentage of significant variables were low (Q < 20%). The FDRadjusted version of the VIP method also performed well in the Block Correlation setting when the effect size was low (Δ = 0.4) but now when the number of measured variables (P = 200, 1000) was higher and when the percentage of significant variables was very low (Q = 10%). In addition, variable selection based on the student t test has been previously shown to perform well in highdimensional data settings when variables are strongly associated with class label [28]; this was also confirmed in this study when the effect sizes were large (Δ = 0.8) and the number of variables is larger than the sample size (≥100). Chong and Jun [8] also found that selection based on VIP scores performed better than LASSO using similar experimental factors but different performance metrics. The Elastic Net was expected to outperform LASSO when variables are highly correlated [11], which was the case in the simulated data with AR(1) and block correlation, and in the independent correlation setting only when the effect size and number of variables were both large.
If any correlation is present between the significant variables, using the stabilitybased VIP scores from a PLSDA model resulted in better prediction when the effect size was low to modest (0.2–0.4). Chemometrics methods are increasingly being applied in biotechnological processes [30] and the results from this study support using the VIP scores from the projectionbased PLSDA models using the stabilitybased method when effect sizes are not large and the number of variables is low or modest (≤200).
Finally, the findings from this study provide new results for variable selection methods, especially for the stabilitybased variable selection approach, which has not previously been extensively evaluated for LASSO and Elastic Net [5, 12]. The stabilitybased student t test and VIP scores outperformed Elastic Net and LASSO in all parameter configurations but not when the effect size and the number of variables were large. In general, the stabilitybased VIP scores performed similarly to or better than the student t test while Elastic Net performed the same as or slightly better than LASSO. When the number of variables was very large, the FDRadjusted student t test performed consistently well.
The objective of this paper was to provide guidance on popular and easily accessible variable selection methods already available in R and readily accessible to any researcher. The R package BioMark provides several additional variable selection methods for identifying important variables to classify new observations into one of two groups, including principal components. The stabilitybased selection approach adopted in this study avoids overfitting and is robust even when the sample size in each group is small. Our results suggest that using the VIP scores over the other three stabilitybased variable selection methods, as it generally provides the highest pAUC values. It was the best performing method when the number of variables was low, and especially when the effect size was modest (≤0.4). When there is a large number of variables (P = 1000) and block correlation is present, the FDRadjusted student t test performed the best—even in the verysmallsample size setting (10 per group). Thus, it should be the preferred approach in this high variabletosample ratio setting. Doubling the sample size from small (50 observations in total) to modest (100 observations in total) tended to yield an increase in the pAUC for any method and any correlation structure with at least some signal in the data (Δ ≥ 0.2). Future research directions should examine ultra highdimensional data settings to see if similar findings hold and should explore the performance of the regularization methods Elastic Net and LASSO across a wider range of sparse data settings. Finally, extension of the stabilitybased selection approach to other study designs (e.g., three or more groups) would increase its usefulness and widen its applicability.
Abbreviations
 Ala:

alanine
 AR(1):

autoregressive of order 1
 Asp:

aspartic acid
 AUC:

area under the roc curves
 GC–MS:

gas chromatography–mass spectrometry
 Gly:

glycine
 LASSO:

least absolute shrinkage and selection operator
 Lys:

lysine
 NIST:

National Institute for Standards and Technology
 pAUC:

partial area under the receiveroperating curve
 PLSDA:

partial least squares regressiondiscriminant analysis models
 Pyr:

pyroglutamic acid
 RI:

retention index
 ROC:

receiver operating characteristic curves
 RSD:

relative standard deviation
 Ser:

serine
 Thr:

threonine
 TMS:

trimethylsilyl groups
 Val:

valine
 VIP:

variable importance in projection
Declarations
Authors’ contributions
KK conceived of the project idea, analysed the spikedin data, and wrote the manuscript. DL carried out the simulation studies and prepared the figures. AW designed the spikedin data set study as part of YM’s MSc thesis, cosupervised YM who generated the GCMS spikedin data set, proposed experimental parameters to evaluate in the simulation study, and revised the manuscript. AD contributed to the simulation study design, methods to be evaluated, and revised the manuscript. OB contributed to the design of the spikedin data set as YM’s thesis cosupervisor, contributed to the design of the simulation study, and revised the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
Option 2: The spikedin dataset supporting the conclusions of this article is available here: https://people.ucalgary.ca/~kakopciu/Spikedin Data Set BMC Notes paper.csv. Additional details on our experimental protocol and the data set used in this study are available here: https://people.ucalgary.ca/~kakopciu/Steps in designing and preparing the spikedin Data set.pdfThe R software code written explicitly to generate the simulated data, identify true biomarkers and generate the AUC’s to compare various parameter settings can be found here: https://people.ucalgary.ca/~kakopciu/Simulated Metabolomics Biomarkers R Code.docx. The program is written using the R programming language, which is licensed as a GNU GPL. R can be run on these operating systems: Unix (including FreeBSD and Linux),Windows, MacOS. It requires the R contributed package BioMark, that will load Matrix, foreach, pls, MASS, st, sda, entropy, corpcor, fdrtool and glmnet packages.
Ethics and consent to participate
Standardized pooled human serum was obtained from the National Institute for Standards and Technology for the spikedin data experiment (http://www.nist.gov/srm/index.cfm). Thus, ethics and consent to participate are not applicable.
Funding
This work was supported by an Undergraduate Student Research Award by the Natural Sciences and Engineering Research Council (NSERC) of Canada to Danny Lu and an NSERC Discovery Grant to A. R. de Leon. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sin. 2010;20(1):101–48.PubMedPubMed CentralGoogle Scholar
 Andersen CM, Bro R. Variable selection in regressiona tutorial. J Chemom. 2010;24(11–12):728–37.View ArticleGoogle Scholar
 Kang Y, Billor N. Variable selection in the Chlamydia pneumoniae lung infection study. J Data Sci. 2013;11(2):371–87.Google Scholar
 Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, Wang Y. The properties of highdimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer. 2008;8(1):37–49.View ArticlePubMedPubMed CentralGoogle Scholar
 Wehrens R, Franceschi P. Metastatistics for variable selection: The R Package BioMark. J Stat Softw. 2012;51(10):1–18.View ArticleGoogle Scholar
 Team RC. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2016. ISBN 3900051070. http://www.Rproject.org; 2016.
 Wold S, Sjostrom M, Eriksson L. PLSregression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58(2):109–30.View ArticleGoogle Scholar
 Chong IG, Jun CH. Performance of some variable selection methods when multicollinearity is present. Chemom Intell Lab Syst. 2005;78(1–2):103–12.View ArticleGoogle Scholar
 Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.Google Scholar
 Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Statistical Methodology). 2005;67(2):301–20.View ArticleGoogle Scholar
 Cho S, Kim K, Kim YJ, Lee JK, Cho YS, Lee JY, Han BG, Kim H, Ott J, Park T. Joint identification of multiple genetic variants via elasticnet variable selection in a genomewide association analysis. Ann Hum Genet. 2010;74:416–28.View ArticlePubMedGoogle Scholar
 Meinshausen N, Buhlmann P. Stability selection. J R Stat Soc Ser B Stat Methodol. 2010;72:417–73.View ArticleGoogle Scholar
 Wehrens R, Franceschi P, Vrhovsek U, Mattivi F. Stabilitybased biomarker selection. Anal Chim Acta. 2011;705(1):15–23.View ArticlePubMedGoogle Scholar
 Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological). 1995;57(1):289–300.Google Scholar
 Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B (Statistical Methodology). 2002;64(3):479–98.View ArticleGoogle Scholar
 Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003;100(16):9440–5.View ArticlePubMedPubMed CentralGoogle Scholar
 Karaman İ, Qannari EM, Martens H, Hedemann MS, Knudsen KEB, Kohler A. Comparison of Sparse and Jackknife partial least squares regression methods for variable selection. Chemom Intell Lab Syst. 2013;122:65–77.View ArticleGoogle Scholar
 Walter SD. The partial area under the summary ROC curve. Stat Med. 2005;24(13):2025–40.View ArticlePubMedGoogle Scholar
 Ma H, Bandos AI, Rockette HE, Gur D. On use of partial area under the ROC curve for evaluation of diagnostic performance. Stat Med. 2013;32(20):3449–58.View ArticlePubMedPubMed CentralGoogle Scholar
 Hsu MJ, Chang YC, Hsueh HM. Biomarker selection for medical diagnosis using the partial area under the ROC curve. BMC Res Notes. 2014;7(1):1.View ArticleGoogle Scholar
 Pepe M, Janes H. Methods for evaluating prediction performance of biomarkers and tests. In: Lee MLT, Gail M, Pfeiffer R, Satten G, Cai T, Gandy A (eds) Risk assessment and evaluation of predictions. Berlin: Springer; 2013. pp. 107–142.View ArticleGoogle Scholar
 Goodacre R, Broadhurst D, Smilde AK, Kristal BS, Baker JD, Beger R, Bessant C, Connor S, Capuani G, Craig A. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics. 2007;3(3):231–41.View ArticleGoogle Scholar
 McConnell Y. Serum metabolomics: development and validation of a new diagnostic test for pancreatic cancer. Calgary; 2012 (unpublished thesis).
 Franceschi P, Masuero D, Vrhovsek U, Mattivi F, Wehrens R. A benchmark spikein data set for biomarker identification in metabolomics. J Chemom. 2012;26(1–2):16–24.View ArticleGoogle Scholar
 Fonville JM, Richards SE, Barton RH, Boulange CL, Ebbels T, Nicholson JK, Holmes E, Dumas ME. The evolution of partial least squares models and related chemometric approaches in metabonomics and metabolic phenotyping. J Chemom. 2010;24(11–12):636–49.View ArticleGoogle Scholar
 Madsen R, Lundstedt T, Trygg J. Chemometrics in metabolomics—a review in human disease diagnosis. Anal Chim Acta. 2010;659(1):23–33.View ArticlePubMedGoogle Scholar
 Saccenti E, Hoefsloot HC, Smilde AK, Westerhuis JA, Hendriks MM. Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics. 2014;10(3):0.View ArticleGoogle Scholar
 Hua J, Tembe WD, Dougherty ER. Performance of featureselection methods in the classification of highdimension data. Pattern Recogn. 2009;42(3):409–24.View ArticleGoogle Scholar
 Chu C, Hsu AL, Chou KH, Bandettini P, Lin C. Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images. Neuroimage. 2012;60(1):59–70.View ArticlePubMedGoogle Scholar
 Rathore AS, Bhushan N, Hadpe S. Chemometrics applications in biotech processes: a review. Biotechnol Prog. 2011;27(2):307–15.View ArticlePubMedGoogle Scholar