Skip to main content

Estimating misclassification error: a closer look at cross-validation based methods



To estimate a classifier’s error in predicting future observations, bootstrap methods have been proposed as reduced-variation alternatives to traditional cross-validation (CV) methods based on sampling without replacement. Monte Carlo (MC) simulation studies aimed at estimating the true misclassification error conditional on the training set are commonly used to compare CV methods. We conducted an MC simulation study to compare a new method of bootstrap CV (BCV) to k-fold CV for estimating clasification error.


For the low-dimensional conditions simulated, the modest positive bias of k-fold CV contrasted sharply with the substantial negative bias of the new BCV method. This behavior was corroborated using a real-world dataset of prognostic gene-expression profiles in breast cancer patients. Our simulation results demonstrate some extreme characteristics of variance and bias that can occur due to a fault in the design of CV exercises aimed at estimating the true conditional error of a classifier, and that appear not to have been fully appreciated in previous studies. Although CV is a sound practice for estimating a classifier’s generalization error, using CV to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic. While MC simulation of this estimation exercise can correctly represent the average bias of a classifier, it will overstate the between-run variance of the bias.


We recommend k-fold CV over the new BCV method for estimating a classifier’s generalization error. The extreme negative bias of BCV is too high a price to pay for its reduced variance.



Class prediction involves the use of statistical learning techniques to develop algorithms for classifying unknown samples through supervised learning on samples of known class. In assessing the performance of a classification algorithm, the goal is to estimate its ability to generalize, i.e., to predict the outcomes of samples not included in the data set used to train the classifier. The performance may be assessed on the basis of a number of different indices. For problems having a dichotomous outcome variable (e.g., positive or negative), the sensitivity, specificity, positive predictive value and negative predictive value are indices that may be of interest in addition to the overall prediction accuracy [1]. In this paper, attention is focused on the overall prediction accuracy, or equivalently, on its counterpart, the prediction error.

Cross-validation (CV) is a widely used method for performance assessment in class prediction [24]. With k-fold CV, a data set of n samples is randomly divided into k subsets each having (approximately) n/k samples. Each of these k subsets serves in turn as a test set. For each of these k test sets of size n/k, a classifier is trained on the remaining (k-1)×(n/k) observations (the training set). The trained classifier is then used to classify the n/k samples in the test set, and the prediction error (perhaps, along with other indices) is calculated. The combined value of the prediction error over the k test sets, which is based on the prediction of all n samples one time each, is the cross-validated estimate of that error. Generally, several replicates of k-fold cross-validation are performed based on different random permutations of the n samples in order to account for the random resampling variance, and the average and standard deviation of these replicates are used to assess the performance of the classifier [5, 6]. When k = n, the exercise is called leave-one-out cross-validation (LOOCV); there is only one unique way to do LOOCV and, hence, it cannot be replicated. A common choice of k is 10, and 10 to 30 replicates of 10-fold CV have been shown to be sufficient to achieve stable values of the prediction error [7].

Before 10-fold CV became popular, efforts were directed toward reducing the variability of LOOCV, recognizing that it gave nearly unbiased estimates of the prediction error [8]. The .632 and .632+ bootstrap methods are well known alternatives to LOOCV [9]. Recently, Fu, Carroll and Wang [10] introduced a new bootstrap version of LOOCV (bootstrap cross-validation or BCV), which they compared to LOOCV and to the .632 bootstrap method (BT632) on problems with low-dimensional predictor spaces. Like Efron and Tibshirani [9], Fu et al. [10] used a mean squared error (MSE) represented by the mean squared bias (MSB) over N Monte Carlo simulations (discussed in Methods Section) as the primary criterion for evaluating estimators of the true conditional error, i.e., the true misclassification error of the trained classifier conditional on the training set [9]. These and similar investigations into estimating the true conditional error via cross-validation (e.g., see [7, 11]) have been interpreted as assessing a classifier’s error in predicting future observations, i.e., its generalization error [8, 9]. It is argued in this paper that while cross-validation is a sound, generally-accepted method for evaluating a classifier’s generalization error, it may be problematic to use cross-validation to assess this generalizability in terms of estimating a true conditional error defined as a single fixed quantity for a given set of data. With that approach the variance of cross-validation will tend to be overstated, even though its bias can still be appropriately characterized, as will be shown in this paper via Monte Carlo simulation and will be explained more fully in the Discussion.

While Efron and Tibshirani used the traditional absolute scale to calculate the MSB and its square root (the root mean square or RMS in their notation), Fu et al. focused on what they termed the mean squared ‘relative’ error (MSRE), stating that calculations on the absolute scale gave similar results. Here, the mean squared error and associated quantities calculated on the absolute scale are used.

The purpose of this paper is to report the results of a more extensive comparison of BCV to conventional CV done via a simulation study like that of Fu et al. [10], based on k-fold CV in addition to LOOCV, and to use those results to fuel a discussion of several issues related to cross-validation. Finally, the performance of BCV and k-fold CV are demonstrated for a real-world data set by classifying patients with breast cancer according to prognosis based on their gene-expression profiles.


Mean squared error

In order to facilitate the definition of terms, suppose for the moment that k-fold cross-validation (k CV) will be used to assess a classifier’s true conditional error rate based on the results of N Monte Carlo simulations. Let k < n, where n is the sample size, and assume that k CV is repeated R times. Then the MSE for the ith simulation is given by

M S E = 1 R r = 1 R ê ri e i 2 = 1 R r = 1 R ê ri e Ri 2 + e Ri e i 2 ,

where e i denotes the true conditional error for the ith simulation, ê ri denotes the rthk CV estimate of e i for the ith simulation, and e Ri = 1 / R r = 1 R ê ri is the mean estimate of the ith true conditional error over R re-samples. The terms on the right hand side of (1) are the variance and bias components of the MSE. The average MSE over N simulations is given by

M S E = 1 N i = 1 N 1 R r = 1 R ê ri e i 2 ,

which can be decomposed into average variance and average (squared) bias components,

M S E = V A R + B I A S 2 = 1 N 1 R i = 1 N r = 1 R ê ri e Ri 2 + 1 N i = 1 N e Ri e i 2 .

In (3) V A R is the average variance and B I A S 2 is the mean squared bias (MSB) over N simulations. It is noted that the two components in (3) are analogous to the pooled variance and lack-of-fit components in linear regression where there are R observations at each of N values of an independent variable.

With BCV, like k CV, it is possible to calculate the MSE in (1) for each value of the true conditional error (k < n). Each of the R bootstrap samples is drawn first, and then each of the n observations in the with-replacement sample is left out one at a time to get an estimate of the prediction error. With BT632, however, it is not possible to calculate the MSE in (1) because only one estimate of the true conditional error can be calculated from the R bootstrap samples in each of the N simulation runs. That is, with BT632 each of the n samples is left out one at a time and then R bootstrap samples are drawn from the remaining n-1 samples. These R bootstrap samples give an estimate of the prediction error for the left-out observation, and the average of these estimates over the n samples is the BT632 estimate. (Efron and Tibshirani [9] presented an efficient algorithm for computing BT632 that uses only R total bootstrap samples instead of R × n samples; the expected number of bootstrap samples used to estimate the prediction error for each left-out observation is (1–0.632) × R.) Hence, the decomposition in (3) can be achieved with both k CV and BCV, but not with BT632.

Of necessity, because of the construction of BT632 and associated estimators, Efron and Tibshirani [9] used only the MSB (second term in (3)) to evaluate the performance of cross-validation methods, where ē Ri for BT632 has a different connotation than for k CV and BCV, but is still an average calculated from R (or fewer) bootstrap samples per observation. Similarly, Molinaro et al. [7] employed the MSB in their investigation. Although it was not explicitly shown, in both papers the MSB was further decomposed as

M S B = S D B I A S 2 + B I A S 2 = 1 N i = 1 N e Ri e i e N e 2 + ( e N e ) 2

where e N = i = 1 N e Ri / N and e = i = 1 N e i / N , with interpretations of results based on the standard deviation of the bias, SD(BIAS), and the average bias, B I A S , but using different notation. Although the simulation study conducted by Fu et al. [10] provided information on the variance and bias components of the M S E in (3) with respect to the BCV estimator of the true conditional error rate, the information on variance was not used in the comparison with LOOCV and BT632, as it was not possible to obtain equivalent information with the latter two methods. Instead, Fu et al. [10] presented information 2on components comparable to those of the MSB in (4), but defined on a relative basis. In the simulation study reported here, in addition to the information provided by the squared-bias component and its sub-components in (4), the information that both BCV and k CV provide on the variance component of the M S E in (3) has been compared. To make the BCV-k CV comparison as fair as possible, the number of recomputations, i.e., the number of retrainings of a classifier, was equalized for BCV and k CV. The purpose was to equalize information rather than to equalize computational effort (see [9, 11]). As with the study of Fu et al. [10], the present comparison was restricted to low-dimensional predictor spaces.

Monte Carlo simulation study

It was assumed that there were two populations (classes) defined by p ≥ 1 predictors or features having underlying Gaussian distributions [10]. The first population was assumed to be distributed N(μ1, Σ1) with μ1 = 0(p)' and the second N(μ2, Σ2) with μ 2 = Δ p / p ' , where 0(p) is the p-dimensional zero vector and Δ(p) is a p-dimensional vector of non-zero constants, Δ. The structure of μ2 is a modified configuration of Freidman [12]. In addition to the equal variance case studied by Fu et al. [10], where Σ1 = Σ2 = Ι(p) (the p×p identity matrix), here the case of unequal population variances was also studied, where Σ1 = Ι(p) and Σ2 = 2Ι(p). Independence among predictors, as reflected by Σ1 = Ι(p) and Σ2 = 2Ι(p), was assumed in order to be consistent with Fu et al. [10]. Given the mean structures, any positive correlation among predictors would simply decrease the generalized Mahalanobis distance between the two populations while negative correlation would increase the distance.

Feature dimensions of p = 1 and 5 were simulated, along with Δ = 1 and 3. For p = 1, sample sizes of n = 20, 50 and 100 were simulated (n/2 in each class), while for p = 5, only n = 50 and 100 were considered. Whereas Fu et al. [10] used quadratic discriminant analysis (QDA) to classify samples for some comparisons and a k-nearest neighbor (k-NN) classifier for others, here QDA was used for all comparisons in light of the low-dimensionality. For higher dimensions, where p > n, a method like k-NN would be required.

As mentioned above, there is only one way to do LOOCV with a given sample. On the other hand, BCV as defined by Fu et al. [10] uses an average of the LOOCV prediction errors over B bootstrap (re)samples. Hence, BCV is based on B × n recomputations (retrainings of the classifier) while LOOCV is based on only n recomputations. For a more extensive comparison, three approaches were taken here in order to compare BCV to k-fold CV (henceforward k CV), as summarized in Table 1.

Table 1 Cross-validation methods to be compared

First, LOOCV was compared to BCV as was done by Fu and colleagues [10]. These methods are denoted by k CVn and BCVn, respectively, in Table 1. Second, n/2-fold CV (leave-two-out CV, denoted k CVn/2) was done in order to stay as close as possible to LOOCV (k CVn) while allowing multiple retrainings of the classifier. To keep the number of recomputations the same as for BCV, 2 × B repetitions of k CVn/2 were run (2×B×n/2 = B × n). Also, a version of BCV based on n/2-fold CV (BCVn/2) was implemented with 2 × B repetitions for a head-to-head comparison with k CVn/2 based on the same number, B×n, of total retrainings. Third, a version of BCV based on 10-fold CV (BCV10) was implemented and compared to traditional 10-fold CV (k CV10), where again the number of recomputations was the same. Here, both BCV10 and k CV10 were based on B×n/10 repetitions for a total of B×n retrainings. BCVn/2 and BCV10 were defined like k CVn/2 and k CV10, except that in each repetition, a bootstrap sample of size n was randomly divided into n/2 or 10 subsets, while with k CVn/2 and k CV10 the original n observations were randomly re-divided into n/2 or 10 subsets (these are the same when n = 20). In this study, B = 50 [9, 10].

The simulation study was implemented as follows. For each combination of p and Δ, a “super-population” of size 10000 was drawn, 5000 from N(μ1, Σ 1) and 5000 from N2, Σ2). Then, for each value of n, N = 1000 simulations were run. For each simulation run, a stratified random sample of size n was drawn without replacement, n/2 observations from the 5000 N(μ1, Σ1) population values and n/2 observations from the 5000 N2, Σ 2) population values. The QDA classifier was trained on the sample. Following Molinaro et al. [7], the true conditional error rate for each classifier was calculated as the proportion of times the trained classifier misclassified the remaining 10000-n members of the super-population. Then k CVn, BCVn, k CVn/2, BCVn/2, k CV10 and BCV10 were each conducted on the sample to estimate the true conditional error. Their MSE, variance and bias were calculated from expression (1).

For p=1 it was found that at least four distinct observations were needed in each class to avoid numerical problems in training the QDA classifier for the BCVn/2 and BCV10 methods. Hence, this requirement was imposed on all three BCV methods. (Fu et al., [10], required at least three distinct observations in each class for the original BCV method, BCVn.) In addition, for p=5, the BCV methods were implemented with stratified sampling, i.e., n/2 bootstrap samples from each class, along with a requirement of at least eight distinct observations in each class.

The mean and standard deviation of the MSE, variance, and bias, as well as the MSB over the N = 1000 simulations were calculated for BCV and k CV. With R representing the number of repetitions of each method (Table 1, column 2), the means are defined by

M S E = 1 / N i = 1 N 1 / R r = 1 R ê ri e i 2 ,
V A R = 1 / N i = 1 N 1 / R r = 1 R ê ri e Ri 2 ,
B I A S = 1 / N i = 1 N e Ri e i ,
B I A S 2 = 1 / N i = 1 N e Ri e i 2 .

The three standard deviations for each method are defined by

S D ( M S E ) = [ 1 N - 1 i = 1 N { 1 / R R = 1 R ( ȇ i ) 2 MSE 2 } ] 1 2
S D ( V A R ) = ( 1 / { N 1 } ) i = 1 N ( 1 / R ) R = 1 R ê ri e ri 2 V A R 2 1 / 2
S D B I A S = 1 / N 1 i = 1 N e ri e i B I A S 2 1 / 2

The means and standard deviations defined in (5) – (11) were used to compare the performance of the methods.

R Version 2.6.0 was used to conduct the Monte Carlo simulation study, with an independently written SAS/IML program being used to verify the mean calculations for the equal-variance case with p=1 [13, 14].

Results and discussion

The results of the simulation study are summarized in Tables 2 and 3 and Figures 1, 2, 3, 4. Table 2 is the same case as covered in Table 1 of Fu and colleagues [10]. For brevity, all configuration results are discussed but only a limited portion of the results are displayed in Tables 2 and 3 (i.e., cases for LOOCV, BCVn, k CV10, BCV10 where Σ1 = Σ2 = Ι(p) ). The interested reader is referred to the supplementary material section for the tables in their entirety and for the cases where Σ1 = Ι(p) and Σ2 = 2Ι(p).

Table 2 Simulation results for p = 1, Σ1 = Σ2 = I(1), N = 1000
Table 3 Simulation results for p = 5, Σ1 = Σ2 = Ι(5), N = 1000
Figure 1
figure 1

The individual values of ( ē Ri e i ) that contribute to B I A S and SD ( BIAS ) for each of N = 1000 simulations with p =1, n =50,Δ = 1, and Σ 1 =Σ 2 =I .

Figure 2
figure 2

The individual values of ( ē Ri e i ) that contribute to B I A S and SD ( BIAS ) for each of N = 1000 simulations with p =5, n =50,Δ = 1, and Σ 1 =Σ 2 =I.

Figure 3
figure 3

The mean relative bias for each of the twenty simulation configurations in Tables2and3and in Additional file1: Table S1, Additional file2: Table S2, Additional file3: Table S3 and Additional file4: Table S4 , where each point is the average of N = 1000 values, B I A S / e , like those plotted in Figures1and2.

Figure 4
figure 4

The mean relative bias expressed as B I A S / S D B I A S for each of the twenty simulation configurations in Tables2and3and in Additional file1: Table S1, Additional file2: Table S2, Additional file3: Table S3 and Additional file4: Table S4.

Beginning with the MSB (i.e., B I A S 2 ), which is the criterion used by Efron and Tibshirani [9], Molinaro et al. [7], Fu et al. [10] and Kim [11] to compare estimators of the true conditional classification error, it is shown in Table 2 that for p = 1 the MSB of k CV is always larger than that of BCV. In terms of the components of the MSB, this is due to a larger SD(BIAS) for k CV than BCV, although the B I A S of BCV tends to be negative and is generally larger than that of k CV in absolute value for Δ = 3. These results are consistent for configurations with n=20 (see supplementary material, Additional file 1: Table S1 and Additional file 2: Table S2). For p = 5, the same pattern is shown in Table 3 for the MSB for Δ = 3, but the reverse is shown for Δ = 1, i.e., the MSB is larger for BCV, where the negative B I A S of BCV is very pronounced. Thus, the variation of BCV, as measured by SD(BIAS) is indeed reduced compared to k CV, although a price is paid in terms of increased B I A S . Again, the results for equal and unequal covariance matrices are consistent (see supplementary material, Additional file 3: Table S3 and Additional file 4: Table S4).

The individual values of (ē Ri  − e i ) that contribute to B I A S and SD(BIAS) for two different simulation conditions are plotted in Figures 1 and 2 for each of N = 1000 simulations. Figures 1a and 1b are for BCV and k CV, respectively, for p=1, n=50 and Δ = 1, from Table 2. Figures 2a and 2b represent corresponding plots for p=5 from Table 3. Figure 1 represents one of the best configurations for BCV compared to k CV while Figure 2 represents one of the worst scenarios. As the figures show, the individual estimates of the true conditional error, e i , are extremely variable across the 1000 simulations. The variance of BCV is indeed less than that of k CV, but the negative bias of BCV can be substantial as the dimensionality of the feature space increases.

Why there is large variation in general

The large variation shown in Figures 1 and 2, along with the correspondingly large values of SD(BIAS) in Tables 2 and 3 for both kCV and BCV, are consistent with results of Efron and Tibshirani ([9], Tables Three to eight on pages 554-556) and Molinaro et al. ([7], Tables One and Four on pages 3304 and 3305), both of whom showed large standard deviations and, in some cases, large values of bias, for the classifiers and error estimation methods they studied. In fact, Efron and Tibshirani [9] noted that none of the methods correlates very well with the conditional error rate on a sample-by-sample basis. This lack of correlation in the present investigation, reflected by the large values of SD(BIAS), appears to be partly due to a problem with the way the true classification error is defined and estimated. As mentioned in the Introduction, the problem appears to be that the quantity purportedly being estimated, the true misclassification error of the trained classifier conditional on the training set, is defined as a single fixed quantity for a given set of data.

Possible alternative approach for estimation of SD(BIAS)

It does not seem logical to take the misclassification error as a fixed quantity and then use cross-validation to estimate it, because the true conditional error for any classifier trained using only part of the data within a cross-validation is not the same as the true conditional error of the classifier trained on the complete set of data, i.e., the quantity to be estimated. This leads to an inflated estimate of SD(BIAS). Although it might prove to be computationally prohibitive, it seems more logical to define and calculate a true conditional error for each training set within each of the k partitions of a cross-validation, say e ijr (i = 1,…,N;j = 1,…,k;r = 1,…,R) and then obtain a corresponding estimate, e ^ ijr . Each difference, e ^ ijr e ijr , would represent an estimate of the expected bias in estimating a true conditional error so defined. So, even though the conditional error itself would change from partition to partition, one could still obtain a sample of estimates of the bias in estimating such an error. The variation among these bias estimates would be expected to be less than that represented by SD(BIAS) in Tables 2 and 3 and reflected in Figures 1 and 2 with the customary method, because a source of variation heretofore not taken into account would be eliminated. This “more logical” approach provides insight into how the variation in the bias is artificially inflated when one attempts to use cross-validation to estimate a single, fixed “true conditional error” of a trained classifier. Attempting to estimate the elusive true conditional error is not recommended. Instead, a classifier’s generalization error in predicting future observations is the error that should be estimated, and is the error for which cross-validation is well-suited.

Average bias estimates Are representative

On the other hand, even though individual-run biases are likely overstated because of the inflated variance when defined in terms of a fixed true conditional error, nevertheless, the average bias, B I A S , calculated in the customary way ought to be representative of the average bias that would be reflected if the “more logical” method described above were used. For this reason, plots like Figure 3 of Efron and Tibshirani [9] of the average relative bias in terms of the expected true error are useful for comparing error estimation methods, even though the individual true conditional errors defined the usual way may not be estimated with precision. Figures 3a to 3d mimic Figure 3 of Efron and Tibshirani [9]. The plotted points are values of B I A S / e , which is equivalent to (ē N  − ē)/ē, for each of the twenty simulation configurations in Tables 2 and 3 and supplementary material Additional file 1: Table S1, Additional file 2: Table S2, Additional file 3: Table S3 and Additional file 4: Table S4, where each point is the average of N = 1000 values like those plotted in Figures 1 and 2. For example, the open triangles plotted in Figure 3a and 3c correspond to Figures 1 and 2, respectively. These figures show a consistent, but modest, positive relative bias for k CV and a consistent, sometimes large, negative relative bias for BCV. In particular as shown in Figure 3, as the sample size increases, differences in relative bias between k CV and BCV decrease for both p=1 and p=5. This agrees with the result of a numerical experiment by Davison and Hall [15] with p=3 and n=(20, 40, 80) in a comparison of bootstrap and LOOCV estimates of discrimination error. Even so, the relative bias of BCV in Figure 3 for p=5 is still substantially negative when n=100. For p=3, Davison and Hall [15] observed a similar decrease in disagreement between the methods as the distance between populations increased. For p=5, Figure 3 also shows that effect going from Δ = 1 to Δ = 3.

Impact of BCV being negatively biased

The negative bias of BCV, i.e., underestimation of the true error, can be explained by the fact that the probability that a test sample appears in the training set is 1-(1-1/n)n ≈ 0.632. Borrowing the words of Efron and Tibshirani [9] to describe this phenomenon, BCV “uses training samples that are too close to the test points, leading to potential underestimation of the error rate.” It is important to note that the bootstrap methods of Efron and Tibshirani do not include test points in the training set.

The substantial negative bias of BCV means that BCV tends to underestimate the classification error on average. While the direction and magnitude of the bias of a cross-validation method might not matter a great deal if the performances of several competitive classification procedures are being compared, it definitely matters if the error rate of a specific classification procedure is of interest. Substantial negative bias, translating to underestimation of the true misclassification error, would be a serious concern. To expound on the sizable negative bias of BCV, Figure 4 shows plots of B I A S / S D B I A S for the same simulations as Figure 3. Applying the rule of thumb of Efron and Tibshirani [16] to bias estimation, the horizontal reference lines at 0.25 in each panel represent thresholds of acceptable relative bias. For p=1, both k CV and BCV satisfy the threshold, except for one instance where BCV exceeds the threshold slightly when n=20. However, for p=5, BCV always exceeds the threshold while k CV is always below the threshold. When Δ = 1, all four BCV relative biases exceed 1, i.e., they are more than four times the 0.25 threshold. Alternatively, a relative-bias plot could be constructed using averages of the components of the MSE by plotting the ratio B I A S / V A R . This would show the same general result as Figure 4, but would be less pronounced because of the propensity for increased V A R of BCV compared to k CV for estimating individual e(e i = 1,…,N) (Tables 2, 3) due to the positive covariance induced by with-replacement sampling with the BCV method.

Assessing reproducibility of error estimates

Because both BCV and k CV can be repeated multiple times, as they have been in the present simulation study, they can give information on the reproducibility among repeated cross-validations. The values of M S E , SD(MSE), V A R and SD(VAR) in Tables 2 and 3 provide such information on the reproducibility of BCV and k CV from CV-run to CV-run. When k CV is used in practice, where there is only a single set of training data, either VAR or V A R is the commonly reported value along with the average error, e R = r = 1 R ê r / N , or its complement, the average accuracy [1, 5]. Because the purpose of cross-validation is to assess a classifier’s ability to generalize outside the training set, the variation from CV-run to CV-run is an important measure of performance. Note that even though the present problem may be ill-defined such that the average biases for individual simulation runs are exaggerated, the values of V A R are unaffected and correctly reflect the degree of reproducibility of the generalization error estimate.

Fair comparison requires equalization of number of trainings

In this study, there were 100 to 500 repetitions of each method (1000 to 5000 retrainings) in order to put BCV and k CV on the same footing with respect to the number of retrainings of classifiers [9, 11]. This is many more repetitions than the ten or twenty repetitions normally done with k CV10. Although nowadays CPU time is relatively inexpensive, 100 to 500 repetitions may be excessive. On the other hand, although the BT632 method of Efron and Tibshirani [9] did not perform as well overall as BCV in the study of Fu et al. [10], it did show competitive behavior in some cases. It seems likely that, if the number of retrainings were equalized while employing the economical algorithm of Efron and Tibshirani [9], the competitiveness of BT632 evaluated in terms of average squared bias and its component parts would improve. As Kim [11] reported recently, the BT632+ method based on 50 bootstraps performed better than 5 repetitions of k CV10 in terms of average squared bias for a pruned tree classifier, although it, too, had a downward bias.

Microarray example

We evaluated the performance of k CV and BCV in predicting prognosis based on the gene expression profiles of breast cancer patients previously reported by van’t Veer and colleagues [17, 18]. The van de Vijver et al. [17] study consisted of 295 patients with stage I or II breast cancer, and patients’ prognosis and gene expression data are publicly available at While the dataset contained a 70-gene prognosis profile, we chose to perform our evaluation of k CV and BCV using only 5 genes based on a simple gene selection procedure using a t-statistic with adjusted p-values [19]. In the study of Fu et al. [10], the authors chose 5 genes that were most highly correlated with the patient’s prognosis. As noted in Fu et al. [10], such gene selection procedure is prone to bias. However, the purpose of this evaluation is not gene selection. Furthermore, in practice only a small subset of genes is often of clinical interest.

Following the steps taken in Fu et al. [10] for comparison, the subsequent steps were carried out: (1) take a random sample S of size n = 50 with half of the patients having good prognosis and half having poor prognosis, (2) train a QDA classifier based on the random sample S and compute its true conditional error rate based on the proportion of times the trained classifier misclassified the remaining samples, (3) for each random sample S, estimate the true conditional error for LOOCV, CVn/2, CV10, BCV, BCVn/2, and BCV10, (4) calculate their MSE, variance and bias, (5) repeat over 1000 simulation runs and calculate the mean and standard deviation of the MSE, variance, bias and MSB.

The results presented in Table 4 and Figure 5 are consistent with our simulation results in Table 3 for Δ = 1. More specifically, the MSB is larger for BCV and the negative B I A S of BCV is evident. Figure 5 certainly demonstrates that BCV is less variable, but as previously noted, this advantage is negated by the considerable bias and overall MSB. Furthermore, given this microarray example data, the M S E and V A R for the BCV methods are higher than the corresponding quantities for the k CV counterpart.

Table 4 Results for the microarray example
Figure 5
figure 5

The individual values of ( ē Ri e i ) that contribute to B I A S and SD ( BIAS ) for each of N = 1000 simulation runs using the breast cancer data.


Cross-validation is a widely accepted and sound practice for estimating the generalization error of a classifier. Of course, for small data sets with high-dimensional predictors, especially for p > n, the variation among cross-validated error estimates can be large. For methods like BCV and kCV that can be replicated, it is generally accepted that cross-validation should be repeated 10 to 30 times to account for variation. However, using cross-validation to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic and should not be attempted. Although Monte Carlo simulation of this estimation exercise can correctly represent the average bias, it will overstate the variance of the bias. For the low-dimensional conditions simulated in the present study, k CV showed a consistent, but modest, positive bias. Conversely, BCV showed a consistent, and sometimes substantial, negative bias, which was much more pronounced for p=5 than for p=1. Increasing the complexity of the simulation to incorporate higher dimensions would only magnify the effect. The bias of BCV is too high a price to pay for its reduced variance; k-fold CV is recommended.

Authors’ information

Songthip Ounpraseuth is an Associate Professor in Department of Biostatistics at the University of Arkansas for Medical Science, Little Rock, AR. His research interests include prediction error estimation, computational statistics, dimension reduction and classification.

Ralph L Kodell is a Professor in the Department of Biostatistics at the University of Arkansas for Medical Sciences. His research interests include classification algorithms for biomedical decision making and statistical models and methods for toxicology and risk assessment.

Shelly Y. Lensing is a Research Associate biostatistician in the Department of Biostatistics at the University of Arkansas for Medical Sciences. Her research interests are the design and analysis of clinical trials and statistical computing.

Horace J. Spencer is a Research Associate biostatistician in the Department of Biostatistics at the University of Arkansas for Medical Sciences. His research interests are in all aspects of statistical computing and error estimation.

Financial disclosures

The authors have no financial relationships relevant to this article to disclose.





Mean squared error


Mean squared bias


Bootstrap cross-validation


Leave-one-out cross-validation


Root mean square


Mean squared relative error

k CV:

k-fold cross-validation


quadratic discriminant analysis

k NN k:

-nearest neighbor.


  1. Moon H, Ahn H, Kodell RL, Baek S, Lin C-J, Chen JJ: Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artif Intell Med. 2007, 41: 197-207. 10.1016/j.artmed.2007.07.003.

    Article  PubMed  Google Scholar 

  2. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2001, Springer, New York

    Book  Google Scholar 

  3. Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci. 2002, 99: 6562-6566. 10.1073/pnas.102102699.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  4. Subramanian J, Simon R: An evaluation of resampling methods for assessment of survival risk prediction in high-dimensional settings. Stat Med. 2011, 30: 642-653. 10.1002/sim.4106.

    Article  PubMed  Google Scholar 

  5. Liu Y, Yao X, Higuchi T: Evolutionary ensembles with negative correlation learning. IEEE Trans Evol Comput. 2000, 4: 380-387. 10.1109/4235.887237.

    Article  Google Scholar 

  6. Arena VC, Sussman NB, Mazumdar S, Yu S, Macina OT: The utility of structure-activity relationship (SAR) models for prediction and covariate selection in developmental toxicity: comparative analysis of logistic regression and decision tree models. SAR QSAR Environ Res. 2004, 15: 1-18. 10.1080/1062936032000169633.

    Article  PubMed  CAS  Google Scholar 

  7. Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005, 21: 3301-3307. 10.1093/bioinformatics/bti499.

    Article  PubMed  CAS  Google Scholar 

  8. Efron B: Estimating the error rate of a prediction rule: improvement on cross-validation. J Amer Stat Assoc. 1983, 78: 316-331. 10.1080/01621459.1983.10477973.

    Article  Google Scholar 

  9. Efron B, Tibshirani R: Improvements on cross-validation: the .632+ Bootstrap method. J Amer Stat Assoc. 1997, 92: 548-560.

    Google Scholar 

  10. Fu WJ, Carroll RJ, Wang S: Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics. 2005, 21: 1979-1986. 10.1093/bioinformatics/bti294.

    Article  PubMed  CAS  Google Scholar 

  11. Kim J-H: Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data Anal. 2009, 53: 3735-3745. 10.1016/j.csda.2009.04.009.

    Article  Google Scholar 

  12. Friedman J: Regularized discriminant analysis. J Amer Stat Assoc. 1989, 84: 165-175. 10.1080/01621459.1989.10478752.

    Article  Google Scholar 

  13. R Core Development Team: R: A Language and Environment for Statistical Computing. 2007, R Foundation for Statistical Computing, Vienna, Austria, accessed November 2, 2007

    Google Scholar 

  14. SAS: SAS/IML 9.1 User’s Guide. 2004, SAS Institute, Inc, Cary, North Carolina

    Google Scholar 

  15. Davison AC, Hall P: On the bias and variability of bootstrap and cross-validation estimates of error rate in discrimination problems. Biometrika. 1992, 79 (2): 279-284. 10.1093/biomet/79.2.279.

    Article  Google Scholar 

  16. Efron B, Tibshirani R: An Introduction to the Bootstrap. 1993, Chapman & Hall/CRC, Boca Raton, Florida

    Book  Google Scholar 

  17. Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AA, Mao M, et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530-536. 10.1038/415530a.

    Article  Google Scholar 

  18. Van de Vijver MJ, He YD, et al: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.

    Article  PubMed  CAS  Google Scholar 

  19. Nguyen VD, Rocke MD: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002, 18: 39-50. 10.1093/bioinformatics/18.1.39.

    Article  PubMed  CAS  Google Scholar 

Download references


The authors are appreciative of the referees for critically reading the manuscript and for their valuable suggestions and comments which have led to an improved presentation.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Songthip Ounpraseuth.

Additional information

Competing interests

The authors declare that they have no competing interests relevant to this article to disclose.

Author’s contributions

SO and RLK conceived the problem and designed the simulations for the manuscript. SYL and HJS were in charge of the computational coding. All authors were involved in drafting the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ounpraseuth, S., Lensing, S.Y., Spencer, H.J. et al. Estimating misclassification error: a closer look at cross-validation based methods. BMC Res Notes 5, 656 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: