Estimating misclassification error: a closer look at cross-validation based methods

Background To estimate a classifier’s error in predicting future observations, bootstrap methods have been proposed as reduced-variation alternatives to traditional cross-validation (CV) methods based on sampling without replacement. Monte Carlo (MC) simulation studies aimed at estimating the true misclassification error conditional on the training set are commonly used to compare CV methods. We conducted an MC simulation study to compare a new method of bootstrap CV (BCV) to k-fold CV for estimating clasification error. Findings For the low-dimensional conditions simulated, the modest positive bias of k-fold CV contrasted sharply with the substantial negative bias of the new BCV method. This behavior was corroborated using a real-world dataset of prognostic gene-expression profiles in breast cancer patients. Our simulation results demonstrate some extreme characteristics of variance and bias that can occur due to a fault in the design of CV exercises aimed at estimating the true conditional error of a classifier, and that appear not to have been fully appreciated in previous studies. Although CV is a sound practice for estimating a classifier’s generalization error, using CV to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic. While MC simulation of this estimation exercise can correctly represent the average bias of a classifier, it will overstate the between-run variance of the bias. Conclusions We recommend k-fold CV over the new BCV method for estimating a classifier’s generalization error. The extreme negative bias of BCV is too high a price to pay for its reduced variance.


Background
Class prediction involves the use of statistical learning techniques to develop algorithms for classifying unknown samples through supervised learning on samples of known class. In assessing the performance of a classification algorithm, the goal is to estimate its ability to generalize, i.e., to predict the outcomes of samples not included in the data set used to train the classifier. The performance may be assessed on the basis of a number of different indices. For problems having a dichotomous outcome variable (e.g., positive or negative), the sensitivity, specificity, positive predictive value and negative predictive value are indices that may be of interest in addition to the overall prediction accuracy [1]. In this paper, attention is focused on the overall prediction accuracy, or equivalently, on its counterpart, the prediction error.
Cross-validation (CV) is a widely used method for performance assessment in class prediction [2][3][4]. With kfold CV, a data set of n samples is randomly divided into k subsets each having (approximately) n/k samples. Each of these k subsets serves in turn as a test set. For each of these k test sets of size n/k, a classifier is trained on the remaining (k-1)×(n/k) observations (the training set). The trained classifier is then used to classify the n/k samples in the test set, and the prediction error (perhaps, along with other indices) is calculated. The combined value of the prediction error over the k test sets, which is based on the prediction of all n samples one time each, is the cross-validated estimate of that error. Generally, several replicates of k-fold cross-validation are performed based on different random permutations of the n samples in order to account for the random resampling variance, and the average and standard deviation of these replicates are used to assess the performance of the classifier [5,6]. When k = n, the exercise is called leave-one-out cross-validation (LOOCV); there is only one unique way to do LOOCV and, hence, it cannot be replicated. A common choice of k is 10, and 10 to 30 replicates of 10-fold CV have been shown to be sufficient to achieve stable values of the prediction error [7].
Before 10-fold CV became popular, efforts were directed toward reducing the variability of LOOCV, recognizing that it gave nearly unbiased estimates of the prediction error [8]. The .632 and .632+ bootstrap methods are well known alternatives to LOOCV [9]. Recently, Fu, Carroll and Wang [10] introduced a new bootstrap version of LOOCV (bootstrap cross-validation or BCV), which they compared to LOOCV and to the .632 bootstrap method (BT632) on problems with lowdimensional predictor spaces. Like Efron and Tibshirani [9], Fu et al. [10] used a mean squared error (MSE) represented by the mean squared bias (MSB) over N Monte Carlo simulations (discussed in Methods Section) as the primary criterion for evaluating estimators of the true conditional error, i.e., the true misclassification error of the trained classifier conditional on the training set [9]. These and similar investigations into estimating the true conditional error via cross-validation (e.g., see [7,11]) have been interpreted as assessing a classifier's error in predicting future observations, i.e., its generalization error [8,9]. It is argued in this paper that while cross-validation is a sound, generally-accepted method for evaluating a classifier's generalization error, it may be problematic to use cross-validation to assess this generalizability in terms of estimating a true conditional error defined as a single fixed quantity for a given set of data. With that approach the variance of crossvalidation will tend to be overstated, even though its bias can still be appropriately characterized, as will be shown in this paper via Monte Carlo simulation and will be explained more fully in the Discussion.
While Efron and Tibshirani used the traditional absolute scale to calculate the MSB and its square root (the root mean square or RMS in their notation), Fu et al. focused on what they termed the mean squared 'relative' error (MSRE), stating that calculations on the absolute scale gave similar results. Here, the mean squared error and associated quantities calculated on the absolute scale are used.
The purpose of this paper is to report the results of a more extensive comparison of BCV to conventional CV done via a simulation study like that of Fu et al. [10], based on k-fold CV in addition to LOOCV, and to use those results to fuel a discussion of several issues related to cross-validation. Finally, the performance of BCV and k-fold CV are demonstrated for a real-world data set by classifying patients with breast cancer according to prognosis based on their gene-expression profiles.

Mean squared error
In order to facilitate the definition of terms, suppose for the moment that k-fold cross-validation (kCV) will be used to assess a classifier's true conditional error rate based on the results of N Monte Carlo simulations. Let k < n, where n is the sample size, and assume that kCV is repeated R times. Then the MSE for the i th simulation is given by where e i denotes the true conditional error for the i th simulation, ê ri denotes the r th kCV estimate of e i for the i th simulation, and e Ri ¼ 1=R ð Þ X R r¼1 ê ri is the mean estimate of the i th true conditional error over R re-samples. The terms on the right hand side of (1) are the variance and bias components of the MSE. The average MSE over N simulations is given by which can be decomposed into average variance and average (squared) bias components, In (3) -VAR is the average variance and -BIAS 2 is the mean squared bias (MSB) over N simulations. It is noted that the two components in (3) are analogous to the pooled variance and lack-of-fit components in linear regression where there are R observations at each of N values of an independent variable.
With BCV, like kCV, it is possible to calculate the MSE in (1) for each value of the true conditional error (k < n). Each of the R bootstrap samples is drawn first, and then each of the n observations in the withreplacement sample is left out one at a time to get an estimate of the prediction error. With BT632, however, it is not possible to calculate the MSE in (1) because only one estimate of the true conditional error can be calculated from the R bootstrap samples in each of the N simulation runs. That is, with BT632 each of the n samples is left out one at a time and then R bootstrap samples are drawn from the remaining n-1 samples. These R bootstrap samples give an estimate of the prediction error for the left-out observation, and the average of these estimates over the n samples is the BT632 estimate. (Efron and Tibshirani [9] presented an efficient algorithm for computing BT632 that uses only R total bootstrap samples instead of R × n samples; the expected number of bootstrap samples used to estimate the prediction error for each left-out observation is (1-0.632) × R.) Hence, the decomposition in (3) can be achieved with both kCV and BCV, but not with BT632.
Of necessity, because of the construction of BT632 and associated estimators, Efron and Tibshirani [9] used only the MSB (second term in (3)) to evaluate the performance of cross-validation methods, where ē Ri for BT632 has a different connotation than for kCV and BCV, but is still an average calculated from R (or fewer) bootstrap samples per observation. Similarly, Molinaro et al. [7] employed the MSB in their investigation. Although it was not explicitly shown, in both papers the MSB was further decomposed as where e N ¼ X N i¼1 e Ri =N and e ¼ X N i¼1 e i =N , with interpretations of results based on the standard deviation of the bias, SD(BIAS), and the average bias, -BIAS, but using different notation. Although the simulation study conducted by Fu et al. [10] provided information on the variance and bias components of the -MSE in (3) with respect to the BCV estimator of the true conditional error rate, the information on variance was not used in the comparison with LOOCV and BT632, as it was not possible to obtain equivalent information with the latter two methods. Instead, Fu et al. [10] presented information 2on components comparable to those of the MSB in (4), but defined on a relative basis. In the simulation study reported here, in addition to the information provided by the squared-bias component and its sub-components in (4), the information that both BCV and kCV provide on the variance component of the M SE in (3) has been compared. To make the BCV-kCV comparison as fair as possible, the number of recomputations, i.e., the number of retrainings of a classifier, was equalized for BCV and kCV. The purpose was to equalize information rather than to equalize computational effort (see [9,11]). As with the study of Fu et al. [10], the present comparison was restricted to lowdimensional predictor spaces.

Monte Carlo simulation study
It was assumed that there were two populations (classes) defined by p ≥ 1 predictors or features having underlying Gaussian distributions [10]. The first population was assumed to be distributed N(μ 1 , Σ 1 ) with μ 1 = 0 (p) ' and the is the p-dimensional zero vector and Δ (p) is a p-dimensional vector of non-zero constants, Δ. The structure of μ 2 is a modified configuration of Freidman [12]. In addition to the equal variance case studied by Fu et al. [10], where Σ 1 = Σ 2 = Ι (p) (the p×p identity matrix), here the case of unequal population variances was also studied, where Σ 1 = Ι (p) and Σ 2 = 2Ι (p) . Independence among predictors, as reflected by Σ 1 = Ι (p) and Σ 2 = 2Ι (p) , was assumed in order to be consistent with Fu et al. [10]. Given the mean structures, any positive correlation among predictors would simply decrease the generalized Mahalanobis distance between the two populations while negative correlation would increase the distance.
Feature dimensions of p = 1 and 5 were simulated, along with Δ = 1 and 3. For p = 1, sample sizes of n = 20, 50 and 100 were simulated (n/2 in each class), while for p = 5, only n = 50 and 100 were considered. Whereas Fu et al. [10] used quadratic discriminant analysis (QDA) to classify samples for some comparisons and a k-nearest neighbor (k-NN) classifier for others, here QDA was used for all comparisons in light of the lowdimensionality. For higher dimensions, where p > n, a method like k-NN would be required.
As mentioned above, there is only one way to do LOOCV with a given sample. On the other hand, BCV as defined by Fu et al. [10] uses an average of the LOOCV prediction errors over B bootstrap (re)samples. Hence, BCV is based on B × n recomputations (retrainings of the classifier) while LOOCV is based on only n recomputations. For a more extensive comparison, three approaches were taken here in order to compare BCV to k-fold CV (henceforward kCV), as summarized in Table 1.
First, LOOCV was compared to BCV as was done by Fu and colleagues [10]. These methods are denoted by kCVn and BCVn, respectively, in Table 1. Second, n/2-fold CV (leave-two-out CV, denoted kCVn/2) was done in order to stay as close as possible to LOOCV (kCVn) while allowing multiple retrainings of the classifier. To keep the number of recomputations the same as for BCV, 2 × B repetitions of kCVn/2 were run (2×B×n/2 = B × n). Also, a version of BCV based on n/2-fold CV (BCVn/2) was implemented with 2 × B repetitions for a head-to-head comparison with kCVn/2 based on the same number, B×n, of total retrainings. Third, a version of BCV based on 10-fold CV (BCV10) was implemented and compared to traditional 10-fold CV (kCV10), where again the number of recomputations was the same. Here, both BCV10 and kCV10 were based on B×n/10 repetitions for a total of B×n retrainings. BCVn/2 and BCV10 were defined like kCVn/2 and kCV10, except that in each repetition, a bootstrap sample of size n was randomly divided into n/2 or 10 subsets, while with kCVn/2 and kCV10 the original n observations were randomly re-divided into n/2 or 10 subsets (these are the same when n = 20). In this study, B = 50 [9,10].
The simulation study was implemented as follows. For each combination of p and Δ, a "super-population" of size 10000 was drawn, 5000 from N(μ 1 , Σ 1 ) and 5000 from N(μ 2 , Σ 2 ). Then, for each value of n, N = 1000 simulations were run. For each simulation run, a stratified random sample of size n was drawn without replacement, n/2 observations from the 5000 N(μ 1 , Σ 1 ) population values and n/2 observations from the 5000 N (μ 2 , Σ 2 ) population values. The QDA classifier was trained on the sample. Following Molinaro et al. [7], the true conditional error rate for each classifier was calculated as the proportion of times the trained classifier misclassified the remaining 10000-n members of the super-population. Then kCVn, BCVn, kCVn/2, BCVn/2, kCV10 and BCV10 were each conducted on the sample to estimate the true conditional error. Their MSE, variance and bias were calculated from expression (1).
For p=1 it was found that at least four distinct observations were needed in each class to avoid numerical problems in training the QDA classifier for the BCVn/2 and BCV10 methods. Hence, this requirement was imposed on all three BCV methods. (Fu et al., [10], required at least three distinct observations in each class for the original BCV method, BCVn.) In addition, for p=5, the BCV methods were implemented with stratified sampling, i.e., n/2 bootstrap samples from each class, along with a requirement of at least eight distinct observations in each class.
The mean and standard deviation of the MSE, variance, and bias, as well as the MSB over the N = 1000 simulations were calculated for BCV and kCV. With R representing the number of repetitions of each method ( Table 1, column 2), the means are defined by - The three standard deviations for each method are defined by The means and standard deviations defined in (5) -(11) were used to compare the performance of the methods. R Version 2.6.0 was used to conduct the Monte Carlo simulation study, with an independently written SAS/ IML program being used to verify the mean calculations for the equal-variance case with p=1 [13,14].

Results and discussion
The results of the simulation study are summarized in Tables 2 and 3 and Figures 1, 2, 3, 4. Table 2 is the same case as covered in Table 1 of Fu and colleagues [10]. For brevity, all configuration results are discussed but only a limited portion of the results are displayed in Tables 2 and 3 (i.e., cases for LOOCV, BCVn, kCV10, BCV10 where Σ 1 = Σ 2 = Ι (p) ). The interested reader is referred to the supplementary material section for the tables in their entirety and for the cases where Σ 1 = Ι (p) and Σ 2 = 2Ι (p) .
Beginning with the MSB (i.e., -BIAS 2 ), which is the criterion used by Efron and Tibshirani [9], Molinaro et al. [7], Fu et al. [10] and Kim [11] to compare estimators of the true conditional classification error, it is shown in Table 2 that for p = 1 the MSB of kCV is always larger than that of BCV. In terms of the components of the MSB, this is due to a larger SD(BIAS) for kCV than BCV, although the -BIAS of BCV tends to be negative and is generally larger than that of kCV in absolute value for Δ = 3. These results are consistent for configurations with n=20 (see supplementary material, Additional file 1: Table S1 and Additional file 2: Table S2). For p = 5, the same pattern is shown in Table 3 for the MSB for Δ = 3, but the reverse is shown for Δ = 1, i.e., the MSB is larger for BCV, where the negative -BIAS of BCV is very pronounced. Thus, the variation of BCV, as measured by SD(BIAS) is indeed reduced compared to kCV, although a price is paid in terms of increased -BIAS . Again, the Table 2 Simulation results for p = 1, Σ 1 = Σ 2 = I (1)   results for equal and unequal covariance matrices are consistent (see supplementary material, Additional file 3: Table S3 and Additional file 4: Table S4).
The individual values of (ē Ri − e i ) that contribute to -BIAS and SD(BIAS) for two different simulation conditions are plotted in Figures 1 and 2 for each of N = 1000 simulations. Figures 1a and 1b are for BCV and kCV, respectively, for p=1, n=50 and Δ = 1, from Table 2. Figures 2a and 2b represent corresponding plots for p=5 from Table 3. Figure 1 represents one of the best configurations for BCV compared to kCV while Figure 2 represents one of the worst scenarios. As the figures show, the individual estimates of the true conditional error, e i , are extremely variable across the 1000 simulations. The variance of BCV is indeed less than that of kCV, but the negative bias of BCV can be substantial as the dimensionality of the feature space increases.

Why there is large variation in general
The large variation shown in Figures 1 and 2, along with the correspondingly large values of SD(BIAS) in Tables 2  and 3 for both kCV and BCV, are consistent with results of Efron and Tibshirani ( [9], Tables Three to eight on pages 554-556) and Molinaro et al. ( [7], Tables One and Four on pages 3304 and 3305), both of whom showed large standard deviations and, in some cases, large values of bias, for the classifiers and error estimation methods they studied. In fact, Efron and Tibshirani [9] noted that none of the methods correlates very well with the conditional error rate on a sample-by-sample basis. This lack of correlation in the present investigation, reflected by the large values of SD(BIAS), appears to be partly due to a problem with the way the true classification error is defined and estimated. As mentioned in the  Introduction, the problem appears to be that the quantity purportedly being estimated, the true misclassification error of the trained classifier conditional on the training set, is defined as a single fixed quantity for a given set of data.

Possible alternative approach for estimation of SD(BIAS)
It does not seem logical to take the misclassification error as a fixed quantity and then use cross-validation to estimate it, because the true conditional error for any classifier trained using only part of the data within a cross-validation is not the same as the true conditional error of the classifier trained on the complete set of data, i.e., the quantity to be estimated. This leads to an inflated estimate of SD(BIAS). Although it might prove to be computationally prohibitive, it seems more logical to define and calculate a true conditional error for each training set within each of the k partitions of a cross-validation, say e ijr (i = 1,. . .,N;j = 1,. . .,k;r = 1,. . .,R) and then obtain a corresponding estimate,ê ijr . Each difference,ê ijr À e ijr , would represent an estimate of the expected bias in estimating a true conditional error so defined. So, even though the conditional error itself would change from partition to partition, one could still obtain a sample of estimates of the bias in estimating such an error. The variation among these bias estimates would be expected to be less than that represented by SD(BIAS) in Tables 2 and 3 and reflected in Figures 1   and 2 with the customary method, because a source of variation heretofore not taken into account would be eliminated. This "more logical" approach provides insight into how the variation in the bias is artificially inflated when one attempts to use cross-validation to estimate a single, fixed "true conditional error" of a trained classifier. Attempting to estimate the elusive true conditional error is not recommended. Instead, a classifier's generalization error in predicting future observations is the error that should be estimated, and is the error for which cross-validation is well-suited.

Average bias estimates Are representative
On the other hand, even though individual-run biases are likely overstated because of the inflated variance when defined in terms of a fixed true conditional error, nevertheless, the average bias, -BIAS , calculated in the customary way ought to be representative of the average bias that would be reflected if the "more logical" method described above were used. For this reason, plots like Figure 3 of Efron and Tibshirani [9] of the average relative bias in terms of the expected true error are useful for comparing error estimation methods, even though the individual true conditional errors defined the usual way may not be estimated with precision. Figures 3a to 3d mimic Figure 3 of Efron and Tibshirani [9]. The plotted points are values of -BIAS= e; which is equivalent to (ē N − ē)/ē, for each of the twenty simulation C D B A Figure 3 The mean relative bias for each of the twenty simulation configurations in Tables 2 and 3 and in Additional file 1: Table S1, Additional file 2: Table S2, Additional file 3: Table S3 and Additional file 4: Table S4, where each point is the average of N = 1000 values, -BIAS= e; like those plotted in Figures 1 and 2 configurations in Tables 2 and 3 and supplementary material Additional file 1: Table S1, Additional file 2: Table  S2, Additional file 3: Table S3 and Additional file 4:  Table S4, where each point is the average of N = 1000 values like those plotted in Figures 1 and 2. For example, the open triangles plotted in Figure 3a and 3c correspond to Figures 1 and 2, respectively. These figures show a consistent, but modest, positive relative bias for kCV and a consistent, sometimes large, negative relative bias for BCV. In particular as shown in Figure 3, as the sample size increases, differences in relative bias between kCV and BCV decrease for both p=1 and p=5. This agrees with the result of a numerical experiment by Davison and Hall [15] with p=3 and n=(20, 40, 80) in a comparison of bootstrap and LOOCV estimates of discrimination error. Even so, the relative bias of BCV in Figure 3 for p=5 is still substantially negative when n=100. For p=3, Davison and Hall [15] observed a similar decrease in disagreement between the methods as the distance between populations increased. For p=5, Figure 3 also shows that effect going from Δ = 1 to Δ = 3.

Impact of BCV being negatively biased
The negative bias of BCV, i.e., underestimation of the true error, can be explained by the fact that the probability that a test sample appears in the training set is 1-(1-1/ n) n ≈ 0.632. Borrowing the words of Efron and Tibshirani [9] to describe this phenomenon, BCV "uses training samples that are too close to the test points, leading to potential underestimation of the error rate." It is important to note that the bootstrap methods of Efron and Tibshirani do not include test points in the training set.
The substantial negative bias of BCV means that BCV tends to underestimate the classification error on average. While the direction and magnitude of the bias of a cross-validation method might not matter a great deal if the performances of several competitive classification procedures are being compared, it definitely matters if the error rate of a specific classification procedure is of interest. Substantial negative bias, translating to underestimation of the true misclassification error, would be a serious concern. To expound on the sizable negative bias of BCV, Figure 4 shows plots of -BIAS=SD BIAS ð Þ for the same simulations as Figure 3. Applying the rule of thumb of Efron and Tibshirani [16] to bias estimation, the horizontal reference lines at 0.25 in each panel represent thresholds of acceptable relative bias. For p=1, both kCV and BCV satisfy the threshold, except for one instance where BCV exceeds the threshold slightly when n=20. However, for p=5, BCV always exceeds the threshold while kCV is always below the threshold. When Δ = 1, all four BCV relative biases exceed 1, i.e., they are more than four times the 0.25 threshold. Alternatively, a relative-bias plot could be constructed using averages of the components of the MSE by plotting the ratio -BIAS= ffiffiffiffiffiffiffiffiffi -VAR p . This would show the same general result as Figure 4, but would be less pronounced because of the propensity for increased -VAR of BCV compared to kCV for estimating individual e(e i = 1,. . .,

A B
C D Figure 4 The mean relative bias expressed as -BIAS= =SDðBIASÞ ð Þfor each of the twenty simulation configurations in Tables 2 and 3 and in Additional file 1: Table S1, Additional file 2: Table S2, Additional file 3: Table S3 and Additional file 4: Table S4 N) (Tables 2, 3) due to the positive covariance induced by with-replacement sampling with the BCV method.

Assessing reproducibility of error estimates
Because both BCV and kCV can be repeated multiple times, as they have been in the present simulation study, they can give information on the reproducibility among repeated cross-validations. The values of -MSE, SD(MSE), -VAR and SD(VAR) in Tables 2 and 3 provide such information on the reproducibility of BCV and kCV from CV-run to CV-run. When kCV is used in practice, where there is only a single set of training data, either VAR or ffiffiffiffiffiffiffiffiffi ffi VAR p is the commonly reported value along with the average error, e R ¼ X R r¼1 ê r =N , or its complement, the average accuracy [1,5]. Because the purpose of cross-validation is to assess a classifier's ability to generalize outside the training set, the variation from CV-run to CV-run is an important measure of performance. Note that even though the present problem may be ill-defined such that the average biases for individual simulation runs are exaggerated, the values of -VAR are unaffected and correctly reflect the degree of reproducibility of the generalization error estimate.

Fair comparison requires equalization of number of trainings
In this study, there were 100 to 500 repetitions of each method (1000 to 5000 retrainings) in order to put BCV and kCV on the same footing with respect to the number of retrainings of classifiers [9,11]. This is many more repetitions than the ten or twenty repetitions normally done with kCV10. Although nowadays CPU time is relatively inexpensive, 100 to 500 repetitions may be excessive. On the other hand, although the BT632 method of Efron and Tibshirani [9] did not perform as well overall as BCV in the study of Fu et al. [10], it did show competitive behavior in some cases. It seems likely that, if the number of retrainings were equalized while employing the economical algorithm of Efron and Tibshirani [9], the competitiveness of BT632 evaluated in terms of average squared bias and its component parts would improve. As Kim [11] reported recently, the BT632 + method based on 50 bootstraps performed better than 5 repetitions of kCV10 in terms of average squared bias for a pruned tree classifier, although it, too, had a downward bias.

Microarray example
We evaluated the performance of kCV and BCV in predicting prognosis based on the gene expression profiles of breast cancer patients previously reported by van't Veer and colleagues [17,18]. The van de Vijver et al. [17] study consisted of 295 patients with stage I or II breast cancer, and patients' prognosis and gene expression data are publicly available at http://microarray-pubs.stanford. edu/would_NKI/explore.html. While the dataset contained a 70-gene prognosis profile, we chose to perform our evaluation of kCV and BCV using only 5 genes based on a simple gene selection procedure using a tstatistic with adjusted p-values [19]. In the study of Fu et al. [10], the authors chose 5 genes that were most highly correlated with the patient's prognosis. As noted in Fu et al. [10], such gene selection procedure is prone to bias. However, the purpose of this evaluation is not gene selection. Furthermore, in practice only a small subset of genes is often of clinical interest.
Following the steps taken in Fu et al. [10] for comparison, the subsequent steps were carried out: (1) take a random sample S of size n = 50 with half of the patients having good prognosis and half having poor prognosis, (2) train a QDA classifier based on the random sample S and compute its true conditional error rate based on the proportion of times the trained classifier misclassified the remaining samples, (3) for each random sample S, estimate the true conditional error for LOOCV, CVn/2, CV10, BCV, BCVn/2, and BCV10, (4) calculate their MSE, variance and bias, (5) repeat over 1000 simulation runs and calculate the mean and standard deviation of the MSE, variance, bias and MSB.
The results presented in Table 4 and Figure 5 are consistent with our simulation results in Table 3 for Δ = 1. More specifically, the MSB is larger for BCV and the negative -BIAS of BCV is evident. Figure 5 certainly demonstrates that BCV is less variable, but as previously noted, this advantage is negated by the considerable bias and overall MSB. Furthermore, given this microarray example data, the -MSE and -VAR for the BCV methods are higher than the corresponding quantities for the kCV counterpart.

Conclusions
Cross-validation is a widely accepted and sound practice for estimating the generalization error of a classifier. Of course, for small data sets with highdimensional predictors, especially for p > n, the variation among cross-validated error estimates can be large. For methods like BCV and kCV that can be replicated, it is generally accepted that cross-validation should be repeated 10 to 30 times to account for variation. However, using cross-validation to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic and should not be attempted. Although Monte Carlo simulation of this estimation exercise can correctly represent the average bias, it will overstate the variance of the bias. For the low-dimensional conditions simulated in the present study, kCV showed a consistent, but modest, positive bias. Conversely, BCV showed a consistent, and sometimes substantial, negative bias, which was much more pronounced for p=5 than for p=1. Increasing the complexity of the simulation to incorporate higher dimensions would only magnify the effect. The bias of BCV is too high a price to pay for its reduced variance; k-fold CV is recommended.

Competing interests
The authors declare that they have no competing interests relevant to this article to disclose.
Author's contributions SO and RLK conceived the problem and designed the simulations for the manuscript. SYL and HJS were in charge of the computational coding. All authors were involved in drafting the manuscript. All authors read and approved the final manuscript.  The individual values of (ē Ri − e i ) that contribute to -BIAS and SD(BIAS) for each of N = 1000 simulation runs using the breast cancer data.