Biomarker selection for medical diagnosis using the partial area under the ROC curve

Background A biomarker is usually used as a diagnostic or assessment tool in medical research. Finding an ideal biomarker is not easy and combining multiple biomarkers provides a promising alternative. Moreover, some biomarkers based on the optimal linear combination do not have enough discriminatory power. As a result, the aim of this study was to find the significant biomarkers based on the optimal linear combination maximizing the pAUC for assessment of the biomarkers. Methods Under the binormality assumption we obtain the optimal linear combination of biomarkers maximizing the partial area under the receiver operating characteristic curve (pAUC). Related statistical tests are developed for assessment of a biomarker set and of an individual biomarker. Stepwise biomarker selections are introduced to identify those biomarkers of statistical significance. Results The results of simulation study and three real examples, Duchenne Muscular Dystrophy disease, heart disease, and breast tissue example are used to show that our methods are most suitable biomarker selection for the data sets of a moderate number of biomarkers. Conclusions Our proposed biomarker selection approaches can be used to find the significant biomarkers based on hypothesis testing.

However, because ( ) has a unique maximum, it follows that On the other hand, by (S5), for sufficiently large 1 , Hence, for a given * * > 0, there exists an 1 such that which contradicts (S6). Hence, it implies that � . .

Proof of Lemma 1.
Since ( | ) 2 < ∞, by SLLN, as → ∞, Consequently, for any fixed ∈ , Hence, for any fixed ∈ , � ( ) → ( ) with probability 1. uniformly bounded in ∈ with probability 1. That is, for any ∈ , there exists , which is free of and converges as n goes to infinity, such that Proof of Lemma 2.
, and (•) is the density function of the standard normal distribution. Then for any fixed ∈ , as → ∞, Since � 0 , � 1 both are symmetric positive definite matrices, and ‖ ‖ = 1, then where , 1 , and * , 1 * are the smallest and the largest eigenvalues of � 0 , � 1 , where is free of and converges as n goes to infinity. Hence, � uniformly bounded in with probability 1.

Simulations of Three and Four Biomarkers
Consider = 3,4. Assume 0 = in the non-diseased group, and 1 = = �Δ 1 , … , Δ � in the diseased group. Further, the covariance matrices are of the following form: for = 0, 1, The population setting can be found in Table 5 of the article. In Table S1, the true value of the best linear combination; empirical mean and standard error of the estimated � based on 1000 replications, denoted by True, AVE and SE, are reported. Table S2, S3 and S4 present the proportion of outcomes from the two biomarker selections among 1000 replicates. Table S2 reports the cases of = 3, while Table S3 and S4 give the cases of  Table S3 and Table S4, respectively. In each scenario, the figure in boldface is correspondent to the most likely outcome. Table S5 gives the two biomarker selection results of the DMD and atherosclerotic coronary heart disease examples by using the raw data. In the DMD example, , are selected. In the heart disease example, only the lutein is concluded as a statistically significant biomarker.

Applications to Real Data Sets
The stepwise details of the standardized data of the breast tissue example are given in Table S6. In order to investigate the relationship between the pAUC and the marginal distributions of the individual biomarkers, we report the sample mean and variance within the two groups, as well as the pAUC in Table S7 in descending order based on the absolute value of the coefficient in the optimal linear combination of the full data set. Additionally, the corresponding density plots of each biomarker are given in Figure 1-Figure 9. In which the reference vertical line, x=c, is found from the given upper limit t=0.1 of the 1-specificity, and given the cutoff, the pAUC integrates all tailed probabilities in the diseased distribution. Moreover, Table S8 and Figures 10-11 present the characteristics of the optimal linear combinations of the reduced biomarkers set found by the two selection methods. Given the results, we found that a biomarker which has a homogeneous non-diseased population and a heterogeneous diseased population tends to have a higher pAUC value.               Figure 10. The distributions of best linear Figure 11. The distributions of best linear combination by the Forward method combination by the Backward method for two groups. for two groups.