- Research note
- Open Access
Disease prediction via Bayesian hyperparameter optimization and ensemble learning
BMC Research Notes volume 13, Article number: 205 (2020)
Early disease screening and diagnosis are important for improving patient survival. Thus, identifying early predictive features of disease is necessary. This paper presents a comprehensive comparative analysis of different Machine Learning (ML) systems and reports the standard deviation of the results obtained through sampling with replacement. The research emphasises on: (a) to analyze and compare ML strategies used to predict Breast Cancer (BC) and Cardiovascular Disease (CVD) and (b) to use feature importance ranking to identify early high-risk features.
The Bayesian hyperparameter optimization method was more stable than the grid search and random search methods. In a BC diagnosis dataset, the Extreme Gradient Boosting (XGBoost) model had an accuracy of 94.74% and a sensitivity of 93.69%. The mean value of the cell nucleus in the Fine Needle Puncture (FNA) digital image of breast lump was identified as the most important predictive feature for BC. In a CVD dataset, the XGBoost model had an accuracy of 73.50% and a sensitivity of 69.54%. Systolic blood pressure was identified as the most important feature for CVD prediction.
Modern medical methods prevent disease through early intervention rather than treatment after diagnosis. Early screening and detection of diseases are major issues in the field of healthcare. Breast cancer and cardiovascular Disease are the most common diseases among women and elderly people, respectively [1,2,3]. Globally, approximately 1.3 million new cases of BC are reported each year. BC has the highest incidence in developed countries, but it has also increased at an alarming rate in low- and middle-income countries . In addition, CVD accounts for approximately half of all deaths in most European countries . Early screening and diagnosis of BC and CVD are the most effective ways to detect early disease and reduce mortality [6, 7]. The prediction of BC diagnosis through Logistic Regression (LR) and cross-validation results in a prediction accuracy of 96.2%, providing a basis for computer system diagnosis of breast cytology . The current machine learning algorithms for BC and CVD prediction are mainly focused on Support Vector Machine (SVM), Neural Networks (NNs), and Decision Tree (DT) models. In analyses of BC diagnosis datasets, Random Forest (RF)  and SVM  have achieved better prediction results than other algorithms. In particular, kernel-based SVM can achieve a classification accuracy of 83.68% . To avoid the problem of overfitting, a DT model with a Chi-square automatic interaction detector algorithm can be used for feature selection and classification with an accuracy rate of 74.1% . The AUC value of the BC prediction model based on the fusion of the sequence forward selection algorithm and the SVM classifier can reach 0.9839 . In a previous study, the LR model was used to predict BC using the same BC diagnostic dataset used in the present study, and an accuracy of 95.72% was reported . Compared to the results of that study our results have less risk of overfitting and greater generalization ability due to dimensionality reduction and the XGBoost algorithm. For the early diagnosis of CVD, statistical learning and intelligent algorithms provide good support; the accuracy of SVM classification can reach 90.5% . For the coronary heart disease dataset in the open database of the Framingham Heart Research Center, the AUC of the SVM algorithm can reach 0.75 . Previously, XGBoost  was used to predict the readmission rate for patients with ischemic stroke within 90 days after discharge and achieved a final AUC value of 0.782 . Among several tested algorithms, XGBoost achieved the best classification performance of the dataset of the China Acute Myocardial Infarction Registry, yielding an AUC value of 0.899 . Hyperparameters have great impact on the classification performance of the XGBoost model. Therefore, in the present study, we used two datasets with large differences in BC and CVD diagnosis. The logarithmic loss of fivefold cross-validation was used to measure the performance of the model under the corresponding parameters, and the prediction performances of XGBoost, Light Gradient Boosting Machine (LightGBM), Gradient Boosting Decision Tree (GBDT), LR, RF, Back Propagation Neural Network (BPNN), and DT models were compared. Repeated sampling was performed, and the standard deviation of the results was calculated.
The BC diagnosis dataset was generated at the University of California, Irvine (UCI) Machine Learning Repository with a total of 569 data points. We used the average of the 10 characteristics of the nucleus. For malignant BC tumors, the target diagnosis in the dataset is encoded “M”; for benign tumors, it is encoded “B”. For our analyses, we converted “M” to “1” and “B” to “0”. The overall dataset diagnosis results are shown in Fig. 1a. The CVD dataset was derived from Kaggle’s public dataset, which includes 65,535 patient data records and 11 characteristics. The target class “cardio” is encoded as “1” if the patient has CVD and “0” if the patient is healthy. Additionally, the IDs of patients who did not contribute to the prediction were deleted. The overall dataset diagnosis results are shown in Fig. 1b. Zero-mean normalization (Z Score) was used to process the original data. The dataset was then divided into a training set (70% of the observations) and a test set (30% of the observations). The two datasets employed in this study were both used to evaluate classifier performance, but an unbalanced structure was observed in the BC dataset (Fig. 1a). Therefore, we compared multiple indicators, including F1 score, AUC, Kolmogorov-Smirnov (KS), Receiver Operating Characteristic (ROC) curve and Precision–Recall (PR) curve, among the different models. A challenge in disease prediction is correctly evaluating whether the diseased patient becomes disease-free. In addition to comparing the performance of the classifiers, we also focused on the positive (sickness) judgment results. Because the ROC curve considers both positive and negative examples, it is suitable for evaluating the overall performance of the classifier. In comparison, the PR curve focuses only on the positive examples.
The models, programming languages, and libraries used in this study are shown in Additional file 1. We trained all models using Python programming language (version 3.7). A personal computer with Intel (R) Core (TM) i5-7200U processor, 8 GB of RAM, and a Radeon (TM) R7 M445 GPU was used for the experiments. Each experiment required approximately 1 to 120 min to train the model.
The purpose of feature selection is to reduce the dimensions, which may improve the generalization of our algorithm [20,21,22]. We selected the features by analyzing the correlations among features in the BC diagnosis dataset. The correlations among radius, perimeter, and area were high, The three characteristics of compactness, concavity, and concave points were also related. Additional file 1: Table S1 illustrates the correlation among the features, and the additional doc file contains more information (see Additional file 1). Based on the correlation analysis, we considered radius and compactness as representatives. After performing feature selection in the BC diagnosis dataset, six features were retained. Considering that the correlation among the features of the CVD dataset is relatively small, the CVD dataset uses recursive feature elimination (RFE) and fivefold cross-validation to reduce the dimensions. We used the number of features corresponding to the minimal logarithmic loss. After the feature selection in the CVD dataset, nine features were retained. Additional file 1: Fig. S1 illustrates the feature selection process for the CVD dataset; the additional doc file contains more information (see Additional file 1).
Performance comparison of different hyperparameter optimization methods
To evaluate the effectiveness of the Bayesian parameter optimization algorithm, we used grid search and random search as comparison methods to adjust the hyperparameters of XGBoost as well as fivefold cross-validation. In this study, four hyperparameters with high influence on the XGBoost algorithm were selected for adjustment. Additional file 1: Table S2 illustrates the hyperparameter space; the additional doc file contains more information (see Additional file 1). The other parameters used the default settings, and the number of iterations was 5000. The horizontal axis of Fig. 1c, d represents different hyperparameter optimization methods in the process of hyperparameter selection, and the vertical axis represents the AUC value predicted by the XGBoost model. Fig. 1d shows that the Bayesian hyperparameter optimization method had better stability. Thus, we used the Bayesian optimization method for hyperparameter selection of all algorithms.
Performance comparison of the different classifiers
The classification indicators of the different classifiers (LightGBM, GBDT, LR, RF, BPNN, and DT) acting on the two datasets were compared with those of the XGBoost classifier. The stability of the results was verified through 1000 repeated samplings. The mean and standard deviation of each indicator were calculated. In the small BC diagnostic dataset, XGBoost performed better than LightGBM, GBDT, LR, RF, BPNN and DT but was not as stable as GBDT. In the large CVD dataset, XGBoost’s classification performance was relatively stable (Table 1 and Fig. 2a–d).
Feature importance ranking
In this experiment, the average gain of each feature in all the trees in which it appeared was used to rank the features in importance. Features with higher values of this metric can be considered more important for prediction than features with lower values. In the BC diagnosis dataset, radius mean was the most important feature for prediction (Fig. 2e). In the CVD dataset, the patient’s systolic blood pressure was the most important feature for predictions (Fig. 2f).
With increasing attention being paid to computer-aided diagnosis, disease-assisted diagnosis requires reliable and interpretable classifiers. In addition to selecting a reliable classifier to achieve better prediction performance, the dataset needs to be preprocessed . To improve the performance of the classifier, we compared grid search, random search, and Bayesian hyperparameter optimization methods. Unlike traditional grid search and random search methods, Bayesian parameter optimization algorithms based on Gaussian processes can find stable hyperparameters, and they are widely used in machine learning . Figure 1c, d shows that compared with the smaller dataset, the larger dataset was associated with more stable Bayesian hyperparameter optimization performances. XGBoost was compared to LightGBM  GBDT, LR, RF, BPNN, and DT using two datasets of different sizes. Among the algorithms, XGBoost achieved the best accuracy (94.74%), precision (92.19%), sensitivity (93.65%), F1 score (92.91%), AUC (0.9875) and PR curve for the BC diagnosis dataset. However, the classification performance of GBDT was more stable over repeated sampling. For the CVD dataset, which has a large amount of data, the AUC (0.8044) and PR curve of the XGBoost algorithm were optimal, and the performance over repeated sampling was largely stable. Therefore, these findings indicate that for smaller datasets, although the XGBoost algorithm has advantages over LightGBM, GBDT, LR, RF, BPNN, and DT algorithms, it is not as stable as GBDT. For larger datasets, the XGBoost algorithm has better classification performance than the other algorithms, and its classification performance is stable (Table 1 and Fig. 2a–d). Our experiments used the gain to rank the importance of the features. A higher value of this metric indicates that the feature has higher importance for prediction. The ranking results of the feature importance are shown in Fig. 2e, f. In the BC dataset, the average value of the distance from the center of the nucleus to the edge (radius mean) of the lesion was the most important for prediction, indicating that it is important to calculate this feature of the nucleus in the digital images of FNA lesions. In the CVD dataset, the systolic blood pressure of the patient was the most important for prediction, suggesting that it is important to measure the patient’s systolic blood pressure during physical examination. Thus, the methods presented in this paper are interpretable and helpful for the predictive evaluation of disease and the identification of early, high-risk features.
Multiclass data can be used to compare the performance of the models used. The feature selection method needs further research. In addition to correlation analysis and RFECV, Gradient Boosted Feature Selection  can also be tried to further reduce relevant and non-redundant features for supervised classification problems. In future work, we would like to further improve the classification prediction performance.
Availability of data and materials
The datasets supporting the conclusions of this article are included within the article and its Additional file 2.
Extreme gradient boosting
Light gradient boosting machine
Gradient boosting decision tree
Back propagation neural network
Recursive feature elimination
Receiver operating characteristic curve
Area under curve
Fine needle aspirate
Madia F, Worth A, Whelan M, Corvi R. Carcinogenicity assessment: addressing the challenges of cancer and chemicals in the environment. Environ Int. 2019;128:417–29.
Nguyen T, Wang Z. Cardiovascular screening and early detection of heart disease in adults with chronic kidney disease. J Nurse Pract. 2019;15(1):34–40.
Zao A, Magalhaes S, Santos M. Frailty in cardiovascular disease: screening tools. Revista Portuguesa De Pneumologia. 2019;38(2):143–58.
Timmis A, Townsend N, Gale CP, Grobbee R, Maniadakis N, Flather M, Wilkins E, Wright L, Vos R, Bax JJ, et al. European society of cardiology: cardiovascular disease statistics 2017. Eur Heart J. 2018;39(7):508–79.
Panieri E. Breast cancer screening in developing countries. Best Pract Res Clin Obstet Gynaecol. 2012;26(2):1521–6934.
Otoole J, Gibson I, Flaherty GT. Young adults’ perception of cardiovascular disease risk. J Nurse Pract. 2019;15(10):e197–200.
Coleman C. Early detection and screening for breast cancer. Semin Oncol Nurs. 2017;33(2):141–55.
Wolberg WH, Street WN, Heisey DM, Mangasarian OL. Computer-derived nuclear features distinguish malignant from benign breast cytology. Hum Pathol. 1995;26(7):792–6.
Tseng Y, Huang C, Wen C, Lai P, Wu M, Sun Y, Wang H, Lu J. Predicting breast cancer metastasis by using serum biomarkers and clinicopathological data with machine learning technologies. Int J Med Inform. 2019;128:79–86.
Tapak L, Shirmohammadikhorram N, Amini P, Alafchi B, Hamidi O, Poorolajal J. Prediction of survival and metastasis in breast cancer patients using machine learning classifiers. Clin Epidemiol Glob Health. 2019;7(3):293–9.
Singh BK. Determining relevant biomarkers for prediction of breast cancer using anthropometric and clinical features: a comparative investigation in machine learning paradigm. Biocybern Biomed Eng. 2019;39(2):393–409.
Wu M, Zhong X, Peng Q, Xu M, Huang S, Yuan J, Ma J, Tan T. Prediction of molecular subtypes of breast cancer using bi-rads features based on a “white box” machine learning approach in a multi-modal imaging setting. Eur J Radiol. 2019;114:175–84.
Shengsheng L, Qiancheng L, Liling Y, Wenping L, Ruimeng Y, Haoyu J. Construction of breast cancer prediction model based on sfs-svm. Chin J Med Phys. 2019. https://doi.org/10.3969/j.issn.1005-202X.2019.07.015.
Liu L. Classification of breast cancer diagnosis data based on logistic regression algorithm. Softw Eng. 2018;21(2):21–2317.
Boursalie O, Samavi R, Doyle TE. M4cvd: mobile machine learning model for monitoring cardiovascular disease. Procedia Comput Sci. 2015;63:384–91.
Beunza J, Puertas E, Garciaovejero E, Villalba G, Condes E, Koleva G, Hurtado C, Landecho MF. Comparison of machine learning algorithms for clinical event prediction (risk of coronary heart disease). J Biomed Inform. 2019;97:103257.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. 2016. p. 785–94. https://doi.org/10.1145/2939672.2939785
Xu Y, Yang X, Huang H, Peng C, Ge Y, Wu H, Wang J, Xiong G, Yi Y. Extreme gradient boosting model has a better performance in predicting the risk of 90-day readmissions in patients with ischaemic stroke. J Stroke Cerebrovasc Dis. 2019;28(12):104441.
Yang J, Li Y, Li X, Chen T, Xie G, Yang Y. An explainable machine learning-based risk prediction model for in-hospital mortality for chinese stemi patients: Findings from china myocardial infarction registry. J Am Coll Cardiol. 2019;73(9):261.
Castellano G, Fanelli AM. Variable selection using neural-network models. Neurocomputing. 2000;31(14):1–13.
Wang T, Huang H, Tian S, Xu J. Feature selection for svm via optimization of kernel polarization with gaussian ard kernels. Expert Syst Appl. 2010;37(9):6663–8.
Wieslaw P. Tree-based generational feature selection in medical applications. Procedia Comput Sci. 2019;159:2172–8.
Niu X, Wang J. A combined model based on data preprocessing strategy and multi-objective optimization algorithm for short-term wind speed forecasting. Appl Energy. 2019;241:519–39.
Wu J, Chen X-Y, Zhang H, Xiong L-D, Lei H, Deng S-H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J Electron Sci Technol. 2019;17(1):26–40.
Ke G, Meng Q, Finley TW, Wang T, Chen W, Ma W, Ye Q, Liu T. Lightgbm: a highly efficient gradient boosting decision tree. 2017. p. 3149–57.
Gilani SZ, Shafait F, Mian A. Gradient based efficient feature selection. 2014. p. 191–7.
The authors thank Prof. N.Y. Shao for his constructive comments during the research process.
We did not receive funding for this study.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Information on feature selection, hyperparameter spaceand programing languages and libraries.The doc file contains two tables and one figure. The first table shows thecorrelation among the features of the breast cancer diagnosis dataset. Thefirst figure illustrates the feature selection process for the cardiovascular disease dataset. The second table shows the hyperparameter space.
The dataset used in the manuscript.The .csv file contains a breast cancer diagnosis dataset and a cardiovasculardisease dataset.
About this article
Cite this article
Gao, L., Ding, Y. Disease prediction via Bayesian hyperparameter optimization and ensemble learning. BMC Res Notes 13, 205 (2020). https://doi.org/10.1186/s13104-020-05050-0
- Hyperparameter optimization
- Feature selection
- Ensemble learning