Objective of the experiments
The collection and use of biological big data is becoming more and more important in recent years in the computerization of medical treatment. Heart rate variability big data is one of the most useful medical information among health medical information. ALLSTAR heart rate variability big data constructed as large-scale heart rate variability big data has collected more than 420,000 samples. We have already systematically performed various analyzes based on this data and have achieved many results [13,14,15,16,17,18]. On the other hand, for example, in the field of analysis of genetic information, it has been reported that information such as the area where an individual lives can be obtained from genetic information [19].
Previous studies
Holter ECG has been evolving since the 2000s and is used today to measure many patients. For example, Jane et al. evaluated the automatic threshold-based detector [20]. Xuexiang et al. Performed CNN (Convolutional neural network) identification using Holter ECG data. They reported that VEB (ventricular ectopic beats) and SVEB (supraventricular ectopic beats) detection obtained a high detection rate of 97.5% or more [21]. Agelink, Yukishita et al. We investigated gender differences in heart rate variability and reported that there was a slight gender difference in heart rate variability [22, 23]. Gender determination from heart rate variability is not impossible, but generally it is not so accurate.
Experimental method
ALLSTAR is 24-h Holter ECG big data. This big data contains more than 420,000 heart rate variability samples, each heart rate variability sample contains 24-h ECG record. The number of subjects was 429,308, including 861 subjects who measured ECG twice. In experiments with 71,264 samples for subjects under the age of 50, the number of subjects was 71,126, which included 138 subjects who measured ECG twice. No subject measured ECG more than two times.
Statistical features used in the analysis. HR is the 24-h mean value of the R-R interval of continuous sinus rhythm, SDNN is the standard deviation, and rMSSD is the rms (root mean square) of the difference of R-R intervals. The changes in the R-R interval are frequency-analyzed as a sample series, and the components are extracted for each ULF (ultra-low frequency, 0 to 0.0033 Hz), VLF (very low frequency, 0.0033 to 0.04 Hz), LF (low frequency, 0.04 to 0.15 Hz) and HF (high frequency, 0.15 to 0.4 Hz). Furthermore, DFA1 (Detrended fluctuation analysis 1) and DFA2 (Detrended fluctuation analysis 2) are calculated by detrended fluctuation analysis. In this time, we conducted a gender identification experiment based on these statistical indicators as 10-dimensional indicators.
Evaluation method used for comparison. In this time, we compared 4 types of classification identification methods. As classification method, we verified three types of classification methods: k-means and identification methods: random forest and SVM. Using all 428,302 data as a sample, test data was set to 60%, and it was obtained using the library of scikit-lab. Fourfold cross validation was performed to four different divisions, and the average was calculated. We had already confirmed this setting gives reliable result in our previous studies. K-fold cross validation is commonly used method to increase the statistical precision from given limited of dataset to be used for training and testing data. We had simply used widely used scikit-learn based API (Application Interface) to perform it. We had chosen 60% test data. This is to balance the ratio of training data and test data for our evaluation for our purpose. For number of folds, we had confirmed fourfold gives the best result for our purpose which means larger k won’t give any major statistical precision. As regression analysis, Elastic net, Lasso, linear regression, and SVR (Support Vector Regression) were performed.