Can Pre-Trained Convolutional Neural Networks be used as Feature Extractors for Video-based Neonatal Sleep and Wake Classication?

Objective: In this paper, we propose to evaluate the use of a pre-trained convolutional neural networks (CNNs) as a features extractor followed by the Principal Component Analysis (PCA) to nd the best discriminant features to perform classication using support vector machine (SVM) algorithm for neonatal sleep and wake states using Fluke® facial video frames. Using pre-trained CNNs as feature extractor would hugely reduce the effort of collecting new neonatal data for training a neural network which could be computationally very expensive. The features are extracted after fully connected layers (FCL’s), where we compare several pre-trained CNNs, e.g., VGG16, VGG19, InceptionV3, GoogLeNet, ResNet, and AlexNet. Results: From around 2-h Fluke® video recording of seven neonate, we achieved a modest classication performance with an accuracy, sensitivity, and specicity of 65.3%, 69.8%, 61.0%, respectively with AlexNet using Fluke® (RGB) video frames. This indicates that using a pre-trained model as a feature extractor could not fully suce for highly reliable sleep and wake classication in neonates. Therefore, in future a dedicated neural network trained on neonatal data or a transfer learning approach is required.


Introduction
Sleep is an essential behavior for the development of the nervous system in neonates [1][2][3]. Normally, newborn babies sleep between 16 to 18 hours per day. Continuous sleep tracking and assessment could potentially provide an indicator of brain development over time [4][5]. To achieve this, automatic sleep and wake analysis is required, which can offer valuable information on a neonate's mental and physical growth, not only for healthcare professionals but also for parents [6].
Currently, Video electroencephalogram (VEEG) is considered as a gold standard for neonatal sleep monitoring, which requires a number of sensors and electrodes attached to a neonate's skin to collect multiple-channel EEG signals [7][8][9]. In addition, the use of VEEG is labor-intensive, where the human effort on annotating sleep states is required [10]. Therefore, one would demand a contact-free sleep monitoring system for neonates. In recent years, unobtrusive or contact-free approaches have gained a lot of attention for sleep monitoring [11][12][13][14][15][16]. All these methods are more successful in adults [17] [18]. In contrast, video-based methods appear to be a promising approach since it is more comfortable and convenient to use both in the home or in the hospitals [19] [20]. With the advancements in deep learning algorithms and clinical research on neonatal facial patterns [21] [22], a new, unobtrusive approach of monitoring sleep patterns has been proposed [23] [24]. However, evaluation of the deep learning models demands big database to train the prediction model.
The main contributions of this work include: (a) extracting features from Well-known CNNs, e.g., VGG-16, VGG-19, InceptionV3, GoogLeNet, ResNet, and AlexNet, (b) comparing different color palette (amber, high contrast, red-blue, hot metal, and grayscale) from RGB and thermal video frames, and (c) evaluating the extracted features using PCA followed by SVM to classify neonatal sleep and wake classi cation. As this was an explorative study, to evaluate the feasibility of a pre-trained model as features extractor to classify neonates' sleep and wake states using video frames, we started with a small pilot study population of neonate's video frames data by adopting a robust and less computational complex approach to classify sleep states.

Main Text Subject Database
Video and VEEG data from seven neonates were collected retrospectively by a pediatrician at the Children's Hospital a liated to Fudan University, Shanghai, China [25]. The detailed descriptions of the demographics and physical conditions of neonates are shown in Additional le 1: Table S1. Annotation of sleep and wake states was performed by a professional neurologist on each 30-sec VEEG epoch and video frames, respectively, according to the American Academy of Sleep Medicine (AASM) [26].

Intensity-Based Detection
To enable identifying sleep and wake states for neonates using video frames, it is required to have precise face detection in Fluke®video [27]. Detail description of Intensity-based detection has been discussed in our previous paper [25]. Figure 1 shows the input video frame, and the neonatal facial region is detected using an intensity-based method. After that, the detected RGB facial region is mapped on other color palettes(thermal) video frames to extract the facial region.

Pre-trained CNN models
Our proposed method is to classify neonatal sleep and wake states using pre-trained CNNs. Usually, initial layers of CNNs capture basic input images features like spots, boundaries, and colors pattern that are inattentive by the deeper hidden layers to form complex higher-level features pattern to present a better-off image illustration [28]. Each layer of the CNNs output act as an activation unit for the input images. Literature studies reveal that while using pre-trained CNNs for feature extraction, the features are usually extracted from the FC layers right before the nal output classi cation layer [29] [30]. Considering this motivation, we extracted the features from fully-connected (FC) layers of a pre-trained network. The detailed descriptions of all the pre-trained models have been mentioned in Additional le 1: Table S2. VGG16 and VGG19 Model: VGG model [31] contains a stack of convolutional layers followed by three fully-connected layers (FCL). In this work, we used both pre-trained VGG16 and VGG19 models, and features were extracted from the last three FCLs.
AlexNet Architecture: The architecture of AlexNet [32] that contains a total of eight layers. In this work, we extracted features from the last two FCL's of the pre-trained AlexNet.
ResNet-18: The baseline structure of the residual network (ResNet) [33] is the same as the other CNNs, except that a shortcut link is added to each pair of 3×3 lters. To classify neonate's sleep and wake states, we extract 1000 features from the last FCL of the pre-trained ResNet-18 model. GoogLeNet: GoogLeNet [34] has unique features that help them to achieve state-of-the-art results and outperform other previous networks, e.g., 1×1 convolution is used as a dimension reduction to reduce computation usage. In this work, we have used the pre-trained GoogLeNet network, and features are extracted from the last "FC1000" layer.
InceptionV3: Inception-v3 is the factorization idea in the third iteration of GoogLeNet [35]. The last FCL is used to extract the features from the pre-trained Inception-V3 model to perform neonate's sleep and wake classi cation.
Principal Component Analysis (PCA): PCA is a method to differentiate the discriminant features in the dataset by suppressing variations [36]. In this paper, once the features are extracted from FCL's of CNNs, we input these features to PCA to nd the best-discriminated features, to help SVM to classify neonates sleep and wake states at the next stage. Support Vector Machine (SVM): Based on features extracted from the pre-trained CNNs, we employed an SVM classi er to classify neonatal sleep and wake states [37][38]. We have used the "classi cationLearner" app in Matlab R2018b with the SVM default setting (kernel function='linear,' box constraint=1 )to perform the classi cation.

Results And Discussion
Twenty-two experiments were conducted on RGB and thermal videos, respectively. For evaluation purposes, all the results are expressed in terms of sensitivity(Se), speci city(Sp), precision(p), and accuracy(Ac), obtained using ve-fold cross-validation. The result are validated with the VEEG annotations. Table 1 shows the sleep and wake classi cation results obtained by the SVM classi er after feature extraction using different pre-trained CNNs. We observed that the overall performance of using FCL6-7-8 in VGG-16 and VGG19, FCL8 in AlexNet, and FCL layer in inceptionV3, ResNet-18, and GoogLeNet was low when used to classify neonatal sleep and wake states. Multifarious statistical results obtained via SVM to classify neonatal sleep and wake states show a disproportionate pattern. However, RGB-InceptionV3 (FCL) shows the best values for Se is 97.4%, but Ac drop to 55.1%, similarly RGB-VGG16 (FCL8) and RGB-VGG19 (FCL8) shows Se of 90.0%, but overall accuracy is drop to 66.2% and 65.2% respectively. However, features extracted from AlexNet (FCL7) trained via SVM shows the best optimal results with an Ac of 65.3%, Se 69.8%, and Sp of 61.0% to classify neonate's sleep and wake states. In contract to the other features extracted values from pre-trained networks, features extracted from AlexNet (FCL7) contains discriminant features that assist SVM to classify neonate's sleep and wake stage. One of the main reason to achieved higher statistical results using pre-trained AlexNet is that as pre-trained, AlexNet was originally trained on just over a million images as compared to other CNNs that were trained on more the 15 million images, depicts more complex features architecture values at different fully connected layer [39] [31]. It is observed that in AlexNet, the rst layer has a lter of size 11x11, and the second layer has a 5x5 lter, and so on, there is no standard about lter sizes and max pooling. The convolutions for each layer are decided purely experimentally. In contrary to that, other CNN have standard protocol such as in VGG-Net, all the convolution kernels are of size 3x3, max-pooling is done after 2 or 3 layers of convolutions. GoogLeNet works on a parallel combination of 1x1, 3x3, and 5x5 convolutional lters. The overall complex nature of pre-trained CNNs distinguished AlexNet to obtained better performance to classify neonate's sleep and wake stages. Figure 2a shows the standard deviation of all the sleep and wake features extracted from AlexNet FCL7. It is observed that most of the sleep and wake extracted features from FCL7 are lies almost in the same region. However, AlexNet shows slightly better performance than other extracted features using SVM; one of the main reasons is that the corresponding trained features are quite separated from each other. Figure 2b depicts the standard deviation of discriminant corresponding features extracted after PCA from pre-trained AlexNet (FCL7). These discriminant AlexNet (FCL7) features help to achieve better neonate's sleep and wake classi cation accuracy as compared to other pre-trained CNNs.  As proof of study, we have analyzed other neonatal facial color palettes extracted from Fluke® SmartView. Additional le 1: Table S3 shows the statistical results achieved using multiple color palettes such as amber, high contrast, red-blue, hot metal, and grayscale. In contrast to the results shown in Table  1 Fluke® multiple colors palette depict disproportionate results such as high contrast-AlexNet (FCL8), InceptionV3-Hot-metal (FCL), GoogLeNet-Grayscale(FCL), and VGG19-Red-Blue (FCL6) achieved the best values for Se are 84.8%, 76.3%, 73.0%, and 81.1% respectively. Similarly, VGG-19-Amber (FCL) shows the best values for Sp is 87.8%. However, overall, Ac obtained from these color palates are quite low; VGG19- High Contrast (FCL7) shows the best Ac of 65.6%. One of the main reason is that the range of these Fluke® color palettes are quite narrow, as shown in Additional le 1: Figure S1.
In general, the statistical results of the pre-trained CNNs model as a feature extractor to classify neonatal sleep and wake states are quite modest [20]. One of the main reasons for attaining such modest accuracy is that all the existing pre-trained CNNs network were trained on natural images such as animals, owers, sceneries, and automobiles, etc. The feature patterns of pre-trained CNNs networks classes are quite different from our neonate's database, that makes it di cult for existing CNNs to classify neonate's sleep and wake states [40] [41]. The motivation for using pre-trained CNNs as feature extraction is that it doesn't demand a lot of computational capacity, and it is quite robust as we do not need to retain the network; these attributes compel us to start with feature extraction approach to classifying neonatal sleep and wake states. However, experimental analysis depicts that this approach doesn't offer the promising results to act as an aided tool for clinicians to classify neonates' sleep and wake states unobtrusively. Nevertheless, as there are no such studies has been reported in the literature by analyzing the neonatal facial videos to classify sleep and wake states using CNNs as feature extractor. This research could be helpful for future studies to adopt other techniques (e.g., transfer learning or dedicated CNNs) to classify neonatal sleep and wake states using video frames to achieve better accuracy.

Conclusions
This work experimentally veri es the achievability of unobtrusive neonate's sleep and wake states via automatic classi cation using a video frame from uke® camera. Five-fold cross-validation depicts the modest accuracy of 65.3% from pre-trained AlexNet at FCL7, compared with VEEG annotated data by a neurologist for sleep and wake states. In the future, the transfer learning approach/dedicated CNNs and more datasets collection with different ethnic groups will be the next step of our research work.

Limitations of the study
It is also important to note that this is a preliminary study, where video data collection took place in a controlled environment with xed camera placement, stable lighting conditions, and under the supervision of neonatal nurses and pediatricians. Furthermore, as for this proof-of-point study, we analyze the variations in neonatal facial pattern with no sleep-related issues. The accuracy concerning to those with sleep syndromes have remained unclear. This article focus only on two-state (sleep and wake) states, the dedicated design of neonatal deep learning architecture to classify neonate's sleep staging is on the foremost next step of this research work.