Mild cognitive impairment prediction based on multi-stream convolutional neural networks | BMC Bioinformatics

Data collectionAll participants gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Far Eastern Memorial Hospital Research Ethics Committee (105147-F) and the Institutional Review Board of the National Yang-Ming University (YM108110E). There are 45 participants in this study, 32 are cognitively normal (median age 69 years, IQR 67–73 years, 9 males, 23 females) and 13 are diagnosed with MCI (median age 75 years, IQR 71–78 years, 6 males, 7 females). Table 1 and Fig. 1 show the gender and age distribution of the participants. In order to collect realistic and reasonable data from participants without stress or embarrassment, participants recorded videos while participating in the MMSE.
Table 1 Gender distribution of participantsFig. 1Age distribution of participantsThe total effective data are 48 facial videos from 45 participants, including 35 videos from normal cognitive participants and 13 videos from MCI participants. Several types of resolutions are used in the original videos, such as 1920 × 1080, 1280 × 720, and 640 × 480. The video frame rate is 29.97 frames per second (fps). Video lengths range from 3 to 30 min, and the average length is 14.5 min. To reduce the spatial and temporal redundancy before processing, the frame resolution was also resized to 640 × 480, and the video frame rate was down-sampled to 5 fps.MCI prediction modelWe proposed an MCI prediction model based on MCNNs to predict whether a participant video is MCI, as shown in Fig. 2. First, a participant video is divided into several segments. Then, we generate spatial and motion data streams as input to MCNN for each segment. MCNN captures latent spatial and motion features from the data streams to extract facial representations during MMSE testing, and classifies segments as MCI or normal based on these facial features. Finally, the aggregation stage produces the final detection results of the input video.Fig. 2Overview of the MCI prediction modelWe randomly sample a frame from each segment to generate the spatial data stream. The frame is an RGB image that contains the participant’s face, which can be used to represent static facial spatial information. During the MMSE test, the participants’ facial responses are also important. To capture the facial dynamics, the motion data stream is generated from segment frames using optical flow techniques [34]. Optical flow is used in computer vision to obtain the motion field on individual pixel basis between two image frames. It is widely used in a variety of biomedical applications for tracking changes over time [35, 36]. The stacked optical flow fields with x and y directions are calculated to represent facial motion information. In this study, we choose the TVL1 optical flow algorithm [37] implemented by OpenCV with CUDA.Inspired by two-stream CNNs [14, 17], our MCNN mainly consists of three CNNs, a fusion mechanism, and a fully connected layer as a classifier, as shown in Fig. 3. The three CNNS are spatial CNN, x-motion CNN, and y-motion CNN. It receives the spatial and x, y motion streams from a segment as inputs, and uses the three CNNs to extract facial spatial and motion features. The spatial features, x-motion features, and y-motion features are then concatenated to form a one-dimensional vector. Finally, the fused feature vector is classified as MCI or normal through a batch normalization (BN) layer and a fully connected (FC) layer.Fig. 3Architecture of MCNN modelMCNN acts as a segment classifier in the MCI prediction model. For each segment, each MCNN classifier produces a unique decision regarding the identity of the segment. Finally, a majority voting scheme [38] is used as an aggregation of classifiers. In aggregating the decisions of the n MCNN classifiers, the input video is assigned to the MCI class when at least k MCNN classifiers agree, where$$k = \left\{ {\begin{array}{*{20}c} {\frac{n}{2} + 1} & {\quad {\text{if }}n{\text{ is even}}} \\ {\frac{n + 1}{2}} & {\quad {\text{if }}n{\text{ is odd}}.} \\ \end{array} } \right.$$
(1)
The MCNN is a general and flexible model at the segment level. Several modern CNN models can be used as the backbone of MCNN. In order to train our MCNN to perform optimally, we choose ResNet as the backbone, after considering its balance between accuracy and efficiency. Meanwhile, most CNN models provide pre-trained models based on working with the public ImageNet dataset [39]. The transfer learning allowed building a high-quality classification model for new data, based on a small amount of newly labeled data. Therefore, we used the transfer learning to fine-tune the pre-trained CNNs to expedite training and to increase accuracy. In the transfer learning, we unfreeze and train the last convolutional block of the pre-trained model, as well as the top-layer classifier (FC layer). In this way, we retain the generic features learned from the ImageNet dataset, while learning domain knowledge from the facial video data.MCNN explorationAlthough MCNN captures and learns spatial and motion features to predict MCI from video segments, the accuracy of MCI prediction also depends on the model architecture. Therefore, exploring different types of model architectures is necessary to devise a robust solution. To further attempt to improve model accuracy, we explored and compared the following model settings and their combinations:

1.

ResNets with different numbers of layers, namely ResNet-18, ResNet-34, and ResNet-50.

2.

ReLU, Swish, and Mish activation functions [40, 41] in ResNets.

3.

SGD, Adam, and Ranger [42] optimizers in model training.

The activation function plays an important role in neural network training. In the early era of the neural network, sigmoid function was the most used activation function in neural networks. However, its small derivative may cause the vanishing gradient problem, so ReLU is more suitable and widely used in deep learning because it has a derivative of one for every positive input. Nevertheless, if the weights in the network always lead to negative inputs into a ReLU neuron, the neuron output is zero and it is dead. This phenomenon is called the dying ReLU problem. Several variants of ReLU have been proposed that perform as well or better than ReLU. Unfortunately, none of them have achieved the same popularity as ReLU due to its simplicity [43].Swish is a smooth non-monotonic activation function, similar to ReLU. The Swish activation function is defined as follows [40]:$${\text{Swish}}\left( x \right) = \frac{x}{{1 + e^{ – x} }}$$
(2)
The simplicity of Swish and its similarity to ReLU means that replacing ReLUs in any network is just a simple one line code change. Even this simple, empirical performance shows that Swish consistently outperforms ReLU and other activation functions. Mish is a new activation function with similar shape and properties to Swish, defined as follows [41]:$${\text{Mish}}\left( x \right) = x{\text{tanh}}\left( {{\text{log}}\left( {1 + e^{x} } \right)} \right)$$
(3)
The graphs of ReLU, Swish, and Mish are shown in Fig. 4. As shown in Fig. 4, the main difference is the concave part of the function. Mish keeps improving ReLU and Swish at the cost of more computation. In this study, we compare the performance of ReLU, Swish, and Mish in ResNets to find the best model architecture.Fig. 4Graphs of ReLU, Swish, and Mish activation functionsOptimizers are critical to the performances of neural networks. While a large number of optimizers are proposed, most of these publications provide incremental improvements to existing algorithms. We adopted the current state-of-the-art optimizer Ranger to improve model training. The Ranger optimizer combines two emerging works from RAdam and Lookahead to build a set of optimizers for deep learning. RAdam uses a dynamic rectifier to adjust Adam’s adaptive momentum based on variance and effectively provides an automatic warm-up mechanism. LookAhead can provide strong and stable breakthroughs throughout the training process. Therefore, the inventor of the Ranger claim that combining the two can achieve higher accuracy. This study also compares the performance of SGD, Adam, and Ranger optimizers in model training.Generating training and test segmentsIn the MCI prediction mode, only the MCNN needs to be trained. Therefore, we divide each participant video into several segments to generate the training and test segments. In this study, considering the video length, and because the number of MCI videos is smaller than the number of normal videos, we evenly extract 200 segments from MCI videos, and 100 segments from normal videos to balance MCI and normal classes. In the end, a total of 5154 segments are extracted from 48 videos, some of which are too short to extract enough segments. Each segment contains 10 frames, and a segment is considered a processing unit of the MCNN.To generate the training and validation segments, we need to split all segments into training and validation sets. However, we cannot directly split the segments because they may come from the same participant. During the training process, the validation data should not be visible. If the training and validation segments come from the same participant, it means that data has been learned during the training. Therefore, this study uses a two-stage approach to generating the training and validation segments.First, all participants are randomly grouped into training and validation groups in a ratio of approximately 8:2. We use the stratified K-fold cross-validation implemented by scikit-learn library [44] to split the participants into groups with roughly the same proportions of classes in the original data. Then, after the participant grouping, all segments are divided into training and validation sets according to their corresponding participant IDs in the training or validation groups. Table 2 shows the numbers of training and validation sets. There are 4237 segments (36 participants, 39 videos) in the training set and 917 segments (9 participants, 9 videos) in the validation set. MCI segments are marked as positive and normal segments are marked as negative. Because there are not many video data, we do not have a separate test set. The verification set will be used in the model testing phase to evaluate the model testing performance. The gender and age distributions of the segmented training set and validation set are shown in Fig. 5.
Table 2 Numbers of training and validation setsFig. 5The gender and age distributions of the segmented training set and validation set. a Gender distribution of the segmented training set. b Age distribution of the segmented training set. c Gender distribution of the segmented validation set. d Age distribution of the segmented validation set

Hot Topics

Related Articles