Bio-inspired feature selection for early diagnosis of Parkinson’s disease through optimization of deep 3D nested learning

This section outlines the methodology and the algorithmic tools used in this proposed work with a focus on the collection of datasets, the development of different deep neural network architectures, and algorithmic changes carried out in realizing fusion and optimization techniques. The overall proposed methodology is illustrated in the block diagram in Fig. 1, which depicts the following modules, namely: (1) Proposed 3D- CNN architecture, (2) Improved ResNet model, (3) Canonical correlation-based fusion followed by whale optimization. Each of the modules is discussed in detail in this section.Figure 1Overall architecture of the proposed method.Data collection and pre-processingThe dataset used in the present work is collected from PPMI database20 ,which is an open-source repository with a collection of different modality images of control and PD subjects. In the proposed experimental investigation, a total of 303 MRI images, consisting of 110 control, 58 prodromal and 135 PD MRI images, were used. The analysis accounted for age and sex in MRI image quality, demonstrating the model’s effective management of demographic variability. From the literature, it is identified that T2-weighted MRI data is extensively used for localizing the SN region. This study performed the preprocessing steps like brain extraction, registration, normalization and data augmentation as described in the reference21,22. In23 author used to analyze the structural changes in subcortical regions to diagnose PD. Hence, among T1, T2 and FLAIR weighted MRI sequences, only the T2 weighted sequence is considered for the proposed method. To ensure the adaptability of the proposed framework within the current healthcare system, the output format and resolution of MRI scans from various MRI machines were analyzed and included in the training and testing data. The dataset includes MRI scans from 1.5T and 3T machines, with 3T images generally providing higher spatial resolution and signal-to-noise ratio. This ensures that the proposed classification model maintains its accuracy regardless of these factors. Significant factors of various MRI machines are listed in Table 2.Table 2 A comparative overview of performance and image quality for different MRI machines.3D-CNNA 3D-CNN architecture is designed and analyzed to classify input MRI images into 3 classes: control-0, prodromal-1, and PD-2. In the literature, 2D-CNN has been extensively explored6,12,13,14 for MRI data on 2D-slice classification. Since each subject has a variable number of slices in MRI scans, it is difficult to fix the input size of the architecture, which results in limited accuracy. In 3D-CNN architecture, three-dimensional MRI images can be directly given as input in the file format of the NIfTI. A novel 3D-CNN architecture is designed and constructed after performing extensive ablation studies. The proposed architecture comprises a total of twenty-four layers, which include an input layer, a 3D 3D conv layer with 3D Rectified Linear Unit (ReLU) activation functions, BN, a pooling layer, two dense layers, and a finally SoftMax classification layer. Figure 2, illustrates the architectural representation of the proposed 3D-CNN model.Figure 2Improved 3D-ResNetIn the literature, the Improved Resnet model15 is used to analyze and classify MRI images. In the proposed work, this model is used as one module for feature extraction. Further, the performance of the model is analyzed to classify MRI data. Figure 3 illustrates the architecture of the 3D-improved resnet model. This neural network architecture is composed of 15 layers, including an input layer, two conv3D blocks, five residual blocks with max-pooling layers, a fully connected layer, and a SoftMax layer. Each “conv3D block” consists of a 3D conv layer, BN, and ReLU activation layers. The residual block contains three residual units, each comprised of two 3D Conv layers, BN, and ReLU activation layers. A vital feature of the residual units is the skip connection, which directly adds the input to the output of the last ReLU layer. This study employs Adam optimization and the SoftMax classifier with a categorical cross-entropy loss function for the output layer optimization. The ReLU activation function, which is a non-linear function, is used to address the problem of vanishing or exploding gradients during training. The ReLU activation function can be defined as in Eq. (1):$$\begin{aligned} f(y) = {\left\{ \begin{array}{ll} 0 & \text {if } y < 0 \\ y & \text {if } y \ge 0 \end{array}\right. } \end{aligned}$$
(1)
Figure 3Feature extractionFeature extraction involves identifying relevant patterns from input image data. In a 3D CNN, the 21st and 22nd layers consist of fully connected layers, from these layers, 1000 features were extracted. For 23rd layer, feature concatenation is performed using algorithmic improvisations, which results in merging two distinct sets of features into one vector by selecting the highest values. Additionally, 1000 features are extracted from the 14th layer of an improved 3D ResNet. A total of 2000 features are finally extracted from both architectures.Feature fusionFeature fusion is the process of combining or integrating features extracted from multiple deep neural network architectures (i.e. 3D-CNN and 3D-ResNet) to improve the performance or effectiveness of a machine learning model. In this work, feature fusion is performed using the CCA algorithm.Feature concatenationFeature concatenation involves merging two distinct sets of features into one vector by selecting the highest value24, as mentioned in Eq. (2). Concatenation of the features from the 21st and 22nd layer in the constructed 3D-CNN gives improved accuracy in PD diagnosis. The composite vector, denoted as \(\tilde{x}_3\), is obtained by extracting the highest value from vectors \(x_1 \in FV1\) and \(x_2 \in FV2\). When \(x_1\) and \(x_2\) are equal, vector \(\tilde{x}_3\) chooses one of these elements,and thereby avoiding duplication of elements. The pseudo-code for this process is outlined in Algorithm 1. Feature vector \(\tilde{x}_3\) is with the dimensions \(FV1 \times 1000\) and \(FV2 \times 1000\). The concatenated feature map is expressed as follows:$$\begin{aligned} \tilde{x}_3 = x_3 \leftarrow \max (x_1, x_2) \quad \text {and} \quad x_3 \in \text {Not Repeated} \quad \end{aligned}$$
(2)
Algorithm 1Pseudo-code for feature concatenationCanonical correlation analysisIn this study, CCA is used to analyze correlations between two sets of features25,26. A total of 2000 features were extracted from both the 3D Resent and the 3D CNN modules. CCA is used to perform correlation-based analysis in this study. It is used to analyze associations between two sets of variables. Suppose \(A \in \mathbb {R}^{p \times n}\) and \(B \in \mathbb {R}^{q \times n}\) are two matrices, each contains \(n\) training feature vectors from two different architectures. In other words, there are \(n\) samples for each of which \((p+q)\) features have been extracted.Let \(S_{AA} = \mathbb {R}^{p \times p}\) and \(S_{BB} = \mathbb {R}^{q \times q}\), it represents the set covariance matrices of \(A\) and \(B\), and \(S_{AB} = \mathbb {R}^{p \times q}\) denote the between-set covariance matrix, where \(S_{BA} = S_{AB}^T\). The comprehensive \((p+q) \times (p+q)\) covariance matrix, denoted as \(S\), encapsulates all information regarding associations between pairs of features as shown in Eq. (3).$$\begin{aligned} S = \begin{bmatrix} \text {cov}(A) & \text {cov}(A, B) \\ \text {cov}(B, A) & \text {cov}(B) \end{bmatrix} = \begin{bmatrix} S_{AA} & S_{AB} \\ S_{BA} & S_{BB} \end{bmatrix} \end{aligned}$$
(3)
This study employs CCA for feature fusion as it is designed to compute linear combinations optimally, denoted as \(A^* = W_A^T A\) and \(B^* = W_B^T B\), which results in maximizing the pairwise correlations across the two datasets as depicted in Eq. (4).$$\begin{aligned} \text {corr}(A^*, B^*) = \frac{\text {cov}(A^*, B^*)}{\sqrt{\text {var}(A^*) \cdot \text {var}(B^*)}} \end{aligned}$$
(4)
where \(\text {cov}(A^*, B^*) = W_A^T S_{AB} W_B\), \(\text {var}(A^*) = W_A^T S_{AA} W_A\) and \(\text {var}(B^*) = W_B^T S_{BB} W_B\). The maximization involves Lagrange multipliers, aiming to maximize the covariance between \(A^*\) and \(B^*\) while adhering to the constraints \(\text {var}(A^*) = \text {var}(B^*) = 1\). The transformation matrices \(W_A\) and \(W_B\) are represented using Eq. (5),$$\begin{aligned} \begin{aligned} S_{AA}^{-1} S_{AB} S_{BB}^{-1} S_{BA} \hat{w}_A&= \Lambda ^2 \hat{w}_A, \\ S_{BB}^{-1} S_{BA} S_{AA}^{-1} S_{AB} \hat{w}_B&= \Lambda ^2 \hat{w}_B. \end{aligned} \end{aligned}$$
(5)
where the diagonal matrix of eigenvalues, or squares of the canonical correlations, is denoted by \(\Lambda ^2\) and the eigenvectors are represented by \(\hat{w}_A\) and \(\hat{w}_B\). The computed eigenvectors for the non-zero eigenvalues form the transformation matrices. The modified feature vectors are concatenated or summed up, to perform feature-level fusion to obtain the fused vectors \(Z\). Here \(Z_1\) and \(Z_2\) are the Canonical Correlation Discriminant Features (CCDFs) mentioned in Eqs. (6, 7). The fused vector (\(Z\)) is applied to the optimization-based feature elimination module to refine and select important features that contribute significantly to the model’s predictive accuracy.$$\begin{aligned} & Z_1 = (A^* \mid B^*) = \left( (W_A^T A) \mid (W_B^T B)\right) = \begin{bmatrix} W_A & 0 \\ 0 & W_B \end{bmatrix}^T \begin{pmatrix} A \\ B \end{pmatrix} \end{aligned}$$
(6)
$$\begin{aligned} & Z_2 = A^* + B^* = W_A^T A + W_B^T B = \begin{pmatrix} W_A \\ W_B \end{pmatrix}^T \begin{pmatrix} A \\ B \end{pmatrix} \end{aligned}$$
(7)
Feature optimizationThe primary goal of feature optimization is to identify the significant features responsible for predictions while also lowering the computational load of the system and data handling. The reduced feature dimension, after optimization, streamlines data management and facilitates easier handling of the training and prediction processes for larger datasets.Mathematical model of WOAMirjalili and Lewis developed the Whale Optimization Algorithm (WOA) inspired by the humpback whales’ bubble-net feeding strategy27,28. This method, where whales trap their prey using bubbles in a circular pattern, is adapted in WOA to optimize the search for solutions to complex optimization problems.Encircling Prey: In their hunting strategy, humpback whales locate and encircle their prey. This behavior is reflected in WOA, which assumes that the best current solution approximates the optimal solution29. Once the best search agent is identified, other agents converge towards it, mimicking the whales’ encircling behavior. This process is mathematically represented in Eqs. (8,9). Here, \(\vec {Z}(u)\) represents the agent’s position, u represents the iteration and \(\left( \vec {Z}^* \right)\) represents the optimal solution. In Eqs. (10, 11) \(\vec {A}\) and \(\vec {D}\) indicate convergence values. The random number is \(\vec {s}[0,1]\), and \(\vec {s}\) stands for the vector that decreases linearly from 2 to 0 during an iteration.$$\begin{aligned} & \vec {Z}(u+1) = \left| \vec {Z}^*(u) – \vec {A} \cdot \vec {F} \right| , \end{aligned}$$
(8)
$$\begin{aligned} & \vec {F} = \left| \vec {D} \cdot (\vec {Z}^*(u) – \vec {Z}(u)) \right| , \end{aligned}$$
(9)
$$\begin{aligned} & \vec {A} = 2 \cdot \vec {a} \cdot \vec {s} – \vec {a}, \end{aligned}$$
(10)
$$\begin{aligned} & \vec {D} = 2 \cdot \vec {s} \end{aligned}$$
(11)
The bubble-net attacking method, employed by humpback whales, involves shrinking and spiraling towards the prey to tighten the search area. In the WOA, this is simulated by decreasing the value of \(\vec {a}\) (see Eq. 11), which narrows the search space and mimics prey-catching behavior. As \(\vec {A}\) is dependent on \(\vec {a}\), it too linearly decreases from 2 to zero. The spiral movement pattern of the whales, which is crucial for capturing their prey, is mathematically modelled in the equations in (12, 13).$$\begin{aligned} & \vec {F}^* = \left| \vec {Z}^*(u) – \vec {Z}(u)\right| , \end{aligned}$$
(12)
$$\begin{aligned} & \vec {Z}(u+1) = e^{bk} \cos (2\pi k) \vec {F}^* + \vec {Z}^*(u) \end{aligned}$$
(13)
The distance between the whale and its target prey, denoted as \(\vec {F}^*\). The parameter b is the logarithmic spiral constant, and l is a random value between \([-1, 1]\). Humpback whales exhibit a 50% likelihood of choosing between a spiral or a direct narrowing approach toward their prey.The parameter determines the probability of selecting either movement pattern p in Eq. (14), a random value between \([0, 1]\).$$\begin{aligned} \vec {Z}(u+1) = {\left\{ \begin{array}{ll} \vec {Z}^* – \vec {A} \cdot \vec {F}, & \text {if } p < 0.5 \\ e^{bk} \cos (2\pi k) \vec {F}^* + \vec {Z}^*(u), & \text {if } p \ge 0.5 \end{array}\right. } \end{aligned}$$
(14)
In the exploration phase of WOA, \(\vec {A}\) is randomly assigned a value between \([-1,1]\), prompting the search agents to diverge from the reference whale. This promotes a global search as the updated position of a search agent is determined by randomly selecting another agent. The mathematical formulation of this exploration mechanism is outlined in Eqs. (15, 16), where \(\vec {Z}_{\text {rand}}\) is a random location for a random whale chosen from the current population.$$\begin{aligned} & \vec {Z}(u+1) = \vec {Z}_{\text {rand}} – \vec {A} \cdot \vec {F}, \end{aligned}$$
(15)
$$\begin{aligned} & \vec {F} = |\vec {C} \cdot (\vec {Z}_{\text {rand}} – \vec {Z})| \end{aligned}$$
(16)
WOA for feature selectionThis study seeks to enhance feature selection for PD classification using the WOA, inspired by bio-inspired algorithms. Extracted 2000 features from the FC-3 layer of a 3D-CNN model and the FC-4 layer of a 3D ResNet model. Effective feature selection is crucial for improving data representation and boosting performance based on specific criteria. Optimal feature selection not only simplifies the model but also enhances the estimator’s generalisation ability, learning speed, and overall performance. Common issues include stagnation at local optima and high computational costs. Thus, it is imperative to employ efficient global search strategies to tackle the feature selection problem. In WOA, each feature subset is analogous to the position of a whale within the search space. Each subset may include up to \(N\) corresponding to the number in the original dataset. The quality of each solution is judged based on two criteria: the number of features and its classification accuracy; the fewer the features and the higher the accuracy, the better the solution. Solutions are evaluated using a fitness function that combines these two objectives: the accuracy achieved using a K-Nearest Neighbors (KNN) classifier and the number of features selected.Solution Representation is a critical challenge in designing metaheuristic algorithms. In this study, we represent solutions as one-dimensional vectors with \(N\) elements, where \(N\) matches the features in the dataset. Each element in the vector is binary, marked as “1” or “0.” A “1” indicates the selection of the corresponding feature, while a “0” denotes its exclusion. This binary approach efficiently delineates the inclusion and exclusion of features in the solution set. The fitness function in our proposed method is carefully crafted to balance the number of selected features (minimized) against the classification accuracy (maximized). The function is defined as in Eq. (17):$$\begin{aligned} f_\theta = \alpha \gamma _R (D) + (1-\alpha ) \frac{|R|}{|N|} \end{aligned}$$
(17)
Here, \(f_\theta\) is the fitness function, \(\gamma _R (D)\) represents the classification error rate using KNN classifier. \(|R|\) is the cardinality of the selected subset and \(|N|\) is the total number of features in the dataset. The parameter \(\alpha\) balances the importance of classification accuracy, while \(\beta = (1 – \alpha )\) adjusts the weight given to the subset size. The values for \(\alpha\) and \(\beta\) are derived to reflect the trade-off between these two aspects, ensuring an effective evaluation of potential solutions30. The optimization process begins by initializing the coefficient vectors \(\vec {A}\) and \(\vec {D}\), and setting the initial best solution vector \(\vec {Z}^*\) to a randomly chosen solution. The process continues through a predefined number of iterations, during which each whale’s position is updated according to Eqs. (8, 9). The fitness of each position is evaluated using the designated fitness function. If a better solution is identified during any iteration, the best solution vector \(\vec {Z}^*\) is updated accordingly. This optimization cycle is repeated for a specified number of independent runs. The process terminates once the designated number of iterations is reached, concluding the search for the optimal solution. The detailed flow representation of WOA for feature selection is shown in Fig. 4.Figure 4Feature selection process using WOA.

Hot Topics

Related Articles