Joint hybrid recursive feature elimination based channel selection and ResGCN for cross session MI recognition

This paper proposes a method that integrates recursive feature elimination and residual-based graph convolutional neural networks18 for classifying MI data. The channel selection method combines random forest, gradient boosting, and logistic regression algorithms, realizing recursive feature elimination based on feature importance, which automatically ranks the importance of channels and comprehensively considers the distribution characteristics of EEG data. Subsequently, the residual graph neural network deeply explores the EEG data, making full use of the electrode characteristics reflected by EEG. It extracts information from the spatial domain to achieve classification, compensating for the low spatial resolution of EEG. Figure 1 shows the overall framework of the proposed method. First, channel selection is performed. Then, the multi-channel EEG data undergoes correlation computation to obtain the graph Laplacian representation, resulting in the EEG spatial topology representation. Additionally, the data stream is augmented, and finally, the data is fed into the residual graph convolutional network for classification and recognition.Fig. 1Overall framework of the H-RFE combined with Res-GCN classification model.Channel selection algorithm
Recursive feature elimination
Feature selection is crucial in deep learning, aiming to find the optimal subset of features while maintaining model performance as much as possible. It helps to remove weakly correlated and redundant features from large datasets, thereby reducing the number of features, improving model accuracy, and reducing the time required for model fitting. Recursive feature elimination (RFE) is a type of wrapper method, with the implementation involving heuristic search through sequential selection and backward search. The process of the RFE algorithm is illustrated in Fig. 2.Fig. 2Flowchart of the recursive feature elimination algorithm.RFE is a type of greedy algorithm that starts its search from the entire feature set and selects subsets through a feature ranking method19. This algorithm repeatedly constructs machine learning models to rank the importance of features. By fitting the data to the estimator, it eliminates one feature with the lowest feature weight at each iteration. After completion of iterations, it generates feature weights for all features. RFE generates feature subset rankings based on certain evaluation criteria, typically the predictive accuracy of classifiers. Common classifiers include random forests, gradient boosting, and support vector machines, among others. In this paper, we apply the RFE algorithm to compare the importance of EEG channels, thereby eliminating minor factors to improve model accuracy. The specific algorithm process is shown in Algorithm 1.Algorithm 1: Recursive Feature Elimination AlgorithmOverall, RFE iteratively eliminates unimportant features and evaluates model performance through cross-validation to select the best-performing feature subset on the given data. This aids in enhancing the model’s capacity for generalization while mitigating the likelihood of overfitting. In this paper, we apply the feature recursive elimination algorithm to compare the importance of EEG channels, subsequently eliminating minor factors to reduce computational resource consumption and enhance model accuracy.Hybrid-recursive feature elimination (H-RFE) algorithmIn machine learning endeavors, employing a fusion of multiple machine learning techniques could yield superior performance in contrast to utilizing a solitary method. Therefore, this paper proposes a fusion recursive feature elimination algorithm that combines RF, GBM, and LR. Random forest (RF) uses the Bootstrap sampling method, which involves resampling n samples from the original dataset to construct n different training sets, and each decision tree is trained on these different training sets20. Gradient boosting machine (GBM) is an ensemble learning model based on the CART algorithm21, which iteratively trains multiple decision trees and stacks them to improve prediction accuracy. Logistic regression (LR) is a statistical learning method used to solve binary classification problems by weighting the input features and applying a logistic function to map the data to probabilities between 0 and 1 for classification22. The channel selection is performed by integrating the channel weights and accuracy obtained from the three RFE models. The algorithm schematic is shown in Fig. 3.Fig. 3Schematic diagram of the fusion recursive feature elimination algorithm.\(\:{W}_{R}\)、\(\:{W}_{G}\) and \(\:{W}_{L}\) are the channel weights obtained from the training of the RFE-RF, RFE-GBM, and RFE-LR models, respectively. Due to the differing scales of the channel weights obtained from the three models, normalization of the weights is required. The algorithm is as follows:$$Normalized\,w=\left( {w – \hbox{min} (W)} \right)/\left( {\hbox{max} \left( W \right) – \hbox{min} \left( W \right)} \right)$$
(1)
Thus, \({\bar {W}_R}\left(x\right)\)、\({\bar {W}_G}\left(x\right)\:\)and \({\bar {W}_L}\left(x\right)\) represent the normalized weights of \(\:{W}_{R}\)、\(\:{W}_{G}\) and \(\:{W}_{L}\), respectively. To obtain the ranking of channel weights, the fusion cross-validation recursive feature elimination algorithm integrates channel weights and model accuracy to rank the importance of all EEG channels.$${W_H}\left( X \right)={\bar {W}_R}\left( X \right) \times RF.acc+{\bar {W}_G}\left( X \right) \times GBM.acc+{\bar {W}_L}\left( X \right) \times LR.acc$$
(2)
Where RF.acc, GBM.acc, and LR.acc represent the model accuracies obtained by training the data using the RF, GBM, and LR algorithms, respectively, after obtaining the optimal channel combinations from the three RFE models and dividing the EEG data into training and testing sets.Residual graph convolutional neural networkGraph convolutional networks can be divided into spectral domain graph convolutional networks and spatial domain graph convolutional networks23. Spectral methods leverage the convolution theorem in the spectral domain to define graph convolution, whereas spatial methods commence from the node domain, aggregating each central node and its neighboring nodes through the definition of aggregation functions. Spectral-domain graph convolutional networks define convolution operations based on the eigenvalue decomposition of the graph Laplacian matrix. They use the spectral information of graph signals for convolution operations, performing convolution operations through the Fourier transform of graph signals. For EEG signals, spectral domain graph convolutional networks can effectively capture the complex spatial dependency relationships and spectral features between brain regions, extracting spatial and frequency domain features from EEG signals, and making them suitable for processing EEG signal data with complex spatial structures. Therefore, this paper chooses the spectral domain graph convolution method.Graph representationFor an undirected graph, we denote it as \(\:G=\left\{V,E,A\right\}\), where \(\:\left|V\right|=n\) indicates there are n nodes, \(\:E\) represents the edge set, and \(\:A\) represents the adjacency matrix, defining the connection relationships between nodes. \(\:\text{A}=\left|P\right|-{I}_{n}\), where \(\:\left|P\right|\) denotes the absolute Pearson matrix between nodes, and \(\:{I}_{n}\in\:{\mathbb{R}}^{n\times\:n}\) is the identity matrix. \(\:L=D-A\) represents the graph Laplacian matrix, where \(\:D\) is a diagonal matrix, \(\:{D}_{i,i}\) denotes the degree of the i-th node, and \(\:{D}_{i,i}=\sum\:_{j}{A}_{i,j}\). The normalized Laplacian matrix [24] is defined as:$$L={I_n} – {D^{ – \frac{1}{2}}}A{D^{ – \frac{1}{2}}}$$
(3)
Because \(\:L\) is a real symmetric matrix, performing eigendecomposition on \(\:L\) yields\(\:\:L=U\varLambda\:{U}^{T}\), where \(\:U=\left({u}_{1},{u}_{2},\cdots\:{u}_{n}\right)\in\:{\mathbb{R}}^{n\times\:n}\) represents the eigenvector matrix composed of n mutually orthogonal eigenvectors of L, and \(\:\varLambda\:=\text{d}\text{i}\text{a}\text{g}\left({{\uplambda\:}}_{0},{{\uplambda\:}}_{1},\cdots\:{{\uplambda\:}}_{\text{n}}\right)\in\:{\mathbb{R}}^{n\times\:n}\) is the diagonal matrix of eigenvalues, and \(\:{\lambda\:}_{i}\) corresponds to the eigenvalue of \(\:{u}_{i}\). The correlation matrix, Pearson matrix, absolute Pearson coefficient matrix, adjacency matrix, degree matrix, and Laplacian matrix for subject 1 in the SHU dataset are shown in Fig. 4.Fig. 4Illustration of the matrices for Subject 1 in the SHU dataset. (a) Shows the correlation matrix (b) displays the Pearson matrix (c) represents the absolute Pearson coefficient matrix (d) depicts the adjacency matrix (e) illustrates the degree matrix (f) presents the Laplacian matrix.Residual graph convolutional neural Network ArchitectureWith the increase in the number of layers and neurons, the non-linear fitting capability of deep neural networks is enhanced. Nevertheless, merely stacking network layers may result in issues like gradient vanishing, gradient explosion, and network degradation. Therefore, we introduce residual structures into graph convolutional neural networks.In the spectral domain, the graph convolution of signals x1 and x2 is defined as:$${x_1}{ * _g}{x_2}=U\left( {\left( {{U^T}{x_1}} \right) \odot \left( {{U^T}{x_2}} \right)} \right)$$
(4)
\(\:{*}_{g}\) represents the graph convolution operator, and \(\:\odot\:\) denotes the Hadamard product. For the input signal \(\:x\), the graph convolution operation with the convolution kernel filter \(\:g\in\:{\mathbb{R}}^{n}\) is defined as:$$x{ * _G}g=U\left( {{U^T}x \odot {U^T}g} \right)$$
(5)
If \(\:g\) is represented as \(\:{g}_{\theta\:}=\text{d}\text{i}\text{a}\text{g}\left({U}^{T}g\right)\), then the graph convolution operation on x can be simplified as:$$x{ * _G}{g_\theta }=U{g_\theta }{U^T}x$$
(6)
The Chebyshev network (ChebyNet)[25] parameterizes the convolutional kernel \(\:{g}_{\theta\:}\) instead of the spectral domain kernel. \(\:{g}_{\theta\:}\) is defined as \(\:{g}_{\theta\:}=\sum\:_{i=0}^{K}{\theta\:}_{i}{T}_{i}\left(\stackrel{\sim}{{\Lambda\:}}\right)\), where \(\:\stackrel{\sim}{{\Lambda\:}}=\frac{2{\Lambda\:}}{{\lambda\:}_{max}}-{I}_{n}\), and the Chebyshev polynomial is defined as \(\:{T}_{i}\left(x\right)=2x{T}_{i-1}\left(x\right)-{T}_{i-2}\left(x\right)\). \(\:{T}_{0}\left(x\right)=1\) and \(\:{T}_{1}\left(x\right)=x\). Therefore, the ChebyNet graph convolution operation is:$$x{ * _G}{g_\theta }=U\left( {\sum\limits_{{i=0}}^{K} {{\theta _i}{T_i}\left( {\tilde { \wedge }} \right)} } \right){U^T}x$$
(7)
Let \(\:\stackrel{\sim}{L}=\frac{2 L}{{\lambda\:}_{max}}-{I}_{n}\), then \(\:{T}_{i}\left(\stackrel{\sim}{L}\right)=U{T}_{i}\left(\stackrel{\sim}{{\Lambda\:}}\right){U}^{T}\). The ChebyNet graph convolution operation can be simplified as:$$x{ * _G}{g_\theta }=\sum\limits_{{i=0}}^{K} {{\theta _i}{T_i}\left( {\tilde {L}} \right)} x$$
(8)
ChebyNet graph convolution does not require eigenvalue decomposition of the Laplacian matrix, and the convolution kernel has only K + 1 learnable parameters. The complexity of parameters is greatly reduced, thereby improving computational speed.ChebyNet utilizes a complete binary tree for pooling operations. Firstly, the input feature tensor is partitioned into several equally sized blocks based on the coarsening stage of the Graclus multilevel clustering algorithm24. Then, these blocks are arranged according to the properties of a complete binary tree, where each non-leaf node has exactly 2 children. At each non-leaf node, the most compatible feature blocks are pooled based on a greedy criterion, merging two nodes into one. At each leaf node, the corresponding feature block is directly passed to the next layer of the network. If the number of feature blocks at the last layer is not a multiple of 2, zero-padding blocks are added on the rightmost side to make the total number of blocks a multiple of 2.Figure 5 illustrates the residual graph convolutional neural network model. The model adopts 12 layers of graph convolution, with a graph maximum pooling layer connecting every two layers to reduce the dimension by half. Chebyshev polynomials of order three are used to approximate the graph convolution filters in the experiments. Cross-entropy with L2 norm (0.001 regularization parameter) is employed as the loss function in this study, and the residual learning framework facilitates the convergence of deeper models.Fig. 5Residual Graph Convolutional Neural Network Model.We utilize the data from each sampling time point as input samples, shuffling the entire EEG data collected during the motor imagery period of each subject to create training data. The dimension of the graph Laplacian matrix is N*N, where N represents the number of sampling channels. For instance, in the PhysioNet dataset, which employs the international 10–10 system, it includes 64 EEG electrodes. Therefore, the input dimension is 64 × 64. Subsequently, convolutional operations are performed, involving neighbor information aggregation, weight learning, local feature extraction, and multi-layer stacking. For each node, graph convolution aggregates information from the node and its neighboring nodes using dynamically learned weight matrices, achieving operations similar to traditional convolutions. Through the learned weights, graph convolution extracts local features while preserving the local connectivity of nodes. Multi-layer stacking of graph convolutions gradually extracts higher-level feature representations. Additionally, the residual mechanism reduces issues such as gradient vanishing and explosion caused by multi-layer stacking. Fully connected layers unfold the features extracted by graph convolution into one-dimensional data, incorporating Dropout to reduce parameters. Softmax transforms the raw outputs of the network into a probability distribution. Specifically, for a motor imagery classification problem with multiple categories, Softmax converts the raw output for each motor imagery category into the corresponding class probability, ensuring that the sum of probabilities for all categories is 1.Evaluation MetricsThis paper evaluates the performance of the model using accuracy and F1-score based on the confusion matrix. The expressions for accuracy and F1-score are as follows:$${\text{Accuracy=}}\frac{{{\text{TP+TN}}}}{{{\text{TP+FP+FN+TN}}}}$$
(9)
$${\text{Precision}}=\frac{{{\text{TP}}}}{{{\text{TP+FP}}}}$$
(10)
$${\text{Recall=}}\frac{{{\text{TP}}}}{{{\text{TP+FN}}}}$$
(11)
$${\text{F1-score}}=2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}}+{\text{Recall}}}}$$
(12)
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

Hot Topics

Related Articles