CTNet: a convolutional transformer network for EEG-based motor imagery classification

CNN-based methodologies have demonstrated efficacy in MI-EEG classification, chiefly due to CNN’s robust capability in local feature extraction. Nonetheless, CNNs typically possess a limited receptive field, potentially impeding their ability to capture global feature dependencies. The Transformer model’s self-attention mechanism effectively captures long-distance dependencies within data, facilitating a comprehensive understanding of the entire input sequence. This feature is particularly critical in MI-EEG signal processing, where MI tasks involve complex cortical coordination that often spans extensive intervals in the time series. The Transformer’s self-attention mechanism is essential for understanding and analyzing complex activity patterns across multiple time points. Additionally, the Transformer can dynamically adjust its focus, applying weighted attention to critical signal features within MI-EEG data, such as specific frequency band rhythm changes, thereby significantly enhancing the model’s sensitivity to key information and accuracy in decoding. Based on these insights, we introduce the CTNet model, which combines the CNN’s capability for local feature extraction with the Transformer’s ability to process global information, offering substantial advantages in the decoding of MI-EEG signals. The efficacy of the CTNet has been validated through subject-specific and cross-subject classification experiments conducted on the BCI IV-2a and BCI IV-2b datasets.Discussion on subject-specific classificationIn our study, we compare CTNet with leading algorithms based solely on CNN architectures (ShallowConvNet, DeepConvNet, EEGNet, TSF-STAN) and those combining CNN and Transformer frameworks (Conformer and MI-CAT) for subject-specific MI-EEG decoding. We reimplemented the models from the open-source code of ShallowConvNet, DeepConvNet, EEGNet, and Conformer to ensure a fair comparison under identical experimental conditions. To conduct a thorough comparison of these state-of-the-art algorithms, we delve deeper into four aspects: data preprocessing, data augmentation strategies, model architecture, and the quantity of trainable network parameters, with comparative results presented in Table 7. In 2017, inspired by the FBCSP algorithm, Schirrmeister et al. introduced the notable ShallowConvNet and DeepConvNet models. ShallowConvNet, employing just two one-dimensional convolutions (temporal and spatial), achieved notable results, with average accuracies and Kappa values of 75.69% and 0.6759 on the BCI IV-2a dataset, and 85.13% and 0.7026 on the BCI IV-2b dataset, respectively. DeepConvNet, building upon ShallowConvNet by adding three convolution-pooling blocks, improved performance on both datasets, particularly achieving a 2.09% and 0.08% increase in average accuracy and Kappa values on the BCI IV-2a dataset. The cost, however, was an increase of over 200,000 in the number of trainable parameters. In 2018, Lawhern et al. proposed the advanced EEGNet model. This model introduced depth-wise and separable convolutions, significantly reducing the quantity of trainable parameters to 2.9k for the BCI IV-2a dataset and 2.1k for the BCI IV-2b dataset, effectively mitigating overfitting. Among all compared models, EEGNet had the fewest parameters, which greatly aids in reducing training times and deploying the model on memory-constrained devices. EEGNet’s average recognition accuracies on the BCI IV-2a and BCI IV-2b datasets were 1.70% and 2.58% higher than those of ShallowConvNet, with Kappa values also improving by 0.0227 and 0.0516, respectively. In 2022, Jia et al. introduced the TSF-STAN model. This model initially leverages time-contained spatial filtering for data preprocessing to increase the inter-category difference of EEG signals while preserving temporal features; it then utilizes a CNN-based spatial–temporal analysis network to further exploit discriminative spatial and temporal features and classify different EEG categories in an end-to-end process. Even without data augmentation, TSF-STAN’s average recognition accuracies on the BCI IV-2a and BCI IV-2b datasets were 5.61% and 0.29% higher than those of EEGNet employing data augmentation. The performance might further improve if TSF-STAN utilized data augmentation. Ablation studies by Jia et al. also revealed that without the TSF preprocessing step, using only the STAN network would decrease the average recognition accuracy and Kappa value on the BCI IV-2a dataset by 17.3% (p < 0.01) and 0.2260, underscoring the significant performance boost provided by the STF preprocessing step. When compared to CTNet using data augmentation, TSF-STAN showed a 0.48% higher recognition accuracy on the BCI IV-2a dataset, while CTNet had a 0.49% higher accuracy on the BCI IV-2b dataset. Compared to the without data augmented CTNet and STAN model, our CTNet model’s average recognition accuracy and Kappa value were higher by 9.61% and 0.1318, respectively.
Table 7 Comparative analysis of state-of-the-art algorithms for subject-specific classification.Conformer and MI-CAT are exemplary models for decoding MI-EEG, utilizing a hybrid architecture that combines CNN and Transformer technologies. In 2023, combining the local feature extraction capabilities of ShallowConvNet with the global modeling strength of the Transformer, Song et al. proposed the Conformer model. On the BCI IV-2a and IV-2b datasets, the average recognition accuracies of the Conformer model were 1.97% and 0.74% higher than those of ShallowConvNet. This also demonstrates that incorporating a Transformer to globally model the high-level features extracted by CNNs can enhance the model’s ability to recognize MI-EEG signals. Correspondingly, the model’s trainable parameter number also increased by approximately 0.12 million. Our CTNet model, inspired by both Conformer and EEGNet, is designed to achieve high recognition accuracy while maintaining a smaller trainable parameter number, thus reducing overfitting and enhancing the model’s generalization capability. In 2023, Zhang and colleagues proposed the MI-CAT model to address the inter-subject variability of EEG signals. MI-CAT employs a temporal-spatial CNN to learn feature representations from paired EEG data, followed by two domain-related attention blocks that preserve domain-dependent information. It then utilizes the Transformer’s self-attention and cross-attention mechanisms to facilitate feature interaction and resolve differential distributions across different domains. Additionally, MI-CAT uses bandpass filtering (BF) and exponential moving standardization (EMS) for data preprocessing. Without data augmentation, MI-CAT achieved remarkable average recognition accuracies of 76.81% and 85.28% on the BCI IV-2a and IV-2b datasets, respectively, with Kappa values of 0.692 and 0.706. In comparison with the CTNet model, which did not use data augmentation, MI-CAT exhibited a 1.50% higher average recognition accuracy on the BCI IV-2a dataset, while CTNet performed 1.88% better on the BCI IV-2b dataset. This shows that the recognition accuracy of CTNet and MI-CAT models is comparable. However, MI-CAT has over 55,000 more trainable parameters than CTNet.In summary, compared to state-of-the-art methods such as ShallowConvNet, DeepConvNet, EEGNet, TSF-STAN, Conformer, and MI-CAT, the CTNet model is relatively small yet achieves comparable decoding accuracy to the TSF-STAN method on both the BCI IV-2a and IV-2b datasets. Specifically, CTNet’s accuracy is higher than other state-of-the-art methods by 4.74% to 6.83% on the BCI IV-2a dataset and by 0.78% to 3.36% on the BCI IV-2b dataset. Notably, while TSF-STAN utilizes a complex TSF data preprocessing method, CTNet employs a simple standardization process, greatly simplifying the preprocessing pipeline. TSF-STAN’s complex TSF preprocessing requires substantial computation, whereas CTNet’s straightforward standardization process reduces computational complexity and resource demands. Achieving high accuracy without sacrificing performance highlights the practical advantages of our approach. In practical applications, reducing computational complexity can lead to shorter processing times, lower power consumption, and easier deployment, particularly in resource-constrained environments. Therefore, achieving accuracy comparable to the state-of-the-art TSF-STAN, combined with our simpler preprocessing pipeline, underscores the practical significance of our method.Discussion on cross-subject classificationIn cross-subject MI-EEG decoding comparative analysis, we utilized the LOSO cross-validation method under consistent experimental conditions to compare CTNet with ShallowConvNet, DeepConvNet, EEGNet, and Conformer. On the BCI IV-2a dataset, the recognition accuracies of Conformer, ShallowConvNet, EEGNet, CTNet, and DeepConvNet progressively improved. DeepConvNet, the best-performing model, achieved a recognition accuracy 1.51% higher than the second-best, CTNet. On the BCI IV-2b dataset, accuracies improved sequentially for Conformer, ShallowConvNet, EEGNet, DeepConvNet, and CTNet, with the top-performing CTNet surpassing the second-best best, DeepConvNet, by 1.09%. These results clearly demonstrate CTNet’s effectiveness in solving cross-subject MI-EEG decoding challenges. Additionally, Chowdhury et al. introduced EEGNet Fusion V231, which integrates five distinct branches of EEGNet with varying hyperparameters, achieving average recognition accuracies of 74.3% and 84.1% in cross-subject MI-EEG decoding on the BCI IV-2a and IV-2b datasets, respectively. The fusion strategy of EEGNet Fusion V2 potentially offers richer feature representations and decision boundaries, mitigating the risk of overfitting or underfitting through an ensemble learning approach. Unlike CTNet, which employs a LOSO cross-validation strategy widely used in BCI research, EEGNet Fusion V2 utilizes a session-based division strategy: one session for training and another for testing on the BCI IV-2a dataset, and three sessions for training with the remaining two for testing on the BCI IV-2b dataset. This strategy likely reduces differences between training and testing data, as data from the same subject across different sessions tend to be more similar. This session-based approach might better adapt to individual subject characteristics, thereby enhancing accuracy during testing. Consequently, EEGNet Fusion V2 significantly outperforms CTNet in terms of accuracy, drawing attention to the merits of utilizing multiple branches with unique hyperparameters for broader feature extraction from EEG data, given the significant variability among different subjects.For cross-subject MI-EEG decoding, Keutayeva and Abibullaev proposed the HTCV model55 and later the st-CViT model56. Both models employed S&R data augmentation strategies and LOSO cross-validation methods. HTCV is tested on the BCI IV-2a and BCI IV-2b datasets for decoding left and right hand movements, while st-CViT is evaluated on the BCI IV-2a, BCI IV-2b, Weibo, and Physionet datasets. Additionally, reference56 also tests the performance of st-CViT using the nested LOSO strategy, which enhances the reliability of their model evaluation.From an architectural perspective, CTNet, HTCV, and st-CViT are hybrid models combining CNN and Transformer. In terms of decoding accuracy for the same classification task on the BCI IV-2b dataset, with the same data augmentation strategy and LOSO cross-validation evaluation method, the results are compared in Table 8. As shown in Table 8, CTNet outperforms HTCV and st-CViT in terms of average classification accuracy for binary classification on the BCI IV-2b dataset by 5.27% and 1.54%, respectively, with the smallest standard deviation of 5.26%. This superior performance can be attributed to CTNet’s EEGNet-like CNN structure, which offers better local spatiotemporal feature extraction capabilities. Furthermore, CTNet effectively integrates features extracted by the CNN with those encoded by the Transformer encoder through residual connections, thus leveraging both the local feature extraction capability of CNNs and the global feature encoding capacity of Transformers. Despite CTNet’s higher overall accuracy, it must be acknowledged that HTCV outperforms CTNet for individual subjects B04, B08, and B09, and st-CViT achieves higher decoding accuracy for subjects B01, B04, B06, B08, and B09. This discrepancy may be due to the individual variability in EEG signals among different subjects, which might require different feature extraction and classification strategies that are better handled by HTCV and st-CViT in these specific cases.
Table 8 Comparison of decoding accuracy between CTNet, HTCV, and st-CViT models on the BCI IV-2b dataset.Discussion on ablation studyAblation studies demonstrate that in the task of decoding subject-specific MI-EEG, the S&R data augmentation method contributes to improved accuracy on both datasets, with a significant impact observed on the BCI IV-2a dataset. Furthermore, the removal of the Transformer module in the presence of data augmentation resulted in a notable decrease in recognition accuracy by 1.77% (p < 0.05) on the BCI IV-2a dataset and 1.79% (p < 0.05) on the BCI IV-2b dataset. This suggests that the Transformer module plays a notable role in harnessing the enriched data environment provided by augmentation techniques. The Transformer’s self-attention mechanism has the unique capability to process the entire input features sequence simultaneously, allowing it to capture global dependencies that are vital for understanding complex EEG patterns. This capability allows for a more nuanced understanding of EEG signals, particularly beneficial for tasks like MI where temporal dynamics are crucial. Furthermore, the global processing is particularly beneficial in datasets enriched through data augmentation, as it helps the model to generalize better across varied yet synthetically expanded data.The effect size analysis provides further insights into the contributions of the Transformer module and data augmentation. For the BCI IV-2a and IV-2b datasets, when data augmentation was applied, adding the Transformer module resulted in effect sizes (Hedges’ g) of 0.179 and 0.184, respectively. This suggests a small but positive impact of the Transformer module when data augmentation is utilized, highlighting its ability to enhance model performance by capturing global dependencies in the data. Conversely, when the model did not use data augmentation, adding the Transformer module resulted in effect sizes of − 0.063 and − 0.002 for the BCI IV-2a and BCI IV-2b datasets, respectively. These negative or near-zero effect sizes indicate that the Transformer module alone does not improve, and may even slightly detract from, model performance without data augmentation. These findings also indicate that combining CNN with a Transformer, especially without data augmentation, leads to a decrease in recognition accuracy. This decline is likely due to the introduction of the Transformer module, which increases the model’s trainable parameters more than fourfold, thereby exacerbating issues of overfitting. Transformers are equipped with a large number of parameters and layers that are advantageous for capturing complex patterns in extensive datasets but can lead to overfitting when the training data is scarce. Under these conditions, the model may start memorizing noise and specific details of the training set instead of generalizing from it. Lacking sufficient data, the Transformer’s advanced mechanisms, such as multi-head attention, are not fully leveraged. This scenario results in a model that is overly complex for the available data volume, consequently underperforming. These findings align with the results from Keutayeva’s research55,56.For the BCI IV-2a and IV-2b datasets, when the Transformer module was removed, using data augmentation resulted in effect sizes of 0.441 and -0.049 for the BCI IV-2a and BCI IV-2b datasets, respectively. This shows that data augmentation alone can have a substantial positive effect in the BCI IV-2a dataset but may have a slightly negative impact in the BCI IV-2b dataset without the Transformer. When the Transformer module was used, adding data augmentation resulted in effect sizes of 0.758 and 0.143, respectively. These results underscore the importance of the Transformer module in improving the model’s performance. Its ability to capture global dependencies in the data, especially when combined with data augmentation, significantly boosts the model’s effectiveness, particularly in datasets where data augmentation alone may not be sufficient.For the BCI IV-2a and IV-2b datasets, when the model simultaneously uses both the Transformer and data augmentation, the effect sizes are 0.595 and 0.139, respectively. These results indicate that the combined use of the Transformer and data augmentation significantly enhances model performance on the BCI IV-2a dataset, while the effect is more modest on the BCI IV-2b dataset.Overall, these results underscore the importance of the Transformer module in enhancing model performance. While data augmentation provides substantial benefits, the Transformer’s advanced mechanisms, such as self-attention, are essential for fully leveraging the enriched data environment and capturing complex temporal dependencies in EEG signals. The Transformer’s effectiveness is particularly pronounced when combined with data augmentation, as it significantly boosts the model’s ability to generalize from enriched data.Discussion on the impact of hyperparameters on model performanceWe investigated three critical hyperparameters of the CTNet model: token size, the number of heads in the MHA, and the depth of the Transformer module. The CTNet model is sensitive to the settings of these parameters. Our findings suggest that smaller tokens effectively reduce local noise, which facilitates the learning of global features. When decoding EEG signals, capturing the spatial distribution of brainwaves is crucial. Each “head” in a Transformer can be viewed as an independent feature detector, focusing on different dimensions of information. CTNet performs best with a two-head attention mechanism. Two heads represent an optimal balance, sufficient to capture the essential spectral characteristics of \(\upmu\) rhythm (8-13Hz) and \(\upbeta\) \(\text{rhythm}\) (13-30Hz), while a higher number of heads could exceed the processing needs required for the complexity of MI-EEG signals, potentially reducing overall efficiency and effectiveness. Furthermore, CTNet achieves optimal recognition performance with a Transformer encoder of depth 6, mirroring findings from Conformer studies. Feature visualization further affirmed that the Transformer encoder facilitates learning more discriminative features than those extracted without the Transformer.Limitations and future workAlthough CTNet has demonstrated superior performance in both subject-specific and cross-subject MI-EEG decoding across two datasets, outperforming several advanced methods in terms of recognition accuracy, it still faces certain limitations. Firstly, there is significant room for improvement in CTNet’s recognition accuracy, especially in cross-subject MI-EEG decoding tasks. Secondly, CTNet appears sensitive to specific hyperparameters such as token size, the number of heads in the MHA, and the depth of the Transformer module. This sensitivity might necessitate extensive hyperparameter tuning to achieve optimal performance, which can be time-consuming and computationally demanding. Additionally, the S&R data augmentation strategy does not significantly contribute to the recognition accuracy of subject-specific MI-EEG decoding on the BCI IV-2b dataset. Moving forward, we plan to explore regularization strategies specifically aimed at addressing cross-subject variability, which may enhance the model’s recognition performance in cross-subject MI-EEG decoding. To address the issue of hyperparameter sensitivity, we will attempt to automate the search for optimal hyperparameter combinations using reinforcement learning-based methods. Furthermore, we also intend to explore the use of generative adversarial networks (GANs) to enhance the training dataset.

Hot Topics

Related Articles