A hybrid deep learning network for automatic diagnosis of cardiac arrhythmia based on 12-lead ECG

In summary, the whole process in this work can be divided into 3 steps: preprocessing, feature extraction and classification, and results evaluation and analysis. The overview process is shown in Fig. 1.Fig. 1The overview of this study. The whole ECG signals classification process including: preprocessing, feature extraction and classification, and results evaluation and analysisPreprocessingDatasets The MIT-BIH Arrhythmia Database was chosen because it is extensively used in scientific research on cardiac arrhythmias classification. This database encompasses various types of electrocardiogram signals and offers a substantial amount of data. Cardiology experts annotated those rhythm files separately, primarily categorizing them into five types, including normal beats (N), atrial premature beats (A), premature ventricular contractions (V), left bundle branch block (L), and right bundle branch block (R), as shown in Supplement Table S1. This database comprises 48 ECG records, each lasting 30 min. The signal sampling rate is 360 Hz.In addition, PTB Diagnostic ECG Database was further used to verify the generalization ability of our proposed CBGM model. The database contains 290 records from 549 subjects. Each record includes 15 simultaneously measured signals: the traditional 12 leads and 3 Frank leads. Each signal is digitized at a rate of 1000 samples per second, with a resolution of 16 bits within a range of ± 16.384 mV.Denoising ECG signals are characterized by their weak, low-amplitude, low-frequency, and stochastic nature, making them susceptible to various forms of noise interference. In this work, a combination of Kalman filtering and wavelet transform is used to preprocess the raw ECG signals. Firstly, the original ECG signal is decomposed into multiscale wavelet coefficients through wavelet decomposition. Subsequently, these coefficients undergo dynamic estimation and filtering using Kalman filtering. Finally, the filtered ECG signal is reconstructed through wavelet reconstruction. The detailed experimental steps are outlined in the Online Supplement.R peaks detection Detection of the QRS waveform is a necessary prerequisite for segmenting and classifying the denoised ECG signal19. In this work, during the process of decomposing ECG signals using wavelet transform, the signal is decomposed into various sub-signals, concentrating the energy of QRS waveforms in the high-frequency band. Post wavelet transform, we identify regions with concentrated energy to locate the QRS waveforms. Typically, the QRS waveforms exhibit prominent energy concentration within the high-frequency sub-signals post wavelet transformation. Leveraging these concentrated energy positions, combined with thresholds or other criteria, facilitates the detection and labeling of QRS waveforms. We set the threshold at 60% of the signal amplitude. This threshold is designed to identify R waves significantly higher than the average signal amplitude. Since R-wave amplitudes are typically much higher than the average amplitude of an ECG signal, setting the threshold at a low level aids in effectively detecting most R waves. The preprocessing result is shown in Supplement figure S1.Feature extraction and classificationThe proposed CBGM model in the article extracts spatial features from the electrocardiogram (ECG) signals using convolutional layers, integrates BiGRU layers to maintain dynamic memory of the extracted features, and finally introduces a multi-head attention mechanism to enhance the extraction of crucial ECG signal features. A detailed description of each layer is as below:CNN CNN is a form of deep learning algorithm inspired by the functioning of biological neural systems like the human brain20. CNNs can progressively learn the spatial hierarchy of data by memorizing both high-level and low-level patterns. Typically, a CNN comprises three types of layers: convolutional layers, pooling layers, and fully connected layers. The convolutional and pooling layers perform feature extraction and dimensionality reduction, respectively, while the fully connected layers map the extracted features to predict the final output21. Regarding the architecture of CNN layers, each layer feeds its output to the next. As information travels through the network, the layer outputs become increasingly complex. This process, known as training, aims to minimize the difference between network predictions and ground truth labels by employing optimization algorithms such as backpropagation and gradient descent. The formula is expressed as follows:$$y(t)=x(t)*w(t)=\int\limits_{{ – \infty }}^{\infty } {x(\tau )w(\tau )} d\tau$$
(1)
Here, x represents the input, the parameter w represents the kernel or filter, and the result y is commonly referred to as the feature map. The convolutional layer is a crucial component of CNNs, responsible for generating feature maps by computing the convolution of its input with these filters.BiGRU BiGRU is a bidirectional recurrent neural network model composed of two parts: the forward GRU and the backward GRU. BiGRU processes the input sequence both forward and backward, then concatenates the results from both directions to obtain a more comprehensive semantic understanding. Compared to GRU, BiGRU is more adept at handling sequential data. Its bidirectional structure enables the network to simultaneously consider past and future information, thereby enhancing its comprehension of context and patterns within time series data, providing robust support for accurate classification of electrocardiogram signals. The detailed formulas’ descriptions of the BiGRU model are shown in the Online Supplement.Multi-head attention mechanism The attention mechanism computes the probability distribution of attention to highlight the influence of specific key inputs on the output, further capturing crucial information within sequences, thereby optimizing the model and enabling more accurate judgments.The principle of the multi-head mechanism involves mapping query (Q), key (K), and value (V) into different subspaces, where each space performs self-attention computation independently without interference22. Eventually, the outputs from each subspace are concatenated, enabling the model to capture more contextual information within the sequence and enhancing its feature representation capacity. Simultaneously, as an ensemble, it helps prevent overfitting. The multi-head attention mechanism, as shown in the equation, depicts each attention mechanism function responsible for a specific subspace in the final output sequence. Each subspace operates independently, and the outputs from multiple subspaces are concatenated, followed by feeding into a fully connected layer to obtain the ultimate feature output. The attention mechanism within each head is defined as follows:$$hea{d_i}=Attention(Q{W^Q},K{W^K},V{W^V})=Soft\hbox{max} \left[ {\frac{{QW_{i}^{Q}{{(KW_{i}^{K})}^T}}}{{\sqrt {{d_k}} }}} \right]VW_{i}^{V}$$
(2)
where WiQ, WiK, and WiV are learning matrices.$$MultiHead(Q,K,V)=concat(hea{d_1},hea{d_2}, \cdot \cdot \cdot ,hea{d_h}){W_o}$$
(3)
where Q, K and V are matrices representing the input electrocardiogram signals. Wo is the learning matrix. h represents the number of subspaces. Due to the properties of the Softmax function, when the input values are extremely large, the function tends to converge towards a region of very small gradients. Therefore, the scaling factor \(1/\sqrt {{d_k}}\)is used to counteract this effect.Proposed CBGM model. In this work, we propose a CNN-BiGRU model with multi-head attention (CBGM model) that extracts spatial features from ECG signals using convolutional layers, integrates BiGRU layers to maintain dynamic memory of the extracted features, and finally introduces a multi-head attention mechanism to enhance the extraction of crucial ECG signal features, as shown in Fig. 2. The CBGM model that takes ECG pulses as input and categorizes them into five different classes: N, A, V, L, and R. Specifically, we first detected the R peaks in all signals. Then, each pulse was defined as 150ms before and 150ms after the R peak. Regarding the architecture of the model, an input layer was followed by 13 hidden layers and an output layer. The input layer received one-dimensional ECG signals with a length of 300 samples. Initially, the signal passed through a CNN section consisting of three triplets, each comprising a convolutional layer, batch normalization layer, and max-pooling layer. These triplets were placed sequentially in the architecture to facilitate feature extraction. Simultaneously, there was dimensionality reduction as the input traversed deeper into the network. Subsequently, the output was fed into the BiGRU layer, responsible for recognizing and memorizing long-term dependencies between data.Fig. 2Schematic diagram of the proposed CBGM model. From multilayered detailed convolution with CNNs and BiRGRU layer to multi-head attention based feature extraction and classificationThe BiGRU module comprises a BiGRU layer, a fully connected layer, and a dropout layer, with the output layer predicting one of the five classes for each input ECG pulse. The BiGRU network captures long-term dependencies in sequential data, crucial for handling time-series data like ECG signals. We introduced a multi-head attention mechanism on the output of the BiGRU layer. This mechanism dynamically adjusts the weights of the signals based on the importance of each timestep, allowing the model to focus more on the moments crucial for the classification task. The introduced attention mechanism enables the model to concentrate on key features in the ECG signals related to cardiac arrhythmias, making our model more robust compared to traditional models. After feature extraction, we flatten the features and feed them into a fully connected layer, followed by a softmax layer for classification, resulting in the final prediction. The detailed parameter settings for the proposed model are shown in Supplement Table S2.We focused on optimizing the learning rate, batch size, and the number of training epochs during hyperparameter tuning. Starting with a learning rate of 0.001 and adjusted it based on observed performance. A batch size of 128 was chosen to balance processing time and model stability. The number of training epochs was set to 100, with early stopping based on validation performance to prevent overfitting. Initial training with default hyperparameters provided a baseline, after which we used grid search to test various learning rates (0.01, 0.001, 0.0001), batch sizes (32, 64, 128), and model architectures. Cross-validation ensured robust evaluation. The optimal configuration was a learning rate of 0.001 and a batch size of 128. In our proposed CBGM model, with the combination of convolutional layers, pooling layers, a GRU layer, and a multi-head attention mechanism, significantly improved the capture of time-series data, enhancing performance in processing ECG signals. All experiments were conducted in a Python 3.9 environment using the PyTorch package, on a desktop equipped with an Intel Core i5-9600 K 3.70 GHz CPU, 16 GB RAM, and an 8 GB NVIDIA GeForce RTX 2070 GPU.Performance evaluation methodThe performance of our proposed model was evaluated using various metrics including precision, specificity, F1 score, sensitivity, and accuracy, which are as follows:$$precision=\frac{{TP}}{{TP+FP}}$$
(4)
$$Accuracy=\frac{{TP+TN}}{{TP+TN+FP+FN}}$$
(5)
$$specificity=\frac{{TN}}{{TN+FP}}$$
(6)
$$F1 – score=2 \times \frac{{recall \times precision}}{{recall+precision}}$$
(7)
where FP, TP, FN, and TN respectively represent “False Positive,” “True Positive,” “False Negative,” and “True Negative.

Hot Topics

Related Articles