A multi-task deep learning approach for real-time view classification and quality assessment of echocardiographic images

An overview of the study is provided in Fig. 1. A multi-task deep learning model was trained on a dataset consisting of 170,311 echocardiographic images to automatically generate view categories and quality scores for clinical quality control workflows. This study conformed to the principles outlined in the Declaration of Helsinki and was approved by the Ethics Board of our institution (No. 2023–407).Fig. 1Overview of the study design. (a) Seven types of echocardiographic standard views were collected, including apical 4-chamber (A4C), parasternal view of the pulmonary artery (PSPA), parasternal long axis (PLAX), parasternal short axis at the mitral valve level (PSAX-MV), parasternal short axis at the papillary muscle level (PSAX-PM), parasternal short axis at the apical level (PSAX-AP), and other views(Others). (b) Four quality attributes were summarized, including overall contour, key anatomical structural details, standard view display, and image display parameter adjustments. (c) Model development workflow for data collection, data labeling, data preprocessing, and model training. (d) Two clinical application workflows: the left side shows real-time quality control during image acquisition, and the right side shows pre-stored image screening prior to AI-assisted diagnosis. Artwork attribution in (c) and (d): www.flaticon.com.DataThe study is a retrospective study. A large number of echocardiographic studies were randomly extracted from the picture archiving and communication system (PACS) of the Sichuan Provincial People’s Hospital between 2015 and 2022 to establish the experimental dataset, with all subjects aged 18 and above. Images showing severe cardiac malformations that prevented recognition of anatomical structures were excluded. The dataset consists of 107,311 echocardiographic images and includes six standard views commonly used in clinical practice: A4C view, parasternal view of the pulmonary artery (PSPA), parasternal long axis (PLAX), parasternal short axis at the mitral valve level (PSAX-MV), parasternal short axis at the papillary muscle level (PSAX-PM), and parasternal short axis at the apical level (PSAX-AP). Except for these six views, all other views are classified as “others”. For standard views with unevenly distributed quality levels, we performed undersampling to balance the data distribution. All images were acquired using ultrasound machines from different manufacturers such as Philips, GE, Siemens, and Mindray. The dataset was randomly divided into training (70%), validation (10%), and test (20%) sets through stratified sampling (Table 1). The distribution of quality scores for three subsets can be found as Supplementary Figure S1 online.
Table 1 Distribution of experimental dataset.Quality scoring methodWe established percentage scoring criteria for different standard views based on four attributes: overall contour, key anatomical structural details, standard view display (see Supplementary Fig. S2 online for an example), and image display parameter adjustments. Each attribute contributed to the score in a ratio of 3:4:2:1. Table 2 presents the scoring criteria for the PLAX view. Two accredited echocardiographers with at least five years of experience individually annotated all images in the dataset. The average of their annotations was used as the final expert score label. A third experienced cardiology expert, with over ten years of experience, conducted a review assessment of images with score differences of > 10. The “others” view was set to zero points for training purposes.
Table 2 PLAX view scoring definition.Model developmentThe model architecture is shown in Fig. 2 and mainly consists of a backbone network, neck network, and two branch modules for view classification and quality assessment. The backbone network is used to learn and extract the multi-scale image features. We choose the output feature maps \(\left\{{S}_{2},{S}_{3},{S}_{4},{S}_{5}\right\}\) (with output sizes of 1/4, 1/8, 1/16, and 1/32 of the original resolution, respectively) from the last four stages of the backbone network as the input for the neck network. To obtain the best backbone network, we compared six different deep CNN architectures, namely, MobileNetV332, DenseNet12133, VGG1634, EfficientNet35, ResNet5036, and ConvNeXt37, and selected VGG16.Fig. 2Proposed multi-task model architecture. The model consists of a backbone network, a neck network, a view classification branch, and a quality assessment branch. A single-frame image is input into the backbone network to extract features. Subsequently, the neck network enhances and fuses these multi-scale features. The highest-level feature from the neck network is fed into the view classification branch to get a view class, while the fused multi-scale feature is input into the quality assessment branch to generate a quality score.The neck network serves as an intermediate feature layer for further processing and fusing the features extracted from the backbone network for the two subsequent tasks. The highest-layer feature,\({S}_{5}\), is a more discriminative high-level semantic feature that reflects the network’s understanding of the overall context of the image and is suitable for classification tasks. Lv et al.38 proposed that conducting the self-attention mechanism on high-level features with richer semantic concepts could capture the connections between conceptual entities in an image. Therefore, to further enhance the expressiveness of the features, we input \({S}_{5}\) into a Vision Transformer Block (VTB)39 that unites a multi-head attention layer and a feedforward layer to facilitate intra-scale feature interaction. The feature map after this step is denoted as \({S}_{5}{\prime}\), which is applied to view classification. Subsequently, FPN is applied to fuse the features of the four scales \(\left\{{S}_{5}{\prime},{S}_{4},{S}_{3},{S}_{2}\right\}\) in a layer-by-layer manner from the top down for cross-scale feature interactions. We denote the set of feature maps output from the FPN as \(\left\{{P}_{5},{P}_{4},{P}_{3},{P}_{2}\right\}\), and each feature map has strong semantic information. Next, we fuse all scale feature maps using an Adaptive Feature Fusion Block (AFFB) to better model image quality perception. As shown in Fig. 3, the AFFB module first upsamples the feature maps at different scales to the size of \({P}_{2}\) and then concatenates them. Subsequently, the channel attention is calculated using the Squeeze-and-Excitation Block40 to adaptively adjust the importance of each channel feature. Finally, element-wise addition is performed on the features from each scale to generate the final fused feature map \(\text{F}\), which is used to perform the quality assessment task.Fig. 3Adaptive Feature Fusion Block. The block integrates channel attention mechanisms to adaptively fuse feature outputs from the feature pyramid network at four scales, generating final quality-aware features for quality assessment.For the view classification branch (VCB), a linear classifier is used to generate the view classification results. Simultaneously, a projection head is utilized to map the feature dimensions to a specified size to compute the Supervised Contrastive Loss41. The goal of supervised contrastive learning is to pull features of the same class closer together in the feature vector space while pushing the features of different classes apart. By applying supervised contrastive loss, we aimed to overcome the problem of small inter-class differences in echocardiographic images. For the quality assessment branch (QAB), a global average pooling is performed on feature map F to generate a K-dimensional feature vector, which is then fed to a multilayer perceptron (MLP) to fit and generate the final image quality score.Model trainingWe jointly trained the model using the cross-entropy loss (for the view classification task), supervised contrastive loss, and mean squared error loss (for the quality assessment task). Additionally, to address the imbalance problem in multi-task training, an auto-tuning strategy42 was applied to learn the relative loss weights for each task.The model was implemented in Python v3.8.12 using PyTorch v1.12.0 and was iteratively trained on two NVIDIA GeForce RTX 3090 GPUs, each with 24 GB of RAM. During training, the initial learning rate was set to 1e-5, and the batch size was set to 128. The Adam optimizer with a weight decay of 1e-5 was used. The input images were resized to 224 × 224, and pixel values were normalized to the range 0 to 1. No data augmentation was performed to prevent changes in image quality. An early stop strategy was used to stop training and reduce overfitting. The best model from the validation set was applied to the test set to evaluate the model performance.Evaluation metricsFive performance evaluation metrics, accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPE), and F1 score (F1), were applied to validate the view classification performance. A confusion matrix was constructed to analyze the classification effect on different views. For quality assessment, Pearson’s linear correlation coefficient (PLCC), Spearman’s rank-order correlation coefficient (SROCC), mean absolute error (MAE), and root mean square error (RMSE) were used as evaluation indices. Indicators, such as the number of model parameters and inference time, were also considered to comprehensively evaluate the model performance. The Kruskal-Wallis test was employed to assess significant differences among the independent groups, with p < 0.05 considered statistically significant. For multiple comparisons, the Dunn-Bonferroni tests were applied. The Bootstrap analysis technique was utilized to calculate the 95% confidence intervals. Statistical analyses were conducted using SPSS v27.0 or Python v3.8.12.

Hot Topics

Related Articles