Deep-learning model accurately classifies multi-label lung ultrasound findings, enhancing diagnostic accuracy and inter-reader agreement

The previous studies have primarily focused on multi-class classification11,12 and have applied LUS deep learning models in specific diseases such as sepsis, dengue, and COVID-1910,11,14,16. However, this study deployed a multi-label deep learning model to distinguish multiple findings or common patterns in LUS into four categories: normal, B-lines, consolidation, and pleural effusion, and assessed its performance across multiple test sets. The model identified LUS findings with AUCs of 0.80–0.93 for normal, 0.78–0.87 for B-lines, 0.77–0.87 for consolidation, and 0.77–0.94 for pleural effusion, though performance decreased in external test sets. In the observer performance test, both readers demonstrated improved accuracy in distinguishing abnormal from normal (reader 1: 87.5–95.6%, p = 0.004; reader 2: 95.0–97.5%, p = 0.19) with the DL model’s assistance. Additionally, the DL model improved inter-reader agreement for determining abnormal from normal (k = 0.73–k = 0.83, p = 0.01) and detecting B-lines (k = 0.70–0.76, p = 0.22). Our results demonstrate the feasibility of using a DL model in LUS for patients with respiratory symptoms and support the potential for overcoming the limitations due to variability between readers.Although several DL-based models based on plain radiographs or CT scans have been developed for classifying respiratory disease, research on DL models applicable to LUS remains scarce. Ebadi et al.14 formulated a DL model utilizing LUS data from COVID-19 patients to classify LUS findings into normal, B-lines, and consolidation/effusion, achieving a model accuracy score of 0.90. Our study aligns with their work, also categorizing LUS findings into four distinct classes (Fig. 1). However, unlike the prior study, we developed a model that can distinguish between pleural effusion and consolidation individually.Fig. 1Adjusted gradient-weighted class activation maps overlaid on the corresponding activation map generated by a convolutional neural network. (a) single-labeled finding for normal (A-line), B-lines, consolidation, and effusion, and (b) multi-labeled findings.In this study, the model exhibited strong performance in classifying LUS into normal, B-line, consolidation, and effusion, displaying AUCs ranging from 0.82 (consolidation) to 0.94 (effusion), and sensitivities ranging from 73.9% (consolidation) to 93.3% (normal) in the temporally separated test set. Nevertheless, its performance experienced a slight decrement in the external test set, with AUCs ranging from 0.78 (B-line) to 0.89 (normal) in the first external test set and 0.77 (effusion) to 0.81 (B-line) in the second external test. This disparity may be due to the use of different machines, the acquisition of LUS by various examiners, and differences in the distribution of LUS abnormalities. In reality, pleural effusion was the most common finding in the training and temporally separated test set. However, normal and consolidation were the most common findings in the first and second external test sets, respectively.LUS is widely used in diagnosing and monitoring respiratory diseases in ICU patients1. Dave et al.15 conducted a prospective study aiming to ascertain the DL model’s ability to differentiate normal and abnormal patterns on bedside LUS in critically ill patients, demonstrating a 95% accuracy of the model, and advocating for its utility in ICU settings. In this study, LUS data from critically ill ICU-admitted patients were utilized, constituting a second external test set (Supplementary Table 1). Similar to the preceding research, the model exhibited an AUC of 0.82 (95% CI 0.79–0.94) for binary discrimination, affirming its applicability in immobilized patients. However, in terms of multi-label classification, the model’s performance was relatively inferior when compared to recent advancements in the domain. This suboptimal performance might stem from several factors, including LUS examinations conducted on uncooperative patients with limited movement capabilities, poor sonic window, and a limited number of included cases. Consequently, further test employing a larger dataset appears essential to enhance the DL model’s efficacy in multi-classification tasks.Despite the numerous advantages of ultrasound imaging, its susceptibility to operator dependency presents a significant limitation. Previous research on inter-reader agreements indicated moderate to substantial concordance (k = 0.36–0.74), with a noted decrease in agreement when positive abnormalities such as consolidation or B-lines were encountered16. Consistent with these findings, our study observed substantial to moderate agreements for B-lines (k = 0.698), effusion (k = 0.710), and consolidation (k = 0.568), similar to prior investigations. Meanwhile, the integration of DL techniques has shown enhanced inter-reader agreements across various medical tasks and anatomical regions17,18,19. Our study also demonstrated that the utilization of the DL model significantly improved binary discrimination (k = 0.825, p = 0.01) and increased agreements for B-lines (k = 0.756, p = 0.22). Recognizing B-lines in LUS is important as they manifest in various pulmonary conditions, including pulmonary edema, interstitial fibrosis, and pneumonia20. Nhat et al.11 also demonstrated improved performance in interpreting LUS findings among non-expert clinicians, although they did not evaluate inter-reader agreement. However, our study indicated a lower level of agreement in the interpretation of consolidation and effusion. This observation may be attributed to the image-based analysis of LUS rather than video-based scrutiny. The observer performance test employed captured LUS images. Given LUS’s established robust accuracy in diagnosing pleural effusion, direct implementation of LUS would likely distinguish effusion and consolidation with greater precision compared to image-based analysis.The study indicated several limitations. First, it is a retrospective design study with a limited number of validated cases. Therefore, captured LUS was conducted with heterogeneous probe machines. However, extensive efforts were made to ensure comprehensive validation of the model through multiple test sets and observer performance tests. Second, the DL model developed for this study classified normal, B-line, consolidation, and pleural effusion based on representative captured frames rather than continuous video sequences. Consequently, evaluating the model’s performance on video data remains a necessary avenue for exploration. Third, the reference standard for the training set was based on the interpretations of expert radiologists rather than the outcomes of CT scans. In clinical practice, LUS results are typically based on expert opinion without reliance on CT scans. Nevertheless, in this study, the observer performance test was conducted using selected cases that had undergone CT scans on the same day, and the reference standard was established based on CT findings.Fourth, the observer performance test was executed with the participation of two radiologists. Generalizing the findings to other clinicians might pose challenges due to the limited scope of the participants. Further research is needed to develop models applicable to a wider range of scenarios, particularly in the implementation of optimization strategies for generalization and regulation purposes or deploying classifiers empowered by pre-trained weights from clinical images such as RadImageNet21 or RadFormer22. Additionally, the retrospective nature of our study presents inherent limitations, including potential biases in data selection and the inability to establish causal affinities. Therefore, further studies should focus on refining the model to handle a broader range of scenarios, such as video clip analysis, and improving its performance in multi-label classification tasks with large numbers of test datasets.In conclusion, we developed multi-label DL model classifying lung ultrasound findings into normal, B-line, consolidation, and effusion, which enabled to enhance readers’ diagnostic accuracy and agreement between the readers.

Hot Topics

Related Articles