Focal liver lesion diagnosis with deep learning and multistage CT imaging

Patient characteristicsReporting of the study adhered to the STARD guidelines. Between June 2012 and December 2022, multiphase contrast-enhanced CT images, including arterial phase (AP) and portal venous phase (PVP), were collected from a total of 4039 patients across six hospitals. The method used for retrospective data collection is depicted in Fig. 1a. Furthermore, clinical testing was conducted on two real-world clinical evaluation queues (Fig. 1b): West China Tianfu Center and Sanya People’s Hospital. At Tianfu Center, we examined 184 cases, while at Sanya People’s Hospital, 235 cases were assessed. Patient cohorts comprised an internal training set, internal test set, four external validation sets, and two real-world clinical datasets. Demographic details, including age and sex distributions, varied across cohorts: for instance, the training set showed a female-to-male ratio of 155:548 in HCC cases with an average age of 53.06 years, while the internal test set had 196 females and 750 males for HCC, averaging 52.34 years. Additional specifics can be found in Table 1.Fig. 1: The Flowchart of the Cohort Setup.a Patient recruitment process of the training, testing, and external validation cohorts. b The real-world clinical test datasets were obtained from two hospitals. HCC denotes Hepatocellular Carcinoma, ICC denotes Intrahepatic Cholangiocarcinoma, MET denotes Metastatic Cancer, FNH denotes Focal Nodular Hyperplasia, HEM denotes Hemangioma, and CYST denotes Cyst.Table 1 Baseline characteristicsPerformance of Lesion DetectionIn the lesion detection task, we filtered out bounding boxes with a confidence level above 0.25 and compared them with the actual ground truth boxes. Boxes with an intersection over union (IoU) greater than the threshold are true positives, while those with a lower IoU or fewer repeats are false-positives. Undetected boxes are false-negatives. As shown in Fig. 2a, we analyzed F1, recall, and precision at different IoU thresholds. At an IoU of 0.1, we achieved an F1 of 94.2%, a recall of 95.1%, and a precision of 93.3%. At an IoU of 0.3, the F1 was 92.8%, the recall was 93.7%, and the precision was 91.3%. An IoU of 0.5 yielded an F1 of 87.4%, a recall of 88.3%, and a precision of 86.6%. These results demonstrate the robust performance of our model across different IOU thresholds. In the LiLNet system, we chose an IOU threshold of 0.1. Despite the minimal overlap between the detection box and the true bounding box at this IoU value, the subsequent classification images were extended to a 224 × 224 detection box, ensuring coverage of a portion of the lesion.Fig. 2: Performance of the proposed model in the testing cohort.a The outcome of lesion detection at various IOU thresholds. b, c, and d ROC curves for distinguishing benign and malignant tumors, malignant tumors (HCC, ICC, and METs), and benign tumors (FNH, HEM, and cysts), respectively. e ACC, F1, recall, and precision for benign and malignant tumor classification. f ACC, F1, recall, and precision for malignant tumor classification. g The same metrics for benign tumor classification. (*) denotes the use of pretrained parameters from ResNet trained on ImageNet. Source data are provided as a Source Data file Source_data_Figure_2.xlsx.Performance of LiLNetWe trained three variations of the LiLNet model on the training set: LiLNet_BM was used to distinguish between benign and malignant liver lesions, LiLNet_M was used to distinguish between three types of malignant liver lesions, and LiLNet_B was used to distinguish between three types of benign liver lesions. On the test set, the LiLNet_BM model achieved the following performance metrics: an AUC of 97.2% (95% CI: 95.9–98.2), an ACC of 94.7% (95% CI: 93.5–95.9), an F1 of 94.9% (95% CI: 93.8–96.1), a recall of 94.7% (95% CI: 93.5-95.9), and a precision of 95.2% (95% CI: 94.2–96.3) (Fig. 2b, e). The LiLNet_M model achieved an AUC of 95.6% (95% CI: 94.3–96.7), an ACC of 88.7% (95% CI: 86.8-90.5), an F1 of 89.7% (95% CI: 88.2-91.3), a recall of 88.7% (95% CI: 86.8–90.5), and a precision of 92.0% (95% CI: 90.6–93.4) (Fig. 2c, f). Finally, the LiLNet_B model achieved an AUC of 95.9% (95% CI: 92.8–98.0), an ACC of 88.6% (95% CI: 83.9–93.3), an F1 of 89.0% (95% CI: 83.9–93.5), a recall of 88.4% (95% CI: 83.1–93.3), and a precision of 89.9% (95% CI: 85.3-94.2) (Fig. 2d, g). As a reference, we also constructed two benchmark models: a naive ResNet50 model and a ResNet50 model loaded with pretraining parameters. Upon a comprehensive analysis of the AUC, the LiLNet model exhibited a 1–2% improvement compared to the two baseline models, demonstrating enhanced discriminative capabilities between positive and negative samples.We evaluated our model’s performance using 1151 patients from four different centers. In the Henan Provincial People’s Hospital (HN Center), our model achieved an AUC of 94.9% (95% CI:93.2–96.5) for distinguishing benign and malignant tumors, with an 89.9% (95% CI:87.7–92.1) ACC, a 90.0% (95% CI:87.8–92.2) F1, an 89.9% (95%CI: 87.7–92.1) recall, and a 90.1% (95%CI: 87.9–92.4) precision (Fig. 3a, d). For malignant tumor diagnosis, it achieved an AUC of 87.9% (95% CI:84.6–91.0), with an 80.8% (95% CI:77.2-84.4) ACC, an 81.6% (95% CI: 78.1-85.0) F1, an 80.8% (95% CI: 77.2–84.4) recall, and an 83.6% (95% CI: 80.4-86.8) precision (Fig. 3b, e). For benign tumor diagnosis, it achieved an AUC of 89.9% (95% CI: 85.7-93.3), with an 83.9% (95% CI: 78.8–89.1) ACC, an 83.7% (95% CI: 78.6–89.0) F1, an 83.9% (95% CI: 78.8-89.1) recall, and an 84.9% (95% CI: 80.3-89.8) precision (Figs. 3c and 3f). Visual comparison of t-Distributed Stochastic Neighbor Embedding between the LiLNet, loaded with pre-trained ResNet50 (*) and the standard ResNet50 on the HN validation set can be found in Supplementary Fig. 1. In the First Affiliated Hospital of Chengdu Medical College (CD Center), an AUC of 94.2% and an ACC of 82.9% were achieved for diagnosing HCC (Fig. 3g). In the Leshan People’s Hospital (LS Center), the model achieved an AUC of 87.6% and an ACC of 82.1% for diagnosing HCC. Similarly, it maintained performance with an AUC of 79.2% and an ACC of 79.6% for ICC (Fig. 3h). Finally, at the Guizhou Provincial People’s Hospital (GZ Center), the ACC was 84.4% for HCC and 74.4% for ICC, as shown in Fig. 3i.Fig. 3: Generalization performance of the LiLNet model on the external validation set.a–c display ROC curves for differentiating benign and malignant tumors in the HN external validation set. d provides ACC, F1, Reacll and Precision for this distinction. e presents ACC, F1, recall, and precision for identifying malignant tumors, while f shows the same metrics for Benign tumors. g The model’s ACC and AUC for HCC in the CD validation set. h The model’s ACC and AUC for HCC and ICC in the LS validation sets. i The ACC and AUC for distinguishing HCC and ICC in the GZ validation sets. Source data are provided as a Source Data file Source_data_Figure_3.xlsx.LiLNet Performance for Tumor SizeTable 2 presents the ACC for different tumor sizes in both the test set and the HN external validation set. Each cell shows the accuracy percentage for a tumor type within its size range, along with the total sample number. For instance, in the test set, tumors smaller than 1 cm achieved a 100% ACC for the HCC type, with a total of 4 samples. However, in the HN validation set, there were no samples in this size range, resulting in a 0% ACC. The ACC varies with size range, and specific tumor types show differing ACCs within these ranges. Hence, tumor size is not the sole factor influencing classification ACC. The results show varying accuracy levels for different tumor sizes, with no consistent trend. Some size ranges display high ACCs, while others show lower ACCs in both the test set and the HN validation set. The ACC also varies for specific tumor types within different size ranges, indicating that tumor size alone does not determine the ACC of classification. Other factors, such as tumor type, likely contribute to these variations.Table 2 Accuracy of classifying tumors with different sizesLiLNet Performance for Liver BackgroundTo assess the potential impact of background liver conditions, such as fibrosis or inflammation, on the performance of our proposed system in analyzing CT images, we collected data from West Chian Tianfu Hospital, including 3 cases of HCC without hepatitis and liver fibrosis, 21 cases of HCC with hepatitis and liver fibrosis, 5 cases of ICC with similar liver conditions and 16 cases of MET without hepatitis or liver fibrosis. We observed that the system achieved an AUC of 88.1% and an ACC of 80.9% for HCC with liver fibrosis caused by hepatitis, while for ICC, the AUC was 96.4%, with an ACC of 80%. Our results show that the background liver condition has minimal impact on lesion extraction and imaging. This is because our data originate from real clinical events in which liver lesions often coexist with conditions such as cirrhosis, hepatitis, and liver fibrosis. During data collection, we did not exclude background liver diseases. The distinct imaging features of liver diseases, such as cirrhosis, fibrosis, or inflammation, on CT images typically differ from those of liver lesions, making it relatively straightforward for the model to differentiate between them.LiLNet Performance for Different PhasesIn clinical practice, lesions show different characteristics in various phases, each presenting unique features. Radiologists often use multiple phases for lesion diagnosis. Following this practice, our system simultaneously detects lesions in multiple phases, providing enhanced support for medical professionals. To evaluate the advantages of incorporating different phases, we conducted experiments on a dataset containing 1569 patients from both the test and HN external validation sets, covering data from multiple phases. The results are depicted in Figs. 4a–d. As depicted in Fig. 4a, for malignant triple classification in the test set, the diagnostic performance of using both AP and PVP was superior to that of using either AP or PVP alone, while the results for using AP or PVP alone were comparable. However, for benign triple classification, the AUC was optimal when utilizing both AP and PVP images simultaneously, followed by using AP alone and PVP alone; other performance indicators showed that AP outperforms AP and PVP, which outperforms PVP. As illustrated in Fig. 4c, in the validation set, the diagnostic performance of AP and PVP surpassed that of AP or PVP alone, regardless of malignant or benign classification. Analysis of the confusion matrices of the test set and external validation set (Fig. 4b, d) showed that employing images from both the AP and PVP phases simultaneously yielded superior results compared to using a single phase. Although the diagnostic outcomes of the two phases align in approximately 90% of cases, there are still instances where lesions exhibit better performance in the AP phase than in the PVP phase, and vice versa. This discrepancy may be attributed to the inherent characteristics of the data. In summary, integrating information from multiphase CT-enhanced images enables a comprehensive and accurate assessment of liver lesion characteristics and properties, thereby offering a more reliable basis for clinical diagnosis and treatment.Fig. 4: LiLNet performance under different conditions.a Comparison of the AUC, F1, recall, and precision for the classification of the three types of malignant lesions (HCC, ICC, and METs) and classification of the three types of benign lesions (FNH, HEM, and cysts) using different phases in the test set. “malignant AP&PVP” indicates the simultaneous use of AP and PVP for diagnosing malignant lesions, “malignant AP” indicates the use of only AP, and “Malignant PVP” indicates the use of only PVP for diagnosing malignant lesions. Similarly, “benign AP&PVP” indicates the simultaneous use of AP and PVP for diagnosing benign lesions, “benign AP” indicates the use of only AP, and “benign PVP” indicates the use of only PVP for diagnosing benign lesions. b Confusion matrices for classification of the three types of malignant lesions and classification of the three types of benign lesions using different phases in the test set. c Comparison of the AUC, F1, recall, and precision for classification of the three types of malignant lesions and classification of the three types of benign lesions using different phases in the HN external validation cohort. d Confusion matrices for classification of the three types of malignant lesions and classification of the three types of benign lesions using different phases in the validation set. e The confusion matrix is employed to depict the classification of lesions for patients, categorizing them into four groups based on the diagnoses provided by AI systems and radiologists. ‘Radiologist Right’ and ‘AI Right’ indicate instances where both the AI system and the doctor correctly diagnosed liver tumors. ‘Radiologist Right’ and ‘AI Wrong’ refer to cases where the AI system incorrectly diagnosed a liver tumor but the radiologist’s diagnosis was accurate. ‘Radiologist Wrong’ and ‘AI Right’ pertain to situations in which the AI system made a correct diagnosis of liver tumors but the radiologist’s diagnosis was incorrect. ‘Radiologist Wrong’ and ‘AI Wrong’ represent instances where neither the AI system nor the doctor diagnosed liver tumors correctly. f The results of clinical validation at West China Tianfu Hospital. g The results of clinical validation at Sanya People’s Hospital. Source data are provided as a Source Data file Source_data_Figure_4.xlsx.Comparison with radiologistsWe used a test set of 6743 images from 221 patients at West China Hospital of Sichuan University to compare the diagnostic ability of LiLNet with that of radiologists. The evaluation involved three radiologists with varying levels of experience. Radiologists independently labeled the 221 patients based on multiphase contrast-enhanced CT images. LiLNet demonstrated a diagnostic accuracy of 91.0% for distinguishing between benign and malignant tumors, 82.9% for distinguishing between malignant tumors, and 92.3% for distinguishing between benign tumors (Table 3). Compared to junior-level radiologists, LiLNet achieved 4.6% greater accuracy for benign and malignant diagnosis, 4.1% greater accuracy for middle-level radiologists, and 2.3% greater accuracy for senior-level radiologists. The diagnostic accuracy of radiologists for diagnosing malignant tumors was similar. Notably, compared with radiologists, LiLNet achieved a substantial 18% improvement in diagnostic accuracy. Additionally, in diagnosing benign tumors, LiLNet outperformed junior-level practitioners by 20%, middle-level practitioners by 10%, and senior-level practitioners by 6.7%. More information about the radiologists and their diagnostic results can be found in the supplementary information (Supplementary Table 1 and Table 2).Table 3 Comparison of diagnostic results between LiLNet and radiologistsWe calculated the Fleiss kappa coefficient between LiLNet and the radiologists to assess consistency. The Fleiss kappa values are 0.806 for benign and malignant cases and 0.848 for benign cases, surpassing the 0.8 threshold, indicating a very high level of agreement among evaluators in benign tumor diagnosis and 0.771 for malignant cases, which falls within the range of 0.6 to 0.8. This indicates a high level of agreement among the evaluators (details in Supplementary Table 3).Figure 4e displays the comparison matrix of diagnoses between the AI system and radiologists (we selected the optimal diagnosis from the assessments provided by multiple radiologists) based on pathological diagnostic labels. This result indicates that 4% of cases are misdiagnosed by both the AI system and radiologists, while radiologists accurately diagnose 8% of cases where the AI system errs. In 16% of cases, the AI system was correct when radiologists made errors. Specifically, in the “benign” cases shown in Fig. 4e, the AI system and radiologists agreed on 92 cases. Among these, 87 cases were confirmed to be correct by pathology, while 5 cases were incorrect. Additionally, there were 12 cases of disagreement: 5 were incorrect AI judgments (false-negatives), 7 were correct (true positives), and 7 were incorrect radiologist diagnoses (false-negatives), with 5 being correct (true positives). Consequently, when the AI system and radiologists differed, the AI system achieved a 58.34% true positive rate for benign diagnoses, while the radiologists achieved a 41.67% true positive rate. For the “malignant” cases, the AI system and radiologists agreed on 96 cases. Among these, 95 cases were confirmed to be correct by pathology, while 1 case was incorrect. Additionally, there were 21 cases of disagreement: 9 were incorrect AI judgments (false negatives), 12 were correct (true positives), and 12 were incorrect radiologist diagnoses (false negatives), with 9 being correct (true positives). Consequently, when the AI system and radiologists differed, the AI achieved a 57.14% true positive rate for benign diagnoses, whereas the radiologists achieved a 42.85% true positive rate. Furthermore, additional diagnostic information for HCC, ICC, METs, FNH, HEM, and cysts can be found in Fig. 4e.Figure 4e shows that AI and radiologists achieved congruent outcomes in cyst diagnosis, accurately identifying 32 cases while misdiagnosing 2 cases. Upon analyzing the misdiagnosed cyst images, we discovered one patient with a mixed lesion, showing characteristics of both HEM and a cyst. In this instance, the cyst was positioned near a blood vessel, resulting in misdiagnosis as HEM. Another misdiagnosis was due to the lesion size being less than 1 cm, presenting challenges in identification. However, there are some differences in diagnosis between AI and radiologists in other categories, indicating differences in diagnostic approach or focus. These findings highlight the potential for our AI-assisted software to collaborate with radiologists to enhance the diagnostic accuracy of liver lesions.Real-world clinical evaluationOur system (a simple web version is available in Supplementary Note 1) is currently suitable for routine clinical diagnoses, encompassing outpatient, emergency and inpatient scenarios with patients undergoing AP and PVP sequences. To authenticate the actual clinical efficacy of the system, we seamlessly integrated the system into the established clinical infrastructure and workflow at West China Tianfu Hospital and Sanya People’s Hospital in China, where we conducted a real-world clinical trial.At West China Tianfu Hospital, we assessed outpatient and inpatient data from February 29th to March 7th, comprising 117 cysts, 22 HEMs, and 16 METs. To improve the evaluation of the model’s ability to diagnose malignancies, we included 24 HCC lesions and 5 ICC lesions from January 2022 to February 2024. All malignant tumors were pathologically confirmed, while benign tumors were diagnosed by three senior radiologists. As shown in Fig. 4f, the results of our system at the Tianfu Center indicated an AUC of 96.6% and an ACC of 91.9% for the diagnosis of benign and malignant lesions, respectively. For HEMs, the AUC was 99.54%, with an ACC of 95.45%, while for cysts the AUC was 99.8%, with an ACC of 98.3%. For HCCs, the AUC was 87.1%, with an ACC of 79.2%; for ICCs, the AUC was 95.0%, with an ACC of 80%; and for METs the AUC was 89.9%, with an ACC of 81.2%.We assessed outpatient and inpatient data at Sanya People’s Hospital from March 15th to March 29th, comprising 45 cysts, 23 HEMs, 121 normal lesions, 1 ICC, and 3 METs. Additionally, we retrospectively collected data for 34 HCCs, 3 ICCs, and 5 METs from April 2020 to February 2024. All malignant tumors were pathologically confirmed, while benign tumors were diagnosed by three senior radiologists. As shown in Fig. 4g, the results of our system at Sanya Center indicated an AUC of 95.4% and an ACC of 90.5% for the diagnosis of benign and malignant lesions, respectively. For HEMs, the AUC was 90.8%, with an ACC of 95.6%, while for cysts, the AUC was 91.4% with an ACC of 80.0%. For HCCs, the AUC was 89.5%, with an ACC of 85.3%; for ICCs, the AUC was 97.6%, with an ACC of 75%; and for METs, the AUC was 88.8%, with an ACC of 87.5%.Deep learning analysisTo better explain the deep learning model, we conducted two experiments: an analysis by professional radiologists on activation maps and gradient analysis.Class activation maps (CAMs) are generated by computing the activation level of each pixel in the image by the model, revealing the areas of focus within the image. Figure 5a shows that the model pays more attention to lesion areas relative to normal liver tissue to distinguish between different subtypes. HCC typically exhibits heterogeneity in internal structure and cellular composition, resulting in significant variation within the tumor. Rapid proliferation of tumor cells leads to increased cell density and richer vascularity in the central region, often manifested as arterial phase enhancement in imaging. Conversely, the surrounding area may display lower density and vascularity due to compression by normal hepatic tissue or the arrangement of tumor cells in a nest-like pattern, presenting as low density in imaging. Consequently, in this CAM image, the central region may exhibit deep activation, while the surrounding area may show secondary activation. Additionally, the irregular spiculated margins commonly observed in HCC are a critical feature, often encompassed within activated regions. ICC is characterized by tumor cells primarily distributed in peripheral regions, with fewer tumor cells and immune-related lymphocytes in the central area. Imaging typically reveals higher density and vascularity in the tumor periphery, contrasting with lower density and vascularity in the central region. These imaging features are reflected in the CAM image. Metastatic tumors, arising from either intrahepatic primary tumors or extrahepatic malignancies, often exhibit necrosis and uneven vascularity in tissue composition. This results in the characteristic imaging appearance of indistinct margins and multifocal lesions. CAM images frequently depict this process by demonstrating areas of diffuse and poorly defined activation, with uneven depth and distribution of activation regions.Fig. 5: The visualization process of model decisions.a The class activation map generated by the last convolution layer. We presented activation maps for liver lesions. The first line displays the original image, while the second line displays the corresponding activation map. Red denotes higher attention values, the color blue denotes lower values, and the red circle represents the tumor area. b SHAP plots revealing the influence of pixel on the model predictions for HCC, ICC, MET, FNH, HEM, and cyst lesions.FNH typically arises from the abnormal arrangement of normal hepatic cells and contains abundant vascular tissue with high density. On images, it typically presents as homogeneous enhancement of focal lesions, while surrounding normal hepatic tissue appears relatively hypoenhanced due to compression. In CAM images, the lesion often exhibits uniform overall activation, while the compressed normal hepatic parenchyma demonstrates relatively lower activation. FNH is characterized by richer vascularity than other lesions, resulting in greater overall activation. HEM lesions usually contain abundant vascular tissue and manifest as focal lesions with significant enhancement during the contrast-enhanced phase of imaging. In CAM images, they typically appear as locally activated areas, exhibiting greater activation than other nonvascular lesions, with a more uniform distribution. Cysts typically consist of fluid or semisolid material, with uniform internal tissue distribution and clear borders. On images, they appear as circular or oval-shaped low-density areas with clear borders. In CAM images, cystic regions appear as circular areas with deep activation, and the activation intensity within the cyst is usually uniform, without significant differences. More class activation maps can be found in Supplementary Fig. 2.Model interpretability refers to the process of explaining the outputs generated by a machine learning model, elucidating which features and how they influence the actual output of the model. In deep learning, particularly in computer vision classification tasks, where features are essentially pixels, model interpretability aids in identifying pixels that have either positive or negative impacts on predicting categories. To achieve this goal, we employ the SHapley Additive exPlanations (SHAP) library to interpret deep learning models. This process primarily involves analyzing the gradients within the model to gain a deeper understanding of how decisions are made. By inspecting gradients, we can determine which features contribute most significantly to the model’s predictions. In Fig. 5b, we present plots for HCC, ICC, MET, FNH, HEM, and CYST. Each SHAP plot comprises the original image alongside grayscale images corresponding to the number of output classes predicted by the model. Each grayscale image represents the model’s contribution to the output class. In these images, blue pixels indicate a negative effect, while red pixels indicate a positive effect. Conversely, white pixels denote areas where the model ignores input features. Below the images, there is a color scale ranging from negative to positive, illustrating the intensity of SHAP values assigned to each relevant pixel. For instance, in the case of correct HCC category prediction, the SHAP plot for HCC reveals that red activations are predominantly concentrated in the lesion area. However, in SHAP plots for other categories such as ICC and MET, although some red pixels are present, they are not concentrated in the lesion area. This suggests that the appearance of red activations outside the lesion area in other categories may indicate a misjudgment or confusion by the model during prediction. Meanwhile, the activation in the lesion area remains one of the key factors for accurate prediction.Data Partitioning Strategy ExperimentsWe conducted a time-based data partitioning experiment to further validate the model’s generalization ability on the test set. We sorted the data used for model development chronologically, using early data for training and later data for testing (with the same test set size as random partitioning). We compared the results of random partitioning with those of time-based partitioning, as shown in Fig. 6. Using the time-based partitioning method, we achieved an AUC of 98.7% (95% CI: 92.1–94.8) and an ACC of 93.5% (95% CI: 92.1–94.8) for benign and malignant results. The diagnostic AUC for benign data was 98.0% (95% CI: 96.1–99.2), with an ACC of 91.3% (95% CI: 86.6–95.3), while for malignant diagnosis, the AUC was 97.5% (95% CI: 96.7–98.1) with an ACC of 90.9% (95% CI: 89.2–92.5). In HN external validation, the AUC for benign and malignant diagnosis was 94.6% (95% CI: 92.6–96.2), with an ACC of 88.8% (95% CI: 86.5–91.2). The AUC for benign data diagnosis was 88.3% (95% CI: 84.0-92.1), with an ACC of 82.4% (95% CI:77.2–88.1), while for malignant diagnosis, the AUC was 87.9% (95% CI: 84.6–91.0) with an ACC of 85.1% (95% CI: 81.9–88.3). In external validation for CD, the accuracy of malignant diagnosis was 90.4% (95% CI: 84.0–95.7). For GZ, the accuracy of malignant diagnosis was 80.1% (95% CI: 74.2–85.5), and for LS, it was 82.8% (95% CI: 77.6–88.1).Fig. 6: Comparison of Results between randomly and time-divided data.a displays ROC curves comparing the differentiation of benign and malignant tumors in the Test and HN external validation sets. b shows ROC curves comparing the differentiation of benign tumors in the Test and HN external validation sets. c presents ROC curves comparing the identification of malignant tumors. d displays ACC for distinguishing between benign and malignant tumors in the Test and HN external validation sets. e demonstrates ACC for distinguishing benign tumors in the Test and HN external validation sets. f provides ACC for identifying malignant tumors in the HN, CD, GZ, and LS validation sets. Source data are provided as a Source Data file Source_data_Figure_6.xlsx.We conducted a statistical analysis of the accuracy of random and time-based data partitioning methods using a two-sided t-test. For the binary classification of benign and malignant lesions, the p-value is 0.192 on the test set and 0.503 on the HN external validation set. For the ternary classification of benign lesions, the p-value is 0.408 on the test set and 0.695 on the HN external validation set. For the ternary classification of malignant lesions, the p-values are 0.082 on the test set, 0.08 on the HN external validation set, 0.136 on the CD external validation set, 0.483 on the GZ external validation set, and 0.811 on the LS external validation set. The statistical results indicate that all p-values are greater than 0.05, suggesting no significant difference between the two methods.

Hot Topics

Related Articles