Efficient urinary stone type prediction: a novel approach based on self-distillation

Evaluation metricsIn this research, a set of key metrics was employed to thoroughly assess the performance of the proposed model, encompassing Accuracy, Precision, Recall, Specificity, and F1 Score. These metrics collectively constitute a holistic evaluation framework for the performance of the classification model.Accuracy: This metric is a primary indicator of a classification model’s overall efficacy, denoting the ratio of correctly predicted samples to the total sample count. Elevated accuracy signifies a more robust classification capability across all categories and is a widely utilized evaluation metric.Precision: Precision represents the fraction of true positive samples within those deemed positive by the model. Higher precision indicates greater accuracy in predicting positive classes.Recall: Recall measures the proportion of actual positive samples correctly identified as positive by the model. An increased recall rate implies the model’s enhanced ability to detect more true positive samples.Specificity: Specificity quantifies the ratio of true negative samples accurately identified as negative by the model. Greater specificity indicates superior performance in excluding negative samples.F1 Score: As the harmonic mean of Precision and Recall, the F1 Score is a crucial metric for gauging a model’s comprehensive performance, particularly in scenarios with class imbalances.The formulae for these metrics are as follows:$$\:\begin{array}{c}Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\end{array}$$
(19)
$$\:\begin{array}{c}Precision=\frac{TP}{TP+FP}\end{array}$$
(20)
$$\:\begin{array}{c}Recall=\frac{TP}{TP+FN}\end{array}$$
(21)
$$\:\begin{array}{c}Specificity=\frac{TN}{TN+FP}\end{array}$$
(22)
$$\:\begin{array}{c}F1\:Score=\:\frac{2PR}{P+R}\end{array}$$
(23)
Here, TP and TN denote the quantities of correctly classified positive and negative samples, respectively, while FP and FN indicate the counts of incorrectly classified samples. ‘P’ stands for Precision, and ‘R’ for Recall. Utilizing these evaluation metrics enables a comprehensive appraisal of the model’s performance across various dimensions, ensuring the objectivity and thoroughness of the evaluation outcomes.Results from comparative experimentsTo meticulously evaluate the efficacy of the algorithm presented in this study, a series of comparative experiments was conducted. These experiments entailed juxtaposing our method with a range of renowned classification networks, such as ResNet3423, ResNet5023, VGG1124, EfficientNetV125, EfficientNetV226, ShuffleNet27, ConvNext28, RegNet29, DenseNet12130, Vision Transformer31, and Swin Transformer32. Each of these networks underwent training, validation, and testing utilizing the dataset specifically compiled for this research.Employing the evaluation metrics delineated in the preceding section, we conducted a comparative analysis, assessing the performance and the number of parameters of our proposed research algorithm against the other models. This extensive comparative scrutiny enabled a thorough understanding of our algorithm’s performance in the context of urinary stone classification tasks.Table 3 Comparative results from various network experiments.The Table 3 presented above reveals that the original ResNet-34 network attained an accuracy of 67.66%. With the implementation of the enhanced self-distillation and other methodologies proposed in this paper, our network model achieved a notable accuracy of 74.96% in classifying urinary stones, representing an increase of 7.3% compared to the original network. Moreover, the algorithm developed in this study also realized improvements in other critical performance indicators such as Precision, Recall, Specificity, and F1 score, registering respective enhancements of 6.29%, 13.47%, 4.67%, and 9.79% over the original ResNet-34.To further explore the efficiency of our model, we compared the number of parameters across different models. Although our model has a slightly higher parameter count (21.7 M) compared to Resnet34 (21.3 M) by only 0.4 M, it significantly outperforms in all performance metrics. This demonstrates that through the self-distillation technique and other optimization methods proposed in this paper, we successfully enhanced the model’s performance with minimal additional parameter overhead. Moreover, compared to models with similar parameter counts such as Convnext (27.8 M) and Swin Transformer (27.5 M), and even much larger models like Vgg11 (128.8 M) and Vision Transformer (86.6 M), our model not only maintains a lower parameter count but also exhibits superior performance, proving our model design’s advantages in resource and computational efficiency.For a more vivid and comprehensive evaluation of performance, ROC (Receiver Operating Characteristic) curves for the classification network of the improved algorithm, as well as for other networks, were plotted. Additionally, the area under the ROC curve (AUC) for each class was calculated. The AUC value, a significant measure of a model’s classification capability, further corroborates the efficacy and superiority of the method introduced in this paper. The ROC curves derived from different networks and the AUC values for each type classification are illustrated in Fig. 5.Fig. 5ROC/AUC of Different Networks.The experimental outcomes demonstrate that our model exhibits outstanding performance in class 1, 2, and 4, leading in AUC (Area Under Curve) metrics among all the models compared. It surpassed the second-highest performing model by margins of 0.9%, 1.9%, and 0.7%, respectively, in these categories. This disparity underscores the discriminative capacity of our model. Furthermore, in class 3, our model also showcased exemplary classification proficiency, achieving an AUC value of 94.6%. This result not only reflects the model’s high accuracy in sensitivity but also its exceptional ability in specificity discrimination. These findings robustly validate the effectiveness and dependability of our model in categorizing various types of urinary system stone diseases.In addition to this, for a more holistic depiction of the performance of different networks in the classification task, we present the confusion matrices for these networks. The confusion matrices enable a visual representation of the accuracy and error rates for each network model across different categories, offering a crucial foundation for further model refinement and analytical evaluation.The confusion matrices for the different networks are depicted in Fig. 6.Fig. 6Confusion matrices of different networks.A thorough examination of the data from the confusion matrices reveals that our model exhibited a pronounced advantage in accurately identifying two primary categories: class 1 and class 3. The accuracy rates achieved for these categories were notably high, at 0.83 and 0.77, respectively, markedly outperforming the other models in the comparison group. In the classification of class 4 and class 2, our model closely approached the top-performing models, with a marginal difference of only 0.05. This demonstrates our model’s exceptional generalization capabilities and consistent performance across diverse categories. Regarding average accuracy, our model attained a score of 0.73, also surpassing the comparative group, indicative of its comprehensive effectiveness. These results further underscore the robust potential of our model in managing a wide array of stone types.In conclusion, our model not only demonstrates high accuracy and robustness but also establishes an effective technical approach for the identification of various urinary system stone components. This achievement heralds promising prospects for future clinical implementations.Influence of distillation temperatureThe distillation temperature parameter T is instrumental in modulating the model’s training dynamics, as it affects the model’s output probability distribution and thereby alters its predictive behavior. A higher temperature setting induces a broader “openness” in the model’s predictions across various categories, meaning it assigns a certain probability to each category. Conversely, a lower temperature setting results in the model displaying greater “confidence” in its predictions, with more definitive outcomes.In our experiment, the selection of the distillation temperature T was tailored to specific tasks and requirements. Generally, during training, a higher temperature is preferred to encourage exploratory learning, enabling the model to assimilate a broader range of information. In contrast, during the inference or testing phase, the temperature can be adjusted according to actual needs, aiming to strike a balance between the model’s conservatism and diversity in prediction.To meticulously evaluate the impact of the distillation temperature T on the experimental outcomes, we conducted experiments at varying temperature levels. The results are detailed in Table 4.Table 4 Comparative outcomes at different distillation temperatures T.The experimental data indicates that the model performs optimally when the distillation temperature T is set at 4, yielding an accuracy of 74.96%, precision of 71.72%, recall of 73%, specificity of 90.05%, and F1 score of 72.35%. Compared to other temperature settings, this configuration shows improvements in accuracy, precision, recall, specificity, and F1 score by 1.8%, 3.07%, 5.5%, 0.82%, and 4.73%, respectively. Hence, a distillation temperature of T = 4 is identified as the most effective setting in this study, highlighting the critical role and influence of the temperature parameter in the model’s training process.Results of ablation experimentsTo ascertain the efficacy of the four novel enhancements introduced in this paper for ResNet-34, a series of ablation experiments were undertaken. These enhancements encompass the adoption of a self-distillation strategy (A), refinement of the self-distillation strategy (B), application of a feature fusion strategy (C), and integration of CAM (D). To methodically evaluate the influence of these augmentations on the original network’s performance, we incrementally incorporated these strategies into ResNet-34 and conducted training and testing using the dataset from this study. The outcomes are presented in Table 5.Table 5 Comparative results from ablation experiments.The results clearly indicate that the incorporation of each innovative strategy contributed positively to the model’s performance. Notably, when all enhancements (A + B + C + D) were integrated into ResNet-34, the model exhibited significant improvements in accuracy, precision, recall, specificity, and F1 score, attaining values of 74.96%, 71.72%, 73.00%, 90.05%, and 72.35%, respectively. These findings not only validate the effectiveness of each individual improvement but also underscore the cumulative impact of these enhancements in elevating the performance of the original network.Comparison of different feature fusion methodsTo evaluate the performance of various feature fusion methods within our architecture, we conducted comparative experiments on four methods, the results of which are presented in Table 6. These methods include:

1.

Concatenation: concatenation of feature layers.

2.

Addition: Element-wise addition of feature layers.

3.

Multiply: Element-wise multiplication of feature layers.

4.

Maximum: Element-wise maximum of feature layers.

Table 6 Results comparison of different feature Fusion methods.The results indicate that the Concatenation method outperforms other methodologies across all evaluation metrics. This superiority is likely due to Concatenation’s ability to retain more comprehensive information, thereby enabling the network to learn more effectively from the fused features. In contrast, other methods such as Addition, Multiplication, and Maximum, while simplifying the information processing, may restrict the model’s ability to learn from complex feature patterns.This comparative analysis confirms the particular importance of Concatenation due to its excellent performance, underscoring the significance of selecting appropriate feature fusion strategies in the design of deep learning models.Results on public datasetsTo substantiate the generalization capacity of our proposed model, we conducted tests on two public datasets: DermaMNIST33 (part of the MedMNIST series) and the ChestXRay2017 dataset34, focused on chest X-ray images.DermaMNIST, a dermatoscopic image dataset, presents a 7-class classification challenge. It comprises 10,015 images in total, with 7,007 images in the training set, 1,003 in the validation set, and 2,005 in the test set. On the DermaMNIST dataset, our model outperformed other models, as indicated in Table 7. Specifically, it achieved an accuracy of 80.16%, surpassing the next best model, ResNet34, by 2.8% (77.36%). Our model led in precision with 71.63%, exceeding the second highest DenseNet121 by 9.56%. In recall, our model also topped the list with 60.90%, 10.13% higher than the next best, the Vision Transformer, at 51.40%. For specificity and F1 score, our model recorded 94.15% and 65.83%, respectively, the highest among all models compared.Table 7 Comparative results on the DermaMNIST dataset.ChestXRay2017, focused on binary classification of pneumonia, contains 5,857 images. We divided a part of the training set for validation, resulting in 4,187 images for training, 1,046 for validation, and 624 for testing. The results, as shown in Table 8, indicate superior performance of our model on this dataset as well. It achieved an accuracy of 92.63%, 1.6% higher than the second-ranked DenseNet121 at 91.03%. In precision and F1 score, our model led with 93.55% and 92.18%, respectively, surpassing the second-best DenseNet121 (91.25%) and EfficientNetV1 by 2.3 and 1.25% points, respectively, further confirming the model’s superiority in such tasks.Table 8 Comparative results on the ChestXRay2017 dataset.These experimental findings highlight the exceptional generalization ability of our proposed model in diverse tasks like skin lesion identification and pneumonia detection. The model demonstrated higher accuracy and reliability on two distinct public datasets compared to other contemporary models, particularly excelling in precision and recall. This emphasizes its potent capability in accurately identifying positive class samples. Such results reinforce the potential applicability of our model in clinical image analysis, paving the way for developing high-quality automated diagnostic tools.

Hot Topics

Related Articles