A Modified Deep Semantic Segmentation Model for Analysis of Whole Slide Skin Images

The proposed model was tested using skin biopsy labelled by pathologists.Evaluation metricsThe final segmented whole slide images were compared to the ground-truth (GT) images. The testing results in terms of class-wise recall are shown in Table 1. To define these metrics, the individual pixels in both sets of GT and model outputs were categorized into four classes: True-Positive (TP), True-Negative (TN), False-Positive (FP), and False-Negative (FN) as follows:

TP: Pixels correctly segmented as the target class by the model.

TN: Pixels correctly identified as not belonging to the target class by the model.

FP: Pixels incorrectly identified as belonging to the target class by the model.

FN:Pixels incorrectly identified as not belonging to the target class by the model.

The class-wise accuracy is defined as the recall:$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
(2)
Another useful metric is the Dice F-1 Score which shows intersection of a said class in the terms of the total area:$$\begin{aligned} F1-Score = \frac{TP}{TP+1/2(FP+FN)} \end{aligned}$$
(3)
Overall the per-pixel accuracy score is calculated, which is given by:$$\begin{aligned} \text {Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(4)
Ablation studyTo evaluate the proposed segmentation model, we conducted detailed ablation studies. As shown in Fig. 4, our studies explored how different backbones on a U-Net performed. The predictions from these models were aggregated to form an ensemble, which included the original U-Net architecture as well.Fig. 4Visualizing the accuracy: generated masks of skin cancer types from the Queensland Dataset compared to Ground Truth: (a) Generated mask with predominant BCC cancer visually represented. (b) Generated mask with predominant IEC cancer displayed. (c) Generated mask illustrating predominant SCC cancer.In our experiments, we employed an 80:10:10 data split for training, validation, and testing, respectively. This partitioning strategy ensures that the model is trained on a diverse set of examples, validated on a separate dataset to tune hyperparameters (as shown in Table 3), and finally tested on an independent set to evaluate its generalization performance. The rationale behind this split is to strike a balance between providing sufficient data for training and robustly assessing the model’s performance on previously unseen instances. The model EfficientNet-B3 took 6 hours to train and during inference stage it takes only 30sec to generate results from a single slice.Table 3 Hyperparameters for model training.Quantitative resultsTable 4 shows the results of the studies in the form of a performance evaluation of the proposed models, both when independently and in the form of an ensemble. The results show how the proposed model outperforms the existing approach by a significant margin in cancerous classes, where an average increase of 6% is observed.Table 4 Class wise recall.Based on the results, it is also important to mention that an ensemble of models did not improve the results significantly, and an individual model with an EfficientNet-B3 backbone would achieve the same results as those of an ensemble model approach. The results clearly highlight that proposed modified EfficientNet-B3 backbone network show improved results specifically for cancerous regions i.e. BCC, SCC and IEC. Our model performed most poorly on the FOL class with a recall of 0.67. This was primarily due to its unbalance in the dataset while also depending on the depth of the biopsy in the case of shave biopsies for example, some hair follicles may be included in the specimen obtained. However, since shave biopsies only remove a superficial layer of skin, the hair follicles may not be fully intact.Figure 5 shows the confusion matrix for the EfficientNet-B3 model. This confusion matrix visualizes the performance of our segmentation model across 12 classes, with a focus on the recall metric. Each row of the matrix corresponds to the actual class, while each column represents the predicted class. The main diagonal, normalized to highlight recall, shows the percentage of correct predictions for each class. Notably, the model exhibits lower performance in identifying the ’PAP’ class, primarily due to its frequent confusion with the ’RET’ class. This confusion arises from the similarity in layers and proximity of these classes in the skin. Additionally, the limited data available for training the model on the ’PAP’ class exacerbated this issue. Another class where the model under performs is ’FOL’, again largely due to the insufficient data for effective training. Nevertheless the results are still better than the existing ones due to a strong pre-trained backbone.Fig. 5Confusion matrix showing recall values for EfficientNet-B3 model against each class.Qualitative analysisThe visual results of the experiment support the findings in the quantitative analysis. We also see how an EfficientNet-B3 backbone greatly improves the results of the experiment. The model predicts the BCC class much better than others, this is again evident from the skewness in the dataset. The model confuses between IEC and SCC as both the classes have minor differences. Figure 6 shows how the model predictions compare to labelling done by pathologist.Fig. 6Isolating BCC, SCC, and IEC Masks in Different Models. (a) Compares the Basal cell carcinoma labeling provided by a doctor, versus that of the trained model. (b) Compares the Intra-epidermal cell carcinoma labeling provided by a doctor, versus that of the trained model. The IEC was difficult to detect as there was a lot of overlapping in it and the Epidermal class. (c) Compares the Squamous cell carcinoma labeled by a doctor to the trained model label.For qualitative analysis in deep learning, uncertainty maps are commonly used to visualize the output of a model and identify regions where the model needs improvement or additional data. It provides valuable insights about how confident the models is about its predictions. For generating the maps as shown in Fig. 7, the predictions from the softmax function were transformed using the output of the last convolutional layer into a probability distribution over the 12 different classes. The resulting probability map can be visualized as a heat map where the intensity of each pixel represents the confidence of the models prediction for that pixel. The formula for the softmax function is:$$\begin{aligned} \text {softmax}(x)=\frac{e^x}{\sum (e^x)} \end{aligned}$$
(5)
where the x is the input to the softmax function, the output of the last convolutional layer of the model. The softmax function normalizes the output of the last convolutional layer into a probability distribution over the classes. After receiving the probability distribution we subtract it from 1 and with each pixel’s maximum probability it is selected and then transformed based on the nipy-spectral color map to show uncertainty in each class.$$\begin{aligned} P(x)=1-max(\text {softmax}(x)) \end{aligned}$$
(6)
The heat maps show that the model segments out the cancerous area with high precision and confidence, while the border layers do show less confidence due to merging of the overlapping patches.Fig. 7The model predictions with their respective uncertainty heatmaps.The Uncertainty heatmaps provided an interpretable way of diagnosis, whereby a model’s confidence indicates the need for a physician to judge the model’s results and where they need to pitch in for improved diagnoses. Generated mask and uncertainty heatmap of (a) BCC cancer (b) IEC cancer (c) and SCC cancer.In the context of surgical applications, this model is specifically tailored to enhance the evaluation of surgical margins in skin cancer surgeries, a critical aspect that significantly impacts patient outcomes. Surgical margin clearance is a vital consideration in oncology surgery. It refers to the distance between the tumor boundary and the nearest edge of excised tissue. Ensuring adequate margin clearance is paramount to achieving complete tumor resection and minimizing the risk of recurrence.. Historically, the assessment of these margins has been a manual and subjective process, often leading to variability in interpretations and surgical outcomes.Fig. 8The generated surgical margin clearance. The green line highlights the location of the cancerous region. The red line indicates the clearance margin, i.e., the region where the cancer might have spread to and where the doctor should most likely perform a cut. Input Image, Generated Mask and Surgical Margin Clearance of (a) BCC cancer (b) IEC cancer (c) and SCC cancer.Figure 8 comprises three parts: the original whole slide skin image, its segmented mask as generated by the AI model, and a visual representation of the margin clearance. The segmented mask clearly delineates the cancerous tissues, while the margin clearance visualization aids in understanding the extent of the spread of the cancer and the necessary boundary for excision.Furthermore, the integration of our AI model into clinical practice promises to standardize the evaluation of surgical margins. By providing objective and quantifiable measurements, it reduces the reliance on subjective assessments, thereby potentially decreasing the variance in surgical outcomes. Moreover, this technology can serve as a valuable educational tool for pathologists and surgeons, offering insights into the complex patterns of tumor spreading skin cancer.Comparison with literatureThe proposed approach for segmentation of WSI skin images using deep learning was compared to the approach presented in Thomas et al.23 as shown in Table 4 and Table 5. The results showed that the proposed approach performed better in most cases, with an overall increase in accuracy and an improvement in average class accuracy. These results demonstrate the effectiveness of the proposed approach and its superiority over the existing approach in the literature. There is 5% to 10% increase in comparison to23 for mean class accuracy and overall accuracy respectively. The improvement in accuracy can be attributed to the use of an EfficientNet-B3 backbone, which manages to extract and learn better features than the existent approach, and addresses class imbalance through data augmentation. These findings contribute to the ongoing research in the field and provide valuable insights for future studies.Table 5 Comparison of models using different performance parameters.

Hot Topics

Related Articles