Improving quality control of whole slide images by explicit artifact augmentation

DatasetsThe datasets employed in this study offer a diverse and comprehensive representation of histopathological artifacts. The ACROBAT challenge dataset19 comprises digitalized Whole Slide Images (WSIs) from FFPE surgical resection specimens of female primary breast cancer patients. Captured at 40X magnification (0.23 \(\upmu\)m per pixel), these images, obtained using Hamamatsu Nanozoomer XR or Nanozoomer S360 scanners, exhibit a rich variety of artifacts. The evaluation focused on the validation subset, consisting of 100 cases equally divided between H &E and IHC-stained images.Similarly, the ANHIR challenge dataset20 provides a wide-ranging collection encompassing various tissues and pathological conditions, including lesions, lung lobes, mammary glands, colon adenocarcinoma (COAD), mice kidney tissue, gastric mucosa, gastric adenocarcinoma tissue, breast tissue, and kidney tissue. This dataset incorporates diverse staining techniques, employing stains such as Clara cell 10 protein, proSPC, H &E, Ki-67, PECAM-1, HER-2/neu, ER, PR, cytokeratin, and podocin. Acquired from various microscopy setups and scanners like Zeiss, Leica, 3DHistec, and NanoZoomer, the ANHIR dataset exhibits high heterogeneity, encompassing magnifications ranging from 10\(\times\) to 40\(\times\) and pixel sizes from 0.174 \(\upmu\)m/pixel to 2.294 \(\upmu\)m/pixel.Additionally, the Radboud University dataset, provided for evaluation purposes, stands as the largest both in artifact count and resolution. This dataset features professionally annotated artifacts in WSIs stained with both H &E and IHC dyes. Exemplary artifacts are present in Fig. 3. Spanning various tissue types, including bone marrow, breast tissue, colon tissue, pancreas tissue, diffuse large B-cell lymphoma (DLBCL), and images from the CAMELYON dataset21, each tissue type is characterized by different staining types, contributing to the dataset’s richness and complexity. The Radboud University dataset offers a substantial resource for evaluating and validating the proposed quality control and segmentation methodologies in diverse histopathological contexts.Figure 3Examples of artifacts from the considered datasets: (a) focus, (b) tissue, (c) dust, (d) ink, (e) air, (f) marker.Selected artifacts are as follows:

(i)

Air. Not strictly connected to the tissue. Due to the fact that only a portion of the air bubble is frequently visible in the image, it assumes an open shape, often deviating from a complete circle.

(ii)

Dust. Small particles or debris that can inadvertently appear on the slides during the preparation or scanning process. It appears both on the foreground and background of the WSI.

(iii)

Tissue. Folded or creased tissue sections that can result from various factors such as handling, processing, or mounting of the tissue slides.

(iv)

Ink. Irregularities in the distribution or application of ink or staining agents on tissue slides.

(v)

Marker. Annotations, such as crosses or other symbols, typically located near the corners or edges of the slide.

(vi)

Focus. This artifact occurs when the focal plane of the microscope is not precisely aligned with the tissue section being captured, resulting in blurred or out-of-focus areas.

We summarize acquired data in Table 1.Table 1 Datasets used in the study with their respective characteristics.Experimental setupThe experimental setup utilized Nvidia Tesla A100 graphics cards with 400W TDP and 40 GB of memory on the PLGrid HPC cluster Athena for model training. In our experiments, we employed deep learning models trained on different datasets, denoted by shorthand notations. After annotating a limited set of training artifacts, we utilize our framework to integrate them with other images. This results in the augmented datasets. The test datasets were unmodified. Models trained exclusively on annotated data from the ACROBAT dataset are referenced as \(\textbf{ACR}\), while models trained on an augmented version of the ACROBAT dataset are denoted as \(\mathbf {ACR’}\). Similarly, with \(\textbf{ANH}\), \(\mathbf {ANH’}\) for ANHIR and \(\textbf{RB}\), \(\mathbf {RB’}\) for Radboud. Additionally, we evaluated models trained on ACROBAT datasets on the ANHIR dataset’s annotations to analyze generalizability. Those models are denoted as \(\mathbf {ACR_{anh}}\) and, \(\mathbf {ACR’_{anh}}\) respectively. When training the models on a dataset from the Radboud University, we present two approaches: (i) while having the full model set to trainable—\(\textbf{RB}\), and (ii) only the last layers unfrozen—\(\mathbf {RB_s}\).ClassificationThe classification study is presented through Receiver Operating Characteristic (ROC) curves accompanied by their corresponding Areas Under the Curve (AUC) scores (Table 2), offering a comprehensive evaluation of the models’ performance. Figure 4 illustrates the promising initial validation results, with improvements evident when employing the augmented dataset, particularly in addressing previously weaker outcomes. Figure 5 details the loss on the validation dataset, highlighting the mitigation of overfitting issues with the augmented dataset during the training process. Subsequent testing on additional ACROBAT annotations reveals improvements for tissue and dust artifacts, alongside a performance decrease for ink artifacts and a slight drop in focus artifacts. Evaluation on ANHIR annotations demonstrates enhancements for air, tissue, dust, and focus artifacts, tempered by slight degradation in marker and ink artifacts.Figure 4ROC curve for classification, evaluated on additional ACROBAT annotations unseen during training. (left) Model trained on \(\textbf{ACR}\). (right) Model trained on augmented \(\mathbf {ACR’}\).Figure 5Chart of the validation loss during training on \(\textbf{ACR}\) and \(\mathbf {ACR’}\) datasets.In Fig. 6, the models undergo evaluation on a diverse set of WSIs. The augmented dataset yields improvements for air and dust artifacts, with more significant enhancements for dust, focus, and tissue types. However, a slight degradation is observed for ink and marker types. Improvement in performance on this dataset is the lowest overall. Further analysis in Table 3 reveals that the model does not generalize well to a new dataset. The lack of statistical significance is confirmed by the statistical tests. Evaluation on the Radboud University dataset (Fig. 7) demonstrates an overall improvement, notably for the initially weakest artifact—Air bubbles. Better results are also observed for tissue and focus, with marginal gains for dust and a slight regression for the ink class.Figure 6ROC curve for classification models, evaluated on ANHIR annotations unseen during training. (left) Model trained on \(\mathbf {ACR_{anh}}\). (right) Model trained on augmented \(\mathbf {ACR’_{anh}}\).Figure 7ROC curve for classification models, evaluated on Radboud University test annotations consisting of evenly sampled 70% of all dataset annotations. (left) Model trained on \(\textbf{RB}\). (right) Model trained on augmented \(\mathbf {RB’}\).Figure 8ROC curve for classification models, evaluated on Radboud University test annotations consisting of evenly sampled 70% of all dataset annotations. Model training was limited to only the last fully connected layer. (left) Model trained on \(\mathbf {RB_s}\). (right) Model trained on augmented \(\mathbf {RB’_s}\).Despite freezing all layers, except the last fully connected layer with 2048 input features and 6 output classes, some overfitting was observed, indicated by an initial increase in loss. However, the loss stabilized over time, and the final performance exhibited less degradation for previously challenging artifacts. The concluding experiment involving a mostly frozen model (Fig. 8) highlights the overfitting issue, with a noticeable performance drop for the dataset with only annotations as a training data. Notably, this regression is absent for the augmented dataset. Comparison between the two sets reveals that the model trained on the augmented dataset outperforms the model trained solely on annotations for all artifact types, indicating the effectiveness of the proposed augmentation approach in mitigating overfitting.In Fig. 9 we see the confusion matrix after thresholding. Background class was raised when no other class met the required threshold. The high values along the diagonal elements of the matrix indicate that our model was successful in correctly classifying instances across multiple classes. Nonetheless, we have observed patterns in misclassifications, with specific classes exhibiting higher rates of misclassifications, e.g., air, dust, tissue, and background. For each of these classes, our augmentations led to a performance improvement.Table 2 Summary of the final performance of the models on each artifact type defined by the AUROC score.Table 3 Summary of the improvements in AUROC made by our method in each dataset and for each artifact type. All differences are presented with an additional Wilcoxon signed-rank test performed on an accumulated list of patch predictions.

Hot Topics

Related Articles