Leveraging immuno-fluorescence data to reduce pathologist annotation requirements in lung tumor segmentation using deep learning

The image analysis approach depends on the objectives of the clinical study. If the goal is to identify whether tumor is present in a WSI, a classification analysis would be sufficient. If the team is interested in localizing the tumor in the WSI, but the exact location of small structures (e.g. whether a cell is in or outside the tumor) is not required object detection analysis, which is a combination of classification and localizing structures would be sufficient (e.g. if distance of tumor nests from lymphocytes clusters is of interest). However, if the analysis is interested in specifying which cells in WSI belong to tumor and perform more advanced investigation on these cells (e.g. how much of a specific marker is expressed in tumor cells and how much is outside tumor), image segmentation that performs classification for each pixel in the WSI needs to be performed and is the objective of the current work.NSCLC samplesHuman NSCLC samples were procured from Indivumed, Capital Biosciences, Asterand, Discovery Life Science, Cureline, BioIVT, and Tristar under institutional review board (IRB) and ethics committee approvals of the respective vendors. All experiments were conducted in accordance with the national and international guidelines. Informed consent was obtained from all subjects by their respective vendors and all experiments were conducted in accordance with the obtained IRB approvals. The study was carried out in accordance with the guidelines and principles of the Helsinki Declaration. NSCLC samples from four different studies were included, where the H&E and immuno-fluorescence (panCK) staining and scanning were performed on 5 µm thick formalin-fixed paraffin-embedded (FFPE) sections. A total of 112 samples were procured (61 LUAD and 51 LUSC). For 11 samples, only one section was available and was used for H&E imaging (Studies 1&2); for 21 samples, two sections were available and were used for H&E and panCK imaging (Study 3), and for 80 samples four sections were available and were used for panCK and H&E imaging at three different centers (Study 4). The IF images were used to generate panCK-based tumor annotations, and the H&E images were used for generating manual tumor annotations by pathologists. Table 1 provides details of the samples included in this study.Table 1 Details of the NSCLC samples including the number of samples for each subtype from each center and study, the image type (H&E or IF), scanner type, and the number of samples that were annotated by pathologists.Immuno-fluorescence annotationsTissue staining with tumor marker panCK could be performed using either immune-histochemistry (IHC) or immuno-fluorescence (IF) staining. IF staining was selected here as it is possible to perform H&E imaging after IF imaging on the same slide. This allows for perfect alignment between the H&E and its corresponding panCK-based annotations. If IHC staining was used, the H&E had to be performed on a serial section which would result in unaltered H&E images but alignment between tumor and its annotation would not be as accurate. Additionally, identifying tumor in IF stained images is more accurate as the only signal in these images is from panCK, whereas in IHC the marker signal is added to the nuclei and tissue structure signals and quantification is not as straight forward. Each approach (creating annotation on the same slide and on a serial slide) has its pros and cons and thus, using IF staining, we were able to investigate both approaches and assess their impact on model performance.The NSCLC segmentation model relies on H&E images, and having unaltered H&E would be beneficial in panCK-based annotation generation step. However, considering H&E is a very stable stain while IF is less stable and that H&E is intrinsically immuno-fluorescent which cannot be removed by destining H&E (i.e. H&E staining cannot be performed prior to IF staining), in practice IF staining is always performed first, followed by epitope retrieval and H&E staining40.A singleplex IF assay for Pan Cytokeratin (Millipore Sigma, USA, AE1/AE3, 1 ug/mL) was performed on the FFPE human NSCLC samples. AE1 recognizes CK10, 14, 15, 16, and 19, while AE3 recognizes CK1, 2, 3, 4, 5, 6, 7, and 8. The assay was developed using the Opal technology workflow (Akoya Biosciences) on the Bond RX autostainer (Leica), as previously described41. 5um thick sections were obtained from each sample, mounted on charged glass slides (Statlab), baked at 60 °C for 60 min, and loaded onto the autostainer. Then, heat induced epitope retrieval was performed using the autostainer’s built in “HEIR 20 min with ER2” protocol, followed by the singleplex IF assay for panCK. Slides were then removed from the autostainer, coverslipped with ProLong Gold mounting media (ThermoFisher), and scanned on the Vectra Polaris scanner (Akoya Biosciences) at 20x  magnification.The IF scan was followed by H&E staining and scanning of the same section at 40x  magnification using a Leica Aperio ScanScope (AT2) scanner. The images were downsampled to 20x magnification (using bi-cubic interpolation) to be at the same resolution as the panCK images. The IF staining involved heat induced epitope retrieval followed by decoverslipping and re-staining with H&E. Thus, the H&E images acquired after IF scanning (hereby called H&E-after-IF) had slightly different color space and some morphological changes to the tissue occurs compared to a slide that had only been stained and scanned for H&E (hereby called H&E-only) as shown in Fig. 2.Figure 2Sample images of panCK (a, b), H&E-after-IF (c, d), and corresponding H&E-only image from a serial section (e, f) showing the differences in color space between H&E-after-IF and H&E-only images. The images also show the morphological changes in the tissue due to IF staining (particularly in non-tumor tissues) and re-coverslipping (tissue fold and tear), as well as co-registration issues (mismatch between H&E-only and panCK). Color normalization transforms these images to a similar color space: H&E-after-IF (g, h), H&E-only (i, j).The panCK images were thresholded using Visiopharm software (Hoersholm, Denmark) to generate tumor annotations and, along with the H&E-after-IF WSI, were used as the image and label pair for training. The 80 samples in Study 4 (40 LUSC and 40 LUAD), as well as the 21 samples in Study 3 (21 LUAD) were used to generate this dataset. After quality control of the panCK and H&E-after-IF images, 68 samples were selected, and panCK-based tumor labels were generated. A non-pathologist excluded normal lung from these images by drawing a rough boundary of the normal tissue. PanCK positive tissue labels were generated using adaptive thresholding of the panCK signal in Visiopharm software. This process involved detecting tissue areas using the DAPI, panCK and Autofluorescence channels. For adaptive thresholding of the panCK signal, the signal was first normalized over a 500 × 500 pixel area to manage signal gradients in panCK images. Then, any pixel that had normalized value greater than 1.4 was considered positive (this threshold was determined empirically). The panCK positive area was then smoothed to fill small holes (panCK is a membrane marker and cell nuclei have no signal). These masks were then exported from Visiopharm and used as the annotation for the H&E WSI.It is worth noting that since in a typical clinical study panCK is usually not available, the input to the segmentation model is only an H&E image. PanCK in only used for generating tumor annotation and is only required in training the segmentation model, while during inference only an H&E image is required.Pathologist annotationsHistological slides were stained with H&E and scanned by either Leica Aperio ScanScope (AT2) or 3DHISTECH Panoramic 250 (P250) digital scanners at 40x  magnification. This data was downsampled to 20x  magnification (same as the IF annotation step) to generate the H&E-only images. There is significant variation in H&E images due to differences in staining and imaging protocols, scanners, centers, operators, and tumor subtypes. Incorporating all of these variations into the training data is challenging, which makes training a robust CNN model difficult (i.e., requires a large and diverse training dataset).In order to incorporate variations in staining and scanning, sections 2–4 of the 80 samples in Study 4 were sent to three different centers (centers 1–3), where they used their respective scanners (AT2 and P250) and protocols (protocols 1–3) for H&E staining and imaging. Additionally, 11 LUSC (Studies 1–2) and 21 LUAD (Study 3) samples that were prepared and stained by different operators and scanned with different AT2 scanners were also included to increase H&E data variability. Figure 1c shows example images from each dataset, demonstrating the significant variation that can be expected in H&E images of NSCLC.Out of these 112 images, 80 H&E images were selected for manual annotation by pathologists. The selected samples included 40 LUAD and 40 LUSC samples. Centers 1 & 2 used the same scanner type, and thus there were 58 samples scanned with AT2 scanners, and 22 samples were scanned on the P250 scanner. For each of the 80 H&E WSI, three 1 mm2 regions of interest (ROI) were selected for manual annotation by pathologists using QuPath software42. This selection that included both LUAD (50%) and LUSC (50%) samples was confirmed by an expert pathologist to ensure a wide range of NSCLC tissue morphologies, normal, and tumor-adjacent tissues were included in the training data. The pathologists drew contours around the tumor cells at high magnification (20x) and excluded any non-tumor cells and structures such as stroma, normal lung cells, normal epithelium, blood vessels, lymphocytes, etc. To incorporate variability in the annotation between pathologists (due to years of experience and differences in their judgment calls), the images were annotated at two different centers, and at each center, multiple pathologists annotated the slides.Out of the 80 annotated H&E WSI samples, 48 samples were selected from Study 4 samples (16 from each center), and 32 WSI were selected from Studies 1–3 (all performed at Center 1). CNN model training was performed using 58 WSI images (29 LUAD & 29 LUSC), and the remaining 22 WSI (11 LUAD & 11 LUSC) were kept for testing (covering all subtypes, centers, protocols, and scanners).The pathologists annotated 174 ROIs of size 1 mm2 in training/validation data. In addition, 72 non-tumor ROIs of size 1 mm2 from these WSI were also included in the training data. Out of the total 246 ROIs, 15% were used as the validation dataset (36 ROIs), and the remaining 210 ROIs were used as training dataset (covering 210 mm2 over 58 WSI). This dataset was downsampled randomly (at ROI level) at 50% (105 ROIs), 30% (70 ROIs), 20% (42 ROIs), and 10% (21 ROIs), and for each fraction, five random subsets were created resulting in 21 training sets (including the one dataset with 100% of the training data).The test dataset was comprised of 66 ROIs of size 1 mm2 from 22 WSI, which were annotated by the pathologists. These ROIs were evenly distributed between the two subtypes (33 LUAD and 33 LUSC) and the three centers.CNN architectureThe CNN used for NSCLC tumor segmentation was based on the attention U-Net43,44 which is a modified version of the U-Net architecture45 with attention gates added to each resolution level of its decoder. Using models with pre-trained weights have been shown to outperform those trained from scratch in handling out of distribution images in digital pathology46. Thus, pre-trained weights of VGG16 network trained on ImageNet dataset were used for the encoder half of the attention U-Net architecture (Fig. 3).Figure 3The attention U-Net architecture with pre-trained weights of VGG16 on ImageNet dataset (a). For a sample NSCLC image, the output of attention gates at different levels of the network are shown. These maps show how the attention gates are placing emphasis on the tumor tissue while minimizing the weighting for background tissues (b).Inputs to the model were H&E patches of size 512 × 512 × 3 at 20x  magnification. The original H&E images that were acquired at 40x  were downsampled using bi-cubic interpolation to arrive at the 20x H&E images and a 5 × 5 convolutional kernel was used in convolutional layers of the model decoder. The VGG16 architecture that is used in the encoder has 3 × 3 kernel in its convolutional layers which results in a receptive field of 212 × 21247. However, 512 × 512 images were used as input here and thus, the kernel size of the decoder was increased to increase the receptive field of the full network. The rationale for selecting this larger input size (which is commonly used in the literature for WSI46) was the fact that tumor segmentation requires some global context to differentiate structures that are similar at cell level (e.g. tumor cells that are epithelial and normal epithelium).Model details are shown in Fig. 3. Regular U-Net with the same model architecture and parameters as the attention U-Net (without attention gates) was also implemented and used for comparison.CNN training and transfer learningThe model was trained in 2 steps where initially, the weights in the encoder half of the model were frozen, and only the decoder was trained with initial learning rate of LR = 0.001 and LR was reduced in half if the validation loss did not decrease for 4 epochs. Training was stopped if the validation loss did not decrease for 15 epochs and the model with the lowest validation loss was selected. In the second step, the encoder was also trained with an initial learning rate of LR = 0.0001 and the same LR reduction schedule and early stopping criteria. Binary cross entropy (BCE) loss was used and the tensorflow package (http://www.tensorflow.org) was used for implementing and training the CNN. The model was first trained using the panCK-based tumor annotations and H&E-after-IF images. This model was then fine-tuned using the pathologist annotations on H&E-only images.To increase model generalizability, image augmentation including rotation (up to 45°), horizontal and vertical shift (up to 20%), zooming into the image (up to 5%), shear deformation (up to 10%), horizontal and vertical flips, as well as random 90°, 180°, 270° rotations were applied to the image patches during training. WSI images have no natural orientation and using the combination of 45° rotation along with random 90°, 180°, 270° rotations and horizontal and vertical flips ensured the image patches were rotates for the all possible rotation degrees. Additionally, to increase variability in color space, color augmentation was performed using the stain intensity (Hematoxylin and Eosin) perturbations technique presented by Tellez et al.48,49, which is based on the Macenko color normalization approach50.Architecture backboneIn order to assess the impact of backbone architecture (which represents the feature extraction phase of the model) on the model performance, several pre-trained backbones were tested. Resnet5027, DenseNet12128, and EfficientNet-B429 pre-trained on ImageNet dataset were used as backbone in addition to VGG16. Moreover, we trained Swin-UNETR architecture30 to compare the performance of the architectures that used CNN vs. transformers (Swin Transformer here51) for feature extraction. The MONAI platform52 was used for Swin-UNETR model implementation. All models were trained similarly where first the model was trained using the panCK annotations followed by fine-tuning using the pathologist annotations.Pre-training: panCK vs. foundation modelsThe current training approach used panCK for pre-training the model. Foundation models are another approach for pretraining. Thus, the performance of using foundation models vs. the task and domain specific pre-training using panCK annotations was performed. Two foundation models, KimiaNet24 which is a CNN based model (DenseNet121), and CTransPath25 which is transformers-based model were used. Both foundation models were trained using pathology images, and for both cases, the model was trained for a classification task. Thus, the models required being adapted for our segmentation task where only the backbone (feature extraction) part of the segmentation models was pre-trained in the foundation model and the decoder part of the model required training.For KimiaNet that used DenseNet121, the same approach as the case of using DeneNet121 pre-trained using ImageNet data was used and only the KimiaNet weights were used as the backbone. Thus, the same attention U-Net architecture was used. For the case of CTransPath that used Swin transformers, the model was adapted for segmentation using the mask2former approach53. These two models were trained using the pathologist annotations only.EvaluationThree pixel-based model performance metrics were calculated. (1) accuracy which calculates the ratio of true positives (tumor pixels correctly classified as tumor) and true negatives (background pixels correctly classified as background) over all pixels in the image. Considering majority of pixels in WSI are usually background, accuracy places higher weight on non-tumor regions and if a small tumor region is missed, the metric does not properly penalize it; (2) mean intersection over union (mIoU) which represents the ratio of the true positive pixels, divided by the sum of true positives, false positives (background pixels incorrectly classifies as tumor) and false negatives (tumor pixels incorrectly classified as background). This metric is preferred for image segmentation tasks as it properly penalize the metrics if only a small tumor region exists in the image; (3) Dice coefficient which is similar to mIoU but with a different formulation. We are reporting both Dice and mIoU in this paper to facilitate comparison with prior studies (some studies in the literature report mIoU and others report Dice which makes comparing different studies problematic). All three metrics were calculated for each 1 mm2 ROI in the test dataset that was comprised of 66 ROIs from 22 WSI. Overall performance was determined as mean and standard deviation or median and interquartile range on the entire test dataset, as well as separately for different tumor subtypes and the centers.Minimum required pathologist annotationsGenerating pathologist annotations is the main bottleneck in developing segmentation models in histopathology due to being expensive and time-consuming. To evaluate the effects of pre-training the CNN with panCK-based tumor annotations, training was performed with and without panCK-based pre-training.Moreover, to determine the minimum amount of pathologist annotations that would be required to train a model with acceptable accuracy, the pathologist annotation dataset was subsampled at 10%, 20%, 30%, and 50%, and the model was trained using these smaller datasets. To preserve the heterogeneity in these smaller datasets, sub-sampling was performed at ROI level (the training data included 246 ROIs of size 1 mm2 annotated on the 58 training WSI). For each sub-sampling fraction, five random datasets were generated. For the cases that 100% of the training data was used, the train/validation split was performed randomly five times. The model was trained with and without using the panCK-based pretraining. The same approach was used for models using pre-trained backbone from foundation model weights (KimiaNet and CTransPath) to compare the effects of using panCK vs. foundation model weights for pre-training while reducing the size of the pathologist annotations dataset. The performance of each model was then tested on the same independent test dataset (66 ROIs from 22 WSI).

Hot Topics

Related Articles