Accelerating histopathology workflows with generative AI-based virtually multiplexed tumour profiling

VirtualMultiplexer architectureThe VirtualMultiplexer is a generative AI toolkit that performs unpaired H&E-to-IHC translation. An overview of the model’s architecture is shown in Fig. 2a. The VirtualMultiplexer is trained using two sets of images: source H&E images, denoted as \({X}_{{\rm{img}}}=\{x\in {\mathcal{X}}\}\), and target IHC images, denoted as \({Y}_{{\rm{img}}}=\{\;y\in {\mathcal{Y}}\}\). Ximg and Yimg are unpaired images that originate from different sections of the same TMA core and thus belong to the same patient, but are pixel-wise unaligned and thus unpaired. We train an independent one-to-one VirtualMultiplexer model for each IHC marker at a time. To train the VirtualMultiplexer, we use patches Xp = {xp ∈ Ximg} and Yp = { yp ∈ Yimg} extracted from a pair of images Ximg and Yimg, respectively. The backbone of the VirtualMultiplexer is a GAN-based generator G, specifically a CUT34 model that consists of two sequential components: an encoder Genc and a decoder Gdec. Upon training, the generator takes as input a patch xp and generates a virtual patch \({y}_{\rm{p}}^{{\prime} }\): that is, \({y}_{\rm{p}}^{{\prime} }=G({x}_{\rm{p}})={G}_{{\rm{dec}}}({G}_{{\rm{enc}}}({x}_{\rm{p}}))\). The virtually generated patches are stitched together to produce a final virtual image \({Y}_{{\rm{img}}}^{{\prime} }=\{\;{y}^{{\prime} }\in {{\mathcal{Y}}}^{{\prime} }\}\). The VirtualMultiplexer is trained under the supervision of three levels of consistency objectives: local, neighbourhood and global consistency (Fig. 2b). The neighbourhood consistency enforces effective staining translation at a patch level, where a patch captures the neighbourhood of a cell. We introduce additional global and local consistency objectives, operating at an image level and cell level, respectively, to further constrain the unpaired S2S translation and alleviate the stain-specific inconsistencies.Neighbourhood consistencyThe neighbourhood objective is a combination of an adversarial loss and a patch-wise multilayer contrastive loss, implemented as previously described in CUT34 (Fig. 2b, panel 1). Briefly, the adversarial loss dictates the model to learn to eliminate style differences between real and virtual patches, and the multilayer contrastive loss guarantees the content preservation at a patch level54. The adversarial loss is a standard GAN min–max loss35, where the discriminator D takes as input real IHC patches Yp and IHC patches \({Y}_{\rm{p}}^{{\prime} }\) virtually generated by generator G and attempts to classify them as either real or virtual (Fig. 2b, panel 1a). It is calculated as follows:$${{\mathcal{L}}}_{{\rm{adv}}}(G,D,{X}_{\rm{p}},{Y}_{\rm{p}})={{\mathbb{E}}}_{{y}_{\rm{p}} \sim {Y}_{\rm{p}}}\log D(\;{y}_{\rm{p}})+{{\mathbb{E}}}_{{x}_{\rm{p}} \sim {X}_{\rm{p}}}\log (1-D(G({x}_{\rm{p}}))).$$
(1)
The patch-wise multilayer contrastive loss follows a NCE concept as presented in refs. 54,55 and reused in refs. 29,34. Specifically, it aims to maximize the resemblance between input H&E patch xp ∈ Xp and corresponding virtually synthesized IHC patch \({y}_{\rm{p}}^{{\prime} }\in {Y}_{\rm{p}}^{{\prime} }\) (Fig. 2b, panel 1b). We first extract a query subpatch \({y}_{\rm{sp}}^{{\prime} }\) of size 64 × 64 from the target IHC domain patch \({y}_{\rm{p}}^{{\prime} }\) (purple square in Fig. 2b, panel 1b) and match it to the corresponding subpatch xsp: that is, a subpatch at the same spatial location as \({y}_{\rm{sp}}^{{\prime} }\) but from the H&E source domain patch xp (black square in Fig. 2b, panel 1b). Because both subpatches originate from the exact same tissue neighbourhood, we expect that xsp and \({y}_{\rm{sp}}^{{\prime} }\) form a positive pair. We also sample N subpatches \(\{{x}_{\rm{sp}}^{-}\}\) at different spatial locations from xp (red squares in Fig. 2b, panel 1b) and expect that they form dissimilar, negative pairs with xsp. In a standard contrastive learning scheme, we would map ysp, xsp and \(\{{x}_{\rm{sp}}^{-}\}\) to a d-dimensional embedding space \({{\mathbb{R}}}^{d}\) via Genc and project them to a unit sphere, resulting in v, v+ and \(\left.\right\{{v}^{-}\},\in {{\mathbb{R}}}^{d}\), respectively, and then estimate the probability of a positive pair (v, v+) selected over negative pairs \((v,{v}_{n}^{-}),\forall n\in N\) as a cross-entropy loss with a temperature scaling parameter τ:$${\mathcal{L}}(v,{v}^{+},{v}^{-})=-\log \left[\frac{\exp (v{v}^{+}/\tau )}{\exp (v{v}^{+}/\tau )+\mathop{\sum }\nolimits_{n = 1}^{N}\exp (v{v}_{n}^{-}/\tau )}\right]$$
(2)
Here, we use a variation of the loss in equation (2), specifically a patch-wise multilayer contrastive loss that extends \({\mathcal{L}}(v,{v}^{+},{v}^{-})\) by computing it for feature maps extracted from L-layers of Genc29,34. This is achieved by passing the L feature maps of xp and \({y}_{\rm{p}}^{{\prime} }\) through a two-layer multilayer perceptron (MLP) Hl, resulting in a stack of features \({\{{z}_{\rm{l}}\}}_{L}={\{{H}_{l}({G}_{{\rm{enc}}}^{l}({x}_{\rm{p}}))\}}_{L}\) and \({\{{z}_{l}^{{\prime} }\}}_{L}={\{{H}_{l}({G}_{{\rm{enc}}}^{l}(\;{y}_{\rm{p}}^{{\prime} }))\}}_{L}\) = \({\{{H}_{l}\left.\right({G}_{{\rm{enc}}}^{l}(G({x}_{\rm{p}}))\}}_{L}\), ∀ l ∈ {1, 2, ⋯ , L}, respectively. We also iterate over each spatial location s ∈ {1, ⋯ , Sl}, and we leverage all Sl\s patches as negatives, ultimately resulting in \({z}_{l,s}^{{\prime} }\), zl,s and \({z}_{l,{S}_{l}\backslash s}\) for the query, positive and negative subpatches, respectively (purple, black and red boxes in Fig. 2b, panel 1b). The final patch-wise multilayer contrastive loss is computed as$${{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{X}_{\rm{p}})={{\mathbb{E}}}_{{x}_{\rm{p}} \sim {X}_{\rm{p}}}\mathop{\sum }\limits_{l=1}^{L}\mathop{\sum }\limits_{s=1}^{{S}_{l}}{\mathcal{L}}\left({z}_{l,s}^{{\prime} },{z}_{l,s},{z}_{l,{S}_{l}\backslash s}\right)$$
(3)
We also employ contrastive loss \({{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{Y}_{\rm{p}})\) on patches yp ∈ Yp, a domain-specific version of the identity loss56,57 that prevents the generator G from making unnecessary changes as proposed in ref. 34. Finally, the overall neighbourhood consistency objective is computed as a weighted sum of the adversarial loss {equation (1)) and the multilayer contrastive loss (equation (3)) with regularization hyperparameter λNCE:$$\begin{array}{ll}{{\mathcal{L}}}_{{\rm{neighbourhood}}}={{\mathcal{L}}}_{{\rm{adv}}}(G,D,{X}_{\rm{p}},{Y}_{\rm{p}})+{\lambda }_{{\rm{NCE}}}\times \left({{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{X}_{\rm{p}})\right.\\\qquad\qquad\qquad\;\;\,\left.+{{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{Y}_{\rm{p}})\right)\end{array}$$
(4)
Global consistencyInspired by seminal work in neural style transfer58, this objective consists of two loss functions: a content loss \({{\mathcal{L}}}_{{\rm{content}}}\) and a style loss \({{\mathcal{L}}}_{{\rm{style}}}\) that together enforce biological consistency in terms of both tissue composition and staining pattern at the image (tile) level (Fig. 2b, panel 2). Because the generated IHC images should be virtually paired to their corresponding input H&E image in terms of tissue composition, the content loss aims to penalize the loss in content between H&E and IHC images at a tile level. First, real patches Xp and synthesized patches \({Y}_{\rm{p}}^{{\prime} }\) are stitched to create images Ximg and \({Y}_{{\rm{img}}}^{{\prime} }\), respectively, and corresponding tiles of size 1,024 × 1,024 are extracted (boxes in Fig. 2b, panel 2), denoted as Xt = {xt ∈ Ximg} and \({Y}_{t}^{{\prime} }=\{\;{y}_{t}^{{\prime} }\in {Y}_{{\rm{img}}}^{{\prime} }\}\), respectively. Then the tiles are encoded by a pretrained feature extractor F, specifically VGG16 (ref. 59) pretrained on ImageNet60. The tile-level content loss at layer l of F is calculated as$${{\mathcal{L}}}_{{\rm{content}}}^{l}\left(F,{X}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)=\frac{\sum | | {F}^{l}({x}_{t})-{F}^{l}(\;{y}_{t}^{{\prime} })| {| }^{2}}{{h}^{l}\cdot {w}^{l}\cdot {c}^{l}}$$
(5)
where h, w and c are the height, width and channel dimensions of the feature map at the lth layer, respectively.The style loss utilizes the synthesized image \({Y}_{{\rm{img}}}^{{\prime} }\) and the available real image Yimg to match the style or overall staining distribution between real and virtual IHC images. Because \({Y}_{{\rm{img}}}^{{\prime} }\) and Yimg do not have pixel-wise correspondence, large tiles \({Y}_{t}^{{\prime} }=\{\;{y}_{t}^{{\prime} }\in {Y}_{{\rm{img}}}\}\) and Yt = { yt ∈ Yimg} are extracted at random such that each tile incorporates a sufficient staining distribution. Next, \({Y}_{t}^{{\prime} }\) and Yt are processed by F to produce feature maps across multiple layers. The style loss is computed as$${{\mathcal{L}}}_{{\rm{style}}}^{l}\left(F,{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)=\frac{\sum | | {\mathcal{G}}\circ {F}^{l}(\;{y}_{t})-{\mathcal{G}}\circ {F}^{l}(\;{y}_{t}^{{\prime} })| {| }^{2}}{| | {\mathcal{G}}\circ {F}^{l}(\;{y}_{t})| {| }^{2}+| | {\mathcal{G}}\circ {F}^{l}(\;{y}_{t}^{{\prime} })| {| }^{2}}$$
(6)
where \({\mathcal{G}}\) is the Gram matrix that measures the correlation between all the styles in a feature map. The denominator is a normalization term that compensates for the under- or overstylization of the tiles in a batch61. The overall global consistency loss is computed as$${{\mathcal{L}}}_{{\rm{global}}}={\lambda }_{{\rm{content}}}\times \mathop{\sum }\limits_{l}^{{L}_{{\rm{content}}}}{{\mathcal{L}}}_{{\rm{content}}}^{l}\left(F,{X}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)+{\lambda }_{{\rm{style}}}\times \mathop{\sum }\limits_{l}^{{L}_{{\rm{style}}}}{{\mathcal{L}}}_{{\rm{style}}}^{l}\left(F,{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)$$
(7)
where Lcontent and Lstyle are the lists of the content and style layers of F, respectively, used to extract the feature matrices, and λcontent and λstyle are regularization hyperparameters for the respective loss terms.Local consistencyThe local consistency objective aims to enforce biological consistency at a local cell level and consists of two loss terms: a cell discriminator loss (\({{\mathcal{L}}}_{{\rm{cellDisc}}}\)) and a cell classification loss (\({{\mathcal{L}}}_{{\rm{cellClass}}}\)) (Fig. 2b, panel 3). The cell discriminator loss is inspired by ref. 26 and uses the cell discriminator Dcell to identify whether a cell is real or virtual, in the same way that the patch discriminator of equation (1) attempts to classify patches as real or virtual. \({{\mathcal{L}}}_{{\rm{cellDisc}}}\) takes as input a real (Yp) and a virtual (\({Y}_{\rm{p}}^{{\prime} }\)) target patch and their corresponding cell masks (\({M}_{{Y}_{\rm{p}}}\) and \({M}_{{Y}_{\rm{p}}^{{\prime} }}\), respectively), which include bounding-box demarcation around the cells (Fig. 2b, panel 3). Dcell comprises a feature extractor followed by a RoIAlign layer62 and a final discriminator. The goal of Dcell is to output \({D}_{{\rm{cell}}}({Y}_{\rm{p}},{M}_{{Y}_{\rm{p}}})\to {1}\) and \({D}_{{\rm{cell}}}({Y}_{\rm{p}}^{{\prime} },{M}_{{X}_{\rm{p}}})\to {0}\), where 1 and 0 indicate real and virtual cells (indicated in black and purple, respectively, in Fig. 2b, panel 3). The cell discriminator loss is defined as$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{cellDisc}}}({D}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }})=\frac{1}{2}\,{{\mathbb{E}}}_{{y}_{\rm{p}}\in {Y}_{\rm{p}}}{\left({D}_{{\rm{cell}}}(\;{y}_{\rm{p}},{M}_{{y}_{\rm{p}}})-1\right)}^{2}\\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad+\frac{1}{2}\,{{\mathbb{E}}}_{{y}_{\rm{p}}^{{\prime} }\in {Y}_{\rm{p}}^{{\prime} }}{\left({D}_{{\rm{cell}}}\left({y}_{\rm{p}}^{{\prime} },{M}_{{y}_{\rm{p}}^{{\prime} }}\right)\right)}^{2}\end{array}$$
(8)
Although Dcell aims to enforce the generation of realistically looking cells, it is agnostic to their marker expression, as it does not explicitly capture which cells have a positive or a negative staining status. To account for this, we introduce an additional loss via a classifier Fcell that is trained to explicitly predict the cell staining status. This is achieved with the help of cell labels \({C}_{{Y}_{\rm{p}}^{{\prime} }}\) and \({C}_{{Y}_{\rm{p}}}\): that is, binary variables depicting the positive or negative staining status of a cell (indicated as 1: yellow and 0: blue boxes in Fig. 2b, panel 3). The computation of cell masks and labels is described in detail in the section ‘Cell masking and labelling of IHC images’. The cell-level classification loss can be easily computed as cross-entropy loss, calculated as$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{cellClass}}}\left({F}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }},{C}_{{Y}_{\rm{p}}},{C}_{{Y}_{\rm{p}}^{{\prime} }}\right)\\=\frac{-1}{| {C}_{{y}_{\rm{p}}}| }\displaystyle\mathop{\sum }\limits_{{j=1}\atop {i\in \{0,1\}}}^{| {C}_{{y}_{\rm{p}}}| }\!{{\mathbb{1}}}_{{C}_{{y}_{\rm{p}}}^{\,j} = i}\times \log \left(p\left({M}_{{y}_{\rm{p}}}^{\;j}\times {y}_{\rm{p}}\right)\right)\\\quad+\frac{-1}{| {C}_{{y}_{\rm{p}}^{{\prime} }}| }\displaystyle\mathop{\sum }\limits_{{j=1}\atop {i\in \{0,1\}}}^{| {C}_{{y}_{\rm{p}}^{{\prime} }}| }\!{{\mathbb{1}}}_{{C}_{{y}_{\rm{p}}^{{\prime} }}^{\,j} = i}\times \log \left(p\left({M}_{{y}_{\rm{p}}^{{\prime} }}^{\;j}\times {y}_{\rm{p}}^{{\prime} }\right)\right)\end{array}$$
(9)
where \(| {C}_{{y}_{\rm{p}}}|\) and \(| {C}_{{y}_{\rm{p}}^{{\prime} }}|\) are the number of cells in yp and \({y}_{\rm{p}}^{{\prime} }\), respectively, \({{\mathbb{1}}}_{(.)}\) is the indicator function and p(.) is the cell-level probabilities predicted by Fcell.The overall local consistency loss is computed as$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{local}}}\;=\;{\lambda }_{{\rm{cellDisc}}}\times {{\mathcal{L}}}_{{\rm{cellDisc}}}\left({D}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }}\right)\\\qquad\qquad\;+\;{\lambda }_{{\rm{cellClass}}}\times {{\mathcal{L}}}_{{\rm{cellClass}}}\left({F}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }},{C}_{{Y}_{\rm{p}}},{C}_{{Y}_{\rm{p}}^{{\prime} }}\right)\end{array}$$
(10)
where λcellDisc and λcellClass are the regularization hyperparameters for the cell discriminator and classification loss terms, respectively. Importantly, the local consistency loss can be easily generalized to any other cellular or tissue component (for example, nuclei, glands) that might be relevant to other S2S translation problems, provided that corresponding masks and labels are available.The complete objective function for optimizing VirtualMultiplexer is given as$${{\mathcal{L}}}_{{\rm{VirtualMultiplexer}}}={{\mathcal{L}}}_{{\rm{neighbourhood}}}+{{\mathcal{L}}}_{{\rm{global}}}+{{\mathcal{L}}}_{{\rm{local}}}$$
(11)
Cell masking and labelling of IHC imagesAs already discussed, the local consistency loss of equation (11) needs as input cell masks \({M}_{{X}_{\rm{p}}},{M}_{{Y}_{\rm{p}}}\) and cell labels \({C}_{{X}_{\rm{p}}},{C}_{{Y}_{\rm{p}}}\). However, acquiring these inputs manually for all patches across all antibodies is practically prohibitive, even for relatively small datasets. Automatic nuclei segmentation/detection using pretrained models (for example, HoVerNet63) is a standard task for H&E images, but no such model exists for IHC images. To circumvent this challenge, we use an attractive property of the VirtualMultiplexer: its ability to synthesize virtual images that are pixel-wise aligned in any direction between the source and target domain. Specifically, we train a separate instance of the VirtualMultiplexer that performs IHC → H&E translation. The VirtualMultiplexerIHC→H&E is trained using neighbourhood consistency and global consistency objectives, as previously described. Once trained, it is used to synthesize a virtual H&E image \({X}_{{\rm{img}}}^{{\prime} }\) from a real IHC image Yimg. At this point, we can leverage HoVerNet63 to detect cell nuclei on real and virtual H&E images (Ximg and \({X}_{{\rm{img}}}^{{\prime} }\)) and simply transfer the corresponding cell masks (\({M}_{{X}_{{\rm{img}}}}\) and \({M}_{{X}_{{\rm{img}}}^{{\prime} }}\)) to their pixel-wise aligned IHC counterparts (\({Y}_{{\rm{img}}}^{{\prime} }\) and Yimg, respectively) to acquire \({M}_{{Y}_{{\rm{img}}}^{{\prime} }}\) and \({M}_{{Y}_{{\rm{img}}}}\). This ‘trick’ eliminates the need to train individual cell detection models for each IHC antibody and fully automates the cell masking process in the IHC domain. To acquire cell labels \({C}_{{Y}_{{\rm{img}}}^{{\prime} }}\) and \({C}_{{Y}_{{\rm{img}}}}\), we use only region annotations in Yimg, where the experts partially annotated areas as positive or negative stainings in a few representative images. Because IHC stainings are specialized in delineating positive or negative staining status, the annotation was easy and fast and required approximately 2–3 minutes per image and per antibody marker. We also train cell detectors for the source and target domain: that is, \({D}_{{\rm{cell}}}^{{\rm{source}}}\) and \({D}_{{\rm{cell}}}^{{\rm{target}}}\), respectively. Provided with the annotations, \({D}_{{\rm{cell}}}^{{\rm{target}}}\) is trained as a CNN patch classifier. The classifier predictions on Yimg combined with \({M}_{{Y}_{\rm{p}}}\) result in \({C}_{{Y}_{p}}\). The above region predictions on Yimg are transferred on to \({X}_{{\rm{img}}}^{{\prime} }\). Afterwards, \({X}_{{\rm{img}}}^{{\prime} }\) and the transferred annotations are used to train \({D}_{{\rm{cell}}}^{{\rm{source}}}\) as a CNN patch classifier. The classifier predictions on Ximg combined with \({M}_{{X}_{\rm{p}}}\) result in \({C}_{{X}_{p}}\).Implementation and training detailsThe architectural choices of the VirtualMultiplexer were set as follows: G is a ResNet64 with nine residual blocks, D is a PatchGAN discriminator12, Dcell includes four stride-2 feature convolutions followed by a RoIAlign layer and a discrimination layer and Fcell includes four stride-2 feature convolutions and a two-layer MLP. We use Xavier weight initialization65, instance normalization66 and a batch size of one image. We use least square GAN loss67 for \({{\mathcal{L}}}_{{\rm{adv}}}\). The model hyperparameters for the loss terms of the VirtualMultiplexer are set as follows: λNCE is 1 with temperature τ equal to 0.08, λcontent ∈ {0.01, 0.1}, λstyle ∈ {5, 10}, λcellDisc ∈ {0.5, 1} and λcellClass ∈ {0.1, 0.5}. VirtualMultiplexer is optimized for 125 epochs using the Adam optimizer68 with momentum parameters β1 = 0.5 and β2 = 0.999. Different learning rates (lr) are employed for different consistency objectives: that is, for neighbourhood consistency, lrG and lrD are set to 0.0002; for global consistency, learning rate lrG is chosen from {0.0001, 0.0002}; and for local consistency, learning rates \({\text{lr}}_{{D}_{{\rm{cell}}}}\) and \({\text{lr}}_{{F}_{{\rm{cell}}}}\) are chosen from {0.00001, 0.0001, 0.0002}. Among other hyperparameters, the number of tiles extracted per image to compute \({{\mathcal{L}}}_{{\rm{content}}}\) and \({{\mathcal{L}}}_{{\rm{style}}}\) is set to eight; the content layer in F is relu2_2; the style layers are relu1_2, relu2_2, relu3_3, relu4_3; and the number of cells per patch to compute \({{\mathcal{L}}}_{{\rm{cellDisc}}}\) is set to eight.GT architectureThe GT architecture, proposed by ref. 41, fuses a graph neural network and a vision transformer (ViT) to process histopathology images. The graph neural network operates on a graph-structured representation of a histopathology image, where the nodes and edges of the graph denote patches and interpatch spatial connectivity, and the nodes encode patch features extracted from a pretrained ResNet-50 network64. The graph representation underwent graph convolutions to contextualize the node features of the local tissue neighbourhood. Specifically, the GT employs a graph convolution layer69 to learn contextualized node embeddings through propagating and aggregating neighbourhood node information. Subsequently, a ViT layer operates on the contextualized node features, leverages self-attention to weigh the importance of the nodes and aggregates the node information to render an image-level feature representation. Finally, an MLP maps the image-level features to a downstream image label. Note that histopathology images can have different spatial dimensions; therefore, their graph representations can have varying number of nodes. Also, the number of nodes can be very high when operating on gigapixel-sized WSIs. These two factors can potentially hinder the integration of the graph convolution layer to the ViT layer. To address these challenges, GT introduces a mincut pooling layer70, which reduces the number of nodes to a fixed number of tokens while preserving the local neighbourhood information of the nodes.Implementation and training detailsThe architecture of the GT follows the official implementation on GitHub (https://github.com/vkola-lab/tmi2022). Each input image was cropped to create a bag of 256 × 256 non-overlapping patches at ×10 magnification, and background patches with non-tissue area greater than 10% were discarded. The patches were encoded using the ResNet-5064 model pretrained on the ImageNet dataset60. A graph representation was constructed using the patches with an eight-node connectivity pattern. The GT network consisted of one graph convolutional layer, and the ViT layer configurations were set as follows: number of ViT blocks = 3, MLP size = 128, embedding dimension of each patch = 32 and number of multihead attention = 8. The model hyperparameters were set as follows: number of clusters in mincut pooling = {50, 100}, Adam optimizer with initial learning rate of {0.0001, 0.00001}, a cosine annealing scheme for scheduling and a mini-batch size of eight. The GT models were trained for 400 epochs with early stopping.DatasetsThe VirtualMultiplexer was trained using the EMPaCT TMA dataset; an independent subset of EMPaCT was used for internal testing. The VirtualMultiplexer was further evaluated in a zero-shot fashion—that is, without any retraining or fine-tuning—on three external prostate cancer datasets (prostate cancer WSIs, SICAP42 and PANDA43 needle biopsies), on an independent PDAC dataset (PDAC TMAs) and on TCGA data from breast and colorectal cancer. In all cases, independent GTs are trained and tested for individual datasets by using both real and virtually stained samples to address various downstream classification tasks. Details on all datasets used follow.EMPaCTThe dataset contains TMAs from 210 primary prostate tissues as part of EMPaCT and the Institute of Tissue Pathology in Bern. The study followed the guidelines of the World Medical Association Declaration of Helsinki 1964, updated in October 2013, and was conducted after approval by the Ethics Committees of Bern (CEC ID2015-00128). For each patient, four cores were selected, with two of them representing a low Gleason pattern and the other two a high Gleason pattern. Consecutive slices from each core were stained with H&E and IHC using multiple antibodies against nuclear markers NKX3.1 and AR, tumour markers p53 and ERG, and membrane markers CD44 and CD146. TMA FFPE sections of 4 μm were deparaffinized and used for heat-mediated antigen retrieval (citrate buffer, pH 6, Vector Labs; or Tris-HCl, pH 9). Sections were blocked for 10 min in 3% H2O2, followed by 30 min room temperature incubation in 1% bovine serum albumin in phosphate-buffered saline–0.1% Tween 20. The following antibodies were used: anti-AR (Dako Agilent, catalogue no. M3562, AR441, 1:100 dilution), anti-NKX3.1 (Athena Enzyme Systems, catalogue no. 314, lot 18025, 1:200), anti-p53 (Dako Agilent, catalogue no. M7001, DO-7, 1:800), anti-CD44 (Abcam, catalogue no. ab16728, 156-3C11, 1:2000), anti-ERG (Abcam, catalogue no. ab133264, EPR3864(2), 1:500) and anti-CD146 (Abcam, catalogue no. ab75769, EPR3208, 1:500). Images were acquired using a 3D Histech Panoramic Flash II 250 scanner at ×20 magnification (resolution 0.24 μm per pixel). The cores were annotated at patient level by expert uro-pathologists with binary labels for overall survival status (0, alive/censored; 1, prostate-cancer-related death) and disease progression status (0, no recurrence; 1, recurrence). Clinical follow-up was recorded on a per-patient basis, with a maximum follow-up time of up to 12 years. For both the survival and disease progression clinical endpoints, the available data were imbalanced in terms of class distributions. Access information is possible upon request to the corresponding authors. The distribution of cores per clinical endpoint for the EMPaCT dataset is summarized in Supplementary Table 2.Prostate cancer WSIsPrimary stage prostate cancer FFPE tissue sections (4 μm) were deparaffinized and used for heat-mediated antigen retrieval (citrate buffer, pH 6, Vector Labs). Sections were blocked for 10 min in 3% H2O2, followed by 30 min room temperature incubation in 1% bovine serum albumin in phosphate-buffered saline–0.1% Tween 20. The following primary antibodies were used: anti-CD146 (Abcam, catalogue no. ab75769, EPR3208, 1:500), anti-AR (Abcam, catalogue no. ab133273, EPR1535, 1:100) and anti-NKX3.1 (Cell Signaling, catalogue no. 83700T, D2Y1A, 1:200). Secondary anti-rabbit antibody Envision horseradish peroxidase (DAKO, Agilent Technologies, catalogue no. K400311-2, undiluted) was incubated for 30 min, and signal detection was done using 3-amino-9-ethylcarbazole substrate (DAKO, Agilent Technologies). Sections were counterstained with hematoxylin and mounted with aquatex. Images were acquired using a 3D Histech Panoramic Flash II 250 scanner at ×20 magnification (resolution 0.24 μm per pixel).SICAPThe dataset contains 155 H&E-stained WSIs from needle biopsies taken from 95 patients, split in 18,783 patches of size 512 × 512 (ref. 42). The WSIs were reconstructed by stitching the patches. The WSIs were scanned at ×40 magnification by a Ventana iScan Coreo scanner and downsampled to ×10 magnification. The WSIs were annotated by expert uro-pathologists for Gleason grades at the Hospital Clínico of Valencia, Spain.PANDAThe dataset includes 5,759 H&E-stained needle biopsies from 1,243 patients at the Radboud University Medical Center, Netherlands71 and 5,662 H&E-stained needle biopsies from 1,222 patients at various hospitals in Stockholm, Sweden72. The slides from Radboud were scanned with a 3D Histech Panoramic Flash II 250 scanner at ×20 magnification (resolution 0.24 μm per pixel) and were downsampled to ×10. The slides from Sweden were scanned with a Hamamatsu C9600-12 and an Aperio Scan Scope AT2 scanner at ×10 magnification with a pixel resolution of 0.45202 μm and 0.5032 μm, respectively. The Gleason grades of the biopsies were annotated by expert uro-pathologists and were released as part of the PANDA challenge43. We removed the noisy and inconspicuously labelled biopsies from the dataset, resulting in 4,564 and 4,988 biopsies from the Radboud and the Swedish cohorts, respectively (9,552 biopsies in total). The distribution of WSIs across Gleason grades for both SICAP and PANDA datasets is shown in Supplementary Table 3.PDACThe PDAC TMA contained cancer tissue of 117 (50 female, 67 male) PDAC cases resected in a curative setting at the Department of Visceral Surgery of Inselspital Bern and diagnosed at the Institute of Tissue Medicine and Pathology (ITMP) of the University of Bern between the years 2014 and 2020. The study followed the guidelines of the World Medical Association Declaration of Helsinki 1964, updated in October 2013, and was conducted after approval by the Ethics Committees of Bern (CEC ID2020-00498). All participants provided written general consent. The TMA contained three spots from each case (tumour front, tumour centre, tumour stroma), leading to a total number of 351 tissue spots. Thirteen of these 117 cases were treated by neoadjuvant chemotherapy followed by surgical resection and adjuvant therapy, and the majority of the cases (104) were resected curatively and received adjuvant therapy. All cases were characterized comprehensively clinico-pathologically, including TNM stage, during a master’s thesis of student Jessica Lisa Rohrbach at ITMP, supervised by Martin Wartenberg. All cases were Union for International Cancer Control (UICC) tumour stage I, stage II or stage III cases on pathologic examination, according to the UICC TNM Classification of Malignant Tumours, 8th edition73; the TMA did not include UICC tumour stage IV cases. In all of our analysis, including the TNM prediction (Fig. 6d), we excluded the 13 neoadjuvant cases and considered only the 104 cases that received adjuvant therapy. The distribution of cores across the three TNM stages is reported in Supplementary Table 4.TCGAThe dataset includes example H&E WSIs from breast cancer (BRCA) and colorectal cancer (CRC) from The TCGA, available at the GDC data portal (https://portal.gdc.cancer.gov) as diagnostic slides under project IDs TCGA-BRCA and TCGA-CRC, respectively.Data preprocessingFor all datasets used, we followed a tissue region detection and patch extraction preprocessing procedure. Specifically, the tissue region was segmented using the preprocessing tools in the HistoCartography library74. A binary tissue mask denoting the tissue and non-tissue regions was computed for each downsampled input image by iteratively applying Gaussian smoothing and Otsu thresholding until the mean of non-tissue pixels was below a threshold. The estimated contours of the denoted tissue and the cavities of tissue were then filtered depending on their area to generate the final segmentation mask. Subsequently, non-overlapping patches of size 256 × 256 were extracted from ×10 magnification using the segmentation contours. The extracted H&E and IHC patches of the EMPaCT dataset were used for training and internal validation of the VirtualMultiplexer. For the unseen datasets (prostate cancer WSIs, SICAP, PANDA, PDAC, TCGA), the images were first stain-normalized to mitigate the staining appearance variability with respect to the EMPaCT TMAs, and then H&E patches were extracted. Specifically, for the SICAP, PANDA and PDAC datasets, we used the Vahadane stain normalization method75, from the HistoCartography library74, on the entire images. We masked out the blank regions by applying a threshold on the Lab colour space and computed the stain-density maps using only the tissue regions. Afterwards, the target stain-density maps are combined with the reference colour appearance matrix to produce normalized images, as proposed by the Vahadane method. Supplementary Fig. 1 presents a sample unnormalized WSI from the PANDA dataset and the corresponding stain-normalized WSI based on the reference EMPaCT TMA. For the prostate cancer and TCGA WSIs, we followed the same procedure but with stain-density maps extracted from a lower magnification (×2.5) for computational efficiency. Note that the VirtualMultiplexer is independent of the stain normalization method and can be trained using H&E images normalized by other advanced stain normalization algorithms: for example, deep learning-based methods76.Method evaluationPatch-level evaluationWe use the FID score77 to compare the distribution of the virtual IHC patches with the distribution of the real IHC patches, as shown in Fig. 3. The computation begins with projecting the virtual and the real IHC patches to an embedding space using the InceptionV3 (ref. 77) model, pretrained on ImageNet60. The extracted embeddings are used to estimate multivariate normal distributions \({\mathcal{N}}(\;{\mu }_{\rm{r}},{\Sigma }_{\rm{r}})\) for real data and \({\mathcal{N}}(\;{\mu }_{\rm{s}},{\Sigma }_{\rm{s}})\) for virtual data. Finally, the FID score is computed as$$\,\text{FID}\,=| | {\mu }_{\rm{r}}-{\mu }_{\rm{v}}| {| }^{2}+{\mathrm{Tr}}\left({\Sigma }_{\rm{r}}+{\Sigma }_{\rm{v}}-2{({\Sigma }_{\rm{r}}{\Sigma }_{\rm{v}})}^{\frac{1}{2}}\right)$$
(12)
where μr and μv are the feature-wise mean of the real and virtual patches, Σr and Σv are covariance matrices for the real and virtual embeddings, and Tr is the trace function. A lower FID score indicates a lower disparity between the two distributions and thereby a higher staining efficacy of the VirtualMultiplexer. To ensure reproducibility, we ran each model three times with three independent initializations and computed the mean and standard deviation for each model (barplot height and error bar in Fig. 3). We used a 70%:30% ratio to split the data into train and test sets, respectively. As for each marker a different number of IHC stainings were available in the EMPaCT data, the exact number of cores used per marker are given in Supplementary Table 5.Image-level evaluationWe used a number of downstream classification tasks to assess the discriminative ability of the virtually stained IHC images on the EMPaCT, SICAP, PANDA and PDAC datasets. We further used these tasks to depict the utility of leveraging virtually multiplexed staining in comparison to standalone real H&E, real IHC and virtual IHC staining. Specifically, provided the aforementioned images, we constructed graph representations as described in Section GT architecture. Subsequently, GTs41 were trained under unimodal and multimodal settings using both real and virtually stained images and evaluated on a held-out independent test dataset. The final classification scores were reported using a weighted F1 metric, where a higher score depicts a better classification performance and thereby higher discriminative power of the utilized images. As before, we ran each model three times with three independent initializations and computed the mean and standard deviation for each model (barplot heights and error bars in Figs. 5 and 6). In all cases, we used a 60%:20%:20% ratio to split the data into train, validation and test sets, respectively. The exact number of train, validation and test samples used per task, marker and training setting in the EMPaCT dataset are given in Supplementary Table 6.For the SICAP, PANDA and PDAC datasets, the exact number of samples used in the train, validation and test splits coincide for all unimodal and multimodal models of Fig. 6 and are reported in Supplementary Table 7.Computational hardware and softwareThe image datasets were preprocessed on POWER9 central processing units and one NVIDIA Tesla A100 graphics processing unit using the Histocartography library74. The deep learning models were trained on NVIDIA Tesla P100 graphics processing units using PyTorch (v.1.13.1) (ref. 78) and PyTorch Geometric (v.2.3.0) (ref. 79). The entire pipeline was implemented in Python (v.3.9.1).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles