Benchmarking robustness of deep neural networks in semantic segmentation of fluorescence microscopy images | BMC Bioinformatics

We propose an assay for benchmarking corruption robustness and adversarial robustness of DNN models in semantic segmentation of FM images. A critical part of the assay is a new method for synthesizing realistic synthetic FM images with precisely controlled corruptions or adversarial attacks. We evaluate robustness of 10 representative segmentation models on both realistic synthetic FM images and real microscopy images of different modalities.Fig. 1Overall workflow of image synthesisGeneration of realistic synthetic images for benchmarking robustnessWe have developed three datasets, referred to as ER-C, Mito-C and Nucleus-C, respectively, for benchmarking robustness of DNN models against corruptions and adversarial attacks in semantic segmentation of FM images [43]. Detailed statistics of the datasets are summarized in Supplementary Table S1. Degraded images in these three datasets are synthesized from raw images along with their manually annotated segmentation labels from the ER, Mito, and Nucleus datasets [44, 45], respectively.We use realistic synthetic FM images to benchmark robustness of DNNs for four reasons. First, the ground truth of each synthetic image is known a priori so that no additional manual annotation is required. Second, using synthetic images enables more direct, flexible, and precise control of corruptions and adversarial attacks than using real images. Such control is difficult to achieve in real images because it is difficult or even infeasible to control imaging conditions in the real world. Third, synthesis of realistic images requires much less time and labor than generation of real images with controlled conditions. Finally, previous studies such as [46] have shown experimentally that models trained on realistic synthetic images perform equally well on real FM images.The overall workflow of image synthesis consists of three steps (Fig. 1). First, segmentation labels are used as binary masks to guide synthesis of images using a generative adversarial network (GAN) [47,48,49,50], which is trained to learn the mapping from the masks to their corresponding FM images. The masks are used as the ground truth for the final output images. The segmentation labels, generated originally by manual annotation, are taken directly from the three datasets[44, 45]. For data augmentation, some objects in existing segmentation annotations are also randomly selected and combined to generate new masks. Furthermore, morphological operations including dilation and erosion are used to increase the shape variability of masks. Second, denoising is performed on the synthesized images to remove their background noise using the method in [51]. This step is important because it enables precise control of signal-to-noise ratios (SNRs) in the next step. Third, different corruptions and adversarial attacks are applied to the denoised synthetic images to generate the final output images for benchmarking robustness of DNN models. Detailed description of each step of the workflow is given below.Fig. 2Comparison of real images versus images synthesized using two strategies. First row: an example from the Nucleus dataset. Second row: an example from the Mito dataset. Third row: an example from the ER datasetStep 1-Initial image synthesis using a GANIn a previous study [40], the foreground and background of FM images of mitochondria were modeled using a Gamma distribution and a Gaussian distribution, respectively, to synthesize images from binary masks. Pixels in foreground regions defined by the binary masks were filled with random samples from the Gamma distribution [40]. This method, referred to as Random Fill in this study, cannot capture spatial patterns of pixel intensities and diffusive boundaries of real image objects. Consequently, the synthesized images have low fidelity (Fig. 2). Moreover, because of the sharp boundaries of the synthetic images, DNN models trained on them tend to over segment on real images with diffusive boundaries [40].Fig. 3Representative synthetic images with different types and levels of corruptions. First row: an example from the Nucleus-C dataset; Second row: an example from the Mito-C dataset; Third row: an example from the ER-C datasetTo generate realistic synthetic FM images, we use a customized GAN model based on Pix2Pix [52], which we refer to as P2P-SN [51]. It is trained to learn the mapping from binary masks to real images. When given a binary mask, it fills the foreground with synthetic signal and background with synthetic noise. In this way, a large number of synthetic images can be generated from given masks. As can be observed qualitatively in Fig. 2, images synthesized by P2P-SN better reproduce the pixel intensity patterns and diffusive boundaries of real images. This observation can be quantified using fidelity metrics of the foreground signal, background noise, and blurring, respectively [46]. However, background noise in the synthetic images makes it difficult to precisely control their SNRs. To solve this problem, we remove background noise of the synthetic images using the method developed in [51].Step 2-Denoising synthetic imagesBackground noise synthesized by P2P-SN hinders the precise control of signal-to-noise ratios (SNRs) of images and therefore is removed via denoising. Specifically, we use the two-stage denoising method named global noise modeling denoiser (GNMD) [51]. In the first stage, a series of independent and nearly all-background masks are fed into a trained P2P-SN to generate a series of synthetic global noise images denoted by N. Assuming that noise is additive, we then synthesize a noisy image \(\hat{I} = I + N\) from a synthetic noise-free image I. In the second stage, pairs of images \((\hat{I},I)\) generated in the first stage are used to train another Pix2Pix-based GAN model referred to as P2P-DN. When GNMD is applied on images synthesized by P2P-SN, their background noise is effectively removed [51]. Controlled degradations can now be applied to the noise-free synthetic images.Step 3-Synthesis of degraded imagesFM images generally have lower SNRs than natural images. We find empirically that FM images with an SNR of 8 to be sufficiently clean visually. Therefore, we take synthesized images with an SNR of 8 as our reference clean images. Different forms and levels of corruptions and adversarial attacks are applied to the clean images to generate two types of samples: corrupted samples and adversarial samples.Generation of corrupted samples. Noise is a common type of corruption for FM images. We simulate different levels of noise quantified by different SNRs (see Fig. 3). Blurring is another common type of corruption for FM images. Similar as in [40], we simulate two types of blurring, namely space-invariant blurring (SIB) and space-variant blurring (SVB), at different levels.Previous studies have shown that Poisson and Gaussian noise are dominant in FM images[13, 14, 53]. Specifically, the photon noise, or shot noise, is generated by the statistical fluctuations of the number of photons emitted at a given exposure level, which follows a Poisson distribution. Photon noise is inherent in all optical signals that result from photon emission. The readout noise is mainly generated by the signal amplification during the process of converting electrical charges into voltages. It follows a Gaussian distribution. In view of these physical mechanisms, we first add Poisson noise onto a noise-free synthetic image I generated by P2P-DN to simulate photon noise (see Eq. 1).$$\begin{aligned} {\hat{I}}_{Poisson} = I + N_{Poisson} \end{aligned}$$
(1)
\(N_{Poisson}\sim P\left( \lambda _{p} \right)\) is noise following a Poisson distribution, \(\lambda _{p}\) represents the average photon flux, which is dependent on signal strength. Then we add pixel-wise independent Gaussian noise onto \({\hat{I}}_{Poisson}\) to achieve desired SNRs, as formulated by the following equation:$$\begin{aligned} {\hat{I}}_{SNR} = {\hat{I}}_{Poisson} + N\left( \mu _{noise},\sigma _{noise} \right) \end{aligned}$$
(2)
where \(\mu _{noise}\) and \(\sigma _{noise}\) denote the mean and standard deviation of the added Gaussian noise, respectively. \({\hat{I}}_{SNR}\) is the simulated noisy image of a certain SNR, which is defined in this study as:$$\begin{aligned} SNR = \left( \mu _{signal} – \mu _{noise} \right) /\sigma _{noise} \end{aligned}$$
(3)
where \(\mu _{signal}\) denotes the mean of signal. Based on this definition, we simulate six SNR levels (SNR=1,2,3,4,5,8) and take \({\hat{I}}_{SNR=8}\) as our clean image. Specifically, \(\sigma _{noise}\) and \(\mu _{signal}\) are estimated from corresponding raw images. Then \(\mu _{noise}\) is calculated for a specific SNR based on its definition. To benchmark robustness against noise corruption of natural images, zero-mean Gaussian noise \(~N\left( {0,\sigma } \right)\) is often adopted, with its level controlled by \(\sigma\) [8, 11, 37]. However, for FM images, we simulate different levels of noise corruption by adjusting the SNRs. This is because FM images have much wider dynamic ranges than natural images, and the mean of their background noise often is nonzero.For blurring, because of the limited depth of field under the high numerical aperture required for high-resolution imaging, FM images often are partially or completely out-of-focus and therefore blurred. Out-of-focus blur is often simulated through convolution with the point spread function (PSF) [54, 55]. Because a Gaussian kernel is often used to approximate PSF in practice, simulation of out-of-focus blur is implemented by convolution with a Gaussian kernel. To simulate space-invariant blurring (SIB), Gaussian filtering is performed on the entire synthetic FM images as described in [40]. Specifically, a fixed Gaussian kernel is applied on the whole image to simulate globally uniform blurring, with its level controlled by the standard deviation of the kernel \(\sigma\). In this study, six levels of SIB are simulated, with corresponding \(\sigma = 0,1,2,3,4,5\).Fig. 4Examples of images with different levels of IFGSM attacks. From left to right, \(\varepsilon =0,2,8,16,32\), respectivelySpace-variant blurring (SVB) is designed to simulate spatially nonuniform blur. In this study, an image is empiricially divided into 4 horizontal bands from the top to bottom. Each band is filtered by a Gaussian kernel randomly selected from \(\sigma =\) 1, M/3, 2 M/3, M to simulate SVB, with the highest level of blurring controlled by M, which is set to be 1, 2, 3, 4 and 5. Representative samples of the three types of corruptions are shown in Fig. 3.Generation of adversarial samples. The fast gradient sign method (FGSM) [27] and the iterative fast gradient sign method (IFGSM) [28] are used to generate adversarial attack samples. If the original image is denoted as \(x \in R^{N \times N}\), FGSM is a one-step attack method based on the gradient of the DNN model loss function. As x steps along the gradient of the loss function, the loss function increases at the fastest rate. In this way, an adversarial sample is generated according to the following equation:$$\begin{aligned} x^{adv} = x + \varepsilon \cdot sign\left( {\nabla _{x}Loss\left( {f(x),gt} \right) } \right) \end{aligned}$$
(4)
where \(\varepsilon\) is the step size that controls the level (i.e., strength) of attack, f(x) is the output of DNN model f, \(sign(\cdot )\) is the sign function, and gt denotes the ground truth of x. Different from FGSM, IFGSM is an iterative attack method based on the gradient of the DNN model loss function, and its formulation is as follows:$$\begin{aligned}&x_{0}^{adv} = x \\&x_{t + 1}^{adv} = x_{t}^{adv} + \alpha \cdot sign\left( {\nabla _{x}Loss\left( {f(x),gt} \right) }\right) \\&x_{t + 1}^{adv} = clip\left( {x_{t + 1}^{adv},\;\varepsilon } \right) \end{aligned}$$
(5)
where \(\alpha\) is the step size for each iteration and function \(clip(x,\varepsilon )\) ensures that each element \(x_{i}\) of x is within the range of \([x_{i}-\varepsilon ,x_{i}+\varepsilon ]\). In our experiment, \(\alpha =1\), and we set the number of iterations as \(min(\varepsilon +4,1.25\varepsilon )\) [28] where \(\varepsilon\) is a variable that controls the level of attack. Figure 4 shows representative samples of IFGSM attacks.Real microscopy image datasets for benchmarking robustnessIn addition to datasets of realistic synthetic images, robustness of DNN segmentation models are also benchmarked on datasets of real microscopy images of different modalities, including fluorescence, brightfield, phase-contrast and differential interference contrast (DIC) microscopy. Representative images are illustrated in Fig. 13. For real fluorescence microscopy images, the datasets from [56] contain about 700 pairs of mitochondrial images, while the Nucleus datasets in [14] contain 1000 cell nucleus images acquired in three imaging modes: two-photon, confocal, and widefield. For these real fluorescence microscopy images, their segmentation annotations were made manually and controlled in quality by local experimental biologists. Because the real FM images of these datasets were collected under a pair of low and high SNRs, they can be used to benchmark corruption robustness. In addition, two phase-contrast microscopy datasets from [57] and [58] are selected. Specifically, the SH-SY5Y dataset, which is part of the LiveCell dataset [57], contains phase-contrast images of human neuroblastoma with long protrusions and dense populations. The Phc-Fib dataset, which comes from [58], contains phase-contrast images of overlapping fibroblasts. Two datasets of DIC images are taken from [58], which are named DIC_v1 and DIC_v2, respectively. DIC_v1 and DIC_v2 both contain images of normal elliptical cells, while DIC_v2 contains images of dense cell populations. Finally, two brightfield microscopy datasets are taken from [58]. The Bright_stain dataset contains images taken using brightfield microscopy on stained cells. The Bright dataset contains images of cells without staining. Cells of the two datasets are normally elliptical in shape and not clustered. It should be noted that the brightfield, phase-contrast and DIC microscopy images are taken from datasets with instance cell segmentation, in which individual cells are differentiated and marked with different segmentation annotations. Because this study only considers semantic segmentation, the segmentation labels are simplified by setting the annotation to 1 for all objects. Real microscopy images are used for benchmarking model robustness for two reasons. First, degradations in real microscopy images are more representative of actual image conditions than in simulated images. Second, real microscopy images can be used to verify benchmarking results obtained on realistic synthetic images.Segmentation modelsTo date, a large number of DNN models have been developed for semantic image segmentation. In this study, we examine 10 models, including FCN, SegNet, UNet, UNet_3, Sim_UNet, DeepLab, PSPNet, ICNet, ViT-B_16 and R50-ViT-B_16. Among them, FCN, SegNet, UNet, DeepLab, PSPNet and ICNet are representative semantic segmentation models that have been validated extensively in the literature. UNet_3 and Sim_UNet are two simplified variants of UNet. ViT-B_16 and R50- ViT-B_16 are Transformer-based models.FCN [41], SegNet [59] and UNet [42] are classical convolutional neural network (CNN) models of the encoder-decoder architecture. An input image is fed into their multi-layer encoder to extract high-level features. Then, their decoder maps the high-level features back to the input domain and outputs dense segmentation results. FCN is one of the early models that successfully apply deep learning to semantic segmentation by replacing full connection layers with convolutional layers to achieve end-to-end training. Its design of fully convolutional layers not only greatly reduces the size of input but also greatly reduces the number of parameters. It features a representative asymmetric encode-decode architecture, with simple up-sampling or deconvolution layers in its decoder. SegNet adds convolutional layers onto its decoder to make it symmetric with its encoder, forming a symmetric encoder-decoder architecture. Max-pooling indices are retained to provide high frequency information for up-sampling layers. UNet also takes a symmetric encoder-decoder structure. However, its skip connections concatenate features from its encoder to its decoder at different layers. Concatenated features largely preserve the information of each encoding layer, making segmentation more accurate. Recently, it has been shown that simplification of the UNet by reducing its level of down-sampling and up-sampling improves segmentation accuracy on FM images [16]. To check how such simplification may influence model robustness, we test two simplified variants of the UNet: UNet_3 retains the first three encode-decode layers of the original UNet and removes the last two layers, whereas Sim_UNet further reduces parameters of each layer of UNet_3 to obtain a more simplified architecture [16].DeepLab [60], PSPNet [61], and ICNet [62] are all ResNet-based [63] models that utilize modules to handle image objects at multiple scales. DeepLab, specifically DeepLabv3, effectively captures multi-scale information at different rates using atrous spatial pyramid pooling (ASPP). PSPNet utilizes a pyramid pooling module (PPM) to extract global context information. ICNet is a cascaded lightweight network that is capable of achieving real-time semantic segmentation of natural images.With its multi-head self-attention modules, Transformer [64] has the capacity to handle both short-range and long-range information and has been widely used in natural language processing. Vision Transformer (ViT) [65] is first proposed to deal with vision tasks based on the Transformer architecture, in which input images are divided into a sequence of non-overlapping patches, followed by positional and information embedding. ViT-B_16 and R50-ViT-B_16 [66] use the classical Transformer as their encoder. Becuase segmentation is a dense computer vision task, the decoder can be a CNN as usual. ViT-B_16 uses a Transformer as its encoder, while R50-ViT_B uses a CNN-Transformer hybrid model where CNN is first used as a feature extractor to generate a feature map for the input, then the feature map is fed into the Transformer modules.We choose these models for several reasons. First, FCN, SegNet, UNet, DeepLab, PSPNet and ICNet are representative models that have been widely used and are known to perform well on natural images. Second, comparing robustness of UNet, UNet_3 and Sim_UNet allows us to examine how ablation of model architecture affects robustness. Third, we choose SegNet, FCNs, DeepLab, PSPNet, ICNet because their robustness has been characterized on natural images in e.g., [12]. This allows us to compare robustness of the same models on FM images versus natural images. Fourth, ViTs (Vision Transformers) [65] have achieved remarkable performance in a broad range of computer vision tasks. But their performance in segmentation of FM images has not be examined.Quantification of robustnessTo quantify robustness, we largely follow the protocol used in [11, 12] so that we can compare model robustness on FM images versus natural images. Specifically, we use IoU (Intersection over Union) as our metric to characterize semantic segmentation performance of DNN models. For a specific model, we use its IoU on the reference clean images to characterize its reference accuracy. We define its robustness as the ratio between its IoU on degraded images and its IoU on the clean image, namely:$$\begin{aligned} R_{c,s}^{f} = \left( {IoU}_{c,s}^{f} \right) /\left( {IoU}_{clean}^{f} \right) \end{aligned}$$
(6)
where \({IoU}_{c,s}^{f}\) denotes the IoU of model f on degraded images. Subscript c denotes the type of corruption, which may be one of SNR (noise), SIB, and SVB. Subscript s denotes the level of degradation. For example, for a noise-corrupted image with an SNR of 4, c = ’SNR’, s = 4. \({IoU}_{clean}^{f}\) denotes the IoU of f when it is tested on clean images. When c denotes corruptions, \(R_{c,s}^{f}\) refers to corruption robustness. When c denotes adversarial attacks, \(R_{c,s}^{f}\) refers to adversarial robustness. In this study, this metric of robustness is first calculated on individual images then averaged over all images.Fig. 5Segmentation accuracy (measured by mean IoU) of different models under different types and levels of corruptions

Hot Topics

Related Articles