Retinal fundus image super-resolution based on generative adversarial network guided with vascular structure prior

Due to the superior image generation capability of generative adversarial networks, it can be applied to improve the quality of retinal fundus imaging to help doctors in retinal image analysis. Although the Real-ESRGAN can generate super-resolution (SR) retinal fundus images, the SR image contains structural distortions. To overcome the problem, we proposed an improved Real-ESRGAN. It contains three parts: our improved generator, our improved discriminator, and our designed new loss function.Improved generatorThe improved generator is shown in Fig. 1. Compared with the original generator of Real-ESRGAN, we added a branch that contains a pre-trained network (U-Net model), condition network, and our designed improved RRDB (Residual in Residual Dense Block). The distribution of retinal vascular structures in fundus images is regular, relatively constant, and semantically simple and clear. The above situation allows retinal vascular structures to be located using low-resolution information. At the same time, the boundaries of different tissues within the retinal fundus image are blurred, and the fine structures are not obvious, requiring more high-resolution(HR) information to distinguish them. Therefore, we first downsample the input image to obtain deep features in the U-Net model. The deep features contain LR information that can provide contextual semantic information about the segmentation target in the whole image. Secondly, we use the upsample operation to obtain shallow features that contain more HR information to segment the segmentation target in the U-Net model finely. In addition, the skip connections used in the upsampling process combine the shallow features with the deeper features of the corresponding layer to obtain more accurate segmentation32. We train the U-Net model on the DRIVE (Digital Retinal Images for Vessel Extraction) dataset33 that is specifically designed for the segmentation of blood vessels in retinal fundus images and use the trained U-Net model as the pre-trained network in Fig. 1.Fig. 1The architecture of the proposed generator.To better incorporate the features contained by the pre-trained network into the SR process, we input the vascular structure segmentation map into the condition network. The condition network consists of a 1\(\times\)1 convolutional layer, a ReLU activation function, and a 1\(\times\)1 convolutional layer. The condition network is used to extract feature maps from the vascular structure segmentation map without changing the size. The outputs of the condition network are referred to as the prior conditions. In the end, we use the Spatial Feature Transform (SFT) layer to fuse the prior conditions with the feature maps obtained by the original generator network to improve the quality of super-resolution of retinal fundus images.Compared with the original RRDB (Residual in Residual Dense Block) basic module in Real-ESRGAN, we add a Spatial Feature Transform (SFT) layer before each convolutional layer in the RRDB module. The SFT layer can make the prior conditions adequately combine with each intermediate feature map obtained in the feature extraction phase. Our proposed RRDB module with SFT layers is shown in Fig. 2. Low-level vision tasks such as SR require more spatial information of the image to be considered and require different processing at different spatial locations of the image. Therefore, we use an SFT layer to combine the feature maps obtained in the feature extraction phase with the prior conditions rather than directly concatenating or summing them. The SFT layer is used to learn a mapping function that outputs a modulation parameter pair based on some prior conditions. This learned parameter pair adaptively affects the output spatially using an affine transformation for each intermediate feature map in an SR network. The affine transformation is carried out by scaling and shifting feature maps:$$\begin{aligned} SFT(F|\gamma ,b) = \gamma \times F + b \end{aligned}$$
(1)
where F denotes the feature maps, whose dimension is the same as \(\gamma\) and b, and \(\times\) represents element-wise multiplication. The SFT layer is shown in Fig. 3, which feeds the prior conditions into two separate combinations of two convolutional layers with a kernel size of \(1\times 1\) to obtain \(\gamma\) and b, and then we modulate the input feature maps with \(\gamma\) and b, as shown in (1).Fig. 2The RRDB module with spatial feature transform layer.Fig. 3The architecture of spatial feature transform layer.Improved discriminatorThe discriminator discriminates whether the input image is the original high-resolution or super-resolution image. In the Real-ESRGAN, it uses a U-shaped network as the discriminative network. This architecture can provide detailed per-pixel feedback to the generator while maintaining the global coherence of super-resolution images by the global image feedback. It consists of three parts: the downsampling part, the upsampling part, and skip connections. However, the convolution operation extracts informative features by blending cross-channel and spatial information in the downsampling part. Therefore, it also generates some redundant feature information.To reduce the interference of redundant feature information, we introduce the channel and spatial attention modules into each skip connection to emphasize meaningful features. The spatial and channel attention modules are shown in Figs. 4 and  5, respectively. The channel attention module is used to obtain the weight of different channels. The more important the information contained in the channel, the greater the weight of the channel. The output feature maps of each skip connection with attention modules and each cascade module in the upsampling part are fused by an element-wise sum operation. The fused feature maps are used as the input feature maps of the next cascade module in the upsampling part.Fig. 4Spatial attention module.The complete discriminator network is shown in Fig. 5. The left part is the downsampling part. The right part is the upsampling part, and the intermediate part is the attention part. We first use a 3\(\times\)3 convolution to extract features in the downsampling part. Secondly, we use three identical cascade modules to realize the downsampling. Each cascade module includes a 4\(\times\)4 convolution with a stride of 2, a spectral normalization layer, and a LeakyReLU activation function. The convolution reduces the scale of feature maps and increases the receptive field. Spectral normalization is used to improve training stability. The LeakyReLU activation function is used to improve the fitting ability of the network. The upsampling part includes three identical cascade modules. The cascade module consists of a bilinear upsampling operation, a 3\(\times\)3 convolution, a spectral normalization layer, and the LeakyReLU function. The bilinear upsampling operation is used to increase the scale of feature maps. The convolution, a spectral normalization layer, and a LeakyReLU activation function are used to extract high-frequency features. At the end of the discriminator, we use two convolutions with a spectral normalization layer and a convolution. It can enhance important features and reduce the interference of the generated noise, which is useful for improving the discriminatory capability of the discriminator.Fig. 5The architecture of the proposed discriminator.Construction of new loss functionConsidering the retinal vascular structure segmentation map is a two-valued image, we use the L1 loss function to measure the pixel-level differences between the retinal vascular structure segmentation maps of the high-resolution (HR) images and the super-resolution (SR) images. It is expressed as follows:$$\begin{aligned} L_{L1_ – Seg} = \frac{1}{{WH}}\sum \limits _{x = 1}^W \sum \limits _{y = 1}^H {||Seg(I^{HR} )_{x,y} – Seg(G(I^{LR} ))_{x,y} ||_1 } \end{aligned}$$
(2)
where H and W refer to the height and width of the segmentation map of the retinal vascular structure of the HR image or the SR image, respectively. \(I^{LR}\) and \(I^{HR}\) represent the low-resolution (LR) image and the HR image, respectively. Seg( ) represents the operation of image segmentation using the U-Net model pre-trained on the DRIVE dataset. The introduction of this loss function is a secondary constraint following the introduction of the structural prior in the generator to constrain the SR images.We use the proposed loss function and the original loss functions to construct a new loss function for the improved generative adversarial network. The new loss function of the generator and discriminator are shown in (3) and (4), respectively.$$\begin{aligned} Loss_G= & \lambda _{adv} L_{adv\_G} + \lambda _{per} L_{per} + \lambda _{L1} L_{L1} + \lambda _{L1_ – Seg} L_{L1_ – Seg} \end{aligned}$$
(3)
$$\begin{aligned} Loss_D= & \lambda _{adv} L_{adv\_D} \end{aligned}$$
(4)
where \(Loss_G\) and \(Loss_D\) are the loss functions of the generator and discriminator, respectively. They are used to measure the difference between the generated data distribution and the real data distribution. The coefficients in Equations (3) and (4) are set as \(\lambda _{adv} \mathrm{{ = 0}}\mathrm{{.1}}\), \(\lambda _{per} \mathrm{{ =1}}\), \(\lambda _{L1} \mathrm{{ =1}}\), \(\lambda _{ L1_ – Seg } \mathrm{{ =1}}\). The \(Loss_G\) and \(Loss_D\) are the adversarial loss functions of the generator and discriminator, respectively. They are expressed as follows:$$\begin{aligned} L_{adv\_G}= & \frac{1}{N}\sum \limits _{n = 1}^N { – \log (D(G(I^{LR} )))} \end{aligned}$$
(5)
$$\begin{aligned} L_{adv\_D}= & \frac{1}{N}\sum \limits _{n = 1}^N { – \log (D(I^{HR} )) – \log (1 – D(G(I^{LR} )))} \end{aligned}$$
(6)
where D and G express the generator and discriminator, respectively. N is the total number of samples in the training set. As shown in (7), this loss function is used to measure the pixel-level differences between the HR and SR images.$$\begin{aligned} L_{L1} = \frac{1}{{WH}}\sum \limits _{x = 1}^W \sum \limits _{y = 1}^H {||I_{x,y}^{HR} – G(I^{LR} )_{x,y} ||_1 } \end{aligned}$$
(7)
The perceptual loss function uses a pre-trained VGG19 network to calculate the gap between the HR and SR images on the feature space. The introduction of the perceptual loss function makes the SR images semantically closer to the HR images.$$\begin{aligned} L_{per_{(i,j)} } = \frac{1}{{W_{(i,j)} H_{(i,j)} }}\sum \limits _{x = 1}^{W_{(i,j)} } \sum \limits _{y = 1}^{H_{(i,j)} } {||\phi _{(i,j)} (I^{HR} )_{x,y} – \phi _{(i,j)} (G(I^{LR} ))_{x,y} ||_1 } \end{aligned}$$
(8)
where \(\phi _{(i,j)}\) indicates the feature maps obtained by the j-th convolution layer(before activation) before the i-th max-pooling layer within the pre-trained VGG19 network. The values of (i, j) are (1, 2), (2, 2), (3, 4), (4, 4), (5, 4). The perceptual loss function is shown in Equation (9) as follows.$$\begin{aligned} L_{per} = \alpha _{(1,2)} L_{per_{(1,2)} } + \alpha _{(2,2)} L_{per_{(2,2)} } + \alpha _{(3,4)} L_{per_{(3,4)} } + \alpha _{(4,4)} L_{per_{(4,4)} } + \alpha _{(5,4)} L_{per_{(5,4)} } \end{aligned}$$
(9)
where the individual coefficients in (9) are: \(\alpha _{(1,2)} = 0.1\), \(\alpha _{(2,2)} = 0.1\), \(\alpha _{(3,4)} = 1\), \(\alpha _{(4,4)} = 1\), \(\alpha _{(5,4)} = 1\).

Hot Topics

Related Articles