Integrating Kalman filter noise residue into U-Net for robust image denoising: the KU-Net model

In the field of image processing, the U-Net architecture has grown to be indispensable, especially for tasks like segmentation, denoising, and restoration. U-Net was initially developed for biomedical image segmentation and was intended to function well with a small amount of training data. The main objective was to design an architecture that could accurately analyze images by capturing fine details on a local level as well as the global context. In our work, we leverage the advantages of U-Net in capturing multi-scale features and preserving details through skip connections by incorporating gradient information and Kalman filter residues into a U-Net framework. U-Net’s intrinsic capacity to fuse various levels of abstraction makes it a suitable choice for merging multiple sources of information (noisy image, noise residue, and gradient) into one framework that guarantees both high-level denoising and the preservation of minute details.The high performance of the KU-Net architecture is achieved at the expense of simplicity compared to traditional models. Key components included in the architecture are as follows: The U-Net is made up of an encoder that uses downsampling to obtain context and a decoder that uses upsampling to reconstruct the image. Skip connections are used between corresponding layers of the encoder and decoder to maintain spatial information. This enables the network to fuse low-level detail features with high-level semantic information. The Kalman filter is utilized to estimate and eliminate noise from the input image. The U-Net receives the residue, which is the variation between the noisy image and the output of the Kalman filter. The residue furnishes details regarding the noise properties, allowing the network to modify its denoising approach. The incorporation of gradient estimation into the deep learning model was the primary innovation in this work. To extract texture and edge information from the noisy image, the gradient is computed. To maintain minute details while denoising, this gradient map is essential. Figure 1 represents the design flow of the KU-Net model, it describes the three layers of information applied to the U-Net then the encoder and decoder operate. Finally, a denoised image will produce at the output stage. Figure 2 shows how the suggested model is architecturally laid out.Fig. 1Fig. 2Design of proposed KU-Net architecture.The input RGB image is converted as a gray image with a size of 400 × 400. Initially, three layers of information are supplied to the U-Net architecture: input gray image, predicted noisy image, and gradient estimation. Consequently, two assessed image priors are concatenated with the input grayscale image. The U-Net structure’s encoding and decoding are planned to generate the denoised image based on this illustrated in Fig. 1. The encoders can perform down-sampling and convolution, which are accomplished by kernel filters and pooling layers, respectively. Decoders are constructed by concatenating encoder stage features and using up samplers, in contrast to encoders. Unlike the classification models, the final output is a denoised image rather than a class label.The encoder and decoder consist of four convolutional blocks named B1, B2, B3, and B4. As illustrated in Fig. 2, the convolution blocks are used to construct the encoder and decoder sections. Two convolutional layers make up the internal structure of each convolution block. For downstream blocks 1, 2, 3, and 4, the number of filters in each encoder block is 64, 128, 256, and 512, respectively.Fig. 3Convolution block layers in encoder and decoder.Figure 3 depicts the encoder’s structure, which was used in this study. The activation function of the Rectified Linear Unit (ReLU) activation function activated every convolutional layer. Nevertheless, the final denoised image was produced by activating the final convolution layers using the “sigmoid” function. Drop-out layers were also incorporated into the convolutional layers of the encoder and decoder to enhance denoising performance. The stages of the encoder and decoder were dropped out at 0.05.Let \(\:Kn\:\left(m,n\right)\) be the denoised image and \(\:K\left(m,n,\:\right)\) be the clean image represented in Eq. (1). Where (m, n) is the pixel location of the image.$$\:Kn\:\left(m,n\right)=\:K\left(m,n\right)+\:Noise\left(0,\:\sigma\:\right)$$
(1)
Kalman filters predict the state of the system and estimate the previous state system. \(\:{R}_{k}\)provides the optimal noise-free pixel value and is thought to be a first-order AR model34. The behavior of pixels in an image is typically represented by this model. The process noise, denoted as \(\:{n}_{k}\) and assumed to be white Gaussian with zero mean and variance \(\:{\sigma\:}_{n}^{2}\), is represented by the constant and in the process model, which is dependent on the signal statistics.$$\:{R}_{k+1}=a{R}_{k}+{n}_{k}$$
(2)
A 3 × 3 window is used to filter the Gaussian noise. Equation illustrates the approximate version of the noise matrix that is obtained by (3)$$\:Noise\:\left(\:m,\:n\right)=\:{K}_{noisy}\left(m,\:n\right)-{K}_{denoised}\left(m,\:n\right)$$
(3)
Let \(\:{K}_{d}\left(m,n\right)\) denote the extracted image obtained by applying the Kalman filter and the gradient magnitude image \(\:{M}_{g}\left(m,n\right)\). Equation (6) mentions the gradient magnitude function.$$\:{M}_{g}\left(m,n\right)=\:\sqrt{\left({{{M}_{g}}_{x}\left(\text{m},\text{n}\right)}^{2}\:+\:{{{M}_{g}}_{y}\left(\text{m},\text{n}\right)}^{2}\right)}\:$$
(4)
The vertical gradient \(\:\:{{M}_{g}}_{y}\left(\text{m},\text{n}\right)\:\:\)in (4b) and the horizontal gradient \(\:{{M}_{g}}_{x}\left(\text{m},\text{n}\right)\) in (4a) are calculated using Sobel operators:$$\begin{aligned} & {{M}_{g}}_{x}\left(\text{m},\text{n}\right)=\:\left({K}_{d}\left(m+1,\:n-1\right)+\:2\:{K}_{d}\:\left(m+1,\:n\right)+{K}_{d}\left(m+1,\:m+1\right)\right) \\ & \quad -\:\left({K}_{d}\:\left(m-1,\:n-1\right)+\:2{K}_{d}\:\left(m-1,\:n\right)+\:K\:\left(m-1,\:n+1\right)\right)\end{aligned}$$
(4a)
$$\begin{aligned} & {{M}_{g}}_{y}\left(\text{m},\text{n}\right)=\:\left({K}_{d}\:\left(m-1,\:n-1\right)+\:2\:{K}_{d}\:\left(m,\:n-1\right)+\:{K}_{d}\:\left(m+1,\:n-1\right)\right) \\ & \quad -\:\left({K}_{d}\:\left(m-1,n+1\right)+\:2\:{K}_{d}\:\left(m,n+1\right)+{K}_{d}\:\left(m+1,\:n+1\right)\right) \end{aligned}$$
(4b)
As a result, the model receives input from three layers of information: the input gray image, the noisy image predicted by Eq. (3), and the gradient information obtained by Eq. (4). After that, a denoised image is intended to be obtained through U-Net encoding and decoding.Metrics evaluationThe basic parameters are used to analyze the performance of denoising an image in the KU-Net architecture. The peak signal-to-noise ratio (PSNR) denotes a quality measurement of the original image and compressed image. The structural similarity Index (SSIM) provides a quantitative measure of edges and texture-aware information comparing two images. In this way, we consider the model performance in a denoising process.The Mean Squared Error (MSE) loss function was selected because the problem involves reconstructing the noiseless image. During the training session, the loss function indicated in Eq. (5) was employed.$$\:MSE\:loss=\:\frac{1}{M\:\times\:\:N}\:\sum\:_{i=1}^{M}\sum\:_{j=1}^{N}{\left({predicted}_{i,\:j}-{actual\:target}_{i,\:j}\:\right)}^{2}$$
(5)
In this case, \(\:{actual\:target}_{i,\:j}\) = the original noise-free image; \(\:{predicted}_{i,\:j}\)= denoised images during training; and N = number of columns, M = number of rows. Reducing MSE loss and obtaining the least amount of error were the key performance indicators.$$\:PSNR=\:10\text{log}\frac{{Max}^{2}}{MSE\:loss}\:\text{d}\text{B}$$
(6)
The testing performance was derived from the images that were not subjected to training and validation, where Max = Maximum value of the pixel in the original noise-free image. The average performance was determined to be 15, 25, and 50 dB in terms of MSE and PSNR, respectively.$$\:SSIM\:\left(x,y\right)=\:\:\frac{\left(2{\mu\:}_{x}{\mu\:}_{y}+{c}_{1}\right)\left(2{\sigma\:}_{xy}+{c}_{2}\right)}{\left({\mu\:}_{x}^{2}+{\mu\:}_{y}^{2}+{c}_{1}\right)\left({\sigma\:}_{x}^{2}+{\sigma\:}_{y}^{2}+{c}_{2}\right)}$$
(7)
The similarity index between the two images indicates the SSIM values, which range [0,1]. Utilizing this metric, analyze the quality of the images.$$\:FOM=\:\frac{1}{\text{m}\text{a}\text{x}\left(\left|{G}_{t}\right|\:\left|{D}_{c}\right|\right)}\sum\:\frac{1}{1+k.{d}_{{G}_{t}}^{2}\left(p\right)}$$
(8)
The quality of edge detection in images is assessed using a metric called the Figure of Merit (FOM). It evaluates the position and magnitude of the detected edges as well as the accuracy of the detected edges about a ground truth.Parameter settingsInitially, 1 × 10− 3 was the model’s learning rate. Adaptively, this rate was changed in response to the validation performance. The learning rate was lowered in non-linear steps of 1 × 10− 5, 1 × 10− 6, and the least of 1 × 10− 7. The value for L2 regularisation was 0.000001. When ten patient attempts were made and there was no improvement, this adaptive variation was carried out using the call-back functions. The Adam optimizer was used to optimize the model35. For the BSD300 database, the batch size was changed to 4. Fifty epochs were used to train this model. To cover all training images, the sessions in the training loop were iterated through 4 images per batch. In this work, image augmentation was not done because the primary goal is to enhance the traditional U-Net model’s learning capability.DatasetsThis paper adopted different types of datasets that are available in online sources, and for training purposes, the Berkeley Segmentation Dataset (BSD) has been chosen. There are a total of 300 images split as 240 images for training and 30/30 images for validating and testing with the dimensions of 481 × 321 and 321 × 481 with three colors of R, G, and B. The images were resized as 400 × 400, with an option of nearest interpolation. After randomly arranging the image names into 42 states, the database was divided. The original scale for each image was changed from [0, 255] to within the range of [0, 1]. Also, for comprehensive experimental results comparison Set12, CBSD68, KODAK24, and McMaster datasets are proceeded. For the BSD dataset, sample images are displayed in Fig. 4. Further, we tested our model by using four different datasets they are Set12, KODAK24, McMaster, and CBSD68.Fig. 4Sample images from the BSD dataset.Experimental setupOur model was trained on an NVIDIA GeForce GTX1650 graphical processing unit (GPU), and Collab was used to conduct all of the experiments on a PC equipped with an Intel Core i5-10300 H processor and 8 GB of RAM.Time complexity analysisThe time complexity analysis of our proposed work KU-Net is calculated by the approximation of summing the number of Floating-point operations (FLOPs) required for the Deconvolution and Convolution layer. The total computational complexity for the given KU-Net model, with an input size of 256 × 256 × 1, a kernel size of 3 × 3, and a variable number of filters through the network, is computed as follows: For Encoder block-891,550,720 operations, Bottleneck layer-3,220,758,528 operations, Decoder block-1,476,917,376 operations, and Output Layer-4,194,304 operations. Finally, summing these the total time complexity of the model is 5,593,420,928 approximately 5.6 × 109 operations. The FLOPs per second for our NVIDIA V100 GPU is rounded as 5.6 GFLOPS/s (gigaflops per second).

Hot Topics

Related Articles