Accelerating photoacoustic microscopy by reconstructing undersampled images using diffusion models

In this section, the development of the DiffPam algorithm is outlined, detailing the transition from theory to practice. An overview of DDPM and its application in image generation is provided. The conditional generation using an unconditional diffusion model is then explained, with regularization techniques incorporated to enhance image reconstruction. The Come-Closer-Diffuse-Faster technique for accelerating the diffusion process is described. Additionally, the datasets used for training and testing are covered, along with the neural network employed as an approximate solution. Finally, the steps of the DiffPam reconstruction algorithm are presented, and the evaluation metrics for assessing image quality are detailed. This methodology aims to accelerate photoacoustic microscopy imaging by reconstructing undersampled images using diffusion models.Denoising diffusion probabilistic modelsDDPMs were first introduced by Sohl-Dickstein23 and then popularized by Ho et al24, as a parametrized Markov chain with variational inference. Markov process defines a stochastic process where the future only depends on the present. Dhariwal and Nichol25 showed that DDPMs can achieve superior image quality and beat GANs.The diffusion process in DDPMs is inspired by the thermodynamic process of diffusion. DDPM has forward and backward diffusion processes. The forward process is defined as adding Gaussian noise to an image and the backward process is trying to obtain the original image by using the noisy one (Fig. 1).Figure 1Forward \(q(x_{t}|x_{t-1})\) and backward \(p_{\theta }(x_{t-1}|x_{t})\) diffusion processes. \(X_{0}\) is the original image, where \(X_{T}\) is the Gaussian noise when \(T\rightarrow \infty\).For the image generation, the backward process is started from pure Gaussian noise at \(t=T\) and repeated until we obtain \(X_{0}\) at \(t=0\). In the following sections, we use y to represent the measurement, \(X_{ref}\) for the reference (ground truth) image, and \(X_{0}\) for the generated output. This notation aligns with the conventions used in diffusion and score-based model research to ensure consistency and clarity.Conditional generation from unconditional diffusion modelThe sequential nature of diffusion models allows researchers to manipulate the generation process. This principle has been the focus of several studies that aim to alter the generation process, using an unconditional model to generate conditional images. Simply, a measured image is defined as a noisy version of the actual image transformed by a degradation operation.$$\begin{aligned} y = Ax+n \end{aligned}$$
(1)
The degradation operation (A) mimics the relation between measured value (y) and real value (x), and (n) represents the measurement noise. The degradation depends on the inverse problem. For super-resolution, the operation will be downsampling with Gaussian blurring. For inpainting, the transformation is the multiplication of a mask operator.Choi et al.26 defined a process called Iterative Latent Variable Refinement (ILVR). Proposed algorithm is to generate \(x’_{t-1}\) from \(x_{t}\) using unconditional generation, then update predicted \(x’_{t-1}\) using the condition:$$\begin{aligned} x_{t-1} = x’_{t-1} + A^{T}(y_{t-1})- A^{T}A \left( x’_{t-1}\right) \end{aligned}$$
(2)
where \(A^{T}\) is the inverse operation to degradation, e.g. upsampling for super-resolution. This process can be deemed as a projection operation onto the desired image manifold. The measurement is considered noiseless (\(n=0\)) in this setting.Chung et al. (a)27 proposed an improvement to projection-based approaches named Manifold Constraint Gradient (MCG) correction. They devised a series of constraints to ensure that the gradient of the measurement remains on the manifold.Projection-based approaches may perform well in some cases, but they have downsides. Since the predicted image is projected onto the measurement, noise in the measurement is amplified at each iteration. Chung et al. (b)28 developed an alternative approach that uses gradient-based correction to the predicted image, named Diffusion Posterior Sampling (DPS). In this way, the noise in the measurement is handled during the gradient process, unlike projection-based methods.$$\begin{aligned} x_{t-1} = x’_{t-1} – \eta \nabla _{x_{t-1}}||y-A(x’_{0})||^{2}_{2} \end{aligned}$$
(3)
\(\eta\) is the scaling factor for DPS, which determines the strength of the measurement in the final image. Chung et al. (b)28 found that \(\eta\) should be between 0.1 and 1, experimentally. Smaller \(\eta\) values lead to hallucination in the generated images, while larger values may produce artifacts.The formulation presented in Eq. (4) illustrates our contribution to the DPS conditioning, namely regularization with the \(l_{1}\) norm of wavelet coefficients. In Eq. (4), \(\mathcal {F}\) denotes wavelet decomposition, and \(\lambda _{W}\) represents the regularization coefficient for wavelet decomposition. Experimentally, we have determined the value of \(\lambda _{W}\) to be 10, below which we do not observe the effect of regularization, while above, it yields excessive regularization and a smoothing effect that diminishes the quality of the outcome. We have selected the B-spline biorthogonal wavelet with 3 and 5 vanishing moments (bior3.5) to achieve sparse representations. We perform a level 3 discrete wavelet transform using the \(pytorch\_wavelets\) package29 in Python.$$\begin{aligned} x_{t-1} = x’_{t-1} – \eta \nabla _{x_{t-1}}\left( ||y-A(x’_{0})||^{2}_{2} + \lambda _{W}\Vert \mathcal {F}(x’_{0})\Vert \right) \end{aligned}$$
(4)
In a similar vein, we have utilized DPS conditioning with TV regularization in Eq. (5), where V denotes the TV of the predicted image \(x’_{0}\), and \(\lambda _{TV}\) serves as the regularization coefficient for TV. To calculate the TV-norm, we utilize the \(total\_variation\) function from the torchmetrics library30. The value of \(\lambda _{TV}\) is experimentally set to \(10^{-4}\).$$\begin{aligned} x_{t-1} = x’_{t-1} – \eta \nabla _{x_{t-1}}\left( ||y-A(x’_{0})||^{2}_{2} + \lambda _{TV}\Vert V(x’_{0})\Vert \right) \end{aligned}$$
(5)
Come-closer-diffuse-fasterAlthough the pre-trained diffusion models are robust and domain-adaptable, they are computationally expensive. Producing even a single image is a time-consuming process. Chung et al. (c)31 claim that for the inverse problems, starting the generation from pure Gaussian noise is redundant. By utilizing an estimate of initial image \(x_{0}\), the backward diffusion process can begin at a timestep \(t<<T\). They referred to this process as Come-Closer-Diffuse-Faster. An initial estimate can be as simple as an interpolation or as fine-detailed as the output of a fully-trained neural network. As the initial estimate improves, the diffusion process can be shortened31. This method allows us to use existing deep-learning models and go beyond their performance.DatasetFor all experiments, we have employed OpenAI’s unconditional DDPM model32 trained with ImageNet 256\(\times\)256 dataset33. Since we are using a pre-trained diffusion model, training and validation of the diffusion models are beyond the scope of this study.For testing the proposed method in this study, the Duke PAM34 dataset is used. Duke PAM is an open-source database acquired by an optical resolution PAM system at a wavelength of 532 nm in the Photoacoustic Imaging Lab at Duke University. This PAM system has a lateral resolution of 5 \(\upmu\)m and an axial resolution of 15 \(\upmu\)m11. This dataset consists primarily of in vivo images of mouse brain microvasculature, from which 20 images are randomly selected as the test set.The selected diffusion model is trained with 256\(\times\)256 images, so central cropping is applied to make input image dimensions 256\(\times\)256. Then, images are normalized by mapping the color range to \([-1,1]\) before being passed to the diffusion model. These normalized images are the ground truth images (\(X_{ref}\)).At this step, our study is divided into three paths to generate input images (y):

1.

For super-resolution, 1/4 downsampling operation is performed and followed by low-pass filtering to prevent anti-aliasing using the Resizer Python package developed by Shocher et al.35. Therefore, the resulting image has 6.25% of the pixels of the original image.

2.

For inpainting with a uniform undersampling, every four rows out of five is replaced with a zero, leaving 20% of the original pixels.

3.

For inpainting with a random undersampling, 80% of the pixels are replaced with zero, leaving 20% of the original pixels.

In PAM imaging, variations in the step sizes along the fast and slow scanning axes can result in differing resolutions in the x and y axes. Random undersampling maintains the same sparsity level without a fixed pattern, thus reducing potential biases from the original scanning step sizes. The differences in reconstruction quality between uniform and random undersampling can be attributed to inherent biases within the original images. For a fair comparison with random undersampling, uniform downsampling was also performed equally along both the x and y axes.Each setup corresponds to distinct real-world scenarios, as elaborated in subsequent sections. Researchers are encouraged to adopt the pathway that best aligns with their objectives and experimental setups.Neural network as an approximate solutionThe previous works mainly focused on reconstructing uniformly undersampled PAM images11,12. They approached it as an inpainting problem and used U-Net-shaped neural networks for a solution. In this study, we demonstrated that existing solutions can be improved using our algorithm. To this end, we trained a U-Net model with residual connections, which was subsequently subjected to comparative evaluation and involved in our proposed algorithmic framework.We used the 337 samples in the Duke PAM train dataset for model training. Each image is randomly cropped from 10 different locations during data loading. For regularization, the data augmentation processes, namely, random crops, horizontal and vertical flips with 30% probability, random rotations up to 20\(^\circ\), and random Gaussian blur, are applied to each image in the training set. Then images are normalized to range \([-1,1]\) and used as the ground truth. Input images are constructed by applying a uniform undersampling to the ground truth image, which replaces every four rows out of five with a zero, leaving 20% of the original pixels. Then, empty rows are filled with bilinear interpolation and used as input to the U-Net model. Figure 2 illustrates an example image with uniform masking and the resulting bilinear interpolation.Figure 2An example mice brain microvasculature image from Duke PAM dataset. (a) Original (ground truth) image, (b) uniformly undersampled in x-axis (every four rows out of five is replaced with a zero) and (c) missing pixels are filled via bilinear interpolation.The U-Net model employed in this study comprises convolutional blocks (Conv Blocks) that consist of two-dimensional convolutional layers (Conv2D) followed by Mish activation functions36. The architecture includes three downsampling blocks and corresponding upsampling blocks, with a middle layer incorporating linear attention (Supp. Fig. 1). The network was optimized using Adam optimizer37 of the PyTorch optimizer package and \(l_{1}\) loss function with batch size 8. The parameters for the Adam optimizer were \(\beta _{1}=0.9\), \(\beta _{2}=0.999\), \(\epsilon =10^{-8}\), and weight decay\(=0\) (the default parameters of PyTorch). We used the PyTorch framework version 1.11. The hyperparameter tuning is achieved by splitting 20% of the training data as the validation set. The initial learning rate is selected \(3e-4\), and step scheduling is employed, which divides the learning rate by half in each 2000 epoch. Total training is done over 5000 epochs, beyond which we did not observe any substantial performance increase.The DiffPam reconstruction algorithmThe steps of the algorithm are the following:

The operation (A) transforms the real value (X) into the measured value (y). For super-resolution, the operation will be downsampling. For inpainting, the transformation is the multiplication of a mask operator. $$\begin{aligned} y = AX_{ref} \end{aligned}$$
(6)

Optional: An approximate solution to the inverse problem is required. The better we approximate the real value, the fewer diffusion steps would be needed. The solution depends on the transform operator A. But in general, interpolation operations or a simple trained neural network as an approximator can be utilized. Depending on the approximate solution, the starting point (\(N<T\)) for the backward diffusion process can be settled.

After having \(X_{N}\), an unconditional diffusion model is employed to generate \(X_{N-1}\). Then, using regularized DPS conditioning the image generation process is iterated in the direction of interest.

Algorithms 1 and 2 illustrate the DiffPam algorithm with or without approximate solution. Table 1 summarizes the experiments conducted in this study, regarding the selection of the measurement operations (A), approximate solutions and the starting point for the diffusion processes (N).Table 1 The experimens conducted in this study, regarding the selection of the measurement operations (A), approximate solutions and the starting point for the diffusion processes (N).The diffusion model that we used in this study is trained with \(T=1000\) diffusion steps. Depending on the quality of the approximate solution, we have selected \(N=500\) for bilinear and bicubic interpolation and \(N=200\) for pre-trained U-Net outputs.Image production time is highly correlated with the sample step size. Selecting \(N=T/k\) reduces the computing time by a factor of k.Algorithm 1DiffPam reconstruction algorithmAlgorithm 2DiffPam with approximate solutionEvaluationThe quality of reconstructed images is evaluated using three key metrics: PSNR, structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS). PSNR is the logarithm of the ratio between the peak signal value and the mean squared error (MSE) between the reference and the reconstructed images. The PSNR goes to infinity when MSE is equal to zero.$$\begin{aligned}{} & {} {\text {MSE}}({\textbf {x}}, {\textbf {y}}) = \frac{1}{mn} \sum _{i=0}^{m} \sum _{j=0}^{n} (\textbf{x}(i,j)-\textbf{y}(i,j))^{2} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} {\text {PSNR}}({\textbf {x}}, {\textbf {y}}) = 10*\log _{10}\frac{max^{2}(\textbf{x})}{MSE(\textbf{x}, \textbf{y}) } \end{aligned}$$
(8)
On the other hand, SSIM is designed to simulate any image distortion using a combination of three factors: loss of correlation, luminance distortion, and contrast distortion38. Equation (9) is the specific form of the SSIM index, which is used in this work. \(\mu\) and \(\sigma\) correspond to the mean and standard deviation of the given image patches39. \(C_{1} = (K_{1}L)^{2},\) and \(C_{2} = (K_{2}L)^{2}\), where K is a small parameter and L is the dynamic range of the image pixels. The SSIM values range between zero and one. Zero SSIM means there is no correlation between the images, and one refers to identical images.$$\begin{aligned} {\text {SSIM}}(\textbf{x}, \textbf{y})=\frac{\left( 2 \mu _x \mu _y+C_1\right) \left( 2 \sigma _{x y}+C_2\right) }{\left( \mu _x^2+\mu _y^2+C_1\right) \left( \sigma _x^2+\sigma _y^2+C_2\right) } \end{aligned}$$
(9)
While PSNR and SSIM are widely acknowledged, they may occasionally fall short in aligning with human perceptual judgments. In response to this, LPIPS, as developed by Richard Zhang et al.40, leverages deep neural networks to offer a more refined measure of image similarity closer to human perception. Contrary to PSNR and SSIM, lower LPIPS scores mean better image quality. In this study, the Scikit-image framework41 is employed to calculate PSNR and SSIM metrics, whereas the lpips Python package with AlexNet is used to compute LPIPS scores. For SSIM calculation, image patches of size \(7 \times 7\) are utilized, with default parameters set to \(L=255\), \(K_{1} = 0.01\), and \(K_{2} =0.03\).In addition to these metrics, we also include the Mean Absolute Error (MAE) to provide further insights into the performance of our method. MAE measures the average magnitude of the absolute differences between the predicted and true values, providing a straightforward interpretation of prediction accuracy. The MAE is defined as:$$\begin{aligned} {\text {MAE}}({\textbf {x}}, {\textbf {y}}) = \frac{1}{mn} \sum _{i=0}^{m} \sum _{j=0}^{n} |\textbf{x}(i,j)-\textbf{y}(i,j)| \end{aligned}$$
(10)
Our sample size is relatively small (n = 20), and data distributions do not meet the normalization condition according to the Kolmogorov-Smirnov test. Therefore, hypothesis testing is done by utilizing non-parametric Wilcoxon signed-rank tests. The Wilcoxon signed-rank test is designed to test the null hypothesis that two paired samples come from the same distribution42. Paired tests are specifically designed for situations where each data point in one sample is directly paired with a data point in the other sample, which was the case for our images. The alpha level for statistical significance is selected as 0.05.

Hot Topics

Related Articles