Multi-stage cascade GAN for synthesis of contrast enhancement CT aorta images from non-contrast CT

Dataset Data acquisitionThe dataset used in this study consisted of 124 different patients, including 62 each of aortic dissection patients and non-aortic dissection patients. The dataset was collected and manually annotated by specialized imaging physicians. The study was approved by the institutional review board. Each subject consists of a NC-CT scan and a corresponding CE-CT scan. The paired scans were collected at the end of respiration, and share same spatial resolution that range from 0.549 × 0.549 × 0.625 to 0.977 × 0.977 × 1.250mm3. The tube voltage of the collection device is 100–120KVp. After obtaining the original data, the personal information of each subject was desensitized.Data preprocessingBefore feeding the data into the model, some preprocessing operations are required. The cross-sectional images obtained from CE-CT and NC-CT images are sharper than sagittal and coronal images, with higher density resolution. Thus, the following pre-processing procedure is performed on the paired cross-section as shown in Fig. 4.For the entire scanned CE-CT and NC-CT images, in order to highlight the abdominal region, the entire image was cropped to contain only the abdomen, removing a large amount of background. In addition, affected by the body size of different subjects, the number of scan layers is not consistent. We crop all the image layers to keep the number of images slices the same, and all of them contain the aorta. The progress shown in Fig. 4a to b, and Fig. 4b respect the paired cropped image.To cope with some misalignments in the images caused by respiratory motion, a rigid registration on NC-CT and CE-CT image pairs is performed, using the NC-CT images as references and the mutual information as cost function to minimize. The progress shown in Fig. 4b to c, and Fig. 4c respect the paired registration image.The resolution of different subject is not consistent, due to different acquisition equipment, different scanners, and different parameter settings. For such reason, all volumes of aortic region are resampled to most frequent resolution 0.7031 × 0.7031 × 0.625 mm3 by trilinear interpolation. The progress shown in Fig. 4c to d, and Fig. 4d respect the paired resampled image.Affected by metal instruments during scanning, the NC-CT and CE-CT slices have metal outliers on the periphery. Outliers were eliminated by threshold method to avoid artifacts after image synthesis, and then entire image is normalized to [0,1]. All normalized images were matched by histogram matching to ensure that all matched training images have similar contrast ranges. The progress shown from Fig. 4d to e, and Fig. 4e respect the paired images with singularities eliminated.Fig. 4Data preprocessing flow. NC-CT (the first row) and corresponding CE-CT (the second row).Making data setsDuring the synthesis of CE-CT images from NC-CT images for aortic dissection, the aorta constitutes a relatively small proportion of the extensive field of view encompassing the chest and abdomen. Consequently, to mitigate the confounding effects of the background and other anatomical structures, we extract the aortas from paired NC-CT and CE-CT volumes before aortic synthesis, as shown in Fig. 4f. Initially, professionals manually delineated the aorta in CE-CT images of healthy subjects using ITK-Snap software. Then, these manual annotations served as labels for training a 3D nnU-Net segmentation network33, which automated the extraction of aortic masks from the CE-CT images. And finally, postprocessing techniques were applied to refine the mask borders, and the resulting masks were aligned with their respective NC-CT and CE-CT image pairs to isolate the aortic regions effectively. For the segmentation of the aortic regions in our dataset, we employed ITK-Snap software, version 3.8.0, an advanced tool for medical image segmentation and visualization. More information about ITK-Snap, including download access, can be found at http://www.itksnap.org.After segmentation, our dataset was randomly selected, with 100 pairs allocated for training the aorta synthesis model and a separate 24 pairs reserved for validation. The training set consisted of 50 healthy pairs and 50 aortic dissection pairs. The NC-CT and CE-CT volumes are sliced to obtain a series of 2D images perpendicular to the aortic layers. Axial slices that contained no aorta information were discarded. The volumes were subsequently cropped to size of 256 × 256 × 420 pixels. Finally, coordinate transformation was applied in all volumes to get three 2D datasets, including axial, sagittal and coronal, shown as Fig. 5.Fig. 5Example of the dataset before and after segmentation on axial, sagittal and coronal. (a) Without aortic dissection, (b) with aortic dissection.Implementation detailsAll experiments were implemented with Pytorch. We trained the model on two NVIDIA GeForce RTX 2080 Ti with 11GB memory each, for 200 epochs. We set the intimal learning rate 2 × 10−4 and use Adam optimizer with beta1 0.5, decay rate of learning 1 × 10−7. \(\lambda\) for feature matching loss is set to 10.Experimental comparisonWe compared the results of MCGAN with the classic image synthesis models of Pix2pix34, CycleGAN35 and Pix2pixHD36, as well as recent literature on AD synthesis Cascaded Deep Learning Framework(CDLF)10 and 3D MTGA[11]. This comparative analysis was conducted on both qualitative and quantitative fronts, with a focus on the PSNR and SSIM metrics to evaluate the synthesized images’ quality relative to ground truth.Table 1 shows the qualitative results. The comparative analysis reveals that MCGAN outperforms in both PSNR and SSIM metrics, with a PSNR of 32.85 and an SSIM of 0.9899. While Pix2pixHD and CDLF show competitive results, MCGAN demonstrates a slight improvement, suggesting its enhanced ability to capture fine details and structural integrity in synthesized images. The performance of CycleGAN and Pix2pix is slightly lower, indicating the superior of multi-stage cascade structure, which is particularly beneficial for applications requiring high-fidelity image generation.Table 1 PSNR and SSIM (mean ± standard deviation) of several algorithms.Fig. 6Comparative visualization of synthesized CE-CT: axial, sagittal, and coronal views with zoom-in.Besides quantitative evaluation, Fig. 6 shows a visual comparison between five methods and our framework. The images of each view contain a lesion area and its magnification. The first column is the input NC-CT image and the last column is the real CE-CT image as the ground truth. As can be seen from the first column of Fig. 6, CycleGAN is completely unable to synthesize the intimal flap in the yellow box in all three views of the aorta. This leads to erroneous synthesis results, synthesizing cases with AD as without AD. Pix2pix is a bit better than CycleGAN in terms of brightness, but it only synthesizes some noise in the lesion area and does not synthesize the intimal flap. Pix2pixHD generates images with higher resolution and brightness, but also lacks grey scale detail of the flap. We observe that the CDLF generates intimal flaps with a degree of blurriness. Despite the lack of sharpness, the results achieved by CDLF are superior to those produced by the 3DMTGA. This may be attributed to the insufficiency of a small dataset to train a robust 3DMTGA model. Compared with these methods, our MCGAN provides smooth, clear and correct synthesis results for all three views. MCGAN is able to synthesize the intimal flap of aortic dissection close to the ground truth in all three views. Especially on axial, the features of the curved and irregular intimal flap are also well preserved, thus MCGAN captures subtal image details and improves the overall performance, indicating that MCGAN achieves better visual effect and less distortion than other methods.Ablation studyEffects of cascade structure and numberIn our method, when extracting NC-CT image features, shallow features and deep features are extracted by different methods, and fuses them in a multi-stage cascade. To verify the benefit of the multi-stage cascades, we also compared MCGAN with single-stage, two-stage and three-stage cascade networks by choosing the same parameters.Table 2 shows the results of experiments by different number of cascades. The first column shows the number of cascades, once, twice, three times and four times. The more the number of cascades, the more the downsampling module of the local generator increases while the number of downsampling layers of the global generator decreases. The local generator of MCGAN has three dense downsampling modules, the global generator has two convolutional downsampling layers, and the local and global features are cascaded four times. Cascading three times means that the local generator has two dense downsampling modules and the global generator has three convolutional downsampling layers. Thus, cascading once means that no local dense downsampling module is used, and the downsampling process consists entirely of the five convolutional downsampling layers of the global generator. The second and third columns of Table 2 show the values of the evaluation indicators PSNR and SSIM, respectively. It can be seen that our four times cascade method gives better results than all other methods. Cascade once attained the worst performance in all evaluation metrics while cascade twice and three times had comparable performance. The reason for this is that cascade once did not use dense connectivity module for downsampling in feature extraction, which resulted in feature propagation loss. Also, the attention module was not used in feature mapping, resulting in a semantic gap between the encoder and decoder.Table 2 PSNR and SSIM (mean ± standard deviation) of different cascade times.Significant values are given in bold.Fig. 7Ablation visualization of cascade times: axial, sagittal, and coronal views with zoom-in.Figure 7 shows views of axial, sagittal and coronal synthesized CE-CT images for a randomly selected subject with AD. When cascading once, the synthesized image as a whole looks noisy and not smooth enough, the aortic edges on the axial view are not clear enough, and the synthesis of the intimal flap is incomplete. In the coronal view, the synthesized intimal flap is not continuous. Even in the sagittal view, no intimal flap is synthesized. When cascaded twice, the synthesized aortic valve is distorted on the axial view and blurred on the coronal view. When cascaded three times, intimal flap features were extracted in all three views, but compared with ground truth, the features were discontinuous in the axial view, the intimal flap was unclear in the coronal view, and no intimal flap is synthesized in the sagittal view. Here, as can be seen from ground truth images, the second row of coronal view, there is inconsistency in the intensity values of the true and false lumen caused by the thrombus. As a result, noise and artefacts in the synthesizes image increase, interfering with the extraction of intima features. In the third row of the sagittal view, the intimal flap in the lesion area is very thin and difficult to distinguish from the surrounding tissue, and noise may interfere with intimal recognition and synthesis. Compared to cascade once, twice and three times, our method with cascade four times synthesizes images closer to the true value in three views. It is evident that features are extracted both shallowly and deeply, and then integrated in multi-stage cascades. This approach ensures the retention of both global and detailed features. This indicates the necessity and superiority of cascade.Effectiveness of the attention moduleOur method uses the DRAB and UNet to obtain four different scale features, and then implement the SA and CA to perform the synthesis. Therefore, these attentions play an important role in our method. To verify the necessity of the spatial attention and channel attention modules in local feature mapping, we separately synthesized CE-CT images from the same set of NC-CT images without attention, with spatial attention only and channel attention only, respectively.The first column of Table 3 shows the module of the attention. No attention implies direct cascade fusion of global and local features of the feature mapping part, similar to a direct skip-connection between an encoder and a decoder, at each time. Only spatial attention means that the features at each scale obtained from downsampling are extracted by spatial attention and then cascaded with the global features. Only channel attention means that the downsampled features at each scale are directly cascaded with the global features, which are fused by channel attention before being upsampled. The second and third columns of Table 3 show the values of the evaluation metrics PSNR and SSIM, respectively.It shows that using dual attention in our method are better than the results of any other method. Without attention shows no good performance in all evaluation metrics. Using only channel attention has better results than without attention. Spatial attention only had better performance than channel attention only. The reason for this is that the failure to use attention to directly cascade global and local features without paying sufficient attention to the key information makes the synthesis less effective. In CT images, different channels can correspond to distinct structures, and aortic dissection can exhibit unique characteristics in specific channels. Through the channel attention mechanism, the model can adaptively adjust the weights of different channels to emphasize the features related to intimal flap and suppress other irrelevant features. With the spatial attention mechanism, the model can pay better attention to the location and shape variations of the aortic coarctation and accurately synthesize the corresponding regions in the enhanced CT images. The combination of the two attention improves the accuracy of the synthesis, resulting in higher values of PSNR and SSIM.Table 3 PSNR and SSIM (mean ± standard deviation) of different attention combinations.Fig. 8Ablation visualization of attention combinations: axial, sagittal, and coronal views with zoom-in.Figure 8 shows axial, sagittal and coronal views of synthesized CE-CT images. In the case of no attention, endocardial slices were blurred or even absent. With spatial attention alone, the generated image endomembrane slices are incomplete and unclear. With channel attention alone, the generated image inner membrane slice is not continuous enough. The whole image generated by using double attention is more uniform, and the continuity of the length of the inner membrane slice is also better. Compared with no attention, spatial attention only and channel attention only, MCGAN with double attention makes great contributions to performance improvement of image synthesis. This indicates the necessity and superiority of double attention.Effectiveness of multiscale fusionThe four scale features obtained from the cascade of shallow feature and deep feature of MCGAN generator are multi-scale fusion. To evaluate the superiority of feature fusion, we perform the GAN with and without feature fusion separately.Without multiscale feature fusion means that the feature mapping process has only one output, and each layer of features is up-sampled until it reaches the input size and then output, i.e., the synthetic image. Unlike the proposed MCGAN network in which there are four different scales of feature outputs in the feature mapping process, containing multiscale fusion i.e. as described in the above section. Table 4 shows the experimental results of fusion effectiveness. The second and third columns of Table 4 show the values of the evaluation metrics PSNR and SSIM respectively. Multi-scale fusion can comprehensively utilize local and global features at different scales to improve the richness and diversity of features. It can be seen that the fused synthetic evaluation results are better than those without fusion.Table 4 PSNR and SSIM (mean ± standard deviation) of multiscale fusion.Fig. 9Ablation visualization of multiscale fusion: axial, sagittal, and coronal views with zoom-in.Figure 9 shows three views of synthesized CE-CT images. Without feature fusion, the synthesized image has a grid effect and overall lack of clarity. And the synthesized image after performing multi-scale fusion is richer in detail information. Multi-scale feature fusion allows for better understanding and reconstruction of shape, edge, and texture information of the intimal flap, which improves the accuracy of the synthesis.

Hot Topics

Related Articles