A lighter hybrid feature fusion framework for polyp segmentation

Dataset detailsTo comprehensively evaluate the performance of the proposed CTHP in polyp segmentation, we conducted experiments and performance assessments on two popular publicly available benchmarks, Kvasir-SEG12 and CVC-ClinicDB39.Kvasir-SEGKvasir-SEG consists of 1000 polyp images, each accompanied by a corresponding ground truth mask. The dataset encompasses various polyp morphologies, ensuring diversity in the data. The image resolutions range from 332×487 to 1920×1072 pixels. All ground truth masks were annotated and verified by medical experts.CVC-ClinicDBCVC-ClinicDB comprises 612 polyp images along with their corresponding ground truth masks. These images were captured from 31 different video colonoscopy sequences and have a resolution of 384×288 pixels. Similarly, the ground truth masks were annotated and verified by medical experts.Following the settings in literature9,12,17,39, we divide these two datasets into training (80%), validation (10%), and testing sets (10%).Implementation detailsWe implemented the proposed CTHP using PyTorch. Consistent with the settings in literature17,31, all polyp images were resized to a resolution of 352×352 pixels. Various image augmentation techniques were employed, including Gaussian blur, colour jitter, horizontal and vertical flips, and affine transforms. Using a batch size of 16, we trained the model for 200 epochs. The Adam40 optimizer was utilized with an initial learning rate of 1e-4. All experiments were conducted on a single Nvidia RTX3090 GPU.EvaluationWe conduct extensive experiments on the proposed CTHP model using the Kvasir-SEG and CVC-ClinicDB benchmarks, which can be divided into two main parts: single-domain performance evaluation and cross-domain performance evaluation (Kvasir \(\rightarrow\) CVC and CVC \(\rightarrow\) Kvasir). The experimental results are presented in Table 1 and Table 2. Following the settings in literature9,12,17,18,39, we evaluate the model’s performance using the mDice, mIoU, mPrecision, and mRecall metrics to provide a comprehensive assessment. All results of the compared models are obtained from publicly available original papers.Table 1 Single-domain performance evaluation of CTHP compared to other models. Optimal results are highlighted in bold.Single-domain performance evaluationThe compared models include U-Net6, ResUNet12, ResUNet++14, MSRF-Net26, PraNet41, MMFIL-Net42, CAFE-Net43, SRaNet44, MGCBFormer32, APCNet45, SSFormer-S17, SSFormer-L17, and FCBFormer31. It is worth noting that SSFormer (SSFormer-S, SSFormer-L) is transformer-based method, while FCBFormer, MGCBFormer, and APCNet, similar to CTHP, is a hybrid model combining transformers and CNNs. From Table 1, it can be observed that CTHP consistently outperforms other models on the most important metric, mDice, for polyp segmentation tasks on both benchmarks. For the mIoU, mPrecision, and mRecall metrics, CTHP also demonstrates competitive performance, ranking second. These results demonstrate the effectiveness and superior performance of CTHP.Table 2 Cross-domain performance evaluation of CTHP compared to other models. Optimal results are highlighted in bold.Cross-domain performance evaluationFollowing the settings in literature17,18,31, we conduct experiments on both benchmarks to test the generalization performance of CTHP with different data distributions. Specifically, maintaining the original data splits, Kvasir \(\rightarrow\) CVC (CVC \(\rightarrow\) Kvasir) refers to training on Kvasir-SEG and testing on CVC-ClinicDB (and vice versa). From Table 2, it can be observed that when deployed on data with different distributions, CTHP outperforms other methods in terms of all tested metrics. This can be attributed to the effectiveness of the feature complementarity strategy and the robustness of the hybrid model.Table 3 The ablation studies of CTHP evaluated in single domain on Kvasir-SEG and CVC-ClinicDB. Optimal results are highlighted in bold.Ablation studiesWe conducted corresponding ablation experiments on Kvasir-SEG and CVC-ClinicDB on the proposed CTHP, as shown in Table 3. We adopted an incremental experimental approach, where Baseline-T refers to the transformer baseline with PVT as the backbone network. TP indicates the modified patch embedding and self-attention computation methods, equipped with IPM Transformer Paradigm. CP refers to the Convolutional Paradigm. \(TP^{\#}\) is a variant of TP in which the IPM module is replaced by convolutional layers. \(TP^{\diamondsuit }\) is a variant where position biases are not used in attention calculations, and \(TP^{\nabla }\) is a variant where self-attention replaces axial attention. Similarly, \(CP^{\#}\) is a variant of CP where the IPM module is replaced by convolutional layers. Comparing No.0 and No.1 reveals the effectiveness of improving the Transformer backbone. By comparing No.1 and No.2 as well as No.5 and No.6, the effectiveness of the IPM module in addressing attention disappearance and maintaining prediction stability can be validated. Additionally, the comparison between No.1 and No.3, No.4 reveals that the variant utilizing axial attention with position biases slightly outperforms the variant using self-attention, while the variant without position biases shows a slight decrease in accuracy, which further confirms the effectiveness of our improvements. Comparing No.1 and No.6 reveals the effectiveness of the hybrid model strategy. The complementary local information mining capability of CNN and the long-range modeling capability of Transformer further enhance the model performance. Comparing No.6 and No.7 reveals the effectiveness of the feature fusion strategy. For the majority of metrics, CTHP with predicted feature fusion outperforms TP+CP without feature fusion.Fig. 6Visualization of segmentation results of CTHP compared with those of other models. All original images are from Kvasir-SEG. All models in the comparison used variants pre-trained on Kvasir-SEG.Qualitative analysisTo demonstrate the segmentation ability of the proposed CTHP more intuitively, we selected some typical segmentation prediction maps for display and compared them with the segmentation results of other models, as shown in Fig. 6. Note that, for fair comparison, some of the prediction maps from other models are taken from their original papers. It can be observed that, compared to existing models, CTHP can better segment the boundaries of polyp regions, forming clearer and more accurate decision boundaries. Moreover, for polyp images with significant internal variations, CTHP can provide stable segmentation results, demonstrating its stability for polyps with different morphologies.Table 4 The complexity and efficiency comparison of CTHP on Kvasir-SEG. Optimal results are highlighted in bold.Complexity and efficiency comparisonTo demonstrate the impact of the proposed model design strategy on the efficiency of the proposed CTHP, we conduct a comparative analysis of model complexity and efficiency on the Kvasir-SEG dataset. The results are presented in Table 4. Metrics such as floating-point operations (FLOPs), the number of parameters (Parameters), inference time, and mDice are employed to comprehensively evaluate the complexity, computational efficiency, and accuracy of the models being compared. Overall, CNN-based models (e.g., U-Net6, UNet++13, ACSNet24, PraNet41) have fewer parameters and faster inference speeds compared to hybrid models. However, their accuracy is somewhat lacking. On the other hand, although hybrid models (e.g., TransFuse20, SSFormer17, FCBFormer31, MGCBFormer32) exhibit higher accuracy, they are generally bulkier in terms of model size and inference time. Among all the methods compared, the proposed CTHP strikes an optimal balance between complexity, efficiency, and accuracy. Within the hybrid models, CTHP features relatively low to moderate FLOPs and parameter counts, the shortest inference time, and the highest accuracy, thereby validating the effectiveness of our model design strategy.Fig. 7Illustration of some failure cases output by CTHP.Failure case discussionsAlthough the proposed CTHP model generally achieves high accuracy in locating and segmenting most polyp samples, it can still encounter errors or omissions in edge segmentation in some particularly challenging scenarios. Some failure cases are illustrated in Fig. 7. As observed, when shadows or reflections cause blurring of the polyp edges (e.g., the second example in the first row and the first example in the second row), CTHP may make incorrect edge predictions or even miss polyp tissues, especially when there are multiple polyps with significant size variations in the image.To address these shortcomings, we believe that incorporating an image denoising module could enhance the model’s performance. By reducing strong reflections or increasing contrast, the edges of polyp tissues could be made clearer, thereby improving the model’s robustness in these challenging scenarios. This is an area we plan to refine in our future work.

Hot Topics

Related Articles