An attentional mechanism model for segmenting multiple lesion regions in the diabetic retina

Figure 2Network structureThe proposed MSAG mechanism network is designed to enhance segmentation in DR images. Figure 2 illustrates the MSAG architecture consists of three primary components:

BackBone (HRNet-OCR): acts as the foundational layer for processing images and extracting features, essential for identifying detailed retinal characteristics.

Spatial attention gated mechanism: integrates semantic feature mapping with a pooling operation to merge spatial information features. Activation through a sigmoid function produces spatial attention maps, focusing the model on relevant lesion areas.

MSAG inference structure: employs a hierarchical computation method allowing for selective use of prediction scales. This flexibility improves lesion localization by optionally incorporating higher scale predictions.

HRNet developed by CUHK and Microsoft Research Asia and introduced at CVPR 2019, is notable for its unique fusion of low- and high-resolution features in a parallel manner, contrasting with the serial fusion approach common in other networks. This method allows for continuous integration of features across scales without the need to reconstruct high resolution from low-resolution inputs, facilitating position-sensitive feature fusion. In segmentation tasks, HRNet achieves faster inference speeds than both PSPNet and DeepLabv3, due to its efficient feature integration process.In this work, HRNet is utilized as the core for semantic information extraction, combined with the Object Contextual Representation (OCR) for context processing, forming the backbone of the MSAG. The MSAG also incorporates U-Net-like skip connections to preserve context over long distances and introduces the Spatial Gated Attention (SAG) module. This module aims to refine noise suppression and spatial feature extraction within HRNet, enhancing semantic accuracy.The MSAG model processes inputs at two scales: the original image size (Scale2) and a downsampled version (Scale1), achieved through 2-fold reduction. The Semantic Head (Seg), following the OCR module, handles lesion segmentation, while the Attention Head (Head) focuses on generating attention maps. These maps undergo Sigmoid activation and are then bilinearly upsampled to the original image size for precise segmentation. The introduction of the SAG unit, a key innovation of this study, is detailed further in the next section, emphasizing its role in optimizing attention within the segmentation process.Spatial attention gateIn contrast to traditional spatial attention modules33 and attention gates30, the SAG uniquely integrates feature mappings directly from the base network, utilizing both MaxPooling (MP) and AveragePooling (AP) techniques. SAG employs a hierarchical fusion strategy, enabling enhanced spatial feature extraction and noise suppression. The SAG derives feature mappings directly from the base network, harmonizing the feature channels from two distinct stages using a \(1\times 1\) convolution. This process is followed by the extraction of spatial information through both AP and mp techniques. CBAM model proved experimentally33 that max-pooled features which encode the degree of the most salient part can compensate the average-pooled features which encode global statistics softly. It was observed that the majority of the connections between modules in DenseNet employ average-pooling, which reduces the dimensionality. This is advantageous as it facilitates the transfer of information to the subsequent module for feature extraction. Consequently, we utilise the AP for feature extraction in the lower layer (\(H_g\times W_g\times C_g\)) reduces the number of parameters while retaining the most contributing feature information. In contrast, the upper layer (\(H_p\times W_p\times C_p\)), which contains a greater proportion of less useful information, employs the MP for key feature selection. The extracted features are then subjected to sigmoid activation to generate spatial attention mappings, which focus on relevant areas within the image. Figure 3 showcases the SAG’s architecture, illustrating its role in enhancing the model’s attention to critical spatial details.Figure 3Spatial attention gate and head unit.After the spatial gating unit, the Head section assimilates global spatial features for semantic analysis and prediction. It utilizes a hierarchical fusion strategy to discern the attention mask across adjacent scales, with the network’s training confined to pairs of neighbouring scales. Figure 4 contrasts an original, non-downsampled image with one that has been reduced by a factor of 2, although this downsampling factor is adaptable. The fusion technique is designed to capture the nuanced attention dynamics between scale pairs, enhancing the network’s ability to adjust to variations in scale and improve feature detection capabilities as highlighted in MS-OCRNet32.Figure 4Training of gated attention mechanisms in multi-scale space.Figure 5 depicts the hierarchical spatial gated attention mechanism model. The symbol \(F^w\) indicates that the input images are initially processed through a shared backbone network and a contextual environment capturer. \(F^i\) indicates that i-th stage output semantic feature, \(1<i<=4\). The symbol q denotes the weights after the output of the gated attention mechanism operation, while the symbol \(F^\alpha\) denotes the attention mask obtained after convolution of the weighted features. In the training process, the input image after processing is scaled by r, where \(r=0.5\) means downsampling by 2 times, \(r=2.0\) means upsampling by 2 times, and \(r=1\) means no operation. We select 0.5 and 1.0 scales images for training, denoted as \(F_{r=0.5}\) and \(F_{r=1}\), respectively. Then semantic prediction of the backbone \(F_{r=1}^w\), \(F_{r=0.5}^w\) is obtained after passing through the shared backbone network and the contextual environment capturer, and i-th stage generates the semantic feature (\(F_{r=0.5}^i\)). At the i-th stage, semantic features \(F_{r=0.5}^i\) are produced by passing the inputs through the spatial gated attention as \(q_{r=0.5}^i\). This process effectively captures and refines semantic information at various scales, as demonstrated by the gated attention computation equation.$$\begin{aligned} & q_{r=0.5}^i = \sigma _{1}[M^i,A^i] = \sigma _{1}[M(f^{1\times 1}(F_{r=0.5}^i));A(f^{1\times 1}(F_{r=0.5}^{i-1}))] \end{aligned}$$
(1)
The feature maps generated via SAG fusion are fed into the Head’s attention mechanism to derive the attention weight mappings \(q_{r=0.5}^i\) . Here, \(\sigma _{1}\) represents the ReLU activation function, and \(\sigma _{2}\) is the sigmoid activation function. The symbol \(f^{1\times 1}\) denotes convolution with a \(1\times 1\) kernel. While M, A indicate the maximum pooling and average pooling operations, respectively. This procedure meticulously calculates the attention weights, facilitating focused analysis on pertinent image regions.$$\begin{aligned} & F_{r=0.5}^\alpha = F_{r=0.5}^i\cdot \sigma _{2} (f^{1\times 1}(q_{r=0.5}^i )) \end{aligned}$$
(2)
Equation (3) outlines the process of weight assignment, which integrates attention mappings and high-level semantic information. Specifically, the attention mapping, adjusted by a factor \(r=0.5\), undergoes element-wise multiplication \(F_{r=0.5}^w\) with the high-level semantic feature, facilitating targeted semantic enhancement. The resultant product is further processed through \(1-F_{r=0.5}^a\) element-wise multiplication with neighboring scale information \(F_{r=1}^w\) to generate the refined output image \(F_{r=1}^s\). Here, Up signifies the application of linear interpolation for upscaling to the desired resolution.$$\begin{aligned} & F_{r=1}^s = Up(F_{r=0.5}^w \cdot F_{r=0.5}^\alpha )+((1-Up(F_{r=0.5}^\alpha ))\cdot F_{r=1}^w) \end{aligned}$$
(3)
MSAG inferenceDuring inference, the obtained attention is applied hierarchically, as depicted in Fig. 5. This involves integrating the attention through a series of computations across N prediction scales, effectively utilizing the attention mechanism to refine predictions at multiple levels of detail.Figure 5Inference for hierarchical multi-scale spatial gated attention mechanisms.In the inference phase, leveraging a 2.0 scale simplifies the process by directly inputting up-sampled images (2.0\(\times\)) into the pre-trained base network, bypassing the need to merge training data from the 0.5 and 1.0 scales. The initial segmentation prediction is derived from this base network output. Subsequently, this prediction is enhanced by element-wise multiplication with the attention mechanism’s down-sampled prediction. Layer-by-layer summation of these predictions refines the final segmentation outcome. This methodology prioritizes lower scales for detail while incrementally incorporating higher scale data for broader context, allowing for refined prediction locations.The approach presents two key benefits: scale flexibility and the systematic integration of higher scale data. It enables the model to incorporate new scales (0.25\(\times\), 2.0\(\times\)) beyond the ones used in training (0.5\(\times\), 1.0\(\times\)), addressing the common restriction of models to their training scales. Moreover, the hierarchical model structure enhances training efficiency. Given the spatially gated attention’s minimal impact on training complexity, training exclusively at 0.5 and 1.0 scales suffices, thereby reducing training demands.

Hot Topics

Related Articles