Lightweight monocular depth estimation using a fusion-improved transformer

Integration of Atrous convolution residual modules and an enhanced transformer for a monocular depth estimation networkThe model in this paper consists of an encoder and a decoder, adopting the classic encoder-decoder structure, as illustrated in Fig. 1. In the encoder, convolution and transformer fusion are utilized to extract image features and multiscale features are aggregated in the encoding layer through four stages.The input image is passed through the Conv-stem convolution module in the first stage. This module consists of two convolutional layers with kernel sizes of 3 × 3 and a stride of 2. The image undergoes two convolutions for downsampling and local feature extraction, resulting in feature maps with a size of \({W \mathord{\left/ {\vphantom {W 2}} \right. \kern-0pt} 2} \times {H \mathord{\left/ {\vphantom {H 2}} \right. \kern-0pt} 2} \times C\). To compensate for the loss of spatial information caused by changes in feature scale, this paper uses ResNet18 for initial feature extraction from the input image. The extracted features are then passed through Pose Net to output the camera’s rotation matrix (R) and translation vector (t) for estimating the camera’s pose. Finally, the extracted results are concatenated with an average pooling module, enabling the network to acquire more spatial information from the original image, providing a better understanding of the context and positional information of the target. Subsequently, downsampling is performed on the feature maps using a 3 × 3 convolution with a stride of 2, resulting in feature maps with a size of \({H \mathord{\left/ {\vphantom {H 4}} \right. \kern-0pt} 4} \times {W \mathord{\left/ {\vphantom {W 4}} \right. \kern-0pt} 4} \times C\). From stage two to stage four, ACR modules and local-global transpose self-attention modules are utilized in each stage to extract features of different scales. These features are concatenated with the output of the pooling module and fed into the next stage. Finally, feature maps with dimensions of \({H \mathord{\left/ {\vphantom {H 8}} \right. \kern-0pt} 8} \times {W \mathord{\left/ {\vphantom {W 8}} \right. \kern-0pt} 8} \times C\) and \({H \mathord{\left/ {\vphantom {H {16}}} \right. \kern-0pt} {16}} \times {W \mathord{\left/ {\vphantom {W {16}}} \right. \kern-0pt} {16}} \times C\) are output. In the decoder, only one convolutional layer is used to fuse features, further reducing the overall computational burden of the depth estimation network. Finally, the inverse depth maps at different resolutions are output by connecting them with bilinear upsampling and prediction heads.Fig. 1Overall structure of the self-supervised monocular depth estimation network.Atrous convolution residual moduleThe encoding layer adopts a shallow CNN network for training to effectively reduce the model size and training parameter count. However, shallow CNNs have certain limitations in terms of the receptive field. The proposed ACR module is introduced to improve local feature extraction. This module utilizes depth separable instead of traditional convolutions to extract image features. Depth-separable convolution consists of depthwise convolution and pointwise convolution. The depthwise convolution extracts features along the channel dimension, while the pointwise convolution combines features along the spatial dimension. The feature extraction ability of the model improves by increasing the number of feature channels through linear modules, introducing nonlinear transformations, reducing computational costs, and capturing image information. This fully leverages the advantages of depth-separable convolution, as depicted in Fig. 2, which illustrates the ACR module.Fig. 2Several ACR modules with different dilation rates are inserted into different stages, and the ACR modules are looped according to the stages to achieve multiscale fusion and aggregation of the local context.The feature X with dimension .\(H \times W \times C\). is used as the input, and the output of the ACR module is as follows:$$\hat {X}=X+Linear(Gelu(BN(Dconv(Linear(X)))))$$
(1)
where \(Linear\) denotes a linear transformation, expanding feature channels, \(Dconv\) represents a 3 × 3 depthwise separable convolution with a dilation rate of d, \(BN\) denotes a batch normalization layer, \(Gelu\) denotes an activation function, and finally, the output is obtained by restoring the dimension through a fully connected layer.Using a shallow CNN network and increasing the receptive field can better capture the global information in the image; however, it may lead to the inability to effectively capture the local fine structures in the image, which may result in the loss or neglect of detailed features. To further optimize the model performance, this paper introduces the strategy of pooling cascading18. This module is constructed by an average pooling module and 1 × 1 convolution, which cascades multiscale image features after each downsampling. The pooling module helps to maintain critical information while reducing dimensionality and enhances the perception of features at different scales through multiscale fusion. By introducing the pooling cascading strategy, this paper captures local features in the image more meticulously while maintaining global information, thus further improving the model performance.Local-global transposed transformer blockDue to the quadratic relationship between the computational complexity of self-attention and the input resolution, existing vision transformers face challenges when directly applied to visual tasks with high resolutions, such as depth estimation. This paper proposes the MDTA module to alleviate this issue, which significantly reduces the computational burden of spatial self-attention using transposed self-attention, improving the shortcomings of the original transformer architecture. The MDTA module employs a self-attention mechanism applied across channels and calculates cross-channel covariances to generate attention maps encoding the global context, thus reducing model complexity in terms of computational dimensions. As another crucial component within MDTA, a depthwise separable convolution module is introduced after the linear layer, emphasizing the local context before computing feature covariances to generate global attention maps. This helps transformer models capture relationships between input data spatial dimensions, enabling better handling of contextual information in language sequences and improving model performance. Figure 3 illustrates the specific structure of the MDTA module.Given an input feature with dimension \(H \times W \times C\), it is expanded by \(N \times C\) into an image sequence, where \(H \times W\) represents the image resolution, denotes the total number of pixels in the input space, and C indicates the number of image channels. Through fully connected layers and 3 × 3 depthwise separable convolutions, the spatial context is encoded channelwise, resulting in a query matrix \({\mathbf{Q}}=W_{d}^{Q}W_{L}^{Q}{\mathbf{X}}\), a key matrix \({\mathbf{K}}=W_{d}^{K}W_{L}^{K}{\mathbf{X}}\) and a value matrix \({\mathbf{V}}=W_{d}^{V}W_{L}^{V}{\mathbf{X}}\), each with dimensions \(N \times C\), where \(W_{L}^{{( \cdot )}}\) denotes the fully connected layers, and \(W_{{\text{d}}}^{{( \cdot )}}\) represents the 3 × 3 depthwise separable convolutions. Therefore, the self-attention mechanism can be expressed as:$${\mathbf{\hat {X}}}=Attention({\mathbf{Q}},{\mathbf{K}},{\mathbf{V}})+{\mathbf{X}}$$
(2)
$$Attention({\mathbf{Q}},{\mathbf{K}},{\mathbf{V}})={\mathbf{V}} \cdot Softmax\left( {{{\mathbf{K}}^T} \cdot {\mathbf{Q}}} \right)$$
(3)
where \({\mathbf{X}}\) and \({\mathbf{\hat {X}}}\) represent the input and output feature maps, respectively. Compared to the original self-attention mechanism, this paper reduces the computational complexity from \(\mathcal{O}\left( {h/{N^2}+Nd} \right)\) to \(\mathcal{O}\left( {{{{d^2}} \mathord{\left/ {\vphantom {{{d^2}} h}} \right. \kern-0pt} h}+Nd} \right)\), d is the vector dimension and h is the number of attention heads.Fig. 3Local-global transposed transformer block.Additionally, this paper proposes a TSGFN to achieve better contextual interactions. Similar to MDTA, the TSGFN introduces depthwise separable convolutions to encode information from spatially adjacent pixel positions. The schematic structure of the TSGFN module is illustrated in Fig. 3. In contrast to a regular multilayer perceptron (MLP) feedforward network, where the MLP updates the activation function using only the current feature and treats each feature independently, the TSGFN controls the information flow through corresponding hierarchical levels in the pipeline. This allows each level to focus on fine details complementary to other levels, updating the current feature in two steps, and facilitating better contextual interactions across the entire text.The input feature map is expanded through a fully connected layer to increase the number of feature channels. Then, image features are extracted using a 3 × 3 depthwise separable convolution to obtain \({\mathbf{X}}\). The feature map is updated in two steps. First, the features are split into two parts, \({{\mathbf{X}}_{\mathbf{f}}}\) and \({{\mathbf{X}}_{\mathbf{b}}}\), based on the feature channels. Part \({{\mathbf{X}}_{\mathbf{f}}}\) is multiplied by \({{\mathbf{X}}_{\mathbf{b}}}\) after passing through the Gaussian error linear unit (GELU) activation function, resulting in \({\mathbf{\tilde {X}}}\) after the first-step update. Then, the updated result \({\mathbf{\tilde {X}}}\) is passed through a fully connected layer. Using the same approach, the latter half \({{\mathbf{X}}_{\mathbf{b}}}\) of the original input feature map is used to update the former half \({{\mathbf{\tilde {X}}}_{\mathbf{b}}}\) of the current feature in the second step. Finally, the image features are restored to their original dimension and output as \({\mathbf{\hat {X}}}\). The TSGFN can be formulated as follows:$${\mathbf{X}}=DWConv(Linear({\mathbf{X}}))$$
(4)
$${\mathbf{\tilde {X}}}=Linear(X\left[ {Gelu({{\mathbf{X}}_{\mathbf{b}}}) \odot {{\mathbf{X}}_{\mathbf{f}}},{{\mathbf{X}}_{\mathbf{b}}}} \right])$$
(5)
$${\mathbf{\hat {X}}}=Linear({\mathbf{X}}\left[ {{{{\mathbf{\tilde {X}}}}_{\mathbf{f}}},Gelu({{\mathbf{X}}_{\mathbf{f}}}) \odot {{{\mathbf{\tilde {X}}}}_{\mathbf{b}}}} \right])$$
(6)
where \(\odot\) represents elementwise matrix multiplication, and \(Gelu\) denotes an activation function. Compared to the MLP, the TSGFN module utilizes feature map segmentation and elementwise multiplication operations to model contextual information through interactions between feature channels. Elementwise multiplication can selectively emphasize or suppress feature responses between different channels, enabling the capture of richer feature representations.Loss functionThis paper utilizes the difference between the source and predicted images as the signal for supervising model training. Therefore, a loss function is designed based on the difference between the two, constraining network training through optical reprojection and edge-aware smoothness loss. By utilizing the intrinsic camera function K and the predicted pose P between two adjacent views, a reconstruction target image \(\hat {I}\) is obtained as a function \(\pi\) of the intrinsic function, pose, source image \({I_s}\) and depth \({D_t}\). The loss signal \({\mathcal{L}_{ss}}\) is calculated as a function \(\mathcal{F}\) of inputs \(\hat {I}\) and I:$${\mathcal{L}_{ss}}\left( {\hat {I},I} \right)=\mathcal{F}\left( {\pi \left( {{I_s},P,{D_t},K} \right),I} \right)$$
(7)
The function \(\mathcal{F}\) is typically obtained as a weighted sum between the structural similarity item and the intensity difference item, calculated as the sum of the pixelwise structural similarity (SSIM) and the L1 loss between \(\hat {I}\) and I:$$\mathcal{F}\left( {\hat {I},I} \right)=\frac{\alpha }{2}\left( {1 – SSIM\left( {\hat {I},I} \right)} \right)+(1 – \alpha )\left\| {\hat {I} – I} \right\|$$
(8)
where \(\alpha\) is typically set to 0.85. The minimum luminance loss is computed to handle out-of-view pixels in the source image and occluded objects:$${\mathcal{L}}\left( p \right) = \mathop {\min }\limits_{{i \in [1, – 1]}} {\mathcal{F}}\left( {\widehat{{I_{i} }}(p),I(p)} \right)$$
(9)
Smoothness loss and edge-aware smoothness loss are used to improve the inverse depth map d:$${\mathcal{L}_{smooth~}}=\left| {{\partial _x}d_{t}^{*}} \right|{e^{ – \left| {{\partial _x}{I_t}} \right|}}+\left| {{\partial _x}d_{t}^{ * }} \right|{e^{ – \left| {{\partial _y}{I_t}} \right|}}$$
(10)
Finally, both the view reconstruction loss \({\mathcal{L}_{ss}}\) and the smoothness loss \({\mathcal{L}_{smooth~}}\) are computed from the output at each scale \({\text{S}} \in \left\{ {{\text{1,}}\tfrac{{\text{1}}}{{\text{2}}}{\text{,}}\tfrac{{\text{1}}}{{\text{4}}}} \right\}\) to achieve full resolution and then averaged to train the network as \({\mathcal{L}_{tot}}\):$${\mathcal{L}}_{{tot}} = \frac{1}{3}\sum\limits_{{s = 1}}^{3} {\left( {\mu{\mathcal{L}}_{{ss}} + \lambda {\mathcal{L}}_{{smooth}} } \right)}$$
(11)
where \({\mathcal{L}_{ss}}\) ensures the similarity between the generated depth map and the original image by measuring the difference between the output result and the original features. \({\mathcal{L}_{smooth~}}\) is used to further smooth the image, preventing the impact of excessive detail noise and irregularities, thereby improving the signal-to-noise ratio. The weights \(\mu ,\lambda \in\)[0,1] adjust the balance between the reconstruction loss and the smoothing loss in different models. These two types of losses have the same effect on street scene images, so the weight factors are set to 0.5.

Hot Topics

Related Articles