Adaptive condition-aware high-dimensional decoupling remote sensing image object detection algorithm

Data set introductionFigure 2 depicts the RSOD data set, an open resource tailored for the detection of minute objects within RSI. The sample image of the RSOD RSI data set includes: aircraft, including 4993 aircraft in 446 images; oiltank, including 1586 oil tanks in 165 images; overpass, including 180 overpasses in 176 images; playground, containing 191 playgrounds in 189 images.Figure 2RSOD RSI data sample diagram.YOLO-ACPHD moduleThe YOLO-ACPHD model proposed in this paper mainly expands the following techniques to improve the detection effect and robustness.The ACAT module can dynamically adjust the parameters of the convolution kernel to adapt to objects of different scales and positions. In RSI, the scale, direction and shape of objects are different, so it is difficult to adjust them effectively based on fixed convolution function. Employing a convolution neural network that leverages the attributes of training instances, this approach facilitates the automated learning of convolution neural networks rooted in these characteristics. Furthermore, it gives rise to multi-level and multi-aspect convolution neural networks, all grounded in the nuances of the training data. it enhances the adaptability and accuracy to complex scenes.Through the deep convolution of each channel, the HDDT model carries on the convolution operation between each channel. Compared with the conventional convolution algorithm, the computing time of HDDT algorithm is reduced to 1/N, where N is the number of convolution kernels. Optimizing computational efficacy in the handling of extensive RSI significantly accelerates object identification by minimizing computational requirements.By using ACAT and HDDT modules, the YOLO-ACPHD RSIOD algorithm can be more flexible to adapt to objects of different scales and locations and at the same time improve the computational efficiency of the algorithm, Consequently, the accuracy of identification and the resilience are significantly improved in the context of complex and dynamic RSI.Research on RSIOD methods based on YOLO-ACPHD variety of advanced OD techniques are used to improve the performance and robustness of the algorithm. As shown in Fig. 3.Figure 3YOLO-ACPHD overall network architecture.ACAT and HDDT modules are used to improve the adaptability and computational efficiency of the algorithm. By adaptively adjusting the characteristics of each region and each region, the ACAT model enhances the adaptive ability and accuracy of various features. HDDT algorithm uses segmented convolution operation, which not only keeps the spatial information, but also improves the robustness to objects. The research results of this project will lay a foundation for the research of RSI data analysis and analysis methods for complex scenes.Adaptive Condition Awareness TechnologyAs depicted in Fig. 4, within the standard Convolution layer, a uniform convolution kernel, referred to also as a weight, is applied across all input instances. This means that the convolution operation is the same regardless of the content of the input example. In the ACAT layer, the Convolution kernel assumes an adaptive role, transforming dynamically based on the input sample at hand. This means that each input example can have its own unique convolution kernel. The ACAT layer is designed to adaptively learn and adjust the convolution kernel according to the features and context information of the input examples, aiming to enhance the model’s capacity for expression and its generalization capabilities. This is achieved by conceptualizing the convolution kernel in relation to the input instance, the ACAT layer can share and learn different convolution kernel parameters between different examples so that the model can better adapt to the diverse input data.Figure 4Conditional parameter convolution diagram.In ACAT, the convolution kernel is parameterized as a function of the input example. Particularly, the parameters for the convolution kernel can be derived utilizing the subsequent equation.$$\begin{aligned} Output(x)=\sigma ((\alpha _{1} W _{1} +\cdots +\alpha _{n} \cdot W_{n} *x). \end{aligned}$$
(1)
Here x denotes the output of the previous layer, n denotes that the ACAT layer has n convolution kernels\((W_{i} )\), \(\sigma\) denotes the activation function, \(\alpha _{i}= r_{i}(x)\) represents a sample-dependent weighted parameter39.Traditional convolution layers typically enhance their capacity through augmenting the kernel’s height/width or the quantity of Icano channels. Nevertheless, during convolution, every extra parameter necessitates more multiplication-addition operations, a computation burden that scales linearly with the image’s pixel count. In the ACAT layer, before applying convolution, a convolution kernel is calculated for each example as a linear combination. Each convolution kernel only needs to be calculated once, but it is applied to many different positions of the input image. The integration of n enhances the overall capability of the network, and this augmentation demands a mere insignificant computational expense. Each additional parameter requires an additional multiplication operation. As you can see from Fig. 5, the ACAT layer is mathematically equivalent to a high-cost expert linear mixed equation in which each expert has a static convolution:Figure 5Expert linear mixed diagram.$$\begin{aligned} \sigma ((\alpha _{1}\cdot W_{1}+\cdots +\alpha _{n}\cdot W_{n}) *x)=\sigma (\alpha _{1}\cdot (W_{1}*x)+\cdots +\alpha _{n}\cdot (W_{n}*x)). \end{aligned}$$
(2)
Therefore, ACAT has the same capacity as the linear mixed expert formula of n experts, but it is computationally efficient because it only needs to calculate one expensive convolution. This formula delves into the nature of ACAT and links it to previous work on conditional calculations and expert mixing. In ACAT, the path selection function of each instance is very critical.A routing function is designed that has high computational efficiency, can effectively distinguish input examples, and is easy to explain. The algorithm uses three steps to solve the problem, namely: global average common, fully connected layer, Sigmoid activation. The route weighting is calculated as follows:$$\begin{aligned} r(x)=Sigmoid(GlobalAveragePool(x)R). \end{aligned}$$
(3)
For the input x, the global average pooling is first performed, and then the right is multiplied by a matrix R (the purpose of the matrix is to map the dimensions to n experts to achieve the subsequent linear combination). Finally, the weights on each dimension are reduced to the [0, 1] interval by sigmoid. Therefore, different routing weight vectors are obtained according to the different input x.High-dimensional decoupling technologyBased on the HDDT, the standard \(3 \times 3\) convolution operation is decomposed into two steps: deep convolution and point convolution, as shown in Fig. 6. The traditional \(3 \times 3\) convolution performs both channel and spatial direction calculations in one step. The high-dimensional decoupling technique decomposes the calculation into two steps: deep convolution and point convolution.Each input channel undergoes an individual convolution process with its exclusive convolution filter. Consequently, every channel is equipped with its own convolution kernel for distinct data processing, thereby conducting the convolution operation across the channel dimension. Point convolution is used to create a linear combination of deep convolution outputs. Point convolution uses \(1 \times 1\) convolution kernels to linearly combine the output channels of deep convolution. This process can be seen as the dimension reduction and fusion of channel dimensions.Standard convolution. Assuming that the convolution kernel size is \(D_{k}\times D_{k}\) , the input channel is M, the output channel is N, and the output feature map size is \(D_{F}\times D_{F}\), then after the standard convolution, it can be calculated, the number of parameters is \(D_{k}\times D_{k} \times M \times N\), the amount of calculation is \(D_{k}\times D_{k} \times M \times N \times D_{F}\times D_{F}\).Deep convolution. The convolution kernel of deep convolution is a single channel mode, and each channel of the input needs to be convoluted, so that the output feature map with the same number of channels as the input feature map will be obtained. That is, the number of input feature map channels = the number of convolution kernels = the number of output feature maps. The convolution size of deep convolution is \(D_{k}\times D_{k} \times 1\), the number of convolution kernels is M, and each must do \(D_{F}\times D_{F}\) multiplication operations. Parameter is \(D_{F}\times D_{F} \times M\), calculation amount is \(D_{k}\times D_{k} \times M \times D_{F}\times D_{F}\).Pointwise convolution. Pointwise convolution W/H dimension unchanged, change the channel. According to deep convolution, the number of input feature map channels = the number of convolution kernels = the number of output feature maps, which will lead to too few output feature maps (too few channels of output feature maps, which can be regarded as the number of output feature maps is 1 and the number of channels is 3), which may affect the effectiveness of information. At this time, point-by-point convolution is needed. Point wise Convolution (PWConv) is essentially a \(1\times 1\) convolution kernel for dimension elevation. The convolution size of point-by-point convolution is \(1\times 1 \times M\) the number of convolution kernels is N, and each must do \(D_{F}\times D_{F}\) multiplication operations. Parameter is \(M\times N\), calculation amount is \(M \times N \times D_{F}\times D_{F}\).HDDT consists of deep convolution and point-by-point convolution. Deep convolution is used to extract spatial features, and point-by-point convolution is used to extract channel features. HDDT groups convolutions on the feature dimension, performs independent deep convolutions on each channel, and aggregates all channels using a \(1\times 1\) convolution before output. Parameter is \(D_{k}\times D_{k} \times M + M \times N\), calculation amount is \(D_{k}\times D_{k} \times M \times D_{F}\times D_{F} + M \times N \times D_{F}\times D_{F}\).Comparison between standard convolution and HDDT.$$\begin{aligned} Parameter quantity ratio= & \frac{HDDT}{Conv} = \frac{D_{k}\times D_{k} \times M \times N }{D_{k}\times D_{k} \times M + M\times N}=\frac{1}{N}+\frac{1}{{D_{k}^{2} } }. \end{aligned}$$
(4)
$$\begin{aligned} Calculation ratio= & \frac{HDDT}{Conv} = \frac{D_{k}\times D_{k} \times M \times N \times D_{F}\times D_{F} }{D_{k}\times D_{k} \times M \times D_{F}\times D_{F} + M\times N \times D_{F}\times D_{F}}=\frac{1}{N}+\frac{1}{{D_{k}^{2}} }. \end{aligned}$$
(5)
In general, N is larger, \(\frac{1}{N}\) is negligible, \(D_{k}\) represents the size of the convolution kernel, if \(D_{k}\), \(\frac{1}{{D_{k}^{2}}}=\frac{1}{9}\), if we use the convolution kernel of the commonly used \(3\times 3\), then the number of parameters and calculations using HDDT is reduced to about one-ninth of the original.In summary, the deep separable convolution based on high-dimensional decoupling technology decomposes the standard \(3\times 3\) convolution into two steps: deep convolution and point convolution. This decomposition can improve computational efficiency and make better use of model parameters and computing resources40.Figure 6Expert linear mixed diagram.When the standard convolution filter K of size \(W^{\prime }\times H^{\prime } \times M\times N\) is applied to the input feature mapping F of size \(W\times H \times M\), an output feature mapping O of size \(D_{f}\times D_{f}\times M\) can be obtained. In standard convolution, each filter K is a tensor of size \(W^{\prime }\times H^{\prime } \times M\). The filter is convoluted with the input feature mapping F to obtain an output feature mapping O with a size of \(D_{f}\times D_{f}\times M\), where N is the number of filters. Each slice of the filter K with a size of \(D_{f}\times D_{f}\times N\) is multiplied by the input feature map F element by element, and the results are added to obtain a feature map with a size of \(D_{f}\times D_{f}\times M\).In summary, a standard convolution filter of size \(W^{\prime }\times H^{\prime } \times M\times N\) is applied to the input feature mapping of size \(W\times H \times M\), and an output feature mapping of size W can be generated. In the context of Convolution technique, individual units multiply the input feature map, cumulatively aggregating the values across each section to derive the resulting output.$$\begin{aligned} O_{k,l,n}=\sum _{i,j,m} K_{i,j,m,n}\cdot F_{k+i-1,l+j-1,m}. \end{aligned}$$
(6)
In the high-dimensional decoupling technology, the above calculation is decomposed into two steps. The first step is to apply \(3\times 3\) deep convolution K to each input channel. Specifically, for the input feature mapping F with a size of \(W\times H \times M\), \(3\times 3\) deep convolution kernels are used to perform convolution operations with each channel of the input feature mapping. In this way, M output feature maps of size \(W\times H\) will be obtained.$$\begin{aligned} \hat{O} _{k,l,n}=\sum _{i,j,m} \hat{K} _{i,j,m}\cdot F_{k+i-1,l+j-1,m}. \end{aligned}$$
(7)
The second step is point convolution, which is used to create a linear combination of deep convolution outputs. Using the convolution kernel of \(1\times 1\), the M output feature maps are linearly combined by point-by-point convolution \(\hat{K}\) to obtain an output feature map of \(W\times H\) size. This process is equivalent to dimensionality reduction and fusion of channel dimensions.$$\begin{aligned} O_{k,l,n}=\sum _{i,j,m} \hat{K} _{m,n}\cdot \hat{O}_{k-1,l-1,m}. \end{aligned}$$
(8)
Deep convolution and point convolution play different roles in generating new features. In this context, deep convolution serves to discern the spatial relationships inherent within an image, essentially extracting details such as edges, textures, and other pertinent information. By employing profound convolution across channels, the algorithm enhances its capacity to discern the image’s spatial architecture and distill a wealth of information from it. Conversely, point convolution is typically employed to seize the interdependencies that exist among distinct channels. By using the convolution kernel function of \(1\times 1\) and the point convolution method, the channel dimensions of each characteristic graph are linearly synthesized to achieve the interaction and fusion between channels. This operation can help the network learn the correlation and importance between different channels, thereby generating new features with more representational capabilities.

Hot Topics

Related Articles