Wildlife target detection based on improved YOLOX-s network

YOLOXThe YOLOX series represents single-stage target detection algorithms known for their fast inference and real-time performance. YOLOX utilizes an Anchor-Free design concept, predicting the location and category of targets directly through the network, which reduces the limitations associated with anchor frames and enhances detection performance. Ge et al.23 introduced the YOLOX algorithms, which include YOLOX-s, YOLOX-m, YOLOX-l, and YOLOX-x variants. The architecture of YOLOX is divided into three main components: Backbone, Neck, and Head. The Backbone is built on the CSPDarknet24 network, where the input image undergoes initial feature extraction. This layer yields three feature layers essential for subsequent network construction. The Neck, consisting of the PAFPN (Path Aggregation Network with Feature Pyramid Network), acts as YOLOX’s Enhanced Feature Extraction Network. It fuses the three effective feature layers from the Backbone, aiming to amalgamate feature information across different scales. In the PAFPN, these layers are further utilized for continued feature extraction. The Head segment incorporates a Decoupled Head, which offers enhanced expressive capability and faster network convergence.MobileViT-poolingMobileViT25 is a lightweight, general-purpose visualization Transformer designed for mobile devices that utilizes a multi-layer Transformer encoder to process sequences in chunks. Nevertheless, employing multiple Transformer layers can increase the model’s parameter count and slow down detection speeds in practical applications. Excessive layers may also distort the extracted features. In this paper, the improved MobileViT-Pooling (referenced in Fig. 2) employs just a single Transformer layer. The authors of MetaFormer26 attribute the success of both the Transformer and MLP models to the generalized architecture of MetaFormer. To further reduce the parameter count and computational burden of the Transformer, we have transformed the multi-head self-attention structure of the Transformer into a non-parameterized spatial pooling operator, which serves as the token mixer module. This modification does not significantly impact detection accuracy in our experiments. Importantly, we recognize that MobileViT’s feature extraction capability is essential for detecting wildlife in this study. Yet, incorporating MobileViT substantially increases the computational demands of the network. Adhering to the principles of being quick, popular, and cost-effective, we have replaced the multi-head self-attention modules in MobileViT with global pooling operations, effectively reducing the network’s computational demands without overly compromising detection accuracy.Fig. 2MobileViT-Pooling module structure diagram.In the described model, the feature map initially passes through a convolutional layer with a convolutional kernel size of n × n. This is followed by a channel adjustment using a 1 × 1 convolutional kernel. Subsequently, global feature modeling is performed using the Unfold, Transformer, and Fold structures consecutively. The channel is then adjusted back to its original size using a convolutional layer with a 1 × 1 kernel size. This adjustment is succeeded by shortcut branching and channel-wise concatenation of the original input feature maps, concluding with a feature fusion through a convolutional layer with a kernel size of n × 1 to produce the final output.In this paper, we explore the potential of merging the strengths of CNNs and ViTs. CNNs are proficient in handling localized image features and structural information, while ViTs excel in managing long-range dependencies and global contextual information. Combining these two technologies may lead to superior performance. For this study, we designed two distinct models integrating CNN and ViT technologies:
Algorithm 1
Incorporates ViT within the CNN structure. In this hybrid model, CNN and ViT modules are alternated within the network architecture. This configuration may necessitate additional techniques and experiments to optimize the integration of CNN and ViT.

Algorithm 2
Employs ViT after CNN processing. Initially, features are extracted using the CNN, and then these features are fed into the ViT for more complex processing and inference. This arrangement leverages the ViT’s capability to handle global contexts and long-range dependencies, while retaining the CNN’s advantages for processing local features. The outcomes of this approach are detailed in Table 1.
Table 1 Experimental comparison results of Algorithm 1 and algorithm 2.Based on the experimental results, the approach of using ViT after CNN significantly outperforms embedding ViT within the CNN architecture. The difference in mAP@0.5 is 6.9%, and the gap in mAP@0.5:0.95 is 3.0%. We believe that employing ViT after processing with CNN allows the model to better segregate and differentiate various types of information. The CNN focuses on processing low-level and mid-level features, enabling ViT to concentrate on handling high-level and abstract features subsequently. This sequential processing might enhance the model’s performance and generalization capabilities. In contrast, embedding ViT directly within the CNN complicates the integration, as the alternating modules may not optimally handle the distinct features of the dataset. Specifically, in datasets where the features of wild animals often resemble environmental backgrounds, multiple embeddings of the ViT module within the CNN could lead to interactions of global information that might distort the already extracted features.Dynamic headEnvironmental background transformations pose significant challenges in visual target detection and tracking. Dynamic weather conditions, such as rainy nights, often introduce substantial noise into images, complicating the task of accurately recognizing wildlife. To address this, our paper introduces the ‘Dynamic Head‘27—a network structure designed to manage dynamic changes and focus attention effectively. The structure of the Dynamic Head is illustrated in Fig. 3. It primarily leverages an attention mechanism to enhance the model’s three key perceptual capabilities, thus improving detector performance. The structure of the DyHead block is shown in the following figure, where \(\:{\pi\:}_{L}\), \(\:{\pi\:}_{S}\), and \(\:{\pi\:}_{C}\) represent scale-aware attention (Level-wise), spatial-aware attention (Spatial-wise), and task-aware attention (Channel-wise), respectively.Fig. 3Dynamic head structure diagram.The introduction of the Dynamic Head brings several advantages to the research discussed in this paper. Nonetheless, determining the optimal number of layers to stack in the Dynamic Head, as well as understanding the specific effects of different layer counts on model performance, requires consideration of the specific model, task, and dataset involved. Increasing the number of layers can enhance the model’s complexity, potentially boosting performance but also increasing the risk of overfitting. Conversely, reducing the number of layers may simplify the model and reduce overfitting, but could compromise performance. To address this, we experimented with various stacking configurations, aligning our changes with the original author’s design. The results of these experiments are documented in Table 2.Table 2 Experimental results of different dyhead stacking times.At a stacking number of 2, mAP@0.5 and AP@0.5:0.95 reaches a maximum of 58.36 and 81.58, respectively. this indicates that stacking 2 layers of DyHead performs best on this wildlife detection task. When the number of stacks exceeded 2, the mAP of the model started to decrease, which could be due to too many stacks causing the model to overfit or making the optimization of the model more difficult. Therefore, for this wildlife detection task, stacking 2 layers of DyHead may have been the best choice, as it achieved the highest mAP while keeping the number of parameters constant, albeit with a slight increase in computation.Focal-IoU lossIoU Loss: Intersection over Union (IoU) is a commonly used metric for evaluating target detection results, which measures the degree of overlap between the predicted bounding box and the actual bounding box. During the training process, the predicted bounding box can be made closer to the real bounding box by minimizing the IoU Loss.The IoU-based function is as follows:$$\:\begin{array}{c}L\left(B,{B}^{gt}\right)=1-\frac{\left|B\cap\:{B}^{gt}\right|}{\left|B\cup\:{B}^{gt}\right|}+R\left(B,{B}^{gt}\right)\end{array}$$
(1)
where B denotes the detection frame and \(\:{B}^{gt}\) denotes the true box, the\(\:R\left(B,{B}^{gt}\right)\) penalized items.Focal Loss28 is primarily employed to address the issue of category imbalance in classification tasks, which involves discrepancies in the number of positive (target) and negative (non-target) samples. In target detection tasks, the number of negative samples often far exceeds that of positive samples. Focal Loss works by reducing the weight of easily classified samples (mostly negative samples) and increasing the weight of challenging-to-classify samples (such as occluded targets or targets resembling the background). This adjustment enables the model to prioritize attention to these challenging samples during the training process. The formula is represented as follows:$$\:\begin{array}{c}FocalLoss=-{\alpha\:}_{t}{\left(1-{p}_{t}\right)}^{\gamma\:}\text{log}\left({p}_{t}\right)\end{array}$$
(2)
where \(\:{\alpha\:}_{t}\) is used to solve the problem of positive and negative sample imbalance,γ is a parameter in the range [0,5], and\(\:{\left(1-{p}_{t}\right)}^{\gamma\:}\) is used to solve the problem of unbalanced hard and easy samples.An ideal loss function should have these characteristics: Small regression errors should yield low gradients, indicating minimal parameter adjustments. Conversely, large errors should produce high gradients, necessitating significant parameter changes. The algorithm often generates a large number of ineffective prediction boxes during prediction. In order to enhance the weight of high-quality prediction boxes in the IoU loss function and improve the quality of training, we have applied the Focal concept to IoU, resulting in the proposal of the Focal-IoU function. Furthermore, the proposed Focal-IoU loss function integrates the concept of Focal, aiming to fine-tune the error rates in cases of small errors, while making drastic adjustments in cases of large errors:$$\:\begin{array}{c}{L}_{Focal-IoU}={IoU}^{\gamma\:}{L}_{IoU}\end{array}$$
(3)
γ is a hyperparameter, which we set to 0.5 for the \(\:\left(1-IoU\right)\) curve. The orange color represents \(\:Io{U}^{0.5}\left(1-IoU\right)\). For IoU values in the range of 0 to 0.8, \(\:Io{U}^{0.5}\left(1-IoU\right)\) decreases, while for IoU values in the range of 0.8 to 1, the color of \(\:Io{U}^{0.5}\left(1-IoU\right)\) remains essentially unchanged. Figure 4 illustrates that the Focal-IoU loss enables the network to focus more on simple samples by reducing the loss associated with difficult samples.Fig. 41-IoU and Focal-IoU curves.The design objective of the Focal-IoU Loss is to tackle both the IoU data imbalance issue and enhance the localization accuracy of the prediction frame simultaneously. To evaluate the impact of the Focal-IoU Loss, we utilize this dataset and assess whether it leads to improved performance under identical conditions.Table 3 Experimental results for different IoU loss functions.In Table 3, it is evident that the model employing Focal-IoU Loss achieves the highest performance on the mAP@0.5 metric, reaching 80.5%. However, for the mAP@0.5:0.95 metric, the performance of Focal-IoU Loss is comparable to that of IoU Loss but slightly lower. This suggests a slight decrease in the detection accuracy of targets with Focal-IoU Loss. The inferior detection accuracies of several other IoU loss functions, such as GIoU, DIoU, and CIoU, imply that they are less suited to this problem compared to Focal-IoU Loss and IoU Loss. Overall, Focal-IoU Loss demonstrates relatively strong performance in this wildlife target detection problem, particularly at larger IoU thresholds.

Hot Topics

Related Articles