Lightweight safflower cluster detection based on YOLOv5

Data acquisitionThe dataset of images employed in this research originated from Hongqi Farm in Jimusar County, Changji Prefecture, Xinjiang Uygur Autonomous Region, China, situated at 89°12ʹE and 44°24ʹN. The safflower strain used was Jihong 1. The Intel RealSense D455 depth camera and HNONR 20 were utilised for image acquisition, placed directly in front of a homemade safflower picking robot body, as shown in Fig. 1. The viewing angle ranged from 20 to 45° to capture images with resolutions of 1080 × 720 and 1920 × 1080. To enhance the robustness and generalisation of the model, we took into account factors such as time and weather conditions during data collection. The dataset was augmented through image enhancements, including rotation, noise perturbation, blurring, and colour transformation. Periods of sunny weather included early morning, noon, and afternoon, as depicted in Fig. 2a–c; overcast conditions are shown in Fig. 2d; Fig. 2e shows images captured at dusk; and Fig. 2f shows LED fill-in lighting at night.Figure 1Safflower picking robot in field.Figure 2Images of safflower captured under diverse weather and light conditions: (a) Sunny day-morning; (b) Sunny-noon; (c) Sunny-afternoon; (d) Overcast; (e) Nightfall sky; (f) Nighttime fill light.Dataset productionComplex datasets were created by combining images captured from various angles, lighting, and weather conditions, enabling automated harvesting equipment to adapt to a variety of working conditions for both identification and harvesting purposes. We aimed to produce versatile datasets that can be used in different settings. Safflower threads typically display shades of reddish-orange and yellowish-orange (in their immature state) in the field, owing to differing degrees of maturity. Given the current market demand for safflower filaments, the two colours are not differentiated, and are graded similarly. Given the inconsistent growth of safflower plants in the field, and the impact of weather leading to fallen plants, as well as instances of plants shading each other and causing significant leaf damage, data marking requires us to exclude safflower plants with shading exceeding 75%. In the process of marking, it is important to avoid flower buds, stalks, and leaves. Figure 3 depicts the Ji Hong 1 safflower variety. The data were annotated using the open-source application LabelImg, and the data were in PASCAL VOC dataset format. The label name for safflower was “safflower.” An example of the annotation is shown in Fig. 4.Figure 3Schematic diagram of Jihong 1.Figure 4Example of annotated safflower data.Safflower cluster detectionYOLOv5 network architectureThe YOLOv5 network comprises an input layer, backbone network, neck network, and prediction output layer. The image input to the backbone network undergoes convolution to obtain the feature map, which is forwarded to the neck network for multi-scale feature fusion, incorporating upsampling and downsampling. The combined features output from layers 17, 20, and 23 are forwarded to the output layer. The confidence score, prediction category, and target frame coordinates can then be obtained after non-maximum suppression and other processing15. The framework diagram for the YOLOv5 network is presented in Fig. 5.Figure 5YOLOv5 model version determinationYOLOv5 has five versions, with different network depths and numbers of residuals. We evaluated the performance of YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x at detecting safflower. We trained each model on a self-constructed safflower dataset, with the aim of identifying the most appropriate version of YOLOv5. Our findings are displayed in Table 1. A model’s efficacy was assessed by its precision (accuracy), F1-score, mean average precision (mAP), GFlops (model calculations), and number of model parameters (Params), where$${\text{Precision}} = \frac{TP}{{\left( {TP + FP} \right)}}$$
(1)
$$F1 = 2*\frac{Precision*Recall}{{Precision + Recall}}$$
(2)
$$mAP = \mathop \sum \limits_{i = 1}^{N} AP_{i} /N$$
(3)
$$GFlops = L\left( {\mathop \sum \limits_{i = 1}^{N} K_{i}^{2} *C_{i – 1}^{2} + \mathop \sum \limits_{i = 1}^{i} M^{2} *C_{i} } \right)$$
(4)
$$Params = L\left( {\mathop \sum \limits_{i = 1}^{N} M_{i}^{2} *K_{i}^{2} *C_{i – 1} *C_{i} } \right).$$
(5)
Table 1 Training results for each version of YOLOv5.True positives (TPs) refer to cases where the model correctly identifies an object, while false positives (FPs) refer to cases where the model predicts an object where there is none. The average precision (AP) is the area under the precision-recall curve. Other factors that can affect model performance include the kernel size (K), number of channels (C), size of input image (M), and number of iterations (i).YOLOv5x, YOLOv5l, and YOLOv5m have greater detection accuracy than YOLOv5n, but their more intensive computation makes them unsuitable for mobile use. YOLOv5n is more lightweight, but has poor accuracy. Balancing computation and accuracy, we selected YOLOv5s, and enhanced it to reduce computation (params, GFlops) while maintaining detection accuracy.Improvement strategiesWe present a safflower detection model for SF-YOLO. We seek a reduced computational burden, given the complex background and environmental variables that characterize safflower farmland. To this end, we replace the standard convolutional block in the backbone network with the lighter Ghost_conv block. We embed the attention module after the SPPF module in the backbone network, which enables the model to focus on relevant information and enhances its self-adaptive fusion ability, ultimately improving the recognition rate. The initial model’s LGIOU loss function is replaced by the fusion L(CIOU+NWD) loss function. We refine the YOLOv5 model’s initial anchor frame using K-means clustering to better suit small- and medium-sized safflower target detection. Figure 6 depicts the structure of the improved SF-YOLO network model.Figure 6Model lightweightingDue to computational limitations, the prediction model is too large for mobile deployment. The numbers of parameters and computations are reduced by replacing the CBS module in the backbone network with the Ghost_conv module from the Ghostnet network16, whose initial phase involves the application of conventional convolution to produce feature maps with fewer channels, thus lowering computational demands. Subsequently, inexpensive operations are performed on the feature maps to obtain new ones and further reduce computation. The two groups of feature maps are combined to generate the output feature maps, as shown in Fig. 7.Figure 7Ghost_conv schematic diagram.CBAM attention mechanismWoo et al. (2018) proposed CBAM, a mechanism that focuses on important features in both channel and spatial dimensions and suppresses unimportant features to improve the accuracy of safflower detection17. The process is as follows. The channel dimension attention mechanism is processed, the input feature map F (H × W × C) is subjected to global max pooling and global average pooling, the two obtained feature maps are fed into a multilayer perceptron (MLP), the two features of the output are summed, and the channel feature map (channel attention feature), i.e., M_c, is generated by sigmoid activation, as shown in Eq. (6). In the second step, M_c is used as the input feature, which undergoes global max pooling and global average pooling. The resulting feature maps are then concatenated along the channel dimension and then passed through a 7 × 7 convolution followed by Sigmoid activation to generate the spatial attention feature map, M_s, as shown in Eq. (8). The channel dimension attention mechanism and the spatial dimension attention mechanism are illustrated in Fig. 8.$$M\_c = Sigmoid\left( {MLP\left( {Avgpooling\left( F \right)} \right) + MLP\left( {Maxpooling\left( F \right)} \right)} \right)$$
(6)
$$F_{1} = M\_c \times F$$
(7)
$$M\_s = Sigmoid\left( {conv\left( {MLP\left( {Avgpooling\left( {F_{1} } \right)} \right) + MLP\left( {Maxpooling\left( {F_{1} } \right)} \right)} \right)} \right)$$
(8)
$$F_{2} = M\_s \times F_{1} ,$$
(9)
where F is the input feature; F1 is the channel attention mechanism feature layer; and F2 is the CBAM module output feature.Figure 8Schematic diagram of CBAM attention mechanism.Improvement of K-means-based anchor frame mechanismThe YOLOv5 initial anchor frame is obtained by using K-means clustering for the COCO dataset, and a genetic algorithm to adjust the anchor frame during dataset training. The size of the anchor frame influences the convergence speed and accuracy of the model. The safflower dataset produced in this study was labelled with small- and medium-sized targets, the 80 category targets in the COCO dataset were of different sizes and categories, and the initial anchor frame of YOLOv5 was not suitable for the constructed dataset. Therefore, we used K-means clustering to obtain new safflower anchor frames in the dataset. The clustering results are shown in Table 2.
Table 2 Safflower anchor frame update results.Loss function improvementThe loss function is used to assess the difference between the predicted and true values of a model. We used LGIOU, LDIOU, and LCIOU to calculate the loss function in YOLOv5, where$$L_{DIOU} = 1 – IOU + \frac{{D_{2}^{2} }}{{D_{C}^{2} }}$$
(10)
$$L_{CIOU} = 1 – IOU + \frac{{D_{2}^{2} }}{{D_{C}^{2} }} + \alpha v$$
(11)
$$\alpha = \frac{v}{{\left( {1 – IOU} \right) + v}}$$
(12)
$$v = \frac{4}{{\pi^{2} }}\left( {arctan\frac{{w^{gt} }}{{h^{gt} }} – arctan\frac{w}{h}} \right)^{2} ,$$
(13)
where IOU is the intersection and merger ratio; \(D_{2}\) is the Euclidean distance between the centroids of the predictive and real frames; \(D_{c}\) is the diagonal distance, i.e., the smallest closure region that can contain both the predictive and real frames; v measures the consistency of the aspect ratio; and wgt and hgt are the respective widths and heights of the real frames.The safflower field growing environment is complex, and safflower targets are not only small- and medium-sized, but there are a large number of them. LCIOU provides more comprehensive loss information than LGIOU and LDIOU by considering the shape, size, centre position, and aspect ratio error of the bounding box. LCIOU takes into account the distance between the centres of bounding boxes, which are more accurately positioned. On the basis of LCIOU, the normalised Wasserstein distance (NWD) algorithm is integrated, and the balancing coefficient β is set to accelerate the convergence of the model loss. We set β to 0.8, and the NWD algorithm is calculated as$$W_{2}^{2} \left( {N_{a} ,N_{b} } \right) = \left[ {x_{A} ,y_{A} ,\frac{{w_{A} }}{2},\frac{{h_{A} }}{2}} \right]^{T} ,\left[ {x_{B} ,y_{B} ,\frac{{w_{B} }}{2},\frac{{h_{B} }}{2}} \right]_{2}^{T2}$$
(14)
$$NWD\left( {N_{a} ,N_{b} } \right) = {\text{exp}}\left( { – \frac{{\sqrt {W_{2}^{2} \left( {N_{a} ,N_{b} } \right)} }}{C}} \right)$$
(15)
$$L_{NWD} = 1 – NWD\left( {N_{a} ,N_{b} } \right),$$
(16)
where \(W_{2}^{2} \left( {N_{a} ,N_{b} } \right)\) is the Gaussian distribution between bounding boxes A and B; \(x_{A}\), \(y_{A}\), \(x_{B}\), \(y_{B}\) are the respective centre coordinates x, y of bounding boxes A and B; \(w_{A}\), \(h_{A}\), \(w_{B}\), \(h_{B}\) are the respective lengths and widths of bounding boxes A and B; and \(NWD\left( {N_{a} ,N_{b} } \right)\) is a similarity measure of bounding boxes A and B.In this study, the original LGIOU is replaced by the improved LCIOU as the loss function of SF-YOLO. LCIOU is based on LDIOU, introducing a term αv, where α is a balancing parameter not involved in the gradient calculation18,19. The total loss LCIOU+NWD after using the improved CIOU is$$L_{CIOU + NWD} = \left( {1 – \beta } \right) \cdot L_{NWD} + \beta \cdot L_{CIOU} .$$
(17)
Figure 9 shows the loss decline curves before and after improvement. The improvement converges quickly in the first 100 rounds of training.Figure 9Improved loss function plot.

Hot Topics

Related Articles