Puzzle: taking livestock tracking to the next level

General principleThe present algorithm, named Puzzle, adheres to the principles of the Wizard tracking algorithm18 with some improvements. The general principle of the method is to leverage the detections from Yolo, recognizing that tracking might be straightforward in some sequences of the video. Indeed, it is common to observe sequences where all the animals are detected, with no occlusions or false negatives. In these sequences, tracking is easy, and a method based only on the bounding box (bbox) locations is sufficient. These sequences are called tracklets, i.e., successions of video frames where all the animals are detected, with little or no overlap between the bounding boxes.On one tracklet, the same animal has a unique ID, but this ID might differ from one tracklet to another. In other words, the tracking problem is easy to solve on a tracklet scale but, of course, not at the video scale. The main idea of the method is to select one particular tracklet, called best-track, from which a CNN will be trained to extract texture/appearance information to identify each animal. Then, this appearance CNN, denoted as A-CNN, is used on the rest of the video to extract texture information from the detections.Once the best-track is defined, and the A-CNN trained on it, the concept of tracklets can be forgotten. The inference on the entire video can start, which is divided into two other steps: the forward and backward passes. On the forward pass, tracking is conducted starting from the end of the best-track until the end of the video. Conversely, for the backward pass, tracking is performed from the beginning of the best-track until the start of the video, or to the end of the next best-track, in case of semi-supervised version. Each pass, like most tracking algorithms, is an ID assignment problem. It involves associating the detections from the previous (known) frame with the detections of the current frame (unknown).Note that at the last (or first) frame of the best-track, the ID of all detections is known; this is the main principle of Puzzle. The two passes consist of propagating these IDs from the best-track to the entire video.Assigning the ID to the detections of the current frame is based on a cost matrix C, where \(C_{i,j}\) is the cost of assigning the animal ID j to the detection number i of the current frame, given the detection on the previous frame. The cost is generally a weighted sum of different metrics, such as texture or location distances.The main improvements of Puzzle over Wizard are as follows. First, the assignment cost was redefined for the creation of tracklets to be more robust, especially for the false positive detections. Second, the assignment cost definition during the backward and forward pass was also improved by not considering the cost \(C_{i,j}\) as a weighted sum of metrics, but as the output of a CNN, which takes different metrics as inputs. Third, the structure of the A-CNN was changed, adding an attention layer on the head to address illumination changes, while the classification layer was replaced by a Mixture of Expert. Fourth, there is an option to manually label several best-tracks, which takes limited time but can greatly improve the tracking results.We will begin by describing the improvements of this new method. Then, we will describe the available data and the method used to detect animals. Finally, we will present the method used to compare the different tracking methods.Puzzle main improvmentsAssignment cost for the creation of trackletsThe main idea of Puzzle (and Wizard) is to first concentrate on simple sequences of the video, where tracking is easy. In this case, the assignment cost matrix C only relies on location information to associate the closest bounding boxes between frames.For Puzzle, our intention was to refine the assignment cost due to the persistence of certain false positive detections, even after employing various elimination methods. Consequently, we introduced a new assignment cost based on two metrics: the distance between bounding boxes and their overlapping area.$$\begin{aligned} overlapping_{i,j}= & {} 1 – IoA(B_i^t, B_j^{t+1}) = 1 – \frac{Area(B_i^t \cap B_j^{t+1})}{Area(B_j^{t})}. \nonumber \\ distance_{i,j}= & {} \sqrt{\sum (B(x,y)_i^t-B(x,y)_j^{t+1})^2}. \nonumber \\ C_{i,j}= & {} (1+overlapping_{i,j})*distance_{i,j}*(1+\delta ). \end{aligned}$$
(1)
Where B denotes a bbox or a detection, B(x, y) denotes the centroid of the detection, and \(B^t_i\) denotes the detection numero i on frame t. t is used to denote the previous frame and \(t+1\) the current frame. On the previous frame, an animal ID is associated to every bbox, and the animal ids have to be estimated for the current frame.The overlapping metric is the intersection over area (IoA), used to quantify the spatial overlap between two consecutive bounding boxes. The \(1-\) adjustment is applied to invert the outcome, as the assignment problem is a minimization problem. The distance metric is simply the Euclidean distance between the centroids of the two detections. The composite cost function 1 combines these two metrics and is computed for any pairs i and j. The bounding box association problem is then solved using the Hungarian algorithm with the defined cost matrix C.Once associations are resolved, various thresholds are applied, mostly in the same way as Wizard, to remove False Positives and break the tracks into tracklets.Definition of the A-CNN: FurryMixtureNetThe A-CNN is used to extract appearance information from the detections: what are the appearance clues allowing to distinguish each animal? The main proposition of both Wizard and Puzzle is to train the A-CNN directly on the images from the video being analyzed.In Puzzle, we introduced a new version of the A-CNN called FurryMixtureNet. It is a classification network with N classes, one for each animal. It consists of two modules: (i) an image encoder module to extract features at different scales, and (ii) a classification module that combines these features for classification through a mixture of experts.FurryMixtureNet: Image Encoder moduleThe initial component of the network is a conventional image feature extractor, resembling a lightweight VGG version. As suggested in the few-shot learning challenge and previous studies24,25, we utilized a smaller structure than in Wizard, with an input size of (84, 84, 3).Additionally, special convolutional blocks and skip connections were integrated to enhance the gradient path during training. Compared to Wizard, we also introduced the CBAM26 and SqueezeExcitation27 modules.Figure 6The image encoder architecture consists of a CBAM attention mechanism to focus on relevant parts of the image. Then, several depth-wise convolutional blocks are employed to efficiently extract features at the current scale. Additionally, skip connections, represented by the symbol \(\bigoplus\), aid in preserving some features from previous scales. Similarly, the Squeeze Excitation layer allows capturing and focusing on distinct features, particularly color information. Finally, the extracted features are flattened and linearly combined to produce a feature vector of length 1538, which is then fed into the classification module.The Convolutional Block Attention Module (CBAM)26 is a network module employed for enhanced feature representation and classification accuracy. CBAM consists of a two-stage attention mechanism designed to learn to focus on crucial image regions and channels.In the first stage, the spatial attention module utilizes the global average pooling operation to generate a channel descriptor, which is then used to weigh the importance of each spatial location. In the second stage, the channel attention module employs the maximum pooling operation to generate a spatial descriptor, which is used to weigh the importance of each channel. The final output is obtained by combining the spatial and channel attention maps through element-wise multiplication.By concentrating on significant regions and channels, CBAM contributes to improving the robustness of Convolutional Neural Networks (CNNs), particularly in the face of illumination changes. For exemple, CBAM module was integrated on yolov328 and in yolov529 for the same purpose.The DepthwiseConvBlock consists of two stages of convolution, batch normalization, and a GELU activation function. This activation function is recognized for its faster convergence and greater resilience to noise compared to other activation functions like ReLU or ELU30. The initial convolutional layer in the first DepthwiseConvBlock includes a bias, while the subsequent layers do not. The first convolutional layer of each DepthwiseConvBlock employs a \(3 \times 3\) kernel size with depth-wise operations, and the second layer uses a \(1 \times 1\) kernel size with point-wise operations. This approach, introduced in31, combines DepthwiseConvBlock with skip connections denoted by \(\bigoplus\). An essential aspect of DepthwiseConvBlock is its computational speed, which is significantly faster than a regular convolutional block by minimizing the amount of multiplication and mathematical operations.The incorporation of a SqueezeExcitation layer27 was motivated by empirical observations. We found that this layer enhances our network’s robustness by facilitating the extraction of a more diverse range of discriminative features, particularly concerning colors. The SqueezeExcitation mechanism focuses on modeling interdependencies between channels, explicitly capturing and leveraging the relationships among different channel responses. By integrating the SqueezeExcitation layer, we observed a reduction in confusion between goats of different colors. However, it should be noted that the introduction of multiple layers of SqueezeExcitation posed challenges during the learning process . Specifically, it affected the gradient path, causing divergent learning. Therefore, only one layer was used.A detailed illustration of the image encoder’s design can be found in Fig. 6.FurryMixtureNet: Classification moduleThe classification process is illustrated in Fig. 7. The network takes the feature vector generated by the image encoder as input, feeding it into Mixture of Experts (MoE) modules32, which mix and classify the features into their most appropriate categories. MoE is a neural network architecture type that divides the input data into multiple subspaces, using a separate “expert” network to make predictions for each subspace. The outputs of the expert networks are then combined to produce a final prediction. MoE is relevant for classification tasks because it enables the network to model complex, multi-modal distributions in the input data, which may not be accurately captured by a single expert network. In the context of individual classification in videos, MoE can help account for the large variability in appearance and goat posture.Figure 7Classification module: the MoE approach.The proposed architecture for the expert modules is provided in Fig. 8, and can be viewed as a small transformer block33. It is composed of several layers, including Dense, Batch Normalization, GELU, Multihead Attention, FeedForward, and a final Sigmoid layer for classification. Each expert specializes in certain features, and their contributions are dynamically weighed through the Gating Top-k. The Multihead Attention layer enables the model to learn how to concentrate on specific areas of the input that are pertinent to the classification task. Meanwhile, the FeedForward layer plays an important role in refining the feature representation obtained from the previous layers, helping the model captures high-level connections between features that are not linked to their spatial position in the image. As a result, the architecture can identify intricate patterns and interactions between features, learn to focus on important regions of the input, and detect high-level relationships between features beyond their spatial arrangement. Remember that the input features are derived from the image encoder, which includes CBAM and SqueezeExcitation layers, enabling diverse and representative features, in spatial and colorimetric space.Figure 8ExpertModule : small architecture based on “Attention is all you need”.A Sigmoid activation function was used instead of Softmax to achieve proportional representation: when goats are separated and do not overlap, the classification using the Sigmoid activation function is expected to be accurate, similar to a one-hot encoder. However, in cases where two or more goats are visible in the same image, the classification output is expected to provide a proportional representation of each goat based on their visibility. Achieving this functionality is challenging using the softmax activation function, as values tend to be maximized in a single class. Still, the sigmoid activation function enables the computation of proportions for each individual separately. To ensure this proportional representation, custom data augmentation techniques were designed.Assignment cost for the forward and backward passesThe third aspect that differs from Wizard is the definition of the assignment cost matrix for assigning the ID between the previous and current detections over the forward and backward passes.Let’s denote \(\left( \text {bbox}^\text {past}_i,\text {id}=i\right) _{i=1}^N\) as the set of past bbox with the associated animal ID. In other words, it is the most recently estimated bbox for each individual. Note that all bbox does not necessarily come from the same frame, as some individuals might be undetected on several consecutive frames. Then, for each new frame, called the current frame, we have a set of bbox \(\left( \text {bbox}^\text {current}_i\right) _{i=1}^{n}\), with \(n \le N\), as again, some animals might not be detected. But also note that, in some rare cases, \(n>N\) if false detections occur. The tracking problem consists of associating each current bbox to one of the past bbox. Once the bbox association is made, the past bbox set is updated, and so on.In order to make the assignment, a cost matrix, C, is used, where \(C_{i,j}, j\le N\), is the cost of associating the i\(^\text {th}\) bbox of the current set with the j\(^\text {th}\) bbox of the past set, or equivalently, assigning animal ID j to the i\(^\text {th}\) bbox of the current set. \(C_{i,j}\) measures a distance or similarity between the two bbox. Once C is computed, making the association is a simple combinatorial optimization problem, called the assignment problem, easily solved in polynomial time (\(O(n^3)\)) using the Hungarian algorithm34. Mathematically, it consists in finding the optimal row numbers \((i_1^*,\dots ,i^*_N)\), with \(i^*_{i_2}\ne i^*_{i_2}\), for which:$$\begin{aligned} \sum _{i=1}^{N}C_{i_i^*,j} \le \sum _{i=1}^{N}C_{i_i, j},\qquad \forall 1 \le i_i\le N\ \text {and}\ i_{i_2}\ne i_{i_2} \end{aligned}$$The major difficulty when designing a tracking algorithm is to propose a reliable cost definition that takes into account every aspect of the tracking problem, such as false negative/positive detections and occlusions.A common approach is to minimize a global cost function that measures the dissimilarity between objects inside the bounding boxes of the previous and current frames. Typically, an empirical function is used to compute this cost, as proposed in Wizard18, and in other popular MOT algorithms such as DeepSort17, BOT-Sort35, ByteTrack36, or HybridSort37.Association metricsOne other approach to model assignment cost consists in using a learnable function that maps pairs of detections to a scalar cost \(C_{i,j}\). This function can be learned using supervised learning38, where pairs of detections are labeled with their true association cost. In this context, various approaches can be found, from GNN (Graph Neural Network)39 to sequence classification40. In this work, we introduced a compact neural network that uses various association metrics as inputs and combines them to determine the accurate assignment cost. We proposed to compute the following metrics: texture:
The one-hot class Manhattan distance (based on the A-CNN output) between known previous and current vectors. By applying this distance function, we can quantitatively measure the dissimilarity between the past and current detections.
IoA:
The Intersection over Area (IoA) is a similarity metric that quantifies the overlap between two bounding boxes. It is calculated by dividing the area of intersection between the two bbox by the total area of the union of the boxes. The IoA value ranges from 0 to 1, where a value of 0 indicates no overlap, and a value of 1 indicates complete overlap between the bounding boxes. By using the IoA as a similarity metric, we can assess the degree of spatial overlap between two bounding boxes and determine the level of similarity between objects or regions of interest.
IoU:
The Intersection over Union (IoU) is a widely used similarity metric for assessing the overlap between two bounding boxes. It is calculated by dividing the area of intersection between the two bbox by the area of their union. The IoU value ranges from 0 to 1, where a value of 0 indicates no overlap and a value of 1 indicates complete overlap between the bbox. Here, the IoU was defined as the mean of IoU between the current detection \(B_j^{t+1}\) and each past detection in the last secondes of the same tracklet. The IoU quantitatively assesses the spatial correspondence between objects.
time:
The absolute time distance quantifies the temporal proximity or similarity between two events or time points. If the time difference between two detections of the same object is extremely short, this implies minimal change in the object’s location between the two frames. Conversely, a substantial time difference, for example, when the object remained undetected over several frames, might suggest significant alteration in the object’s location across the frames, indicating a degree of dissimilarity. By combining the absolute time distance with other metrics, the assignment process can prioritize the temporal proximity of detections.
pos_old:
The Euclidean distance between the centroids of the current and past detections is used as a dissimilarity metric. This metric quantifies the spatial disparity between the current detection and the most recent associated centroid. A smaller distance indicates a higher level of similarity, while a larger distance suggests greater dissimilarity. However, this metric cannot be robust in case of false association, even short.
pos_med:
The Euclidean distance between the centroids of the current detection and the median of the last one-second detections is used as a dissimilarity metric. Again, it measures the spatial difference between the current and previous detections. A smaller distance indicates higher similarity, and a larger distance indicates greater dissimilarity. It should be more stable in case of short false association.
speed:
The acceleration of the animal at each frame is computed for the last 3 seconds. To assess similarity, we compare the smallest (speed_a = quantile 10%) and the largest (speed_b = quantile 90%) acceleration values from this list with the actual acceleration of the animal, which is calculated as the distance traveled divided by the time interval. This approach serves as a similarity metric by evaluating the consistency of the animal’s acceleration over time. A small difference between the smallest and largest acceleration and the current acceleration indicates a higher degree of similarity, suggesting that the animal is maintaining a relatively stable and consistent acceleration pattern. On the other hand, an acceleration difference between the two points (smallest and largest) indicates a deviation from the expected acceleration pattern, suggesting changes in the animal’s movement dynamics. The quantile is used to remove possible outliers coming from short false association.
theta:
Three spatio-temporal points are used: the current detection centroid, the last associated centroid, and the last 1-second median centroid. To measure the similarity between these points, we use the angle formed by these three locations. A smaller angle suggests that the objects are moving in a coherent and consistent manner, potentially towards the front. This indicates that the tracked objects maintain a relatively stable trajectory over time. On the other hand, a medium angle may indicate that the object is in motion, while a higher angle would be unlikely, suggesting abrupt changes in position that are not consistent with most movement patterns.
Balance between association metrics: FuzzyBodyTo amplify the influence of the texture metric between individuals, the variable was duplicated thrice, resulting in a list of 11 metrics. This operation provides the neural network with more weights allocated to the texture distance, allowing the model to prioritize the correct tracks based on appearance, rather than relying on incorrect tracks due to other metrics based on spatial information. However, other metrics still play an important role in case of errors caused by the image classification network. Therefore, determining the weight for each distance metric is a delicate balance between the two aspects. To learn this delicate balance, we proposed the following neural network, named FuzzyBuddy, described in Fig. 9, to compute the value of the scalar \(C_{i,j}\).Figure 9The proposed FuzzyBuddy network used to weight the distance metrics for the ID assignment problem.Several studies41,42 have reported that working with logarithmic data leads to more stable and robust neural networks, as well as faster convergence. In the FuzzyBuddy network, the first layer is thus a logarithmic transform, which can be expressed mathematically as \(f(x) = \text {sign}(x) \cdot \log (\text {abs}(x)+1)\). The use of the log transform was also motivated by the need to manage spatial distances, as high distance weights could disrupt the learning process and place too much emphasis on this metric compared to textural distances. The following two layers consist of convolutions and GELU activation. The choice of convolution over a linear layer was motivated by the group-based operation, which allows for working with specific properties separately rather than mixing them together. Additionally, the GELU activation was chosen for its convergence speed and the ability to create fuzzy thresholds. A skip connection is added to mix input data and learned “thresholds”.The BMM (Batch Matrix Multiplication) layer is a custom layer that utilizes batch matrix multiplication to generate novel combinations of properties that cannot be achieved through standard linear operations. Specifically, this layer computes the matrix product of the input variable \(x\) with its transpose \(x^T\), resulting in a square symmetric matrix \(X\). The BMM layer replaces the diagonal elements of the output matrix with the original input variable values, effectively removing the supposed unnecessary squared values. The BMM layer allows for the exploration of non-linear relationships and interactions between the input variables.A compact dense layer was introduced to combine the features obtained from the previous flattened BMM layer. It learns polynomial coefficients since the previous operations can be considered as complex multivariate Bernstein expansions43,44,45. A GELU activation was used to introduce some additional non-linearity and enable a smooth/fuzzy transition between variables. A final dense layer was proposed to reduce the number of variables. Finally, to prioritize the texture metric (i.e., the appearance clue), the output was squared and multiplied by the texture metric (\(A^2 \times B^2 + A\)) before being fed into the final dense layer, which represents the optimal association cost. This ensures that the network gives greater importance to the texture metric while retaining some spatio-temporal information.If the hard constraint of texture is removed, an interesting observation emerges: the learned network tends to become overly focused on spatial information, disregarding the valuable cues provided by texture. As a result, the network’s performance may suffer, particularly in tasks that require accurate classification or identification based on both spatial and texture cues. For example, it becomes incapable of recovering the original track after a bad association, even if it’s short. This approach allows the network to achieve a more balanced and holistic understanding of the input data, leading to improved performance and avoiding the pitfalls of overfitting on spatial information alone.Training of the fuzzy buddy networkTo be able to fit the model parameters, we used the ground truth of several videos, i.e., the detections and the associated true animal IDs. At each video frame, all the distance metrics defined earlier were computed between any pair of current versus previous detected objects. Then, we used the ground truth to characterize the value of the assignment (i.e., 0 if the assignment is correct, 1 otherwise).The extraction of the distance metrics, together with the assignment value, was done on 15 videos. Note that we only focus on frames of the video where there exists at least one overlapping section between some detections (IoU > 0.05), while other detections might be non-overlapping. Indeed, on most of the videos, there is no overlapping between bbox, and if the network were trained on the entire video, these sequences would have a strong effect on the model parameters, resulting in a network that probably puts no weight on the texture metric.Training was approached as a contrastive learning problem. During optimization, we clamped the CNN’s output values to the range of [0,1], which can be seen as optimizing this decision boundary in probability \(p_i\). Due to the imbalance in the ground truth data, with many soft cases and few hard cases (overlapping goats), we used the focal loss, a modified version of the standard cross-entropy loss function, as suggested in other studies46,47, specifically for detection anomaly42. In our case, the binary cross-entropy was employed in the focal loss and defined as follows:$$\begin{aligned} bce\_loss(p, y) = -\frac{1}{N}\sum _{i=1}^{N}\left[ y_i\log p_i + (1-y_i)\log (1-p_i)\right] . \end{aligned}$$
(2)
Where p is the predicted probability for each example, y is the corresponding true label (either 0 or 1), and N is the total number of examples in the batch. The focal loss introduces hyper-parameters \(\alpha\) and \(\gamma\), which balance the contribution of each class and down-weight the contribution of well-classified examples. The focal loss can be expressed as follows:$$\begin{aligned} \begin{aligned} a&= y\alpha + (1-y)(1-\alpha ) \\ m&= (1-yp + (1-y)(1-p))^{\gamma } \\ focal\_loss(p, y)&= \frac{1}{N} \sum _{i=1}^{N} a \cdot m \cdot bce\_loss(p_i, y_i) \end{aligned} \end{aligned}$$
(3)
The focal loss is expressed as a combination of the modulating factor \(m\), the class weight \(a\), and the binary cross-entropy loss function. The binary cross-entropy loss function measures the differences between the predicted probabilities and the true labels, while the modulating factor reduces the weight given to well-classified examples. Finally, the focal loss is obtained by averaging the modified cross-entropy losses over the batch. We manually tuned \(\gamma = 2.5\) and \(\alpha = 0.25\).Finally, we used the Adam optimizer with an initial learning rate of 0.0001 and beta parameters of 0.9 and 0.999. A ReduceLROnPlateau was also employed with default settings. The training process lasted for 20 epochs and was run on 15 videos. The remaining videos (\(53-15=38\)) were used for testing the tracking algorithm. Note that, as in Wizard, the common Hungarian algorithm is used to perform ID assignment once the cost matrix is built based on the neural network output (\(C_{i,j} = NN(x) \; |\; |x|=11\)).Puzzle semi-supervised learning optionWe proposed a modified version of Wizard, where the A-CNN could be trained on several best-tracks. However, in this case, the ID of each animal in the best-track has to be recorded manually.In the case of multiple best-tracks, the user was provided with a GUI interface to annotate five best-tracks. These best-tracks are strategically selected: one close to the beginning of the video, one close to the end, and finally, the three longest tracklets of the video. These annotated best-tracks are then used for the training of the A-CNN. Note that in this case, multiple forward and backward passes are used. A backward or forward pass always starts at the beginning or end of the best-track. Then, when a video sequence is located between two best-tracks, the forward and backward pass of each best-track ends in the middle of the video sequence (see Fig. 1).Tracking efficacyWe proposed using the percentage of good tracking (PGT) to characterize the efficacy of the tracking algorithm. The PGT was computed for each animal in the video, representing the percentage of frames where the tracking algorithm provided the correct animal ID. It is also referred to as precision in some articles. It should be noted that commonly used measures, such as MOT or ClearMOT, were tested in the previous article. While these measures are commonly used, it was discussed in the previous article that they are not representative for our study.Type and duration of errorsCommon tracking errors occur when animals are not detected or when animals are close to each other, leading to bounding boxes (bbox) that contain parts of different animals. If an animal is lost and then re-detected, possibly in a different posture or location, it becomes challenging for the algorithm to make the correct association, as it relies on information from the previous detection. When animals are close to each other, confusion arises because their location information is similar, and their appearance information gets mixed.Another possible error occurs when both a false positive and a false negative detection happen in the same image. In such cases, using the Hungarian Algorithm may lead to forcibly associating the false positive to an individual, resulting in a momentary error where all individuals may be swapped due to the cost minimization.To assess the ability of different methods to handle specific cases, we employed the following methodology. For each method, video, and animal, video sequences with incorrectly predicted animal IDs were identified, and we recorded the frame number \(t\) at the beginning of these sequences, as well as \(na^t\), the number of detected animals for this frame. We examined the detections from the frame just before the start of the error sequences, \(t-1\), and also recorded the number of detected animals \(na^{t-1}\). We then compared the number of animals in these two frames.If \(na^t > na^{t-1}\), the error was attributed to a new detection, meaning an animal was lost and re-detected on frame \(t\) but was assigned the wrong animal ID. Conversely, if \(na^t < na^{t-1}\), an animal was lost, which could also have caused an error. Finally, when \(na^t = na^{t-1}\), the error was not due to a detection problem. In this case, we computed the overlapping ratio between the detections of the animals that received the wrong animal ID. We then calculated the proportion of cases where the bounding boxes (bbox) were overlapping (i.e., with a non-zero overlapping ratio), which we attributed to a tracking error due to occlusion.We summarized the results for each method and computed the proportion of errors with the same number of detections, both with and without occlusions, and attributed to detection errors, with a lower or greater number of detections at the start of the error sequence. Additionally, we recorded the duration, in seconds, of the error sequences for each method during this analysis.Impact for monitoring behaviorIn order to assess the suitability of the tracking method for behavioral studies, we analyzed the impact of tracking errors on the estimation of two behavioral traits: space use and activity.Space useIn behavioral studies, the utilization of space is often a crucial indicator, especially for estimating the time spent in specific areas such as resting places, water sources, or food dispensers. In our study, we focused on the estimation of occurrence frequency at the pasture scale, which can subsequently be used to derive other traits (e.g., frequency in a particular location).To analyze space utilization, we divided the pasture in each video into a square grid of 10 pixels by 10 pixels, assigning a unique ID to each grid cell. For a given video, we calculated the number of detections of animals in each cell, resulting in an occurrence frequency map for each animal and video. These maps were generated using each tracking method and then compared to the ground truth labeled data. The error for each cell in the grid was computed as follows:$$\begin{aligned} error = 100\times abs\left( \frac{{\tilde{f}}_i}{f_i}-1\right) . \end{aligned}$$
(4)
Where \({\tilde{f}}_i\) and \(f_i\) are the estimated and true frequency on cell i. And abs is the absolute value function. The error is thus expressed as a percentage.ActivityWe categorized three types of activities: grazing, lying, and other. Grazing activity was determined based on the head position of the animal, specifically when the head was below the shoulder, often corresponding to situations where the animal’s head touches the ground. An animal was considered lying when it was not on its feet, indicating a resting position on the ground. Any behaviors other than grazing or lying were classified as “other”, including activities like resting while standing or moving.To estimate the activities, we employed a neural network for each goat (see below), treating it as an image classification problem. The network took the image of the goat as input and predicted the activity class (grazing, lying, or other). The tracking method provided the images of each goat for every video frame, enabling us to predict their activities. We computed the activity time budget for each animal in each video, representing the proportion of frames where the animal was grazing, lying, or engaged in other behaviors. This activity time budget was then compared to the ground truth labeled data, and the estimation error was computed for each video, animal, and activity using the Eq. 4.To predict activity from the image, we constructed a simple neural network, to speed up the prediction and training process. The input size of the image was set to 28 by 28 pixels. The network is only composed of one convolution layer, with 30 filters of size (5, 5) and a stride and dilation factor of (1, 1), with no padding. Followed by a Batch Normalization and a Relu layers, to finish with a fully connected layer, a softmax and a classification layer, using a cross-entropy loss function (see Fig. 10).Figure 10Activity prediction network.We used another study to quickly construct the training and test set. In 22 videos, the animals were equipped with accelerometers attached to one horn. Although these accelerometers were initially trained to predict more detailed behaviors, we repurposed their predictions to automatically generate images where goats were either grazing, lying, or engaged in other behaviors. For each video, we utilized the accelerometer predictions to create 100 images per animal-50 images indicating grazing and 50 images indicating non-grazing activities (i.e., either lying or other behaviors). Using the accelerometer data, we estimated the activity and automatically saved the images in the corresponding folders for the three activity types. All the images underwent manual inspection to correct potential misclassifications from the accelerometers, with images moved to the correct folders in case of errors. Images associated with challenging situations, such as those affected by occlusion, were deleted. In total, we collected 6530 images, with 1714 images representing grazing, 2022 lying, and 2794 other behaviors. We split the images into a training set (70%) and a test set (30%). On the test data, the recall rates were 96.1% for grazing, 89.8% for lying, and 86.4% for other behaviors. The precision rates were 83.3%, 92.2%, and 93.4% for grazing, lying, and other behaviors, respectively.For both behavioral traits, the estimation errors were averaged per method. We were particularly interested in studying the variation of behavioral errors as a function of tracking efficacy.

Hot Topics

Related Articles