Deep learning pose detection model for sow locomotion

Data were collected in a commercial pig farm (Topgen) located in Jaguariaíva, Paraná, Brazil. The experiment was conducted with the approval of the Ethics Committee on Animal Use (CEUA) of the Faculty of Veterinary Medicine and Animal Science at the University of São Paulo (USP) under registration number 9870211117. The study was conducted according to the ARRIVE guidelines (https://arriveguidelines.org/). Vision computer models were created through support by the Robotics and Automation Group for Biosystems Engineering (RAEB) at the Faculty of Animal Science and Food Engineering of the USP.Animal experimentation and data acquisitionA sample of 500 sows in locomotion (Landrace × Large White, Afrodite line) was used to individually record 2D videos images (it has two dimensions: width and length) to create the video image repository.The farm’s routine was not altered for this experiment. Data was collected every day between 8 a.m. and 5 p.m. from May 9th through May 17th, 2022.The filming setup was built in an empty pen (6 × 4 m) in the farm facilities. A solid floor area was delimited with two galvanized wires (2.10 mm) to create a corridor. At the end of the corridor, a return area was provided for the animals. The corridor and the wall were painted with white acrylic wall paint to enhance the contrast between the animals and the setting. The filming area measured 1.5 m wide by 5 m long.The equipment was strategically chosen as follows: a Dell Inspiron 15 5502 laptop (featuring Core i7, Microsoft Windows 11, 16 GB RAM, 512 GB SSD, NVIDIA GeForce) was used to record lateral videos, and a Dell Inspiron 3421 laptop (featuring Core (TM) i5, Microsoft Windows 10, NVIDIA Geforce) was used to record dorsal videos. A ZED 2i Stereo Camera (Stereolabs Inc., USA) was positioned 0.6 m from the floor and 3 m away from the corridor wall where the animals passed. Another camera was positioned above the corridor at 2.05 m from the floor. This configuration yielded comprehensive lateral and dorsal views of the entire length of the animals, extending from the cranial to caudal. To regulate the luminance of the environment and enhance the video quality, an artificial light system was placed on each extreme of the corridor behind the wires and 2.5 m away from the corridor wall (Greika PK-SB01 lighting kit, two 50 × 70 cm soft boxes). Figures 1 and 2 illustrate the data acquisition facilities.Figure 1Scenario used to collect the data of sows in locomotion. (a) superior view, and (b) lateral view. The red dashed line is the animal’s route of entering and turning around. The blue dashed line is the animal’s route out of the pen.Figure 2Scenario and equipment used to collect the data of sows in locomotion using lateral and dorsal views. (1) laptop (Core i7); (2) under the table Zed i2 camera and (3) along with the feed tube; (4) corridor where the animals walked; (5) Greika PK-SB01 lighting kit; and (6) laptop (Core (TM) i5).The ZED 2i camera was configured to capture RGB images and point clouds in HD (1920 × 1080 pixels) at 15 frames per second, with an average duration of 30 s per video per animal.Video preprocessing and annotationThe filtering process had two stages: the sow entering the corridor and the sow passing through the corridor, as illustrated in Fig. 1. Only the returning part was used for training the models (right lateral side of the animal and the entire dorsal side of the animal). A Python script was used to remove the moment when the animal leaves the pen. The video was then converted from SVO to MP4 format.A total of 1207 videos, 565 lateral and 642 dorsal videos, were recorded; however, 40% of the videos were not used due to issues encountered during filming. Some problems included the sow not walking, being inactive for a long time, or running instead of walking, and video defects such as cuts in the image or low image quality. A total of 364 lateral and 336 dorsal videos were converted to the MP4 format (Fig. 3).Figure 3Workflow chart of the proceorganizing videos to add to SLEAP software and save on the website Animal Welfare Science Hub, as well as the training and testing models and results.Thirteen experts in farm animal locomotion assessment categorized each sow video using the Zinpro Swine Locomotion Scoring system. The scores ranged from 0 to 3, from no signs of lameness to severe lameness (Table 1).Table 1 Zinpro’s Swine locomotion scoring system to assess lameness in pigs.The 364 lateral videos were evaluated by the experts using Google Forms. Only the lateral view was assessed by the experts. After the 364 videos were analysed, 11 videos were removed because the experts were unable to classify the animals’ locomotion scores due to the sows slipping at the beginning of the video, making it difficult for the experts to assess locomotion. Scores indicated by more than 50% of the experts were considered as the final score. In addition, descriptive statistical analysis through the mean, median, standard deviation, maximum and minimum with box plot visualization was performed to identify outlier experts, resulting in the removal of three experts and their respective responses. The statistical analysis was performed using Jamovi37 and R38 software.In the SLEAP software (Social LEAP Estimates Animal Poses), only lateral videos that had the corresponding dorsal view were used. The score assigned to each lateral view video was also assigned to the corresponding dorsal view video for each sow. During the video evaluations for determining locomotion scores, there were divergent responses. Therefore, the differences between the scores were calculated using Microsoft® Excel, after which the degree of certainty of the answer was calculated. To calculate the differences between the answers (DBA), the maximum function is added to the range of locomotion scores (0, 1, 2 and 3); the range of locomotion scores is subtracted from this; and the maximum is then added again. The formula for calculating the DBA demonstrated for the first time is as follows (1):$$DBA=max\left(loc \,scores\right)-(sum\left( loc \,scores\right)-{\text{max}}(loc \,scores))$$
(1)
After calculating DBA, the response confidence, provided in Table 2, was calculated by applying if–then functions. If the DBA is less than − 1, then there is no confidence in the score evaluation. If the DBA is less than 0, then there is 25% certainty in the score evaluation. If the DBA is less than 2, then there is 75–100% certainty in the score evaluation.Table 2 Degree of reliability of the evaluation by lameness experts of the sows’ locomotion scores.Computer vision modelsThe computational models were constructed and tested for different deep learning architectures. The processing of the models was carried out using an HP Z2 Tower G5 workstation computer with an Intel Xeon W-1270 CPU, 32 GB RAM, Windows 10 Pro for Workstations version 21H2, and the tool SLEAP (version 1.3.0). SLEAP is an open-source software developed in the Python programming language that provides a deep learning-based framework for pose estimation in different animal species39.The SLEAP software was input with a total of 106 2D videos in both the lateral and dorsal views, with 33 videos for each locomotion score of 0, 1, and 2, and seven videos for a locomotion score of 3 due to the low number of animals with this score. Initially, the framework was defined as a skeleton related to a set of keypoints that were marked on the animal’s body in each frame of the video. The lateral skeleton was defined by 13 keypoints: snout, neck, right and left hock, right and left metacarpal, dorsal neck, dorsal tail, rump, and hoof right, left, front and posterior (Fig. 4c). To simplify the model, a lateral skeleton with 11 keypoints was also defined, where the dorsal neck and dorsal tail keypoints were removed (Fig. 4b). For the dorsal view, a skeleton with 10 keypoints was created: neck, scapula right/left, spine middle, pelvic right and left, tail, head, thoracic and lumbar (Fig. 4e). A preliminary analysis was performed on the established keypoints to identify variations. Based on this analysis, a lateral skeleton with 6 keypoints, removing the dorsal neck, dorsal tail, right and left hock, neck and right and left metacarpal keypoints (Fig. 4a) and a dorsal skeleton with 7 keypoints, further removing the head, thoracic, and lumbar keypoints (Fig. 4d) were established. There were 20,323 frames in the lateral videos, of which 3293 frames were manually labelled by two trained individuals. There were 14,537 frames in the dorsal videos, of which 2311 frames were manually labelled. Only the frames in which the sows were fully visible in the videos were considered for training and testing the models.Figure 4Identification of sows skeletons lateral and dorsal views in SLEAP software. (A) (6 keypoints), (B) (11 keypoints) and (C) (13 keypoints) = Lateral view sow. (D) (7 keypoints) and (E) (10 keypoints) = Dorsal view sow. Keypoints: (1) snout, (2) hoof front right, (3) hoof front left, (4) hoof posterior right, (5) hoof posterior left, (6) rump, (7) neck, (8) pastern right, (9) pastern left, (10) hock right, (11) hock left, (12) dorsal neck, (13) dorsal rump, (14) scapula left, (15) scapula right, (16) middle, (17) pelvic right, (18) pelvic left, (19) lumbar, (20) thoracic, (21) head.The keypoints were defined to identify and analyse movements for future kinematic studies and to relate these movements to the sow’s locomotion score, determined by the panel of observers. The snout and neck keypoints were chosen to identify compensatory head movements. The neck dorsal, tail dorsal, neck, and tail keypoints were chosen to identify spine arching. The hocks, hoots, and metacarpals keypoints were chosen to identify which limb the sow had difficulty walking on34.The SLEAP software allows for the settings to be customized to improve the computational model according to the project’s needs. Nineteen models (Table 4) with 6, 7, 10, 11 and 13 keypoint skeletons were developed using 5 different convolutional neural network (CNN) architectures: LEAP, U-Net, ResNet-50, ResNet-101, and ResNet-152.

LEAP (Estimates Animal Poses): LEAP’s pose estimation architecture is based on deep learning and uses a 15-layer convolutional neural network to predict the positions of animal body parts40.

U-Net: U-Net is a convolutional neural network (CNN) architecture with 23 layers and a “U”-shaped format41. The presence of both an encoder and a decoder in this architecture helps it address complex tasks such as posture classification42,43.

ResNet-50: ResNet-50 is a 50-layer residual neural architecture trained on the ImageNet image database; it is an improved version of the CNN44.

ResNet-101 and ResNet-152: ResNet-101 and ResNet-152 are residual neural architectures with 101 layers and 152 layers, respectively44.

Figure 4a–e, as well as pixel error graphics, were generated with SLEAP software. The videos of sows with labelled keypoints (ground truth) and unlabelled keypoints (predicted by the algorithm) were developed in the SLEAP software. With these data, a video could be developed with only the x and y coordinates in pixels and without the animal by using the script created in MATLAB R2021b (Mathworks Inc., USA). The videos are provided in the supplementary material (Supplementary Videos 1 and 2). The general and specific hyperparameters in the 19 models for lateral and dorsal views, such as input scale, epochs, batch size, initial learning rate, Gaussian noise, and rotation, were configured. The split method was used for model evaluation, where the videos images repository was randomly split into 85% for training and validation and 15% for testing.Metrics for evaluating computational modelsPose estimation is a difficult activity for the algorithm to perform, as it involves variations in lighting, perspective projection, and the occlusion of portions of images39,45. The model’s evaluation is complex because it involves many errors that affect the performance of the algorithm45,46. In addition, the average precision (AP) metric cannot interpret the behaviour of the algorithm and does not identify all the existing errors45,46. Due to some limitations in pose estimation analysis, new metrics for the task have been developed, such as object keypoint similarity (OKS), mean average precision (mAP), and distance average (dist.avg).The OKS metric was designed to identify different types of errors in pose estimation algorithms; it can be used for estimating the poses of humans and animals. This metric calculates the average similarity between the person’s labelled (ground truth) and the unlabelled (predicted by the algorithm) keypoints45,47. OKS calculates the true value and predicts the similarity of human keypoints. OKS is defined by Eq. (2):$$OKS= \frac{{\sum }_{i}{\text{exp}}(-{d}_{i}^{2}/2{s}^{2}{k}_{i}^{2})\delta ({v}_{i}>0)}{{\sum }_{i}\delta ({v}_{i}>0)}$$
(2)
where \({d}_{i}\) is the Euclidean distance between the detected keypoint and its corresponding ground truth; \({v}_{i}\) is the visibility flag of the ground truth; s denotes the person scale (in our study, this is the sow); and \({k}_{i}\) is a per keypoint constant that controls falloff22,39. When the SLEAP software was developed, OKS was employed as a lower bound on the true accuracy of the model because some reference points on animals can be difficult to precisely locate39. The distribution of OKS scores and the mAP metric were obtained to summarize accuracy across the dataset.The mAP is based on object keypoint similarity (OKS). The mAP is the mean value of multiple APs with different thresholds47 and is used to measure the average accuracy of pose estimation across multiple individuals48. The mAP metric provides an overall evaluation of all the keypoints, so its evaluation is more accurate, as each keypoint has a different scalar47. Calculating mAP, which is used to evaluate the pose estimation accuracy in SLEAP, involves considering predicted instances as either true positives (TP) or false positives (FP) based on OKS39. The mAP and the mean average recall (mAR) metrics provide balanced estimates of accuracy, yielding reliable precision39. AP is defined by Eq. (3)47:$$AP= \frac{\sum_{n}\delta \left({OKS}_{n}<T\right)}{{\sum }_{n}1}$$
(3)
The distance average (Dist. avg) measures the Euclidean distance between the ground truth and model predictions based on the algorithm and architecture of each skeleton keypoint45,49. Thus, in the labelling and training phases of the neural network, ground truth labels were provided to identify body part positions in the images40.A variation on the use of Euclidean distance as a metric is the percentage correct of keypoints (PCK), which assesses the accuracy of joint localization47, has also been used; PCK is used to indicate whether each keypoint prediction is correct47. The PCK metric has been used in other studies as well in human50,51 and animal47 pose estimation tasks. The PCK is calculated using Eq. (4):$${PCK}_{i}=\frac{{\sum }_{n}\delta (\frac{{d}_{{n}_{i}}}{{s}_{n}}<\alpha )}{{\sum }_{n}1}$$
(4)
where \(n\) is the \(n\) th target; \({d}_{{n}_{i}}\) is the Euclidean distance between the \(i\) th predicted keypoint and its ground truth of the \(n\) th target; \(\alpha\) is a constant parameter that controls the relative correctness of the evaluation; and \({s}_{n}\) is a normalized scalar of the \(n\) th target. \(\alpha\) and \(s\) vary among different works.In addition, pixel errors were analysed for each of the labelled reference points on the animal’s body.

Hot Topics

Related Articles