DAMM for the detection and tracking of multiple animals within complex social and environmental settings

Object detection approachWe employed the Mask R-CNN architecture for instance segmentation, which detects individual objects within an image and delineates each object’s precise location with a pixel-level mask. The Mask R-CNN operates as a two-stage detector: the first stage generates predictions for regions of interest that may contain objects, while the second stage classifies these objects and refines the bounding boxes—rectangular frames outlining the exact position and size of objects—and masks associated with each object. This process involves several high-level steps: (1) Extract feature maps from the image using a convolutional neural network (CNN). (2) Predict potential object locations (regions of interest) with a Region Proposal Network (RPN) based on the feature maps. (3) Crop out regions of interest (ROI) features from the predicted feature maps and resize features so they are all aligned. (4) For all ROIs, predict the object category, refine the bounding box, and generate a mask. (5) Employ a non-maximum suppression (NMS) algorithm to eliminate overlapping or low-confidence boxes.Tracking approachWe employed the Simple Online and Realtime Tracking (SORT) algorithm20, specialized in single- and multi-object tracking within video streams. The SORT algorithm extends image-level detection to video tracking using only an image-level detector, enabling seamless integration with DAMM. The algorithm involves several key steps: Initialization, Prediction, Association, and Update. In the Initialization step, objects that are repeatedly detected across frames with high overlap and are not currently being tracked are added to the set of tracked objects. The system can initiate tracking for new objects at any point during the video, provided they appear consistently across frames. In the Prediction step, the SORT algorithm estimates the next position of each tracked object based on their previous trajectories, via a Kalman filter24. This strategy leverages kinematic information and reduces noise from the object detector’s prediction. In the Association step, the SORT algorithm uses the Hungarian algorithm25 to pair predicted locations of currently tracked objects from the Kalman filter with those provided by the object detector, optimizing matches using metrics such as bounding box IoU. During the Update step, the SORT algorithm refines the Kalman filter estimation for each tracked object with the matched bounding box. If there is no match, the Kalman filter independently updates using its next state prediction, effectively handling temporary occlusions. Objects are tracked until they cannot be matched to a predicted bounding box for a certain number of frames (e.g., 25). This process loops, cycling back to the Prediction step and continuing until the video concludes.Implementation detailsCode basesWe utilized Detectron226, an open-source deep learning framework developed by Meta AI, for various object detection tasks. This framework offers a user-friendly Application Programming Interface (API) for creating model architectures, managing training processes, and evaluating model performance. Additionally, for bounding box annotations in Google Colab notebooks, we used a customized version of the TensorFlow Object Detection Annotation Tool27, adapted to fit our system’s data formats.HardwareWe utilized computers with a variety of Nvidia GPUs for training and inference. The released version of DAMM requires a GPU.Model selection and trainingTo pretrain DAMM, we conducted a hyperparameter search, testing various weight decays [1e−1, 1e−2, and 1e−3] and learning rates [1e−1, 1e−2, and 1e−3]. We used the model that performed best on the validation set to evaluate/report test set performance. The final DAMM was trained using the best settings on the combined training, testing, and validation datasets. The final model was trained for 10,000 iterations using Stochastic Gradient Descent (SGD) with momentum and a batch size of 8. We started with weights from an LVIS pretrained Mask R-CNN.For the fine-tuning of DAMM for few-shot learning on new experimental setups, we set the learning rate to 1e−1, and the weight decay at 1e−2, for 500 iterations using SGD with momentum. This fine-tuning process typically took around 5 min on an RTX 2080 GPU.For the comparison to the SuperAnimal-TopViewMouse model released by DeepLabCut22, we used predictions aggregated over scales [200,300,400,500,600] which was the only hyperparameter selected by the end-user. To construct a bounding box that is used to approximate a bounding box localization, we compute the tightest box encompassing all points, while excluding all tail points.Dataset collectionAER lab generalization (AER-LG) datasetWe collected the AER-LG dataset to pretrain object detectors on diverse data encompassing a wide range of unique setups typical in behavioral studies involving mice. We compiled this dataset from a lab drive containing a rich repository of about 12,500 behavioral experiment videos collected over seven years. For each video, we randomly sampled one frame, and after a curation process, we selected 10,000 diverse images for annotation.During the annotation phase, we employed an iterative process, initially annotating a small training set alongside a 200-image validation set and a 500-image test set. With each iteration, we expanded the training set as we annotated additional batches of images from the remaining unannotated images. After annotating each batch, we trained our object detectors and assessed their performance on the test set. This cycle of annotating and training continued, with successive additions to the training set, until performance converged. Ultimately, we annotated 1500 images, reaching an accuracy of 92% Mask Average Precision (AP) at a 0.75 IoU threshold on our test set. Our final dataset contains 2,200 images (Fig. S1), all annotated using the SAM annotation tool (see below).Lab experimental setups (Detect-LES) datasetTo evaluate the DAMM detector in experimental setups typical of our lab, we collected the Detect-LES dataset. The original videos in this dataset might have been previously encountered by the DAMM detector during its pretraining phase. To facilitate a thorough evaluation, we constructed a series of five mini-datasets, each corresponding to videos originating from different downstream experimental setups stored on our lab server. The first mini-dataset featured a single mouse in a simple, brightly lit environment. The second, third, and fourth mini-datasets depicted a single black mouse, two black mice, and three colored mice, respectively, in a home cage under white light containing bedding, nesting material, and food. The fifth mini-dataset featured a single black mouse in a large enclosure, which included various enrichment objects such as a running wheel, and was recorded under dim red light. From these videos, we randomly sampled 100 frames. These frames were then annotated using our SAM GUI (see below).Publicly available experimental setups (Detect-PAES) datasetTo evaluate the DAMM detector on setups not encountered during its pretraining, we collected the Detect-PAES dataset using publicly available video data. The collection process mirrored that of the Detect-AER, with the key difference being the source of the videos–collected through the internet instead of our lab. We acquired a total of six videos. Three videos were donated by Sam Golden and acquired from the OpenBehavior Video Repository (edspace.american.edu/openbehavior/video-repository/video-repository-2/): one depicting a single mouse in an open field (‘Open field black mouse’), and two showcasing a home cage social interaction setup with a black and white mouse, one recorded in grayscale (‘Home cage mice (grayscale)’), and the other in RGB (‘Home cage mice (RGB)’). Additionally, we selected a video from the CalMS21 dataset21, featuring a home cage social interaction setup with a black and white mouse, recorded in grayscale (‘CaIMS21 mice (grayscale)’). From the maDLC Tri-mouse dataset28, we curated a mini-dataset, which uniquely provided images rather than videos, allowing us to directly sample 100 random images. Finally, we included a setup donated by Michael McDannald, featuring a rat in an operant chamber recorded with a fisheye lens, also available through OpenBehavior Video Repository. We randomly sampled 100 frames for each mini-dataset. These frames were subsequently annotated using our SAM annotator (see below).AER challenge datasetTo assess the performance of DAMM under controlled conditions, with a focus on variation in image resolution, mouse coat color, and enclosure architecture, we created the AER Challenge dataset. This dataset consists of videos that were created post DAMM pretraining, utilized arenas not previously used for pretraining and were taken from non-standard angles (see below), ensuring their novelty to our system. We organized the dataset around three key variables: camera quality (entry-level camera costing tens of dollars: Explore One Action camera, 1080 × 720, 8-megapixel sensor; high-end camera costing hundreds of dollars: Nikon D3500 DSLR camera, 1920 × 1080, 24.2-megapixel sensor), mouse coat color (white, black, and agouti), and enclosure architecture. The enclosures included a ‘Large cage’ with bedding (34 cm × 24 cm × 20 cm), an ‘Operant chamber’ with a metal grid floor and red walls (30 cm × 32 cm × 29 cm), and an ‘Enriched cage’ with bedding and toys (40 cm × 30 cm × 20 cm). Our objective in recording video data from non-standard angles was to assess the effectiveness of our system in tracking mice across diverse viewpoints, addressing key challenges in computer vision, such as occlusions, variations in object size, and within-class visual variability. We filmed 5-min-long videos with 3 mice in each recording for each of the 18 possible combinations (2 × 3 × 3) of these variables. From each video, we randomly sampled 70 frames, which we annotated using our SAM annotator tool (see below).Single- and multi-animal tracking datasetsTo evaluate DAMM’s ability to track mice within videos, we compiled two tracking datasets. Unlike our detection datasets, which are composed of annotated images, our tracking datasets consist of annotated videos. In these datasets, each data point is a video with every frame and mouse annotated. Additionally, for every mouse, an associated ID is used to maintain the object’s identity throughout the video. To generate this dataset, we collected video clips from both our AER lab drive and various publicly available datasets with a mean duration of 46 s (with a standard deviation of 24.7 s). These videos were converted to a maximum frame rate of 30 FPS. Subsequently, we divided them into two subgroups: single-animal and multi-animal. We annotated each frame of each video using our SAM tracking annotation strategy (see below).Our single-animal dataset, used for evaluating single-object tracking, encompassed seven diverse experimental setups all of which, besides one, were distributed through OpenBehavior Video Repository (edspace.american.edu/openbehavior/video-repository/video-repository-2/). The dataset included the following videos: (1) ‘Olfactory search chamber’ (donated by Matt Smear); (2) ‘Open field, grayscale’ (donated by Sam Golden); (3) ‘Open field, RGB’ (donated by Sam Golden); (4) ‘Simple chamber, red light’ (from the AER lab); (5) ‘Elevated plus maze’ (donated by Zachary Pennington and Denise Cai); (6) ‘Operant chamber, mouse’ (donated by Zachary Pennington and Denise Cai); and (7) ‘Operant chamber, rat,’ acquired with a fisheye lens (donated by Michael McDannald).Our multi-animal dataset for evaluating multi-object tracking encompassed five diverse experimental setups: (1) ‘Operant chamber, mixed’ (donated by Sam Golden acquired via OpenBehavior); (2) ‘Home cage, mixed grayscale’21; (3) ‘Home cage, mixed RGB’ (donated by Sam Golden acquired via OpenBehavior); (4) ‘Enriched cage, mixed infrared’ acquired with an infrared camera (from the AER lab); and (5) ‘Large cage, white triplet’ (from the AER lab).Segment anything model (SAM)-guided annotation strategyImage annotationTo annotate object masks both efficiently and cost-effectively, we leveraged the Segment Anything Model (SAM), developed by Meta11, as a guide for mask generation. SAM–a deep neural network–is designed for interactive instance segmentation and is adept at converting simple user prompts into high-quality object masks in images.To annotate our detection data, we developed a graphical user interface (GUI). The interface allows users to interact with images by specifying foreground/background points or bounding boxes. SAM then converts these points into precise instance masks. Our annotation process utilizes two of SAM’s prompting strategies: (1) Point prompts, where the user specifies a set of points to indicate an object’s foreground or background. (2) Bounding box prompts, where SAM is provided the object of interest with a bounding box, which are used for annotating tracking data efficiently.The input to the annotation tool is a folder containing images, and its output is a Common Objects in Context (COCO)-style metadata file29 with instance segmentation annotations for the images. The pipeline for annotating a single image is as follows: (1) the user specifies a foreground/background point using the right/left mouse click, (2) SAM converts the point prompt into an instance mask, (3) if the predicted mask is accurate, the user can press <space> to proceed to the next animal in the image, or <esc> to move to the next image. If the mask is incorrect, the user can return to step 1 and refine the prompt, prompting SAM to update the mask based on the latest set of points.Tracking data annotationAnnotating tracking data poses significant time and cost challenges due to the large number of frames requiring annotation in each video (e.g., a 1-min video at 25 FPS results in 1,500 frames). To expedite this process, we annotate frames sequentially while initializing the annotations for a current frame by providing SAM with the previous frame’s mouse bounding boxes and the current frame’s image. This method bootstraps annotation by taking advantage of the minimal movement of mice between frames, requiring only minimal further adjustments to the bounding boxes.Evaluation proceduresZero-shot evaluationThis strategy aims to assess the effectiveness of a model on a new, downstream task without any fine-tuning specific to that task. In this study, we begin all zero-shot analysis with a pretrained DAMM detector and directly evaluate its performance on the evaluation set.Few-shot evaluationThis strategy aims to assess a model’s effectiveness on a downstream task when it has been exposed to a limited number of examples from that task. In this study, we conducted few-shot analyses with N ranging from 5 to 50 across various experiments. In these cases, we used the N examples to fine-tune the DAMM detector before its evaluation on the downstream task.Evaluation metrics for detectionIntersect over Union (IoU)IoU measures the overlap between the predicted bounding boxes/masks and the ground truth bounding boxes/masks. It is calculated as the area of intersection divided by the area of union, providing a value between 0 and 1, where 1 indicates perfect overlap.Mask Average Precision (AP) 75Mask AP 75 in detection tasks evaluates the accuracy of instance segmentation, specifically measuring how well the model identifies instances of objects within an image by comparing the predicted masks to the ground truth masks. A mask is considered correctly identified if there is greater than 0.75 IoU with the ground truth mask. We use COCO-style mAP evaluation metrics implemented in Detectron2.Evaluation metrics for trackingSingle-object tracking accuracySingle-object tracking accuracy (TA) assesses how accurately a model tracks a single object in video sequences. It is calculated using the following equation: TA = number of correctly tracked frames/total number of frames. For this paper we consider an IoU greater than 0.5 to be considered correctly tracked.Multi object tracking accuracyMulti-object tracking accuracy (MOTA) assesses how accurately a model tracks multiple objects in video sequences. The primary distinction from single object tracking accuracy is the inclusion of ID switches in the assessment. The calculation is as follows: MOTA = 1 − ((false negatives + false positives + id switches)/ground truth). For this paper we consider an IoU greater than 0.5 to be considered correctly tracked.Behavioral experiments detailsChronic social defeat stressMiceWe utilized male black C57BL/6J mice aged 8–12 weeks (bred in-house) and 6–8 month old ex-breeder male white CD-1 mice (Strain #: 022, sourced from Charles River Laboratories). The mice were housed in a controlled environment, maintained at a temperature of 22 ± 1 °C with a 12-h light/dark cycle and ad libitum access to food and water. Prior to the experiment, the mice were also provided with nesting material. All experiments were conducted in accordance with the US National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the University of Michigan’s Institutional Animal Care and Use Committee. This study is reported in accordance with ARRIVE guidelines (https://arriveguidelines.org). Mice were euthanized with carbon dioxide followed by cervical dislocation.Chronic social defeat stress procedureWe implemented a chronic social defeat stress (CSDS) model as described in Golden et al., 2011, in adult male C57BL/6J mice to induce stress-related phenotypes. The CSDS procedure spanned 10 consecutive days. Each day, a test mouse was introduced into the home cage (50 cm × 25 cm × 40 cm) of a novel aggressive CD-1 mouse for a period of 5–10 min, ensuring direct but controlled aggressive interactions. Post confrontation, the test mouse was separated from the aggressor by a transparent, perforated divider within the same cage, allowing visual, olfactory, and auditory contact for the remaining 24 h (Golden et al., 2011). The aggressor mice were selected based on their established history of aggressive behavior, screened prior to the experiment. Control mice (adult male C57BL/6J of a similar age) were left undisturbed in their home cages for 10 days. CSDS-exposed and control mice were transferred to new cages on day 11.Social interaction (SI) testApproximately 23 h after the transfer of the CSDS-exposed and control mice to new cages, they were subjected to a SI test to evaluate the behavioral impacts of chronic social defeat (Golden et al., 2011). This test aimed to assess changes in social behavior potentially induced by the CSDS experience. The test was conducted in an arena measuring 44.5 cm by 44.5 cm, divided into two consecutive 150-s phases. In the first phase, the test mouse was introduced into the arena containing an empty wire mesh cage (10 cm by 6.5 cm), allowing for baseline sociability observations. Subsequently, the test mouse was gently removed, and an unfamiliar CD-1 mouse was placed inside the wire mesh cage. In the second phase, the test mouse was reintroduced to the arena, now with the CD-1 mouse present in the cage, to assess changes in social behavior. All trials were video recorded using high-definition webcams (either Logitech C920 or Angetube 1080p), positioned above the arena. To calculate the SI ratio, the time a mouse spent in the interaction zone with a target CD-1 present was divided by the time it spent in the interaction zone when a target CD-1 was absent.Chemogenetic activation of LH-TRAPed neuronsMiceWe utilized reproductively inexperienced F1 Fos2A-iCreERT2 (TRAP2; The Jackson Laboratory, Stock #: 030323) mice > 8 weeks old (bred in-house by crossing with black C57BL/6J mice). The mice were housed in a controlled environment, maintained at a temperature of 22 ± 1 °C with a 12-h light/dark cycle and ad libitum access to food and water. The mice were provided with compressed cotton ‘Nestlet’ nesting material (Ancare, Bellmore, NY, U.S.A.), shredded paper ‘Enviro-Dri’ nesting material (Shepherd Specialty Papers, Watertown, TN, U.S.A.). During the experiment, mice were individually housed in custom Plexiglas recording chambers (28.6 × 39.4 cm and 19.3 cm high). All experiments were conducted in accordance with the US National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the University of Michigan’s Institutional Animal Care and Use Committee.Mice were anesthetized with a ketamine-xylazine mixture (100 and 10 mg kg−1, respectively; intraperitoneal injection, IP) and administered with lidocaine and carprofen (4 mg kg−1 and 5 mg kg−1, respectively). Mice were placed into a stereotaxic frame (David Kopf Instruments, Tujunga, CA, USA) and maintained under isoflurane anesthesia (∼1% in O2). We stereotaxically infused viral vectors (AAV-EF1α-DIO-hM4Gq-mCherry) into the lateral hypothalamus (AP = −1 mm, ML =  ± 1.15 mm and DV =  − 4.9 mm) at a slow rate (100 nl min−1) using a microinjection syringe pump (UMP3T-1, World Precision Instruments, Ltd.) and a 33G needle (Nanofil syringe, World Precision Instruments, Ltd.). After infusion, the needle was kept at the injection site for ≥ 8 min and then slowly withdrawn. The skin was then closed with surgical sutures. Mice were placed on a heating pad until fully mobile. Following recovery from surgery (∼10 days), mice were separated into individual recording chambers.Mice were acclimated to handling and intraperitoneal (IP) injections for approximately one week prior to the experiment. On the day of the experiment, starting at Zeitgeber Time (ZT) 0, the nests in the home cages of the test mice were dispersed. This was followed by 4-hydroxytamoxifen (4-OHT) administration at ZT 1. Subsequently, the original nests were removed, and the mice were provided with fresh nesting material. This intervention extended their pre-sleep phase. Throughout the 2-h period following the 4-OHT administration, an experimenter monitored the mice continuously to prevent them from sleeping, supplying additional nesting material as needed to keep them engaged and awake. After this period, from ZT 2 to 24, the mice were left undisturbed.Chemogenetic manipulationMice were removed from their home cages at the beginning of the dark phase (ZT 12), IP administered either saline or CNO (1 mg kg−1), and returned to the home cage with as little disturbance to the nest as possible. Mice behavior was video recorded using high-definition webcams (either Logitech C920 or Angetube 1080p).Resident-intruder testMiceWe utilized male white CD-1 mice (bred in-house). The mice were housed in a controlled environment, maintained at a temperature of 22 ± 1 °C with a 12-h light/dark cycle and ad libitum access to food and water. The fur of the mice was dyed with either Blue Moon (blue) or Electric Lizard (green) dyes from Tish & Snooky’s Manic Panic (manicpanic.com). All experiments were conducted in accordance with the US National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the University of Michigan’s Institutional Animal Care and Use Committee.Experimental procedureMice were individually housed for approximately one week prior to the experiment. At ZT 0, a male non-sibling “intruder” was placed into the home-cage of a “resident” mouse. Mice behavior was video recorded using high-definition webcams (either Logitech C920 or Angetube 1080p).

Hot Topics

Related Articles