Unsupervised decomposition of natural monkey behavior into a sequence of motion motifs

AnimalsFour laboratory-bred adult common marmosets were used (2 males, 2 females; 1.4–6.4 years old; 290–400 g, Supplementary Table 1). Each cage was exposed to a 12/12-h light-dark cycle. Room temperature and humidity were maintained at 27–30 °C and 40%–50%, respectively. All experimental procedures were performed in accordance with the Guide for the Care and Use of Nonhuman Primates in Neuroscience Research (Japan Neuroscience Society; https://www.jnss.org/en/animal_primates) and were approved by the Animal Ethics Committee of the National Institutes for Quantum Science and Technology (#11-1038).Behavior testingBehavior experiments were conducted in a sound-attenuated room (O’hara & Co., Ltd., Tokyo, Japan; 2.4 m (h) × 1.2 m (w) × 1.6 m (d)), which was apart from the colony room. Vocalizations from the colony room could not be detected in the experimental room. The temperature was maintained at 27–30 °C and relative humidity was 30%–40%. The internal space of the sound-attenuated room was ventilated and illuminated with fluorescent lighting. The experiments were performed between 11:00 and 16:00.Before the experimental sessions, each subject was transferred individually from the colony room to the experimental room in a small transport chamber (O’hara & Co., Ltd., Tokyo, Japan; 300 mm (h) × 100 mm (w) × 100 mm (d)). Once in the experimental room the transport chamber was placed under a table with a green top (O’hara & Co., Ltd., Tokyo, Japan; 0.5 m × 0.5 m × 0.5 m), upon which rested a cylindrical test chamber made from transparent acrylic (0.4 m (r) × 0.5 m (h)). The bottom of the test chamber had a door that when opened allowed the marmoset to enter from the transport chamber (Supplementary Fig. 1a). The test chamber had 16 feeding ports (30 mm × 15 mm) located at 45-degree intervals on the floor (8 ports) and on the wall at heights of 150 mm and 200 mm (4 each, alternately arranged). Platforms (30 mm × 15 mm) were set up on the outside of the wall ports where food rewards could be placed (Supplementary Fig. 1b). For 3D data acquisition, 4 depth cameras (RealSense Depth Camera R200, Intel, Santa Clara, USA) were placed around the chamber at 90-degree intervals (Supplementary Fig. 1b), and were connected in parallel to a PC (Windows 10, 64-bit) using USB-C cables (U3S1A01C12-050, Newnex Technology, Santa Clara, USA; the distance was 1-5 m). The subjects were allowed to adapt to the transport procedure and experimental environment for two consecutive days before behavioral testing.Each behavioral test started when a marmoset entered the test chamber from the transport chamber through the floor entrance, and movements were recorded until they finished eating all the food placed in the feeding ports, or until 30 min has passed. Subjects freely moved around the test chamber and, whenever they wanted, ate 7–9 pieces of sponge cake (approximately 2 g each); one cake was located on the near side of the entrance to entice them to enter the chamber, two on the wall ports, two on the near side of the floor ports, and the others 3 cm from the floor ports (Fig. 2a, Supplementary Fig. 1b). Observed feeding behaviors were manually classified into three subtypes: using hands to take food from the floor (floor-hand), using head and mouth directly (floor-head), and taking food from the wall (wall) (Fig. 2b, Supplementary Table 1). Fifty-one sets of 20-s data (10 s before and after eating) were used for analysis (Supplementary Table 1). Subjects were returned to the colony room after the end of the recording session. The experiments were performed once a day for each subject.Marmoset marker-less 3D motion-tracking systemOur motion-tracking system software package for depth camera calibration, 3D data acquisition, and fundamental setup for physical simulation is available online (3DTracker-FAB, https://www.3dtrack.org). This motion-tracking system allowed us to robustly estimate the 3D trajectory of marmoset body parts as the positions of skeleton model parts (Head, Neck, Trunk, Hip) fitted by the physical simulation to the 3D point data of marmoset body shape (3D point cloud)9,34. The Face position was estimated by projecting the rectangle of the marmoset face area onto the 3D point cloud. This face area was detected frame-by-frame on 2D-RGB images by an object recognition algorithm YOLO335. To achieve sufficient accuracy, the YOLO3 detector was trained to detect marmoset face regions using 2,000 manually detected faces from two of the four marmosets (Marmo1 and Marmo2). 3D points inside the projected face rectangle were filtered to be less than 2.5 cm from the center of the Head, and the average position of these 3D points was defined as the Face position (Fig. 2c). Estimated error of Face position is reported in Supplementary Fig. 8.Behavioral data under chemogenetic neuronal manipulationWe used behavioral data obtained from an adult male marmoset that received a viral vector injection in the unilateral SN for expressing an excitatory Designer Receptor Exclusively Activated by a Designer Drug (DREADD) (AAV2.1-hSyn1-hM3Dq-IRES-AcGFP)9. Using this DREADD setup, SN neurons are excited when the receptor is activated by administration of the agonist DCZ (HY-42110, MedChemExpress, NJ, USA; 3 μg/kg, orally). Two sets of free-moving behavior data ( ~ 60 min) following either DCZ or vehicle alone (saline with 2.5% dimethyl sulfoxide FUJIFILM Wako Pure Chemical., Osaka, Japan) were used.Data preprocessingAll data preprocessing was performed using R version 4.0.3 (www.r-project.org) and the R packages tidyverse (version 1.3.0)45, data.table (version 1.13.2), patchwork (version 1.1.1), and magick (version 2.7.0).Marmoset behavioral analysisFor free-feeding behavior analysis, the trajectory of body parts (Face, Head, Trunk, and Hip) was filtered with a locally estimated scatterplot smoothing filter using the stats::loess() function with span = 1/30 and downsampled to 10 Hz. Then, the spatial movement speed of each body part was calculated and the data coordinates were transformed frame-by-frame to make them posture-centered. Specifically, distance-from-center coordinates were transformed into posture coordinates with the Trunk’s center and the vector from Trunk to Head contained in the front-up quadrant. As a result, we used a set of 13 posture parameters (Face [x, y, z, v], Head [x ≡ 0, y, z, v], Trunk [x ≡ 0, y ≡ 0, z, v], and Hip [x, y, z, v], Fig. 2d). For SMP analysis, PC1 and PC2, which retained 69% of the data’s variance (Supplementary Table 2), were calculated using the stats::prcomp() function in R with scale = TRUE (Fig. 2e), and then scaled to a maximum absolute value of 1.The same procedures were used for the data obtained after chemogenetic neuronal manipulation, except that instead of the Face position, we included the relative velocity of body parts into the SMP analysis, which was calculated from the postures at t + 3 s (Fig. 5d) and normalized by the coordinates of the body at time t and the distances between these parts. Thus, the analysis of the neuronal manipulation data used a set of 26 parameters (Head [x, y, z ≡ 0], Trunk [x ≡ 0, y ≡ 0, z ≡ 0], Hip [x, y, z], their positions at t + 3 s [x, y, z], their velocities [x, y, z, and absolute value]).We used PC1 and PC2, which retained 48% of the data’s variance (Supplementary Table 6). To validate that these two PCs were the optimal choice, we also compared the results of the SMP analysis when using different numbers of PCs (Supplementary Fig. 7b, c, e, f).Macaque behavioral analysisFor this analysis, we used published 3D macaque tracking data captured by OpenMonkeyStudio14. From the published data, we extracted 3,534 s of data that was divided into 29 subsets with few missing frames (frame gap ≤ 20 frames) and sufficient duration (d ≥ 100 s). Then, the extracted data were transformed frame-by-frame from distance-from-center coordinates to posture coordinates with the Hip’s center and the vector from Hip to Neck contained in the front-up quadrant. These posture-coordinated parameters were interpolated in 2/3 Hz using the stats::loess() function with a span set to one-fifth of the data fragment length. As a result, we use 5,452 video frames with 36 posture parameters ([x, y, z] coordinates of 13 body keypoints with Hip_x, Hip_z, Neck_z scaled to zero, Fig. 4a). For SMP analysis, the PC1 and PC2 extracted from the 36-parameter data, which retained 46% of the data’s variance (Supplementary Table 5), were calculated using the stats::prcomp() function with scale = TRUE.Computational segmentationThe joint probability distribution of motion-motif length and class can be estimated by a blocked Gibbs sampler in which all motion motifs and their classes in the observed data (PC scores) are sampled. First, all data are randomly divided into motion motifs and classified. Next, motion motifs obtained by a part of the data are excluded from the dataset, and the model parameters are updated. By iterating this procedure, the parameters can be optimized. Parameter estimation in each iteration is described as forward filtering-backward sampling, which can be considered a maximum likelihood estimation process.In the forward filtering step, the probability α that a data point given time step t is the ending point of a motion motif of length k classified as motif class c was calculated as follows:$$\alpha \left[t\right]\left[k\right]\left[c\right]={GP}\left({s}_{t-k:k}|{X}_{c}\right)\times {\sum}_{{k}^{{\prime} }=1}^{K}\left\{{\sum}_{{c}^{{\prime} }=1}^{C}{p}_{{trans}}\left({c|}{c}^{{\prime} }\right)\alpha \left[t-k\right]\left[{k}^{{\prime} }\right]\left[{c}^{{\prime} }\right]\right\}$$
(1)
where si denotes time series PC score at time step i, C denotes the maximum number of motion-motif classes, K denotes the maximum length of motion motifs, and ptrans (c|c’) denotes posterior transition probability from motif c’ to c.The \({GP}({s}_{t-k:k}|{X}_{c})\) is a predictive distribution of a Gaussian process of class c fitted to PC score s at the time step from t-k to k, and is computed as follows:$${GP}\left({s}_{t-k:k}|{X}_{c}\right)={\prod}_{i=t-k}^{k}p\left({s}_{i}|{X}_{c,i}\right),$$
(2)
where Xc denotes a set of segments that are classified into class c. The variation within a segment is regressed on data fragments belonging to the same class c as follows:$$C\left({i}_{p},{i}_{q}\right)=k\left({i}_{p},{i}_{q}\right)+{\omega }^{-1}{\sigma }_{{pq}}.$$
(3)
The class ci of the ith motion motif is determined by the (i-1)th segment and transition probability πc, which is generated from a Dirichlet process (DP) using β and parameterized by η as follows:$${\pi }_{c}\sim {DP}\left(\eta ,\beta \right).$$
(4)
This β is an infinite-dimensional multinomial distribution, and its parameters were constructed by repeatedly breaking a stick, the length of which is one, with a ratio vk sampled from a beta distribution, as follows:$${v}_{k}\sim {Beta}\left(1,\gamma \right) \, \left(k=1,2,\ldots ,\infty \right),$$
(5)
$${\beta }_{k}\sim {v}_{k}{\prod}_{i=1}^{k-1}\left(1-{v}_{i}\right) \, \left(k=1,2,\ldots ,\infty \right).$$
(6)
This stick-breaking process is also the DP, so the two-phase DP to generate c is called an HDP. In both the macaque and marmoset behavior analyses, the parameters were set to ω = 10.0, η = 10.0, and γ = 1.0, and the length of the motion motifs for the initial value of the simulation was set to be around 30 frames (min = 10, max = 70). This model was reported as HDP-GP-HSMM for human motion analysis29. To evaluate the impact of motif length variation on the results, the motif length was set to an average of 90 frames, ranging from 75 to 150, for the marmoset neural manipulation data (Supplementary Fig. 7a, d). For exploratory behavior analysis in NHPs, we optimized the kernel function \(k(\cdot ,\cdot )\) with comparison as follows:$${{{\rm{SMP}}}}\,{{{\rm{and\; Model}}}}1,k({i}_{p},{i}_{q})=exp (-{{{{\rm{||}}}}{i}_{p}-{i}_{q}{{{\rm{||}}}}}^{2});$$$${{{\rm{Model}}}}2,k({i}_{p},{i}_{q})=1+{i}_{p}* {i}_{q};$$$${{{\rm{and\; Model}}}}3, \, k({i}_{p},{i}_{q})=1.$$In addition, after learning the parameters for the data by iterating the model 100 times, the Viterbi algorithm estimated the most likely path of the latent state in the data. That is, all possible α in the data were obtained, and the most probable of all possible state sequences, combined with the learned transition probabilities, was adopted as the posterior state sequence.Posture modelFor comparisons, the conventional posture model was estimated as described below. PC1 and PC2 scores, which were scaled to have a maximum absolute value of 1 to match the input of SMP, were mapped to two dimensions using the umap.UMAP() function in Python 3.746. Posture clusters were detected on the UMAP scores using the stats::kmeans() function in R. The representative posture for each cluster was the average of the data contained in each class.Manual motion segmentationFive trained observers manually segmented four sets of 20-s (approximately) RGB video data clips captured from four separate angles simultaneously. The clips showed marmoset free-feeding behavior, was part of the data used for SMP analysis. Observers were instructed to segment the clips into motifs lasting 2–10 seconds.Synthesis of typical body motion sequencesThe inverse operation of PC analysis was used to synthesize a typical posture series from the estimated PC coordinates. In each motif class, the typical PC waveforms (Fig. 2h, and Fig. 4c) were calculated within 10 Hz by moving the average for all PCs. These typical PCs were multiplied by the pseudo-inverse matrix of the eigenvector for the PC analysis of the test data that was calculated using MASS::ginv() function in R. Finally, the resulting values were inversely standardized using the mean and variance values of the data to synthesize the posture parameters. In the cases of marmoset free-feeding behavior and macaque ethogram detection, the postural information is displayed side-by-side along the time axis (Fig. 3c–e, Fig. 4d). The results of neural manipulation analysis visualized the motif’s internal variability by integrating the velocity relative to the initial position (Fig. 5g).Statistical analysisPearson’s Chi-square test, implemented with the rstatix::chisq_test() function in R, was used to compare intervals containing constant-specific timing among feeding subtypes and to test for differences in motion motif distributions among individuals. The Brunner–Munzel test, implemented with the lawstats::brunner.munzel.test() function in R, was used to compare the median value of motif length.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles