Automated construction of cognitive maps with visual predictive coding

Environment simulationForest–cave–river environmentThese experiments leverage the Malmo framework44 to construct a controlled environment within Minecraft. This environment is a rectangular space measuring 40 by 65 lattice units and incorporates three key visual features: a prominent cave serving as a global landmark, a forest area introducing some visual ambiguity between scenes and a river with a bridge that restricts agent movement options. Within this environment, an agent traverses paths between randomly chosen waypoints. These paths are determined using the A* search algorithm to ensure obstacles did not block the agent’s path. The agent varies its speed and direction to traverse the generated paths. During its exploration, the agent captures visual observations at regular intervals along each path.Circular environmentTo explore the model’s ability to differentiate between visually identical but spatially distinct scenes, these experiments used a circular corridor environment. This environment consists of an infinitely repeating sequence of rooms, specifically coloured red, green, red, blue and yellow in a clockwise direction. Notably, there are two distinct red rooms despite their identical appearance. Technically, the environment is an infinitely long hallway segmented into these coloured rooms. Similar to the previous experiment, an agent navigates between randomly chosen waypoints within this environment. The paths are determined using the A* search algorithm, and the agent captures visual observations at regular intervals along its journey.Predictive coderArchitectureThe proposed neural network follows an encoder–decoder architecture, employing a U-Net structure to process input image sequences and predict future images. The encoder and decoder components are both based on ResNet-18 convolutional neural networks.The encoding module utilizes a ResNet-18 model to extract hierarchical features from the input image sequence. Each image in the sequence is processed independently through the ResNet-18 encoder, generating a sequence of latent vectors. The encoder consists of residual blocks, each containing convolutional layers, batch normalization and rectified linear unit (ReLU) activations. The downsampling is achieved via strided convolutions within the residual blocks.The self-attention module utilizes multi-headed attention, which processes the sequence of encoded latent units to encode the history of past visual observations. The network consists of one layer of multi-headed attention. The multi-headed attention has h = 8 heads. For the encoded latent units with dimension D = C × H × W, the dimension d of a single head is d = C × H × W/h.The latent vectors output by the encoder are concatenated to form an ordered sequence. This sequence is then processed by a self-attention layer to capture temporal dependencies and relationships among the image sequence. The self-attention mechanism enables the model to weigh the importance of each latent vector in the context of the entire sequence, facilitating improved temporal feature representation.The decoding module mirrors the encoder’s architecture, utilizing a ResNet-18 model adapted for upsampling. The decoder reconstructs the future images from the transformed latent vectors, employing transposed convolutions and residual blocks analogous to those in the encoder.TrainingThe predictive coder is trained for 200 epochs using stochastic gradient descent as the optimization algorithm. The training parameters include a learning rate of 0.1, Nesterov momentum of 0.9 and a weight decay of 5 × 10−6. To optimize the learning process, the learning rate is scheduled using the OneCycle learning-rate policy. This policy adjusts the learning rate cyclically between a lower and upper bound, facilitating efficient convergence and improved performance. The OneCycle learning-rate schedule is characterized by an initial increase in the learning rate, followed by a subsequent decrease.Latent unitsThe predictive coder’s encoding and self-attention modules were used to analyse the encoded sequences as the predictive coder’s latent units. The image sequence first undergoes processing through the encoder, which extracts a compressed representation capturing the key features within each image. Subsequently, this encoded sequence is fed into the self-attention module. This module specifically focuses on the inherent temporal order of the images within the sequence. The self-attention module’s processed output forms the predictive coder’s latent units.Auto-encoderArchitectureUnlike the predictive coder architecture, the auto-encoder architecture transforms the current images (rather than the past images for the predictive coder) into a low-dimensional latent vector. The proposed neural network follows an encoder–decoder architecture employing a U-Net structure to process input image sequences into a low-dimensional latent vector and to reconstruct the initial image. The encoder and decoder components are both based on ResNet-18 convolutional neural networks. However, the auto-encoder architecture does not utilize any self-attention layers to integrate past observations of images.The encoding module utilizes a ResNet-18 model to extract hierarchical features from the input image sequence. Each image in the sequence is processed independently through the ResNet-18 encoder, generating a sequence of latent vectors. The encoder consists of residual blocks, each containing convolutional layers, batch normalization and ReLU activations. The downsampling is achieved via strided convolutions within the residual blocks.Unlike the predictive coder, the latent vectors output by the encoder are directly processed by the decoder. Whereas the predictive coder predicts the future images within an image sequence, the auto-encoder predicts the current images, given the low-dimensional latent vector generated by the encoder.The decoding module mirrors the encoder’s architecture, utilizing a ResNet-18 model adapted for upsampling. The decoder reconstructs the future images from the transformed latent vectors, employing transposed convolutions and residual blocks analogous to those in the encoder.TrainingThe predictive coder is trained for 200 epochs using stochastic gradient descent as the optimization algorithm. The training parameters include a learning rate of 0.1, Nesterov momentum of 0.9 and a weight decay of 5 × 10−6. To optimize the learning process, the learning rate is scheduled using the OneCycle learning-rate policy. This policy adjusts the learning rate cyclically between a lower and upper bound, facilitating efficient convergence and improved performance. The OneCycle learning-rate schedule is characterized by an initial increase in the learning rate, followed by a subsequent decrease.Latent unitsThe auto-encoder’s encoding module was used to analyse the encoded images as the auto-encoder’s latent units. The image sequence first undergoes processing through the encoder, which extracts a compressed representation capturing the key features within each image. The encoder’s processed output forms the auto-encoder’s latent units.Positional decoderTo assess the effectiveness of the predictive coder in capturing positional information within the encoded sequences, this analysis employed an auxiliary neural network for position prediction. This network, referred to as the positional decoder, takes the latent units generated by the predictive coder—or auto-encoder—as input. The decoder architecture consists of several layers designed to extract this positional information: a convolutional layer transforms the input to a higher dimension (256), followed by a ReLU activation for non-linearity. A max pooling layer then reduces the spatial resolution while maintaining relevant features. Subsequently, two fully connected (affine) layers with ReLU activations project the data to a lower dimension (64) and finally to a 2-dimensional output, corresponding to the agent’s predicted position (x and y coordinates).During training, the mean-squared error between the agent’s actual position and the predicted position served as the loss function$$E(x,\hat{x})={\left\Vert x-\hat{x}\right\Vert }_{{\ell }_{2}}.$$To optimize this loss, the AdamW optimizer was employed with a two-stage learning-rate schedule. The initial stage utilized a learning rate of 10−4 for 1,000 epochs, followed by a fine-tuning stage with a reduced learning rate of 10−5 for an additional 1,000 epochs.Modelling the correspondence between latent and physical distancesThis analysis evaluated the ability of the predictive coder’s latent space to encode local positional information. For each path traversed by the agent, we computed the pairwise distances between positions in physical space and the corresponding latent space distances within a neighbourhood of 100 time steps. To assess the correspondence between these two distance measures, we analysed the joint distribution of physical and latent space distances. We modelled the relationship between latent distances and their corresponding physical distances using a logarithmic function with additive Gaussian noise$$\hat{x}=x+\epsilon ,\epsilon \sim {{{\mathcal{N}}}}(0,\sigma ).$$The goodness-of-fit between the model and the actual data was evaluated using two metrics: the Pearson correlation coefficient, which measures the dependence between the physical and latent distances, and the Kullback–Leibler divergence$$({{\mathbb{D}}}_{{{{\rm{KL}}}}}(\;{p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})),$$which quantifies the difference between the two modelled regression distribution and the observed empirical distribution.Mutual information of the predictive coder and auto-encoderThe spatial information encoded within the latent representations of both the predictive coder and the auto-encoder was evaluated. To achieve this, this analysis computed the joint densities between the latent distances in each model and the corresponding actual physical distances within the environment. By analysing these joint densities, we were able to quantify the physical information within each model’s latent space. Mutual information$$I[X;Z\;]={{\mathbb{E}}}_{p(X,Z\;)}\left[\log \frac{p(X,Z\;)}{p(X\;)p(Z\;)}\right]$$was employed as a metric to assess this physical information. Higher mutual information indicates that the latent distances in a model encode a greater amount of spatial information, signifying a stronger correlation between the distances in the latent space and the actual physical separations between locations in the environment. This comparison allows us to gauge the relative effectiveness of each model in capturing and representing spatial relationships within their respective latent spaces.Place field analysisPlace field calculationThis analysis investigated the spatial localization of individual units within the neural network’s latent space. First, this analysis computed the histogram of the distribution of the 128-dimensional latent vectors. To identify active units, this analysis employed a thresholding technique based on the 90th-percentile value of the continuous latent unit values. This ensured a focus on units with notable activation levels. The agent’s head direction was varied during data collection to ensure the identified localized regions remained stable regardless of the agent’s orientation.Place field statistical fittingTo quantify the degree of localization for each active unit, this analysis fitted a two-dimensional Gaussian distribution$$P(x)=\frac{1}{2\uppi | {{\Sigma }}| }\exp \left[-\frac{1}{2}{(x-\mu )}^{T}{{{\Sigma }}}^{-1}(x-\mu )\right]$$to its corresponding distribution in physical space. The area of the resulting ellipsoid, defined by the Gaussian approximation and exceeding a probability threshold of P ≥ 0.0005, served as our localization metric. This area reflects the spatial extent of the unit’s activation within the environment, relative to the overall environment size of 40 × 65 lattice units (2,600 units). Units with smaller ellipsoid areas indicate a more concentrated activation pattern in physical space, suggesting a higher degree of localization.Vector navigation analysisThis analysis investigated the ability of the neural network’s latent space to not only encode positional information but also represent the vector heading from a current location to a goal location—called vector navigation. To assess this, we compared the combinations of overlapping regions in the latent space representations of two distinct positions x1 and x2. This analysis achieved this by computing the bitwise difference z1 − z2 between the corresponding latent codes z1, z2 for these positions. Subsequently, we examined the relationship between this difference vector and the actual physical displacement vector x1 − x2 using a linear decoder$${x}_{1}-{x}_{2}=W\;[{z}_{1}-{z}_{2}]+b$$This decoder was trained to predict the displacement vector based solely on the latent code difference. The predicted displacement was then decomposed into its distance and directional components, to calculate the specific errors associated with predicting both the distance and direction to the goal location. This analysis computed the Pearson correlation coefficient between the predicted distance, predicted direction and the predicted displacement vector.Mutual information calculationThis analysis employed a complementary approach to evaluate the spatial information encoded within the binary vectors derived from the latent space. Here the joint densities were computed between the bitwise distances of these binary vectors and the Euclidean distances between corresponding physical positions. The mutual information$$I[X;Z\;]={{\mathbb{E}}}_{p(X,Z\;)}\left[\log \frac{p(X,Z\;)}{p(X\;)p(Z\;)}\right].$$was then computed to quantify the amount of spatial information captured by the bitwise distances. This metric essentially reflects how well the bitwise distance between latent codes reflects the actual physical separation between locations in the environment. Finally, to provide context for the obtained value, the mutual information of the binary vectors’ bitwise distance was compared with the mutual information derived from the latent distances of both the predictive coder and the auto-encoder. This comparison assesses the relative effectiveness of each model in capturing spatial information within their respective latent representations.Place field stability with shifting landmarksTo assess the stability of the identified localized regions within the latent space, this analysis investigated their resilience to changes in the environment’s landmarks. The environment was manipulated: the trees, originally serving as landmarks, were removed and then randomly redistributed throughout the space. Subsequently, the Jaccard index$$| {S}_{{{{\rm{new}}}}}\cap {S}_{{{{\rm{old}}}}}”https://www.nature.com/” {S}_{{{{\rm{new}}}}}\cup {S}_{{{{\rm{old}}}}}|$$was employed to quantify the overlap between the latent units identified in the original environment and those found in the environment with shifted landmarks. The Jaccard index ranges from 0 to 1, where a value of 1 indicates a perfect overlap between the sets of latent units, and 0 signifies no overlap. This analysis allowed us to evaluate how well the latent units maintain their spatial correspondence despite alterations to the environment’s visual features.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles