Biophysical neural adaptation mechanisms enable artificial neural networks to capture dynamic retinal computation

Retina electrophysiologyRetina electrophysiology experiments were performed in two different labs. In the Manookin Lab, electrophysiological experiments were performed using ex vivo retina from a 11 years old female macaque (Macaca nemestrina) obtained through the tissue distribution program at the University of Washington National Primate Research Center and in accordance with the Institutional Animal Care and Use Committee at the University of Washington. Additional primate (17 years old male Macaca mulatta) and rat (2 female Long-Evans) retina electrophysiology experiments were performed in the Field Lab at Duke University, in accordance with Duke University’s Institutional Animal Care and Use Committee.Electrophysiology experiments followed similar procedures in both the labs. For primate electrophyiology experiments, eyes were enucleated from a terminally anesthetized macaque monkey and hemisected, and the vitreous humor was removed. Immediately after enucleation, the anterior portion of the eye and the vitreous were removed in room light. The eye cup was placed in a dark sealed container with Ames’ solution (Sigma, St. Louis, MO) at room temperature. Under infrared illumination, segments of peripheral retina 6–15 mm (25–70°, 200 μm/°) from the fovea and 3–5 mm in diameter were dissected and isolated from the retinal pigment epithelium. Preparation for the rat retinae was similar and described in detail in ref. 28. Briefly, the retina of an euthanized animal was extracted and dissections were performed in darkness with the assistance of infrared converters. We dissected dorsal pieces of the retina that were 3 × 2 mm large. For recording, the retina was kept at 32–35 °C and was perfused with Ames’ solution bubbled with 95% O2 and 5% CO2, pH 7.4.The segment of retina was then placed flat, RGC layer down, on a planar multielectrode array (MEA) covering an area 2000 μm × 1000 μm. The MEA consisted of 512 electrodes with 60 μm or 30 μm spacing. Spikes on each electrode were identified by thresholding the voltage traces at 4 s.d. of a robust-estimate of the voltage s.d. For retina experiment involving naturalistic stimuli (Manookin Lab), spike sorting was performed using the Kilosort51 software package (version 2.5). Spike waveform clusters were identified as neurons only if they exhibited a refractory period (1.5 ms) with < 1% estimated contamination. For retina experiments across light levels (Field Lab), spike sorting was performed by an automated PCA algorithm and verified by hand with a custom software52,53. Spike waveform clusters were identified as neurons only if they exhibited a refractory period (1.5 ms) with < 10% estimated contamination. Most sorted units from the primate retina had 0% spike contamination based on refractory period violations. Other units had contamination in the range of 0.05–0.09% with one unit at 0.8%. Only units that could be reliably tracked across all recording conditions were considered for further analysis.For each retinal segment, a Retinal Reliability Index was computed to assess tissue quality. This involved analyzing the responses of individual sorted RGCs to multiple trials using either white noise or naturalistic movies. We first estimated the trial-averaged noise by categorizing trials into two groups, averaging responses within each subgroup, and determining the mean squared error as$${\sigma }_{noise}^{2}={{{{{{{{\rm{E}}}}}}}}}_{t}[{({y}_{t}^{A}-{y}_{t}^{B})}^{2}]$$
(1)
where, \({y}_{t}^{A}\) and \({y}_{t}^{B}\) are the observed spike rates of an RGC calculated as an average across set of trials A and set of trials B respectively at time bin t. The sets A and B were obtained by randomly splitting the total number of repeats into two. We then computed the fraction of explainable variance which is the fraction of variance of each RGC attributable to the stimulus as$${{{{{{{\rm{Fraction}}}}}}}}\,{{{{{{{\rm{Explainable}}}}}}}}\,{{{{{{{\rm{Variance}}}}}}}}=\frac{Var[{y}^{A}]-{\sigma }_{noise}^{2}}{Var[{y}^{A}]}$$
(2)
where,$$Var[{y}^{A}]=\frac{1}{T}\sum\limits_{t=1}^{T}{({y}_{t}^{A}-{\bar{y}}^{A})}^{2}$$
(3)
and yA is the observed spike rate at time bin t and \({\bar{y}}^{A}\) is the mean firing rate across time. A higher fraction indicates recordings with low noise, where most of the variance in RGC responses is stimulus-driven. The retinal reliability index was determined by computing the median of the fraction of explainable variance across all sorted RGCs in each experiment. These values are presented in Table 1 for each retina used in this study. Intuitively, higher the value (maximum 1), better the quality of recordings.Table 1 Retinal reliability index for each retina used in this studyVisual stimulation and data acquisition for primate retina experiment using naturalistic movies (Figs. 2, 3)Visual stimuli were created with custom Matlab code. Stimuli were presented with a gamma-corrected OLED display (Emagin, Santa Clara, CA) refreshing at 60.32 Hz. The display had a resolution of 800 × 600 pixels covering 3.0 × 2.3 mm on the retinal surface.Spectral intensity profile (in μWcm−2 nm−1) of the light stimuli was measured with a calibrated CCS100 spectrometer (Thorlabs). We transformed the stimulus intensity into equivalents of photoisomerizations per receptor per second (R*receptor−1 s−1). The spectrum was converted to photons cm−2 s−1 nm−1, convolved with the normalized spectrum of macaque cones and rods, and multiplied with the effective collection area of these photoreceptors. The ambient light level (i.e., mean stimulus intensity) was set using neutral density filters in the light path. The attenuation of each neutral density filter was measured for the red, green, and blue LEDs using a calibrated UDT 268R radiometric sensor (Gamma Scientific).We recorded RGC activity to 36-min of binary checkerboard white noise stimuli and 9-min of gray scale naturalistic movies at mean light level of 50 R*receptor−1 s−1. The checkerboard stimuli in this experiment had 100 × 75 pixels, where each pixel edge corresponded to 30 μm on the retina surface. The refresh rate of the stimulus was set to 60.32 Hz (~16.6 ms per frame). The naturalistic movies were created by displacing natural scene images from Van Hateren dataset54 across the retina, incorporating eye movement trajectories derived from the DOVES dataset55 (Fig. 1a). We used nine different natural scene images leading to nine naturalistic movies. Each movie was 6-s long where the image remained stationary for the first 1-s to allow time to adapt to the spatial contrast before the motion began, which lasted for 5-s. The nine movies were played in sequence and the entire sequence was repeated ten times, totaling a duration of 9 min for naturalistic movies. These movies were presented to the retina at a resolution of 800 × 600 pixels where each pixel edge corresponded to ~3.8 μm on the retina surface. We selected 57 RGCs (27 ON and 30 OFF parasol cells) for modeling purposes based on spike sorting quality and reliability across experimental conditions.At the model training stage, each repeat of the movie was treated as an individual movie, i.e., 9 movies with 10 trials were treated as 90 movies. 80 of which (8 unique movies and 10 trials) were used for training the model and the held-out movie was used to validate the model against trial averaged responses. Additionally, naturalistic movies were spatially down-sampled by a factor of 8 to 100 × 75 pixels to match the resolution of checkerboard white noise stimuli. This was necessary as we first trained the models on checkerboard movie and then fine-tuned the same model with naturalistic movies.Visual stimulation and data acquisition for primate and rat retina experiment at different light levels (Figs. 4–7)Visual stimuli were created with custom Matlab code. Stimuli were presented with a gamma-corrected OLED display (SVGA + XL Rev3, Emagin, Santa Clara, CA) refreshing at 60.35 Hz. The image from the display was focused onto the photoreceptors using an inverted microscope (Ti-E, Nikon Instruments) with a ×4 objective (CFI Super Fluor ×4, Nikon Instruments). The optimal focus was confirmed by presenting a high spatial resolution checkerboard noise stimulus (20 × 20 μm, refreshing at 15 Hz) and adjusting the focus to maximize the spike rate of RGCs over the MEA. The display had a resolution of 800 × 600 pixels covering 4 × 3 mm on the retinal surface.Spectral intensity profile (in μW cm−2 nm−1) of the light stimuli was measured with a calibrated Thorlabs spectrophotometer (CCS100). We transformed the stimulus intensity into equivalents R*receptor−1s−1 by converting the power and the emission spectra of the display to an equivalent photon flux by Planck’s equation. This converted the emission spectrum to photons cm−2 s−1 nm−1, which was then convolved with the normalized spectral sensitive of rods56, and multiplied with the effective collection area of rods (0.5 μm2). The ambient light level (i.e., mean stimulus intensity) was set using neutral density filters in the light path. In each recording, stimuli were first presented at the darker light level. For every subsequent higher light level, the retina tissue was first adapted to that light level before continuing the recordings.Stimuli consisted of non-repeated, binary checkerboard white noise interleaved with repeated (N = 126 or 225 trials), binary white noise segments (5 or 10 s) to estimate noise. The total duration of stimulation was 60 min. We recorded primate RGC activity to the same 60 min white noise sequence at three different mean light levels, each differing by 1 log unit: 0.3 R*receptor−1 s−1, 3 R*receptor−1 s−1 and 30 R*receptor−1 s−1. These light levels fall under the scotopic regime, where mostly rod photoreceptors contribute to vision. The movies across the three light levels only differed in their mean pixel values which were 0.3 R*receptor−1 s−1 (low light level), 3 R*receptor−1s−1 (medium) and 30 R*receptor−1 s−1 (high). The checkerboard stimuli in this experiment had 39 pixels × 30 pixels, where each pixel edge corresponded to ~140 μm on the retina surface. The refresh rate of the stimulus was set to 15 Hz which means that each checkerboard pattern was exposed on to the retina for ~67 ms. In this work, we used a subset of 37 recorded RGCs that could be reliably tracked across light levels and were classified as high quality units after spike sorting. This subset contained 2 ON parasol, 28 ON midget, 5 OFF parasol and 2 OFF midget RGC types).The rat experiments of ref. 28 were performed at two light levels differing by 4 log units: 1 R*receptor−1 s−1 (scotopic light level where mostly rod photoreceptors contribute to vision) and 10,000 R*receptor−1s−1 (photopic light level where cone photoreceptors predominantly contribute to vision). The white noise checkerboard movie in these experiments had 10 pixels × 11 pixels, with each pixel edge corresponding to ~252 μm on the retina. The refresh rate of the stimulus was 60 Hz and 30 Hz at the photopic and scotopic light levels, respectively. In this work, we used data from two rat experiments: a subset of 61 RGCs from Retina A and a subset of 55 RGCs from Retina B that could be reliably tracked across light levels and were classified as high quality units after spike sorting. This subset contained OFF brisk sustained and OFF brisk transient RGC subtypes.Data preprocessing for modelsBoth white noise and naturalistic movies were up-sampled to 120 Hz by repeating each frame so that each frame had a duration of 8 ms. This up-sampling was necessary for the photoreceptor layer in which differential equations are solved using the Euler method.Spikes were grouped in 8 ms time bins spanning the duration of the movie. Firing rates were then estimated by convolving the binned spike counts with a Gaussian of σ = 32 ms (4 frames/bins) standard deviation and amplitude of 0.25σ−1e1/2. The resulting firing rates for each RGC were normalized by the median firing rate of that RGC over the course of the experiment. This was done to ensure that responses of all output units of the model (i.e., the modeled RGCs) were at the same scale.Conventional CNN architectureThe general architecture of the conventional CNN we used was similar to Deep Retina9. The model (Fig. 1b) had three convolution layers (orange color), followed by a fully-connected output layer (black arrows). The model takes as input a movie (80 frames per training example where each frame corresponds to 8 ms) and outputs an instantaneous spike rate for each RGC at the end of that movie segment. The first convolution layer is a 3D convolutional layer operating in both the spatial and temporal dimensions. The output of the 3D convolutional layer is a 2D image which is normalized using Batch Normalization (BatchNorm) and then passed through a rectifying nonlinearity. All the temporal information from the movie is extracted by this layer as the temporal dimension of the convolutional filter is the same as the temporal dimension of the movie. To down sample the spatial dimensions, we applied a 2D max pool operation (blue color) that took the maximum value over 2 × 2 patches of the previous layer’s output. The subsequent 2D CNN layers are followed by a final, fully connected layer with softplus activation function that outputs the predicted spike rate for each RGC in the dataset.To obtain the time series of RGC responses to longer movie stimuli, we feed into the model many 80-frame video samples taken from that longer movie, that correspond to 1-frame shifts. I.e., the model receives as inputs frames 1–80, 2–81, 3–82, etc., and outputs RGC responses at the times of movie frames 80, 81, 82, etc.A Layer Normalization (Layer Norm) at the input of the first convolutional layer was applied to z-score each frame of a movie segment. Layer Norm computes normalization statistics for each pixel over its temporal history within a single training example i.e., a single movie segment comprising 80 frames. This step removes the mean luminance from each training example, mitigating sensitivity changes associated with global luminance changes while preserving the temporal structure within each movie segment.Each convolution operation is followed by Batch Normalization which contributes to stable training and faster convergence of the model57. During model training, the distribution of inputs to a layer undergoes changes as the network’s parameters are updated, leading to what is known as the internal covariate shift—a phenomenon that hampers model convergence and introduces instabilities. These Batch Norm layers address the internal covariate shift during training by z-scoring the input, using normalization statistics computed based on batch statistics from batches comprising over 100 movie segments. This process enforces a 0 mean and unit variance for the data, introducing two trainable parameters—shift and scale—that systematically adjust weights and biases in the CNN layer. Additionally, the running average and variance of the training data serve as non-trainable parameters saved for later use during the test phase. This normalization process mitigates extreme parameter values, preventing issues such as exploding or vanishing gradients. The scale and shift parameters enable the model to adapt to variations in feature magnitudes across layers and activations, facilitating improved and faster convergence. During the test phase, Batch Norm uses the non-trainable moving average and variance saved during the training to normalize its inputs i.e., the outputs of the convolutional layer. Batch Norm then scales and shifts the normalized input using the scale and shift parameters learned during the training phase. Notably, in the current setup, Batch Norm parameters are not influenced by the mean light level as Layer Norm at the model’s input removes mean luminance from each training sample.For modeling RGC responses across light levels (experiments of Figs. 4–7), the input to the model was a movie segment of 120 frames instead of 80 frames. The longer movie segment allowed for longer integration times at the lowest light level of 0.3 R*receptor−1 s−1.Biophysical photoreceptor–CNN architectureThe proposed photoreceptor convolution layer builds upon a biophysical model of the phototransduction cascade by ref. 20 (Fig. 1c). The model incorporates the various feedforward and feedback molecular processes that convert photons into electrical signals, and therefore faithfully captures the photoreceptors’ adaptation mechanisms. The biophysical model is reproduced in Supplementary Note 2 and described in brief below.The biophysical model was represented by a set of six differential equations that mimics the enzymatic reactions of the phototransduction cascade. Rapid adaptation in this model emerges from changes in the rate of cGMP turnover produced by light intensity-dependent changes in phosphodiesterase activity and by calcium feedback to the rate of cGMP production. The model is governed by twelve parameters. By setting the model’s parameter values to match experimentally-derived values from cone or rod photoreceptors, the model can be configured to represent either photoreceptor type. For all primate retina modeling in this manuscript, we configured the photoreceptor model to represent primate rods by setting the initial values of the model parameters to those that were derived from separate patch-clamp experiments23 on primate rod photoreceptors (Supplementary Table 2). For rat retina modeling, we configured the photoreceptor model as a cone photoreceptor for modeling responses at photopic light levels, and as a rod photoreceptor for modeling responses at scotopic light levels. The parameter values here were obtained from fitting the model to mouse cone and rod photoreceptors as part of patch-clamp experiments for other studies23. The corresponding values are stated in Supplementary Table 2.We implemented this biophysical model as a fully-trainable neural network layer, called the photoreceptor layer, using the Keras58 package in Python. All twelve parameters of the photoreceptor layer can be trained through backpropagation using the Keras and TensorFlow package in Python—although the user can also set some or all of these parameters to be non-trainable and hence held fixed at their initial value. Photoreceptor parameters were initialized to their known values (Supplementary Table 2). For the experiments presented herein, 7 of the parameters were set to be non-trainable. Some of these parameters like the concentration of cyclic guanosine monophosphate (cGMP) in darkness vary across rod and cone photoreceptor types (rods and cones). Other parameters governing cGMP conversion into current, calcium concentration in the dark, affinity for Ca2+, and hills coefficient are comparable across photoreceptor types. The remaining five parameters, set to be trainable or non-trainable depending on the model configuration, consisted of the photopigment decay rate σ, the phosphodiesterase (PDE) activation rate η, the PDE decay rate ϕ, the rate of Ca2+ extrusion β and γ that controls the overall sensitivity of the model to light inputs. These trainable parameters also differ across photoreceptor types. In the current version of the model, the photoreceptor parameters are shared by all the input pixels, and each pixel acts as an independent photoreceptor: I.e., the conversion of each pixel into photocurrents only depends on that pixel’s previous values and not on the values of the other pixels.The photoreceptor layer converts each pixel of the input movie in units of receptor activations per photoreceptor per second (R*receptor−1 s−1) into photocurrents (pA) by solving the differential equations using the Euler’s method. Similar to the conventional CNN model, the photoreceptor layer takes as input 80 frames, where each frame corresponds to 8 ms. The output of this layer is a movie that is 80 frames long, and the same spatial dimensions as the input visual stimuli. The first 20 frames of the photoreceptor layer output are truncated to account for edge effects. The photocurrents movie is then z-scored using Layer Norm. This normalization step is crucial due to substantial differences in scale between the biophysical model’s parameters and the downstream CNN weights. The absence of these normalization layers hinders the photoreceptor–CNN model’s convergence. The resulting movie is then passed through the downstream CNN layers, where the size of the first convolution layer filter representing the temporal dimension is 60 frames instead of the 80 frames in the case of conventional CNN model.For modeling RGC responses across light levels (experiments of Figs. 4–7), the input to the photoreceptor layer was a movie segment of 180 frames. The first 60 frames were then discarded to account for edge effects. The longer movie segment allowed for longer integration times at the lowest light level of 0.3 R*receptor−1 s−1.Linear photoreceptor–CNN architectureThe linear photoreceptor model consists of a linear convolutional filter given by Eq. (4), and described previously in ref. 20.$$f(t)=\alpha \left(\frac{{\left(\frac{t}{{\tau }_{rise}}\right)}^{4}}{1+{\left(\frac{t}{{\tau }_{rise}}\right)}^{4}}\right)\times {e}^{-\left(\frac{t}{{\tau }_{decay}}\right)}\times \cos \left(\frac{2\pi t}{{\tau }_{osc}}+\omega \right)$$
(4)
The parameters for this model were initialized to the following values: α = 631 pA/R*/s, τrise = 28.1 ms, τdecay = 24.3 ms, τosc = 2 × 103 s, and ω = 89.97°. These values corresponded to estimates of the single-photon response, obtained by recording cone photoreceptor responses20. However, all the parameters were set to trainable and could therefore be learned along with the downstream CNN weights.Similar to the other models, the hyperparameters of the linear photoreceptor–CNN model were optimized via a grid search.Model trainingModel weights were optimized using Adam59, where the loss function was given by the negative log-likelihood under Poisson spike generation. The network layers were regularized with a L2 weight penalty at each layer, to prevent loss of information and be more robust to outliers. In addition, a L1 penalty was applied to the output of the fully-connected layer because the neural activity itself is relatively sparse and L1 penalties are known to induce sparsity. Learning rates were initially set to 0.001. A learning rate scheduler reduced the learning rate by a factor of 10 at epoch 3, 30 and 100.The number of channels in each CNN layer and the filter sizes were optimized by a grid search for each model type and dataset. Grid search for each experiment and model type was conducted using the full training data for that experiment. During the grid search procedure, models were trained for 50 epochs. Optimal hyperparameters were selected by evaluating the model on validation data that was neither used during the training phase, or during the model evaluations of predicted responses. Models with these optimal hyperparameters were then re-trained for at least 100 epochs. Optimal hyperparameters for each model used in this study are described in Supplementary Table 1.Model evaluationTrained models were evaluated using the held out test dataset not seen during the training. We quantified the model performance with the fraction of explainable variance in each RGC’s response that was explained by the model (FEV). This quantity (Eq. (5)) was calculated as the ratio between the variance accounted for by the model and the explainable variance (denominator in Eq. (5)). Such metrics to quantify how well a model predicts neural data have been used in previous studies like ref. 6. We calculate FEV as$$FEV=1-\frac{\frac{1}{T}\sum\limits_{t=1}^{T}{({y}_{t}^{A}-\hat{{y}_{t}})}^{2}-{\sigma }_{noise}^{2}}{Var[{y}^{A}]-{\sigma }_{noise}^{2}}$$
(5)
where,$${\sigma }_{noise}^{2}={{{{{{{{\rm{E}}}}}}}}}_{t}[{({y}_{t}^{A}-{y}_{t}^{B})}^{2}]$$
(6)
yA and yB are the observed spike rate of an RGC calculated as an average across a set of repeats A and set of repeats B respectively. The sets A and B were obtained by randomly splitting the total number of repeats into two. \({\hat{y}}_{t}\) represents the predicted spike rate by the model at time bin t. The explainable variance (denominator in Eq. (5)) is the variance of each RGC attributable to the stimulus, computed by subtracting an estimate of the observed noise from the variance across time (Eq. (7)) in the actual RGC’s responses, calculated as$$Var[{y}^{A}]=\frac{1}{T}\sum\limits_{t=1}^{T}{({y}_{t}^{A}-{\bar{y}}^{A})}^{2}$$
(7)
where \({\bar{y}}^{A}\) is the the observed spike rate yA averaged across time. In all neural data sets we considered, the number of trials was sufficient (N = 10 for naturalistic movies, N = 225 for white noise movies) and hence the estimated noise variance was quite low. As a result, our FEV values are quite similar to what is obtained using the usual fraction explained variance calculation, which does not correct for unexplainable noise. By definition, FEV can be negative if the prediction error is larger than the variance in the actual responses. We report each model’s performance across all RGCs as the median FEV across the set of RGCs. For ease in interpretation, we present FEV as a percentage throughout our results.Model RGC temporal receptive fields (Fig. 6)For a given model RGC, we computed the gradient of its output spiking rate with respect to the pixel values in the input movie segments, similar to ref. 13 and ref. 12. These gradients were evaluated for different binary white noise movie segments from the primate retina experiment across light levels (Figs. 4, 5). In total we had 400,000 input movie segments, generated by incrementing the white noise movie that was shown to the retina forward by one frame at a time (where 1 frame corresponds to 8 ms). These input movie segments spanned a total duration of 54 min. Since all the models were implemented with TensorFlow60, we calculated the gradients using automatic differentiation.The resulting gradient matrix representing spatio-temporal receptive field were decomposed into spatial and temporal components (Supplementary Fig. 3) using Singular Value Decomposition (SVD), similar to the way the spatial and temporal receptive fields are computed from the spike-triggered average (STA) analysis applied directly to experimental data.We normalized the spatial component to have unit mean. By doing so, our process of decomposing the instantaneous spatio-temporal RF into spatial and temporal components assigned any variations in the receptive field’s amplitude only to the temporal component. The average of all the instantaneous temporal receptive fields was taken as the model RGC’s temporal receptive field. In Fig. 6c, d, the temporal receptive field was normalized by the maximum peak.Statistics and reproducibilityWe used the SciPy package in Python to perform a two-sided Wilcoxon signed-rank test to compare performance across models (Figs. 3, 5). This non-parametric test was chosen due to its appropriateness for paired samples (in this case the same RGCs being modeled by different architectures) and its robustness against potential violations of normality assumptions. The null hypothesis tested was that the difference between the median fraction of explainable variance explained (FEV) by the two models for the population of RGCs modeled was 0. In comparing response latencies between two different light levels obtained by different methods (Fig. 6e), we performed a two-sided Wilcoxon rank sum test of the null hypothesis that there was no difference between the distributions (N = 22 RGCs) of latencies at the two different light levels.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles