Unsupervised modeling of mutational landscapes of adeno-associated viruses viability | BMC Bioinformatics

In this section, we describe the details of our biophysical-inspired statistical model, and describe the different training and test datasets used. The model defines a statistical energy that we consider as a proxy to the sequence viability. Strictly speaking, viability is a discrete phenotypic trait that in the series of experiments conducted that constitute our training set [16] is assessed in terms of sequencing counts. The boolean nature viable/non-viable is eventually induced with a statistical analysis a-posteriori. To compare the performance of the our model in solving the simpler boolean classification task, we also introduce a standard classifier with the same architecture as the energy-based model.ModelThe machine-learning method we propose utilizes a statistical model that aims to capture the entire experimental pipeline conducted to assess the ability of variants to package the genome within the capsid. The structure of this model is very general and permits the analysis of any DMS (Deep Mutational Scanning) experiment. As previously discussed in [18], the model comprises three main steps: selection, amplification, and sequencing.During the selection phase, mutated viruses are introduced into tissue cells, where their survival depends on their ability to form functional capsids. Typically, only a small fraction of the initial population of viruses is selected during this phase. Following selection, viral DNA is extracted from a sample of supernatant and undergoes amplification. Finally, the amplified DNA is sequenced to gather information about the genetic composition of the selected viruses.SelectionWe examine a population of AAV2s (adeno-associated viruses), each carrying a mutated sequence of the gene of interest. Let \(N_s\) represent the count of viruses containing sequence s in the initial library. From a biophysical perspective, we assume that the functional characteristics of the viral variant are determined by the thermodynamic states adopted by the protein products resulting from the mutated gene. The expressed protein copies of variant s can either bind together, leading to the assembly of a functional capsid, or their structural properties may hinder capsid formation. The energies of these two states will be denoted by \(E^{\text{capsid}}_s\) and \(E^\mathrm {dissol.}_s\), respectively. Since survival depends on the ability to form the capsid, the probability of selection for viruses carrying variant s (or selectivity) can be computed according to the Boltzmann law$$\begin{aligned} p_s&= \frac{\mathrm e^{\mu ^{\text{capsid}} – E^{\text{capsid}}_s}}{\mathrm e^{\mu ^{\text{capsid}} – E^{\text{capsid}}_s} + \mathrm e^{\mu ^\mathrm {dissol.} – E^\mathrm {dissol.}_s}} \nonumber \\&= \frac{1}{1 + \mathrm e^{\mu – E_s}} \end{aligned}$$
(1)
where \(\mu ^{\text{capsid}}, \mu ^\mathrm {dissol.}\) are chemical potentials associated with each state, and we defined \(\mu := \mu _\mathrm {dissol.} – \mu _{\text{capsid}}\) and \(E_s:= E^\mathrm {dissol.}_s – E^{\text{capsid}}_s\). From now on, we define the pair \(E_s, \mu\), energy specificity, and chemical potential respectively. With these definitions, \(p_s\) assumes the well-known Fermi-Dirac distribution form for a two-level system. In the large \(N_s\) limit, one can expect small fluctuations in terms of the number of selected sequences. Therefore, out of the \(N_s\) viruses carrying sequence s in the initial library, we assume \(N_s p_s\) are selected by their ability to form the capsid. This approximation is sometimes referred to as deterministic binding approximation [18].Energy specificityThe capsid-formation energy \(E_s\), depends on the ability of protein products to bind together and assemble the capsid. Protein binding depends on sequence s, and in full generality this dependence is complex. In analogy with the reference study, we use one of the neural network models tested in [16] to parametrize the energy mapping specificity. The network is composed of two convolutional layers and then three fully connected ones (see Sect. 3 in the Supplementary Material for the detailed structure).We denote the set of parameters of this network (weights and biases) by \(\Theta\). Having specified the network architecture, selectivities then become functions of \(\Theta\) and the chemical potential,$$\begin{aligned} p_s(\Theta , \mu ) = \frac{1}{1 + \mathrm e^{\mu – E_s(\Theta )}} \end{aligned}$$
(2)
Although other choices of energy function could be used in principle, the convolutional structure chosen here allows us to handle sequences of different lengths as explained in detail in Sect. 2 in the Supplementary Material.AmplificationEmpirically, it turns out that only a small fraction of viruses survive the selection phase. To replenish the original total population size, after the selection round an amplification phase through PCR is performed. We assume for simplicity that this process is uniform across all sequences. Then, denoting by \(N_s’\) the abundance of sequence s after amplification,$$\begin{aligned} N_s’\propto \frac{p_s N_s}{\sum _\sigma p_\sigma N_\sigma } \end{aligned}$$
(3)
SequencingIn the experimental pipeline of [16], sequencing is performed at two different steps of the experiment: (i) when the initial combinatorial library is inserted into the plasmids ( but prior to transfection into the target cells), and (ii) after the viral extraction from the cell. It is clear that the target phenotype observed is the viral viability, i.e. the ability of the variant to remain active and capable of forming a stable capsid.If we assume that each variant is sequenced with a probability proportional to its abundance, the probability of observing a count \(R_s\) for variant s conditional to the (unobservable) number of variants \(N_s\), follows the multinomial distribution:$$\begin{aligned} {\mathcal {P}}({\textbf{R}}|{\textbf{N}}) = \frac{(\sum _s R_s)!}{\prod _s(R_s!)}\frac{\prod _s N_s^{R_s}}{\left( \sum _s N_s\right) ^{\sum _s R_s}} \end{aligned}$$
(4)
Here and in the following we indicate with \({\textbf{R}}\), \({\textbf{N}}\), and \({\textbf{p}}\) the vectors of read counts \(R_s\), variant abundances \(N_s\), and selectivities \(p_s\), respectively. Similarly, for the sample taken after selection and amplification,$$\begin{aligned} {\mathcal {P}}({\textbf{R}}’|{\textbf{N}}, {\textbf{p}})&=\frac{(\sum _s R_s’)!}{\prod _s(R_s’!)}\frac{\prod _s N_s’^{R_s’}}{\left( \sum _s N_s’\right) ^{\sum _s R_s’}} \end{aligned}$$
(5)
$$\begin{aligned}&=\frac{(\sum _s R_s’)!}{\prod _s(R_s’!)}\frac{\prod _s(p_s N_s)^{R_s’}}{\left( \sum _s p_s N_s\right) ^{\sum _s R_s’}} \end{aligned}$$
(6)
Note that \({\mathcal {P}}({\textbf{R}}’|{\textbf{N}}, {\textbf{p}})\) depends on the selectivities \(p_s\), which in turn depend on the parameters \(\Theta ,\mu\), see (2).Maximum likelihood inferenceThe model defined so far has several parameters that must be inferred from data: the parameters of the neural networks specifying the mapping from sequence to energy \(E_s\), that we hereby denote by \(\Theta\), the chemical potentials \(\mu _s\), and the variant abundances in the initial library \(N_s\). As a function of these parameters, the model log-likelihood can be written:$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\Theta , \varvec{\mu }, {\textbf{N}})&= \ln {\mathcal {P}}({\textbf{R}}|{\textbf{N}}) + \ln {\mathcal {P}}({\textbf{R}}’|{\textbf{N}}, {\textbf{p}}) \\&= \sum _s R_s\ln N_s + \sum _s R_s’\ln (p_s N_s) \\&\qquad – \left( \sum _s R_s\right) \ln \left( \sum _s N_s\right) – \left( \sum _s R_s’\right) \ln \left( \sum _s p_s N_s\right) + \text {const.} \end{aligned} \end{aligned}$$
(7)
Note that the dependence on the parameters \(\Theta ,\mu\) arises from the dependence of the selectivities, (2). Following [19], we then learn \(\Theta , \varvec{\mu }, {\textbf{N}}\) from the sequencing data by numerical maximization of the likelihood,$$\begin{aligned} \max _{\Theta , \varvec{\mu }, {\textbf{N}}} {\mathcal {L}}(\Theta , \varvec{\mu }, {\textbf{N}}) \end{aligned}$$
(8)
See Sect. 4 in the Supplementary Material for details on the optimization procedure.Once the optimal parameters have been obtained, the learning can be validated by assessing how the quantities observed experimentally correlate with the ones inferred. To be more precise, we look at the correlation between \(p_s\) and the ratio \(\theta _s = R_s’/R_s\), usually called empirical selectivity [20], on a subset of sequences that are not used during training of the model. Barring sampling noise, \(\theta _s\) should be proportional to the selectivities \(p_s\) inferred by the model. Unfortunately in this experiment, only one round of selection has been performed; therefore it is not possible to use information from multiple rounds to filter out experimental noise as described in [20] and already exploited in [18].Binarizing the output of the energy-based modelThe most powerful feature of this machine learning method is that it is capable of inferring quantitatively the selectivity of every single variant. Anyway, there are cases in which the useful information is less specific and one can be only interested in knowing if a variant is able or not to perform its task. In these cases, one can use the model to perform a binary classification setting a threshold value for \(p_s\) and classifying accordingly the variants.One unsophisticated but sensible way to set such a threshold is to fit a bimodal Gaussian distribution from the values of the inferred energies \(E_s\). If the phenotypic trait has a binary nature and the test set is more or less balanced between viable and non-viable sequences that can be recognized by the model, the energy values will cluster around two values: a high energy value for non-viable variants and a low energy value for the viable ones. Thus, we decided to fit the distribution of the energy by two Gaussians mixture model. Of course, depending on how overlapping the two Gaussians turn out from the fit, different strategies can be used to define a threshold. In our case, as shown in Fig. 2, the two Gaussians are separated enough to produce a minimum in the mixture between the two peaks. We decided to use this value for the threshold as it is the most robust choice in the sense that a small perturbation around this value produces a minimal change in the classification. Of course, being our method entirely unsupervised, we have no unique way to set the threshold. The other option, as done for instance in [16], is to use a fully supervised strategy on variants already annotated as viable/non-viable.Fig. 2Fit performed on the histogram of inferred energies on the two test experiments. This provides an unsupervised estimate of the threshold for binary classification of the sequences as viable or non-viable. The energy distribution has been fitted with a bimodal Gaussians mixture distribution and the valley in between the two peaks has been chosen as threshold valueAs a semi-supervised check, one could only use the viable/non-viable annotation just to set the ”optimal threshold” after having inferred unsupervisedly the energy model. In Sect.  4 we will compare the discrepancy in classification when we use the threshold obtained as described here and the case in which we use the information on the viability of the test sequences to find an ”optimal threshold”; we do this to show that the threshold found by the naïve bimodal fit is already almost ”optimal”.Data binarizationTo validate our model, we need to divide data into viable/non-viable classes. To do so, first an empirical proxy for the ”fitness” of the variants is computed, this quantity is usually referred to as empirical (log)selectivity and it is defined as the following$$\begin{aligned} \theta _s = \log \frac{R’_s}{R_s} \end{aligned}$$
(9)
Then these quantities can be used to assign a label to each sequence. In [16] they have assigned a binary label to each variant setting a threshold for this quantity. The procedure we used to assign this threshold value is similar to the one used in the reference study. By direct inspection of the empirical log-selectivity distribution Fig. 3, one can see that they follow a bimodal distribution. We consider sequences near the rightmost peak as viable and the remaining as non-viable. In analogy with what we have done for the model energy, we fitted a bimodal Gaussian distribution from those histograms and we have chosen as threshold the value of log-selectivity where we find a valley in between the two peaks to have a stable classification.Fig. 3Fit performed on the histogram of empirical log-selectivities of each experiment to determine the threshold value for binary labeling of the sequences as viable or non-viable. The empirical distribution has been fitted with a bimodal Gaussians mixture distribution and the valley in between the two peaks has been chosen as threshold valueIn contrast with [16] we decided to fit a threshold for each experiment since in principle this quantity might change for each experimental realization, and as one can see from Fig. 3 this is the case.It is evident from Figs. 2 and 3 that the bimodal Gaussian mixture distribution fits both the experimental data and the predictions very inaccurately. In fact, there is no reason why these quantities should follow such a distribution. In this case, the Gaussian mixture distribution is just the simplest thing one can do to select a threshold value for these quantities. In the following, it will be clear that this is not an issue because the robustness of our biophysical model isn’t affected by the choice of such a threshold as long as it is some sensible value between the two peaks. This is evident from the ROC curve analysis reported in “Results” section.Supervised binary classifierTo benchmark the classification performance of the energy-based model, we trained a neural network to perform viable/non-viable classification (binary classifier) in a supervised manner. The architecture of this classifier is identical to that of the energy-based model (refer to Sect. 2 in the Supplementary Material for details). Defining \(q_{s,\text {viable}}\) and \(q_{s,\text {non-viable}}\) as the two outputs of the Neural Network corresponding to the two states, the loss function can be expressed as:$$\begin{aligned} d {\mathcal {C}}=\sum _s \left[ y_s \ln \frac{e^{q_{s,\text {viable}}}}{e^{q_{s,\,\text {viable}}}+e^{q_{s,\,\text {non-viable}}}} + (1-y_s) \ln \frac{e^{q_{s,\,\text {non-viable}}}}{e^{q_{s,\,\text {viable}}}+e^{q_{s,\,\text {non-viable}}}}\right] \end{aligned}$$
(10)
where \(y_s\) is 1 if the sequence is labeled as viable and 0 otherwise.

Hot Topics

Related Articles