Deep learning for discriminating non-trivial conformational changes in molecular dynamics simulations of SARS-CoV-2 spike-ACE2

System preparation and molecular dynamics protocolWe obtained the crystal structure of the SARS-CoV-2 spike receptor binding domain (RBD) bound to angiotensin-converting enzyme 2 (ACE2) from PDB: 6M0J and selected amino acid residues 333 to 527, corresponding to the RBD, using PyMOL (PyMOL Molecular Graphics System, Version 2.4 Schrödinger LLC)31. In total, we performed 37 systems based on the RBD FASTA sequence and used a homology model to determine the 3D structures of proteins in SWISS-MODEL32.We estimated protonation states for all residues in both systems from a standard physiological blood pH of 7.4, a salinity of 0.15 M, and internal and external relative permittivities of 10 and 80, respectively, using the H\({++}\) web server33. We parameterized the systems using the tleap tool from AmberTools2134 and solvated them in a cubic box with a minimum edge distance of 12Å. We describe the water, protein, and glycidic fractions of the systems with the AMBER force fields TIP3P, ff14SB, and GLYCAM-06, respectively.We also added ions \(Na^{+}\) and \(Cl^{-}\) to reach a physiological salinity of 0.15 M. After solvation, minimization, equilibration, and productive MD simulation steps were performed using NAMD 2.13 CUDA verbs35. with the AMBER force field. We performed all simulations with a time step of 1 fs using an NPT ensemble, with the temperature and pressure kept constant at 310 K and 1 atm, respectively, by means of a Langevin thermostat and a Langevin piston. We used periodic boundary conditions and a 10 Å electrostatic interaction cut-off point for non-bonded interactions and calculated the long-range electrostatic interactions using Particle Mesh Ewald.A relaxation protocol was applied under NPT conditions to all systems before performing the productive MD simulation, and a multistep equilibration protocol was used. This protocol consisted of (1) 500 minimization steps with harmonic constraints on protein atoms; (2) 500 steps of unconstrained minimization; (3) 300 ps equilibrium with harmonic constraints on protein atoms; (4) 300 ps equilibrium with harmonic constraints on the backbone atoms; (5) 300 ps of unrestricted balance; and (6) 1 ns preproductive MD simulation with reset speeds and no restrictions. Finally, we performed three independent 100 ns MD simulations for each S RBD protein system with the corresponding mutations on final relaxation coordinates on the SDumont supercomputer at the Brazilian National Laboratory for Scientific Computing (LNCC).Trajectory analysisWe performed a root mean square fluctuation analysis per residue (RMSF) against the average frame for the \(C\alpha\) carbons, aligning the entire RBD with the least mobile residues (333–437, 454–456, 492–494 and 509–526). We estimated RMSFs for each of the three replicates individually, as well as for their combined data. Additionally, to evaluate the convergence between the different simulations of each system, two-dimensional (2D) root mean square deviation (2D-RMSD) plots relative to the protein backbone were generated using the cpptraj plug-in36 of AmberTools 2134.DatabaseOur database consists of 38 (thirty-eight) distinct systems, each derived from MD simulations, with most systems containing between 1 (one) and 3 (three) point mutations per mutant (a comprehensive table listing all systems utilized in this study, including their specific RBD mutations, is presented in Supplementary Table 1). Among these, 17 (seventeen) were selected for machine learning-based analyzes, which involved approximately 85K frames; These included the original 2019-nCoV strain, also known as the “wild type” (WT)37; neutral mutants; and variants currently monitored by the WHO (VBM)38. These variants, previously classified as variants of concern (VOCs) or variants of interest (VOIs), are known for their increased infectivity and transmissibility; We assign the label “\(++\)” to the corresponding systems (Table 1).Table 1 Database composition.Furthermore, for systems that represent neutral mutants – those that neither increase receptor affinity for ACE2 nor enhance resistance to antibodies – we label them as “\({-}{-}\)”. The list of single-type mutations, presented in the S:mutation (label) format55,56, is as follows: S:V445L (\({-}{-}\)), S:K417Y (\({-}{-}\)), S:N439R (\({-}{-}\)), S:Q498I (\({-}{-}\)), S:S494K (\({-}{-}\)), S:Y489F (\({-}{-}\)), and S:Y505F (\({-}{-}\))57,58. We labeled the systems according to the published literature. The wild-type strain was labeled “\({-}{-}\)”.Complementarily, we conducted a comprehensive analysis of all molecular dynamics simulations to characterize the systems based on their mutant types, with particular attention to mobility changes in key regions such as the RBD, RBM, and loop regions. In this context, systems with mutant types that exhibit decreased affinity for ACE2 and increased antibody resistance, labeled as “\(-+\)”, as well as those in which the mutation demonstrates increased affinity for ACE2 and decreased antibody resistance, labeled as “\(+-\)”, were also analyzed. Table 2 provides a summary of the mutant types present in the database, along with their associated effects on viral binding affinity and immune evasion.Table 2 Impact of mutations on viral binding affinity and immune evasion.The systems correspond to triplicates of 25,000-frame trajectories (dcd files) obtained through 100 ns simulations, totaling 75,000 frames (300 ns). We sampled 5000 frames using a skip of 15 frames. Next, we separated each of these 5000 frames into .pdb files and generated 5 (five) clusters from the hieragglo average linkage algorithm59, that is, using the average distance between members of two clusters calculated in cpptraj36 and a cutoff of 2 Å for each system. Thus, although there is a reduction in the sample space, the most likely conformations tend to occur with greater probability.Feature transformationOur approach focuses on a restricted data representation to structural information, specifically pairwise distances within the receptor binding domain (RBD) of the SARS-CoV-2 S protein (Fig. 1a). From the MD trajectories, we generate 2D distance maps that serve as the main geometric descriptors of the structure throughout the simulation. Distance maps are graphical representations that detail the spatial arrangements between all pairs of amino acid residues within a molecule, provided as a 2D matrix of the real-valued distances between residues (distance matrices)60,61,62. Consider a set of \(N\) points in a three-dimensional space, where each point is defined by its coordinates \((x_i, y_i, z_i)\) and is represented by the position vector \(\vec {r}_i\). The Euclidean distance \(\delta _{ij}\) between the \(i\)-th and \(j\)-th points is defined by the following equation:$$\begin{aligned} \delta _{ij} = {\left\{ \begin{array}{ll} 0, & \text {if } i = j \\ \Vert \vec {r}_j – \vec {r}_i\Vert , & \text {otherwise}, \end{array}\right. } \end{aligned}$$
(1)
In the context of structural biology, \(\delta _{ij}\) denotes the Euclidean distance between the \(i\)-th and \(j\)-th amino acid residue. Therefore, the pairwise distance matrix \([\delta _{ij}]_{N \times N}\) is constructed, where each element \(d_{ij}\) represents the Euclidean interresidue distances, \(d_{ij}\). The matrix is symmetric, with \(\delta _{ij} = \delta _{ji}\) for all \(i, j\), and its diagonal elements are zero, indicating that the distance from any residue to itself is zero60,63. The complete distance matrix can be represented as$$\begin{aligned} [\delta _{ij}]_{N \times N} = \begin{bmatrix} 0 & \delta _{12} & \delta _{13} & \dots & \delta _{1N} \\ \delta _{21} & 0 & \delta _{23} & \dots & \delta _{2N} \\ \delta _{31} & \delta _{32} & 0 & \dots & \delta _{3N} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \delta _{N1} & \delta _{N2} & \delta _{N3} & \dots & 0 \\ \end{bmatrix} \end{aligned}$$
(2)
We selected the coordinates (x, y, z) of 194 \(\alpha\) carbon atoms5,60 that constitute the RBD, resulting in distance matrices with dimensions of \(194\times 194\) (Fig. 1b,c). The asymptotic computational complexity of this method can be expressed as \(O(n^2)\), where n indicates the total number of residues in the protein. We converted these matrices into a 2D image (.png format) using the Matplotlib library in Python. To ensure compatibility between map dimensions and model input, we resized the dimensions of the maps to \(224\times 224\) pixels64 by adding padding zeros. We implemented the algorithms for generating the DMs in Python (version 3.9.13).CNN-based model developmentSince the problem of identifying subtle conformational changes in MD trajectories falls under the AI-complete category, and considering that DMs have a grid-like topology2, we use the DL technique known as convolutional neural networks (CNN)65. We choose the VGG architecture, known as Visual Geometry Group64 in our methodology. Specifically, we implemented the VGGNet-B variant, characterized by a configuration of 13 trainable layers that include weights.We represent DMs as 4D tensors, structured based on image dimensions, number of image channels (RGB), and batch size (BS)66. In this context, we employ a batch size of 64 samples, as previous studies have shown that batch sizes ranging from 32 to 256 yield improved results67, as also observed in related applications9. Moreover, we preprocess DM by normalizing the pixel values to a range of \([0-1]\), which is more suitable for efficient processing by neural networks2.Our implementation follows the VGG configuration64 (Fig. 1d), except for a slight variation, in which we use a single neuron in the output with a sigmoid activation function, thus transforming the CNN into a binary classifier, resulting in 129 M parameters. After each conv. and pooling layer64, we incorporated batch normalization (BN) and rectified linear unit (ReLU) activation. To avoid the problem of internal covariate shift and regularize the model, we added BN before the activation layer68.Additionally, we use dropout after FC layers to mitigate potential overfitting and improve the generalizability of the model2. We choose a dropout rate of 0.5, as values within the range of \(\left[ 0.3-0.6\right]\) tend to reduce the error rate during training69. Dropout is particularly beneficial for datasets exceeding 5K samples, such as the data set of the present problem69. We developed the model using well-established ML and neural network libraries such as TensorFlow (version 2.15.0)70 and Keras66.Model trainingWe partitioned the database into four main subsets of trajectories: WT, VOC, VOI, and neutral. For training the model, we used DMs obtained from MD trajectories related to wild-type (WT) and Beta and Delta variants of concern (VOCs). We selected these variants because they share most of the mutations observed in VOI and are commonly associated with an increased affinity for ACE2 and resistance to natural antibodies or vaccines71.During training, we use cross-validation (CV), a heuristic to minimize the model’s generalization error and optimize hyperparameters72. We define a percentage \(\gamma < 0.5\) of the training data as a reference to the validation72, and used k-fold CV.Fig. 1Deep learning-based pipeline. (a) Receptor Binding Domain. (b) Distance map referring to a frame of the WT MD trajectory. (c) Distance map referring to a frame of the Beta variant MD trajectory. It is evident that both maps are indistinguishable to the naked eye, highlighting the intricate nature of the problem. (d) VGG-B architecture64. (e) Feature map extracted from the initial block of conv layers. (Conv1_2), for the centroid of cluster 0 of the Gamma variant. (f) Projection of the high-intensity pixels of the feature map onto the 3D structure of the RBD.We randomly partition the training set into k mutually exclusive subsets of the same size (n/k), where n is the total number of instances. Thus, validation takes place for one of the subsets, while the remaining \(k-1\) serve to train the model. This process occurs k successively, and we estimate the average cross-validation error rate (binary cross-entropy loss) as a reference to optimize the model hyperparameters72. We set \(k = 5\) since \(\gamma \ge 0.1\) is commonly recommended and proves effective in various applications72. We select the optimal parameter configuration based on the error rate estimates73. After adjusting the parameters, we utilize the entire training data set to make predictions on the independent test set.We conducted the training in Google’s virtual environment, Colab, which provided access to a Jupyter Notebook. The computational resources included an NVIDIA A100 GPU with 40 GB of VRAM and an additional 89.6 GB of RAM.Model evaluationThe test set comprises 14 (fourteen) systems, each distributed equally across classes, totaling 70,014 DMs. We evaluated mutations associated with VOIs such as Iota, Kappa, Lambda, Mu, and Theta, as well as VOCs such as Gamma and Omicron. Notably, key mutations including S:L452R, S:T478K, S:E484K, and S:N501Y were commonly observed in most of these systems, leading us to assign them the label “\(++\)”41,48,71,74. Furthermore, we analyzed point mutations S:V445L, S:K417Y, S:N439R, S:Q498I, S:S494K, S:Y489F, and S:Y505F, which are considered neutral57, and assigned them the label “\({-}{-}\)”, thus completing the evaluation of the test set.During validation, our optimized model achieved an average error rate of 1% with the following specific parameters: learning rate of 1E-3, dropout of 0.5, BS of 64 and 100 training epochs (more details are provided in the Supplementary Material). We used the receiver operating characteristic (ROC) curve (see Fig. 1a in the Supplementary Material) to determine the optimal operating threshold, 0.5, which resulted in a better balance between the recall and false positive rate (1-specificity)75.In the testing, we derived performance metrics from the confusion matrix, including precision (prec), true positive rate (TPR)/recall (rec), false positive rate (FPR), \(\hbox {F}_\beta\) score and Matthews’ correlation coefficient (MCC)75. In relation to the \(\hbox {F}_\beta\) score, we employ the \(\hbox {F}_1\) score, which balances precision and recall by setting \(\beta =1\). Furthermore, we calculate the area under the ROC curve (AUC). We calculated complementary metrics such as precision and recall for the test set, since relying solely on the error rate can be limiting as a measure of discriminability. Furthermore, we sought to determine the success rate of the model for each class of problem75.

Hot Topics

Related Articles