Explaining Conformational Diversity in Protein Families through Molecular Motions

Overview of DANCEDANCE takes as input a set of protein 3D structures (in Crystallographic Information File or CIF format) and outputs a set of protein- or protein family-specific conformational collections or ensembles (in CIF of PDB format). It first clusters and superimposes the input structures based on the similarities found in their corresponding amino acid sequences. The users can choose to analysis all input structures or only those representing monomeric biological units. DANCE then determines the set of principal components sufficient to explain the variability observed within each conformational ensemble. The algorithm unfolds in six main steps depicted in Fig. 1.

a- Extraction of sequences. The first step extracts the one-letter amino acid sequences of all polypeptidic chains contained in the input CIF files. In case of multiple models, DANCE retains only the first one. The names of the residues with resolved 3D coordinates are taken from the _atom_site.label_comp_id column. Residues missing from the protein structure are included as lowercase letters in the sequence if they are defined in the _entity_poly_seq category. This information will help in clustering and aligning the sequences (see below). Otherwise, they are replaced by the “X” symbol. The “X” symbol is also used for unknown amino acid types and for modified amino acids without a close natural neighbour. Sequences comprising less than 5 non-“X” residues are then filtered out.

b- Clustering of the sequences. DANCE clusters sequences using MMseqs238. The users can choose the desired levels of sequence similarity and coverage, both set to 80% by default. The coverage is bidirectional by default. This step outputs a TSV file specifying the clusters.

c- Multiple sequence alignments. DANCE then aligns the sequences within each cluster using MAFFT39 with default parameters and the BLOSUM62 substitution matrix40. It further removes all the columns containing only Xs or gaps, and reorders the sequences according to their PDB codes.

d- Extraction of structures. DANCE extracts 3D coordinates of the backbone atoms N, C, Cα, and the O atom, of all polypeptidic chains contained in the input CIF files. It reconstructs missing O atoms based on the other atom’s coordinates. It disregards residues with missing backbone atoms and chains shorter than 5 residues.

e- Generation of the conformational collections. DANCE then uses the sequence clusters defined in (b) to group conformations and the residue matching provided by (c) to superimpose them. The superimposition puts their centers of mass to zero and then aims at determining the optimal least-squares rotation matrix minimizing the Root Mean Square Deviation (RMSD) between any conformation and a reference conformation (see below). This is achieved through the ultrafast Quaternion Characteristic Polynomial method41,42. The users can choose to account for all the atoms in the superimposition, or only the Cα atoms. Optionally, the users can filter out the conformations with too few (less than 5 by default) residues aligning to the reference. As a post-processing step, DANCE reduces structural redundancy. Namely, it removes any conformation A deviating by less than rmscut Å from another one B, provided that the sequence of A is identical to or included in that of B. The value of rmscut is 0.1 Å by default and is customizable by the users. Finally, DANCE saves the conformational ensemble as a multi-model file in PDB or CIF format. Notice that the models can display different amino acid sequences. DANCE also outputs the corresponding multiple sequence alignments (MSA) in FASTA format, and the matrix of all-to-all pairwise RMSDs.

f- Extraction of linear motions. DANCE performs PCA on the 3D coordinates from each collection. This dimensionality reduction technique identifies orthogonal linear combinations of the variables, namely the Cartesian coordinates, maximally explaining their variance (see below). These linear combinations, which we refer to as principal components or PCA modes, represent directions in the 3D space for every atom. Deforming the protein structure using these components produces motions that connect the conformations observed in the collection. For the sake of simplicity, we directly refer to the principal components as to linear motions, although they may not represent actual physical motions undergone by the protein. Furthermore, we estimate the intrinsic dimensionality of the linear motion manifold underlying an ensemble’s conformational variability as the number of principal component explaining essentially all its positional variance. The higher the dimensionality – the more complex the linear motions.

Fig. 1Outline of the study. Our approach, DANCE, exploits both amino acid sequences and 3D coordinates. We applied it to all experimentally determined protein-containing 3D structures from the PDB. Alternatively, users can provide a custom set of experimental structures or predicted models. DANCE first concentrates on sequences. It extracts them from the input structures (a) and clusters them with MMseqs2 based on user-defined similarity and coverage thresholds (b). For each cluster, It generates a multiple sequence alignment using MAFFT (c). It then extracts all 3D coordinates (d), groups the conformations according to the clusters identified in step b and superimposes them to generate conformational ensembles (e). The superimposition aims at minimizing the Root Mean Square Deviation to a chosen reference, using the alignments produced by step c for mapping the residues. The examples of the bacterial enzymes adenylate kinase (in grey, reference PDB code: 1AKEA) and MurD (in blue, 1E0DA), and the murine ABC transporter P-glycoprotein (5KOYB) are depicted. The arrows indicate adenylate kinase’s main motion. The horizontal lines behind the P-glycoprotein indicate the boundaries for the membrane bilayer. Finally, DANCE summarises conformational diversity through Principal Component Analysis (f). We further assessed the ability of classical manifold learning techniques to reconstruct and extrapolate conformations.Choosing a referenceWe choose the reference conformation for the superimposition as the one with the amino acid sequence most representative of the MSA. For this, we first determine the consensus sequence s* by identifying the most frequent symbol at each position. We consider “X” symbols as equivalent to gaps. Hence, each position is described by a 21-dimensional vector giving the frequencies of occurrence of the 20 amino acid types and of the gaps. In case of ambiguity, we prefer an amino acid over a gap, hence longer sequences over shorter ones, and an amino acid with a higher BLOSUM62 score over a lower-scored one. Then, we compute a score for each sequence s in the MSA reflecting its similarity to s* and expressed as, $$\,\mathrm{score}\,(s)={\sum }_{i=1}^{P}\sigma ({s}_{i},{s}_{i}^{* }),$$
(1)
where P is the number of positions in the MSA and $\sigma ({s}_{i},{s}_{i}^{* })$ is the BLOSUM62 substitution score between the amino acid si at position i in sequence s and the consensus symbol ${s}_{i}^{* }$ at position i. We set the gap score to ${\min }_{a,b}(\sigma (a,b))-1=-\,5$.Judging the quality of the MSAWe compute the identity level of an MSA as the average percentage of sequence pairs sharing the same amino acid in a column, and the coverage as the percentage of positions having less than 20% of gaps. In addition, we evaluate the global quality of the MSA with a sum-of-pairs score, with σmatch = 1 and σmismatch = σgap = − 0.5. We normalise the raw sum-of-pairs scores by dividing them by the maximum expected values. The final score for an MSA is thus expressed as, $${\mathrm{score}}_{rel}(MSA)=\frac{\mathrm{score}\,(MSA)}{\left(\begin{array}{c}n\\ 2\end{array}\right){L}_{eff}},$$
(2)
where $\mathrm{score}\,({\rm{MSA}})$ is the raw MSA score, n is the number of chains or sequences, and Leff is the effective length of the MSA, computed as, $${L}_{eff}={\max }_{s\in {\mathscr{S}}}{\sum }_{i=1}^{L(s)}{\mathbb{I}}\{{s}_{i}\in {\mathscr{A}}\},$$
(3)
where ${\mathbb{I}}$ is the indicator function, ${\mathscr{S}}$ is the set of sequences comprised in the MSA, L(s) is the length of the aligned sequence s, and ${\mathscr{A}}$ is the 20-letter amino acid alphabet (e.g., excluding gap characters).Extracting linear motionsThe Cartesian coordinates of each conformational ensemble can be stored in a matrix R of dimension n × 3m, where n is the number of conformations and m is the number of positions in the associated MSA. Each position is represented by a C-α atom. We compute the covariance matrix as, $$C=\frac{1}{n-1}{R}^{c}{({R}^{c})}^{T}=\frac{1}{n-1}(R-\bar{R}){(R-\bar{R})}^{T},$$
(4)
where $\bar{R}$ is obtained by averaging the coordinates over the conformations. Alternatively, the users can choose to center the data on the reference conformation. The covariance matrix is a 3m × 3m square matrix, symmetric and real.The PCA consists in decomposing C as C = VDVT where V is a 3m × 3m matrix where each column defines an eigenvector or a PCA mode that we interpret as a linear motion. D is a diagonal matrix containing the eigenvalues. The sum of the eigenvalues ${\sum }_{k=1}^{3m}{\lambda }_{k}$ amounts to the total positional variance of the ensemble. The portion of the total variance explained by the kth eigenvector or linear motion is estimated as $\frac{{\lambda }_{k}}{{\sum }_{k=1}^{3m}{\lambda }_{k}}$.In addition, we estimate the collectivity43,44 of the kth eigenvector as, $$\,\mathrm{coll}\,({{\bf{v}}}_{k})=\frac{1}{m}\exp \left(-{\sum }_{i=1}^{3m}{v}_{ki}^{2}\log {v}_{ki}^{2}\right).$$
(5)
If coll(vk) = 1, then the corresponding motion is maximally collective and has all the atomic displacements identical. In case of an extremely localised motion, where only one single atom is affected, the collectivity is minimal and equals to 1/m.We also apply PCA to the correlation matrix computed by normalising the covariance matrix as, $${\mathrm{Cor}}_{i,j}=\frac{{C}_{i,j}}{\sqrt{{C}_{i,i}}\sqrt{{C}_{j,j}}}.$$
(6)
In that case, the sum of the eigenvalues ${\sum }_{k=1}^{3m}{\lambda }_{k}$ amounts to 1.Handling missing dataAs stated above, the conformations in a collection may have different lengths reflected by the introduction of gaps in the associated MSA. We fill these gaps with the coordinates of the conformation used to center the data (average conformation, by default). In doing so, we avoid introducing biases through reconstruction of the missing coordinates. Moreover, this operation results in low variance for highly gapped positions, thus limiting their contribution to the extracted motions. To go further and explicitly account for data uncertainty, we implemented a weighting scheme. Specifically, DANCE assigns confidence scores to the residues and include them in the structural alignment step and the PCA. The confidence score of a position i reflects its coverage in the MSA, ${w}_{i}=\frac{1}{n}\sum _{S}{{\mathbb{1}}}_{{a}_{i}^{S}\ne \mbox{”} {\rm{X}}\mbox{”}}$, where “X” is the symbol used for gaps. The structural alignment of the jth conformation onto the reference conformation amounts to determining the optimal rotation that minimises the following function45, $$E=\frac{1}{{\sum }_{i}{w}_{i}}{\sum }_{i}{w}_{i}{\left({r}_{ij}^{c}-{r}_{i0}^{c}\right)}^{2},$$
(7)
where ${r}_{ij}^{c}$ is the ith centred coordinate of the jth conformation and ${r}_{i0}^{c}$ is the ith centred coordinate of the reference conformation. The resulting aligned coordinates are then multiplied by the confidence scores prior to the PCA.Implementation detailsWe implemented DANCE in C/C++ and Python. It relies on the C++ GEMMI library46 to parse the CIF files and manipulate the structures. It runs MMseqs2 through the following command: cluster DB clusterDB tmp –cov-mode 0 -c $cov –min-seq-id $id. It launches MAFFT with the options auto, amino and preservecase. The multiple sequence alignment and structure superimposition steps are parallelized. For the PCA, we use the singular value decomposition (SVD) implemented in NumPy47 on the R matrix directly. SVD is computationally more advantageous when 3m ≫ n, which is typically the case of our data, since we only compute the required number of n components. We created structure visualisations in Pymol v2.5.048.Application and extension of DANCEDANCE is applicable to experimental 3D structures as well as predicted 3D models, as long as they comply with the CIF standards.Describing conformational variability over the whole PDBWe applied DANCE to all 748 297 protein chains with experimentally resolved 3D structures available in the PDB, as of June 2023. We downloaded all the PDB entries in CIF format from the RCSB49. We replaced the raw CIF files with their updated and optimised versions from PDB-REDO whenever possible50. It took about 2.25 hours to run DANCE on the whole PDB on a desktop computer with Intel Xeon W-2245 @ 3.90GHz and 32Go of RAM (Supplementary Table S1). The most time consuming steps are the extraction and superimposition of the 3D structures to create the conformational ensembles. We ran DANCE at eight different levels of sequence similarity, designated as ${{\rm{l}}}_{cov}^{id}$, where id and cov are the sequence identity and coverage thresholds, correspondingly, and range from 50 to 80%. For investigating how the ensembles transformed across levels, we focused on the 18 616 conformational ensembles detected in the most relaxed set up, namely at 30% identity and 50% coverage (${{\rm{l}}}_{50}^{30}$). For each ensemble, we extracted its reference protein chain and we traced back the conformational ensembles to which it belonged upon progressively applying stricter thresholds.Focusing on the ABC superfamilyWe extended DANCE usage beyond the single-chain and sequence-similarity paradigms to describe the conformational variability of ABC (ATP Binding Cassette) transporters. We retrieved a set of 354 ABC protein experimental 3D structures from https://abc3d.hegelab.org26. They correspond to functionally relevant states annotated as biological units in the PDB. In most of these structures, several polypeptidic chains, typically 2 or 4, encode the two nucleotide-binding domains (NBDs) and two transmembrane domains (TMDs) of the ABC architecture. In addition, some structures contain several ABC protein copies or some ABC protein cellular partners (small molecules, substrate peptides, interacting proteins). We chose the murine ABC transporter P-glycoprotein (5KOYA) as reference for the subsequent analysis. Its 1182-residue-long single polypeptidic chain the full-length transporter architecture.To cope with the high sequence divergence of the ABC superfamily, we relied on structural similarity for grouping and matching the ABC conformations. Specifically, we used the method Foldseek51 to identity structures sharing significant similarity with the reference and align them. We performed a first screen by querying the reference against all individual chains (1 244 in total) and defined significant hits as those with an e-value lower than 10.0. Then, for each structure, we estimated an upper bound on its coverage of the reference by summing up the reference residue ranges appearing in the alignments associated with its significant hits. We filtered out the structures with coverage upper bounds lower than 90%. We performed a second screen by querying the reference against the 209 remaining structures defined as monomers by concatenating their chains. We identified two structures (5NIK, 5NIL) spanning less than 90% of the reference. Permuting their chains did not increase their coverage and thus we removed them. To further detect potentially suboptimal chain orderings, we computed reference to target residue span ratios. We identified one structure, namely 7AHD, with a highly imbalanced ratio of 1.6. Such a high value is indicative of large parts of the reference that could not be aligned to the target structure. Permuting the four chains (A,B,C,D) of 7AHD into (A,D,B,C) led to a more balanced ratio of 0.86. We did not observe discrepancies for other structures and thus we retained their original chain ordering. Finally, we removed the structures with low-quality alignments, i.e., with more than 200 gaps or with a continuous gapped region of more than 60 positions.Among the 195 structures finally selected, 4F4C, 7SHN and 7AHD contained unknown or unrecognized amino acids which we removed. We ran Foldseek one more time to generate a structure similarity-based multiple sequence alignment centred on the reference 5KOYA. We trimmed the alignment and the 3D structures by removing the residues inserted with respect to the reference. We gave the trimmed alignment and 3D coordinate files as input to DANCE, starting directly from step d (see the overview of DANCE algorithm above). For consistency and comparison purposes, we asked DANCE to center the data on the reference. To mitigate the impact of potential alignment errors, we applied weights reflecting position-specific confidence scores (see above, Handling missing data). DANCE structural redundancy reduction step removed 7 conformations, resulting in an ensemble of 188 conformations.We compared this ensemble with those generated by DANCE default sequence similarity-based end-to-end procedure applied to the whole PDB. More specifically, we took the ensembles generated at ${{\rm{l}}}_{80}^{80}$ and ${{\rm{l}}}_{50}^{30}$ and containing 5KOYA and we rebuilt them with DANCE, applying the 5KOYA centering and the uncertainty weighting scheme. We estimated the similarity between the ensembles’ motion subspaces as the Root Mean Square Inner Product (RMSIP)52,53. The latter measures the overlap between all pairs of the l first PCA modes and is defined as, $$\,\mathrm{RMSIP}\,=\sqrt{\frac{1}{l}{\sum }_{i=1}^{l}{\sum }_{j=1}^{l}{({{\bf{v}}}_{i}^{{{\mathscr{S}}}_{{\mathscr{A}}}}.{{\bf{v}}}_{j}^{{{\mathscr{S}}}_{{\mathscr{B}}}})}^{2}},$$
(8)
where ${{\bf{v}}}_{i}^{{{\mathscr{S}}}_{{\mathscr{A}}}}$ and ${{\bf{v}}}_{j}^{{{\mathscr{S}}}_{{\mathscr{B}}}}$ are the ith and jth PCA modes extracted from the conformational ensembles ${{\mathscr{S}}}_{{\mathscr{A}}}$ and ${{\mathscr{S}}}_{{\mathscr{B}}}$, and l is the number of modes considered for the comparison. Moreover, we monitored the distance between the geometric centres of the two NBDs defined by the C-α atoms of residues numbered 346-596 and 929-1182, respectively, in the reference 5KOYA.Benchmarking for the generation of unseen conformationsWe further investigated whether the extracted linear principal components could be useful to predict unseen conformations. Moreover, since the manifold underlying our data is a priori non-linear, we tested whether non-linear methods could achieve better reconstructions than linear PCA. We focused on the widely used kernel Principal Component Analysis (kPCA)54,55 and the uniform manifold approximation and projection (UMAP)56.Dimension reduction with non-linear kernel PCAThe intuition behind kPCA is to map the input data points to a higher dimensional space where they will be linearly separable by a classical PCA. The mapping function $\phi \,:{{\mathbb{R}}}^{3m}\to {{\mathbb{R}}}^{M}$ is not known. Instead of explicitly calculating it, we use a kernel function $k({{\bf{r}}}_{i},{{\bf{r}}}_{j})=\phi {({{\bf{r}}}_{i})}^{T}\phi ({{\bf{r}}}_{j})$, where ri and rj are two conformations (lines in the ${\mathbb{R}}$ matrix). We considered three commonly used kernels,

the polynomial kernel $k({{\bf{r}}}_{i},{{\bf{r}}}_{j})={\left(\frac{1}{2{\sigma }^{2}}{{\bf{r}}}_{i}{{\bf{r}}}_{j}^{T}+c\right)}^{d}$, where c = 1 and d = 3 by default,

the sigmoid kernel $k({{\bf{r}}}_{i},{{\bf{r}}}_{j})=\tanh \left(\frac{1}{2{\sigma }^{2}}{{\bf{r}}}_{i}{{\bf{r}}}_{j}^{T}+c\right)$, where c = 1 by default,

and the radial basis function (RBF) or Gaussian kernel $k({{\bf{r}}}_{i},{{\bf{r}}}_{j})=\exp \left(-\frac{d{({{\bf{r}}}_{i},{{\bf{r}}}_{j})}^{2}}{2{\sigma }^{2}}\right)$, where d(ri, rj) is the Euclidean distance between the two conformations ri and rj.

We explored different values of the hyperparameter σ. For sufficiently large values, i.e., $\frac{1}{2{\sigma }^{2}}{{\bf{r}}}_{i}{{\bf{r}}}_{j}^{T}\ll 1$ or $\frac{1}{2{\sigma }^{2}}d{({{\bf{r}}}_{i},{{\bf{r}}}_{j})}^{2}\ll 1$, the kernel becomes effectively linear.Thus, given the input coordinates R representing n conformations, we computed the corresponding kernel matrix K of dimension n × n and decomposed it using the classical PCA. The resulting principal components $\{{{\boldsymbol{\nu }}}_{{\bf{1}}},{{\boldsymbol{\nu }}}_{{\bf{2}}},\ldots ,{{\boldsymbol{\nu }}}_{{\bf{n}}}\}$ can then be expressed as,$${\nu }_{{\bf{j}}}={\sum }_{i=1}^{n}{a}_{ji}\phi ({{\bf{r}}}_{i}),\,\mathrm{where}\,{a}_{ji}=\frac{1}{{\lambda }_{j}(n-1)}\phi {({{\bf{r}}}_{i})}^{T}{\nu }_{{\bf{j}}}.$$
(9)
Uniform manifold approximation and projectionThe UMAP algorithm first builds a graph representing the data in the ambient space, and then determines the most similar graph in a lower dimension. It relies on the assumptions that there exists a low-dimensional manifold on which the original data would be uniformly distributed and that this manifold is locally connected. Under such assumptions, any ball of fixed volume on the low-dimensional manifold should contain approximately the same number of points. Thus, to build the graph, UMAP defines balls in the ambient space centred at each point and encompassing its nneigh nearest neighbours. The balls have variable sizes that reflect the topology of the dataset in the ambient space. UMAP then connects points whose corresponding balls overlap and computes the edge weights by combining the balls’ radii. The resulting graphical representation is projected into a lower-dimensional space by minimising the cross entropy between the high- and low-dimensional graphs, which can be viewed as a force-directed graph layout algorithm. We explored two hyperparameters, namely the number of neighbours nneigh controlling the balls’ radii and the minimum distance dmin apart that points are allowed to be in the low dimensional representation. Low values of nneigh will make UMAP focus on local details of the dataset topology while high values will account for more global properties. Increasing dmin will push points far from each other in the representation space.Generating conformationsFor linear PCA, generating 3D conformations by combining the principal components is straightforward. More specifically, given a set of l PCA modes computed from the coordinates R, we generate a new conformation ${{\bf{r}}}_{{\bf{pred}}}^{\ast }$ as, $${{\bf{r}}}_{{\bf{pred}}}^{\ast }={{\bf{p}}}^{\ast }{V}_{l}^{T}+\bar{{\bf{r}}},$$
(10)
where the matrix ${V}_{k}\in {{\mathbb{R}}}^{3m\times l}$ contains the modes, $\bar{{\bf{r}}}\in {{\mathbb{R}}}^{3m}$ is the average conformation, and ${{\bf{p}}}^{\ast }\in {{\mathbb{R}}}^{l}$ is a point in the l-dimensional representation space defined by the modes. The coordinates of p* specify the amplitudes of the modes.For kPCA and UMAP, we need to learn an inverse transform function that maps points in the l-dimensional representation space defined by the components back to the input space. This problem is known as the pre-image problem. To solve it for kPCA, we used kernel ridge regression of the input coordinates R on their low-dimensional projections in the representation space as described in57,58 and implemented in the scikit-learn Python library59. The contribution of the L2-norm regularisation is controlled through the hyperparameter α. More technically, α connects the squared L2-norm between a point in the representation space and its reconstruction with the squared L2-norm of the kernel weights used for the reconstruction. In the case of UMAP, we used the built-in inverse_transform function60. It relies on stochastic gradient descent to minimise the cross entropy between the low-dimensional graph and its high-dimensional pre-image graph.Leave-one-cluster-out cross-validation procedureWe assessed the predictive performance of PCA and kPCA with a leave-one-out cross-validation procedure. Since the conformations are not evenly distributed within an ensemble, we grouped them into clusters prior to the evaluation. We performed the clustering in the l-dimensional PCA representation space, where l is the minimal number of linear components sufficient to explain 90% of the ensemble’s total positional variance. We used the k-means clustering61 with k = l + 2.Given a clustered ensemble, we systematically tested the ability of the principal modes inferred from l + 1 clusters to predict the conformations belonging to the held-out cluster. We reconstructed each test conformation r* from its projection p* in the l-dimensional representation space. For the classical PCA, we computed the projection as, $${{\bf{p}}}^{\ast }=({{\bf{r}}}^{\ast }-\bar{{\bf{r}}}){V}_{l}.$$
(11)
For the kPCA, the projection onto the principal component νj is expressed as, $$\phi ({{\bf{r}}}^{\ast }){{\boldsymbol{\nu }}}_{{\bf{j}}}={\sum }_{i=1}^{n}{a}_{ji}\phi {(R)}^{T}\phi ({{\bf{r}}}^{\ast })={\sum }_{i=1}^{n}{a}_{ji}K(R,{{\bf{r}}}^{\ast }).$$
(12)
We evaluated the reconstruction error as the RMSD between the predicted conformation ${{\bf{r}}}_{{\bf{pred}}}^{\ast }$ and the original conformation r*.Distance to the training setWe estimated the difficulty of reconstructing a given conformation by computing its distance to the convex hull defined by the conformations used for training in the l-dimensional representation space. Setting the number of clusters in the training set to l + 1 ensures that the convex hull will be a polytope of dimension at least l. For instance, in 1 dimension, we need at least 2 affine-independent points to define a 1-polytope. The explicit computation of the convex hull of n points in l dimensions is an operation whose complexity is of the order of O(nl/2)62 and rapidly becomes computationally infeasible as the value of l increases. Nevertheless, the calculation of the distance of a given point to the hull does not require computing the convex hull explicitly and is a much simpler computational problem. It can be solved in quasilinear time with quadratic programming (QP). Here, we used the efficient and exact QP simplex solver proposed in63 and implemented in the Computational Geometry Algorithms Library (CGAL)64. It takes advantage of the low dimensionality of the representation space by observing that the closest features of two l-polytopes are always determined by at most l + 2 points.In order to compare distances across systems of different sizes, we scale them by the number of positions m, $${d}^{norm}=\frac{d}{\sqrt{m}}.$$
(13)
This normalisation also allows relating distances in the representation space with RMS deviations in the 3D Cartesian space. Indeed, let us consider an ensemble of conformations exhibiting a purely one-dimensional motion. Any two conformations distant by an RMSD of 1 Å in the original 3D space will be separated by a normalised distance of 1 Å in the one-dimensional representation space.Interpolating between statesWe generated interpolation trajectories between ATPase states with PCA and kPCA. We started from the conformational clusters defined in the leave-one-out procedure and identified clusters 0 and 4 as the most extreme ones along the first PCA component. Secondly, we used these two clusters only to learn PCA and kPCA low-dimensional representation spaces. We computed the coordinates of the clusters’ centres in these spaces and defined interpolation trajectories between them with 50 regularly spaced intermediate points. We then generated 50 conformations from the 50 intermediate points. We finally determined the minimal RMS deviation between each generated conformation and the known conformations from clusters 1, 2 and 3. We qualitatively compared these trajectories with physics-based non-linear trajectories computed with NOLB65. NOLB extracts normal modes from a starting conformation and models the transition to a target conformation as a series of twists extrapolated from these modes with optimal amplitudes, as described in66. We chose 1KJUA from cluster 0 as the starting conformation and 1T5SA from cluster 4 as the target conformation.

Explaining Conformational Diversity in Protein Families through Molecular Motions

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis