Structure prediction of alternative protein conformations

Proteins with alternative conformations not present in the AlphaFold2 training setAlphaFold2 (AF) was trained on all structures in the PDB14 with a maximum release date of 30 April 2018. To analyse if AF can predict alternative conformations it is necessary to first obtain structures with conformations not present in this training set. We obtained such a set by:

1.

Selecting all monomeric protein structures from the PDB on 2023-07-05 determined by X-RAY diffraction or Electron Microscopy (EM) with a resolution ≤5 Å. We excluded NMR structures as the flexibility in these may give rise to the appearance of alternative conformations when there are none. In total, 69861 structures were obtained. About 51,183 were deposited on or before 30 April 2018 and 18,678 after.

2.

We extracted the first protein chain in each PDB file and the corresponding sequences with less than 80% non-regular amino acids and more than 50 residues. 68953/69861 proteins fulfilled these criteria (99%).

3.

We clustered the sequences at 30% identity using MMseqs2 (version f5f780acd64482cd59b46eae0a107f763cd17b4d)23. This is a more stringent cutoff than the 40% identity used in AlphaFold21 to ensure that similar structures are indeed captured within the clusters.

mmseqs easy-cluster examples/DB.fasta clusterRes tmp –min-seq-id 0.3 -c 0.8 –cov-mode 1.

4.

This resulted in 5452 clusters with more than one entry (10,116 clusters in total). From the sequence clusters with over one entry, we used TM-align (version f0824499d8ab4fa84b2e75d253de80ab2c894c56)15 to perform pairwise structural alignment of all entries within each cluster. We created structural clusters if the pairwise TM-score ≥0.8. 64,208 out of the 68,953 structures (93%) could be clustered into 6696 structural clusters.

5.

We now checked for sequence clusters that have different structural clusters (different conformations) before and after April 30 2018. We obtained 900 putative alternative conformations.

6.

To see if these are truly alternative conformations and not a result of sequence variations, we clustered the sequences again on 90% identity using MMseqs2:

mmseqs easy-cluster examples/DB.fasta clusterRes tmp –min-seq-id 0.9 -c 0.8 –cov-mode 1.

7.

We obtained 14,662 clusters out of 64,208 sequences in total. About 7806 of the clusters have more than one member and 64 maps to alternative conformations (some with as many as nine different ones).

8.

To see if these are truly alternative conformations and not a result of length differences or low resolution, we performed a manual check. We find that some putative alternative conformations are a result of variations in N/C-terminal loops, mutations or disconnected chains (breaks) and such structures were excluded. The manual check resulted in 38 alternative conformations in total.

Proteins with alternative conformations in the PDBTo assess all proteins with alternative conformations from the PDB, we continue from step 4 above using the structural clusters at 0.8 TM-score. We skip step 5 and go directly to step 6 (as no date cutoff is necessary here) to see if these are truly alternative conformations and not a result of sequence variations. Figure 6 shows the entire data selection workflow starting from step 1 above.

After step 6 (clustering at 90% sequence identity), we analyse all sequence clusters at 90% (14,662 clusters) that contain at least two different structural clusters. In total, there are 627 such sequence clusters.

To reduce the number of necessary manual checks, we do pairwise sequence alignment of all sequences within each structural cluster of the 90% sequence clusters. Using these sequence alignments, we extract the corresponding structural areas and perform another structural comparison with TM-align. Only if the structures are still different (TM-score <0.8), do we consider them to be putative alternative conformations.

We obtain 311 such examples that are within the same 90% sequence identity cluster but have aligned sequence regions with corresponding structures that differ >0.2 in TM-score. We select representatives with the longest sequence overlaps and the biggest TM-score difference from each of the structural clusters. We perform a manual check to ensure that the conformations are indeed different and not a result of variable loop regions, low resolution or other experimental differences.

We obtained 245 sequence clusters at 90% with alternative conformations (TM-score difference >0.2) belonging to 458 different structural clusters. Only one of the conformations of these is selected for training and the remainder is saved for evaluation. We select the clusters at random. In total, there are 6474 structural clusters composed of 59946 sequences for training and 222 composed of 4263 sequences for evaluation with 244 alternative conformations (some structural clusters are present in more than one alternative conformation). We select 315 of the training structural clusters at random for validating the structure prediction performance (5%, 3539 sequences).

Most structures with alternative conformations are below 500 residues and the structural change between conformations is 0.2–0.3 TM-score (see below, Fig. 7). The most represented type of conformational change is rearrangements, and the least fold switches (see below, Fig. 7c). In total, there are 62 hinge motions, 180 rearrangements and threefold switches.Fig. 6: Processing the PDB.All single-chain structures with at least 50 residues and 20% regular amino acids determined by cryo-EM or X-ray crystallography with a resolution ≤5 Å were selected. These were then clustered at 30% sequence identity and 0.8 TM-score. Within the structural clusters, sequences with 90% sequence identity in different structural clusters were analysed manually to annotate alternative conformations. In total, 245 such sequences were found.Fig. 7: Protein lengths and structural similarities.a Sequence length distribution. Most sequences are below 500 residues. b TM-score distribution between alternative conformations. As most conformations are between 0.7 and 0.8, this suggests that the structural change is mainly 0.2–0.3 TM-score. c Strip plot of TM-score between alternative conformations per type (n = 62, n = 180 and n = 3 for hinge motion, rearrangement and fold switch, respectively). The fold switches are fewer with lower TM-scores than the hinge motions and rearrangements.Training, validation and test partitionsSupplementary Table 3 displays the number of sequences and structural clusters in each partition. Approximately 5% of the sequences and structural clusters were used for validation and testing and 90% for training.ScoringTM-score from TM-align15 was used for all scoring. We consider conformational changes to have at least a difference of 0.2 in TM-score. This is because smaller changes are very difficult to distinguish. Note that having a TM-score above 0.5 is generally considered as having similar folds24 and this analysis is therefore very refined (a TM-score above 0.8 is highly accurate).Another reason is the evaluation of predicted structures. If a change of only 0.1 TM-score is permitted, the predictions have to have a score of >0.9 to be able to distinguish between the different conformations. This is very unlikely as this means that almost all atoms are coincidental between predicted and native structures. Therefore, we use a threshold of 0.2 TM-score.MSA generationAlphaFold2 generates three different MSAs. This process constitutes the main bottleneck for the predictions as very large databases such as the Big Fantastic Database25,26 are searched, which is very time-consuming27. To simplify this process we instead search only uniclust30_2018_0828 with HHblits (from HH-suite29 version 3.1.0):

hhblits -E 0.001 -all -oa3m -n 2

We note that it is not the number of hits or the number of effective sequences30,31 in the MSAs that determine the outcome, but rather the coevolution present18. This is an elusive concept extracted by the network and it is possible to improve upon the MSA generation at inference by searching larger databases. Indeed, AF has been improved by searching larger sets of metagenomic sequences32.Structure prediction networkArchitectureThe structure prediction network trained here is almost a complete copy of AlphaFold21 for monomeric structure prediction. The main difference in the architecture is that the template track—which processes similar structures—is removed here to focus on the MSA only. Templates can capture alternative conformational states, although this would mean that these states are not predicted but instead copied from these templates.The configuration for all layers and modules is the same as in AlphaFold2 (Supplementary Table 4).TrainingThe main difference overall between the network here (Cfold) and other structure prediction networks is how it is trained. We train Cfold on structural clusters of sequences which enables learning a one-to-one mapping of a given MSA and a protein conformation. The structural clusters used for training are selected at random, which means that the network has to learn to extract specific coevolutionary information from each MSA that relates to a certain conformation that is learned during training.This sets the network up to be more focused on one structural description in the MSA, which is important for later evaluating the impact of different MSAs on the outcome. We sample the structural clusters with inverse probability to the cluster size. This allows the network to learn a more refined mapping from each sequence to the exact structure compared to using sequence clusters alone.The effective batch size is 24 distributed across eight NVIDIA A100 GPUs (three examples per GPU) with a crop size of 256 residues. We apply the same losses as in AlphaFold2:$${Loss}= 0.5\cdot {FAPE}+0.5\cdot {AUX}+0.3\cdot {Distance} \\ +0.2\cdot {MSA}+0.01\cdot {Confidence}$$where FAPE is the frame-aligned point error, AUX is a combination of the FAPE and angular losses, Distance is a pairwise distance loss, MSA is a loss over predicting masked-out MSA positions and Confidence is the difference between true and predicted lDDT scores. These losses are defined exactly as in AlphaFold2 and we refer to the description there1.We use a learning rate of 1e-3 with 1000 steps of linear warmup and clip the gradients with a global norm of 0.1 as in AlphaFold2. The optimiser is Adam33, applied through the Optax package in JAX, which the entire network is written in (JAX version 0.3.24, https://github.com/deepmind/jax/tree/main). The model is trained until convergence with a total of 74,000 steps (compared to 78,125 in AlphaFold2). Each step takes approximately 19.7 s, resulting in a total training time of 17 days.The different losses and how they converge can be seen in Fig. 8. During training, 56,345/56,407 (99.9%) of the training examples could be loaded due to file system issues blocking access to some data. We do not expect that this will impact the performance of the network.Fig. 8: Train losses and metrics vs. training steps.The losses have been smoothed with an exponential moving average (step size = 100). The lDDT CA increases continuously, although almost all performance is reached in the first 10,000 steps (a). First, the MSA (e) and distogram losses (d) saturate, followed by the structural module loss (b) and plDDT loss (c). The total loss (f) is the sum of distogram, MSA, predicted lDDT (plDDT) and structural module losses.Structure prediction validationTo validate the training of Cfold, we use the full-length structures taken from 315 structural clusters (one per cluster sampled randomly) and three recycling iterations. From the 315 sampled structures, 307 can be evaluated (97%) due to structural inconsistencies (e.g. missing CAs) causing errors with TM-align and the lDDT calculations.Figure 9a shows the TM-score distribution vs. the training step and b the lDDT scores. The best model weights are obtained at 10000 steps and these are the ones used in Cfold. It takes ~3 days to train the model to this point. The relationship between the predicted and true lDDT scores remains stable throughout the training process (Fig. 9c). The median TM-scores at 20,000 and 30,000 steps are similar to those of 10,000, but at ≥10,000 steps, the secondary structure of the beta sheets is inaccurate (Fig. 9d).Fig. 9: Validation.a TM-score distribution for the validation set vs. training step (n = 307). The medians are marked by black lines and the upper and lower quartiles are in colour. b lDDT CA distributions vs. training step (n = 307). c Pearson correlations for lDDT vs. plDDT for the alpha carbons (CA, n = 307). The correlation coefficients (R) remain close to 0.9 throughout the training. The medians are marked by black lines and the upper and lower quartiles are in colour. d Predicted structures for PDB ID 3BDL from the validation set at different training steps. The TM-scores are 0.59, 0.53 and 0.55 for 10,000, 20,000 and 30,000 steps, respectively. Visually, one can see that the network starts to make worse predictions >10,000 training steps, although this is not apparent from the metrics in (a, b, c). This suggests that the network starts to overfit certain structures at an early stage.Structural limitationsTo correct bond length violations and overlapping atoms, specific losses are added during the fine-tuning stage in AlphaFold2. In addition, one can obtain slightly more accurate structures by relaxing predictions in the Amber force field34. Here, we did not perform fine-tuning as the main objective is to sample globally different conformations. To generate highly accurate structures locally, the same Amber force field can be applied to obtain consistent side chain angles for all amino acids. We did not relax the structures here as this will likely not impact the overall performance.Conformational generationWe select structures whose conformations are in the train set from the test clusters that Cfold can predict with a TM-score of at least 0.6 using one prediction (0.5 is considered to be of a similar fold24), resulting in 155 out of 234 possible structures (66%). One reason that not all structures are predicted with the same fold as would be expected from AlphaFold2 is that Cfold is trained with fewer data (excluding e.g. monomers extracted from multimers) for a smaller number of steps. Other confounding factors may be that these structures do indeed have alternative conformations, and the training states may not be favourable.We take the minimum TM-scores from the TM-align structural superpositions here. To analyse if Cfold can predict the structure of alternative conformations of these, we apply two different strategies:

1.

Dropout19,20,21

2.

MSA clustering5

For the dropout, we simply activate dropout everywhere except for in the structural module following AFsample19 and make 100 predictions in total. For the MSA clustering, we set the number of clusters used in the sampling for the predictions to vary between [16, 32, 64, 128, 256, 512, 1024, 5120]5. We take 13 samples per clustering threshold, resulting in 104 predictions per target in total (compared to the 50 used previously). In total, 145 and 154 out of 155 structures could be predicted using strategies 1 and 2, respectively. The number of resulting structures sampled is 14177 and 16007 for strategies 1 and 2, respectively. The failures are due to memory issues (NVIDIA A100 GPUs with 40 GB RAM were used).We evaluate the predictions with the TM-score from TM-align, this time taking the maximum score to account for possible size differences between the native structures representing the different conformations (extracted matching regions) and the full genetic sequences used for prediction.Embedding comparisonOut of 10,816, 10774 structures were sampled successfully (99.6%, the failures were due to memory issues for large cluster sizes using NVIDIA A100 GPUs with 40 GB RAM). For the single sequence embedding, we calculate the cosine similarity using the models which best correspond to train/test conformations:$${Cosine\; similarity}=\frac{A\cdot B}{\left|A\right|\times \left|B\right|}\cdot \frac{1}{L}$$
(1)
where A and B are single sequence embeddings and |A| and |B| represent the two-norms and L is the protein length.For the pair embeddings, we instead compute the pairwise differences overall:$${Pairwise\; difference}=\frac{1}{n}{\sum }_{n}\sqrt{{(C-D)}^{2}}$$
(2)
where C and D are the pair embeddings and n is the number of elements in the matrices.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles