Protein complex structure modeling by cross-modal alignment between cryo-EM maps and protein sequences

Benchmark settingWe have curated a non-redundant dataset of cryo-EM maps from EMDB50. The collected maps are all single particle cryo-EM maps within 2–4 Å resolution, with unique PDB fitted structure, and released after 2018/1. Subsequently, their fitted PDB structures are downloaded from PDB51. To build an independent test set, a subset of maps released after 2021/5 were first gathered. Maps in this subset were clustered by cd-hit52 at 25% sequence similarity (two maps with any pair of chains  > 25% sequence similarity would be clustered). For each cluster, redundant maps were removed until only one map remained. It resulted in a non-redundant test set of 99 cryo-EM maps (Supplementary Data 1). Maps released before May 2021 were collected for raw training data. Among them, maps that have  > 25% sequence similarity with any test map were also removed, which resulted in a training set of 1529 cryo-EM maps (Supplementary Data 2). It should be noted that all these maps underwent no preprocessing, different from MAINMAST27 and EMBuild21 which sharpened the cryo-EM maps by PDB structures. This difference allowed EModelX to be applied on maps that have no deposited PDB structures. Therefore we also gathered a dataset comprising 126 cryo-EM maps that have no deposited PDB structures (Supplementary Data 3).On the curated benchmark test set. We compared EModelX with four cryo-EM protein structure modeling methods:

Phenix30 (phenix-1.20.1-4487, release date: Jan 20, 2022) is a software suite for cryo-EM protein structure modeling. It ensembles image sharpening, image segmentation, atomic structure construction, and real space refinement tools to build models.

MAINMAST27 (version 1.0, release date: Mar 1, 2017) identified the protein backbone structure by mean shift and employed tabu search algorithm in backbone tracing to build single-chain structures from maps sharpened by PDB structures.

DeepTracer31 is a pioneering cryo-EM protein complex structure modeling method. It predicts the locations of amino acids, the location of the backbone, secondary structure positions, and amino acid types to determine protein complex structure.

ModelAngelo35 (version 1.0.12, release date: Nov 29, 2023) is the state-of-the-art machine-learning approach for automated atomic model building in cryo-EM maps. It combines cryo-EM data with protein sequence and structure information within a graph neural network to construct models of protein complexes with high accuracy, effectively eliminating the need for manual intervention and expertize.

We applied EModelX and these methods to our benchmark test set. It should be noted that EModelX, DeepTracer, and ModelAngelo employed original cryo-EM maps as input, while MAINMAST and Phenix utilized maps sharpened by PDB structure. The implementation details are described in Supplementary Note 2.We have calculated various metrics to measure modeling performance from different perspectives:

TM-score is to assess the topological similarity between the backbone structure of built models and the PDB structures. Here MM-align39 and TM-align53 were employed to calculate TM-scores for protein complex models and single-chain models, respectively.

Sequence Recall is defined as the proportion of the PDB residues that is neighboring (Cα distance ≤3Å) to a modeled residue with the same amino acid type, following Jamali et al.40.

Coverage is the proportion of aligned PDB residues, and the number of aligned PDB residues (Aligned_length) was calculated by MM-align.

RMSD is the root of the mean squared distance between the Cα atoms of the aligned residue pairs of built models and the PDB structures, and MM-align was used in RMSD computation.

Mean Length is the mean length of continuous forward segments (proceed in the same direction as PDB structure) and can be calculated by phenix.chain_comparison30.

Forward Rate is the proportion of modeled residues that proceed in the same direction as PDB structures, and the number of forward residues was calculated by phenix.chain_comparison.

CC_box: the correlation coefficient between the atomic model and the whole cryo-EM map, calculated by phenix.map_model_cc43.

CC_mask: the correlation coefficient between the atomic model and the map masked by atomic centers with a fixed radius. It is also calculated by phenix.map_model_cc.

Multi-task cryo-EM map interpretationThe first step of EModelX is the multi-task cryo-EM map interpretation. However, raw cryo-EM maps in EMDB are various in microscope models, electron doses, electron detectors, and experimental procedures, which results in a large variance in density distribution, local resolution, and noise intensity. Therefore, it’s crucial to adopt an image preprocessing step to normalize the maps, making it more suitable for neural network training. Given a raw cryo-EM map \({{{\mathcal{M}}}}\in {{\mathbb{R}}}^{w\times h\times d}\) where w, h, d represents the width, height, and depth of this raw map, we first obtain \({{{{\mathcal{M}}}}}^{{\prime} }\in {{\mathbb{R}}}^{{w}^{{\prime} }\times {h}^{{\prime} }\times {d}^{{\prime} }}\) through transposing the coordinate system of the raw cryo-EM map according to its header file so that it shares the same coordinate system with the PDB-deposited structure, and resizing the transposed map to normalize the voxel size to 1 × 1 × 1 Å by trilinear interpolation. After that we produce the normalized map \({{{\mathcal{N}}}}\in {{\mathbb{R}}}^{{w}^{{\prime} }\times {h}^{{\prime} }\times {d}^{{\prime} }}\) through normalizing the voxel value by:$${{{{\mathcal{N}}}}}_{xyz}=\left\{\begin{array}{c}0,{{{{\mathcal{M}}}}}_{xyz}^{{\prime} } \, < \, {{{{\mathcal{M}}}}}_{med}^{{\prime} }\\ \frac{{{{{\mathcal{M}}}}}_{xyz}^{{\prime} }-{{{{\mathcal{M}}}}}_{med}^{{\prime} }}{{{{{\mathcal{M}}}}}_{top1}^{{\prime} }},{{{{\mathcal{M}}}}}_{med}^{{\prime} }\, \le \, {{{{\mathcal{M}}}}}_{xyz}^{{\prime} } \, < \, {{{{\mathcal{M}}}}}_{top1}^{{\prime} }\\ 1,{{{{\mathcal{M}}}}}_{xyz}^{{\prime} } \, \ge \, {{{{\mathcal{M}}}}}_{top1}^{{\prime} }\end{array}\right.$$
(1)
where (x, y, z) is the voxel coordinate, \({{{{\mathcal{M}}}}}_{med}^{{\prime} }\) represents the median density value of \({{{{\mathcal{M}}}}}^{{\prime} }\), and \({{{{\mathcal{M}}}}}_{top1}^{{\prime} }\) is defined as the top 1% density value of \({{{{\mathcal{M}}}}}^{{\prime} }\). All voxels in \({{{\mathcal{N}}}}\) range from 0 to 1. The median density value is chosen as the lower boundary considering the sparsity of cryo-EM maps and the top 1% density value is set as the upper boundary to reduce the impact of extreme noise densities on neural network training and inference.As illustrated in Fig. 1a, the normalized maps were then interpreted as Cα atoms, backbone atoms, and amino acid (AA) type distribution by multi-task machine learning. Specifically, the employed multi-task 3D Residual U-Nets can be formulated as:$${{{{\mathcal{N}}}}}^{B}={F}_{a}(une{t}_{S}({F}_{s}({{{\mathcal{N}}}}),\, {\theta }_{S}))$$
(2)
$${{{{\mathcal{N}}}}}^{C}={F}_{a}(une{t}_{C}({F}_{s}([{{{\mathcal{N}}}};{{{{\mathcal{N}}}}}^{B}]),\, {\theta }_{C}))$$
(3)
$${{{{\mathcal{N}}}}}^{A}={F}_{a}(une{t}_{A}({F}_{s}([{{{\mathcal{N}}}};{{{{\mathcal{N}}}}}^{B}]),\, {\theta }_{A}))$$
(4)
where \({{{\mathcal{N}}}}\in {{\mathbb{R}}}^{1\times W\times H\times D}\) is the input normalized map, \({{{{\mathcal{N}}}}}^{B}\in {{\mathbb{R}}}^{4\times W\times H\times D}\) represents the predicted backbone distribution map, \({{{{\mathcal{N}}}}}^{C}\in {{\mathbb{R}}}^{4\times W\times H\times D}\) denotes the predicted Cα atom distribution map, \({{{{\mathcal{N}}}}}^{A}\in {{\mathbb{R}}}^{21\times W\times H\times D}\) is the predicted amino acid type classification map, \([{{{\mathcal{N}}}};{{{{\mathcal{N}}}}}^{B}]\in {{\mathbb{R}}}^{5\times W\times H\times D}\) is the channel-wise concatenation of \({{{\mathcal{N}}}}\) and \({{{{\mathcal{N}}}}}^{B},{F}_{s}:{{\mathbb{R}}}^{c\times W\times H\times D}\to \{{{\mathbb{R}}}^{c\times 64\times 64\times 64}\}\) splits a given map into a set of \({{\mathbb{R}}}^{c\times 64\times 64\times 64}\) sub-maps with slide strides of 8 voxels, \(unet:{{\mathbb{R}}}^{{C}_{in}\times 64\times 64\times 64}\to {{\mathbb{R}}}^{{C}_{out}\times 64\times 64\times 64}\) is a U-Net34 module with trainable parameters θ, and \({F}_{a}:\{{{\mathbb{R}}}^{{C}_{out}\times 64\times 64\times 64}\}\to {{\mathbb{R}}}^{{C}_{out}\times W\times H\times D}\) produces the assembled prediction results from all sub-maps.The employed U-Net has been widely used in image segmentation tasks. Regarding our prediction tasks as three 3D image semantic segmentation tasks, U-Net’s max-pooling and up-sampling operation are beneficial for extracting coarser and finer-grained features that are both important for semantic segmentation. Here we implemented our U-Net as 3D Residual U-Net54, where the skip-connection55 was exploited to alleviate the resolution reduction issues caused by max-pooling and the gradient vanishing problem of deep network. Specifically, the 3D Residual U-Net module \(unet:{{\mathbb{R}}}^{{C}_{in}\times 64\times 64\times 64}\to {{\mathbb{R}}}^{{C}_{out}\times 64\times 64\times 64}\) in our method can be formatted as an encoder-decoder model:$${x}^{(0)}=en{c}^{(0)}(x)$$
(5)
$${x}^{(n)}=en{c}^{(n)}({F}_{p}({x}^{(n-1)}))$$
(6)
$${y}^{(n)}=de{c}^{(n-1)}({F}_{u}({y}^{(n-1)})+{x}^{(N-n)})$$
(7)
$$y=sof\!tmax(de{c}^{(N)}({y}^{(N)}))$$
(8)
where \(x\in {{\mathbb{R}}}^{{C}_{in}\times 64\times 64\times 64}\) represents the input map, \(y\in {{\mathbb{R}}}^{{C}_{out}\times 64\times 64\times 64}\) is the output map of segmentation result, \({F}_{p}:{{\mathbb{R}}}^{c\times 2W\times 2H\times 2D}\to {{\mathbb{R}}}^{c\times w\times h\times d}\) is the max-pooling operation (w, h, d), \({F}_{u}:{{\mathbb{R}}}^{c\times w\times h\times d}\to {{\mathbb{R}}}^{c\times 2w\times 2h\times 2d}\) is the upsampling operation implemented by strided transposed convolution56, the operation ’+’ in Eq. (7) is the skip connection performed by element-wise summation joining, normalized exponential function softmax is performed on the channel-wise, N denotes the total number of encoder/decoder layers, n marks the current layer, and an encoder/decoder module can be unified as:$${f}_{\!\!out}=ELU(con{v}^{(0)}( \, \, {f}_{\!\!in})+con{v}^{(0,1,2)}( \, \, {f}_{\!\!in}))$$
(9)
where \({f}_{in}\in {{\mathbb{R}}}^{{C}_{in}\times w\times h\times d}\) represents the input feature map, \({f}_{out}\in {{\mathbb{R}}}^{{C}_{out}\times w\times h\times d}\) is the output feature map, \(ELU:{{\mathbb{R}}}^{c\times w\times h\times d}\to {{\mathbb{R}}}^{c\times w\times h\times d}\) is the exponential linear unit (ELU)57 as an activation function, \(con{v}^{(0)}:{{\mathbb{R}}}^{{C}_{in}\times w\times h\times d}\to {{\mathbb{R}}}^{{C}_{out}\times w\times h\times d}\) and \(con{v}^{(1,2)}:{{\mathbb{R}}}^{{C}_{out}\times w\times h\times d}\to {{\mathbb{R}}}^{{C}_{out}\times w\times h\times d}\) are cascaded layers and in each layer feature maps are processed by 3 × 3 × 3 convolution  → group normalization58 → ELU activation, and similar to Eq. (7) the operation ’+’ is also the skip connection performed by element-wise addition.After 3D Residual U-Nets prediction, the subsequent step of cryo-EM map interpretation is to propose Cα atom candidates (Fig. 1a). First, we pick a set of voxels satisfying \({{{{\mathcal{N}}}}}_{0ijk}^{C} \, > \, 0.35\) where \({{{{\mathcal{N}}}}}_{0}^{C}\) is the softmax score of Cα class in \({{{{\mathcal{N}}}}}^{C}\) and (i, j, k) represents the voxel coordinate. We then run density-based spatial clustering of applications with noise (DBSCAN) algorithm59 with density parameter eps = 10 to efficiently filter out the outlier clusters of Cα voxels that are usually the incorrectly predicted noises. Considering that the ideal distance between Cα atoms is 3.8 Å60, the predicted Cα neighbors should also roughly keep this distance. Here we first filter out the non-local maximum Cα voxels in 3 × 3 × 3 Å through the Non-Maximum Suppression (NMS) algorithm, and then adjust the remaining Cα coordinates by:$${{{{\mathcal{C}}}}}_{n}=\frac{1}{{\sum }_{\delta }{{{{\mathcal{N}}}}}_{0({{{{\mathcal{C}}}}}_{n}^{{\prime} }+\delta )}^{C}} {\sum}_{\delta=(-1,-1,-1)}^{(1,1,1)}{{{{\mathcal{N}}}}}_{0({{{{\mathcal{C}}}}}_{n}^{{\prime} }+\delta )}^{C}\times ({{{{\mathcal{C}}}}}_{n}^{{\prime} }+\delta )$$
(10)
where \({{{{\mathcal{C}}}}}^{{\prime} }\in {{\mathbb{Z}}}^{N\times 3}\) denotes the original coordinates of predicted Cα voxels, N represents the total number of predicted Cα voxels and n is the index of a given Cα, δ ∈ {−1, 0, 1}3 is used to traverse neighbor coordinates, and \({{{\mathcal{C}}}}\in {{\mathbb{R}}}^{N\times 3}\) is the adjusted Cα coordinates.3D Residual U-Nets TrainingIn order to train our model, we first annotated the cryo-EM maps in the training dataset according to the PDB-deposited structure. For backbone prediction, we segmented each cryo-EM map into four semantics. A voxel in cryo-EM maps is annotated as a main chain voxel if it contains any main chain atom, otherwise, it is labeled as a side chain voxel when it contains any side chain atom, otherwise, it is assigned as a mask voxel when it is neighbor to any protein atom, otherwise, it is annotated as a non-structural voxel. The introduced mask voxel is to alleviate the unfair bias caused by experimental error in PDB-deposited structures, and it does not participate in the network back-propagation. Similarly, for Cα prediction, a voxel in cryo-EM maps is annotated as Cα voxel if it contains any Cα atom, otherwise, it is labeled as other-atom voxel when it contains any other protein atom, otherwise, it is assigned as a mask voxel when it is neighbor to any protein atom, otherwise, it is annotated as a non-structural voxel. Then for amino acid type prediction, we annotated a voxel neighbor to any Cα voxel as its corresponding amino acid type, all other voxels are assigned as mask voxels and masked out in network back-propagation. Our training loss can be defined as:$${{{\mathcal{L}}}}={\lambda }_{S}{{{{\mathcal{L}}}}}_{S}+{\lambda }_{C}{{{{\mathcal{L}}}}}_{C}+{\lambda }_{A}{{{{\mathcal{L}}}}}_{A}$$
(11)
where λS, λC, λA is the warming-up task weights that are adaptively adjust from 1, 1, 0 to 0, 0, 1 in our training procedure and \({{{{\mathcal{L}}}}}_{S},{{{{\mathcal{L}}}}}_{C},{{{{\mathcal{L}}}}}_{A}\) can be unified as \({{{{\mathcal{L}}}}}_{CE}\):$${{{{\mathcal{L}}}}}_{CE}=\sum -{W}_{\hat{y}}log\frac{exp(y=\hat{y})}{{\sum }_{c=0}^{C}exp(y=c)}$$
(12)
where C represents the number of classes, \(y\in {{\mathbb{R}}}^{C\times w\times h\times d}\) is the output of corresponding U-Net module, \(\hat{y}\in {\{0,…,C\}}^{w\times h\times d}\) is the annotated ground truth label, and \({W}_{\hat{y}}\) is the class weight of \(\hat{y}\) that is set according to the ground truth class distribution in order to alleviate the class imbalance problem. Here for \({{{{\mathcal{L}}}}}_{S}\) the class weights are set as 1, 0.3, 0.03, 0 for the class of main chain, side chain, non-structural, and masked voxel, respectively. Similarly, for \({{{{\mathcal{L}}}}}_{C}\) the class weights are set as 1, 0.1, 0.01, 0 for the class of Cα, other-atom, non-structural, and masked voxel, respectively. Nevertheless, for amino acid type prediction we didn’t apply any different class weight for different amino acid types since we do not focus on the prediction of a certain amino acid class and the amino acid class imbalance itself implies natural protein sequence bias. Our neural networks were implemented by PyTorch 1.8.161 and trained on Nvidia GTX 3090 Graphics Processing Unit (GPU) with Adam optimizer62, learning rate of 1 × 10−4 and batch size of 8.Cα-Sequence AlignmentTo achieve cross-modal alignment across cryo-EM maps and protein sequences, a naive approach is to map the Cα atom candidates from cryo-EM map to protein complex sequences by scoring how their amino acid types match. However, considering that there are a large number of identical amino acid types in the sequence and there exists prediction error in \({{{{\mathcal{N}}}}}^{A}\), this naive approach is far from correctly aligning Cα candidates to protein sequence. Specifically, we define B as the event that a Cα is predicted as the same amino acid type with a protein sequence position, and A is defined as the event that this Cα matches this protein sequence position in the PDB structure. The probability P(A∣B) can be calculated by Bayes’ theorem:$$P(A| B)=\frac{P(A)\cdot P(B| A)}{P(B)}$$
(13)
$$=\frac{P(A)\cdot P(B| A)}{P(A)\cdot P(B| A)+P({\overline{A}}_{1})\cdot P(B| {\overline{A}}_{1})+P({\overline{A}}_{0})\cdot P(B| {\overline{A}}_{0})}$$
(14)
$$=\frac{\frac{1}{N}\cdot acc}{\frac{1}{N}\cdot acc+\frac{n-1}{N}\cdot acc+\frac{N-n}{N}\cdot \frac{1-acc}{19}}$$
(15)
$${\approx}^{\frac{n}{N}\approx \frac{1}{20}}\frac{20\cdot acc}{N}$$
(16)
where N is the number of residues, n is the number of residues with identical amino acid types, acc = P(B∣A) is the amino acid prediction accuracy, and event \({\overline{A}}_{0}\) / \({\overline{A}}_{1}\) is defined as that the predicted amino acid type is the same / not the same with its protein sequence position but don’t match in the PDB structure. Eqs. (15) and (16) are derived by the uniform distribution assumption (for computation convenience) that \(P(B| {\overline{A}}_{0})=\frac{1-acc}{19}\) and \(\frac{n}{N}\approx \frac{1}{20}\). When N is large enough (e.g. N > 1000), \(P(A| B)\approx \frac{20\cdot acc}{N}\approx 0\). Therefore, such a naive mapping approach is not sufficient for accurate alignment.Instead of naive mapping, EModelX leverages Cα trace sampling to enhance the confidence of alignment. As shown in Fig. 1b, the sampled traces are aligned with the sub-sequences of the protein sequence. We define \({B}^{{\prime} }\) as the event that the predicted amino acid types of a trace are identical to a subsequence, and \({A}^{{\prime} }\) is defined as the event that this trace matches this subsequence in the PDB structure. The probability \(P({A}^{{\prime} }| {B}^{{\prime} })\) can be calculated by Bayes’ theorem:$$P({A}^{{\prime} }| {B}^{{\prime} })=\frac{P({A}^{{\prime} })\cdot P({B}^{{\prime} }| {A}^{{\prime} })}{P({B}^{{\prime} })}$$
(17)
$$=\frac{P({A}^{{\prime} })\cdot P({B}^{{\prime} }| {A}^{{\prime} })}{P({A}^{{\prime} })\cdot P({B}^{{\prime} }| {A}^{{\prime} })+{\sum }_{i=0}^{s}P({\overline{{A}^{{\prime} }}}_{i})\cdot P({B}^{{\prime} }| {\overline{{A}^{{\prime} }}}_{i})}$$
(18)
$$=\frac{\frac{1}{N-s+1}\cdot ac{c}^{s}}{\frac{1}{N-s+1}\cdot ac{c}^{s}+{\sum }_{i=0}^{s}\left({i}\atop{s}\right)\cdot \frac{{\prod }_{1}^{i}{n}_{k}\cdot {\prod }_{i+1}^{s}\, (N-{n}_{k})}{{N}^{s}}\cdot ac{c}^{i}\cdot {(\frac{1-acc}{19})}^{s-i}}$$
(19)
$$\mathop{\approx}^{\frac{{n}_{k}}{N}\approx \frac{1}{20},acc\approx {0.5}} \frac{\frac{1}{N-s+1}}{\frac{1}{N-s+1}+{\sum }_{i=0}^{s}\left(\begin{array}{c} i\\ s \end{array}\right)\cdot \frac{1}{2{0}^{s}}}$$
(20)
where s is the length of the sampled subsequence, event \({\overline{A}}_{i}\) is defined as that the sampled trace has i amino acids identical to the subsequence but don’t match in the PDB structure, \(acc=P({B}^{{\prime} }| {A}^{{\prime} })\) is the amino acid prediction accuracy, and nk is the number of residues that have identical amino acid type with the kth residue in subsequence. Given assumption that acc = 0.5 and \(\frac{{n}_{k}}{N}\approx \frac{1}{20}\), when s is large enough, \(P({A}^{{\prime} }| {B}^{{\prime} })\approx 1\).To implement the sampling-enhanced Cα-sequence alignment, EModelX first computes the naive amino acid type matching score \({{{{\mathcal{S}}}}}^{A}\) between the predicted Cα candidates and native protein sequences, which can be formatted as:$${{{{\mathcal{S}}}}}_{ijk}^{A}={{{{\mathcal{N}}}}}_{{s}_{ij}{F}_{r}({{{{\mathcal{C}}}}}_{k})}^{A}$$
(21)
where \({{{{\mathcal{S}}}}}^{A}\in {{\mathbb{R}}}^{S\times L\times N}\) (S: number of unique sequences, L: max length of sequences, N: number of Cα candidates), sij represents the type of jth amino acid in the ith unique sequence, \({{{{\mathcal{C}}}}}_{k}\) is the coordinate of kth Cα candidate, and \({F}_{r}:{{\mathbb{R}}}^{3}\to {{\mathbb{Z}}}^{3}\) is the rounding function.The subsequent step is to sample Cα local traces. As shown in Fig. 1b, the traces of Cα candidates were sampled based on backbone distribution and Cα distance. Firstly, the Cα neighbor connection likelihood \({{{\mathcal{H}}}}\in {{\mathbb{R}}}^{N\times N}\) is estimated as:$${{{\mathcal{H}}}}=\frac{{{{{\mathcal{S}}}}}^{D}+{{{{\mathcal{S}}}}}^{B}}{2}$$
(22)
where \({{{{\mathcal{S}}}}}^{D}\in {{\mathbb{R}}}^{N\times N}\) is the distance score, \({{{{\mathcal{S}}}}}^{B}\in {{\mathbb{R}}}^{N\times N}\) is the backbone score, and they are defined as:$${{{{\mathcal{S}}}}}_{ij}^{D}=\max \left(\min \left(1-\frac{| | {{{{\mathcal{C}}}}}_{i}-{{{{\mathcal{C}}}}}_{j}| -3.8| -0.5}{2},1\right),0\right)$$
(23)
$${{{{\mathcal{S}}}}}_{ij}^{B}=\frac{1}{6}\mathop{\sum }_{k=0}^{5}{{{{\mathcal{N}}}}}_{0{F}_{r}({{{{\mathcal{C}}}}}_{i}+\frac{k}{5}\times ({{{{\mathcal{C}}}}}_{j}-{{{{\mathcal{C}}}}}_{i}))}^{B}$$
(24)
where i and j are indexes of two Cα candidates satisfying \(| {{{{\mathcal{C}}}}}_{i}-{{{{\mathcal{C}}}}}_{j}| \in \left[2,6\right),{{{{\mathcal{N}}}}}_{0}^{B}\) is the softmax score of main chain class in \({{{{\mathcal{N}}}}}^{B}\), and Fr is the rounding function.Subsequently, \({{{\mathcal{H}}}}\) is used to sample local structures \({{{\mathcal{T}}}}\in {{\mathbb{R}}}^{L\times 7}\), where L is the number of sampled structures, and 7 is the length of each sampled local structure. We then estimate the n-hop (n ∈ [1, 6]) connection likelihood \({{{{\mathcal{H}}}}}^{(n)}\in {{\mathbb{R}}}^{N\times N}\):$${{{{\mathcal{H}}}}}_{{t}_{0}{t}_{n}}^{(n)}=nor{m}_{{{{\mathcal{C}}}}}\left({\max }_{{{{\mathcal{T}}}}}\left( \mathop{\prod}_{i=1}^{n}{{{{\mathcal{H}}}}}_{{t}_{i-1}{t}_{i}}\right)\right)$$
(25)
where \(t\in {{{\mathcal{T}}}}\) represents a local structure with length of 7, ti is the ith Cα in \(t,ma{x}_{{{{\mathcal{T}}}}}\) maintains the maximum value among different \(t\in {{{\mathcal{T}}}}\) that share identical for identical (t0, tn) pair, and \(nor{m}_{{{{\mathcal{C}}}}}\) normalizes \({{{{\mathcal{H}}}}}^{(n)}\) for summing up to 1 in the first channel.Finally, we compute \({{{\mathcal{S}}}}\in {{\mathbb{R}}}^{S\times L\times N}\) as the Cα-sequence aligning score of predicted Cα candidates to complex sequences:$${{{{\mathcal{S}}}}}_{ijk}={{{{\mathcal{S}}}}}_{ijk}^{A}+ {\sum}_{n=1}^{6} {\sum}_{{k}^{{\prime} }=0}^{N}\left({{{{\mathcal{S}}}}}_{i(j-n){k}^{{\prime} }}^{A}+{{{{\mathcal{S}}}}}_{i(j+n){k}^{{\prime} }}^{A}\right)\times {{{{\mathcal{H}}}}}_{{k}^{{\prime} },k}^{(n)}$$
(26)
where \({k}^{{\prime} }\) traverses the indices of predicted Cα candidates. \({{{\mathcal{S}}}}\) is updated by n-hop connection likelihood to learn Cα-sequence alignment from n-hop neighboring Cαs. This procedure is named Cα-sequence score propagation.We have also implemented EModelX(+AF). As shown in Fig. 1c, EModelX(+AF) leverages AlphaFold predicted structure to assist Cα-sequence alignment. Specifically, the Cα-sequence aligning score \({{{\mathcal{S}}}}\) is modified as \({{{{\mathcal{S}}}}}^{{\prime} }\)= \({{{\mathcal{S}}}}\) + \({{{{\mathcal{S}}}}}^{T}\), where \({{{{\mathcal{S}}}}}^{T}\in {{\mathbb{R}}}^{S\times L\times N}\) is defined as the structural aligning score. For each \(t\in {{{\mathcal{T}}}},n\in [1,6]\) and k = tn:$${{{{\mathcal{S}}}}}_{ijk}^{T}=-\!{\min }_{{{{\mathcal{T}}}}}(\delta (t,{{{{\mathcal{P}}}}}_{ijn}))$$
(27)
where \({{{{\mathcal{P}}}}}_{ijn}\) is the [j − n, j − n + 6] sub-structure of AlphaFold predicted structure in ith unique sequence, and δ is the RMSD calculated by superimposing t and \({{{{\mathcal{P}}}}}_{ijn}\).Chain and Sequence RegistrationAfter Cα-sequence alignment, the high-confidence Cα-sequence mapping can be identified from the aligning score matrix. We found that the high-confidence mappings showed a strong correlation with ground-truth matches between Cα candidates and protein sequence positions. Therefore, a hierarchical modeling strategy is adopted to first build an initial model based on high-confidence mappings and subsequently fill the unmodeled gaps through Cα threading.To build the initial model, the chain and sequence registration is necessary to assign chain index and sequence position to those Cαs. Following a greedy strategy, we start from the highest-confidence match of \({{{\mathcal{S}}}}\) to lower ones. For each current match (i, j, k) in \({{{\mathcal{S}}}}\), we iteratively explore its spatial and sequential neighbor match \((i,{j}^{{\prime} },{k}^{{\prime} })\) satisfying \({j}^{{\prime} }=j\pm 1,\, {{{{\mathcal{S}}}}}_{k{k}^{{\prime} }}^{D} \, > \, 0\) and \({k}^{{\prime} }\) = argmax(\({{{{\mathcal{S}}}}}_{i{j}^{{\prime} }}\)) until no such match could be found. The found matches list, regarded as a Cα trace matching to a protein sub-sequence, would be identified as a high-confidence sequence registration result if its length is long enough (≥9).The sequence registration is straightforward since we have aligned Cα traces to sequences. However, chain registration can be a combinatorial optimization problem for homologous chains that share the same sequence. Here we leverage connectivity and symmetry to solve this problem. As shown in Algorithm 1, a trace clashes to a chain means that the trace’s sub-sequence has been occupied in the chain. The TOP_CONNECTIVE function proposes the chain to which trace t is most connective. Specifically, a naive greedy strategy is adopted to perform Cα threading for connecting trace t to each chain in given steps (equal to their gap length in sequence order), and the cumulative Cα-sequence aligning scores of connecting results are used to rank these chains and pick the top candidate. The TOP_SYMMETRIC function proposes the chain to which trace t is most symmetric. Specifically, trace t is fused with each chain in \({{{{\mathcal{V}}}}}_{t}\) and is subsequently superimposed to each chain in \({{{{\mathcal{C}}}}}_{t}\). The chain in \({{{{\mathcal{V}}}}}_{t}\) that obtained the lowest RMSD is regarded as the most symmetric chain.
Algorithm 1
Algorithm 1: Chain Registration
Sequence-Guiding Cα ThreadingWe have built up a high confidence aligned protein complex Cα model with some unaligned structure gaps. For a given gap we thread Cα from one endpoint (Cα that has been assigned with a certain Cα in a high-confidence model) to another. However, it suffers from high computational complexity in long-length gaps. So we employed a strategy of pruning search to accelerate it, which is a modified version of our previous work20. The schematic flowchart of the pruning search algorithm has been depicted in Supplementary Fig. S8, which relies on a scoring function to filter out traces with lower scores within the same structural cluster. The goal of this scoring function is to preserve traces that have: i. higher Cα-sequence aligning scores, ii. higher Cα connection scores, and iii. higher symmetry with corresponding segments in other homomeric chains or AlphaFold structures. Specifically, the scoring functions can be formatted as:$${{{\mathcal{F}}}}= {\sum}_{j\in s,k\in t}{{{{\mathcal{S}}}}}_{ijk}+ {\sum}_{k\in t}{{{{\mathcal{H}}}}}_{k,{k}^{{\prime} }}-\delta (t,{{{{\mathcal{M}}}}}_{s})$$
(28)
where t is the Cα trace that has been searched, s is the corresponding sub-sequence, \({{{\mathcal{S}}}}\) is the Cα-sequence aligning score, \({{{\mathcal{H}}}}\) is the estimated Cα neighbor connection likelihood, \({k}^{{\prime} }\) is the next Cα of k in t, δ is to calculate the RMSD between t and \({{{{\mathcal{M}}}}}_{s}\). \({{{\mathcal{M}}}}\) for EModelX is another homomeric chain that has built model for sub-sequence s, and \({{{\mathcal{M}}}}\) for EModelX(+AF) is the AlphaFold structure. Cα threading is performed on the unmodeled structure gaps which commonly correspond to local regions at lower resolution. Incorporating AlphaFold can not only provide a more reliable template \({{{\mathcal{M}}}}\) but also enhance Cα-sequence aligning score \({{{\mathcal{S}}}}\) by adding a structure alignment item \({{{{\mathcal{S}}}}}^{T}\) (Eq. (27)). Therefore, it holds promise for enhancing the accuracy of Cα threading.After sequence-guiding Cα threading, we have built the Cα backbone model of protein complex. Following MAINMAST, we adopted PULCHRA37 as the full-atom construction tool. Finally, the full-atom complex model is refined in the EM density map using phenix.real_space_refine38. Molecular graphics and analyses are performed with UCSF Chimera63.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles