Modelling protein complexes with crosslinking mass spectrometry and deep learning

Crosslink simulationWe simulate SDA crosslinks with XWalk27 with a 25 Å Cα-Cα cutoff and trypsin digestion. XWalk identifies cross-linkable residue pairs based on solvent accessibility, crosslinker, and peptide digestion. We add FDR = 20% of noise to match the expected FDR. At least one crosslink is always incorrect, the actual FDR can therefore be much higher. False positive crosslinks can be any residue pair > 25 Å Cα-Cα, where one residue is a Lys, Ser, Thr, and Tyr. In real data, there is often an affinity towards Lys and clustering of crosslinks. These biases are not reflected in the simulation since we uniformly subsample the crosslink candidates. This doesn’t play a big role in training, and we can see that it translates to real data, but it may give better coverage during testing than what we can normally expect. The link-level FDR is simulated by shuffling the crosslinks and counting the number of incorrect links observed so far. The coverage is set to 10% and corresponds to the sequence coverage based on the longer sequence. We sample inter- and intra-protein crosslinks independently.Integration of crosslinksWe integrate crosslinks in AlphaLink the same way we did in the monomer version7. We add a crosslink embedding layer to the neural network that projects the soft label contact map into the 128-d z-space. The projection is added to the pair representation (z). This way, the crosslinking information influences retrieval of the co-evolutionary information and the coupled updates with the MSA representation enables noise rejection.Fine-tuning of alphafold-multimerWe switched to Uni-Fold28 since OpenFold29 didn’t support multimers. To avoid training Uni-Fold from scratch, we fine-tune the weights provided by Deepmind. For v2 we use AlphaFold-Multimer 2.2.4 weights as the starting point (https://github.com/deepmind/alphafold/releases/tag/v2.2.4). AlphaFold-Multimer was trained on PDBs deposited before 2018-04-30, predating CASP13. For v3, we use the AlphaFold-Multimer 2.3.0 weights (https://github.com/deepmind/alphafold/releases/tag/v2.3.0). AlphaFold-Multimer was trained on PDBs deposited before 2021-09-30, predating CASP15. The networks are refined on 11424 protein complexes with a total of 34054 chains from the DIPS-Plus30 training set with simulated SDA crosslinking data. DIPS-Plus contains PDBs deposited before May 2021, predating CASP15. We use Uni-Fold v2.1.0 (https://github.com/dptech-corp/Uni-Fold/releases/tag/v2.1.0). MSAs were generated with the reduced database setting. We train and test with model_1. Since we focus on heteromers, we sample homomeric crosslinks like heteromeric crosslinks during training to have more samples.For training, we follow the refinement training regime outlined in the AlphaFold-Multimer paper but expand the crop size to 640AA to increase interface exposure and the number of crosslinks we see during training. We train on 4 A100 GPUs for 10 days. We used early stopping on the validation set which consists of proteins from CAMEO31 released after 2022.Evaluation set upFor the comparison, we use the official predictions from CASP15. The CASP15 targets were classified as TBM/FM (template-based modelling / free modelling), meaning that there is a partial template for a subunit and/or the assembly. Our main comparison point is NBIS-AF2-multimer which corresponds to standard AlphaFold-Multimer v2. We use the same MSAs as NBIS-AF2-multimer provided by Arne Elofsson (http://duffman.it.liu.se/casp15/). We increased the recycling iterations to 20, the original comparison can be found in Supplementary Fig. 1. We randomly sample 10 crosslink sets with 10% coverage and 20% FDR for each target and predict each with a different seed for a total of 200 seeds (10 in the original comparison). We only relax the best sample (chosen by model confidence) per crosslink set. The maximum MSA cluster size is 512 sequences.For the SAbDab comparison, we compile a new data set consisting of 33 recent antibody-antigen targets between 01-01-2022 and 11-10-2023 which represent challenging protein complexes due to the lower evolutionary signal. We only include targets which have a crystal structure with a resolution of 3 Å or better and a single assembly, to simplify the evaluation. We do not remove targets that may have homologues in the training set.The primary evaluation metric is the DockQ score. We compute the DockQ scores for AlphaLink with the official CASP15 evaluation scripts (https://git.scicore.unibas.ch/schwede/casp15_ema).The DockQ score is the average of the fraction of native contacts (Fnat), the interface RMSD (iRMS)32, and the ligand RMSD (LRMS)32. The interface includes all contacting residues. Residues are in contact if they are from different chains and at least one heavy atom is within 5 Å. The iRMS increases the cutoff to 10 Å. The Fnat then corresponds to the recall of the native contacts. The iRMS is the RMSD of the backbone atoms in the interface. The LRMS is the RMSD after superimposing the larger of the two structures onto the smaller one. The final DockQ score is the average DockQ score over all interfaces for protein complexes with more than two chains.We only relax the best prediction per crosslink subset to save compute time. Relaxing only slightly changes the final scores.For the Bacillus subtilis predictions, we use the same MSAs as O’Reilly et al5. We predict the targets again with AlphaFold-Multimer v2.2 to be comparable. The results with the original AlphaFold-Multimer v2.1 predictions from the study are shown in Supplementary Fig. 9. The model confidence might not be comparable. Except for a few targets, there are no crystal structures available for Bacillus subtilis which is why we have to resort to model confidence as an indicator of improvement. Supplementary Fig. 10a shows the correlation between the model confidence and the DockQ score and further the relationship between the model confidences of AlphaLink and AlphaFold-Multimer with respect to the DockQ score (Supplementary Fig. 10b). Although, on these hard targets, the model confidence is an overestimation, a better model confidence generally translates into a better DockQ score.For the Cullin4 complex, we use the v3 weights to predict the structures. We always predict the full complex and use the same MSAs for both AlphaFold-Multimer and AlphaLink. We compare the prediction of DCAF1-Vpr to the crystal structure (PDB 6ZX9) with the T4 tag removed. The EM densities correspond to the EMDB accession codes: EMD-10611 (core), EMD-10612 (conformational state-1), EMD-10613 (state-2) and EMD-10614 (state-3). There are on average 21 crosslinks per interface.The structures are visualised with PyMol v2.5.0 and ChimeraX 1.7.1.Strains, media and growth conditionsE. coli DH5α and Rosetta DE3 (28) were used for cloning and for the expression of recombinant proteins, respectively. All B. subtilis strains used in this study are derivatives of the laboratory strain 168. They are listed in Supplementary Table 1. B. subtilis and E. coli were grown in Luria-Bertani (LB) or in sporulation (SP) medium33,34. For growth assays and the in vivo interaction experiments, B. subtilis was cultivated in LB, SP, or CSE-Glc minimal medium34,35. CSE-Glc is a chemically defined medium that contains sodium succinate (6 g/l), potassium glutamate (8 g/l), and glucose (1 g/l) as the carbon sources35. Iron sources were added as indicated. The media were supplemented with ampicillin (100 µg/ml), kanamycin (50 µg/ml), chloramphenicol (5 µg/ml), or erythromycin and lincomycin (2 and 25 µg/ml, respectively) if required. LB and SP plates were prepared by the addition of Bacto Agar (Difco) (17 g/l) to the medium. All oligonucleotides used in this study are listed in Supplementary Table 2.DNA manipulationTransformation of E. coli and plasmid DNA extraction were performed using standard procedures33. All commercially available plasmids, restriction enzymes, T4 DNA ligase and DNA polymerases were used as recommended by the manufacturers. B. subtilis was transformed with plasmids, genomic DNA or PCR products according to the two-step protocol34. Transformants were selected on SP plates containing erythromycin (2 µg/ml) plus lincomycin (25 µg/ml), chloramphenicol (5 µg/ml), kanamycin (10 µg/ml), or spectinomycin (250 µg/ml). DNA fragments were purified using the QIAquick PCR Purification Kit (Qiagen, Hilden, Germany). DNA sequences were determined by the dideoxy chain termination method33.Construction of mutant strains by allelic replacementDeletion of the fur and fpa genes was achieved by transformation of B. subtilis 168 or GP879 with a PCR product constructed using oligonucleotides to amplify DNA fragments flanking the target genes and an appropriate intervening resistance cassette36. The integrity of the regions flanking the integrated resistance cassette was verified by sequencing PCR products of about 1100 bp amplified from chromosomal DNA of the resulting mutant strains.Phenotypic analysisIn B. subtilis, amylase activity was detected after growth on plates containing nutrient broth (7.5 g/l), 17 g Bacto agar/l (Difco) and 5 g hydrolysed starch/l (Connaught). Starch degradation was detected by sublimating iodine onto the plates.Quantitative studies of lacZ expression in B. subtilis were performed as follows: cells were grown in CSE-Glc or LB medium supplemented with iron sources as indicated. Cells were harvested at OD600 of 0.5 to 0.8. b-Galactosidase specific activities were determined with cell extracts obtained by lysozyme treatment34. One unit of β-galactosidase is defined as the amount of enzyme which produces 1 nmol of o-nitrophenol per min at 28 °C.Plasmid constructionsTo express the Fur and Fpa proteins carrying a N-terminal His-tag in E. coli, the fur and fpa genes were amplified using chromosomal DNA of B. subtilis 168 as the template and appropriate oligonucleotides that attached specific restriction sites to the fragment. Those were: BamHI and XhoI for cloning fur in pET-SUMO (Invitrogen, Germany), and BamHI and SalI for cloning fpa in pWH84437. The resulting plasmids were pGP3589 and pGP2583 for Fur and Fpa, respectively.For overexpression of fpa in B. subtilis, we constructed plasmid pGP3897. For this purpose, the fpa gene was amplified and cloned between the BamHI and SalI site of the expression vector pBQ20038.Plasmid pAC739 was used to construct a translational fusion of the dhbA promoter region to the promoterless lacZ gene. For this purpose, the promoter region was amplified using oligonucleotides that attached EcoRI and BamHI restriction to the ends of the products. The fragments were cloned between the EcoRI and BamHI sites of pAC7. The resulting plasmid was pGP3594.Protein expression and purificationE. coli Rosetta(DE3) was transformed with the plasmid pGP37140, pGP2583, and pGP3589 encoding His-tagged versions of PtsH, Fpa, and Fur, respectively. For overexpression, cells were grown in 2x LB and expression was induced by the addition of isopropyl 1-thio-β-D-galactopyranoside (final concentration, 1 mM) to exponentially growing cultures (OD600 of 0.8). The His-tagged proteins were purified in 1x ZAP buffer (50 mM Tris-HCl, 200 mM NaCl, pH 7.5). Cells were lysed by four passes (18,000 p.s.i.) through an HTU DIGI-F press (G. Heinemann, Germany). After lysis, the crude extract was centrifuged at 46,400 × g for 60 min and then passed over a Ni2+nitrilotriacetic acid column (IBA, Göttingen, Germany). The proteins were eluted with an imidazole gradient. After elution, the fractions were tested for the desired protein using SDS-PAGE. The purified proteins were concentrated in a Vivaspin turbo 15 (Sartorius) centrifugal filter device (cut-off 5 or 50 kDa). The protein samples were stored at −80 °C until further use. The protein concentration was determined according to the method of Bradford41 using the Bio-Rad dye binding assay and bovine serum albumin as the standard.Electromobility shift assay (EMSA) with DNATo analyse the binding of Fur to the dhbA promoter region, we performed EMSA assays with a 284 bp dhbA promoter fragment that carries the Fur binding site and purified Fur, Fpa, and PtsH proteins. 200 ng of DNA and 80 pmoI of the proteins were used. The samples were first prepared without the proteins only with DNA, buffer and water and heated for 2 minutes at 95 °C. Then the proteins were added in different combinations and the samples were incubated for 30 minutes at 37 °C. Meanwhile, the EMSA gels were applied to a pre run at 90 V for 30 minutes immersed in TBE buffer (28). Afterwards, 2 µl of the loading dye were added and the samples were loaded into the gel pockets. The gel was run for 3 hours at 110 V. Then, the gels were immersed in TBE containing HDGreen® fluoreszence dye (Intas, Germany). After 2 minutes the gels were photographed under UV light.Bacterial two-hybrid assayPrimary protein-protein interactions were identified by bacterial two-hybrid (BACTH) analysis42. The BACTH system is based on the interaction-mediated reconstruction of Bordetella pertussis adenylate cyclase (CyaA) activity in E. coli BTH101. Functional complementation between two fragments (T18 and T25) of CyaA as a consequence of the interaction between bait and prey molecules results in the synthesis of cAMP, which is monitored by measuring the β-galactosidase activity of the cAMP-CAP-dependent promoter of the E. coli lac operon. Plasmids pUT18C and p25N allow the expression of proteins fused to the T18 and T25 fragments of CyaA, respectively. For these experiments, we used the plasmids pGP3868-pGP3875, which encode N-and C-terminal fusions of T18 or T25 to fur and fpa. The plasmids were obtained by cloning the fur and fpa between the KpnI and BamHI sites of pUT18C and p25N42. The mutant fur* allele was purchased from Eurofins Genomics (Germany) and then amplified and cloned as the wild type fur gene. The resulting plasmids were then used for co-transformation of E. coli BTH101 and the protein-protein interactions were then analysed by plating the cells on LB plates containing 100 µg/ml ampicillin, 50 µg/ml kanamycin, 40 µg/ml X-Gal (5-bromo-4-chloro-3-indolyl-ß-D-galactopyranoside), and 0.5 mM IPTG (isopropyl-ß-D-thiogalactopyranoside). The plates were incubated for a maximum of 36 h at 28 °C.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles