SymProFold: Structural prediction of symmetrical biological assemblies

Pre-filtering of possible prediction candidatesSymProFold uses sequence files in fasta format as input and consequently checks their predictability by performing homodimer predictions with AlphaFold-Multimer21. Homodimer predictions are scored according to the ipTM+pTM score, a weighted model confidence score for complex predictions, which ranges from 0–1 and is calculated using 80% of the ipTM score and 20% of the pTM score as described by Evans et al. 202221. Only homodimer predictions with a minimal score of 0.3 ipTM+pTM are further processed, excluding proteins that are unlikely to dimerize. The sequences of 19 annotated SLPs, 3 non-SLPs, and viral capsid from Odonata-associated circular virus 21 were used as input. Table 1 includes a list of proteins for which SymProFold S-layer predictions were performed. We used the annotated S-layer protein for each species, as reported in the Uniprot databank, for all of the presented S-layer predictions and calculations.Identification of domains and subchainsAs a next step, domains of the protein need to be defined in a fasta file with domains separated by line breaks. Domain boundaries can be set manually, or an automated domain identification can be used. Both start with predicting the full-length protein structure as a monomer using AlphaFold. For automated domain identification, we created a ‘Domain_Separator’ tool that can be included in the SymProFold pipeline to prepare the fasta file. Domain_Separator is a Python script that uses a coordinate file as input, identifies structural domains, and creates a fasta file with domain sequences separated by line breaks. Additionally, a ChimeraX file with separately colored domains can be generated. In an iterative process, small domain subsections are merged until they describe a complete domain. The initial domain subsections are chain ranges created through local crosslinking of secondary structures. Using relations between contact areas, surface areas, and subsection sequence lengths, neighboring domain subsections are then iteratively merged into even larger domain subsections. This iterative process ends when a saturation of merging is reached. In a postprocessing step, for each linker between domain sections, a cropping point is identified that minimizes the contacts between both domains. An example of Domain_Separator utilization is found in the Supplementary Method 1 and Supplementary Fig. 1. The generated fasta file can be used as direct input for SymProFold. Both methods lead to comparable results for identifying domains in our presented examples.Once the domains are defined in the fasta file, a set of five different subchains of the protein is generated. These subchains represent truncations of the protein according to a scheme described in Supplementary Table 3, which includes the full-length sequence, a subchain without the N-terminus, a subchain without the C-terminus, the first third of the domains, and the last third of the domains. Minimum or maximum lengths were defined (Supplementary Table 3) to mitigate the effects of very large or small domains.Prediction of oligomer sets and filteringFor each subchain, the algorithm starts different oligomer predictions (dimers, trimers, tetramers, and hexamers). Five models are calculated for each prediction, and the resulting symmetry complexes are evaluated and further processed.All predictions are evaluated to check if there is a rotational symmetry axis that can align all the subchain components within the oligomer prediction to each other, satisfying the symmetry requirement. Therefore, the algorithm checks whether the occurring angles between the monomers correspond to a 2-, 3-, 4- or 6-fold rotational symmetry and whether the associated rotational symmetry axis is uniform within a tolerance range (max. of 5 Å deviation for each monomer). Models with an ipTM+pTM score of ≥0.20 and existing interfaces between the monomers are treated as symmetry complexes. The order of the rotational symmetry axis (k-fold) of each complex is deduced from its symmetry angle (2-, 3-, 4-, or 6-fold rotational symmetry axis). For example, a trimer prediction of a given subchain of an SLP that forms a p6 S-layer could result in a 3-fold axis (∆φ = 120°) or 6-fold axis (∆φ = 60°). In the latter case, the property of a 6-fold symmetry axis is described by only three predicted subunits instead of 6. The symmetry complexes are further filtered by criteria to discard predictions of low quality. With respect to the complete protein chain, relaxed models with at least 3.0 clashes per 100aa (and 60.0 per 100aa for unrelaxed) are excluded. Additionally, in all possible sequence sections of length 200aa, only 6.0 clashes per 100aa for relaxed models (and 120.0 per 100aa for unrelaxed) are allowed. The number of clashes is calculated using the corresponding method in ChimeraX59. Models in which a part of the protein chain passes incorrectly through the neighboring molecule are excluded by filtering out models with unusually high fractions of intermolecular β-strands.Clustering via interfacesThe symmetry complexes are clustered according to the agreement between their binding interfaces, which are compared via interface matrices (distograms). Each cluster can be viewed as a candidate for a rotational symmetry axis of the assembled S-layer. All symmetry complexes are represented by a node in a network graph, connected by edges. Every edge weight is proportional to a correlation coefficient between the interface matrices of the two symmetry complexes (nodes) connected by it. A more significant agreement between both interface matrices leads to a larger correlation coefficient (see Supplementary Method 4 for details). This network graph is partitioned using the Louvain60 method.Typically, each cluster contains differently cropped subchains, resulting in the same symmetry axis. For each cluster, high ipTM+pTM scores and, where present, several different molecule counts leading to the same order of symmetry are indicators for the fold of the respective rotational symmetry axis.Superposition of symmetry complexesPairs of symmetry complexes from 2 different clusters (symmetry axes) are tested on the possibility of repetitive 2D assembly formation by the superposition of overlapping regions. At least one domain must constitute the overlapping region. The pairs of symmetry complexes to be tested are selected according to indicators of highest scores in a cluster. In case no clear result is found, more pairs of symmetry complexes with lower ipTM+pTM scores from different clusters can be tested. From a pair of symmetry complexes (A, B), the symmetry complex (A) with the higher order of rotational symmetry is aligned in z direction and then complemented by the corresponding number of copies of the other (B) by overlapping superposition. The symmetry complexes are superposed using the matchmaker method of ChimeraX59. Subsequently, the rotational symmetry axes of symmetry complexes B are aligned in z direction using linkers between complexes A and B as pivot points. Then the symmetry complexes B are complemented by copies of the symmetry complex A by overlapping superposition and the rotational symmetry axes are aligned likewise.Parameter extractionThe superposition of the assembled S-layer is scored based on the average intermolecular clashes per residue and the bending score of the assembly. Assembly bending is assessed by the axial tilt between the rotational symmetry axes of symmetry complexes A and B. The bending score is the RMSD of the angles between the axes of the central symmetry complex A and the nearest neighbor symmetry complex B. The score is normalized so that an angle of 0 rad (0°) corresponds to a bending score of 0, and an angle of π/2 rad (90°) corresponds to a bending score of 1. Addition of the average clashes per residue (Eq. (1)) and the bending score (Eq. (2)) results in the combined quality score (Eq. (3)).$${{{\rm{score}}}}_{{{\rm{clash}}}}=\frac{{N}_{{{\rm{clashes}}}}}{{N}_{{{\rm{res}}}}}$$
(1)
$${{{\rm{scor}}}}{{{\rm{e}}}}_{{{\rm{bend}}}}=\frac{2}{\pi }\sqrt{\frac{1}{n}{\sum }_{i=1}^{n}{\left({\varphi }_{i}-{\varphi }_{0}\right)}^{2}}$$
(2)
$${{\rm{scor}}}{{{\rm{e}}}}_{{{\rm{quality}}}}={{\rm{scor}}}{{{\rm{e}}}}_{{{\rm{clash}}}}+{{\rm{scor}}}{{{\rm{e}}}}_{{{\rm{bend}}}}$$
(3)
Unit cell parameters are determined using averaged distances calculated from vector differences between the noncentral symmetry complexes A with their central symmetry mate. Function command clashes in ChimeraX is used to calculate the clash score (Supplementary Method 3)Assembly of S-layer unit cellA mathematically exact unit cell is created using the determined unit cell parameters. To obtain the best possible model, the quality score can be optimized by variation of the pairs of symmetry complexes. The final output is a mmcif file with the primitive unit cell, the unit cell parameters, and symmetry operations to generate the symmetry mates to get a fully assembled S-layer. Furthermore, in the output folder files from previous steps of SymProFold for a subchain definition, symmetry complexes (SymPlot) and interface network graphs are available.Prediction of p1 S-layerIf SymProFold terminates during the workflow and cannot create a model of a primitive unit cell, it might be due to a p1 symmetry of the S-layer. Within a p1 S-layer, no rotational symmetry axes of order 2 or higher occur, therefore SymProFold needs some adjustments to predict the assembly successfully. If a p1 symmetry is known or assumed, an alternative SymProFold method for the p1 S-layer can be manually applied. In the p1 pipeline, no set of subchains is generated, but a full-length prediction is split either manually or with Domain_Separator into all possible single domains. Sets of heterodimers of the full-length protein and each single domain are calculated. If two or more strong interactions are found, the full-length model is mapped to the individual domains, and the fully assembled layer in both directions is generated. Automated superposition of axis objects can still be applied to generate the fully assembled S-layer.Prediction of the viral capsidAs a test case, we selected the viral capsid from Odonata-associated circular virus 21 (Uniprot: A0A0B4UH63). Allowing for the possibility of a 5-fold rotational symmetry axis in the SymProFold workflow basically allows the prediction of some viral capsids. For this, oligomer predictions are performed for dimers, trimers, tetramers, pentamers, and hexamers. In our test case, the predictions are made for only one subchain. In the filtering step, the oligomer predictions are tested for the presence of rotational symmetry axes that can map subchains in the oligomer prediction to one another and fulfill rotational symmetry operations. Therefore, the algorithm checks whether the occurring angles between the monomers correspond to 2-, 3-, 4-, 5- or 6-fold rotational symmetries and whether the associated rotational symmetry axes are uniform within a tolerance range. Since the surface of viral capsids is not a flat 2D plane, the rotational symmetry axes of symmetry complexes B are not aligned parallel to symmetry complex A. SymProFold currently does not support the automated assembly of viral capsids, but the results from the best scoring rotational symmetry axes can be manually built into an orientation that results in fully assembled capsid.ResourcesFor SymProFold, we present an implementation in Python 3, which uses function libraries of ChimeraX 1.659. The Domain_Separator tool is written in Python and uses ChimeraX function libraries and dssp61 to determine secondary structures. Contacts between residues are determined using the function contacts of ChimeraX and the functions surface and measure to calculate solvent-excluded surface areas. Domain_Separator uses the Python library NetworkX.Computing requirementsAn AlphaFold-Multimer 2.3 installation in standard configuration with full databases and v3 model weights was used for all complex calculations. The default value of max. 20 recycling iterations was used. The predictions were calculated at the VSC-5 Vienna Scientific Cluster (Vienna, Austria) on a GPU (NVIDIA A100, 40GB of vram).Protein expressionThe codon-optimized sequences (E. coli) for Varv_VI (Viridibacillus arvi; amino acids 765–844) and Mvol_anchor (Methanococcus voltae; amino acids 24–75 and 484–576 connected with a GGS linker) were purchased precloned in pET24 (BioCat GmbH, Heidelberg, Germany). The 5 µg of plasmid were dissolved in 50 µl water and 0.5 µl were transformed into One Shot BL21 Star (DE3) Chemically Competent E. coli (Thermo Fisher Scientific, Waltham, MA, U.S.A.) and plated on LB agar containing 35 µg/ml kanamycin. Single colonies were picked for overnight cultures in LB broth containing 35 µg/ml kanamycin. Main cultures in 100 ml LB broth containing kanamycin were inoculated 1/100 with ONC and grown to an OD600 between 0.4 and 0.7 at 37 °C. The expression was induced by the addition of 0.5 mM IPTG, the temperature was reduced to 20 °C and carried out ON. Expression cultures were harvested by centrifugation (2800 g, 20 min, 4 °C). The supernatant was discarded, and the pellet was frozen (−20 °C until further use).Protein purificationExpression pellets were thawed and resuspended in 20 ml lysis buffer (50 mM HEPES pH 7.5, 300 mM NaCl) and sonicated for cell disruption (Bandelin Sonoplus sonicator at 80%, 5 cycles, 5 min, on ice). The supernatant containing the soluble protein was filtered (Rotilabo syringe filter, PVDF, pore size 0.45 µm, Carl Roth GmbH and Co., Karlsruhe, Germany) before loading samples onto an ÄKTA pure system from GE Healthcare (Chicago, United states). The HisTrap FF affinity column 5 ml from GE Healthcare was equilibrated with the previously mentioned lysis buffer. The column was washed with 20 column volumes with 5% elution buffer (50 mM HEPES pH 7.5, 300 mM NaCl, 500 mM imidazole). The proteins were eluted with 50% elution buffer, concentrated using Amicon Ultra centrifugal filters (Millipore, Merck KGaA, Darmstadt, Germany) to a volume of 500 µl and subjected to gel filtration with SEC buffer (25 mM HEPES pH 7.5, 150 mM NaCl) with a Superdex 200 Increase 10/300 column (Cytiva, Marlborough, MA, USA). Peak fractions were pooled and concentrated for crystallization.CrystallizationCrystallization experiments were performed using vapor diffusion sitting drop in Swissci UVXPO 3 Lens crystallization plates (High Wycombe, United Kingdom). Pipetting was carried out with an Oryx 8 robot (Douglas Instruments, United Kingdom) with 35 µL of condition solution in the reservoir and drops of 0.3 µl protein; the concentration ranges from 10 mg/ml to 22 mg/ml, mixed with 0.3 µl screening solution (JCSG+ eco screen, Molecular Dimensions, Calibre Scientific, Rotherham, UK).Data collection and processingCrystals were frozen in liquid nitrogen and data collection of all crystals was performed at 100 K. Crystal screening and data collection were carried out with an in-house dual port system made up of a MetalJet X-ray source (Excillium, Kista, Sweden), a D8 Venture X-ray diffractometer (Bruker, Billerica, USA) and a Photon III detector (Bruker, Billerica, USA). Data processing was performed with DIALS 3.862, and data reduction with pointless and aimless63. For the dataset collected for Methanococcus voltae, an anisotropic cutoff was applied with the Staraniso webserver64.Structures solution, refinement, and analysisBoth structures were solved by molecular replacement using Phaser65 with the predictions presented in this manuscript (monomeric fragments of the respective amino acid ranges) as templates. Refinement was performed with Refmac66 and phenix.refine67. Structures were deposited at the RCSB Databank with the PDB codes 9FS9 and 9FSA. Table1 containing the collection and refinement statistics is available in Supplementary Method 8 and Supplementary Table 6.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles