GTalign: spatial index-driven protein structure alignment, superposition, and search

Structure representationGTalign offers users the flexibility to configure and choose which protein structure atoms will serve as representatives. By default, protein structures are represented using alpha-carbon atoms. All experiments conducted with GTalign were performed using alpha-carbon atoms as representatives.Algorithm outlineThe GTalign software takes as inputs query and subject (referred to as “reference” in the software) structure databases of arbitrary size. GTalign processes this data in chunks, aligning batches of query structures with batches of subject structures, both sorted by length, iteratively until all possible batch pairs are completed. A similar batch-oriented approach to processing large databases has been described previously27. Below, we outline the (sequential) algorithmic steps representing the actions performed on a pair of batch query and subject structures.Certain steps in the algorithm involve the alignment refinement procedure described in Algorithm 10 (RefineBestAlignments) specified in Supplementary Section S2. Algorithm 10 optimizes TM-scores and refines alignments by considering differently positioned alignment fragments of different lengths. It takes three parameters: the numbers of query and subject structures, where all possible pairs are processed in parallel, and a gap opening penalty for the COMER2 DP algorithm27 to generate alignments that optimize TM-scores given the superpositions. The complete outline of the GTalign algorithm is provided below.

1.

Index query and subject structures and store the spatial indices in a k-d tree data structure.

2.

Assign secondary structure states to the structures at each residue in parallel. This assignment is determined by the coordinates of five residues centered around the residue under consideration, with distance cutoffs between residues optimized in ref. 21.

3.

Calculate transformation matrices based on continuous fragment pairs in parallel for all queries and subjects, their matched fragment pairs, and fragment positions.

4.

Apply Algorithm 10 with parameters (nQ, nS, 0) to refine alignments obtained from the superpositions found in the previous step to maximize the TM-score (always normalized by the length of the shorter protein). nQ and nS are the numbers of query and subject structures in batches. Here and in the following steps, keep track of the maximum TM-score and the corresponding transformation matrix for all query-subject pairs.

5.

If the option –add-search-by-ss is specified, apply Algorithm 10 with parameters (nQ, nS, {−0.6, 0}) to refine alignments obtained from the application of the COMER2 DP algorithm27 using a scoring function based on secondary structure matching and sequence similarity score40.

6.

Apply Algorithm 1 (DeepSuperpositionSearch) to find the most favorable superpositions through a deep search using spatial indices. The search depth is controlled with the –speed option. This step is central to the GTalign method because it enables rapid exploration of the superposition space, resulting in accurate alignments. We provide a detailed specification of Algorithm 1 in Supplementary Section S2.

7.

Apply Algorithm 10 with parameters (nQ, nS, {−0.6, 0}) to refine alignments obtained from the application of the COMER2 DP algorithm using a scoring function based on secondary structure matching and TM-score, given the optimal transformation matrices obtained so far.

8.

Apply Algorithm 10 with parameters (nQ, nS, {−0.6, 0}) to refine alignments obtained from the application of the COMER2 DP algorithm using TM-score as a scoring function. Here, the number of repetitions in Algorithm 10 is configurable (option –convergence).

9.

Produce final alignments using the COMER2 DP algorithm based on the optimal transformation matrices in parallel for all queries and subjects.

10.

Calculate TM-scores, root-mean-squared differences (RMSDs), and other alignment-related statistics in parallel for all queries and subjects.

Steps 1 and 2 prepare data for processing. Steps 3 and 4 identify protein superpositions by matching continuous protein segments, similar to TM-align’s initial gapless matching21. These steps are sufficient to capture optimal superpositions of proteins sharing high structural similarity over a significant fraction of the length of at least one protein of a pair. Step 5 occasionally improves superpositions found in the previous steps. Step 6 conducts an extensive superposition search by matching different protein spatial regions. Steps 7 and 8 represent the refinement of transformation matrices and related alignments obtained earlier, meaning that alignment regions typically do not change or change slightly. Steps 9 and 10 prepare results for output. All the steps are based on algorithms and data structures designed to maximize instruction and memory throughput.Spatial index data structureTo accelerate the superposition search, protein structures are initially indexed (step 1 of the algorithm outline). Each structure’s index is stored in a k-d tree data structure, which hierarchically organizes protein atom coordinates. This organization allows for the retrieval of the nearest neighbor in the tree for a query atom with specified coordinates in constant, O(1) time.Accelerated superposition search using spatial indexingThe accelerated superposition search process (step 6 of the algorithm outline) leverages spatial indexing to find optimal superpositions for query and subject proteins within a data chunk. Conducted in parallel for all query-subject protein pairs in the chunk, this process explores numerous initial superposition configurations per protein pair simultaneously.Initially, the process calculates initial superpositions, or transformation matrices, based on continuous query and subject protein fragments spanning the entire extent of both proteins. The search depth, determining the number of superpositions to explore, and the fragment length depend on the query and subject protein lengths, with the fragment length not exceeding 100 residues. Regions with low local secondary structure similarity between query-subject fragment pairs avoid the calculation and exploration of initial superpositions.Upon completing initial superpositions, the shorter protein undergoes transformation to obtain spatial overlays. Then, alignments are generated between the shorter protein and the other using the longer protein’s index in parallel over residues, achieved in constant time complexity. This routine repeats twice: Initially produced alignments refine spatial overlays, followed by repeated alignment production using the protein index while ensuring matching protein secondary structure this time.The most favorable alignment for the query-subject pair is then selected based on the highest TM-score. However, alignments obtained using spatial indices are sequence order-independent. Therefore, approximate sequence order-dependent TM-scores are computed from these alignments, with one structure transformed, in sub-linear time considering a maximum of 512 aligned residues.Next, a small subset of transformation matrices with the highest approximate scores is chosen for TM-score calculation using the COMER2 DP algorithm. Further refinement involves selecting an even smaller subset of transformation matrices corresponding to the highest TM-scores to optimize alignments, considering different-length and differently positioned alignment fragments. Finally, the best alignment for the query-subject pair is selected and refined similarly, employing full DP and TM-score optimization.A detailed specification of this procedure is provided in Supplementary Section S2.Calculation of rotation matricesGTalign computes rotation matrices using the Kabsch algorithm41,42. Solving for the eigenvalues and eigenvectors of a cross-covariance matrix, \(K\in {{\mathbb{R}}}^{3\times 3}\) (R in the original notation), requires double-precision arithmetic. To render the problem solvable in single precision, thereby boosting instruction and memory throughput, K is normalized by the mean of the absolute values of its elements. It is easy to show that this operation corresponds to scaling the coordinates of protein atoms. Effectively, rotation matrices for large proteins can be considered as obtained using coordinates expressed in nanometers instead of Angstroms, preventing single-precision arithmetic overflow and underflow. The resulting rotation matrices exhibit an insignificant error (on the order of 10−5 on average) with no discernible impact on superposition and structural alignment while still ensuring high performance.Dynamic programming implementationA previously published algorithm27 was employed to implement the dynamic programming (DP) algorithms. The time complexity to calculate DP matrices is \(O({\max }_{q}{l}_{q}+{\max }_{s}{\tilde{l}}_{s})\), with a constant factor dependent on the number of threads running in parallel. (The computation involves \(({\max }_{q}{l}_{q}+{\max }_{s}{\tilde{l}}_{s})/32\) iterations of independent and parallelized calculations, executed in O(32) time by GPU threads.) Here, the maximums are taken over the lengths of all query (lq) and subject (\({\tilde{l}}_{s}\)) proteins in a data chunk. In instances where DP matrix values are only required to update backtracking information, the memory complexity for DP matrices is \(O({n}_{{{{\rm{Q}}}}}\, {\sum }_{s}\, {\tilde{l}}_{s})\)27, with nQ representing the number of query proteins in the chunk.DP matrices are built using a gap opening penalty, while the gap extension cost is set to 0 unless otherwise specified. For optimizing memory usage, match scores are modified before writing to memory—either by negating them for non-negative alignment scores or subtracting a large constant for potentially negative scores. The reverse operation is subsequently applied upon reading from memory. This approach minimizes memory requirements, facilitating greater data accommodation and parallelization.Algorithm efficiencyThe high efficiency of the developed algorithms, particularly the spatial indices, is best demonstrated by aligning large protein structures. For instance, GTalign only took seconds to align and provide a superposition for two virus nucleocapsid variants, 7a4i and 7a4j (37,860 residues each), featuring different chain orders on a single Tesla V100 GPU (Supplementary Fig. S9). In contrast, aligning these complexes using TM-align took more than three months. Although TM-align is not typically used for aligning complexes, GTalign’s efficiency may open additional possibilities for exploring large complexes when chain order preservation is important.Performance improvement potentialGTalign’s efficiency can be further enhanced by considering three key aspects. First, GTalign currently uses 32-bit floating-point precision (FP32) operations. Exploring the adoption of 16-bit (FP16) or even 8-bit (FP8) floating point precision before the final stages of alignment has the potential to increase the degree of parallelization by 2 to 4-fold.Second, the COMER2 DP algorithm, a critical component for accuracy, is employed several times throughout the structural alignment search procedure. Substituting it with spatial matching, as outlined in Supplementary Section S2, at all intermediate stages and reserving it solely for the final alignment stage could result in a significant speedup.Finally, the third aspect involves similarity selection on the coarse scale. By encoding structures with embeddings and utilizing indexed vector databases43, GTalign could achieve nearly instantaneous selection of similar protein candidates and a constant-time database search and alignment, regardless of the database size.Prescreening for similarities in sequence and structure spaceGTalign allows for an initial screening in the sequence space (option –pre-similarity) to identify potential similarities before engaging in more detailed structural analysis. The implementation of this procedure is based on calculating local ungapped alignment scores between protein sequences using a sequence similarity score table40 and does not involve dynamic programming. Protein pairs with alignment scores exceeding a specified threshold progress to the subsequent stages of structural analysis.In addition, an initial screening for similarities is available in the structure space using the –pre-score option. With this option, protein pairs with provisional TM-scores, obtained in step 4 of the algorithm outline, lower than a specified threshold are excluded from further processing.GTalign softwareGTalign incorporates several key features that contribute to its versatility and user-friendly nature. Developed using the OpenMP standard for CPUs and CUDA architecture for GPUs, GTalign is compatible with various computing architectures, including NVIDIA Pascal, Turing, Volta, Ampere, Ada Lovelace, and subsequent GPU architectures. (The GPU version exhibits a 10–20x increase in speed.) Its independence from external packages ensures seamless operation across different compilers (GCC, LLVM/Clang, MSVC) and their respective versions. GTalign is cross-platform software, with binary packages precompiled for Linux and Windows x64 operating systems. For other platforms, users have the flexibility to compile GTalign from its source code. GTalign usage is straightforward: No structure database preprocessing is required. Users can effortlessly employ GTalign by directly providing files, compressed files (gzip), directories, and/or archives (tar) of protein structures as command-line arguments. This user-centric design enhances accessibility and facilitates streamlined integration into diverse computational environments.Alignment accuracy evaluationThe evaluation of structural alignment accuracy is based on assessing how accurately the structural alignments of protein pairs translate to spatial agreement in their respective structures. This self-contained evaluation is unbiased, as it does not depend on external classifications, which may be constructed using specific sequence and structure alignment tools.The superposition of two aligned proteins is evaluated by the TM-score and RMSD, calculated by the established method TM-align21 using the -I option (Fig. 7). Notably, in this setting, TM-align does not perform a global superposition search but instead optimizes superposition constrained by a given alignment, leaving it unchanged.Fig. 7: Schematic for benchmarking structure alignment tools.The entire procedure can be described as follows: (i) Run a structure alignment tool; (ii) Use TM-align to calculate the TM-score for each produced alignment; (iii) Sort the alignments by the tool’s measure (e.g., P-value, Z-score, etc.); (iv) In addition, sort the alignments by the TM-score calculated by TM-align; (v) Finally, calculate the cumulative TM-score for the results, considering both the sorting by the tool’s measure and the sorting by the TM-align-obtained TM-score. This provides a comprehensive measure of how accurately the tool produces alignments and their rate. It’s worth noting that seemingly subtle differences in cumulative TM-score can be significant, especially considering the narrow gap between successive TM-scores; a mere 0.2 difference can distinguish between an accurate and inaccurate alignment.GDT_TS scores were calculated using the TM-score tool26, with minimal modifications to the source code to normalize GDT_TS by the number of aligned residue pairs. The adapted TM-score code is publicly available.In the benchmarks, alignments are evaluated based on (i) the TM-score normalized by the length of the shorter protein and (ii) the TM-score normalized by the query length. The first scenario considers all structural similarities, including instances where smaller proteins match regions of larger proteins. The second scenario downgrades the importance of alignments between the query and a much shorter subject protein, providing a more favorable position for some methods (e.g., Dali22) as their measures (e.g., Z-score) reduce the significance of such alignments.SCOPe-based evaluationThis evaluation aimed to assess the ability of the tools to match SCOPe 2.0829 domains to families, superfamilies, and folds. True positives (TPs) at the family, superfamily, and fold level were defined as pairs of structures from the same family, the same superfamily but different families, and the same fold but different superfamilies, respectively. Self-matches were excluded. The sizes of these groups are referred to as effective sizes. False positives (FPs) were identified as pairs from different folds.Precision and recall were calculated as #TP/(#TP + #FP) and #TP/P, respectively, where P represents the total number of positive pairs. The number of TPs, #TP, and P for precision-recall (PR) curves were downweighted by the effective size of family, superfamily, and fold for respective-level calculations. The number of FPs, #FP, was downweighted by the effective fold size. The weighting for counts was consistent with the approach used in ref. 25.Before conducting sensitivity and PR analyses, alignments generated by the tools were sorted by their significance measure. Foldseek (default parametrization) alignments were sorted by E-value, while FATCAT alignments were sorted by P-value, and Dali alignments by Z-score. DeepAlign alignments were sorted by DeepScore. TM-align and GTalign alignments were sorted by the harmonic mean of the TM-scores normalized by the query and subject lengths. The harmonic mean proved superior to the arithmetic mean for TM-align and GTalign alignments due to its ability to reduce significance for structure pairs with large length differences. However, the arithmetic mean was more suitable for Foldseek –tmalign-fast 1 and –tmalign-fast 0 alignments, as most of such pairs had already been filtered out.Secondary TM-scores, referred to as 2TM-scores, were introduced to rank GTalign alignments in the SCOPe-based evaluation. The 2TM-score is calculated over the alignment excluding unmatched helices and provided slightly improved results for fold-level evaluations. Options to calculate 2TM-scores (–2tm-score) and rank alignments by the harmonic mean of the TM-scores or 2TM-scores are available starting with version 0.15.0.The SCOPe40 2.08 datasetAll protein domains from the SCOPe 2.0829 database filtered to 40% sequence identity (SCOPe40 2.08), totaling 15,177, were searched with query protein domains selected randomly, one per superfamily, from the same SCOPe40 2.08 dataset. Representatives that Dali22 failed to reformat for its initial structural representation were omitted, resulting in a total of 2045 queries.To ensure consistent structure interpretation between TM-align and the other tools, the structure files underwent the following changes: (i) the first model of multi-model files was retained; (ii) the chain identifier was set to ‘A’ to make a single-chain structure; (iii) residues were renumbered sequentially; (iv) residues lacking at least one of the N, CA, C, and O atoms were removed. HETATM records were disregarded when using Foldseek as its interpretation of these records differed from that of TM-align.The PDB20 datasetPDB30 structures filtered using blastclust44 version 2.2.26 to 20% sequence identity with a length coverage threshold of 70% (PDB20), totaling 18,801, were queried with 186 CAMEO protein structure targets45 released over 3 months from 07/24/2021 through 10/16/2021. The CAMEO targets and the PDB20 structures maintained no more than 20% sequence identity.The structure files were preprocessed to ensure consistent structure interpretation across the tools: HETATM records and residues lacking at least one of the N, CA, C, and O atoms were removed. Also, the first model of multi-model files was retained.The Swiss-Prot datasetAll UniProtKB/Swiss-Prot31 protein structures (542,378) from the AlphaFold Database4 were searched with 40 proteins representative of structurally diverse CRISPR-Cas systems46. The selection of the 40 query proteins followed a specific process: First, the 5831 PDB protein chains associated with CRISPR-Cas systems (downloaded on 10/19/2023) were clustered at a TM-score threshold of 0.4 with a length coverage threshold of 40% using GTalign with options –speed=13 –add-search-by-ss –cls-coverage=0.4 –cls-threshold=0.4 –ter=0 –split=2. Subsequently, the top 40 members from every third singleton cluster, sorted by length, were chosen as queries, with an average length of 382 residues. The query structures underwent preprocessing, involving the removal of HETATM records and residues lacking at least one of the N, CA, C, and O atoms.The HOMSTRAD datasetThe HOMSTRAD dataset, comprising reference structural alignments of protein families and accompanying structure files, was obtained from ref. 34, containing 398 multiple protein structure alignments from the HOMSTRAD database32. (The original data were inaccessible). For benchmarking purposes, each family’s first protein from the reference alignments was aligned with every other protein of the same family, resulting in a total of 1722 pairwise alignments.Computer system configurationUnless otherwise specified, all benchmark tests were conducted on a server equipped with two Intel Xeon Gold 5115 CPUs @ 2.4 GHz (20 hardware threads per CPU), 128GB DDR4 RAM, and three NVIDIA Tesla V100-PCIE-16GB GPU accelerators, running the CentOS 7 operating system.Runtime evaluationThe runtimes of all tools were measured by the Linux time command.GTalign settingsUnless otherwise specified, all analyses were performed using GTalign version 0.14.0, compiled with GPU support. For protein pairs in the HOMSTRAD dataset, alignments were generated with the command

gtalign –qrs=<query_file> –rfs=<subject_file> -o <output_dir> –hetatm –dev-min-length=3 –speed=<speed> –pre-score=0 -s 0 –add-search-by-ss –dev-mem=4096

The command used to process queries for the SCOPe40 2.08, PDB20, and Swiss-Prot datasets wasgtalign –qrs=<query_dir> –rfs=<db_dir> -o <output_dir> –hetatm –dev-queries-total-length-per-chunk=1500 –dev-min-length=3 –dev-max-length=<max_len> –speed=<speed> -s 0.44 –add-search-by-ss –nhits=<n_hits> –nalns=<n_hits> –dev-N=3 where <query_dir> represents the directory of query structures, <db_dir> denotes the directory or tar archive (Swiss-Prot dataset) containing subject structures, <max_len> is 5000 for the PDB20 dataset and 4000 for the others, and <n_hits> is 10,000, 4000, and 50,000 for the SCOPe40 2.08, PDB20, and Swiss-Prot datasets, respectively. The value of the –speed option, along with the usage of several additional options (–pre-score and –pre-similarity), is indicated in Figs. 1–5 and Supplementary Tables and Figures. When employing the initial screening in the sequence space (option –pre-similarity), options -s 0.3 and -c cachedir were specified. The latter was used unconditionally for the Swiss-Prot dataset. In the SCOPe-based evaluation, the –2tm-score option was specified to calculate 2TM-scores (version 0.15.0).GTalign calculates and outputs TM-scores normalized by the length of both proteins in a pair. Consequently, the corresponding TM-scores were utilized as GTalign’s measures to sort alignments (left panels of Figs. 1b–d and 2a–c).TM-align settingsParallel processing of all queries for each dataset was achieved by iteratively running 40 instances of TM-align21 version 20220412 simultaneously. For the HOMSTRAD and three other datasets, each process instance was executed with the following options, respectively: <query> <subject_file> -het 1 and <query> -dir2 <db_dir> <lst_file> -het 1. Here, <query> represents a query file, <db_dir> is a directory of subject structure files, and <lst_file> is a list file of all subjects. For the Swiss-Prot dataset, -outfmt 2 was included to reduce disk space usage. The fast version (TM-align -fast) utilized an additional option, -fast.Dali settingsThe standalone version DaliLite.v522 was employed in the benchmark tests. Prior to initiating searches, structure files underwent reformatting to an initial representation using the command import.pl –pdbfile <struct_file> –dat <dir> –pdbid <id>. In this command, <struct_file> refers to a structure file, <dir> is a directory for reformatted structures, and <id> is an assigned structure identifier. Reformatting failed for 104 and 525 subject structures from the SCOPe40 2.08 and PDB20 datasets, respectively. The time taken for reformatting was excluded from runtime evaluations.For parallelizing searches, 40 process instances were executed using the command dali.pl –np 40 –query <query_list> –db <sbjct_list> –dat1 <query_dir> –dat2 <db_dir> –outfmt alignments. In this command, <query_list> and <sbjct_list> represent the list files of query and subject structures, respectively, while <query_dir> and <db_dir> indicate the directories of (reformatted) query and subject structures. Dali’s Z-score served as the sorting measure, arranging the alignments in descending order based on it.For the HOMSTRAD dataset, reformatting was not used, and alignments were directly generated with the command dali.pl –pdbfile1 <query_file> –pdbfile2 <subject_file> –dat1 <query_dir> –dat2 <subject_dir> –outfmt alignments.DeepAlign settingsAll query-subject structure pairs in each dataset underwent parallel processing through the iterative execution of 40 simultaneous instances of DeepAlign18 version v1.4 Aug-20-2018 (https://github.com/realbigws/DeepAlign), using the command DeepAlign <query_file> <subject_file>.DeepAlign outputs the TM-score normalized by the length of the shorter protein, which was used as its measure for sorting alignments in the corresponding evaluations (left panel of Fig. 1b, c). When evaluating alignments based on the TM-score normalized by the query length, DeepAlign’s DeepScore was utilized as its measure to sort the alignments (left panel of Fig. 2a, b).FATCAT settingsFATCAT 2.028 searches were conducted iteratively for all queries in three datasets using the rigid structural alignment setting. The command FATCATSearch <query_file> <sbjct_list> -i2 <db_dir> -r -o <output_file> -m was utilized, with <sbjct_list> and <db_dir> representing the list file and the directory of subject structures, respectively. For each query, FATCAT automatically initiated parallel processes corresponding to the number of processors in the system, in this case, 40. FATCAT’s P-value served as the sorting measure, i.e., the alignments were sorted in ascending order based on it. It is noteworthy that FATCAT disregards HETATM records, and thus, these records were also omitted during alignment accuracy evaluation with TM-align.For the HOMSTRAD dataset, alignments were generated with the following command: FATCAT -p1 <query_file> -p2 <subject_file> -r -o <output_file> -m.Foldseek settingsFoldseek25 version d1d1b868a571a9a0c62ae50b07139ebdd224f879 was used (downloaded on 6/25/2023). All queries in the SCOPe40 2.08, PDB20, and Swiss-Prot datasets were parallelized using the following command:

foldseek easy-search <query_dir> <db_dir> <output_file> <tmp_dir> –threads 40 –max-seqs 4000 –format-output query,target,evalue,qtmscore,ttmscore,alntmscore,rmsd,qstart,qend,qlen,tstart,tend,tlen,qaln,taln

In this command, <query_dir> denotes the directory of query structures, <db_dir> indicates the directory or tar archive (Swiss-Prot dataset) containing subject structures, and <tmp_dir> is a temporary directory. For the Swiss-Prot dataset, the option value –max-seqs 20000 was specified. Foldseek was additionally parameterized with the options –alignment-type 1 and –tmalign-fast 0 to enable the use of the fast (Foldseek –tmalign-fast 1) and regular (Foldseek –tmalign-fast 0) versions of TM-align for alignment production.As Foldseek generates TM-scores, the TM-score normalized by the length of the shorter protein was employed as the sorting measure for alignments in the left panel of Fig. 1b–d. When evaluating alignments based on the TM-score normalized by the query length (Fig. 2a–c), E-value and the average TM-score (Foldseek –tmalign-fast 0/1), a recommended metric25, were utilized as measures to sort the alignments in ascending and descending order, respectively.For the HOMSTRAD dataset, the command was modified to specify individual query and subject structure files instead of their directories. Also, the option –threads was set to 1, and the additional options –prefilter-mode 2 and -e 1e6 were included.Figure preparationMolecular graphics images were generated using UCSF Chimera47 version 1.14. Plots were created using the ggplot2 package48 in R49, versions 3.6.0 and 4.3.2.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles