Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Fig. 2This figure illustrates the architectures of our BRI and BSD modules and the training procedure. The roles of these modules within the prediction procedure are explained in “Decomposition into sub-tasks” section and Fig. 1. (A) Our BRI module utilizes a 3D CNN and geometric attention layers to score each candidate residue within radius \(17\mathop {\textrm{A}}\limits ^{\circ }\). A more detailed illustration of the architecture is provided in “The BRI module” section. For inference, residues with scores higher than 0 (sigmoid probability 0.5) constitute the predicted binding residues \(\hat{R_i}\). For training, the scores are compared with the true binding residues determined by the condition that any non-hydrogen atom of it is within \(4\mathop {\textrm{A}}\limits ^{\circ }\) from any ligand non-hydrogen atom. Note that this \(4\mathop {\textrm{A}}\limits ^{\circ }\) condition was used in [1] and [25] to find binding site residues and atoms respectively. More technical details of the training are provided in “Training the BSD module” section. (B) Our BSD module utilizes the same backbone architecture as the BRI module but is followed by additional layers to produce outputs on the level of candidate binding sites. A more detailed illustration of the architecture is provided in “The BSD module” section. For inference, n top-scored candidate binding sites constitute the predicted binding sites \(\left\{ \hat{c_i}\right\} _{i=1}^{n}\). For training, the scores are compared with the true binding sites determined by the condition that they are within \(4\mathop {\textrm{A}}\limits ^{\circ }\) from any ligand non-hydrogen atom (i.e. Distance center to atom (DCA) is smaller than \(4\mathop {\textrm{A}}\limits ^{\circ }\)) following [1] and [25]. More technical details of the training are provided in “Training the BRI module”. (C) Our BSD module is trained in two stages, where in the second stage, the parameters of the parts shared with the BRI module are initialized from the result of training a BRI module in the first stageThe model architectureThe candidate generation moduleTo generate the binding site candidate centers, we use an external software Fpocket [21]. Given a protein structure, Fpocket finds sets of heavy atoms \({\hat{S}}_1,\ldots , {\hat{S}}_m\), each corresponding to a region geometrically likely to be a binding pocket. Then, we find the candidate centers \({\hat{c}}’_i\) (\(i=1,\ldots ,m\)) by taking the center of the mass of the atoms in \({\hat{S}}_i\).We chose Fpocket as the candidate generation method because it achieves a sufficiently high recall rate (96.4% on scPDB v.2017, according to [11]. This means that, for a given protein and its binding site, it is likely that at least one of the generated candidates corresponds to the binding site. Then, provided that the BSD module ranks the candidates properly, the top-n candidates may approximate the true binding site centers with a high accuracy.The BSD moduleThe BSD module takes as input the protein structure and a candidate binding site center \({\hat{c}}’\) and outputs the predicted druggability at \({\hat{c}}’\).In this process, it featurizes the surroundings of \({\hat{c}}’\) into a set of per-residue 3D grids, and processes the grids through a neural network to produce the output feature. Here, each grid in the set corresponds to a residue close enough to \({\hat{c}}’\) (distance threshold \(17\mathop {\textrm{A}}\limits ^{\circ }\)), and encodes the local environment of the residue. Note that the distance threshold \(17\mathop {\textrm{A}}\limits ^{\circ }\) is justified in Section 5 of Supplementary Information. In short, (1) it is at about the same level as the input size of Deeppocket’s segmentation model in a sense and (2) there was a consideration of computational resources.The neural network of the BSD module is composed of (1) a residue-local feature extraction unit that runs in parallel for each grid, (2) an aggregation unit that globally aggregates the local features, and (3) a reduction unit that maps the aggregated feature to a single scalar quantity. The feature extraction unit is a 3D CNN model, and the aggregation unit is composed of several geometric self-attention layers. The reduction unit is composed of a point-wise feed-forward layer and a mean-reduction operation.The BRI moduleThe BRI module takes in the protein structure and a putative binding site \({\hat{c}}\) as inputs and outputs the set of predicted binding residue indices.The BRI module shares the residue-local feature extraction and global aggregation units with the BSD module. To be more specific, the BRI module shares the following units with the BSD module: (1) the CNN feature extractor and (2) the stack of geometric attention layers up to the penultimate one in the BSD module. However, the remaining part of the BRI module is only comprised of a point-wise feed-forward layer without a mean-reduction operation. Consequently, the outputs of the last layer are used to determine (with the threshold 0) whether the corresponding residues are binding site residues or not.The CNN modelFor the CNN architecture, our BSD and BRI modules use a 3D bottleneck ResNet model adopted in [25]. The model is adapted from the bottleneck ResNet model introduced in [13] for image classification. The bottleneck architecture reduces the number of parameters, thereby facilitating the employment of a deeper network. [25] demonstrated that the 3D bottleneck ResNet model, despite its lightweight design, achieved competitive performance in comparison to its non-bottleneck counterpart.The geometric attention layersFor the geometric attention layers, our BSD and BRI modules use an attention mechanism introduced in [18], with a slight adjustment necessary to accommodate the forms of inputs.The inputs of the attention layers are composed of the following:

\(x_i\in \mathbb {R}^{d_{hidden}}\) (\(i=1,\cdots ,n\)), hidden vectors associated with the residues.

\(T_i=(R_i, t_i)\in SO(3)\times \mathbb {R}^3\) (\(i=1,\cdots ,n\)), the local frames associated to the residues, where \(t_i\) is the position of the alpha carbon and \(R_i\) is the rotation matrix that represents the residue orientation (See 1.3 in Supplementary Information). Note that the operation \(v\mapsto T_i v\) maps the local coordinates (concerning the local frame) to the corresponding global coordinates, and the operation \(u\mapsto T_i^{-1} u\) does the reverse.

Then, the computation is carried out in the following steps:

(1)

The standard query and key vectors \(q_i^h\) and \(k_i^h\) are computed by the linear mappings from \(x_i\). Here, h denotes a “head”.

(2)

The geometric query and key vectors \({\textbf {q}}_i^{hp}\) and \({\textbf {k}}_i^{hp}\) in \(\mathbb {R}^3\) are computed by the linear mappings from \(x_i\). Here, h denotes a “head” and p denotes a “point” of attention.

(3)

The attention weight from the i-th token to the j-th token is computed from a linear combination of the standard attention weight $$\begin{aligned} w_{ij}^{h,standard }=\frac{1}{\sqrt{d_{hidden}}}q_i^h\cdot k_j^h \end{aligned}$$
(3.1)
and the geometric attention weight $$\begin{aligned} w_{ij}^{h,geometric}=\frac{1}{\sqrt{N_{points}}}\sum _{p}\left\| T_i{\textbf {q}}_i^{hp} -T_j{\textbf {k}}_j^{hp}\right\| \end{aligned}$$
(3.2)
by applying a softmax operation. More precisely, the attention weight becomes $$\begin{aligned} w^h_{ij} = softmax_{j} \left(\frac{1}{\sqrt{2}}\left(w_{ij}^{h,standard}-\log \left(1+\gamma ^h\right)w_{ij}^{h,geometric}\right)\right) \end{aligned}$$
(3.3)
where \(\gamma ^h\) is a learnable parameter.

(4)

The standard value vectors \(v_j^h\) are computed by a linear map from \(x_j\), and aggregated as $$\begin{aligned} o_i^h=\sum _{j}w_{ij}^hv_j^h \end{aligned}$$
(3.4)

(5)

The geometric value vectors \({\textbf {v}}_j^h\) are computed by a linear map from \(x_j\), and aggregated as $$\begin{aligned} {\textbf {o}}_i^{hp} = T_i^{-1} \left( \sum _{j}w_{ij}^hT_j{\textbf {v}}_j^{hp}\right) \end{aligned}$$
(3.5)

(6)

The aggregated vectors as well as their sizes are concatenated and linearly mapped via \(f_{final}\) to produce the output of the attention layer $$\begin{aligned} x_i’= f_{final}\left(concat_{h,p}\left(o_i^h,{\textbf {o}}_i^{hp},\left\| {\textbf {o}}_i^{hp}\right\| \right)\right) \end{aligned}$$
(3.6)

Note that the adjustment made to the original attention mechanism is the omission of the “attention bias” term.The inter-resolution transfer learningAs illustrated in Fig. 2, transfer learning can be applied between the BSD and BRI sub-tasks, taking advantage of the shared architectures between the BSD and BRI modules. More specifically, we initialize the weights of the BSD module’s shared parameters with the weights obtained from the training of the BRI module. The rationale behind this procedure is the following intuition: the protein’s binding site can be determined based on the patterns of the binding residues.The homology-based augmentationFig. 3These figures illustrate how our homology-based augmentation determines the positive and negative binding site candidates in augmented proteins. A Depicts an augmented protein with two positive (the blue points) and one negative (the red point) binding site candidate centers. Among the binding site candidates proposed by Fpocket, they are labeled based on the distances to the proxy centers (the star-shaped points) of binding sites inferred from homology relations. B and C depict the homologous proteins in the original database that contributed to the inferred binding sites in the augmented protein. X and Y are their ligands. The bright and dark green regions of the chains indicate the residues in close proximity to the ligands, while only the bright green region exhibits evolutionary correspondence to residues in the augmented protein. The bright green region must comprise at least 50% of the entire green region for the binding site to countFig. 4These figures illustrate how our homology-based augmentation assigns residue labels in augmented proteins for a positive binding site candidate. A illustrates UniProt protein Q9VC32 as an augmented protein. The red and purple residues correspond to the red residues in (B) of the PDB protein 4G34, which are the ligand-binding residues. Similarly, the blue and purple residues correspond to the blue residues in figure (C) of the PDB protein 4BID, which are the ligand-binding residues. The purple residues, which form the intersection, attain labels 1.0, while the other colored residues attain labels 0.5. This means that our augmentation method regards the purple residues as the most likely ligand-binding onesTable 1 Residue names for colored sets in Fig. 4We use homology-based augmentation, which is a form of semi-supervised learning that aims to improve the training by utilizing the large database of protein structures whose binding sites are unlabelled. It is distinguished from the conventional augmentation methods in that it does not rely on transformations applied to the samples during the training. Instead, it pre-computes appropriate “augmented samples” out of the unlabelled database and uses the augmented dataset consisting of the augmented samples during training. Essentially, the augmented samples are selected based on the sequence alignments computed against the proteins in the original training set. In the following, we describe the augmentation method in detail, clarifying its inputs, outputs, procedures, and underlying rationale.The augmentation method requires a seed database \(\mathcal {S}^*\) of multi-chain protein-ligand complexes and a target database \(\mathcal {T}\) of single-chain protein structures. In our instantiation, \(\mathcal {S}^*\) was the training set of the scPDB dataset [11], different for each cross-validation fold. For \(\mathcal {T}\), we used version 2 of the alphafold protein structure database [38] which contained 992,316 protein structures from the proteome of humans and 47 other key organisms, as well as the Swiss-Prot entries.The augmentation procedure generates two types of information, collectively constituting the augmented dataset, which is subsequently utilized during training as outlined in “The homology-based augmentation” section. The first type of information denotes the centers of the binding site candidates in proteins in a selected subset of \(\mathcal {T}\), labeled either positive or negative. This is used to augment the BSD training dataset. The second type of information denotes, for each previous positive binding site candidate, the likelihood of each nearby protein residue being a ligand-binding residue. This is used to augment the BRI training dataset.In the following, we describe the steps of the procedure. The italicized words are general terms whose specification may vary depending on one’s needs. Whenever there is an italicized word, we provide specific details at the end of the step.

(1)

In each holo structure of \(\mathcal {S}^*\), find ligands associated to exactly one chain. As a result, obtain a database \(\mathcal {S}\) of protein chains associated to at least one such single-chain ligands. (A chain can be associated to multiple single-chain ligands) We define that a chain and a ligand are associated to each other if they have heavy atoms within \(4\mathop {\textrm{A}}\limits ^{\circ }\) to each other.

(2)

Run a homology search algorithm with \(\mathcal {S}\) as the query database and \(\mathcal {T}\) (the database of single-chain protein structures) as the target database. Based on the results, obtain an MSA for each chain in \(\mathcal {S}\). For the homology search algorithm, we use the software HHBlits [30] with its default setting.

(3)

For each triplet (x, l, y), composed of:

(1)

a query chain x in \(\mathcal {S}\)

(2)

a ligand l associated to x found in step 1 of the procedure and

(3)

a target chain y aligned with x in the MSA,

determine whether the ligand l’s binding site in x is preserved in y. The triplets for which the previous determination was affirmative will be called preserving. We define a triplet (x, l, y) as preserving if at least half of the residues of x that are in close contact with l (heavy atoms within \(4\mathop {\textrm{A}}\limits ^{\circ }\)) are aligned with a residue of y in the MSA.

(4)

For each preserving triplet (x, l, y), find a proxy center of the binding site in y that corresponds to the ligand l’s binding site in x. We define the proxy center to be the mean of the alpha carbon coordinates of the residues of y aligned in the MSA with a residue of x in close contact with l.

(5)

On each chain y in \(\mathcal {T}\) that is involved in at least one preserving triplet, run Fpocket to get an initial list of binding site center candidates. Label a candidate center “positive” if it is within a lower threshold from a proxy center obtained in the previous step. Label it “negative” if it is further than a higher threshold from any such proxy center. If a candidate center does not fall into these categories, ignore it and exclude it from the dataset. We define the lower threshold to be \(7.5\mathop {\textrm{A}}\limits ^{\circ }\) and the upper threshold to be \(30\mathop {\textrm{A}}\limits ^{\circ }\) (justifications of these values are provided in Section 5 of Supplementary Information). Figure 3 illustrates this step using schematic figures.

(6)

For each positively labeled binding site candidate from the previous step, label residues of y with the estimated likelihood of comprising the binding site. The estimate is obtained as a result of “voting” of the homologous chains in \(\mathcal {S}\) that gave rise to the binding site. More specifically, among the preserving triplets (x, l, y) whose proxy center gave rise to the binding site (in the sense of the step (5–1)), the proportion of such triplets for which the residue at hand corresponds (in MSA) to a residue in the binding site of l is computed. Figure 4 illustrates this step using an actual example.

The assignments of different labels in the previous procedure are based on the following hypotheses:

The positive binding site label: if a pocket-like site (discovered by a geometry-based BSP method) is surrounded by the sequence fragments that are homologous to the binding site sequence fragments of other proteins, then it is likely to be a binding site.

The negative binding site label: Even if a site exhibits pocket-like characteristics, it is unlikely to be a binding site if it is distant from any sequence fragments that share homology with the binding site sequence fragments of other proteins in a given seed database.

The residue labels: Whether a residue near a binding site is considered as a part of the binding site or not can be determined by assessing whether the same holds true for corresponding residues in homologous binding sites.

Note that similar hypotheses have been the basis of template-based methods introduced in “Template-based methods”.Also, it is important to note that we are not arguing that our augmentation procedure always reliably assigns precise labels to unlabelled proteins. Rather, we believe that it is a good approximation that expands on well-founded principles underlying template-based methods and has proven its empirical benefits through our experiments.How these approaches address the challenges posed by the existing methods.The problem of CNN-only architecturesSince an attention layer can emulate arbitrarily distant interaction by a single step of operation, it can obviate the problem of long-term dependency by keeping the layers relatively shallow while being able to capture global patterns.The problem of excessive post-processingGiven that the fundamental unit of computation in our model architecture is the pocket and residue, its outputs directly align with the resolutions of interest. Therefore, it does not require additional post-processing that might be sub-optimal for the sub-tasks.The problem of under-utilization of existing data sourcesOur method addresses this issue through two approaches. Firstly, inter-resolution transfer learning resolves the issue of overlooking more fine-grained information, a limitation observed in certain previous methods for Binding Site Detection (BSD). Secondly, homology-based augmentation offers a mechanism to leverage databases of protein structures whose binding sites are unlabelled, which were previously overlooked by existing works.Additional details of trainingTraining the BSD moduleTo implement the transfer learning described in “The inter-resolution transfer learning” section, the BSD training consists of two stages. The first stage is pre-training the part of the BSD module’s architecture shared by the BRI module, as depicted in Fig. 2. In this process, we append the unshared portion of the BRI module architecture on top of the shared part in the BSD module and then train the combined model for the BRI task. The second stage is fine-tuning the entire original BSD module for the BSD task. In the second stage, to facilitate seamless transfer learning, we freeze the parameters of the parts trained in the first stage up to certain steps of gradient descent.In both stages, we use a balanced sampling of binding site candidates of positive and negative labels. This is because, among the binding site candidates predicted by Fpocket (on average 33 per protein in scPDB), only a few are actual binding sites, typically only one. Training without such balanced sampling may lead to the model being biased toward the outnumbered label. [17]In addition, we resolve the similar problem of unbalanced residue labels in the first stage using a weighted loss function. This loss function consists of a weighted sum of terms computed from different residues, where the binding and non-binding residues attain the following weights:$$\begin{aligned} w_{pos}=\frac{1}{2n_{pos}},\quad w_{neg}=\frac{1}{2n_{neg}} \end{aligned}$$
(3.7)
where \(n_{pos}\) and \(n_{neg}\) are the numbers of binding and non-binding residues, respectively.Training the BRI moduleTo train the BRI module, we exclusively use the positive binding site candidates, which is identical to the procedure used in Deeppocket. This is because, in our intended usage, the BRI module operates on the binding sites detected by the BSD module. Note that this intention is reflected in the evaluation metric average IOU of binding residues against the closest ligands as well. All the settings of the first stage of BSD training were maintained, except the balanced sampling of the binding site candidates.The homology-based augmentationIn all stages of training, applying the homology-based augmentation involves adding an auxiliary loss to the original loss (originating from the original dataset). This auxiliary loss is calculated in the same manner but stems from the augmented dataset.DatasetsWe used scPDB v.2017 [11] as the main dataset for training and validation. In addition, we used three other datasets for the tests: COACH420 [19], HOLO4K [19], and CHEN [9]. To be more specific, we used the training subset of the scPDB dataset provided by [25] for 5-fold cross-validation. This subset excludes proteins with a sequence similarity higher than 90% to the proteins in one of the external test datasets. We used the remaining part of the scPDB as a test dataset. Thus, the test datasets were comprised of the scPDB test set and the external test datasets — COACH420, HOLO4k, and CHEN. The CHEN dataset had holo and apo subsets. Thus, in the tests using the apo subset, we obtained the ground-truth binding sites from the structural alignments with the corresponding holo structures. More specifically, the ligands of the holo structures were superimposed onto the apo structures according to the structural alignments. The characteristics of each test dataset and the details of the structural alignments are described in Section 2 of Supplementary Information. For instance, the basic properties of each dataset such as the number of proteins and the average number of binding sites are presented in Table 1 of Supplementary Information.Evaluation methodsWe use three evaluation metrics: (1) the success rate for detection (success rate) (2) the average IOU of binding residues against the closest ligands (IOU) (3) the average IOU of the binding residues against the successfully detected ligands (conditional IOU). The metrics evaluate different combinations of BSD and BRI performances. The success rate metric evaluates BSD performance, the IOU metric evaluates BRI performance but it is also influenced by BSD performance, and the conditional IOU metric aims to evaluate BRI performance alone. We give additional details with regard to how these evaluation metrics compare to their counterpart metrics introduced in the previous literature in Supplementary Information.To provide a formal definition of each metric, we shall adopt the following notations:

\(n^{(i)}\) is the number of ground-truth ligands bound to the i-th protein.

\(\left\{ l^{(i)}_{1},\cdots , l^{(i)}_{n^{(i)}}\right\}\) is the set of ground-truth ligands bound to the i-th protein.

\(\left\{ (c^{(i)}_1,BR^{(i)}_1)\cdots , (c^{(i)}_{n^{(i)}}, BR^{(i)}_{n^{(i)}})\right\}\) is the set of predictions of the method to evaluate.

\(\left\{ TBR^{(i)}_1,\cdots ,TBR^{(i)}_{n^{(i)}}\right\}\) is the set of true binding residue indices, where \(TBR^{(i)}_j\) is defined to be the set of residues in the i-th protein that is within \(4\mathop {\textrm{A}}\limits ^{\circ }\) from \(l^{(i)}_j\)

The success rate metric measures the correspondence between the predicted binding site centers \(\left\{ c^{(i)}_1,\cdots ,c^{(i)}_{n^{(i)}}\right\}\) and the positions of the ground-truth ligands \(\left\{ l^{(i)}_1,\cdots ,l^{(i)}_{n^{(i)}}\right\}\). For each i, we compute the F1 score (the harmonic mean of precision and recall) based on the definition of detection. Specifically, we define that \(c^{(i)}_j\) is a correct detection of \(l^{(i)}_k\) when \(c^{(i)}_j\) is within \(4\mathop {\textrm{A}}\limits ^{\circ }\) (a threshold commonly used in the literature e.g. [1] and [25] from any ligand of \(l^{(i)}_k\). In other words, we define that detection is correctly performed when the Distance from Center to Atom (DCA) is \(<4\mathop {\textrm{A}}\limits ^{\circ }\). Then, the F1 scores are weighted-averaged (weighted by \(n^{(i)}\)) over the proteins. In summary, we obtain this metric as$$\begin{aligned} \left( \sum _{i}n^{(i)}\cdot \frac{2}{\frac{1}{P^{(i)}}+\frac{1}{R^{(i)}}} \right) \bigg / \left( \sum _{i}n^{(i)} \right) \end{aligned},$$
(3.8)
where \(P^{(i)}\) is the precision defined as follows:$$\begin{aligned} P^{(i)} = \frac{\#\left\{ 1\le j\le n^{(i)}:c^{(i)}_j\text { detects one of }l^{(i)}_1,\cdots l^{(i)}_{n^{(i)}}\right\} }{n^{(i)}} \end{aligned}$$
(3.9)
and \(R^{(i)}\) is the recall defined as follows:$$\begin{aligned} R^{(i)} = \frac{\#\left\{ 1\le k\le n^{(i)}:l^{(i)}_k\text { is detected by one of }c^{(i)}_1,\cdots ,c^{(i)}_{n^{(i)}}\right\} }{ n^{(i)}} \end{aligned}$$
(3.10)
This is a BSD metric since it involves only the predicted binding site centers, not the predicted binding residues.The IOU metric compares the predicted binding residues \(BR^{(i)}_j\) with the true binding residues \(TBR^{(i)}_{\phi ^{(i)}(j)}\) of the ligand \(l^{(i)}_{\phi ^{(i)}(j)}\) closest to the predicted binding site center \(c^{(i)}_j\). Here, the index \(\phi ^{(i)}(j)\) of the closest ligand is defined as$$\begin{aligned} \phi ^{(i)}(j) = \mathop {\textrm{argmin}}\limits _{k=1}^{n^{(i)}}DCA(c^{(i)}_j, l^{(i)}_k) \end{aligned}$$
(3.11)
The comparison is performed in terms of intersection over union (IOU), and the quantity is averaged over all pairs of (i, j). In summary, we obtain the second metric as$$\begin{aligned} \left( \sum _{i}\sum _{j=1}^{n^{(i)}}\frac{\#(BR^{(i)}_j\cap TBR^{(i)}_{\phi ^{(i)}(j)})}{\#(BR^{(i)}\cup TBR^{(i)}_{\phi ^{(i)}(j)})}\right) \bigg / \left( \sum _{i}n_i\right) \end{aligned}$$
(3.12)
Although this is essentially a BRI metric, it also depends on BSD performance due to how the \(\phi ^{(i)}(j)\) is defined. In particular, if the predicted center \(c^{(i)}(j)\) is far from any ligand, the set of predicted binding residues \({\hat{R}}_j^{(i)}\) does not contribute to the metric.The conditional IOU metric is similar to the IOU metric, but it aims to eliminate the previously mentioned problem of the IOU metric being dependent on BSD performance. It does so by focusing on the case that the predicted binding sites are close to at least one ligand. In summary, we obtain the metric as$$\begin{aligned} \left( \sum _{i}\sum _{j\in S^{(i)}}\frac{\#(BR^{(i)}_j\cap TBR^{(i)}_{\phi ^{(i)}(j)})}{\#(BR^{(i)}\cup TBR^{(i)}_{\phi ^{(i)}(j)})}\right) \bigg / \left( \sum _{i}\# S^{(i)} \right) \end{aligned}$$
(3.13)
where$$\begin{aligned} S^{(i)} = \left\{ j=1,\cdots ,n^{(i)}:DCA(c^{(i)}_j,l^{(i)}_{\phi ^{(i)}(j)})<4\mathop {\textrm{A}}\limits ^{\circ }\right\} \end{aligned}$$
(3.14)
Baseline methodsWe compared our method to the previous state-of-the-art deep learning methods, which are based on CNN: Deeppocket [1], Kalasanty [35] and DeepSurf [25]. All these methods are briefly explained in “Structure-based deep learning methods” section. For Deeppocket and DeepSurf, we have trained their parameters from scratch according to our dataset splits. However, for Kalasanty, we used the parameters released by the authors due to the high computational costs of training. It is important to note that the parameters of Kalasanty were trained on the entire scPDB v.2017 dataset. To be specific, the training data of Kalasanty may have included data whose protein sequences are similar (similarity above 90%) to those in the test dataset. Thus, the Kalasanty method has an advantage in terms of the coverage of the training dataset compared to the other methods when they are evaluated on the external test datasets.The definitions of \({\hat{c}}_1,\cdots , {\hat{c}}_n\) for the baseline methods are mostly natural and derived directly from the original papers. All methods produce a ranked list of predictions; we limit them to produce only the top-n outputs. Also, they compute the centers of predicted binding sites in their evaluation, so we can compute \({\hat{c}}_i\) as prescribed.However, not all baseline models output the predicted binding sites at the residue level. Thus, it is necessary to map their outputs to sets of residues \({\hat{R}}_1,\cdots , {\hat{R}}_n\). For example, in Deeppocket [1], the authors used the distance threshold \(2.5\mathop {\textrm{A}}\limits ^{\circ }\) (performed best in their validation set) to determine the binding residues based on the segmented voxels; therefore, we followed the same procedure. For Kalasanty [35] and DeepSurf [25], the authors introduced a method to convert their predictions to atom-level predictions (which was implemented in their code); therefore, we regarded the residues containing at least one such predicted binding atom as the predicted binding residues.Ablation studyWe conducted an ablation study to assess the effectiveness of each component of our proposed method. We considered the omission of the following components:

The use of local features extracted by the CNN

The inter-resolution transfer learning

The homology-based augmentation

In the ablation of “the use of local features extracted by the CNN”, we removed the CNN component from our model. To be more specific, the hidden vectors for the attention layers were directly obtained from the one-hot encoding layer for the amino acid types, followed by the token embedding layer. To compensate for the loss of model complexity, we added two more attention layers to the default configuration of BRI and BSD model architecture.

Hot Topics

Related Articles