EpiScan: accurate high-throughput mapping of antibody-specific epitopes using sequence information

Motivation and key ideaThe rapid growth of high-throughput antibody sequencing data has opened up new possibilities for vaccine development. Identifying highly immunogenic epitopes is crucial for designing effective vaccines, and we hypothesize that these epitopes are likely to be frequently targeted by neutralizing antibodies. Similar antibodies may target two completely different conformational regions of the same antigen, which makes the epitope mapping highly challenging. To address this challenge, we developed EpiScan, incorporating biological principles into its model framework. It employs an attention mechanism within a neural network to model antibody internal dynamics and capture the underlying logic of epitope recognition. Additionally, EpiScan employs a protein language model to efficiently encode antibody amino acid sequences, thereby facilitating the retrieval of pertinent information solely from sequence data. We believe that the modeling paradigm employed by EpiScan holds great promise for addressing the challenges associated with epitope mapping against a specific antigen using high-throughput antibody sequences.Overview of EpiScanWe present the motivation and the overall architecture of EpiScan, as illustrated in Fig. 1. EpiScan is a comprehensive framework composed of three primary modules that work together to predict unknown interactions within antibody-antigen pairs. In the following subsections, we describe the materials and methods used in each of the three primary modules of the EpiScan framework in detail.Fig. 1: Motivation and the overall architecture of EpiScan.a The EpiScan framework consists of three primary modules: (1) Input module, which incorporates a matrix representation of antigen-antibody pairs. These matrix input pairs are processed by the encoding layer, where the antibody sequence is encoded using a pre-trained deep learning language model (Bepler and Berger); (2) Feature extraction module, comprising four distinct blocks: Binding, Hinge, Rotation, and ECA. The pre-encoded matrix undergoes feature extraction to obtain antigen structural features and antibody sequence features. Each block implements the simulation of antigen-antibody coupling, resulting in an output probability matrix corresponding to the epitope length of the antigen primary structure; (3) Output module, designed to predict unknown interactions within antibody-antigen pairs and capable of addressing regression tasks. b Workflow for mapping potential immunogenic epitopes, demonstrated in a case study using the SARS-CoV-2 RBD structure: High-throughput neutralizing antibody sequencing data from sera of naturally infected or vaccinated survivors are used to batch-map the epitopes of each neutralizing antibody on a specific antigen structure. This process identifies immunologically advantageous regions, providing valuable insights for vaccine and drug design.The EpiScan framework consists of an Input module, Feature extraction module, and Output module. Given two input matrices(Ag-Ab pairs), antigen ${Z}_{{Ag}}={\{{z}_{{{ag}}_{i}}\}}_{i=1}^{{L}_{{Ag}}}$ and antibody ${Z}_{{Ab}}={\{{z}_{{{ab}}_{i}}\}}_{i=1}^{{L}_{{Ab}}}$, the network assigns output ${Z}_{{out}}$ to each sample point ${p}_{i}\in P$ a probability of belonging to the positive class (i.e.binding amino-acid residue). The set of sample points $P$ refers explicitly to the individual residues of the Ag that are considered in epitope mapping analysis. Each antigenic amino acid is treated as a distinct sample point for the purpose of feature extraction and epitope prediction.The input module of EpiScan consists of several layers that process the input matrix of the antigen and antibody. First, the input data is processed by an embedding block, which converts each amino acid in the sequence into a high-dimensional representation. The pre-trained layer only processes the antibody sequence and is not fine-tuned during training. The output of the pre-trained layer is passed through a linear layer, a ReLU layer, and a dropout layer, which help to reduce overfitting and improve the performance of the model. The output module of EpiScan consists of a max-pooling layer, which is applied to the output of the input module to reduce the dimensionality of the features and extract the most important information. The resulting feature vector is then passed through the epitope prediction(sigmoid) layer, which predicts the likelihood of each position in the antigen sequence being part of an epitope. The EpiScan is trained using a specific loss function (Section “Methods” details-Cost and optimizer function), and the predicted epitope residues are marked as positive samples.Evaluation of the performance of EpiScan on DB1 datasetAs shown in Table 1, a comparison of baseline methods for predicting antibody-specific epitopes on DB1 is presented in terms of ${Precision}$, ${Recall}$, $F1{\_score}$, ${MCC}$, ${AUROC}$, and ${AU}{PR}$. EpiScan, the proposed SOTA model, achieved the best overall performance among all the methods. In terms of ${Precision}$, EpiScan achieved a value of 0.239 ± 0.019, outperforming most methods except for DeepBindPPI with a ${Precision}$ of 0.315. EpiScan also exhibited a slightly higher ${Recall}$ of 0.776 ± 0.038 compared to PInet, the method with the next highest ${Recall}$ of 0.774. Remarkably, EpiScan attained the highest $F1{\_score}$ of 0.338 ± 0.021, surpassing the $F1{\_score}$ of other models by a significant margin. The exact $F1{\_score}$ value for PInet is not provided in the table for comparison. Additionally, EpiScan achieved the best ${AU}{ROC}$ of 0.715 ± 0.008, which was 0.5% higher than the second-best performing method, EPI-EPMP, with an ${AU}{ROC}$ of 0.710 ± 0.003.Table 1 A comparison of baseline methods for predicting antibody-specific epitopes on DB1 ( ± std) is presented in terms of Precision, Recall, F1_score, the area under the receiver operating characteristic curve (AUROC), and the area under the precision-recall curve (AUPR)These results indicate that the EpiScan model excels in predicting antibody-specific epitopes on the DB1 dataset. The model’s multi-feature representation of proteins and its attention to finer granularity information appear to be beneficial for accurate prediction. Notably, EpiScan outperformed the best-performing comparison model, PInet, in the ${AU}{ROC}$, ${AU}{PR}$, and $F1{\_score}$ also ${MCC}$ metrics, demonstrating its effectiveness as a SOTA model.Evaluation of the performance of EpiScan on DB2 datasetThe generalization of EpiScan was evaluated with a separate test set (DB2). As shown in Table 2, EpiScan outperformed state-of-the-art methods including PInet across all evaluated metrics. Specifically, it achieved a Precision of 0.215, significantly higher than DeepBindPPI(0.201). Moreover, EpiScan demonstrated an improved Recall of 0.855 compared to PInet’s 0.825. In terms of ${AUROC}$, EpiScan also showed enhancement with a score of 0.686, versus 0.647 for AbAdapt. Similarly, EpiScan obtained a higher ${AUPR}$ of 0.243 than PInet’s 0.168. Lastly, EpiScan had a superior $F1{\_score}$ of 0.327, whereas PInet reached only 0.238. In addition, EpiScan achieved a ${MCC}$ of 0.264, surpassing PInet’s 0.228. Compared to other recent methods like DeepBindPPI, AbAdapt, and EPI-CNN-GCN, EpiScan showed consistently stronger performance, highlighting its state-of-the-art epitope prediction capabilities generalizable across datasets.Table 2 Comparison of EpiScan with the state-of-the-art method on DB2This study presented a comprehensive evaluation and analysis of the performance and generalization capabilities of EpiScan, a self-attention-based convolutional neural network model, compared with PInet, a graph neural network-based model. The comparison was based on epitope prediction results for five representative antigen–antibody complexes (Fig. 2a) and performance metrics across two distinct datasets (Fig. 2b). For each BCE type, the representative antigen structure was utilized, as shown in Supplementary Table 2. In Fig. 2a, every amino acid residue on the antigen surface is assigned an epitope probability score, indicating its potential as a component of the antibody binding site (epitope). For EpiScan, this score is derived from the predictive model’s output, a probability matrix of dimensions (Ag-seq-length, 2), where the softmax function assists in extracting the probability of each antigen residue being part of the epitope, we utilize the last dimension of this matrix to articulate the epitope probability.Fig. 2: Comparison of epitope prediction and performance evaluation of EpiScan and PInet on DB1 and DB2 datasets.a Epitope map visualization for representative queries, depicting native epitopes (column 1) and predicted epitopes by EpiScan (column 2) and PInet (column 3) in red on the RBD surface. Probability of prediction by EpiScan and PInet are presented in columns 4 and 5, with F1_score (left) and AUROC (right) values indicated below each prediction. b Boxplot comparisons of EpiScan and PInet performance on evaluation metrics, including Precision, Recall, F1_score, AUROC, and AUPR, for DB1 and DB2 datasets. Lines represent confidence intervals, and p-values were calculated using a Wilcoxon test.Figure 2a demonstrates that EpiScan and PInet exhibited varying degrees of success in epitope prediction across the five complexes. The $F1{\_score}$ and ${AU}{ROC}$ highlighted the strengths and limitations of each method, emphasizing the importance of selecting the appropriate approach for epitope prediction tasks. In several cases, EpiScan outperformed PInet, largely due to its ability to consider the interrelationships between VH and VL, including the interactions between the CDRS and FRs. EpiScan individually models the interactions between the antigen and VH or VL of the antibody, allowing it to determine the different levels of importance of each chain during binding. On the contrary, PInet does not have this feature. The prediction accuracy of the model varies for different epitope regions. For 3ZKM located in the “helix-loop-helix” motif and 1TZH located in the “sheet-loop-sheet” motif, the prediction accuracy of the model is relatively higher than that of 1NFD located in the loop motif. Moreover, the prediction accuracy of the model for 7ZF8 located in the RBM region is higher than that for 7WRL located in the non-RBM region. In addition to conformational differences, the possible reasons for this difference include the long-tail problem caused by the difference in the number of samples in the target epitope-enriched and non-enriched regions, exacerbating the model’s positive prediction bias towards the enriched region due to data imbalance. Figure 2b illustrates the performance of EpiScan and PInet on DB1, a public database, and DB2, an independent database of SARS-CoV-2 neutralizing antibody–antigen complexes. EpiScan outperformed PInet on all evaluation metrics in DB1, with no significant group differences observed (p > 0.05). In DB2, EpiScan showed significant group differences in ${Precision}$, $F1{\_score}$, and ${AUROC}$ (p < 0.05), whereas no significant differences were detected for ${Recall}$ and ${AUPR}$.As an enhancement to our methodological approach, we have included macro-average ROC curves in Supplementary Fig. 3, providing additional insights into the performance of epitope prediction across two datasets. In DB1, a robust evaluation through 5-fold cross-validation exhibits consistent performance with ${AUROC}$ values ranging from 0.70 to 0.72, as evidenced by the closely aligned ROC curves. The nearness of these curves to the top left corner of the plot underscores the effectiveness of our model in distinguishing between the positive class (epitopes) and the negative class (non-epitopes), compared to a random guess which would follow the diagonal dashed line. For DB2, the comparison among various prediction models is illustrated. EpiScan, performs exceptionally well on DB2 with an ${AUROC}$ value of 0.68. In contrast, DeepBindPPI exhibits poorer performance, achieving an ${AUROC}$ of 0.52. The PInet model has an ${AUROC}$ of 0.55, while AbAdapt achieves 0.52. Lastly, the EPI-CNN-GCN model performs least effectively on DB2, with an ${AUROC}$ of 0.51.We also have adopted a new standard for dividing the training and testing sets, ensuring that the CDR identity among antibodies is less than 70%. This adjustment aims to mitigate the learning benefits derived from antibody homology. The revised dataset, named DB1-cdr70, includes 86 complexes for training and 30 for testing, as detailed in Supplementary Table 6. The performance of EpiScan, PInet, and DeepBindPPI on DB1-cdr70 is summarized in Supplementary Tables 7 and 8. PInet and DeepBindPPI rank closely behind EpiScan, showcasing competitive performance as the subsequent top deep learning models in benchmark DB1 and DB2.From the results in Supplementary Table 7 and Supplementary Table 8, it is evident that when the training and testing sets are divided based on the criterion of less than 70% CDR sequence identity, all models show a decline in performance compared to the original DB1 test results. Despite the reduced training set (from 132 to 86 samples), EpiScan continues to perform best, validating the robustness of our approach.Notably, we tuned the parameter γ in the ECA module, which is responsible for the interaction between the antibody CDR and the antigen (see Methods details section). We reduced γ from 2 to 1 (where γ = 2 yielded the best performance in the DB1 test) to increase the convolutional kernel size in the ECA module. We found that increasing the convolutional kernel size improved EpiScan’s performance on the DB1-cdr70 dataset (also performs steadily on the DB2). This improvement is likely because, in datasets with significant CDR differences, focusing too much on the detailed features of the CDR region (with a smaller convolutional kernel) can lead to overfitting. In contrast, a larger convolutional kernel captures global features better, enhancing generalization.Evaluation of the performance of EpiScan on DB3 datasetOn another dataset, DB3 (sourced from SEPPA-mAb (2023)32), we re-trained and tested the performance of EpiScan. Similar to SEPPA-mAb, we used 860 complexes deposited before July 2017 as the internal training dataset and data deposited after 2017 for testing. We utilized four test sets: DB3-test-193, an independent test set from DB3 comprising 193 antigen-antibody complexes; DB3-test-HIV-36, consisting of 36 HIV Env glycoprotein complexes also sourced from DB3; DB3-test-CoV2-31, containing 31 CoV2 complexes from DB3 (before 2022); and DB2-test-CoV2-24, consisting of 24 CoV2 complexes from DB2 (after 2022). The test results are presented in Table 3.Table 3 Comparison of EpiScan with the SOTA structure-based method SEPPA on DB3Table 3 provides a detailed comparison of EpiScan’s capabilities relative to the SOTA structure-based method SEPPA, utilizing the DB3 dataset. Four distinct test sets form the basis of this analysis: DB3-test-193, DB3-test-HIV-36, DB3-test-CoV2-31, and DB2-test-CoV2-24.For the DB3-test-193 dataset, EpiScan’s AUROC of 0.864 significantly exceeds that of SEPPA-patch by 0.090. While EpiScan’s FPR of 0.150 is impressively lower than SEPPA 3.0’s 0.206, it is slightly higher than SEPPA-mAb’s remarkable FPR of 0.097, highlighting SEPPA-mAb’s strength in minimizing false positives in this context. In the DB3-test-HIV-36 set, EpiScan presents an AUROC of 0.906, outperforming SEPPA-patch by 0.071. Although EpiScan’s FPR of 0.067 is exceptionally low, it is marginally higher than SEPPA-mAb’s FPR of 0.058. This slight difference underscores SEPPA-mAb’s efficiency in reducing false positives, a testament to its predictive precision. Examining the DB3-test-CoV2-31 dataset, EpiScan’s AUROC of 0.778 is substantially higher than SEPPA 3.0’s 0.672. However, its FPR of 0.182, while lower than SEPPA-patch’s 0.304, is higher than SEPPA-mAb’s 0.224. Within the DB2-test-CoV2-24 set, EpiScan’s AUROC of 0.684 demonstrates competitive performance. Despite its accuracy, EpiScan’s FPR of 0.116, though significantly better than SEPPA 3.0’s 0.314, is not as optimized as SEPPA-mAb’s FPR of 0.202. This comparison reflects SEPPA-mAb’s capability to maintain a lower false positive rate, emphasizing its precision in this dataset.Overall, the data indicate that EpiScan exhibits superior performance in terms of AUROC and ACC across all test sets, with consistently higher AUROC compared to the SEPPA methods. The results underscore EpiScan’s effectiveness and reliability as a predictive tool in antigen-antibody interaction studies.Evaluation of the performance of EpiScan on DMS-3104 datasetWe derived the DMS-3104 dataset from the studies33,34 which includes 3104 anti-RBD antibodies and their corresponding 12 targeted hot regions on the antigen, identified through deep mutational scanning (DMS). Given that DMS does not provide direct and explicit antigen-antibody interaction sites due to the lack of crystal structure resolution, we adapted the evaluation of EpiScan on DMS data into a targeted epitope classification problem. Specifically, we merged the 12 DMS escape regions into four epitope classes (class1-class4), ensuring minimal overlap between the classes, as illustrated in Supplementary Fig. 5.When the EpiScan model outputs predictions for antibody-specific epitope sites, we assign the predicted target regions to one of four classes based on the following rule: the predicted epitope site is allocated to the class with which it has the largest intersection (i.e., the class for which the intersection of the model output and the class sites is the greatest). Mathematically, this can be expressed as:Let P be the predicted epitope site, and C1, C2, C3, C4 be the four classes. The predicted class Cp is given by:$${C}_{p}={{{\backslash }}{argmax}}_{{C}_{i}}({{|}}P\cap {C}_{i}{{|}})\,{for\,i}=1,2,3$$where ${|P}\cap {C}_{i}|$ denotes the size of the intersection between the predicted site P and the class sites ${C}_{i}$.By applying this rule, we have established a method to evaluate the specificity of epitope predictions using DMS data. We have utilized three different datasets as training sets to assess the impact of varying data samples on the model’s antibody sensitivity, with the results presented in Table 4.Table 4 The table presents the performance metrics of the EpiScan model re-trained on different datasets and tested on the DMS-3104 datasetEpiScan trained on the DB1 dataset yielded a Precision of 0.417, Recall of 0.434, AUROC of 0.618, AUPR of 0.289, and an F1-score of 0.374. When trained on the DB3* dataset, the model’s performance improved, achieving a Precision of 0.539, Recall of 0.536, AUROC of 0.712, AUPR of 0.412, and an F1-score of 0.483. The highest performance across all metrics was observed when the model was trained on the combined DB3 + DB2 dataset, with a Precision of 0.576, Recall of 0.588, AUROC of 0.788, AUPR of 0.568, and an F1-score of 0.554. These observations indicate that integrating multiple datasets for model training can significantly enhance model performance. The diversity and richness of information available in the combined DB3 + DB2 dataset likely provide a more comprehensive representation of the epitope space, enabling the model to learn more generalized features that are effective across different datasets. Figure 3 has been plotted to further investigate the predilection of the model for predicting CoV2 epitopes.Fig. 3: Confusion matrices depicting the performance of the models trained on different datasets.a DB1, (b) DB3*, and (c) DB3 + DB2. The matrices illustrate the distribution of true versus predicted labels, providing insight into the classification accuracy and error patterns across the datasets.From Fig. 3, it is observable that when the EpiScan model is trained on datasets excluding CoV2 complexes, the model’s false positives are predominantly concentrated in Class1 and Class2 epitope regions, which are areas of the Receptor Binding Domain (RBD) with the highest exposure level. Training with the DB1 dataset, the model particularly underperforms in predicting Class3 and Class4 epitopes, with more than two-thirds of the epitope predictions falling within Class1 and Class2. When trained with DB3*, the model shows some improvement in prediction accuracy but still exhibits a bias towards Class1 and Class2. This suggests that without training on CoV2 data, the model demonstrates lower sensitivity to CoV2-specific antibodies, favoring predictions towards potential RBD candidate epitopes (i.e., the immunogenically stronger Class 1 and Class 2). However, with the inclusion of a small amount of CoV2 complex samples, the model’s sensitivity towards Class 3 and Class 4 specific antibodies increased, thereby enhancing its ability to predict anti-RBD specific epitopes. This also supports our statement that training a highly accurate and generalizable specificity epitope prediction model using existing complex data alone is challenging. Tuning the model on data specific to particular scenarios is a worthwhile approach, which is a significant reason for deploying models tailored to specific strains on our web-server.Performance of model componentsAs listed in Table 5, the effect of different blocks on the EpiScan model performance was evaluated by modifying the models, such as by removing or keeping specific components, including the Hinge block and the VH/VL separation. The results indicated that the original EpiScan model with ECA/Rotation blocks achieved the highest performance in terms of ${Precision}$ (0.239 ± 0.019), ${Recall}$ (0.776 ± 0.038), ${AUROC}$ (0.715 ± 0.008), ${AUPR}$ (0.304 ± 0.009), $F1{\_score}$ (0.338 ± 0.021) and ${MCC}$ (0.275 ± 0.018). This experiment highlighted the crucial role of each component in the EpiScan model’s overall performance. Removing any of these components leads to a drop in performance, emphasizing the importance of considering all these factors when designing and optimizing models for predicting antibody–antigen interactions.Table 5 Impact of different blocks on EpiScan model performanceThe importance of the Hinge block in the model was evident when it was removed (EpiScan|Hinge), as the performance metrics decreased slightly. The FRs and CDRs plays a critical role in maintaining the structural stability and flexibility of the antibody, allowing it to bind to various antigens with high specificity and affinity35. The inclusion of Hinge block in the EpiScan model enables the model to better capture antibody binding specificity by the coupling properties between FRs and CDRs, thereby improving the performance of predicting antibody-antigen interactions.Similarly, the removal of VH/VL separation (EpiScan | HL) resulted in decreased performance metrics. The VH and VL are essential components of an antibody’s structure, and their correct separation and interaction are crucial for the antibody’s function. Including VH and VL separation in the EpiScan model can enhance the simulation of the complex structural and functional relationships between these chains, leading to improved accuracy of the predictions of antibody–antigen interactions. The lowest performance was observed when the Hinge block, VH/VL separation were disregarded simultaneously (EpiScan|Hinge|HL), further demonstrating the significance of these components in the model.The removal of the Rotation module alone has a minimal effect on the overall performance of EpiScan but a more significant effect on the model’s stability (see the next summary for details). In conclusion, the incorporation of the Hinge block and VH/VL separation in the EpiScan model is essential for achieving high performance in predicting antibody–antigen interactions. These components provide an accurate representation of the complex structural and functional relationships within antibodies, leading to improved model performance.Figure 4 offers an illustrative representation of how the EpiScan model operates at varying computational blocks to predict antibody-specific antigen epitopes. The model’s performance at each stage is evaluated using the ${AUROC}$ scores. The first computational block, the Rotation block, is responsible for simulating the translation and rotation of the antibody. Using this mechanism, the model attempts to identify the most probable binding region with the antigen. Before entering the Rotation block, the ${AUROC}$ score is 0.385, indicating poor initial discrimination. After passing through the Rotation block, which simulates antibody movement to identify likely binding regions, the ${AUROC}$ improves to 0.646. The model then proceeds to the next stage, the VH-Binding block. This block is responsible for initiating the reaction between the heavy chain (VH) of the antibody and the antigen, pinpointing the binding amino acid residues. This process enhances the prediction accuracy, reflected in the improved ${AUROC}$ score of 0.833. Subsequently, the model processes the information through the VL-Binding block, which builds upon the heavy chain-antigen recognition. During this stage, the reaction between the light chain (VL) of the antibody and the antigen takes place. This further includes possibly overlooked amino acid epitopes, enhancing the model’s comprehensive epitope prediction capability, the ${AUROC}$ reaches 0.942, though with some false positives. Overall, the stage-wise ${AUROC}$ scores demonstrate EpiScan’s capability in incrementally improving prediction performance through coordinated blocking representing key aspects of antibody-antigen binding. The analysis also reveals opportunities to enhance early positioning discrimination and reduce late-stage false positives.Fig. 4: Visualization and analysis of the epitope prediction via EpiScan’s internal blocks (PDB 1TZH).The figure, from left to right, demonstrates the output of the EpiScan model at different stages of epitope prediction, namely the Rotation block, VH-Binding block, and VL-Binding block. The AUROC scores corresponding to each stage are also presented.Robustness evaluationRobustness was evaluated on DB1 and DB2 datasets by using EpiScan and PInet models, and the results are shown in Fig. 5.Fig. 5: Robustness evaluation of EpiScan and PInet on DB1 and DB2 datasets.The figure displays F1_score, AUROC, and AUPR of baseline predictions against link perturbations with varied ratios of random additions or removals. The label “EpiScan | Rotation” denotes the EpiScan model with the Rotation block removed. Similarly, “EpiScan | ECA” refers to the model with the ECA block removed, and “EpiScan | HL” represents a model without VH and VL separation, but with direct completion of antibody and antigen reaction.Different resulting datasets were generated by randomly adding or deleting different proportions of original links to evaluate the robustness of the model. Disturbance characteristic curves were then recalculated for $F1{\_score}$, ${AUROC}$, and ${AUPR}$. Figure 5 shows that the performance of the evaluation indicators of “EpiScan|Rotation” and “EpiScan | HL” models declined rapidly, indicating sensitivity to changes in datasets. The PInet model’s performance was better than that of the “EpiScan | HL” model, particularly on the DB2 dataset, when the disturbance ratio was high (>60%). The geometric topology modeling of the PInet model enabled it to extensively represent the network structure information, thereby providing some adaptability to high-scale disturbances. Meanwhile, the “EpiScan|Rotation” model was observed to be sensitive to interference. By comparison, EpiScan proved to be more stable, particularly in terms of $F1{\_score}$ and ${AUROC}$ indices. The Rotation block that estimates the coordinate changes demonstrated some similarity to molecular docking, and it improved the anti-jamming ability of the model. The “EpiScan|Rotation” model has a flexible simulation that explains why it is sensitive to interference. The VH/VL separation calculation mechanism used to simulate the binding of antibody and antigen contributed to the edge prediction problem under high disturbance. Further, compared with different EpiScan variant models, the separation of Rotation modules with VH and VL enhanced the model’s robustness.Effects of different types of input featuresIn Methods details section, the input features used in EpiScan training based on sequence and structure information were explained in detail. This section focuses on investigating the effects of different types of input features on EpiScan performance in predicting epitopes. The input features are classified into six categories: (i) three-dimensional atomic coordinates of the antigen structure, (ii) solvent accessible surface area, (iii) local amino acid contact maps, (iv) conservative maps containing evolutionary information of antigen sequences, (v) protein language model coding of antibody sequences, and (vi) one-hot encoding and amino acid physicochemical properties coding for the antibody sequence.Supplementary Table 3 demonstrates that the combination of all four antigen features (i–iv) with the protein language model coding of antibody sequences (v) achieved the highest performance in terms of ${Precision}$, ${AUPR}$, $F1{\_score}$ and ${AUROC}$. This finding highlighted the effectiveness of incorporating the structural and evolutionary information of the antigen, in addition to the protein language model representation of the antibody, to achieve enhanced accuracy in predicting epitopes. Furthermore, protein language model coding (v) outperformed one-hot encoding and physicochemical property coding (vi) for the antibody sequence in terms of ${Recall}$, ${AUROC}$, ${AUPR}$, and $F1{\_score}$. This finding indicated that the language model encoding representation can better capture the complex relationships between antibody sequences and their binding epitopes. Importantly, omitting any one of the antigen features (i–iv) led to a decline in performance, illustrating the significance of considering all these features in predicting accurate epitopes. Overall, this study emphasized the crucial role of input features selection in EpiScan performance and advocated for a combination of structural, evolutionary, and sequence-based representations to improve epitope prediction accuracy. The presentation in Fig. 6a offers a more intuitive visualization.Fig. 6: Ablation experiments of the EpiScan model on DB1.a Effects of various input feature combinations on EpiScan performance. The input features include (i) Three-dimensional atomic coordinates of the antigen structure, (ii) Solvent accessible surface area, (iii) Local amino acid contact maps, (iv) Conservative maps with evolutionary information of antigen sequences, (v) Protein language model coding of antibody sequences, and (vi) One-hot encoding and amino acid physicochemical properties coding for antibody sequence. b Effects of the combination of different loss functions on the performance of EpiScan.Effects of loss functions on EpiScan performanceThe EpiScan model, which is developed for predicting antigen–antibody binding epitopes, faces the challenge of data imbalance due to the relatively small number of antigen epitopes than non-epitope data points. Thus, the effects of different loss functions on EpiScan performance was investigated to address this issue. Supplementary Table 4 presents the effects of various combinations of loss functions on the performance of EpiScan. The loss functions considered include cross correlation (CC) loss, generalized dice (GD) loss, and Kullback–Leibler (KL) divergence.The results demonstrated that the best performance was achieved by combining all three loss functions. This combination effectively addressed the data imbalance issue and improved the overall performance of EpiScan in predicting antigen–antibody binding epitopes. Furthermore, using CC loss alone led to a significant increase in EpiScan’s performance across all evaluation metrics compared with using GD loss alone. Combining GD loss with either KL divergence or CC loss (i.e., GD + KL or GD + CC) resulted in a slightly enhanced performance. The results underscored the importance of selecting suitable loss functions for tackling the data imbalance challenge in the training process of EpiScan. The presentation in Fig. 6b offers a more intuitive visualization.Quantitative mapping of high-throughput neutralizing antibodies on the SARS-CoV-2 RBD reveals variable epitope immunogenicityThe distribution of the antibody–receptor binding domain (RBD) interface reflects the prevalence of group-specific sites on the antigen interface of different neutralizing antibodies of SARS-CoV-2 (wild-type, WT), as depicted in Fig. 7.Fig. 7: Distribution of antibody-RBD interfaces for neutralizing antibodies of SARS-CoV-2 (PDB:6XC4 chain A).Different colors indicate the prevalence of group-specific sites on the antigen interface, with red representing higher prevalence and blue representing lower prevalence. e Mapping of specific epitopes from existing antigen-antibody complexes. f EpiScan model-based mapping of specific epitopes on the antigen protein using high-throughput BCR antibody sequencing data. g Comprehensive map of specific sites based on epitope conservation, immunogenicity, and immune escape scores. The closer to red, the higher the immunogenicity and conservation, while the closer to blue, the lower the immunogenicity/conservation. Yellow sites warrant particular attention and are suitable for use as vaccine epitopes. Conformational analysis targeting the hot region (379, 383-386) of the IY-2A is conducted. a The RBD is shown in gray; the light chain and heavy chain of IY-2A are shown in green and yellow, respectively. b Two H-bonds between Y38 and C379. And the hydrophobic interaction between the Y38 and P384. c The hydrogen interaction of E56-S383, E56-T385 and T113-T385. d The electrostatic interaction between Y55 and K386.The map in Fig. 7e, which was based on the specific epitopes of existing antigen–antibody complexes36, illustrated different colors representing varying levels of prevalence, with red indicating a high epidemic frequency and blue indicating low popularity. Furthermore, the specific epitopes on the antigen protein were mapped using the EpiScan model, which utilized high-throughput BCR antibody sequencing data, as shown in Fig. 7f. The higher frequency of neutralizing antibody binding for S477 and F486 in Fig. 7e is evident in Fig. 7f, likely due to immune escape by mutations at this site. Indeed, in the later stages of the pandemic, it was observed that neutralizing antibodies exhibited a higher affinity towards site regions other than the S477 and F486 sites33,37 Meanwhile, specific epitope mapping using high-throughput BCR sequencing data, contrary to the complex statistics, the K386 locus. Previous studies have shown that site 386 is one of the representative targeting sites of RBD neutralizing antibody37. Figure 7g presents a comprehensive evaluation of epitope conservation, immunogenicity, and immune escape score, compiling a map of specific sites. The yellow site was identified as highly important and could serve as a valuable vaccine epitope. The red and yellow hotspots in Fig. 7g partially coincide with related studies38. Specifically, the sites Q409, E406, R403, Y473, Q474, K458, R457, L492, Q506, Y508, L461, and C379 were marked as potential vaccine epitopes. These sites were selected based on high sequence conservation and high functional conservation39, combined with low immune escape score40 and high immunogenicity score calculated by EpiScan. It is noteworthy that recent studies have demonstrated a significant overlap between the epitope regions targeted by broad-spectrum neutralizing antibodies in non-RBM areas and the amino acid positions C379, S383-K386 predicted by EpiScan41. We have analyzed the potential reasons for the high antibody affinity associated with non-RBM areas(C379, S383-K386). The non-bonded interactions play the vital roles for antibody-antigen complex binding dynamics. Based on the analysis of the IY-2A(IY-2A interacts with a conserved conformational epitope and effectively neutralizes diverse sarbecoviruses while accommodating antigenic variations)41 and SARS-CoV-2(WT)-RBD interactions, there were a potential network of hydrogen bonds with the RBD through the CDRs of antibody. As shown in Fig. 7, the side chain of Y38 in the heavy chain formed two hydrogen bonds with C379(Y38: OG-C379: NH1, Y38: N-C379: OH). The E56 residue of the light chain and the S383 residue of the RBD were close to each other and engaged in two hydrogen bonds(E56:OE1-S383:HG, E56:HE2-S383: OG). Meantime, another two hydrogen bonds were formed between side chain of E56 and T385 in RBD(E56:OE2-T385:H, E56:OE2-T385:HG1). In addition, T113 in the heavy chain made a potential hydrogen bond with the oxygen atom of T385 in the RBD(T113:HG1-T385:OG1). The bond lengths of the two hydrogen bonds between the T385 in RBD and E56 in light chain and between the T385 and T113 in heavy chain were shorter (2.7 Å and 2.8 Å respectively) than that on the others, suggesting an important role of hydrogen-bonding interactions at this position. The binding interactions made the binding affinity between the antibody and antigen significantly stronger. Furthermore, there also existed electrostatic interaction between arene-OH of Y55 and the N-H function of K386. The hydrophobic interaction between the Y38 residue in light chain and P384 residue in RBD could remain. Through structure affinity analysis, we described one of the possible sources of strong immunogenicity of the hot region(C379, S383-K386), which as a vaccine epitope may stimulate the production of more IY-2A-like broad-spectrum neutralizing antibodies.An interactive and user-friendly EpiScan web serverFor the convenience of the community, we have developed a user-friendly web server for EpiScan which can be accessed at https://github.com/gzBiomedical/EpiScan, which includes the ‘Home’, ‘Submit’, and ‘Help’ pages (Fig. 8).Fig. 8: Screenshots to show the submitpage of the EpiScan web server.a The page rendering before the task is submitted. b The steps to submit (detailed on the help page).Vaccine designers and practitioners can utilize this web server to determine epitope mapping of high-throughput antibody sequences onto specific antigens. We have conducted a comparative analysis of the CALIBER dataset42 with DB1, DB2, and DB3, from which we have identified and selected 472 unique antigen-antibody PDB complex samples to serve as an independent test set for the development of our production model on the web-server (Supplementary Table 9). Specifically, after benchmarking the performance of EpiScan through DB1, DB2, and DB3, we constructed a general model using the aforementioned data excluding all samples from these datasets. We then proceeded to test the generalizability of the model on the CALIBER dataset (for which we utilized a subset of 472 samples, selecting 17 humanized antibody samples with CDR identity below 70% compared to the training data; the independent test samples will be selectively adjusted in the future based on practical application scenarios). Based on the actual test results, we set a minimum threshold of 0.80 AUROC to determine the optimal general model, EpiScan(general). Subsequently, we fine-tuned EpiScan(general) on 14 external SARS-CoV2 samples from 2023 (Supplementary Table 10) to obtain the specialized model, EpiScan(CoV2). Similarly, we fine-tuned the model on 41 Flu-HA samples (Supplementary Table 10) to derive EpiScan(Flu-A), which is also deployed on our web-server. It’s worth noting that these models were required to achieve an AUROC of at least 0.85 on their respective fine-tuning datasets, reflecting a high level of precision and reliability, before being selected for deployment on the cloud server. We are considering, for future work, to incorporate experimental DMS data along with crystallographic data for the fine-tuning and optimization of specific models. We are considering, for future work, to incorporate experimental DMS data along with crystallographic data for the fine-tuning and optimization of specific models. The web server will be continuously updated with specific models for coronaviruses (SARS-CoV, SARS-CoV-2, MERS-CoV) and Influenza A viruses across various strain (we believe the current data is insufficient to train highly accurate general models). EpiScan takes antibody heavy chain and antibody light chain FASTA sequences as input. Users can select the appropriate prediction model based on the antigen type. The users can set their own threshold values, and the prediction results are displayed interactively in a 3D protein visualization on the webpage. Additionally, the results can be downloaded as a TSV file. In the future, researchers will also be able to use this web server to validate the immunogenicity of engineered proteins and infer epitope drift of mutant viral strains. As illustrated in Fig. 7b, Step 3, we have established three reference threshold levels according to [Eq. 19], which are further detailed in Supplementary Fig. 4. Based on the response curve for each antigen-specific model, these reference values indicate thresholds reflecting user preferences for either precision, recall, or a balanced approach between the two. For instance, with SARS-CoV-2, users seeking high precision with minimal false positives may set the threshold near 0.90; those prioritizing a higher recall rate to cover more potential epitopes may adjust the threshold closer to 0.20; and for a more balanced outcome, a threshold around 0.45 would be recommended. The graphical representation offers an optimal threshold to accommodate varying tolerance levels for false positives and false negatives. Beyond the three provided references, users can select an appropriate beta curve based on their tolerances and identify the peak value on the curve for threshold setting. We hope that the introduction of EpiScan can reduce unnecessary experimental procedures and provide hypotheses and supplements for accelerating vaccine development.

EpiScan: accurate high-throughput mapping of antibody-specific epitopes using sequence information

Do AI models produce more original ideas than researchers?

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Hot Topics

Do AI models produce more original ideas than researchers?

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Do AI models produce more original ideas than researchers?

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Popular Articles

Do AI models produce more original ideas than researchers?

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more