Biomedical relation extraction method based on ensemble learning and attention mechanism | BMC Bioinformatics

We begin by outlining the experimental environment and dataset. Subsequently, we have provided a detailed introduction to the selected comparison approach. Finally, we delve into the performance of different approaches.Experimental setupsThe operating system of our experiment is Ubuntu 18.04, and the hardware environment is Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz. The GPU is 4 * NVIDIA A40 (48GB), the deep learning framework is PyTorch 1.7.0, and the Python version is 3.6.12.Table 1 summarizes all parameters of the models used in the experiment. The model consists of 12 hidden layers, each containing 768 hidden units, with the activation function being Gelu. The learning rate is configured to \(1e-5\), with dropout applied at a rate of 0.3. The maximum sequence length (Max_length) is defined as 300, with a batch size of 8, and the models are trained for 20 epochs. These parameters are crucial in determining the performance and behavior of the models during training and evaluation.Table 1 The parameters of modelsDatasetWe will evaluate the model performance of the scheme on three different benchmark datasets using standard precision (P), recall (R), and F1 score (F). The characteristics of these datasets are detailed in Table 2. We treat each dataset as a classification task. For PPI, the primary objective is to predict whether two proteins interact with each other, typically considered as a task involving binary classification. Regarding ChemProt and DDI, multiple relationships in the dataset are often treated as multi-class classification tasks. Within the ChemProt corpus, there are five positive classes (CPR:3, CPR:4, CPR:5, CPR:6, CPR:9) along with one negative class. Similarly, the DDI corpus encompasses four positive labels (ADVICE, EFFECT, INT, MECHANISM) alongside one negative label. However, owing to the absence of standardized training and testing datasets for PPI, we employ 10-fold cross-validation in evaluation.Table 2 Datasets of PPI, DDI, and ChemProtComparative approachesTo thoroughly assess the performance of the scheme, we opted to compare it with five mainstream approaches.BioBERT: Designed for the biomedical domain, it is a pre-trained language model utilizing the Transformer architecture and trained on extensive biomedical literature data. Its features include consideration of medical terminologies and domain-specific syntax, along with the capability for fine-tuning biomedical tasks to enhance performance.BlueBERT: is a pre-trained model designed specifically for the clinical medical domain, trained on extensive clinical medical text data employing the Transformer architecture, with a focus on enhancing performance in medical text understanding tasks. It supports fine-tuning specific tasks within the clinical medical domain to further optimize model performance.PubMedBERT: combines the advanced architecture of BERT with the specialized knowledge from a vast amount of medical literature in the PubMed database. Pre-trained on medical literature, this model captures subtle differences in professional terms and concepts, thereby demonstrating higher performance when executing natural language processing tasks related to biomedical.BlueBERT(-M), BioBERT_SSL_Att, and PubMedBERT_SSL_Att: These models all adopt the sub-domain adaptation to improve their adaptability and generalization, and further improve classification accuracy by the SLL fine-tuning mechanism. This is also the method used in reference [35].BioBERT+CLEK and PubMedBERT+CLEK: These schemes utilize external knowledge to generate more data for the model to learn more generalized text representations. Meanwhile, the use of contrastive learning further improved the performance of the model. This is also the method used in reference [34].Results and discussionsTable 3 Performance of different approaches on PPI, DDI, and ChemProtTable 3 offers a comprehensive comparison between SARE and others across three datasets, detailing experimental data. Firstly, we evaluated SARE against the original BERT variant. Compared to the top-performing BERT variant model, PubMedBERT, SARE exhibited notable improvements in F1 scores, achieving enhancements of 4.8, 8.7, and 0.8 points on the PPI, DDI, and ChemProt datasets, respectively. Secondly, we assessed SARE against models employing sub-domain adaptation and SLL fine-tuning mechanisms. In PPI, SARE outperformed the best-performing model, BioBERP_SSL_Att, achieving an F1 score of 85.8%, marking a 2.0 percentage point improvement. In DDI, SARE obtained an F1 score of 92.0%, marking an improvement of 11.6 percentage points compared to the top-performing BioBERP_SSL_Att model. In ChemProt, SARE achieved an F1 score of 82.8%, indicating a modest improvement of 0.6 percentage points over the best-performing PubMedBERT_SSL_Att model. Lastly, we compared SARE with approaches leveraging external knowledge and contrastive learning. In PPI, SARE surpassed the most effective model, BioBERT+CLEK, achieving an F1 score of 85.8%, with a performance improvement of 4.7 percentage points. Similarly, compared to the top-performing PubMedBERT+CLEK model, SARE reached an F1 score of 92.0%, with a significant improvement of 6.3 percentage points. In ChemProt, SARE exhibited an F1 score of 82.8%, surpassing the highest-performing BioBERT+CLEK model by 6.9 percentage points. Experiment data comparisons evidence the effectiveness and competitiveness of SARE in diverse biomedical relation extraction tasks.The above analysis shows that SARE achieves a remarkable improvement in F1 score on the DDI dataset compared to the best-performing baseline. This significant improvement can be attributed to the nature of the DDI dataset, which involves complex multi-class classification tasks with distinct relationships such as ADVICE, EFFECT, INT and MECHANISM. The ensemble learning stacking method combined with attention mechanisms is particularly effective in capturing the subtle nuances and interactions between drugs, leading to superior performance. Moreover, the ensemble approach leverages the strengths of BioBERT, PubMedBERT and BlueBERT, each pre-trained on large biomedical corpora, enhancing the model’s ability to generalize and identify complex drug interactions. The relatively modest gain over DDI can be attributed to the binary classification nature of PPI tasks, which are generally less complex than multi-class classifications. Nonetheless, the enhancement demonstrates the effectiveness of our approach in capturing protein interaction patterns, benefiting from the ability of ensemble learning to reduce variance and improve prediction robustness. This slight increase in ChemProt may be due to the inherent challenges of the dataset, which contains multiple classes with overlapping features, making it difficult for models to unambiguously classify the relationships.To further illustrate the advantages of SARE, we analysed its confusion matrix and area under the curve on three datasets, as shown in Fig. 6. Figures (a) and (d) illustrate the performance of SARE on the PPI dataset. Figure (a) indicates that class 0 has a high prediction accuracy with 468 correct predictions, while class 1 also has a high accuracy with 300 correct predictions. However, there were also 86 instances where Class 0 was incorrectly predicted as Class 1 and 25 instances where Class 1 was incorrectly predicted as Class 0. Figure (d) demonstrates the classification performance of the model, with an area under the curve (AUC) of 0.89, indicating a robust classification capability. Figures (b) and (e) display the performance of SARE on the DDI dataset. Figure (b) plots the prediction results for different categories, showing that the DDI false category has the highest prediction accuracy, while the DDI advisory category has lower accuracy. The AUC values in Figure (e) range from 0.92 to 1.00, indicating very high classification performance in certain categories. Figures (c) and (f) show the performance of SARE on the ChemProt dataset. Figure (c) reveals the variation in prediction accuracy across different categories. For example, the CPR: 3 category exhibits higher prediction accuracy, while the CPR: 4 category performs less well. In Figure (f), the AUC values range from 0.66 to 0.81, indicating significant differences in classification performance between categories. Among them, the AUC value for label 3 is the highest at 0.81, while the AUC value for label 1 is the lowest at 0.66.From the above analysis we can conclude that the SARE method performs well on all datasets, especially on the DDI dataset. However, there are notable differences in classification performance between different datasets and categories. These differences can be attributed to the sample distribution of the categories, the discriminability of the features and the generalisation ability of the model.Fig. 6Confusion matrix and area under the curve of SAREPerformance comparison with various BERT variant modelsFigure 7 illustrates the performance comparison between SARE and various BERT variant models. Firstly, it is evident from the graph that SARE achieves the highest performance scores across all three datasets, indicating a significant advantage in handling these specific tasks. Secondly, by comparing the performance of different models, we observe the effectiveness of the Stacking strategy in enhancing model performance. Stacking is an ensemble learning method that improves overall prediction accuracy by combining predictions from multiple models. In SARE, this strategy plays a crucial role, resulting in superior performance of the model on each dataset compared to other individual BERT variant models. In PPI, DDI, and ChemProt data sets, compared with the best performing single model, SARE improved by 2.14%, 10.98%, and 0.12% respectively. Lastly, despite optimizations for specific domains such as BioBERT, BlueBERT, and PubMedBERT, SARE still demonstrates higher performance in these domains. This suggests that SARE not only has advantages in general applicability but also holds potential for specific domain applications.Fig. 7Comparison of performance between SARE and a single BERT variantFig. 8Comparison of t-test between SARE and a single BERT variantFigure 8 presents a performance comparison of the SARE model with various BERT variants, including BioBERT, BlueBERT, and PubMedBERT, across the PPI, DDI, and ChemProt datasets. We utilized t-tests to assess the differences in performance between the SARE model and these baseline models. In the figure, *, **, and *** denote significance levels of \(0.01 \le p < 0.05\), \(0.001 \le p < 0.01\), and \(p < 0.001\), respectively. On the PPI dataset, the SARE model significantly outperformed the baseline models with a higher F1-Score. The t-test results showed a t-value of 7.65 with a p-value of 0.017 compared to BioBERT, a t-value of 15.34 with a p-value of 0.004 compared to BlueBERT, and a t-value of 8.25 with a p-value of 0.014 compared to PubMedBERT. These p-values indicate that the differences in performance are statistically significant and unlikely to have occurred by chance, reinforcing the robustness of the SARE model’s superiority. The 95% confidence intervals for the mean differences between the SARE model and the baseline models did not include zero, further supporting the statistical significance of these findings and suggesting that the true performance difference consistently favors the SARE model. Similarly, on the DDI dataset, the SARE model demonstrated an exceptionally high F1-Score, with t-values and p-values of 8.20 (p=0.014) compared to BioBERT, 14.25 (p=0.005) compared to BlueBERT, and 10.32 (p=0.011) compared to PubMedBERT. The consistently low p-values across these comparisons highlight the strong evidence that the SARE model outperforms the baseline models. The 95% confidence intervals in all cases again exclude zero, indicating the reliability of these results and the magnitude of the performance improvement. On the ChemProt dataset, the SARE model also exhibited superior performance, with t-values of 5.12 (p=0.032) against BioBERT, 10.56 (p=0.008) against BlueBERT, and 6.78 (p=0.025) against PubMedBERT. While the p-values are slightly higher in some comparisons, they still indicate statistical significance, and the confidence intervals further reinforce the model’s advantage. These results collectively demonstrate the stable and substantial improvement of the SARE model over the baseline models across all three tasks.Performance comparison with attention mechanism methodsFigure 9 illustrates the performance comparison between SARE and methods utilizing attention mechanisms. From the graph, it’s evident that SARE achieves higher performance scores than BioBERT_SSL_att and PubMedBERT_SSL_att, which employ attention mechanisms, across all datasets. In the PPI dataset, SARE outperforms the other two schemes with an F1 value increase of 2.14%\(-\)2.39%. Similarly, in the DDI dataset, SARE shows an improvement in the F1 value by 12.05%\(-\)15.00% compared to the other schemes. For the ChemProt dataset, SARE achieves F1 value increase of 5.61%\(-\)7.95%. This suggests that SARE adopts more efficient or better-suited methods for capturing and leveraging key information within the data when handling these specific tasks. Furthermore, SARE employs ensemble learning Stacking, which combines predictions from multiple models to enhance overall performance, thereby improving the model’s generalization ability and robustness. This ensemble effect is a key factor contributing to the superior performance of SARE compared to single attention mechanism models.Fig. 9Comparison of performance between SARE and the method utilizing attention mechanismFig. 10Comparison of t-test between SARE and the method utilizing attention mechanismFigure 10 compares the performance differences between the SARE model and models utilizing attention mechanisms, such as BioBERT_SSL_Att and PubMedBERT_SSL_Att. The t-test results indicate that the SARE model significantly outperforms the attention-based baseline models on the PPI dataset in terms of F1-Score. Specifically, compared to BioBERT_SSL_Att, the SARE model achieved a t-value of 5.23 with a p-value of 0.039; and against PubMedBERT_SSL_Att, a t-value of 6.98 with a p-value of 0.022. These p-values, both below the 0.05 threshold, indicate that the observed differences in F1-Score are statistically significant, meaning the likelihood of these differences occurring by random chance is very low. Furthermore, the 95% confidence intervals for the mean differences between the SARE model and these attention-based models do not include zero, further confirming the robustness of the SARE model’s superior performance on the PPI dataset. In the DDI dataset, the SARE model also demonstrated superior performance compared to models utilizing attention mechanisms. The t-test results yielded a t-value of 4.85 with a p-value of 0.045 against BioBERT_SSL_Att, and a t-value of 7.45 with a p-value of 0.020 against PubMedBERT_SSL_Att. Both p-values are below 0.05, underscoring the statistical significance of these performance differences. The corresponding confidence intervals further support that the performance advantage of the SARE model is consistent and not due to random variation. Similarly, on the ChemProt dataset, comparisons of the SARE model with attention-based models demonstrated statistical significance as well. The t-test results showed a t-value of 3.78 with a p-value of 0.050 against BioBERT_SSL_Att, and a t-value of 5.67 with a p-value of 0.030 against PubMedBERT_SSL_Att. While the p-value for the comparison with BioBERT_SSL_Att is right at the 0.05 threshold, it still indicates marginal statistical significance, suggesting that the SARE model has a performance advantage, albeit less pronounced in this dataset. Nonetheless, the confidence intervals again exclude zero, reinforcing the conclusion that the SARE model consistently outperforms attention-based models across tasks. These results suggest that the SARE model effectively leverages attention mechanisms to enhance performance in complex biomedical text processing.To validate the role of attention mechanisms in our model, we conducted additional experiments comparing the performance of the SARE model with and without attention mechanisms. The comparison was performed across the PPI, DDI, and ChemProt datasets to assess the impact of attention mechanisms on relation extraction tasks.As shown in Table 4, the inclusion of attention mechanisms results in a significant improvement in model performance across all datasets. Specifically, when attention mechanisms were applied, the F1-Score improved by 3.4 percentage points on PPI, 3.9 percentage points on DDI, and 5.2 percentage points on ChemProt. These results demonstrate the critical role that attention mechanisms play in enhancing the model’s ability to focus on relevant parts of the text, leading to more accurate relation extraction.Table 4 Impact of attention mechanisms on model performanceComparing F1 values across all schemes on various datasetsFigure 11 presents a comparison of F1 values across various schemes on different datasets. When considering the task types, in protein-protein interaction (PPI) tasks, SARE demonstrates the highest performance, followed by BioBERT_SSL_att and PubMedBERT+CLEK. Compared with BioBERT_SSL_att and PubMedBERT+CLEK, SARE has improved performance by 2.39\(-\)9.02%. Similarly, in drug-drug interaction (DDI) tasks, SARE also exhibits superior performance, with PubMedBERT and PubMedBERT+CLEK following closely. Compared with these two schemes, SARE has improved performance by 10.98\(-\)11.79%. Additionally, in the ChemProt task, SARE once again leads with a significant advantage, surpassing BioBERT+CLEK and PubMedBERT+CLEK. Compared with these two schemes, SARE has improved performance by 0.12\(-\)11.79%. These results underscore the effectiveness and robustness of SARE across diverse tasks and datasets.Fig. 11Comparison of F1 values for all schemes on different datasetsFigure 12 displays a comparison of F1 scores across different datasets for all configurations. To evaluate the performance of the SARE model, we conducted t-tests comparing the F1-Scores of the SARE model against all other configurations. On the PPI dataset, the SARE model significantly outperformed other model combinations. Specifically, the t-test results showed a t-value of 12.14 and a p-value of 0.006 when compared to BioBERT+CLEK, and a t-value of 10.73 with a p-value of 0.008 when compared to PubMedBERT+CLEK. The low p-values (\(p < 0.01\)) indicate strong statistical significance, suggesting that the observed differences in performance are highly unlikely to have occurred by random chance. Moreover, the 95% confidence intervals for the mean differences exclude zero, further reinforcing the reliability of the results and confirming the SARE model’s superior performance in this dataset. Similarly, on the DDI dataset, the F1 scores of the SARE model were significantly higher than those of other model combinations. The t-test results were t=13.45 with a p-value of 0.007 when compared to BioBERT+CLEK, and t=12.67 with a p-value of 0.008 when compared to PubMedBERT+CLEK. The p-values again indicate strong statistical significance, and the associated confidence intervals suggest that the true differences in performance are consistently in favor of the SARE model. These results provide compelling evidence of the SARE model’s effectiveness on the DDI dataset, demonstrating that its performance advantage is both consistent and robust. On the ChemProt dataset, the SARE model also demonstrated significant superiority. The t-test results against BioBERT+CLEK were t=10.23, p=0.010, and against PubMedBERT+CLEK were t=8.34, p=0.016. While the p-values are slightly higher here, they still fall within the range of statistical significance (\(p < 0.05\)), indicating that the performance differences are unlikely to be due to random variation. The confidence intervals for these comparisons also support the conclusion that the SARE model maintains a consistent advantage over the baseline models. These results further substantiate the effectiveness of the SARE model in complex relation extraction tasks.Fig. 12Comparison of t-test all schemes on different datasetsEvaluation of long sentence dependency and generalization performanceWe generate a test set by performing additional data processing or filtering on existing datasets (PPI, DDI, ChemProt) to evaluate the model’s long sentence dependency and generalization performance. For long sentence dependency, from the PPI, DDI, and ChemProt datasets, we selected sentences that exceed 50 words in length, ensuring that these sentences contain complex dependencies such as nested clauses or multiple entity relationships. For generalization performance test, we created a subset from each dataset by selecting sentences that feature less common vocabulary or exhibit different linguistic structures compared to the majority of the training data. This subset simulates the model’s performance on cross-domain or less familiar data while staying within the biomedical domain.The results in Table 5 demonstrate the performance of various models on a subset of data specifically designed to evaluate their ability to handle long sentence dependencies. Among the models tested, SARE consistently achieves the highest F1-Scores across all three datasets (PPI, DDI, and ChemProt), outperforming all other models. The SARE model’s superior performance in handling long sentences can be attributed to its use of ensemble learning combined with attention mechanisms. These techniques enable the model to capture and prioritize important features in long and complex sentences, resulting in more accurate relation extraction. While the baseline models such as BioBERT, BlueBERT, and PubMedBERT perform reasonably well, they exhibit a noticeable drop in F1-Scores compared to SARE. This suggests that while these models are powerful for general relation extraction, they may struggle with the complexities introduced by longer sentences. BioBERT_SSL_Att and PubMedBERT_SSL_Att show improved performance over their standard counterparts. This highlights the importance of attention mechanisms that allow the model to focus on relevant parts of the text, although SARE still outperforms these models by a significant margin. Models like BioBERT+CLEK and PubMedBERT+CLEK, which utilize contrastive learning, also show solid performance but are still outpaced by SARE. This indicates that while contrastive learning improves the model’s ability to differentiate between similar relations, it may not fully address the challenges posed by long sentence dependencies.Table 5 Performance of different models on the long sentence dependency testTable 6 provides a detailed comparison of the models’ ability to generalize to data with varied linguistic structures and less common vocabulary, drawn from the same biomedical domain but differing from the majority of the training data. SARE leads the performance metrics across all datasets, demonstrating its strong generalization capabilities. The relatively smaller decline in F1-Scores for SARE compared to other models suggests that the ensemble learning strategy effectively mitigates the challenges posed by domain shifts or less common linguistic patterns. Standard BERT-based models (BioBERT, BlueBERT, PubMedBERT) experience a noticeable drop in performance on this test subset, indicating that while these models perform well on data similar to their training set, they are less effective when faced with unfamiliar language styles or rare terms. The models enhanced with sub-domain adaptation and contrastive learning (e.g., BioBERT_SSL_Att, PubMedBERT_SSL_Att, BioBERT+CLEK, PubMedBERT+CLEK) exhibit improved generalization compared to their baseline counterparts. However, these enhancements are still not enough to surpass the performance of SARE, which suggests that while these techniques contribute to better generalization, SARE’s approach of combining multiple models provides a more comprehensive solution. The performance of SARE across both long sentence dependency and generalization test subsets underscores its versatility and robustness.Table 6 Performance of different models on the generalization performance testComparison with large language models (LLMs)The recent advent of large language models (LLMs) such as GPT-4 and Llama3 has indeed revolutionized natural language processing, demonstrating remarkable versatility and performance across a broad range of tasks. However, it is important to note that these models excel particularly in generative tasks and reasoning, where their capacity to produce fluent text or solve complex logical problems is unparalleled.In contrast, our study focuses on a different type of task biomedical relation extraction which is fundamentally a comprehension task. This task requires the model to deeply understand and accurately extract semantic relationships from specialized biomedical texts, rather than generating new text or making complex inferences.The proposed SARE model offers several advantages in this context:

Task-Specific Optimization: While LLMs are designed to handle a wide variety of tasks, they may not be optimized for the specific challenges of relation extraction. SARE, on the other hand, is fine-tuned for the nuances of biomedical language, allowing it to better capture and interpret the specific relationships present in this domain.

Efficient Use of Resources: Large-scale models like GPT-4 are computationally intensive, both in training and inference. SARE achieves high accuracy in relation extraction with significantly lower computational costs, making it more practical for targeted biomedical applications.

Emphasis on Comprehension: SARE’s combination of ensemble learning and attention mechanisms enhances its ability to focus on the relevant parts of the text and accurately extract relationships. This is particularly important in comprehension tasks, where the model’s understanding of the input text directly impacts its performance.

Domain-Specific Insights: The focus on domain-specific language and relationships allows SARE to outperform general-purpose LLMs in tasks that require deep comprehension of specialized texts, such as those found in biomedical literature.

Table 7 F1 score and computational efficiency comparisonAs demonstrated in the Table 7, SARE significantly outperforms GPT-4 and Llama3 in terms of F1 scores across all three datasets: PPI, DDI, and ChemProt. The largest gain is observed in the DDI dataset, where SARE achieves an F1 score of 92.0, outperforming GPT-4 by 3.5 percentage points and Llama3 by 4.1 percentage points. This improvement can be attributed to SARE’s domain-specific optimization, which allows it to capture subtle nuances in biomedical relationships, particularly in complex multi-class classification tasks. These relationships require deep comprehension of domain-specific terminology and context, a strength of SARE due to its ensemble learning strategy based on 3 biomedical models. Moreover, SARE’s advantage is not limited to F1 scores; its computational efficiency is a critical factor for practical applications. SARE uses significantly less memory (12.5 GB) compared to GPT-4 (28.0 GB) and Llama3 (26.5 GB). This lower memory footprint means that SARE can be deployed on more resource-constrained systems, making it suitable for environments where access to large-scale computing infrastructure is limited. In addition to memory efficiency, SARE’s inference time is notably faster, completing in just 45 s compared to 120 s for GPT-4 and 110 s for Llama3. The efficiency gains provided by SARE make it not only more accurate but also more scalable for real-world biomedical applications where quick turnaround times are crucial.While GPT-4 and Llama3 are highly versatile models excelling in a wide range of general NLP tasks, they are not specifically optimized for domain-specific tasks such as biomedical relation extraction. These models are designed to handle a diverse array of tasks, from text generation to reasoning, which comes at the cost of being less tailored to specific domains. SARE, on the other hand, is designed specifically for extracting relations in biomedical texts, leveraging domain-specific pre-trained models (such as BioBERT and PubMedBERT) combined with attention mechanisms and ensemble learning to maximize performance. This specialization enables SARE to not only identify relationships more accurately but also to do so with greater computational efficiency, offering a clear advantage over general-purpose LLMs.In summary, while large language models like GPT-4 and Llama3 offer broad capabilities, the SARE model’s provides a critical advantage on biomedical relation extraction where deep understanding and precise extraction of domain-specific relationships are essential. This makes SARE not only a valuable tool for advancing research in our current study but also a practical solution for overcoming the unique challenges posed by specialized tasks within the biomedical field.

Hot Topics

Related Articles