VAIV bio-discovery service using transformer model and retrieval augmented generation | BMC Bioinformatics

Motivation exampleWe first describe the intended use of this database system, providing with motivation examples. Consider Table 1: most search engines straightforwardly retrieve results for basic queries like 1) using Boolean operators and title filters. In the example, to perform the specified search on PubMed, the ‘AND’ operator and the ‘Title’ filter are necessary. However, addressing search requirements such as examples in 2) and 3) can be very challenging because we practically have no prior knowledge about which genes/proteins are most closely associated with a specific entity. Although previous biomedical text-mining approaches are helpful in finding documents including specific biomedical terms, they often fail to provide statistically ranked quantitative information or meaningful relation information between biomedical entities.In addition, researchers seek to obtain answers through natural language queries from a large number of documents. Classical term-based search systems rank a set of documents by computing a relevance score for each document based on a given query. However, a document can be actually relevant to a query even without matching terms. Thus, we attempt neural search which uses vector embedding to represent document and query more semantically. It enables to capture context of a term in a document and semantic relations with other terms. Moreover, since the answers to a given query are often scattered across multiple documents, summarization is very helpful. For search and summary generation, we adopt the retrieval augmented generation based on the large language model.Database contentIn this section, we describe a system design including indexing and implementation along with data sources and the informatics of data generation. For convenience, we explain the details with each module. The system architecture can be depicted as shown in Fig. 1.Fig. 1NE (named entity) recognition and relation extraction moduleUsing the specialized NLP module, biomedical named entities, general keywords, and biomedical relations between the entities are gradually extracted from the documents. We first collected 219,317 publication abstracts from 2023 PubMedFootnote 4 baseline and 6,924 from PubMed daily updates. The National Library of Medicine (NLM) offers an annual baseline set of PubMed citation records as well as daily update records, both available for free download in XML format. In the MEDLINE PubMed XML, certain mandatory elements such as the article title, abstract, author name, publication date and journal title are essential for a record to be complete. On the other hand, optional supplementary elements like author affiliation, grant support, reference, keywords, chemical lists, MeSH terms, tags, and other metadata provide additional information that enhances the record’s comprehensiveness and usefulness. In this study, two supplementary elements, chemical list and MeSH term lists are considered in addition to the essential elements. Additionally, documents related to the functions of 8,499 targets from TTD [19] which describe therapeutic protein and nucleic acid targets, related diseases, pathways, and corresponding drugs were added. These documents were indexed, but interactions between entities were not extracted. The biomedical named entity taggerFootnote 5 [20] we used recognizes entities of chemicals/drugs, genetics (genes/proteins), and diseases/symptoms from the abstract texts. Its core engine for text entity recognition is based on BioBERT [4]. According to the study [20], this tagger achieved a micro-average F1-score of 0.86 for partial matches on the PGxCorpus [21].In biomedical literature, accurate recognition and category mapping of entities are very challenging since they often exhibit inconsistencies and ambiguities in expressions due to synonyms, abbreviations, and diverse nomenclature of terms. In addition, some entities can belong to both chemical compound/drug and protein/gene categories. For example, ‘interferon alfa-2b’ is a form of recombinant human interferon used to treat ‘hepatitis B and C infection’, ‘genital warts’, ‘hairy cell leukemia’, ‘follicular lymphoma’, ‘malignant melanoma’, and ‘AIDs-related Kaposi’s sarcoma’. It exhibits biological characteristics as a recombinant protein and is used as a drug to mimic the action of the protein in some contexts. In scientific and medical contexts, a comprehensive understanding requires considering these diverse perspectives.To address potential errors of the NE tagger, we also include terms from the chemical list provided by the PubMed XML as entities if they appear in the abstract. However, since PubMed’s chemical list encompasses genes and proteins besides chemical compounds, we constructed an additional dictionary to differentiate them under the gene/protein category. It was compiled using the entities listed in DrugBank [22], CTD (Comparative Toxicogenomics Database) [17, 18] and UniProt [23] databases along with the entities annotated in the ChemProt [24], DDI [25] and DrugProt [26] Corpus.After identifying entities in the text, their co-occurrences are indexed to investigate their relatedness. Similarly, general key terms are extracted and indexed after removing stop words for keyword search. The keywords in our search engine correspond to general terms, named entities, and MeSH terms. For each keyword, co-occurrences with named entities and MeSH terms are indexed and sorted by frequency. Table 2 presents terms associated with ‘mcp-1’, a protein that plays a crucial role in the immune response and inflammatory processes in the human body. These terms can be categorized into gene/protein, chemical compound, disease entities, MeSH terms and general terms. In our work, MeSH terms are treated as entities. Consequently, the search engine has indexed various pairs of entities, including the entity–entity, general term-entity, and entity-interactions.Table 2 Associated terms for ‘mcp-1’Our system classifies the type of interaction into the relevant category, as presented in Table 3, if a sentence contains a pair of entities that exhibit potential interaction. For instance, in the case of DDI, when two distinct chemical compounds or drugs are mentioned within the same sentence, they are considered as candidates for interaction classification. To analyze interaction candidates, approximately 2.12 million sentences from abstracts were processed using a sentence splitter.Table 3 Target interactionsTo extract relations between entities, we adopted the T5slim_dec model proposed in our previous study [9], which is a modified version of the original T5 [4] specifically designed for interaction generation. In the relation generation task, the transformer model generates a single interaction string such as “DDI-effect” or “AGONIST,” as its output for each given sentence input. In this task, the self-attention mechanism in decoder block primarily functions as an identity function and the multi-head does not effectively capture the connections between target tokens due to the presence of only a single target token. Thus, the T5slim_dec model removes the self-attention layer in the general transformer’s decoder and integrates the target interaction labels directly into the vocabulary.Consequently, T5slim_dec constrains its outputs (target labels) to generate complete whole tokens, rather than predicting a sequence of separated tokens in an autoregressive manner. It utilizes the pretrained parameters of SciFive [8] which were further finetuned on specific training datasets, namely ChemProt [24] and DrugProt [26] for BioCreative RE tasks. The model has demonstrated improved relation classification performance compared to SOTA models in the ChemProt and DDI tasks. It achieved an F-score accuracy of 0.92 in the DDI dataset and 0.943 in the ChemProt dataset.In the ChemProt BioCreative task [24], interactions were grouped into 10 semantically related classes, labeled from CPR:1 to CPR:10. However, only five relation types were utilized to evaluate system performance. The types of interest correspond to CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9. In contrast to ChemProt evaluation, this work considers all CPR interaction types as target interactions. From a granularity perspective, these groups pose challenges of the practical utility in biomedical applications and add complexity into the classification procedure. Moreover, the training datasets for CPR:7(modulator) and CPR:8 (cofactor) are quite limited in size. This indicates that the categories are difficult to classify accurately. Nevertheless, the ChemProt and DrugProt training datasets are widely recognized as Gold Standard datasets due to their comprehensive coverage and manually annotation by experts. For more details, please refer to the study [9]. The target interaction types considered in this work are presented in Table 3.For the CDR task, the training dataset available for the T5slim_dec transformer model is extremely limited. The datasets for the BioCreative V Chemical-Disease Relation (CDR) task [27] comprised 1,500 PubMed abstracts, which were equally divided into 500 each for training, development, and testing, and were focused on chemical-induced disease (CID) relations. At the current stage, in case of CDR, only potential interactions are recognized, due to the insufficient datasets for training more detailed and specific interaction types.Consequently, from 226,241 abstracts, 85,006 diseases, 167,804 chemical compounds/drugs, 143,042 proteins/genes were recognized. Additionally, 663,732 (CPR), 151,193 (DDI), and 302,091 (CDR) pairs were ultimately identified from 2.12 million sentences as exhibiting specific interactions between the entities after excluding recognized false interactions such as ‘NOT’ in CPR and ‘DDI-false’ in DDI, as shown in Fig. 2.Fig. 2Figure 3 shows various interactions associated with ‘calcium’ including ‘POTENTIAL’, ‘REGULATOR’, ‘DDI-effect’, ‘DDI-mechanism’, ‘SUBSTRATE/PRODUCT-OF’ and so on. Additionally, specific entities that interact with ‘calcium’ are identified. It has ‘REGULATOR’ relationships with proteins such as ‘calmodulin’, ‘parathyroid hormone (pth)’, ‘alkaline phosphatase’, ‘insulin’, ‘albumin’ and others.Fig. 3Some interactions associated with ‘calcium’Indexing moduleThe indexing process makes it easier to access the content related to each entity. As stated earlier, the associated terms covered by our search engine include gene/protein, chemical compound, disease entities and MeSH terms. For efficient retrieval in entity and relation searches, three distinct types of index tables are employed: (1) an entity–entity inverted index table, designed to find associated terms for each entity, (2) an entity-relation index table to discover associated interaction for each entity, and (3) an entity-relation-entity inverted index table which facilitates the identification of associated terms for a given entity and interaction, as detailed in Table 4. In a given document, if interactions between the same entities occur in multiple sentences and belong to the same interaction category, they are indexed only once and counted as one.Table 4 Indexing structure for entity and relation databaseAs a result, the index size is substantial as all entity pairs and entity-relation-entity triples are stored with their associated document information in the database. The size is expected to increase significantly as more documents are added. To efficiently handle this expanding volume of data, we employ Hadoop Distributed File System and a Hadoop-based NoSQL database, HBase [28].Search and answer generation moduleWe provide answers derived from papers in response to natural language queries based on the Retrieval Augmented Generation (RAG) method [16]. It initially retrieves articles likely to contain relevant information, and then generates prompts based on the query and retrieved passages, rather than merely extracting documents containing query keywords. Finally, large language model generates answers by using the prompts.To retrieve abstracts, we adopt a hybrid method combining neural search with keyword-based probabilistic retrieval model BM25 for passage retrieval, as shown in Fig. 4. This can leverage the strengths of both approaches: neural search retrieves documents containing answers to natural language queries by identifying semantically related texts through embedding vectors, while BM25, a keyword-based model, emphasizes important keywords in documents relevant to the query. In the neural search, both documents and queries are converted into vector embeddings, and answers are located based on the vector relatedness by comparing the query and the document embeddings [15]. The system indexes the text with its vector embeddings in a vector index.Fig. 4System flow for Retrieval Augmented GenerationTo this end, we utilize RoBERTa model [29], where text is first tokenized by Byte Pair Encoding (BPE). It begins by splitting the text into individual characters and then progressively merges the most frequently occurring pairs of characters or character sequences. Consequently, it creates a vocabulary of the most common character combinations which consist of whole words and subwords. This method is particularly effective in handle rare words and out-of-vocabulary terms. The tokenized text is transformed into vectors through embedding. The RoBERTa model combines token embeddings, which capture the semantic meanings of words, with positional embeddings that highlight their order and position in the sequence. This allows the model to effectively understand the context, semantics, and structural relationships of tokens within a document, thereby enhancing the relevance and accuracy of the search results.To generate answer for a given natural language query, the ChatGPT model [3] is adopted as the LLM. By combining search capabilities with the LLM, we can mitigate the hallucination problem, which is one of major issues in LLMs. By conditioning on retrieved relevant documents, the RAG architecture can generate more accurate and contextually appropriate answers, especially for questions requiring factual knowledge. Furthermore, this integration enhances a comprehensive understanding of the contents within the search results from papers rather than simply presenting them in a list.Utility and discussionIn this section, we describe user interface and the intended uses of the database. In addition, we introduce the benefits of functionality on provisioned module and improvement of similar existing databases. A case study of the use of the database and future plan are also presented.User interface and utilityThis database supports three types of search. As shown in Fig. 5, it offers functionality for both keyword and entity searches. It can invoke a search filter function to either include or exclude specific keywords and a result filter to limit the types of associated entities for a given query. The figure displays entity search results related to ‘leptin’. It identifies ‘obesity’ as the most closely related disease and presents links to related publications, along with their abstracts, on the right-hand side.Fig. 5Keyword and entity searchFigure 6 presents the results of a relation search related to ‘leptin’. As shown in the figure, if the relation-centered filter option is selected, it lists associated interactions and then displays the entities related to any chosen interaction. For instance, chemical compounds like ‘glucose’, ‘cholesterol’, ‘fatty acid’, ‘nitric oxide’, ‘triglyceride’, and ‘plasminogen’ are identified as having regulatory interactions with ‘leptin’. Documents related to these regulatory interactions are then collectively displayed on the right side of the interface. Conversely, when the entity-centered button is selected, interactions involving leptin are displayed with a focus on the entity.Fig. 6Figure 7 shows the system returns a response to a natural language query, ‘What is leptin and how is it related to glucose?’. It presents the search results similar to general search engines. The key differences are: (1) the documents, that are likely to contain answers to the question, are retrieved not just because they include keywords in query, and (2) the LLM generates the response based on the retrieved results. It summarizes the contents of retrieved results as “leptin is a hormone secreted by fat cells that exerts significant effects on the brain, glucose metabolism, and muscle cells. It has insulin-like properties, enhancing glucose uptake and metabolism in muscle cells. Additionally, leptin increases glucose uptake and stimulates the synthesis of glycogen, akin to insulin’s effects”. Furthermore, references to the abstracts related to the generated summary are included.Fig. 7Search and summary with natural language queryComparison with other databases and text mining systemsIn this section, we first compare our database with other database search systems such as PubMed and CTD [17, 18]. Figure 8 shows PubMed’s search results for ‘leptin’. To control the results, PubMed offers filters related to text availability (such as abstract, free full text, full text), article attributes, article types, and publication dates. Additionally, there is an option to display the abstracts of the retrieved documents. There is no further enriched information related to the keyword. On the other hand, CTD supports improved information about ‘leptin’ as shown in Fig. 9. It displays top-ranked interacting chemicals. CTD integrates data from diverse resources such as BioGRID, ChemIDplus, CL, GO, KEGG, MeSH, and PubMed, providing manually curated data relating chemical exposures with their genetic, molecular, and biological outcomes.Fig. 8PubMed search results for ‘leptin’. This figure shows the search result from PubMed (https://pubmed.ncbi.nlm.nih.gov/?term=leptin). Permission to use the screenshot has been granted by the PubMed teamFig. 9CTD search results for ‘leptin’. a This figure shows the search results from Comparative Toxicogenomics Database (https://ctdbase.org/detail.go?type=gene&acc=3952). Permission to use the screenshot has been granted by the CTD team. b This figure shows the search results from Comparative Toxicogenomics Database (https://ctdbase.org/detail.go?type=gene&acc=3952&view=ixn&chemAcc=D004041). Permission to use the screenshot has been granted by the CTD teamSome curation application tools for data entry invoke functions with automatic quality control to help annotations of CTD biocurators and interactions are translated into readable sentences. For example, structured interaction notation such as ‘C1/n + act G1/p’ is displayed as “bisphenol A analog results in increased activity of ESR1 protein” by conjoining terms from vocabularies, MeSH, 4 chemical qualifiers, 4 action term degrees, 55 action terms, NCBI gene symbol and gene qualifiers. Figure 9 shows interaction sentences between ‘leptin’ and ‘dietary fat’. However, these expressions tend to be overly rigid and lack contextual depth, resulting in evidence sentences that are so structured they limit their ability to provide unique insights.Our system has discovered that ‘glucose’ interacts with ‘leptin’ most frequently, acting as a “regulator”. Moreover, chemical compounds such as ‘glucose’, ‘atp (adenosine triphosphate)’, ‘cholesterol’, ‘k + (potassium ion)’, ‘nitric oxide’, and ‘fatty acid’ are identified as having relationships with ‘leptin’. Figure 10 shows the interacting chemical compounds or drugs with ‘leptin’. Their interactions are quantified based on individual sentences instead of on a document level to facilitate comparison. ‘Leptin’ often regulates ‘k-atp channel’, ‘k + ’, ‘k(atp)’, and ‘atp’.Fig. 10Interacting chemicals with ‘leptin’ and their interaction typesAlthough the database has yet to accumulate a large quantity of publication documents, it is noticeable that it has discovered quite new interesting findings not present in the Comparative Toxicogenomics Database (CTD) such as the inhibition of ‘leptin’ secretion by ‘vitamin C’ and its impact on glucose levels, the reduction in ‘5-FU (fluorouracil)’ cytotoxicity through leptin treatment, and the activation of ‘ATP’-sensitive ‘K + channels’ by ‘leptin’. Moreover, the interaction types are more sophisticated as shown in the figure. Figure 11 displays common interacting chemicals of both CTD and our database such as ‘glucose’, ‘cholesterol’, ‘fatty acids’, ‘acarbose’, ‘diazocide’, ‘nitric oxide’, ‘tempol’, and so on. CTD contains data on 474 chemical compounds interacting with ‘leptin’, which is a substantial volume compared to our system that found a total of 99 interacting entities. Figure 12 shows some example sentences that convey interactions between ‘leptin’ and chemical compounds/drugs. As seen in the figure, interaction sentences encompass a wide range of contexts, making it important to provide accompanying literature information for the specific interaction of interest.Fig. 11Comparison of the top-50 ranked interacting chemical compounds/drugsFig. 12Evidence sentences for interactionsCurrently, CTD [18] consists of 17,117 chemicals, 54,355 genes, 6,187 phenotypes, 954 anatomical terms, 7,274 diseases, 202,000 exposure statements, and over 3.4 million evidence-based, manually curated interactions including chemical–gene, chemical–phenotype, chemical–disease, gene–disease, and chemical–exposure interactions. Additionally, it generates over 31 million inferred gene-disease interactions and 2.9 million statistically ranked chemical-disease predictive interactions from the internal integration of curated direct interactions. External integration with imported annotations from other databases produces an additional 13 million inferences. In CTD, if chemical A interacts with gene C, and gene C is associated with disease B, then interaction between chemical A and disease B are inferred to be related via gene C. In total, CTD includes over 50 million toxicogenomic relationships for computational analysis and hypothesis development. Our system identified 167,804 chemical compounds/drugs, 143,042 proteins/genes, and 85,006 diseases from 2.12 million sentences, which encompass a significantly broader range of entities than those covered by CTD. However, as illustrated in Fig. 13, the number of recognized interactions is significantly smaller than in CTD because CTD includes inferred interactions, and we only consider abstracts, not the full texts of publications.Fig. 13Resource comparisons with CTDIn practice, even with a large collection of papers, some crucial knowledge may only be mentioned in a very few papers, making it difficult to discover. Thus, there are inherent limitations in relying solely on research papers to extract important knowledge. This highlights the importance for human-curated databases to complement the gaps in knowledge extraction from academic literature. Furthermore, to ensure accuracy, all data entry must be carefully verified.Nevertheless, the development of such automatic knowledge construction and mining systems should proceed simultaneously with the creation of curated databases. In this context, even if our database lacks completeness, users are still capable of discovering new insights, provided that a range of relevant information is available. The most significant benefit of our system is that it can also perform QA and summarization for more extensive information through natural language queries, even at the level of relation and interaction.We also compared our system with other text mining systems like DrugCentral 2023 [30], DrugBank [22], DigSeE [31], and Drugs.com [32]. Figure 14 shows the results from searching for ‘leptin’ in DrugCentral 2023 [30], which identified only two related drugs, ‘metreleptin’ and ‘setmelanotide’. The results provided by DrugCentral are limited. In our system, when querying how each drug is related to leptin, information on ‘metreleptin’ was provided from TTD’s target function documents and a related paper on ‘setmelanotide’ was also found, as shown in Fig. 15. In the case of DrugBank, no interaction for leptin was found.Fig. 14Drugs related to ‘leptin’ in DrugCentral 2023. This figure shows the search results from DrugCentral 2023 (https://drugcentral.org/?q=leptin&approval =). Permission to use the screenshot has been granted by the DrugCentral teamFig. 15Relations related to ‘leptin’ in our systemFigure 16 shows the results of DiGSeE (disease gene search engine with evidence sentences) [31] regarding the association between ‘leptin (LEP)’ and ‘insulin resistance,’ which corresponds to frequent interacting pair in our system. It retrieved only one relevant document. DiGSeE identifies biological events such as gene expression, regulation, phosphorylation, localization, and protein catabolism in the development of diseases to understand the associations between diseases and genes. The performance seems to require further improvement, as it primarily utilized the Turku event extraction system [33] to locate biological events, which achieved an F-measure of 52.86% (precision 58.13% and recall 48.46%).Fig. 16CDR comparison. b and c This figures show the search results and retrieved document from DiGSeE (http://210.107.182.61/geneSearch/ Gene Query = lep and Disease Query = Insulin Resistance).Permission to use the screenshot has been granted by the DiGSeE teamFurthermore, to investigate how well our system summarizes in response to questions, we compared interaction descriptions from DrugBank [22], as shown in Fig. 17. As shown in the example, the information on drug interactions provided by DrugBank is concise, often limited to a single sentence and mainly related to increases or decreases in interactions. The description patterns are typically phrases like “A may decrease the excretion rate of B, which could result in a higher serum level,” or “The metabolism of A can be increased when combined with B.” While these descriptions are concise, they may need to elaborate on the complex mechanisms of actual drug interactions, requiring more detailed information, especially for research or scientific analysis. Additionally, we investigated the interaction between ‘pravastatin’ and ‘paroxetine’ using the interaction checker on Drugs.com [32]. The description is likely intended as medication advice for patients, as follows: Taking pravastatin with paroxetine may increase blood glucose levels, especially in patients with diabetes. Consult your doctor about your medication use. In contrast, our system’s summarized answer further explains the mechanism of how the two drugs interact, detailing the effects on their activity.Fig. 17Comparison of interaction description/summarizaiton between ‘Pravastatin’ and ‘Paroxetine’. (a) The figure shows the search results from DRUGBANK online (https://go.drugbank.com/drugs/DB00175 Interaction Drug = paroxetine).Permission to use the screenshot has been granted by the DRUGBANK teamFinally, to evaluate RAG-based summarization, we used the BioASQ Task B datasets [34], excluding the cases where our system answered ‘no papers exist with information that matches your question.’ We used 258 questions from the Task11B-GoldenEnriched dataset (330 questions) of BioASQ Task B on Biomedical Semantic QA without adding the PubMed articles for the task. The task uses benchmark datasets containing development and test questions in English, along with gold standard (reference) answers constructed by a team of biomedical experts. Participants have to respond with relevant concepts, articles, snippets, and RDF triples from designated resources, as well as exact and ‘ideal’ answers. We utilized ideal answers as references and our system’s answers as candidates.To assess the text summarization quality for questions, two widely used metrics, BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) were adopted. The BLEU score measures the similarity between a generated text and a reference text based on the precision of n-grams. A higher BLEU score indicates a higher degree of similarity, reflecting that the generated summary accurately captures the content of the reference summary. ROUGE compares the overlap between the generated summary and the reference summary. ROUGE-1, ROUGE-2, and ROUGE-L measure the overlap of unigrams, bigrams and the longest common subsequences, respectively. Our system demonstrates high QA performance with a ROUGE-1 score of 0.912 (F-score) and a BLEU score of 0.795 using only the 2023 PubMed baseline, as shown in Fig. 18. This means that we can, to some extent, prevent the potential harm of drug associations, which can arise from incorrect interpretations of the summarizations, since the information is based solely on the given publications.Fig. 18Summarization Performance (our system)

Hot Topics

Related Articles