Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology

DatasetOur main dataset is in the domain of regulatory requirements, consisting of 227 parts of texts from publically available 38 regulatory documents33, and parts of texts from 23 PubMed documents34. These documents are primarily used for understanding the current regulatory landscape and proactively making project decisions based on recent regulatory activity. For example, this may be new U.S. Food and Drug Administration guidelines to include data from studies. Due to the size of these regulatory documents, it is helpful for regulatory teams to have their informative summaries. These informative summaries are also valuable for biologics license application reviewers because those reviewers can in turn provide better inputs if they know about the current regulatory trends. One benefit of informative summaries of regulatory documents is that they can reduce project timelines by decreasing the number of regulatory queries received from regulators, as each query is potentially a delay to the final approval date. There is also a cost-saving benefit as it could mean less work responding to regulators. The 227 parts of texts from the 38 regulatory documents is preprocessed and fed through the MedicoVerse system architecture. Finally, the summaries are evaluated and compared along with results from other models.System architectureOur approach addresses the challenge of condensing extensive regulatory documents into concise and informative abstractive summaries. The proposed summarizer, MedicoVerse, uses advanced NLP models, hierarchical clustering, summarization techniques, and disease-chemical keyword annotation. A comprehensive breakdown of the MedicoVerse architecture is shown in Fig. 1.Fig. 1The MedicoVerse text summarization architecture.Data collection and pre-processingOur main dataset consists of 227 parts of texts from publically available 38 regulatory documents, and parts of texts from 23 PubMed documents. All documents are preprocessed through the elimination of unnecessary characters and newlines, text normalization, as well as ensuring a consistent format. Moreover, the text is segmented into smaller units using a customized delimiter. This segmentation enhances the feasibility of subsequent analysis and processing.sapBERT word embeddingsTo capture the intricacies of the terminology and context, the preprocessed data is inputted into the cambridgeltl/SapBERT-from-PubMedBERT-fulltext24,35 model to generate BERT embeddings. Instead of providing a fixed embedding for each word, BERT produces contextualized word embeddings, so that the representation of each word is influenced by its surrounding words in a given sentence. This model has been pre-trained on an extensive corpus of full-text biomedical articles from PubMed36,37 and clinical text data38, enabling it to capture intricate language patterns and domain-specific knowledge present in biomedical literature. This ensures that our summarizer gains the capability to comprehend the specialized language inherent in documents. The resulting embeddings are extracted from the last hidden state of the model which generates the contextualized information for each token in the sequence. The last hidden state of the model refers to the final internal representation generated by the encoder in a sequence-to-sequence model. It encapsulates the key information from the entire input sequence and serves as a condensed representation. This last hidden state is crucial in summarization tasks, as it is used as the initial context for the decoder to generate a concise and contextually informed summary of the input text. This process effectively provides a good representation of the original text, allowing our summarization tool to create comprehensive summaries that capture the essence of the content.Hierarchical clustering with agglomerative clusteringThe resulting sentence embeddings from the cambridgeltl/SapBERT-from-PubMedBERT-fulltext24,35 model are put through a hierarchical agglomerative clustering algorithm29,30 that works in a bottom-up approach. This process begins by treating each sentence as an individual cluster. It uses the cosine similarity metric to measure the similarity between sentences and employs the average linkage method to create a hierarchical structure. The output is a distance matrix that represents how sentences cluster together based on their similarity. Iteratively, a pair of closest clusters are merged, reducing the number of clusters by half. This merging process continues until all of the data points belong to one cluster, forming a hierarchical structure of clusters. Leveraging this technique, we are able to group semantically related sentences into clusters. The resulting dendrogram, shown in Fig. 2, can be cut at different levels to obtain various clusters. The number of clusters will be highly dependent on the complexity of the document.We arrived at the optimal number of clusters empirically through experimentation. Our refined approach on the number of clusters is an integral part of \(\frac{n+1}{3}\), where n is the total number of sentences in the document. This formulation allowed us to achieve clusters each with coherent set of sentences. The heuristic \(\frac{n+1}{3}\) is designed to balance over-segmentation and under-segmentation of text data.Each resulting cluster is ranked based on the smallest index among the sentences within that cluster. The rankings of these clusters will be used to order the summaries coming from each cluster, yielding a final cohesive summary.Fig. 2Hierarchical clustering of a piece of regulatory document39. Top left: piece of regulatory document, top right: colored sentences corresponding to clusters, bottom left: dendrogram showing cluster heirarchy of sentences, and bottom right: key for colors corresponding to cluster ids.Summarization with bart-large-cnn-samsumThe next step of the MedicoVerse solution focuses on the generation of concise summaries for a set of clustered texts, employing the large language model philschmid/bart-large-cnn-samsum40. This is a pre-trained sequence-to-sequence model that adopts the Transformer model architecture12, leveraging self-attention mechanisms to capture comprehensive dependencies between words within a sequence. The bart-large-cnn-samsum model is a Transformer-based text summarization model fine-tuned on the SamSum dataset40, which includes general conversational text and their corresponding summaries. This dataset features interactions that also include conversations between doctors and patients, discussing diseases, symptoms, and healthcare. Models trained on the SamSum dataset are also known to perform well on MTSamples, a collection of transcribed medical sample reports, and the Medical Information Mart for Intensive Care (MIMIC) dataset41.Each cluster resulting from the agglomerative clustering technique is subsequently fed into the model. Summarizing each of these clusters ensures our ability to capture the context present across the entire document. Finally, MedicoVerse merges the summaries of the clusters with respect to the order of the clusters based on the original sentence indices.Disease and chemical keyword annotation with SciSpacyWithin the scope of the current task, one notable application of the SciSpaCy42,43 model en_ner_bc5cdr_md lies in its capability to perform entity annotation by color-coding disease and chemical entities within each text. The value of this functionality is in the identification of relevant entities within the summarized text. By color-coding these entities, our summarization approach offers a visual cue, empowering users to swiftly discern and extract pertinent information concerning diseases and chemicals.Fig. 3MedicoVerse summary of a piece of regulatory document from Fig. 2 highlighting diseases and chemical entities.Quality of clusters and sumarriesThe top left of Fig. 2 shows an example of a piece of regulatory document. Its color-coded form at the top right in Fig. 2 depicts the cluster of the corresponding sentences. The hierarchical clustering bins the sentences based on their semantic content. The color coding scheme and corresponding cluster number of Fig. 2 can be considered more carefully to understand what each cluster comprises.

Cluster 0 sentences discuss the water deficit in DKA and HHS, showing a thematic focus on the conditions impact on hydration levels.

Cluster 1 sentences cover the initial fluid therapy, its direction, and factors influencing the choice of fluid, indicating a focus on treatment steps.

Cluster 2 sentences elaborate on the specifics of fluid therapy, such as the type of saline and rate of infusion.

Cluster 3 sentences are the largest group and seems to discuss the management of the patients condition in a more comprehensive manner, covering various aspects of fluid therapy, insulin administration, and their effects.

Cluster 4 has a single sentence that emphasizes the importance of considering urinary losses, a unique aspect not covered in other clusters.

Cluster 5 contains a defining sentence about DKA and HHS, which is likely quite distinct from the operational treatment discussions in the other clusters.

Figure 3 shows the summary of a part of a regulatory document39 from Fig. 2. This summary indicates that key themes such as volume depletion in DKA and HHS, fluid therapy protocols, and the consideration of electrolytes and hydration status have been maintained. Specific details such as the rate and type of saline administration, the goal of therapy, and precautions with insulin administration in hypotensive patients have also been preserved. The summary has reduced the text by almost 46%. This suggests that the clustering managed to eliminate redundant information while retaining the essence of the text. Over the 227 pieces of texts from 38 regulatory documents, an average of 37% reduction was seen in their summaries. The ability to condense content without sacrificing the quality or missing vital information showcases the effectiveness of the clustering approach.The summarizer also retains a high level of readability and presents the information in a manner that is accessible to both medical professionals and readers less familiar with the domain. It demonstrates the potential of advanced natural language processing techniques to support researchers in distilling and communicating complex datasets effectively.Evaluation metricsA multifaceted approach was employed for a comprehensive evaluation of MedicoVerse. We utilized the ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L) to assess the quality of the generated summaries. ROUGE44,45,46 (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for the automatic evaluation of text summarization and machine translation systems. The ROUGE metrics measure the quality of the generated summary or translation by comparing it to one or more reference summaries or translations. The scoring ranges from 0 to 1, with 1 being the best score. There are 3 main variants under ROUGE metrics: ROUGE-1, ROUGE-2, and ROUGE-L. The ROUGE-1 metric compares the unigrams between the generated summary and the reference summary while ROUGE-2 compares the bigrams. The ROUGE-L metric is used to compute the longest common subsequence between the machine-generated summary and the reference summary. By treating each summary as a sequence of words, ROUGE-L considers sequence matches of sentences that reflect sentence-level word order. The ROUGE metrics precision, recall, and F1 score are assessed by comparing the model-generated summaries in reference to the raw text, which serves as the ground truth for evaluation.In our work, we employed the ROUGE F1 score as the primary evaluation metric for assessing the quality of summaries generated by multiple models, given the absence of ground truth references. We also incorporated BERTScore to evaluate the similarity between the generated summary and the reference summary by utilizing contextual embeddings from models such as BERT. BERTScore computes precision, recall, and F1 scores based on token-level matches within the embedding space. While the average BERT F1 score and ROUGE F1 score provides a balanced assessment, it is important to acknowledge its sensitivity to the choice of evaluation metric and the potential for bias inherent in automated evaluation methods. To mitigate these limitations and broaden the scope of the evaluation on our models, we used business key performance indicators (KPIs) where subject matter experts identified business entities that were used to measure the models capacity to capture domain-specific entities. As a last measure, we considered the readability index that quantifies the ease with which a piece of text can be understood by its readers. This index provides a numerical value that reflects the complexity of the texts’ vocabulary, sentence structure, and overall readability. We employed the Flesch Reading Ease32 readability metric to assess the text’s readability, which gauges how easily a document can be comprehended. This metric relies on both the average sentence length and the average number of syllables per word in order to calculate readability. It evaluates the ease of comprehending a summary after the summarization process has been carried out.Under these four metrics, we constructed a final score in Eq. (1), where Eq. (2) shows how to compute the ROUGE F1 Score Weighted Average term.$$\begin{aligned} & \begin{aligned} \text {{Final}}&\ \text {{Score}} \\&= 0.2 \times (\text {ROUGE F1 Score Weighted Average}) \\&\quad + 0.2 \times (\text {BERT F1 Score Average}) \\&\quad + 0.4 \times (\text {Unique Business KPI}) \\&\quad + 0.2 \times (\text {Flesch Reading Ease}). \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} & \begin{aligned} \text {ROUGE F1 Score }&\text {Weighted Average} \\&= 0.2 \times (\text {ROUGE-1 F1 Score})\\&\quad + 0.2 \times (\text {ROUGE-2 F1 Score}) \\&\quad + 0.6 \times (\text {ROUGE-L F1 Score}). \end{aligned} \end{aligned}$$
(2)
This final score of Eq. (1) was used to compare multiple summarization models such as bart-large-cnn-samsum, Facebook BART, Google Pegasus47, T5, Mixtral 8\(\times \)7b instruct, GPT 3.5, and Llama-2-70b in addition to our own approach. The weights of Eqs. (1) and (2) were determined empirically.EvaluationThis paper’s primary objective is to assess the effectiveness of various summarization approaches and their ability to generate concise and coherent summaries for regulatory and PubMed documents. We introduced a novel scoring technique to evaluate the effectiveness of our approach, integrating four key metrics: ROUGE, BERTScore, Unique Business KPIs, and Flesch Reading Ease. A detailed analysis of the results from our evaluation is illustrated through performance Tables 1 and 2, with discussions of key findings, and comparisons to existing approaches. We also delve into the implications of our findings and their potential applications in pharmaceutical and biomedical sciences.Table 1 ROUGE scores for different models on summaries of a part of regulatory document48.Table 2 Average score across all the models for ten sampled data of regulatory documents.According to Table 1, the MedicoVerse model demonstrates robust performance across all ROUGE metrics, consistently achieving scores exceeding 0.4. The MedicoVerse model displays recall and precision scores of 0.57, 0.46, 0.56, 0.86, 0.73, 0.85, exhibiting a balance between precision and recall scores compared to the other models. The ability to strike this balance ensures that the generated summaries contain essential information while avoiding excessive inclusion of irrelevant terms.On the contrary, models such as philschmid/bart-large-cnn-samsum, Facebook/bart-large-cnn, Google/pegasus-xsum, and T5 models without clustering demonstrate low recall scores and high precision, consistently surpassing 0.6. The high precision and low recall scores of these models suggest that the generated summaries fall short of capturing the most relevant words, phrases, or context from the reference text. Furthermore, the F1-scores associated with these models are notably low, residing within the range of 0.05–0.2, indicating a substantially lower overall performance compared to that of MedicoVerse.To broaden the scope of our analysis and ensure a more exhaustive comparison, we explore the capabilities of MedicoVerse alongside large language models such as Mixtral 8\(\times \)7b instruct, GPT 3.5, and Llama-2-70b, as illustrated in Table 1. We utilized MedicoVerse’s architecture but substituted the bart-large-cnn-samsum model with Large Language Models (LLM). The Mixtral 8\(\times \)7b instruct exhibits notable F1-score performance, with values of (0.66, 0.51, 0.66) closely resembling those of MedicoVerse. Additionally, its recall scores (0.52, 0.39, 0.52) and precision scores (0.92, 0.76, 0.92), surpassing the 0.4 threshold, position it as the second-best performing model in our evaluation.On the other hand, models such as GPT 3.5 and Llama-2-70b demonstrate noteworthy F1-score performances with values of (0.60, 0.31, 0.54) and (0.63, 0.42, 0.60), respectively. Furthermore, the recall and precision scores for both models fall within the range of 0.23 to 0.54, with precision values consistently exceeding 0.4. This finding highlights a favorable balance between recall and precision, indicating that the generated summaries effectively capture only the relevant information from the reference text.Table 1 shows that MedicoVerse performs better than other compared models consistently under the evaluation metrics. MedicoVerse, a unique composition of hierarchical clustering combined with the philschmid/bart-large-cnn-samsum model, demonstrates significant advantages. However, a limitation inherent in the philschmid/bart-large-cnn-samsum model is the persistent inclusion of 1–2% noise in the generated summaries. This constraint is effectively mitigated by the Mixtral 8\(\times \)7b instruct, where the summaries are concise and domain-specific. Notably, models like GPT 3.5 and Llama-2-70b yield summaries that are lengthy, easy to comprehend, and tailored for a broader audience.Fig. 4Model evaluations by SME score and computational score.To assess the significance of the hierarchical clustering in the summarization process, we conducted evaluations using Mixtral 8\(\times \)7b instruct, GPT 3.5, and Llama-2-70b models as shown in the Table 1 to generate summaries without the inclusion of clustering as an additional layer. Comparing the Mixtral 8\(\times \)7b instruct without and with clustering, we observe an increase in ROUGE-1 scores from 0.15 to 0.52, ROUGE-2 from 0.04 to 0.39, and ROUGE-L from 0.14 to 0.52. This demonstrates the effectiveness of the supplementary clustering layer in capturing essential information and relationships within the raw text. Extending the analysis to the other models such as GPT 3.5 and Llama-2-70b, when going from not using hierarchical clustering to using it, the F1-scores have increased from (0.52, 0.28, 0.51) to (0.6, 0.31, 0.54) and (0.57, 0.37, 0.53) to (0.63, 0.42, 0.6), respectively. The application of hierarchical clustering consistently results in higher ROUGE scores across the models, reinforcing the general applicability of clustering techniques in the context of text summarization. The summaries of ten sampled data of regulatory documents across all these models are further reviewed by subject matter experts (SME’s), as depicted in Fig. 4. Notably, the x-axis in the graph represents the different models, while the y-axis indicates the score given by SME’s and computational scores. The graph illustrates a clear distinction in the scores assigned by SME’s between summaries generated with and without clustering. SME’s reviewed both the summaries and original text, assigned scores based on their expertise. Specifically, SME’s consistently rated the summaries generated by models with clustering higher compared to those without. This trend suggests that incorporating clustering techniques in the summarization process contributes to a deeper understanding of the input data, enabling the models to generate summaries of greater accuracy.Table 2 provides an overview of the performance of eleven distinct models, with each metric averaged across ten sampled data of regulatory documents. We are using Eq. (1) for calculating the final score.Table 2 shows that MedicoVerse emerges as the standout model with the highest final score of 0.64, demonstrating its overall performance. A more granular analysis of each metric starting with ROUGE F1 score weighted average illustrates that MedicoVerse attained a score of 0.60, while Mixtral 8\(\times \)7b instruct outperformed with a score of 0.63, followed closely by Llama-2-70b with a score of 0.61. Delving deeper into the ROUGE F1 score weighted average across ten sampled data of regulatory documents, it is evident that Mixtral 8\(\times \)7b instruct surpasses MedicoVerse and Llama-2-70b, showcasing its ability in effectively capturing the most relevant context from the reference text. This highlights that using hierarchical clustering with Mixtral 8\(\times \)7b instruct is a compelling choice for regulatory document text summarization.In addition to the ROUGE scores, BERT F1 scores provide another layer of evaluation for contextual understanding. Here, Hierarchical Clustering + Mixtral 8\(\times \)7b instruct again leads with a score of 0.77, slightly ahead of both MedicoVerse and Hierarchical Clustering + Llama-2-70b, each scoring 0.76. The Hierarchical Clustering + GPT 3.5 Model also performs well with a score of 0.76, indicating strong contextual relevance across these models.With regards to the average Flesch Reading Ease scores, MedicoVerse achieves the second highest score of 0.43, suggesting moderately easy-to-understand summaries suitable for a broader audience. However, it’s worth noting that the Flesch Reading ease scores for all the large language models such as Mixtral 8\(\times \)7b instruct, Llama-2-70b, and GPT 3.5, with and without clustering fall within the range of 0.18 to 0.29, reflecting highly concise and domain-specific summaries. In contrast, philschmid/bart-large-cnn-samsum, Facebook/bart-large-cnn, Google/Pegasus-xsum, and T5 models without clustering exhibit comparatively higher values of 0.35, 0.39, 0.33, and 0.44 and average ROUGE scores ranging from 0.1 to 0.32. These scores indicate that while the summaries generated by these models are easy to understand, they lack capturing the key elements from the reference text.Lastly, MedicoVerse attains a Business KPI score of 0.70 while Llama-2-70b with hierarchical clustering stands out as the top performer among all other models with a score of 0.72. Following closely, with hierarchical clustering, GPT 3.5 achieves a score of 0.63, and Mixtral 8\(\times \)7b instruct achieves a score of 0.56. Conversely, models such as philschmid/bart-large-cnn-samsum, Facebook/BART-large-cnn, Google/Pegasus-xsum, and T5 models without clustering received significantly lower scores ranging from 0.06 to 0.31, indicating limited alignment with business objectives.Tables 1 and 2 provide a clear analysis, indicating that MedicoVerse is the most effective model that meets all the business requirements. Its comprehensive performance, coupled with the advantage of being free-to-use, generating summaries with relevant context, positions it as a promising option. Among the other models evaluated, Llama-2-70b outperforms most of the benchmarks demonstrating competitive performance. Despite its scores, the summaries generated by Llama-2-70b with clustering lack conciseness, thereby leading to lengthy summaries. As a result of this drawback, Mixtral 8\(\times \)7b instruct with clustering emerges as a compelling option, positioned as the second-best performer.

Hot Topics

Related Articles