Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

The UCSF Information Commons contains de-identified structured clinical data as well as de-identified clinical text notes, de-identified and externally certified as previously described31. The UCSF Institutional Review Board determined that this use of the de-identified data within the UCSF Information Commons environment is not human participants’ research and, therefore, was exempt from further approval and informed consent.We identified all adult visits to the University of California San Francisco (UCSF) Emergency Department (ED) from 2012 to 2023 with an ED Physician note present within Information Commons (Fig. 1). Regular expressions were used to extract the Presenting History (consisting of ‘Chief Complaint’, ‘History of Presenting Illness’ and ‘Review of Systems’) and Physical Examination sections from each note (Supplementary Information).We sought to evaluate GPT-3.5-turbo and GPT-4-turbo performance on three binary clinical recommendation tasks, corresponding to the following outcomes: (1) Admission status—whether the patient should be admitted from ED to the hospital. (2) Radiological investigation(s) request status—whether an X-ray, US scan, CT scan, or MRI scan should be requested during the ED visit. (3) Antibiotic prescription status—whether antibiotics should be ordered during the ED visit.For each of the three outcomes, we randomly selected a balanced sample of 10,000 ED visits to evaluate LLM performance (Fig. 1). Using its secure, HIPAA-compliant Application Programming Interface (API) through Microsoft Azure, we provided GPT-3.5-turbo (model = ‘gpt-3.5-turbo-0301’, role = ‘user’, temperature = 0; all other settings at default values) and GPT-4-turbo (model = ‘gpt-4-turbo-128k-1106’, role = ‘user’, temperature = 0; all other settings at default values) with only the Presenting History and Physical Examination sections of the ED Physician’s note for each ED visit and queried it to determine if (1) the patient should be admitted to hospital, (2) the patient requires radiological investigation, and (3) the patient should be prescribed antibiotics. LLM performance was evaluated against the ground-truth outcome extracted from the electronic health record. Separately, a resident physician with 2 years of postgraduate general medicine training labelled a balanced n = 200 subsample for each of the three tasks to allow a comparison of human and machine performance. In a similar manner to the LLMs, the physician reviewer was provided with the Presenting History and Physical Examination sections of the ED Physician’s note for each ED visit and asked to use their clinical judgement to decide if the patient should be admitted to the hospital, requires radiological investigation, or should be prescribed antibiotics.We subsequently experimented with three iterations of prompt engineering (Table S1, Supplementary Information) to test if modifications to the initial prompt could improve LLM performance. Chain-of-thought (CoT) prompting is a method found to improve the ability of LLMs to perform complex reasoning by decomposing multi-step problems into a series of intermediate steps27. This can be done in a zero-shot manner (zero-shot-CoT), with LLMs shown to be decent zero-shot reasoners by adding a simple prompt, ‘Let’s think step by step’ to facilitate step-by-step reasoning before answering each question14. Alternatively, few-shot chain-of-thought prompting can be used, with additional examples of prompt and answer pairs either manually (manual CoT) or computationally (e.g., auto-CoT) provided and concatenated with the prompt of interest27,28. Current understanding of the impact of zero-shot-CoT, manual CoT, and auto-CoT prompt engineering techniques applied to clinical text is limited. In this work, we sought to focus on zero-shot-CoT and investigate the effect of adding ‘Let’s think step by step’ to the prompt on model performance.Our initial prompt (Prompt A) simply asked the LLM to return whether the patient should be e.g., admitted to the hospital, without any additional explanation. We additionally attempted to engineer prompts to (a) reduce the high false positive rate of LLM recommendations (Prompt B) and (b) examine whether zero-shot chain-of-thought prompting could improve LLM performance (Prompts C and D). Attempting to reduce the high LLM false positive rate, Prompt B was constructed by adding an additional sentence to Prompt A: ‘Only suggest *clinical recommendation* if absolutely required’. This modification was kept for Prompts C and D, which were constructed to examine chain-of-thought prompting. Because chain-of-thought prompting is most effective when the LLM provides reasoning in its output, we removed the instruction ‘Please do not return any additional explanation’ from Prompts C and D, and added the chain-of-thought prompt ‘Let’s think step by step’ to Prompt D, increasing GPT-3.5-turbo but not GPT-4-turbo response verbosity (Table S2, Supplementary Information). Prompt C, therefore, served as a baseline for comparison of LLM performance when it is permitted to return additional explanation (in addition to its outcome recommendation), allowing comparisons with both Prompt A (where no additional explanations were allowed in the prompt) and Prompt D (where the effect of chain-of-thought prompting was examined).To evaluate the performance of both LLMs in a real-world setting, we constructed a random, unbalanced sample of 1000 ED visits where the distribution of patient outcomes (i.e., admission status, radiological investigation(s) request status, and antibiotic prescription status) mirrored the distributions of patients presenting to ED from our main cohort. The Presenting History and Physical Examination sections of the ED Physician’s note for each ED visit were again passed to the API in an identical manner to the balanced datasets, while a resident physician was provided with these same sections and asked to manually label the entire sample to allow human vs machine comparison. In addition, an attending emergency medicine physician independently classified 10% of this subsample, with 79% concordance and comparable accuracy between reviewers (Table S3, Supplementary Information).Sensitivity analysisDue to the stochastic nature of LLMs, it is possible that the order of labels reported in the original prompt may affect the subsequent labels returned. To test this, we conducted a sensitivity analysis on a balanced n = 200 subsample for each outcome where the positive outcome was referenced before the negative outcome in the initial prompt (e.g., ‘1: Patient should be admitted to hospital’ precedes ‘0: Patient should not be admitted to hospital’ in the GPT-3.5-turbo prompt).Statistical analysisTo assess model performance for the unbalanced datasets, the following evaluation metrics were calculated: true positive rate, true negative rate, false positive rate, false negative rate, sensitivity and specificity. Classification accuracy was calculated in addition to the aforementioned evaluation metrics utilised for the balanced datasets to provide a summative evaluation metric for this real-world simulated task. 95% confidence intervals were calculated by bootstrapping 1000 times, with replacement. All analyses were conducted in Python, version 3.11.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles