Evaluating prompt engineering on GPT-3.5’s performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4

Study designWe evaluated ChatGPT (GPT-3.5-turbo) using three prompt strategies: direct, chain of thoughts (CoT), and a modified CoT14. Our analysis covered 95 USMLE Step 1 multiple-choice questions15. We also added two sets of questions we created with GPT-4, one set of medical calculations and another of clinical case questions, and the architecture and workflow of this experiment are detailed in Fig. 1. Analyses were performed during July 2023.Figure 1The figure illustrates a multi-step process in which GPT-4 generates 1000 USMLE-style medical questions with calculation and non-calculation, and GPT-3.5-turbo answers them using three prompting strategies direct, COT, and Modified COT. The generated questions span 19 clinical fields and various medical topics, and the model’s answers aim to mimic human problem-solving behavior, enhancing reasoning ability and clarity in its responses.Question generationWe used GPT-4 to create 1000 questions in USMLE style11,16. We split them evenly into two groups: 500 calculation-based and 500 non-calculation-based. The non-calculation set spanned diagnoses, treatment plans, lab test readings, disease courses, pathophysiology, public health, and preventive care. The calculation set included tasks like figuring out medication doses, clinical scores, diagnostic math, and statistical evaluations. GPT-4 also rated each question by difficulty—easy, medium, or hard, and by the medical field, covering 19 specialties such as Internal Medicine, Pediatrics, Psychiatry, Surgery, and others.For generating questions, the prompt was:Dear GPT-4, we are conducting research on prompt engineering and your help is needed to generate a high-quality medical question. This question should meet the following criteria:The result should be returned in a JSON format, with the following headers:where

Clinical field one of 19 clinical fields (internal medicine, surgery, etc.).

Broad type either calculation or non-calculation.

Subtypes:

Calculation questions: Drug dosage calculations, Clinical score calculations, Diagnostic test calculations, Statistical data interpretation.

Non-calculation type questions: Diagnosis based on symptoms, Treatment selection, Interpretation of lab results, Disease progression and prognosis, Pathophysiology questions, and public health and preventive medicine questions.

Difficulty level easy, medium, or hard.

Question answering—prompt engineeringTo query GPT-3.5, we used three prompting strategies:

The “direct prompt” strategy simply instructed the model to “answer the question.”

The “CoT” strategy guided the model to “reason step by step and answer the question.”

The “modified CoT” strategy directed the model to “read the problem carefully, break it down, devise a strategy for solving it, check each step for accuracy, and clearly and concisely convey your reasoning leading to your final answer.” This approach sought to mimic human problem-solving behavior, with the aim to enhance the model’s reasoning ability while promoting clarity and precision in its responses.

All prompts were submitted using openAI API with the following format, using default temperature (0.5) and max token length of 700:where ‘prompt’ corresponded to:The direct prompt was:The CoT prompt was:The modified CoT prompt was:Human validationTwo emergency room attending physicians independently evaluated the first 50 questions generated by GPT-4 for appropriateness, type, subtype, difficulty level, clinical field, and correctness of the answer. Each aspect was reviewed blindly, and the assessments were quantified.To clearly delineate the agreement calculations, each evaluator’s judgments were compared against the features of GPT-4 generated questions, for example, appropriateness, type, and difficulty level. The percentage agreement for each evaluator was calculated by the ratio of matches (e.g. agreement with GPT-4 difficulty level assignment) to the total questions evaluated. We further analyzed the inter-rater reliability between the evaluators using Cohen’s Kappa to compare their levels of agreement.EvaluationThe main metric of our evaluation was the accuracy of GPT-3.5 and GPT-4 answers and, we have mentioned their feature comparison in Table 1. In addition, we ran further analyses on the questions, looking at difficulty level, what type they were, and their medical specialty. This helped us get a full picture of ChatGPT’s capabilities.Table 1 Comparison of Features between GPT-3.5 and GPT-4.Statistical analysisStatistical analyses were executed using Python version 3.9.16. Agreement between the human reviewers was statistically analyzed using Cohen’s Kappa to measure inter-rater reliability. We used the Chi-square test to examine the relation between prompt types and response accuracy. A p-value of less than 0.05 was considered statistically significant.

Hot Topics

Related Articles