UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases

Milad, Daniel; Antaki, Fares; Milad, Jason; Farah, Andrew; Khairy, Thomas; Mikhail, David; Giguère, Charles-Édouard; ... Duval, Renaud; + view all (2024) Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases. British Journal of Ophthalmology 10.1136/bjo-2023-325053. (In press). Green open access

[thumbnail of Antaki_ChatGPT JAMA Opht revision-final clean.pdf]
Preview
Text
Antaki_ChatGPT JAMA Opht revision-final clean.pdf

Download (245kB) | Preview

Abstract

Background/aims: This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases. // Methods: We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort. // Results: Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020). // Conclusion: Improved prompting enhances GPT-4’s performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.

Type: Article
Title: Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases
Location: England
Open access status: An open access version is available from UCL Discovery
DOI: 10.1136/bjo-2023-325053
Publisher version: http://dx.doi.org/10.1136/bjo-2023-325053
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Institute of Ophthalmology
URI: https://discovery.ucl.ac.uk/id/eprint/10189698
Downloads since deposit
19Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item