Milad, Daniel;
Antaki, Fares;
Milad, Jason;
Farah, Andrew;
Khairy, Thomas;
Mikhail, David;
Giguère, Charles-Édouard;
... Duval, Renaud; + view all
(2024)
Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases.
British Journal of Ophthalmology
10.1136/bjo-2023-325053.
(In press).
Preview |
Text
Antaki_ChatGPT JAMA Opht revision-final clean.pdf Download (245kB) | Preview |
Abstract
Background/aims: This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases. // Methods: We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort. // Results: Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020). // Conclusion: Improved prompting enhances GPT-4’s performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.
Type: | Article |
---|---|
Title: | Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases |
Location: | England |
Open access status: | An open access version is available from UCL Discovery |
DOI: | 10.1136/bjo-2023-325053 |
Publisher version: | http://dx.doi.org/10.1136/bjo-2023-325053 |
Language: | English |
Additional information: | This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions. |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Institute of Ophthalmology |
URI: | https://discovery.ucl.ac.uk/id/eprint/10189698 |
Archive Staff Only
![]() |
View Item |