UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy

Cheong, KX; Zhang, C; Tan, TE; Fenner, BJ; Wong, WM; Teo, KYC; Wang, YX; ... Tham, YC; + view all (2024) Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy. British Journal of Ophthalmology 10.1136/bjo-2023-324533. (In press). Green open access

[thumbnail of Keane_Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy_AAM.pdf]
Preview
Text
Keane_Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy_AAM.pdf

Download (244kB) | Preview

Abstract

BACKGROUND/AIMS: To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR). METHODS: We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as ‘Good’, ‘Borderline’ or ‘Poor’ quality. RESULTS: Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10−3). Based on the consensus approach, 83.3% of ChatGPT-4’s responses and 86.7% of ChatGPT-3.5’s were rated as ‘Good’, surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10−2). ChatGPT-4 and ChatGPT-3.5 had no ‘Poor’ rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others. CONCLUSION: ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.

Type: Article
Title: Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy
Location: England
Open access status: An open access version is available from UCL Discovery
DOI: 10.1136/bjo-2023-324533
Publisher version: http://dx.doi.org/10.1136/bjo-2023-324533
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher's terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Institute of Ophthalmology
URI: https://discovery.ucl.ac.uk/id/eprint/10193124
Downloads since deposit
10Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item