eprintid: 10179572
rev_number: 6
eprint_status: archive
userid: 699
dir: disk0/10/17/95/72
datestamp: 2023-10-25 11:26:41
lastmod: 2023-10-25 11:26:41
status_changed: 2023-10-25 11:26:41
type: article
metadata_visibility: show
sword_depositor: 699
creators_name: Casey, Arlene
creators_name: Davidson, Emma
creators_name: Grover, Claire
creators_name: Tobin, Richard
creators_name: Grivas, Andreas
creators_name: Zhang, Huayu
creators_name: Schrempf, Patrick
creators_name: O'Neil, Alison Q
creators_name: Lee, Liam
creators_name: Walsh, Michael
creators_name: Pellie, Freya
creators_name: Ferguson, Karen
creators_name: Cvoro, Vera
creators_name: Wu, Honghan
creators_name: Whalley, Heather
creators_name: Mair, Grant
creators_name: Whiteley, William
creators_name: Alex, Beatrice
title: Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
ispublished: pub
divisions: UCL
divisions: B02
divisions: DD4
keywords: brain radiology, electronic health records, natural language processing, stroke phenotype
note: © 2023 Casey, Davidson, Grover, Tobin, Grivas, Zhang, Schrempf, O’Neil, Lee, Walsh, Pellie, Ferguson, Cvero, Wu, Whalley, Mair, Whiteley and Alex. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
abstract: BACKGROUND: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. METHODS: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images. RESULTS: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%. CONCLUSIONS: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed "out of the box." It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.
date: 2023
date_type: published
official_url: https://doi.org/10.3389/fdgth.2023.1184919
oa_status: green
full_text_type: pub
language: eng
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 2098631
doi: 10.3389/fdgth.2023.1184919
medium: Electronic-eCollection
lyricists_name: Wu, Honghan
lyricists_id: HWWUX46
actors_name: Flynn, Bernadette
actors_id: BFFLY94
actors_role: owner
funding_acknowledgements: 216767/Z/19/Z [Wellcome Trust]; R484/0516 [The Dunhill Medical Trust]; CAF/17/01 [Chief Scientist Office]
full_text_status: public
publication: Frontiers in Digital Health
volume: 5
article_number: 1184919
event_location: Switzerland
citation:        Casey, Arlene;    Davidson, Emma;    Grover, Claire;    Tobin, Richard;    Grivas, Andreas;    Zhang, Huayu;    Schrempf, Patrick;                                             ... Alex, Beatrice; + view all <#>        Casey, Arlene;  Davidson, Emma;  Grover, Claire;  Tobin, Richard;  Grivas, Andreas;  Zhang, Huayu;  Schrempf, Patrick;  O'Neil, Alison Q;  Lee, Liam;  Walsh, Michael;  Pellie, Freya;  Ferguson, Karen;  Cvoro, Vera;  Wu, Honghan;  Whalley, Heather;  Mair, Grant;  Whiteley, William;  Alex, Beatrice;   - view fewer <#>    (2023)    Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports.                   Frontiers in Digital Health , 5     , Article 1184919.  10.3389/fdgth.2023.1184919 <https://doi.org/10.3389/fdgth.2023.1184919>.       Green open access   
 
document_url: https://discovery.ucl.ac.uk/id/eprint/10179572/1/fdgth-05-1184919.pdf