UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Transformers and large language models are efficient feature extractors for electronic health record studies

Yuan, Kevin; Yoon, Chang Ho; Gu, Qingze; Munby, Henry; Walker, Ann; Zhu, Tingting; Eyre, David W; (2025) Transformers and large language models are efficient feature extractors for electronic health record studies. Communications Medicine , 5 , Article 83. 10.1038/s43856-025-00790-1. Green open access

[thumbnail of s43856-025-00790-1.pdf]
Preview
Text
s43856-025-00790-1.pdf

Download (951kB) | Preview

Abstract

Background Free-text data is abundant in electronic health records, but challenges in accurate and scalable information extraction mean less specific clinical codes are often used instead. Methods We evaluated the efficacy of feature extraction using modern natural language processing methods (NLP) and large language models (LLMs) on 938,150 hospital antibiotic prescriptions from Oxfordshire, UK. Specifically, we investigated inferring the type(s) of infection from a free-text “indication” field, where clinicians state the reason for prescribing antibiotics. Clinical researchers labelled a subset of the 4000 most frequent unique indications (representing 692,310 prescriptions) into 11 categories describing the infection source or clinical syndrome. Various models were then trained to determine the binary presence/absence of these infection types and also any uncertainty expressed by clinicians. Results We show on separate internal (n = 2000 prescriptions) and external test datasets (n = 2000 prescriptions), a fine-tuned domain-specific Bio+Clinical BERT model performs best across the 11 categories (average F1 score 0.97 and 0.98 respectively) and outperforms traditional regular expression (F1 = 0.71 and 0.74) and n-grams/XGBoost (F1 = 0.86 and 0.84) models. A zero-shot OpenAI GPT4 model matches the performance of traditional NLP models without the need for labelled training data (F1 = 0.71 and 0.86) and a fine-tuned GPT3.5 model achieves similar performance to the fine-tuned BERT-based model (F1 = 0.95 and 0.97). Infection sources obtained from free-text indications reveal specific infection sources 31% more often than ICD-10 codes. Conclusions Modern transformer-based models have the potential to be used widely throughout medicine to extract information from structured free-text records, to facilitate better research and patient care.

Type: Article
Title: Transformers and large language models are efficient feature extractors for electronic health record studies
Open access status: An open access version is available from UCL Discovery
DOI: 10.1038/s43856-025-00790-1
Publisher version: https://doi.org/10.1038/s43856-025-00790-1
Language: English
Additional information: Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Keywords: Computational biology and bioinformatics, Health services, Infectious diseases, Medical research
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Inst of Clinical Trials and Methodology
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Inst of Clinical Trials and Methodology > MRC Clinical Trials Unit at UCL
URI: https://discovery.ucl.ac.uk/id/eprint/10205587
Downloads since deposit
5Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item