Pikoula, Maria;
Quint, Jennifer K;
Kallis, Constantinos;
Henry, Albert;
Denaxas, Spiros;
(2025)
Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in a primary care population.
BMC Pulmonary Medicine
, 25
, Article 487. 10.1186/s12890-025-03953-x.
Preview |
Text
Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in .pdf - Accepted Version Download (3MB) | Preview |
Abstract
Background: Obstructive respiratory conditions, including asthma, bronchiectasis, and chronic obstructive pulmonary disease (COPD), are increasingly recognised as heterogeneous syndromes with significant overlap. Multiple disease pathways contribute to phenotypes that do not always align with textbook definitions, limiting the effectiveness of a one-size-fits-all approach. This study aims to identify, validate, and characterise clinically meaningful airway disease subtypes using electronic healthcare records (EHR) and unsupervised machine learning clustering techniques. // Methods: We applied k-means clustering to 626,651 patients with a diagnosis of asthma, bronchiectasis, or COPD, using linked national structured EHRs in England. Twenty-one clinical features, including risk factors and comorbidities, were analysed, with dimensionality reduction via principal component and multiple correspondence analyses. Associations between cluster membership and exacerbations, as well as respiratory and cardiovascular mortality, were assessed. Over 3,696,962 person-years of follow-up, 102,522 deaths were recorded. Cluster stability was evaluated after five years, and genome-wide association studies (GWAS) were conducted to explore genetic associations with cluster membership. // Results: Seven clusters were identified, each encompassing patients across traditional diagnostic labels. Distinct clinical patterns emerged as follows: (1) High BMI female predominant, (2) Older male-predominant with diabetes and cardiovascular disease, (3) Eosinophilic atopic, (4) Older non-comorbid, (5) Non-comorbid low BMI, (6) Neutrophilic smoker, (7) Anxious/depressed female-predominant.The cluster with cardiovascular comorbidities showed the highest rates of hospital admissions for exacerbations. Neutrophilic cluster 6 is a potential novel subtype marked by persistent neutrophilia and poor outcomes. Cluster stability over five years ranged from 38% to 78%. GWAS revealed significant genetic loci in a cluster enriched for allergic disease and eosinophilia, suggesting shared genetic mechanisms. // Conclusions: This study provides a data-driven dissection of the heterogeneity underlying obstructive airway diseases in a large, real-world population. Unsupervised machine learning applied to national-scale EHR data revealed distinct and partially stable subtypes that transcend conventional diagnostic boundaries. These findings highlight the complexity and overlap of airway disease phenotypes and demonstrate the value of clustering approaches for uncovering clinically and biologically meaningful subgroups. This work lays the foundation for further exploration into mechanisms and prognosis within and across airway disease phenotypes.
| Type: | Article |
|---|---|
| Title: | Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in a primary care population |
| Location: | England |
| Open access status: | An open access version is available from UCL Discovery |
| DOI: | 10.1186/s12890-025-03953-x |
| Publisher version: | https://doi.org/10.1186/s12890-025-03953-x |
| Language: | English |
| Additional information: | This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. |
| Keywords: | Asthma, Bronchiectasis, Chronic Obstructive Pulmonary Disease, Cluster Analysis, Electronic Health Records, CALIBER, Machine Learning, Genome-wide association studies, UKBiobank |
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics > Clinical Epidemiology |
| URI: | https://discovery.ucl.ac.uk/id/eprint/10216963 |
Archive Staff Only
![]() |
View Item |

