UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Improving the processes of linking and analysing longitudinal population data to promote ethnic health equity

Lam, Joseph; (2025) Improving the processes of linking and analysing longitudinal population data to promote ethnic health equity. Doctoral thesis (Ph.D), UCL (University College London).

[thumbnail of Lam_10215745_thesis_Redacted.pdf] Text
Lam_10215745_thesis_Redacted.pdf
Access restricted to UCL open access staff until 1 November 2026.

Download (64MB)

Abstract

Background: Ethnic health inequities arise from avoidable and systemic differences in both social determinants of health and individuals’ interactions with healthcare services. Linked administrative data have become critical to shaping health policy, but this carries heightened responsibility to scrutinise data quality, linkage methods, and the constructs measured. Systematic errors, such as linkage biases, can disproportionately exclude marginalised groups. Aim: To challenge structural biases embedded in data system design, linkage and analytical practices, by proposing novel data linkage and processing methods for analysis that promotes equitable health research processes for the minoritised, marginalised and missed populations. Methods: I undertook five interconnected methodological investigations across UK and US datasets. i) Ethnicity recording and use in UK health research – bibliographical review including 44 papers and three focus groups with young migrants and refugees. ii) Cohort-administrative health data linkage – evaluation of linkage quality and inclusiveness for 19 longitudinal population cohorts with administrative health data. iii) Synthetic data for linkage evaluation – development of novel data linkage and evaluation framework using high-fidelity synthetic datasets, using the Avon Longitudinal Study of Parents And Children. iv) Ethnic-based linkage bias mechanisms - analysis of name structures, corruption experiments on 8.7 million records in a US voter registry; and string-matching algorithm assessment, including non-Latin-based name processing. Proposing and demonstrating added-value of novel name-feature-based linkage compared to term probabilistic and frequency adjusted models. v) Ethnicity aggregation and intersectionality – empirical demonstration of aggregation effects in an inter-categorical intersectional framework using Evidence for Equality National Survey. Results: i) Less than 30% of included studies provided any justification for aggregating ethnicity beyond recommended categories; only 25% provided any theoretical justification of including ethnicity in the analytical model. Focus groups highlighted non-interchangeability between migration status and ethnicity. ii) Across 16 UK longitudinal population studies linked to NHS data (n = 228,531), overall linkage rates were high for both explicit consent (90.1%) and Section 251 (91.6%) routes, but in the Section 251 group linkage fell to 65–69.9% for Black participants and 80–84.9% for Mixed ethnicity, compared to >95% for White participants; no such disparity was observed in the consent group. Agreement between cohort and administrative records ranged from ~55–59% for vaccination status to >95% for cancer diagnosis, but was based on an evaluable subsample that excluded most younger and ethnic minority participants—underscoring the need for disaggregated linkage quality metrics, transparent methods, and linkage algorithms that minimise differential error across group iii) High fidelity synthetic data closely reproduced linkage performance metrics, with missed match rates within 0.13–0.55% and false match rates within 0.00–0.04% of those observed in the original data. Importantly, incorporating associations between identifier errors and attributes like maternal age and ethnicity enhanced the utility of the synthetic datasets. iv) Name bias mechanisms – demonstrated race-based bias in name characteristics and error distribution, which directly impacts linkage rates, where linkage false negative rate is the lowest in Non-Hispanic White group. The proposed name-feature linkage model produced the most equitable linkage outputs minimising false negative rate difference between all racial groups and Non-Hispanic White groups in the most realistic data corruption scenario. v) Using an intercategorical intersectionality approach, I demonstrated that detailed 21-category ethnicity classifications revealed subgroup patterns of reported racism that were obscured in the standard 5-category approach, while larger intersectional effects observed in the coarser model may be spurious—highlighting the risk that broad groupings can both mask inequities and misrepresent interaction effects. Conclusions: Addressing ethnic health inequities in research requires scrutiny of the entire data pipeline, from recording to analysis. Using empirical evaluations of name-based linkage bias, synthetic corruption experiments, and analyses of ethnic variation in name structures, this thesis shows how standard linkage models can disadvantage minoritised ethnic groups and demonstrates mitigations such as term-frequency adjustments, principal component–based name features, using high-fidelity synthetic data for evaluation in absence of gold-standard data, mindful to corruption fairness. It recommends bias audits in linkages, adoption of flexible, transparent tools and frameworks, and investment in methodological expertise and equity-focused governance.

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Improving the processes of linking and analysing longitudinal population data to promote ethnic health equity
Language: English
Additional information: Copyright © The Author 2025. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
Keywords: health data science, health inequalities, inequity, record linkage, statistics
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > UCL GOS Institute of Child Health
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > UCL GOS Institute of Child Health > Population, Policy and Practice Dept
URI: https://discovery.ucl.ac.uk/id/eprint/10215745
Downloads since deposit
9Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item