eprintid: 10205804
rev_number: 8
eprint_status: archive
userid: 699
dir: disk0/10/20/58/04
datestamp: 2025-03-10 13:45:33
lastmod: 2025-03-10 13:45:33
status_changed: 2025-03-10 13:45:33
type: article
metadata_visibility: show
sword_depositor: 699
creators_name: Lam, Joseph
creators_name: Cortina Borja, Mario
creators_name: Aldridge, Robert
creators_name: Blackburn, ruth
creators_name: Harron, Katie
title: Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal
representations of Chinese names for data linkage
ispublished: inpress
divisions: UCL
divisions: B02
divisions: D13
divisions: G25
keywords: data linkage; romanisation; linkage errors; data equity
note: © The Authors. Open Access under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/deed.en)
abstract: Accurate data linkage across large administrative databases is crucial for addressing
complex research and policy questions, yet linkage errors—stemming from inconsistent name
representations—can introduce biases, predominantly for names not given in English. This data note
examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing
standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government
Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific
variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies
in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis
reveals that standardised romanisation systems enhance the uniqueness and consistency of name
representations, thereby improving linkage precision and recall compared to HKG-romanisation.
Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKGromanisation only reached 68.8%. Incorporating tonal information further improved recall. These
findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and
flexible database designs to reduce linkage errors and promote data equity for under-represented
groups. We advocate for the implementation of phonetic encodings in databases, alongside
language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage
processes.
date: 2025-03-14
date_type: published
publisher: Swansea University
official_url: https://doi.org/10.23889/ijpds.v6i1.2935
oa_status: green
full_text_type: pub
language: eng
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 2367304
doi: 10.23889/ijpds.v6i1.2935
lyricists_name: Lam, Joseph
lyricists_id: JLAMX69
actors_name: Lam, Joseph
actors_id: JLAMX69
actors_role: owner
funding_acknowledgements: 212953/Z/ 18/Z [Wellcome Trust]
full_text_status: public
publication: International Journal of Population Data Science
volume: 6
number: 1
article_number: 2935
issn: 2399-4908
citation:        Lam, Joseph;    Cortina Borja, Mario;    Aldridge, Robert;    Blackburn, ruth;    Harron, Katie;      (2025)    Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage.                   International Journal of Population Data Science , 6  (1)    , Article 2935.  10.23889/ijpds.v6i1.2935 <https://doi.org/10.23889/ijpds.v6i1.2935>.    (In press).    Green open access   
 
document_url: https://discovery.ucl.ac.uk/id/eprint/10205804/1/6_1_Lam_2935.pdf