UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage

Lam, Joseph; Cortina Borja, Mario; Aldridge, Robert; Blackburn, ruth; Harron, Katie; (2025) Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage. International Journal of Population Data Science , 6 (1) , Article 2935. 10.23889/ijpds.v6i1.2935. (In press). Green open access

[thumbnail of 6_1_Lam_2935.pdf]
Preview
Text
6_1_Lam_2935.pdf - Accepted Version

Download (2MB) | Preview

Abstract

Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors—stemming from inconsistent name representations—can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKGromanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.

Type: Article
Title: Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage
Open access status: An open access version is available from UCL Discovery
DOI: 10.23889/ijpds.v6i1.2935
Publisher version: https://doi.org/10.23889/ijpds.v6i1.2935
Language: English
Additional information: © The Authors. Open Access under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/deed.en)
Keywords: data linkage; romanisation; linkage errors; data equity
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > UCL GOS Institute of Child Health
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > UCL GOS Institute of Child Health > Population, Policy and Practice Dept
URI: https://discovery.ucl.ac.uk/id/eprint/10205804
Downloads since deposit
Loading...
4Downloads
Download activity - last month
Loading...
Download activity - last 12 months
Loading...
Downloads by country - last 12 months
Loading...

Archive Staff Only

View Item View Item