%0 Journal Article
%@ 2399-4908
%A Lam, Joseph
%A Cortina Borja, Mario
%A Aldridge, Robert
%A Blackburn, ruth
%A Harron, Katie
%D 2025
%F discovery:10205804
%I Swansea University
%J International Journal of Population Data Science
%K data linkage; romanisation; linkage errors; data equity
%N 1
%T Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal  representations of Chinese names for data linkage
%U https://discovery.ucl.ac.uk/id/eprint/10205804/
%V 6
%X Accurate data linkage across large administrative databases is crucial for addressing  complex research and policy questions, yet linkage errors—stemming from inconsistent name  representations—can introduce biases, predominantly for names not given in English. This data note  examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing  standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government  Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific  variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies  in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis  reveals that standardised romanisation systems enhance the uniqueness and consistency of name  representations, thereby improving linkage precision and recall compared to HKG-romanisation.  Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKGromanisation only reached 68.8%. Incorporating tonal information further improved recall. These  findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and  flexible database designs to reduce linkage errors and promote data equity for under-represented  groups. We advocate for the implementation of phonetic encodings in databases, alongside  language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage  processes.
%Z © The Authors. Open Access under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/deed.en)