eprintid: 10205804 rev_number: 8 eprint_status: archive userid: 699 dir: disk0/10/20/58/04 datestamp: 2025-03-10 13:45:33 lastmod: 2025-03-10 13:45:33 status_changed: 2025-03-10 13:45:33 type: article metadata_visibility: show sword_depositor: 699 creators_name: Lam, Joseph creators_name: Cortina Borja, Mario creators_name: Aldridge, Robert creators_name: Blackburn, ruth creators_name: Harron, Katie title: Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage ispublished: inpress divisions: UCL divisions: B02 divisions: D13 divisions: G25 keywords: data linkage; romanisation; linkage errors; data equity note: © The Authors. Open Access under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/deed.en) abstract: Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors—stemming from inconsistent name representations—can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKGromanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes. date: 2025-03-14 date_type: published publisher: Swansea University official_url: https://doi.org/10.23889/ijpds.v6i1.2935 oa_status: green full_text_type: pub language: eng primo: open primo_central: open_green verified: verified_manual elements_id: 2367304 doi: 10.23889/ijpds.v6i1.2935 lyricists_name: Lam, Joseph lyricists_id: JLAMX69 actors_name: Lam, Joseph actors_id: JLAMX69 actors_role: owner funding_acknowledgements: 212953/Z/ 18/Z [Wellcome Trust] full_text_status: public publication: International Journal of Population Data Science volume: 6 number: 1 article_number: 2935 issn: 2399-4908 citation: Lam, Joseph; Cortina Borja, Mario; Aldridge, Robert; Blackburn, ruth; Harron, Katie; (2025) Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage. International Journal of Population Data Science , 6 (1) , Article 2935. 10.23889/ijpds.v6i1.2935 <https://doi.org/10.23889/ijpds.v6i1.2935>. (In press). Green open access document_url: https://discovery.ucl.ac.uk/id/eprint/10205804/1/6_1_Lam_2935.pdf