UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

Pita, R; Mendonca, E; Reis, S; Barreto, M; Denaxas, S; (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche, L and Chakravarthy, S, (eds.) Proceedings of the 19th International Conference on Big Data Analytics and Knowledge Discovery: DaWaK 2017. (pp. pp. 214-227). Springer, Cham: Regensburg, Germany. Green open access

[thumbnail of titunse.pdf]
Preview
Text
titunse.pdf - Accepted Version

Download (421kB) | Preview

Abstract

Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.

Type: Proceedings paper
Title: A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage
Event: 19th International Conference on Big Data Analytics and Knowledge Discovery (DaWaK)
Location: Lyon, FRANCE
Dates: 28 August 2017 - 31 August 2017
ISBN-13: 978-3-319-64282-6
Open access status: An open access version is available from UCL Discovery
DOI: 10.1007/978-3-319-64283-3_16
Publisher version: https://doi.org/10.1007/978-3-319-64283-3_16
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics > Clinical Epidemiology
URI: https://discovery.ucl.ac.uk/id/eprint/10058731
Downloads since deposit
601Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item