UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Confound-leakage: confound removal in machine learning leads to leakage

Hamdan, S; Love, BC; von Polier, GG; Weis, S; Schwender, H; Eickhoff, SB; Patil, KR; (2023) Confound-leakage: confound removal in machine learning leads to leakage. GigaScience , 12 , Article giad071. 10.1093/gigascience/giad071. Green open access

[thumbnail of Love_Confound-leakage_VoR.pdf]
Preview
PDF
Love_Confound-leakage_VoR.pdf - Published Version

Download (1MB) | Preview

Abstract

BACKGROUND: Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood. RESULTS: We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound. CONCLUSIONS: Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.

Type: Article
Title: Confound-leakage: confound removal in machine learning leads to leakage
Location: United States
Open access status: An open access version is available from UCL Discovery
DOI: 10.1093/gigascience/giad071
Publisher version: https://doi.org/10.1093/gigascience/giad071
Language: English
Additional information: © The Author(s) 2023. Published by Oxford University Press GigaScience. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/).
Keywords: clinical applications, confounding, data-leakage, machine-learning, Machine Learning
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences > Experimental Psychology
URI: https://discovery.ucl.ac.uk/id/eprint/10178664
Downloads since deposit
13Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item