eprintid: 1427772
rev_number: 43
eprint_status: archive
userid: 608
dir: disk0/01/42/77/72
datestamp: 2014-04-25 19:00:15
lastmod: 2021-12-06 00:25:34
status_changed: 2014-04-25 19:00:15
type: article
metadata_visibility: show
item_issues_count: 0
creators_name: Shah, AD
creators_name: Bartlett, JW
creators_name: Carpenter, J
creators_name: Nicholas, O
creators_name: Hemingway, H
title: Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
ispublished: pub
divisions: UCL
divisions: B02
divisions: D65
divisions: J38
divisions: DD4
divisions: B04
divisions: C06
divisions: F61
keywords: angina, stable, imputation, missing data, missingness at random, regression trees, simulation, survival
note: © The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted reuse, distribution, and reproduction in any medium,
provided the original work is properly cited.

PubMed ID: 24589914
abstract: Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic
research. The “true” imputation model may contain nonlinearities which are not included in default imputation
models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions
and does not require a particular regression model to be specified.We compared parametric MICE with a
random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000
persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research
using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates.
Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained
using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard
ratios, but random forest was more efficient and produced narrower confidence intervals. The second study
used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear
way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better.
This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in
which some patients have missing data.
date: 2014-01-12
official_url: http://dx.doi.org/10.1093/aje/kwt312
vfaculties: VFPHS
oa_status: green
full_text_type: pub
primo: open
primo_central: open_green
verified: verified_manual
elements_source: WoS-Lite
elements_id: 942181
doi: 10.1093/aje/kwt312
lyricists_name: Carpenter, James
lyricists_name: Hemingway, Harry
lyricists_name: Nicholas, Owen
lyricists_name: Shah, Anoop
lyricists_id: JCARP26
lyricists_id: HHEMI65
lyricists_id: ONICH93
lyricists_id: ASHAH69
full_text_status: public
publication: AMERICAN JOURNAL OF EPIDEMIOLOGY
volume: 179
number: 6
pagerange: 764 - 774
issn: 0002-9262
citation:        Shah, AD;    Bartlett, JW;    Carpenter, J;    Nicholas, O;    Hemingway, H;      (2014)    Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study.                   AMERICAN JOURNAL OF EPIDEMIOLOGY , 179  (6)   764 - 774.    10.1093/aje/kwt312 <https://doi.org/10.1093/aje%2Fkwt312>.       Green open access   
 
document_url: https://discovery.ucl.ac.uk/id/eprint/1427772/1/Am._J._Epidemiol.-2014-Shah-764-74.pdf