Random Forest variable importance with missing data

Hapfelmeier, Alexander; Hothorn, Torsten; Ulm, Kurt

research

Random Forest variable importance with missing data

Authors: Alexander Hapfelmeier
Torsten Hothorn
Kurt Ulm
Publication date: 15 February 2012
Publisher
Doi

Abstract

Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Universität München: Elektronischen Publikationen

oai:epub.ub.uni-muenchen.de:12...

Last time updated on 09/07/2019

Open Access LMU

oai:epub.ub.uni-muenchen.de:12...

Last time updated on 19/07/2013