Skip to main content
Article thumbnail
Location of Repository

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

By A. (Andrea) Marshall, Douglas G. Altman, Patrick Royston and Roger L. Holder


Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. \ud Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained. \ud Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches. \ud Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR

Topics: QA, RC
Publisher: BioMed Central Ltd.
Year: 2010
OAI identifier:

Suggested articles


  1. (2005). A comparison of imputation methods in a longitudinal randomized clinical trial. Statistics in Medicine doi
  2. (2003). A potential for bias when rounding in multiple imputation. American Statistician doi
  3. (1999). Adjusting regression attenuation in the Cox proportional hazards model. doi
  4. (2004). Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. doi
  5. (1997). Analysis of Incomplete Multivariate Data. doi
  6. (2001). CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods doi
  7. (2009). Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. doi
  8. (2004). Core Team: R: A language and environment for statistical computing.
  9. (1991). DB: Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica
  10. (2002). DB: Statistical Analysis with Missing Data, Second edition. doi
  11. (2002). Dealing with missing data. doi
  12. (2002). Double-semiparametric method for missing covariates in Cox regression models. doi
  13. (1995). Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses.
  14. (2005). Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine doi
  15. (2003). Grobbee DE: Diagnostic research on routine care data prospects and problems. doi
  16. (2008). How should variable selection be performed with multiply imputed data?. Statistics in Medicine doi
  17. (1998). Ibrahim JG: Estimating equations with incomplete categorical covariates in the Cox model. Biometrics doi
  18. (2004). Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. doi
  19. (2009). Imputing missing covariate values for the Cox model. Statistics in Medicine doi
  20. (2001). JF: A prognostic model for ovarian cancer. doi
  21. (1999). Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine doi
  22. (1999). Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors. doi
  23. (1998). Missing data in epidemiologic studies. Encyclopedia of Biostatistics New York: doi
  24. (2002). Missing data: our view of the state of the art. Psychological Methods doi
  25. (2003). mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data. R package version 1.0.4
  26. MK: Modelling and imputation of semicontinuous survey variables. The Methodology Center,
  27. (2002). ML: Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. doi
  28. (1999). Modelling the effects of standard prognostic factors in node-positive breast cancer. doi
  29. (1996). MP: Evaluation of Proc Impute and Schafer’s imputation software.
  30. (1996). Multiple imputation after 18+ years. doi
  31. (2004). Multiple Imputation for Nonresponse in Surveys. doi
  32. (1991). Multiple imputation in health-care databases: an overview and some applications. Statistics in Medicine doi
  33. (1994). Multiple-imputation inferences with uncongenial sources of input.
  34. (2002). Novo AA: norm: Analysis of multivariate normal datasets with missing values. R package version 1.0.9
  35. (2005). Oudshoorn CGM: mice: Multivariate Imputation by Chained Equations library. R package version 1.13.1
  36. (2001). Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. doi
  37. (2006). RL: The design of simulation studies in medical statistics. Statistics in Medicine doi
  38. (2004). Sauerbrei W: A new measure of prognostic separation in survival data. Statistics in Medicine doi
  39. (2004). SR: Non-ignorable missing covariate data in survival analysis: a case-study of an doi
  40. (1996). Taylor JMG: Partially parametric techniques for multiple imputation. doi
  41. (1996). The NHANES III multiple imputation project.
  42. (1996). Using the EM-algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.