2,072 research outputs found

    An Empirical Comparison of Multiple Imputation Methods for Categorical Data

    Full text link
    Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online

    Multiple Imputation for Multilevel Data with Continuous and Binary Variables

    Get PDF
    We present and compare multiple imputation methods for multilevel continuous and binary data where variables are systematically and sporadically missing. The methods are compared from a theoretical point of view and through an extensive simulation study motivated by a real dataset comprising multiple studies. The comparisons show that these multiple imputation methods are the most appropriate to handle missing values in a multilevel setting and why their relative performances can vary according to the missing data pattern, the multilevel structure and the type of missing variables. This study shows that valid inferences can only be obtained if the dataset includes a large number of clusters. In addition, it highlights that heteroscedastic multiple imputation methods provide more accurate inferences than homoscedastic methods, which should be reserved for data with few individuals per cluster. Finally, guidelines are given to choose the most suitable multiple imputation method according to the structure of the data

    Multiple imputation methods for bivariate outcomes in cluster randomised trials.

    Get PDF
    Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd

    Multiple Imputation Methods for Treatment Noncompliance and Nonresponse in Randomized Clinical Trials

    Get PDF
    Summary: Randomized clinical trials are a powerful tool for investigating causal treatment effects, but in human trials there are oftentimes problems of noncompliance which standard analyses, such as the intention-to-treat or as-treated analysis, either ignore or incorporate in such a way that the resulting estimand is no longer a causal effect. One alternative to these analyses is the complier average causal effect (CACE) which estimates the average causal treatment effect among a subpopulation that would comply under any treatment assigned. We focus on the setting of a randomized clinical trial with crossover treatment noncompliance (e.g., control subjects could receive the intervention and intervention subjects could receive the control) and outcome nonresponse. In this article, we develop estimators for the CACE using multiple imputation methods, which have been successfully applied to a wide variety of missing data problems, but have not yet been applied to the potential outcomes setting of causal inference. Using simulated data we investigate the finite sample properties of these estimators as well as of competing procedures in a simple setting. Finally we illustrate our methods using a real randomized encouragement design study on the effectiveness of the influenza vaccine

    Multiple imputation methods for longitudinal blood pressure measurements from the Framingham Heart Study

    Get PDF
    Missing data are a great concern in longitudinal studies, because few subjects will have complete data and missingness could be an indicator of an adverse outcome. Analyses that exclude potentially informative observations due to missing data can be inefficient or biased. To assess the extent of these problems in the context of genetic analyses, we compared case-wise deletion to two multiple imputation methods available in the popular SAS package, the propensity score and regression methods. For both the real and simulated data sets, the propensity score and regression methods produced results similar to case-wise deletion. However, for the simulated data, the estimates of heritability for case-wise deletion and the two multiple imputation methods were much lower than for the complete data. This suggests that if missingness patterns are correlated within families, then imputation methods that do not allow this correlation can yield biased results

    Comparison of Multiple Imputation Methods for Categorical Survey Items with High Missing Rates: Application to the Family Life, Activity, Sun, Health and Eating (FLASHE) Study

    Get PDF
    Two multiple imputation methods, the Sequential Regression Multivariate Imputation Algorithm and the Cox-Lannacchione Weighted Sequential Hotdeck, were examined and compared to impute highly missing categorical variables from the Family Life, Activity, Sun, Health and Eating (FLASHE) study. This paper describes the imputation approaches and results from the study

    Laparoscopic versus open colorectal resection for cancer and polyps: A cost-effectiveness study

    Get PDF
    Methods: Participants were recruited in 2006-2007 in a district general hospital in the south of England; those with a diagnosis of cancer or polyps were included in the analysis. Quality of life data were collected using EQ-5D, on alternate days after surgery for 4 weeks. Costs per patient, from a National Health Service perspective (in British pounds, 2006) comprised the sum of operative, hospital, and community costs. Missing data were filled using multiple imputation methods. The difference in mean quality adjusted life years and costs between surgery groups were estimated simultaneously using a multivariate regression model applied to 20 imputed datasets. The probability that laparoscopic surgery is cost-effective compared to open surgery for a given societal willingness-to-pay threshold is illustrated using a cost-effectiveness acceptability curve

    Missing covariates in logistic regression, estimation and distribution selection.

    Get PDF
    We derive explicit formulae for estimation in logistic regression models where some of the covariates are missing. Our approach allows for modeling the distribution of the missing covariates either as a multivariate normal or multivariate t-distribution. A main advantage of this method is that it is fast and does not require the use of iterative procedures. A model selection method is derived which allows to choose amongst these distributions. In addition we consider versions of AIC that are based on the EM algorithm and on multiple imputation methods that have a wide applicability to model selection in likelihood models in general.Akaike information criterion; Likelihood model; Logistic regression; Missing covariates; Model selection; Multiple imputation; t-distribution;
    corecore