A Nearest-Neighbor Nonparametric Multiple Imputation Approach for Incomplete Categorical Data under Missing at Random

Abstract

Incomplete categorical data is a common problem in medical research. If researchers simply use complete cases for data analysis, the estimation might be biased and/or inefficient due to ignoring the missing values. Under the assumption of missing at random (MAR), i.e. missing values depend only on the observed data but not on the unobserved data, an increasing number of approaches have been proposed to handle missing data. However, most of the existing missing-data methods for incomplete categorical data are either not robust or sensitive to extreme missingness probabilities. In my dissertation, I study a nearest-neighbor nonparametric multiple imputation approach (NNMI) using two working models to impute values for a missing at random categorical variable, and to estimate marginal mean as well as conditional mean under three different study designs. In the first paper, I adopt the NNMI for dealing with a categorical outcome with missing values and estimating the proportion of each category. Specifically, multinomial logistic regression/cumulative logistic regression is performed to construct a working model for predicting the incomplete categorical outcome. Logistic regression is performed to fit a working model for predicting the missingness probabilities. The predicted values from the two working models are used as scores for calculating distances between each missing value with other non-missing values. A weighting scheme is used to accommodate contributions from two working models when generating predictive scores. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances (donors) from each of the missing observations. I conduct a simulation study to evaluate the performance of the NNMI method and compare it with several alternative methods. A real-data application is presented using a dataset from the 2013 Behavioral Risk Factor Surveillance System (BRFSS) survey. In the second paper, I use the NNMI method to handle missing covariate in logistic regression. Similarly, two working models are used to predict the incomplete covariate and the missingness probabilities. First, I perform a computation to assess the potential factors related to selecting an optimal size of donors. Second, the performance of the proposed method is compared with several alternative methods. Finally, the NNMI is applied on the 2013 BRFSS survey data to impute an incomplete categorical covariate and estimate the regression coefficients from a logistic regression model. In the third paper, the NNMI is extended to handle missing covariate under a matched case-control study. The estimation is conducted using a conditional logistic regression model. The performance of the NNMI is compared with complete cases and six parametric multiple imputation methods. The objective is to assess whether the NNMI demonstrates a doubly robust property compared with parametric methods. Then the NNMI is applied to impute an incomplete categorical covariate under a nested case-control cohort using the 2013 BRFSS survey data. To summarize the three papers, the proposed NNMI is a reasonable approach to dealing with an incomplete categorical outcome with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression model to predict the missing outcome and a logistic regression model to predict the missingness probability. For imputing an incomplete covariate and estimating logistic regression coefficients, the NNMI demonstrates a doubly robust property and works stably when missingness probabilities are close to 0 or 1. When missing values occur in the covariates under a matched case-control design, the NNMI can be used on multiple incomplete covariates as long as the misspecification is moderate.Dissertation not available (per author’s request

    Similar works