15,155 research outputs found

    The Impact of Multiple Imputation of Coarsened Data on Estimates on the Working Poor in South Africa

    Get PDF
    South African household surveys typically contain coarsened earnings data, which consist of a mixture of missing earnings values, point responses and interval-censored responses. This paper uses sequential regression multivariate imputation to impute missing and interval-censored values in the 2000 and 2006 Labour Force Surveys, and compares poverty estimates obtained under several different methods of reconciling coarsened earnings data. Estimates of poverty amongst the employed are found not to be sensitive to the use of the multiple imputation approach, but are sensitive to the treatment of workers reporting zero earnings. Multiple imputing earnings for all workers with missing, interval-censored or reported zero earnings, the proportion of workers earning less than R500 per month falls by almost a third between 2000 and 2006.coarsened data, multiple imputation, poverty, wage distribution, working poor

    Reuse of imputed data in microarray analysis increases imputation efficiency

    Get PDF
    BACKGROUND: The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. RESULTS: We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. CONCLUSIONS: Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data

    DIAGNOSTICS FOR MULTIPLE IMPUTATION BASED ON THE PROPENSITY SCORE

    Get PDF
    Multiple imputation (MI) is a popular approach to handling missing data, however, there has been limited work on diagnostics of imputation results. We propose two diagnostic techniques for imputations based on the propensity score (1) compare the conditional distributions of observed and imputed values given the propensity score; (2) fit regression models of the imputed data as a function of the propensity score and the missing indicator. Simulation results show these diagnostic methods can identify the problems relating to the imputations given the missing at random assumption. We use 2002 US Natality public-use data to illustrate our method, where missing values in gestational age and in covariates are imputed using Sequential Regression Multiple Imputation method

    AN ANALYSIS OF NONIGNORABLE NONRESPONSE IN A SURVEY WITH A ROTATING PANEL DESIGN

    Get PDF
    Missing values to income questions are common in survey data. When the probabilities of nonresponse are assumed to depend on the observed information and not on the underlining unobserved amounts, the missing income values are missing at random (MAR), and methods such as sequential multiple imputation can be applied. However, the MAR assumption is often considered questionable in this context, since missingness of income is thought to be related to the value of income itself, after conditioning on available covariates. In this article we describe a sensitivity analysis based on a pattern-mixture model for deviations from MAR, in the context of missing income values in a rotating panel survey. The sensitivity analysis avoids the well-known problems of underidentification of parameters of non-MAR models, is easy to carry out using existing sequential multiple imputation software and has a number of novel features

    Using Multiple Imputation to Address Missing Values of Hierarchical Data

    Get PDF
    Missing data may be a concern for data analysis. If it has a hierarchical or nested structure, the SUDAAN package can be used for multiple imputation. This is illustrated with birth certificate data that was linked to the Centers for Disease Control and Prevention’s National Assisted Reproductive Technology Surveillance System database. The Cox-Iannacchione weighted sequential hot deck method was used to conduct multiple imputation for missing/unknown values of covariates in a logistic model

    Analysis of Machine Learning Based Imputation of Missing Data

    Get PDF
    Data analysis and classification can be affected by the availability of missing data in datasets. To deal with missing data, either deletion-based or imputation-based methods are used that results in the reduction of data records or wrong predicted value imputed by means/median respectively. A significant improvement can be done if missing values are imputed more accurately with less computation cost. In this work, a flow for analysis of machine learning-based algorithms for missing data imputation is proposed. The K-nearest neighbors (KNN) and Sequential KNN (SKNN) algorithms are used to impute missing values in datasets using machine learning. Missing values handled using statistical deletion approach (List-wise Deletion) and ML-based imputation methods (KNN and SKNN) is then tested and compared using different ML classifiers (Support Vector Machine and Decision Tree) to evaluate effectiveness of imputed data. The used algorithms are compared in terms of accuracy, and results yielded that the ML-based imputation method (SKNN) outperforms LD-based approach and KNN method in terms of effectiveness of handling missing data in almost every dataset with both classification algorithms (SVM and DT)

    Assessment and Improvement of a Sequential Regression Multivariate Imputation Algorithm.

    Full text link
    The sequential regression multivariate imputation (SRMI, also known as chained equations or fully conditional specifications) is a popular approach for handling missing values in highly complex data structures with many types of variables, structural dependencies among the variables and bounds on plausible imputation values. It is a Gibbs style algorithm with iterative draws from the posterior predictive distribution of missing values in any given variable, conditional on all observed and imputed values of all other variables. However, a theoretical weakness of this approach is that the specification of a set of fully conditional regression models may not be compatible with a joint distribution of the variables being imputed. Hence, the convergence properties of the iterative algorithm are not well understood. The dissertation will focus on assessing and improving the SRMI algorithm. Chapter 2 develops conditions for convergence and assesses the properties of inferences from both compatible and incompatible sequences of generalized linear regression models. The results are established for the missing data pattern where each subject may be missing a value on at most one variable. The results are used to develop criteria for the choice of regression models. Chapter 3 proposes a modified block sequential regression multivariate imputation (BSRMI) approach to divide the data into blocks for each variable based on missing data patterns and tune the regression models through compatibility restrictions. This is extremely helpful to avoid divergence when the data are missing in general patterns and when it is difficult to get well fitting models across all missing data patterns. Conditions for the convergence of the algorithm are established, and the repeated sampling properties of inferences using several simulated data sets are studied. Chapter 4 extends the imputation model selection to quasi-likelihood regression models in both SRMI and BSRMI to better capture structure in the prediction model for the missing values. The performance of the modified approach is examined through simulation studies. The results show that extension to quasi-likelihood regression models makes it easier to choose better fitting model sequences to yield desirable repeated sampling properties of the multiple imputation estimates.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133402/1/jianzhu_1.pd

    The Sensitivity of Estimates of Post-Apartheid Changes in South African Poverty and Inequality to key Data Imputations

    Get PDF
    We begin by summarising the literature that has assessed medium-run changes in poverty and inequality in South Africa using census data. According to this literature, over the 1996 to 2001 period both poverty and inequality increased. In this paper we assesses the robustness of these results to the large percentage of individuals and households in both censuses for whom personal income data is missing and to the fact that personal income is collected in income bands rather than as point estimates. First, we use a sequential regression multiple imputation approach to impute missing values for the 2001 census data. Relative to the existing literature, the imputation results lead to estimates of mean income and inequality (as measured by the Gini coefficient) that are higher and estimates of poverty that are lower. This is true even accounting for the wider confidence intervals that arise from the uncertainty that the imputations bring into the estimation process. Next we go on to assess the influence of dubious zero values by setting them to missing and re-doing the multiple imputation process. This increases the uncertainty associated with the imputation process as reflected in wider confidence intervals on all estimates and only the Gini coefficient is significantly different from the first set of estimated parameters. The final imputation exercise assesses the sensitivity of results to the practice of taking personal incomes recorded in bands and attributing band midpoints to them. We impute an alternative set of intra-band point incomes by replicating the intra-band empirical distribution of personal incomes from a national income and expenditure survey undertaken in the year before each census. Using the empirical distributions increases estimated inequality although the differences are relatively small. We finish our empirical work with a discussion of provincial poverty shares as a policy relevant illustration of the importance of dealing with missing values. Overall our results for 1996 and 2001 confirm the major findings from the existing literature while generating more reliable confidence intervals for the key parameter of interest than are available elsewhere.
    corecore