18,764 research outputs found

    Multiple Imputation Using SAS Software

    Get PDF
    Multiple imputation provides a useful strategy for dealing with data sets that have missing values. Instead of filling in a single value for each missing value, a multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results of parameter estimates and their associated standard errors from different imputed data sets is essentially the same. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values. This paper reviews methods for analyzing missing data and applications of multiple imputation techniques. This paper presents the SAS/STAT MI and MIANALYZE procedures, which perform inference by multiple imputation under numerous settings. PROC MI implements popular methods for creating imputations under monotone and nonmonotone (arbitrary) patterns of missing data, and PROC MIANALYZE analyzes results from multiply imputed data sets

    Addressing missing data in the estimation of time-varying treatments in comparative effectiveness research

    Get PDF
    Comparative effectiveness research is often concerned with evaluating treatment strategies sustained over time, that is, time-varying treatments. Inverse probability weighting (IPW) is often used to address the time-varying confounding by re-weighting the sample according to the probability of treatment receipt at each time point. IPW can also be used to address any missing data by re-weighting individuals according to the probability of observing the data. The combination of these two distinct sets of weights may lead to inefficient estimates of treatment effects due to potentially highly variable total weights. Alternatively, multiple imputation (MI) can be used to address the missing data by replacing each missing observation with a set of plausible values drawn from the posterior predictive distribution of the missing data given the observed data. Recent studies have compared IPW and MI for addressing the missing data in the evaluation of time-varying treatments, but they focused on missing confounders and monotone missing data patterns. This article assesses the relative advantages of MI and IPW to address missing data in both outcomes and confounders measured over time, and across monotone and non-monotone missing data settings. Through a comprehensive simulation study, we find that MI consistently provided low bias and more precise estimates compared to IPW across a wide range of scenarios. We illustrate the implications of method choice in an evaluation of biologic drugs for patients with severe rheumatoid arthritis, using the US National Databank for Rheumatic Diseases, in which 25% of participants had missing health outcomes or time-varying confounders

    Identifiability of Normal and Normal Mixture Models With Nonignorable Missing Data

    Full text link
    Missing data problems arise in many applied research studies. They may jeopardize statistical inference of the model of interest, if the missing mechanism is nonignorable, that is, the missing mechanism depends on the missing values themselves even conditional on the observed data. With a nonignorable missing mechanism, the model of interest is often not identifiable without imposing further assumptions. We find that even if the missing mechanism has a known parametric form, the model is not identifiable without specifying a parametric outcome distribution. Although it is fundamental for valid statistical inference, identifiability under nonignorable missing mechanisms is not established for many commonly-used models. In this paper, we first demonstrate identifiability of the normal distribution under monotone missing mechanisms. We then extend it to the normal mixture and tt mixture models with non-monotone missing mechanisms. We discover that models under the Logistic missing mechanism are less identifiable than those under the Probit missing mechanism. We give necessary and sufficient conditions for identifiability of models under the Logistic missing mechanism, which sometimes can be checked in real data analysis. We illustrate our methods using a series of simulations, and apply them to a real-life dataset

    ACCOUNTING FOR MONOTONE ATTRITION IN A POSTPARTUM DEPRESSION CLINICAL TRIAL

    Get PDF
    Longitudinal studies in public health, medicine and the social sciences are often complicated by monotone attrition, where a participant drops out before the end of the study and all his/her subsequent measurements are missing. To obtain accurate non-biased results, it is of public health importance to utilize appropriate missing data analytic methods to address the issue of monotone attrition.The defining feature of longitudinal studies is that several measurements are taken for each participant over time. The commonly used methods to analyze incomplete longitudinal data, complete case analysis and last observation carried forward, are not recommended because they produce biased estimators. Simple imputation and multiple imputation procedures provide alternative approaches for addressing monotone attrition. However, simple imputation is difficult in a multivariate setting and produces biased estimators. Multiple imputation addresses those shortcomings and allows a straightforward assessment of the sensitivity of inferences to various models for non-response. This thesis reviews the literature on missing data mechanisms and missing data analysis methods for monotone attrition. Data from a postpartum depression clinical trial comparing the effects of two drugs (Nortriptyline and Sertraline) on remission status at 8 weeks were re-analyzed using these methods. The original analysis, which only used available data, was replicated first. Then patterns and predictors of attrition were identified. Last observation carried forward, mean imputation and multiple imputation were used to account for both monotone attrition and a small number of intermittent missing measurements. In multiple imputation, every missing measurement was imputed 6 times by predictive matching. Each of the 6 completed data sets was analyzed separately and the results of all the analyses were combined to get the overall estimate and standard errors. In each analysis, continuous remission levels were imputed but the probability of remission was analyzed. The original conclusion of no significant difference in probability of remission at week 8 between the two drug groups was sustained even after carrying the missing measurements forward, mean and multiple imputations. Most drop outs occurred during the first three weeks and participants taking Sertraline who live alone were more likely to drop out

    On Sharp Identification Regions for Regression Under Interval Data

    Get PDF
    The reliable analysis of interval data (coarsened data) is one of the most promising applications of imprecise probabilities in statistics. If one refrains from making untestable, and often materially unjustified, strong assumptions on the coarsening process, then the empirical distribution of the data is imprecise, and statistical models are, in Manskiā€™s terms, partially identified. We first elaborate some subtle differences between two natural ways of handling interval data in the dependent variable of regression models, distinguishing between two different types of identification regions, called Sharp Marrow Region (SMR) and Sharp Collection Region (SCR) here. Focusing on the case of linear regression analysis, we then derive some fundamental geometrical properties of SMR and SCR, allowing a comparison of the regions and providing some guidelines for their canonical construction. Relying on the algebraic framework of adjunctions of two mappings between partially ordered sets, we characterize SMR as a right adjoint and as the monotone kernel of a criterion function based mapping, while SCR is indeed interpretable as the corresponding monotone hull. Finally we sketch some ideas on a compromise between SMR and SCR based on a set-domained loss function. This paper is an extended version of a shorter paper with the same title, that is conditionally accepted for publication in the Proceedings of the Eighth International Symposium on Imprecise Probability: Theories and Applications. In the present paper we added proofs and the seventh chapter with a small Monte-Carlo-Illustration, that would have made the original paper too long

    Block-Conditional Missing at Random Models for Missing Data

    Full text link
    Two major ideas in the analysis of missing data are (a) the EM algorithm [Dempster, Laird and Rubin, J. Roy. Statist. Soc. Ser. B 39 (1977) 1--38] for maximum likelihood (ML) estimation, and (b) the formulation of models for the joint distribution of the data Z{Z} and missing data indicators M{M}, and associated "missing at random"; (MAR) condition under which a model for M{M} is unnecessary [Rubin, Biometrika 63 (1976) 581--592]. Most previous work has treated Z{Z} and M{M} as single blocks, yielding selection or pattern-mixture models depending on how their joint distribution is factorized. This paper explores "block-sequential"; models that interleave subsets of the variables and their missing data indicators, and then make parameter restrictions based on assumptions in each block. These include models that are not MAR. We examine a subclass of block-sequential models we call block-conditional MAR (BCMAR) models, and an associated block-monotone reduced likelihood strategy that typically yields consistent estimates by selectively discarding some data. Alternatively, full ML estimation can often be achieved via the EM algorithm. We examine in some detail BCMAR models for the case of two multinomially distributed categorical variables, and a two block structure where the first block is categorical and the second block arises from a (possibly multivariate) exponential family distribution.Comment: Published in at http://dx.doi.org/10.1214/10-STS344 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%
    • ā€¦
    corecore