67 research outputs found

    MIDAS: A SAS Macro for Multiple Imputation Using Distance-Aided Selection of Donors

    Get PDF
    In this paper we describe MIDAS: a SAS macro for multiple imputation using distance aided selection of donors which implements an iterative predictive mean matching hot-deck for imputing missing data. This is a flexible multiple imputation approach that can handle data in a variety of formats: continuous, ordinal, and scaled. Because the imputation models are implicit, it is not necessary to specify a parametric distribution for each variable to be imputed. MIDAS also allows the user to address the sensitivity of their inferences to different assumptions concerning the missing data mechanism. An example using MIDAS to impute missing data is presented and MIDAS is compared to existing missing data software.

    MIDAS: A SAS Macro for Multiple Imputation Using Distance-Aided Selection of Donors

    Get PDF
    In this paper we describe MIDAS: a SAS macro for multiple imputation using distance aided selection of donors which implements an iterative predictive mean matching hot-deck for imputing missing data. This is a flexible multiple imputation approach that can handle data in a variety of formats: continuous, ordinal, and scaled. Because the imputation models are implicit, it is not necessary to specify a parametric distribution for each variable to be imputed. MIDAS also allows the user to address the sensitivity of their inferences to different assumptions concerning the missing data mechanism. An example using MIDAS to impute missing data is presented and MIDAS is compared to existing missing data software

    Multiple imputation for the comparison of two screening tests in two-phase Alzheimer studies

    Get PDF
    Two-phase designs are common in epidemiological studies of dementia, and especially in Alzheimer research. In the first phase, all subjects are screened using a common screening test(s), while in the second phase, only a subset of these subjects is tested using a more definitive verification assessment, i.e. golden standard test. When comparing the accuracy of two screening tests in a two-phase study of dementia, inferences are commonly made using only the verified sample. It is well documented that in that case, there is a risk for bias, called verification bias. When the two screening tests have only two values (e.g. positive and negative) and we are trying to estimate the differences in sensitivities and specificities of the tests, one is actually estimating a confidence interval for differences of binomial proportions. Estimating this difference is not trivial even with complete data. It is well documented that it is a tricky task. In this paper, we suggest ways to apply imputation procedures in order to correct the verification bias. This procedure allows us to use well established complete-data methods to deal with the difficulty of the estimation of the difference of two binomial proportions in addition to dealing with incomplete data. We compare different methods of estimation, and evaluate the use of multiple imputation in this case. Our simulation results show that the use of multiple imputation is superior to other commonly used methods. We demonstrate our finding using an Alzheimer data

    Multiple Imputation for Correcting Verification Bias

    Get PDF
    In the case in which all subjects are screened using a common test, and only a subset of these subjects are tested using a golden standard test, it is well documented that there is a risk for bias, called verification bias. When the test has only two levels (e.g. positive and negative) and we are trying to estimate the sensitivity and specificity of the test, one is actually constructing a confidence interval for a binomial proportion. Since it is well documented that this estimation is not trivial even with complete data, we adopt Multiple imputation (MI) framework for verification bias problem. We propose several proper imputation procedures for this problem, and compare different methods of estimation. We show that our imputation methods are doing much better then the existing methods with regard to nominal coverage and confidence interval length

    Multiple imputation - Review of theory, implementation and software

    Get PDF
    Missing data is a common complication in data analysis. In many medical settings missing data can cause difficulties in estimation, precision and inference. Multiple imputation (MI) \cite{Rubin87} is a simulation based approach to deal with incomplete data. Although there are many different methods to deal with incomplete data, MI has become one of the leading methods. Since the late 80\u27s we observed a constant increase in the use and publication of MI related research. This tutorial does not attempt to cover all the material concerning MI, but rather provides an overview and combines together the theory behind MI, the implementation of MI, and discusses increasing possibilities of the use of MI using commercial and free software. We illustrate some of the major points using an example from an Alzheimer disease (AD) study. In this AD study, while clinical data are available for all subjects, postmortem data are only available for the subset of those who died and underwent an autopsy. Analysis of incomplete data requires making unverifiable assumptions. These assumptions are discussed in detail in the text. Relevant S-Plus code is provided

    The Ethics in Synthetics: Statistics in the Service of Ethics and Law in Health-Related Research in Big Data from Multiple Sources

    Get PDF
    An ethical advancement of scientific knowledge demands a delicate equilibrium between benefits and harms, in particular in health-related research. When applying and advancing scientific knowledge or technologies, Article 4 of UNESCO’s Universal Declaration on Bioethics and Human Rights, ethically justifiable research requires maximizing direct and indirect benefits and minimizing possible harms. The National Institution of Health [NIH] Data Sharing Policy and Implementation Guidance similarly states that data necessary for drawing valid conclusions and advancing medical research should be made as widely and freely available as possible (in order to share the benefits) while safeguarding the privacy of participants from potentially harmful disclosure of sensitive information. This paper discusses the challenges in the maximization of research benefit and the minimization of potential harms in the unique context of health-related research in Big Data from multiple sources, which are differently protected by the law. Part I frames the ethical dilemma by discussing potential benefits and harms, showing the constant misalignment in health-related research in Big Data from multiple sources, between the benefits in the use of confidential information for scientific purposes and the value in keeping confidentiality. Part II addresses existing regulations, including their nature and legal coverage. It highlights the prevailing challenges when combining data from multiple sources that are differently protected by the law. Part III compares different requirements for consent or authorization to use persons’ health information for research. It focuses on the difficulty of existing regulation to ensure those requirements when using multiple sources of data. Part IV investigates whether exemptions from the authorization requirement could prevail in the context of information that exceeds the protection of HIPAA and the Protection of Human Subjects Regulations. In Part V the paper proposes a solution of a statistical nature, using the method of synthetic data to balance conflicting considerations. Part VI shows how the use of synthetic data can overcome some of the ethical challenges

    Asymptotically Unbiased Estimation of Exposure Odds Ratios in Complete Records Logistic Regression.

    Get PDF
    Missing data are a commonly occurring threat to the validity and efficiency of epidemiologic studies. Perhaps the most common approach to handling missing data is to simply drop those records with 1 or more missing values, in so-called "complete records" or "complete case" analysis. In this paper, we bring together earlier-derived yet perhaps now somewhat neglected results which show that a logistic regression complete records analysis can provide asymptotically unbiased estimates of the association of an exposure of interest with an outcome, adjusted for a number of confounders, under a surprisingly wide range of missing-data assumptions. We give detailed guidance describing how the observed data can be used to judge the plausibility of these assumptions. The results mean that in large epidemiologic studies which are affected by missing data and analyzed by logistic regression, exposure associations may be estimated without bias in a number of settings where researchers might otherwise assume that bias would occur

    Privacy Protection and Aggregate Health Data: A Review of Tabular Cell Suppression Methods (Not) Employed in Public Health Data Systems

    Get PDF
    Public health research often relies on individuals’ confidential medical data. Therefore, data collecting entities, such as states, seek to disseminate this medical data as widely as possible while still maintaining the privacy of the individual for legal and ethical reasons. One common way in which this medical data is released is through the use of Web-based Data Query Systems (WDQS). In this article, we examined WDQS listed in the National Association for Public Health Statistics and Information Systems (NAPHSIS) specifically reviewing them for how they prevent statistical disclosure in queries that produce a tabular response. One of the most common methods to combat this type of disclosure is through the use of suppression, that is, if a cell count in a table is below a certain threshhold, the true value is suppressed. This technique does work to prevent the direct disclosure of small cell counts, however, primary suppression by itself is not always enough to preserve privacy in tabular data. Here, we present several real examples of tabular response queries that employ suppression, but we are able to infer the values of the suppressed cells, including cells with 1 counts, which could be linked to auxiliary data sources and thus has the possibility to create an identity disclosure. We seek to stimulate awareness of the potential for disclosure of information that individuals may wish to keep private through an online query system. This research is undertaken in the hope that privacy concerns can be dealt with preemptively rather than only after a major disclosure has taken place. In the wake of a such an event, a major concern is that state and local officials would react to this by permanently shutting down these sites and cutting off a valuable source of research data
    • …
    corecore