21 research outputs found

    Causal inference is not a statistical problem

    Full text link
    This paper introduces a collection of four data sets, similar to Anscombe's Quartet, that aim to highlight the challenges involved when estimating causal effects. Each of the four data sets is generated based on a distinct causal mechanism: the first involves a collider, the second involves a confounder, the third involves a mediator, and the fourth involves the induction of M-Bias by an included factor. The paper includes a mathematical summary of each data set, as well as directed acyclic graphs that depict the relationships between the variables. Despite the fact that the statistical summaries and visualizations for each data set are identical, the true causal effect differs, and estimating it correctly requires knowledge of the data-generating mechanism. These example data sets can help practitioners gain a better understanding of the assumptions underlying causal inference methods and emphasize the importance of gathering more information beyond what can be obtained from statistical tools alone. The paper also includes R code for reproducing all figures and provides access to the data sets themselves through an R package named quartets

    The `Why' behind including `Y' in your imputation model

    Full text link
    Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. Here, we investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommmendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation where the imputed values are treated as fixed) and stochastic imputation (i.e., single imputation with a random value or multiple imputation) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This paper aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model

    Design Principles for Data Analysis

    Full text link
    The data science revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking -- the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a data analyst (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. Here, we introduce design principles for data analysis and describe how they can be mapped to data analyses in a quantitative, objective and informative manner. We also provide empirical evidence of variation of principles within and between both producers and consumers of data analyses. Our work leads to two insights: it suggests a formal mechanism to describe data analyses based on the design principles for data analysis, and it provides a framework to teach students how to build data analyses using formal design principles.Comment: arXiv admin note: text overlap with arXiv:1903.0763

    Evaluating the Alignment of a Data Analysis between Analyst and Audience

    Full text link
    A challenge that data analysts face is building a data analysis that is useful for a given consumer. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept that we call the alignment of a data analysis between the data analyst and a consumer. We define a successfully aligned data analysis as the matching of principles between the analyst and the consumer for whom the analysis is developed. In this paper, we propose a statistical model for evaluating the alignment of a data analysis and describe some of its properties. We argue that this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists and students in data science courses for how to build better data analyses

    Impact of Disclosing Uncertainty on Decision Making

    No full text

    Randomized controlled trial: Quantifying the impact of disclosing uncertainty on adherence to hypothetical health recommendations.

    No full text
    We conducted a randomized controlled trial to assess whether disclosing elements of uncertainty in an initial public health statement will change the likelihood that participants will accept new, different advice that arises as more evidence is uncovered. Proportional odds models were fit, stratified by the baseline likelihood to agree with the final advice. 298 participants were randomized to the treatment arm and 298 in the control arm. Among participants who were more likely to agree with the final recommendation at baseline, those who were initially shown uncertainty had a 46% lower odds of being more likely to agree with the final recommendation compared to those who were not (OR: 0.54, 95% CI: 0.27-1.03). Among participants who were less likely to agree with the final recommendation at baseline, those who were initially shown uncertainty have 1.61 times the odds of being more likely to agree with the final recommendation compared to those who were not (OR: 1.61, 95% CI: 1.15-2.25). This has implications for public health leaders when assessing how to communicate a recommendation, suggesting communicating uncertainty influences whether someone will adhere to a future recommendation

    Medicine is a data science, we should teach like it

    No full text
    Medicine has always been a data science. Collecting and interpreting data is a key component of every interaction between physicians and patients. Data can be anything from blood pressure measurements at a yearly exam to complex radiology images interpreted by experts or algorithms. Interpreting these uncertain data for accurate diagnosis, management, and care is a critical component of every physician’s daily life. The intimate relationship between data science and medicine is apparent in the pages of our most prominent medical journals. Using Pubmed, we pulled the abstracts of all papers published in The New England Journal of Medicine, JAMA, Nature Medicine, The Lancet, PLoS Medicine, and BMJ for the years 2010 - March 2019. We then searched for a list of statistical terms in the text of these abstracts. For these 12,281 abstracts a median of 50% (IQR 30%, 67%) of sentences contained a term that would require statistical training to understand

    ropensci/rdhs: rdhs 0.6.0

    No full text
    <p>Accepted on <a href="https://cran.r-project.org/package=rdhs">CRAN</a></p&gt
    corecore