21 research outputs found
Causal inference is not a statistical problem
This paper introduces a collection of four data sets, similar to Anscombe's
Quartet, that aim to highlight the challenges involved when estimating causal
effects. Each of the four data sets is generated based on a distinct causal
mechanism: the first involves a collider, the second involves a confounder, the
third involves a mediator, and the fourth involves the induction of M-Bias by
an included factor. The paper includes a mathematical summary of each data set,
as well as directed acyclic graphs that depict the relationships between the
variables. Despite the fact that the statistical summaries and visualizations
for each data set are identical, the true causal effect differs, and estimating
it correctly requires knowledge of the data-generating mechanism. These example
data sets can help practitioners gain a better understanding of the assumptions
underlying causal inference methods and emphasize the importance of gathering
more information beyond what can be obtained from statistical tools alone. The
paper also includes R code for reproducing all figures and provides access to
the data sets themselves through an R package named quartets
The `Why' behind including `Y' in your imputation model
Missing data is a common challenge when analyzing epidemiological data, and
imputation is often used to address this issue. Here, we investigate the
scenario where a covariate used in an analysis has missingness and will be
imputed. There are recommendations to include the outcome from the analysis
model in the imputation model for missing covariates, but it is not necessarily
clear if this recommmendation always holds and why this is sometimes true. We
examine deterministic imputation (i.e., single imputation where the imputed
values are treated as fixed) and stochastic imputation (i.e., single imputation
with a random value or multiple imputation) methods and their implications for
estimating the relationship between the imputed covariate and the outcome. We
mathematically demonstrate that including the outcome variable in imputation
models is not just a recommendation but a requirement to achieve unbiased
results when using stochastic imputation methods. Moreover, we dispel common
misconceptions about deterministic imputation models and demonstrate why the
outcome should not be included in these models. This paper aims to bridge the
gap between imputation in theory and in practice, providing mathematical
derivations to explain common statistical recommendations. We offer a better
understanding of the considerations involved in imputing missing covariates and
emphasize when it is necessary to include the outcome variable in the
imputation model
Design Principles for Data Analysis
The data science revolution has led to an increased interest in the practice
of data analysis. While much has been written about statistical thinking, a
complementary form of thinking that appears in the practice of data analysis is
design thinking -- the problem-solving process to understand the people for
whom a product is being designed. For a given problem, there can be significant
or subtle differences in how a data analyst (or producer of a data analysis)
constructs, creates, or designs a data analysis, including differences in the
choice of methods, tooling, and workflow. These choices can affect the data
analysis products themselves and the experience of the consumer of the data
analysis. Therefore, the role of a producer can be thought of as designing the
data analysis with a set of design principles. Here, we introduce design
principles for data analysis and describe how they can be mapped to data
analyses in a quantitative, objective and informative manner. We also provide
empirical evidence of variation of principles within and between both producers
and consumers of data analyses. Our work leads to two insights: it suggests a
formal mechanism to describe data analyses based on the design principles for
data analysis, and it provides a framework to teach students how to build data
analyses using formal design principles.Comment: arXiv admin note: text overlap with arXiv:1903.0763
Evaluating the Alignment of a Data Analysis between Analyst and Audience
A challenge that data analysts face is building a data analysis that is
useful for a given consumer. Previously, we defined a set of principles for
describing data analyses that can be used to create a data analysis and to
characterize the variation between analyses. Here, we introduce a concept that
we call the alignment of a data analysis between the data analyst and a
consumer. We define a successfully aligned data analysis as the matching of
principles between the analyst and the consumer for whom the analysis is
developed. In this paper, we propose a statistical model for evaluating the
alignment of a data analysis and describe some of its properties. We argue that
this framework provides a language for characterizing alignment and can be used
as a guide for practicing data scientists and students in data science courses
for how to build better data analyses
Randomized controlled trial: Quantifying the impact of disclosing uncertainty on adherence to hypothetical health recommendations.
We conducted a randomized controlled trial to assess whether disclosing elements of uncertainty in an initial public health statement will change the likelihood that participants will accept new, different advice that arises as more evidence is uncovered. Proportional odds models were fit, stratified by the baseline likelihood to agree with the final advice. 298 participants were randomized to the treatment arm and 298 in the control arm. Among participants who were more likely to agree with the final recommendation at baseline, those who were initially shown uncertainty had a 46% lower odds of being more likely to agree with the final recommendation compared to those who were not (OR: 0.54, 95% CI: 0.27-1.03). Among participants who were less likely to agree with the final recommendation at baseline, those who were initially shown uncertainty have 1.61 times the odds of being more likely to agree with the final recommendation compared to those who were not (OR: 1.61, 95% CI: 1.15-2.25). This has implications for public health leaders when assessing how to communicate a recommendation, suggesting communicating uncertainty influences whether someone will adhere to a future recommendation
Medicine is a data science, we should teach like it
Medicine has always been a data science. Collecting and interpreting data is a key component of every interaction between physicians and patients. Data can be anything from blood pressure measurements at a yearly exam to complex radiology images interpreted by experts or algorithms. Interpreting these uncertain data for accurate diagnosis, management, and care is a critical component of every physician’s daily life. The intimate relationship between data science and medicine is apparent in the pages of our most prominent medical journals. Using Pubmed, we pulled the abstracts of all papers published in The New England Journal of Medicine, JAMA, Nature Medicine, The Lancet, PLoS Medicine, and BMJ for the years 2010 - March 2019. We then searched for a list of statistical terms in the text of these abstracts. For these 12,281 abstracts a median of 50% (IQR 30%, 67%) of sentences contained a term that would require statistical training to understand
ropensci/rdhs: rdhs 0.6.0
<p>Accepted on <a href="https://cran.r-project.org/package=rdhs">CRAN</a></p>