7,334 research outputs found
A Primer on Causality in Data Science
Many questions in Data Science are fundamentally causal in that our objective
is to learn the effect of some exposure, randomized or not, on an outcome
interest. Even studies that are seemingly non-causal, such as those with the
goal of prediction or prevalence estimation, have causal elements, including
differential censoring or measurement. As a result, we, as Data Scientists,
need to consider the underlying causal mechanisms that gave rise to the data,
rather than simply the pattern or association observed in those data. In this
work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to
provide an introduction to some key concepts in causal inference. Similar to
other causal frameworks, the steps of the Roadmap include clearly stating the
scientific question, defining of the causal model, translating the scientific
question into a causal parameter, assessing the assumptions needed to express
the causal parameter as a statistical estimand, implementation of statistical
estimators including parametric and semi-parametric methods, and interpretation
of our findings. We believe that using such a framework in Data Science will
help to ensure that our statistical analyses are guided by the scientific
question driving our research, while avoiding over-interpreting our results. We
focus on the effect of an exposure occurring at a single time point and
highlight the use of targeted maximum likelihood estimation (TMLE) with Super
Learner.Comment: 26 pages (with references); 4 figure
Marginal integration for nonparametric causal inference
We consider the problem of inferring the total causal effect of a single
variable intervention on a (response) variable of interest. We propose a
certain marginal integration regression technique for a very general class of
potentially nonlinear structural equation models (SEMs) with known structure,
or at least known superset of adjustment variables: we call the procedure
S-mint regression. We easily derive that it achieves the convergence rate as
for nonparametric regression: for example, single variable intervention effects
can be estimated with convergence rate assuming smoothness with
twice differentiable functions. Our result can also be seen as a major
robustness property with respect to model misspecification which goes much
beyond the notion of double robustness. Furthermore, when the structure of the
SEM is not known, we can estimate (the equivalence class of) the directed
acyclic graph corresponding to the SEM, and then proceed by using S-mint based
on these estimates. We empirically compare the S-mint regression method with
more classical approaches and argue that the former is indeed more robust, more
reliable and substantially simpler.Comment: 40 pages, 14 figure
High-dimensional regression adjustments in randomized experiments
We study the problem of treatment effect estimation in randomized experiments
with high-dimensional covariate information, and show that essentially any
risk-consistent regression adjustment can be used to obtain efficient estimates
of the average treatment effect. Our results considerably extend the range of
settings where high-dimensional regression adjustments are guaranteed to
provide valid inference about the population average treatment effect. We then
propose cross-estimation, a simple method for obtaining finite-sample-unbiased
treatment effect estimates that leverages high-dimensional regression
adjustments. Our method can be used when the regression model is estimated
using the lasso, the elastic net, subset selection, etc. Finally, we extend our
analysis to allow for adaptive specification search via cross-validation, and
flexible non-parametric regression adjustments with machine learning methods
such as random forests or neural networks.Comment: To appear in the Proceedings of the National Academy of Sciences. The
present draft does not reflect final copyediting by the PNAS staf
Non-Parametric Causality Detection: An Application to Social Media and Financial Data
According to behavioral finance, stock market returns are influenced by
emotional, social and psychological factors. Several recent works support this
theory by providing evidence of correlation between stock market prices and
collective sentiment indexes measured using social media data. However, a pure
correlation analysis is not sufficient to prove that stock market returns are
influenced by such emotional factors since both stock market prices and
collective sentiment may be driven by a third unmeasured factor. Controlling
for factors that could influence the study by applying multivariate regression
models is challenging given the complexity of stock market data. False
assumptions about the linearity or non-linearity of the model and inaccuracies
on model specification may result in misleading conclusions.
In this work, we propose a novel framework for causal inference that does not
require any assumption about the statistical relationships among the variables
of the study and can effectively control a large number of factors. We apply
our method in order to estimate the causal impact that information posted in
social media may have on stock market returns of four big companies. Our
results indicate that social media data not only correlate with stock market
returns but also influence them.Comment: Physica A: Statistical Mechanics and its Applications 201
Causal inference methods for combining randomized trials and observational studies: a review
With increasing data availability, causal treatment effects can be evaluated
across different datasets, both randomized controlled trials (RCTs) and
observational studies. RCTs isolate the effect of the treatment from that of
unwanted (confounding) co-occurring effects. But they may struggle with
inclusion biases, and thus lack external validity. On the other hand, large
observational samples are often more representative of the target population
but can conflate confounding effects with the treatment of interest. In this
paper, we review the growing literature on methods for causal inference on
combined RCTs and observational studies, striving for the best of both worlds.
We first discuss identification and estimation methods that improve
generalizability of RCTs using the representativeness of observational data.
Classical estimators include weighting, difference between conditional outcome
models, and doubly robust estimators. We then discuss methods that combine RCTs
and observational data to improve (conditional) average treatment effect
estimation, handling possible unmeasured confounding in the observational data.
We also connect and contrast works developed in both the potential outcomes
framework and the structural causal model framework. Finally, we compare the
main methods using a simulation study and real world data to analyze the effect
of tranexamic acid on the mortality rate in major trauma patients. Code to
implement many of the methods is provided
Efficient Adjustment Sets for Population Average Causal Treatment Effect Estimation in Graphical Models
The method of covariate adjustment is often used for estimation of total treatment effects from observational studies. Restricting attention to causal linear models, a recent article (Henckel et al., 2019) derived two novel graphical criteria: one to compare the asymptotic variance of linear regression treatment effect estimators that control for certain distinct adjustment sets and another to identify the optimal adjustment set that yields the least squares estimator with the smallest asymptotic variance. In this paper we show that the same graphical criteria can be used in non-parametric causal graphical models when treatment effects are estimated using non-parametrically adjusted estimators of the interventional means. We also provide a new graphical criterion for determining the optimal adjustment set among the minimal adjustment sets and another novel graphical criterion for comparing time dependent adjustment sets. We show that uniformly optimal time dependent adjustment sets do not always exist. For point interventions, we provide a sound and complete graphical criterion for determining when a non-parametric optimally adjusted estimator of an interventional mean, or of a contrast of interventional means, is semiparametric efficient under the non-parametric causal graphical model. In addition, when the criterion is not met, we provide a sound algorithm that checks for possible simplifications of the efficient influence function of the parameter. Finally, we find an interesting connection between identification and efficient covariate adjustment estimation. Specifically, we show that if there exists an identifying formula for an interventional mean that depends only on treatment, outcome and mediators, then the non-parametric optimally adjusted estimator can never be globally efficient under the causal graphical model.Fil: Rotnitzky, Andrea Gloria. Consejo Nacional de Investigaciones CientÃficas y Técnicas; Argentina. Universidad Torcuato Di Tella. Departamento de EconomÃa; ArgentinaFil: Smucler, Ezequiel. Consejo Nacional de Investigaciones CientÃficas y Técnicas; Argentina. Universidad Torcuato Di Tella. Departamento de EconomÃa; Argentin
Causal Inference and Data-Fusion in Econometrics
Learning about cause and effect is arguably the main goal in applied
econometrics. In practice, the validity of these causal inferences is
contingent on a number of critical assumptions regarding the type of data that
has been collected and the substantive knowledge that is available. For
instance, unobserved confounding factors threaten the internal validity of
estimates, data availability is often limited to non-random, selection-biased
samples, causal effects need to be learned from surrogate experiments with
imperfect compliance, and causal knowledge has to be extrapolated across
structurally heterogeneous populations. A powerful causal inference framework
is required to tackle these challenges, which plague most data analysis to
varying degrees. Building on the structural approach to causality introduced by
Haavelmo (1943) and the graph-theoretic framework proposed by Pearl (1995), the
artificial intelligence (AI) literature has developed a wide array of
techniques for causal learning that allow to leverage information from various
imperfect, heterogeneous, and biased data sources (Bareinboim and Pearl, 2016).
In this paper, we discuss recent advances in this literature that have the
potential to contribute to econometric methodology along three dimensions.
First, they provide a unified and comprehensive framework for causal inference,
in which the aforementioned problems can be addressed in full generality.
Second, due to their origin in AI, they come together with sound, efficient,
and complete algorithmic criteria for automatization of the corresponding
identification task. And third, because of the nonparametric description of
structural models that graph-theoretic approaches build on, they combine the
strengths of both structural econometrics as well as the potential outcomes
framework, and thus offer a perfect middle ground between these two competing
literature streams.Comment: Abstract change
- …