29 research outputs found
Data-driven Algorithms for Dimension Reduction in Causal Inference
In observational studies, the causal effect of a treatment may be confounded
with variables that are related to both the treatment and the outcome of
interest. In order to identify a causal effect, such studies often rely on the
unconfoundedness assumption, i.e., that all confounding variables are observed.
The choice of covariates to control for, which is primarily based on subject
matter knowledge, may result in a large covariate vector in the attempt to
ensure that unconfoundedness holds. However, including redundant covariates can
affect bias and efficiency of nonparametric causal effect estimators, e.g., due
to the curse of dimensionality. Data-driven algorithms for the selection of
sufficient covariate subsets are investigated. Under the assumption of
unconfoundedness the algorithms search for minimal subsets of the covariate
vector. Based, e.g., on the framework of sufficient dimension reduction or
kernel smoothing, the algorithms perform a backward elimination procedure
assessing the significance of each covariate. Their performance is evaluated in
simulations and an application using data from the Swedish Childhood Diabetes
Register is also presented.Comment: 27 pages, 2 figures, 11 table
Contrasting Identifying Assumptions of Average Causal Effects: Robustness and Semiparametric Efficiency
Semiparametric inference on average causal effects from observational data is
based on assumptions yielding identification of the effects. In practice,
several distinct identifying assumptions may be plausible; an analyst has to
make a delicate choice between these models. In this paper, we study three
identifying assumptions based on the potential outcome framework: the back-door
assumption, which uses pre-treatment covariates, the front-door assumption,
which uses mediators, and the two-door assumption using pre-treatment
covariates and mediators simultaneously. We provide the efficient influence
functions and the corresponding semiparametric efficiency bounds that hold
under these assumptions, and their combinations. We demonstrate that neither of
the identification models provides uniformly the most efficient estimation and
give conditions under which some bounds are lower than others. We show when
semiparametric estimating equation estimators based on influence functions
attain the bounds, and study the robustness of the estimators to
misspecification of the nuisance models. The theory is complemented with
simulation experiments on the finite sample behavior of the estimators. The
results obtained are relevant for an analyst facing a choice between several
plausible identifying assumptions and corresponding estimators. Our results
show that this choice implies a trade-off between efficiency and robustness to
misspecification of the nuisance models
Inverse probability of treatment weighting with generalized linear outcome models for doubly robust estimation
There are now many options for doubly robust estimation; however, there is a
concerning trend in the applied literature to believe that the combination of a
propensity score and an adjusted outcome model automatically results in a
doubly robust estimator and/or to misuse more complex established doubly robust
estimators. A simple alternative, canonical link generalized linear models
(GLM) fit via inverse probability of treatment (propensity score) weighted
maximum likelihood estimation followed by standardization (the g-formula) for
the average causal effect, is a doubly robust estimation method. Our aim is for
the reader not just to be able to use this method, which we refer to as IPTW
GLM, for doubly robust estimation, but to fully understand why it has the
doubly robust property. For this reason, we define clearly, and in multiple
ways, all concepts needed to understand the method and why it is doubly robust.
In addition, we want to make very clear that the mere combination of propensity
score weighting and an adjusted outcome model does not generally result in a
doubly robust estimator. Finally, we hope to dispel the misconception that one
can adjust for residual confounding remaining after propensity score weighting
by adjusting in the outcome model for what remains `unbalanced' even when using
doubly robust estimators. We provide R code for our simulations and real
open-source data examples that can be followed step-by-step to use and
hopefully understand the IPTW GLM method. We also compare to a much
better-known but still simple doubly robust estimator
Propensity score weighting plus an adjusted proportional hazards model does not equal doubly robust away from the null
Recently it has become common for applied works to combine commonly used
survival analysis modeling methods, such as the multivariable Cox model, and
propensity score weighting with the intention of forming a doubly robust
estimator that is unbiased in large samples when either the Cox model or the
propensity score model is correctly specified. This combination does not, in
general, produce a doubly robust estimator, even after regression
standardization, when there is truly a causal effect. We demonstrate via
simulation this lack of double robustness for the semiparametric Cox model, the
Weibull proportional hazards model, and a simple proportional hazards flexible
parametric model, with both the latter models fit via maximum likelihood. We
provide a novel proof that the combination of propensity score weighting and a
proportional hazards survival model, fit either via full or partial likelihood,
is consistent under the null of no causal effect of the exposure on the outcome
under particular censoring mechanisms if either the propensity score or the
outcome model is correctly specified and contains all confounders. Given our
results suggesting that double robustness only exists under the null, we
outline two simple alternative estimators that are doubly robust for the
survival difference at a given time point (in the above sense), provided the
censoring mechanism can be correctly modeled, and one doubly robust method of
estimation for the full survival curve. We provide R code to use these
estimators for estimation and inference in the supplementary materials
Introduction to statistical simulations in health research
In health research, statistical methods are frequently used to address a wide variety of research questions. For almost every analytical challenge, different methods are available. But how do we choose between different methods and how do we judge whether the chosen method is appropriate for our specific study? Like in any science, in statistics, experiments can be run to find out which methods should be used under which circumstances. The main objective of this paper is to demonstrate that simulation studies, that is, experiments investigating synthetic data with known properties, are an invaluable tool for addressing these questions. We aim to provide a first introduction to simulation studies for data analysts or, more generally, for researchers involved at different levels in the analyses of health data, who (1) may rely on simulation studies published in statistical literature to choose their statistical methods and who, thus, need to understand the criteria of assessing the validity and relevance of simulation results and their interpretation; and/or (2) need to understand the basic principles of designing statistical simulations in order to efficiently collaborate with more experienced colleagues or start learning to conduct their own simulations. We illustrate the implementation of a simulation study and the interpretation of its results through a simple example inspired by recent literature, which is completely reproducible using the R-script available from online supplemental file 1
Covariate selection and propensity score specification in causal inference
This thesis makes contributions to the statistical research field of causal inference in observational studies. The results obtained are directly applicable in many scientific fields where effects of treatments are investigated and yet controlled experiments are difficult or impossible to implement. In the first paper we define a partially specified directed acyclic graph (DAG) describing the independence structure of the variables under study. Using the DAG we show that given that unconfoundedness holds we can use the observed data to select minimal sets of covariates to control for. General covariate selection algorithms are proposed to target the defined minimal subsets. The results of the first paper are generalized in Paper II to include the presence of unobserved covariates. Morevoer, the identification assumptions from the first paper are relaxed. To implement the covariate selection without parametric assumptions we propose in the third paper the use of a model-free variable selection method from the framework of sufficient dimension reduction. By simulation the performance of the proposed selection methods are investigated. Additionally, we study finite sample properties of treatment effect estimators based on the selected covariate sets. In paper IV we investigate misspecifications of parametric models of a scalar summary of the covariates, the propensity score. Motivated by common model specification strategies we describe misspecifications of parametric models for which unbiased estimators of the treatment effect are available. Consequences of the misspecification for the efficiency of treatment effect estimators are also studied