22 research outputs found
Confidence Intervals for Maximin Effects in Inhomogeneous Large-Scale Data
One challenge of large-scale data analysis is that the assumption of an
identical distribution for all samples is often not realistic. An optimal
linear regression might, for example, be markedly different for distinct groups
of the data. Maximin effects have been proposed as a computationally attractive
way to estimate effects that are common across all data without fitting a
mixture distribution explicitly. So far just point estimators of the common
maximin effects have been proposed in Meinshausen and B\"uhlmann (2014). Here
we propose asymptotically valid confidence regions for these effects
Distributionally robust and generalizable inference
We discuss recently developed methods that quantify the stability and
generalizability of statistical findings under distributional changes. In many
practical problems, the data is not drawn i.i.d. from the target population.
For example, unobserved sampling bias, batch effects, or unknown associations
might inflate the variance compared to i.i.d. sampling. For reliable
statistical inference, it is thus necessary to account for these types of
variation. We discuss and review two methods that allow quantifying
distribution stability based on a single dataset. The first method computes the
sensitivity of a parameter under worst-case distributional perturbations to
understand which types of shift pose a threat to external validity. The second
method treats distributional shifts as random which allows assessing average
robustness (instead of worst-case). Based on a stability analysis of multiple
estimators on a single dataset, it integrates both sampling and distributional
uncertainty into a single confidence interval
backShift: Learning causal cyclic graphs from unknown shift interventions
We propose a simple method to learn linear causal cyclic models in the
presence of latent variables. The method relies on equilibrium data of the
model recorded under a specific kind of interventions ("shift interventions").
The location and strength of these interventions do not have to be known and
can be estimated from the data. Our method, called backShift, only uses second
moments of the data and performs simple joint matrix diagonalization, applied
to differences between covariance matrices. We give a sufficient and necessary
condition for identifiability of the system, which is fulfilled almost surely
under some quite general assumptions if and only if there are at least three
distinct experimental settings, one of which can be pure observational data. We
demonstrate the performance on some simulated data and applications in flow
cytometry and financial time series. The code is made available as R-package
backShift
One estimator, many estimands: fine-grained quantification of uncertainty using conditional inference
Statistical uncertainty has many components, such as measurement errors,
temporal variation, or sampling. Not all of these sources are relevant when
considering a specific application, since practitioners might view some
attributes of observations as fixed.
We study the statistical inference problem arising when data is drawn
conditionally on some attributes. These attributes are assumed to be sampled
from a super-population but viewed as fixed when conducting uncertainty
quantification. The estimand is thus defined as the parameter of a conditional
distribution. We propose methods to construct conditionally valid p-values and
confidence intervals for these conditional estimands based on asymptotically
linear estimators.
In this setting, a given estimator is conditionally unbiased for potentially
many conditional estimands, which can be seen as parameters of different
populations. Testing different populations raises questions of multiple
testing. We discuss simple procedures that control novel conditional error
rates. In addition, we introduce a bias correction technique that enables
transfer of estimators across conditional distributions arising from the same
super-population. This can be used to infer parameters and estimators on future
datasets based on some new data.
The validity and applicability of the proposed methods are demonstrated on
simulated and real-world data.Comment: 60 page
Causal aggregation: estimation and inference of causal effects by constraint-based data fusion
In causal inference, it is common to estimate the causal effect of a single
treatment variable on an outcome. However, practitioners may also be interested
in the effect of simultaneous interventions on multiple covariates of a fixed
target variable. We propose a novel method that allows to estimate the effect
of joint interventions using data from different experiments in which only very
few variables are manipulated. If there is only little randomized data or no
randomized data at all, one can use observational data sets if certain parental
sets are known or instrumental variables are available. If the joint causal
effect is linear, the proposed method can be used for estimation and inference
of joint causal effects, and we characterize conditions for identifiability. In
the overidentified case, we indicate how to leverage all the available causal
information across multiple data sets to efficiently estimate the causal
effects. If the dimension of the covariate vector is large, we may only have a
few samples in each data set. Under a sparsity assumption, we derive an
estimator of the causal effects in this high-dimensional scenario. In addition,
we show how to deal with the case where a lack of experimental constraints
prevents direct estimation of the causal effects. When the joint causal effects
are non-linear, we characterize conditions under which identifiability holds,
and propose a non-linear causal aggregation methodology for experimental data
sets similar to the gradient boosting algorithm where in each iteration we
combine weak learners trained on different datasets using only unconfounded
samples. We demonstrate the effectiveness of the proposed method on simulated
and semi-synthetic data
Guilt in voting and public good games
This paper analyzes how moral costs affect individual support of morally difficult group decisions. We study a threshold public good game with moral costs. Motivated by recent empirical findings, we assume that these costs are heterogeneous and consist of three parts. The first one is a standard cost term. The second, shared guilt, decreases in the number of supporters. The third hinges on the notion of being pivotal. We analyze equilibrium predictions, isolate the causal effects of guilt sharing, and compare results to standard utilitarian and non- consequentialist approaches. As interventions, we study information release, feedback, and fostering individual moral standards
Learning under random distributional shifts
Many existing approaches for generating predictions in settings with
distribution shift model distribution shifts as adversarial or low-rank in
suitable representations. In various real-world settings, however, we might
expect shifts to arise through the superposition of many small and random
changes in the population and environment. Thus, we consider a class of random
distribution shift models that capture arbitrary changes in the underlying
covariate space, and dense, random shocks to the relationship between the
covariates and the outcomes. In this setting, we characterize the benefits and
drawbacks of several alternative prediction strategies: the standard approach
that directly predicts the long-term outcome of interest, the proxy approach
that directly predicts a shorter-term proxy outcome, and a hybrid approach that
utilizes both the long-term policy outcome and (shorter-term) proxy outcome(s).
We show that the hybrid approach is robust to the strength of the distribution
shift and the proxy relationship. We apply this method to datasets in two
high-impact domains: asylum-seeker assignment and early childhood education. In
both settings, we find that the proposed approach results in substantially
lower mean-squared error than current approaches