5,008 research outputs found
Synthetic learner: model-free inference on treatments over time
Understanding of the effect of a particular treatment or a policy pertains to
many areas of interest -- ranging from political economics, marketing to
health-care and personalized treatment studies. In this paper, we develop a
non-parametric, model-free test for detecting the effects of treatment over
time that extends widely used Synthetic Control tests. The test is built on
counterfactual predictions arising from many learning algorithms. In the
Neyman-Rubin potential outcome framework with possible carry-over effects, we
show that the proposed test is asymptotically consistent for stationary, beta
mixing processes. We do not assume that class of learners captures the correct
model necessarily. We also discuss estimates of the average treatment effect,
and we provide regret bounds on the predictive performance. To the best of our
knowledge, this is the first set of results that allow for example any Random
Forest to be useful for provably valid statistical inference in the Synthetic
Control setting. In experiments, we show that our Synthetic Learner is
substantially more powerful than classical methods based on Synthetic Control
or Difference-in-Differences, especially in the presence of non-linear outcome
models
A Statistical Perspective on Algorithmic Leveraging
One popular method for dealing with large-scale data sets is sampling. For
example, by using the empirical statistical leverage scores as an importance
sampling distribution, the method of algorithmic leveraging samples and
rescales rows/columns of data matrices to reduce the data size before
performing computations on the subproblem. This method has been successful in
improving computational efficiency of algorithms for matrix problems such as
least-squares approximation, least absolute deviations approximation, and
low-rank matrix approximation. Existing work has focused on algorithmic issues
such as worst-case running times and numerical issues associated with providing
high-quality implementations, but none of it addresses statistical aspects of
this method.
In this paper, we provide a simple yet effective framework to evaluate the
statistical properties of algorithmic leveraging in the context of estimating
parameters in a linear regression model with a fixed number of predictors. We
show that from the statistical perspective of bias and variance, neither
leverage-based sampling nor uniform sampling dominates the other. This result
is particularly striking, given the well-known result that, from the
algorithmic perspective of worst-case analysis, leverage-based sampling
provides uniformly superior worst-case algorithmic results, when compared with
uniform sampling. Based on these theoretical results, we propose and analyze
two new leveraging algorithms. A detailed empirical evaluation of existing
leverage-based methods as well as these two new methods is carried out on both
synthetic and real data sets. The empirical results indicate that our theory is
a good predictor of practical performance of existing and new leverage-based
algorithms and that the new algorithms achieve improved performance.Comment: 44 pages, 17 figure
Sparse regulatory networks
In many organisms the expression levels of each gene are controlled by the
activation levels of known "Transcription Factors" (TF). A problem of
considerable interest is that of estimating the "Transcription Regulation
Networks" (TRN) relating the TFs and genes. While the expression levels of
genes can be observed, the activation levels of the corresponding TFs are
usually unknown, greatly increasing the difficulty of the problem. Based on
previous experimental work, it is often the case that partial information about
the TRN is available. For example, certain TFs may be known to regulate a given
gene or in other cases a connection may be predicted with a certain
probability. In general, the biology of the problem indicates there will be
very few connections between TFs and genes. Several methods have been proposed
for estimating TRNs. However, they all suffer from problems such as unrealistic
assumptions about prior knowledge of the network structure or computational
limitations. We propose a new approach that can directly utilize prior
information about the network structure in conjunction with observed gene
expression data to estimate the TRN. Our approach uses penalties on the
network to ensure a sparse structure. This has the advantage of being
computationally efficient as well as making many fewer assumptions about the
network structure. We use our methodology to construct the TRN for E. coli and
show that the estimate is biologically sensible and compares favorably with
previous estimates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS350 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …