20 research outputs found
Adaptive Linear Estimating Equations
Sequential data collection has emerged as a widely adopted technique for
enhancing the efficiency of data gathering processes. Despite its advantages,
such data collection mechanism often introduces complexities to the statistical
inference procedure. For instance, the ordinary least squares (OLS) estimator
in an adaptive linear regression model can exhibit non-normal asymptotic
behavior, posing challenges for accurate inference and interpretation. In this
paper, we propose a general method for constructing debiased estimator which
remedies this issue. It makes use of the idea of adaptive linear estimating
equations, and we establish theoretical guarantees of asymptotic normality,
supplemented by discussions on achieving near-optimal asymptotic variance. A
salient feature of our estimator is that in the context of multi-armed bandits,
our estimator retains the non-asymptotic performance of the least square
estimator while obtaining asymptotic normality property. Consequently, this
work helps connect two fruitful paradigms of adaptive inference: a)
non-asymptotic inference using concentration inequalities and b) asymptotic
inference via asymptotic normality.Comment: 16 pages, 3 figure
Reward Imputation with Sketching for Contextual Batched Bandits
Contextual batched bandit (CBB) is a setting where a batch of rewards is
observed from the environment at the end of each episode, but the rewards of
the non-executed actions are unobserved, resulting in partial-information
feedback. Existing approaches for CBB often ignore the rewards of the
non-executed actions, leading to underutilization of feedback information. In
this paper, we propose an efficient approach called Sketched Policy Updating
with Imputed Rewards (SPUIR) that completes the unobserved rewards using
sketching, which approximates the full-information feedbacks. We formulate
reward imputation as an imputation regularized ridge regression problem that
captures the feedback mechanisms of both executed and non-executed actions. To
reduce time complexity, we solve the regression problem using randomized
sketching. We prove that our approach achieves an instantaneous regret with
controllable bias and smaller variance than approaches without reward
imputation. Furthermore, our approach enjoys a sublinear regret bound against
the optimal policy. We also present two extensions, a rate-scheduled version
and a version for nonlinear rewards, making our approach more practical.
Experimental results show that SPUIR outperforms state-of-the-art baselines on
synthetic, public benchmark, and real-world datasets.Comment: Accepted by NeurIPS 202
Statistical Limits of Adaptive Linear Models: Low-Dimensional Estimation and Inference
Estimation and inference in statistics pose significant challenges when data
are collected adaptively. Even in linear models, the Ordinary Least Squares
(OLS) estimator may fail to exhibit asymptotic normality for single coordinate
estimation and have inflated error. This issue is highlighted by a recent
minimax lower bound, which shows that the error of estimating a single
coordinate can be enlarged by a multiple of when data are allowed to
be arbitrarily adaptive, compared with the case when they are i.i.d. Our work
explores this striking difference in estimation performance between utilizing
i.i.d. and adaptive data. We investigate how the degree of adaptivity in data
collection impacts the performance of estimating a low-dimensional parameter
component in high-dimensional linear models. We identify conditions on the data
collection mechanism under which the estimation error for a low-dimensional
parameter component matches its counterpart in the i.i.d. setting, up to a
factor that depends on the degree of adaptivity. We show that OLS or OLS on
centered data can achieve this matching error. In addition, we propose a novel
estimator for single coordinate inference via solving a Two-stage Adaptive
Linear Estimating equation (TALE). Under a weaker form of adaptivity in data
collection, we establish an asymptotic normality property of the proposed
estimator.Comment: This paper is accepted at NeurIPS 202
An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits
Typically, multi-armed bandit (MAB) experiments are analyzed at the end of
the study and thus require the analyst to specify a fixed sample size in
advance. However, in many online learning applications, it is advantageous to
continuously produce inference on the average treatment effect (ATE) between
arms as new data arrive and determine a data-driven stopping time for the
experiment. Existing work on continuous inference for adaptive experiments
assumes that the treatment assignment probabilities are bounded away from zero
and one, thus excluding nearly all standard bandit algorithms. In this work, we
develop the Mixture Adaptive Design (MAD), a new experimental design for
multi-armed bandits that enables continuous inference on the ATE with
guarantees on statistical validity and power for nearly any bandit algorithm.
On a high level, the MAD "mixes" a bandit algorithm of the user's choice with a
Bernoulli design through a tuning parameter , where is a
deterministic sequence that controls the priority placed on the Bernoulli
design as the sample size grows. We show that for , the MAD produces a confidence sequence that is
asymptotically valid and guaranteed to shrink around the true ATE. We
empirically show that the MAD improves the coverage and power of ATE inference
in MAB experiments without significant losses in finite-sample reward
Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments
From clinical development of cancer therapies to investigations into partisan
bias, adaptive sequential designs have become increasingly popular method for
causal inference, as they offer the possibility of improved precision over
their non-adaptive counterparts. However, even in simple settings (e.g. two
treatments) the extent to which adaptive designs can improve precision is not
sufficiently well understood. In this work, we study the problem of Adaptive
Neyman Allocation in a design-based potential outcomes framework, where the
experimenter seeks to construct an adaptive design which is nearly as efficient
as the optimal (but infeasible) non-adaptive Neyman design, which has access to
all potential outcomes. Motivated by connections to online optimization, we
propose Neyman Ratio and Neyman Regret as two (equivalent) performance measures
of adaptive designs for this problem. We present Clip-OGD, an adaptive design
which achieves expected Neyman regret and thereby
recovers the optimal Neyman variance in large samples. Finally, we construct a
conservative variance estimator which facilitates the development of
asymptotically valid confidence intervals. To complement our theoretical
results, we conduct simulations using data from a microeconomic experiment
Deeply-debiased off-policy interval estimation
Off-policy evaluation learns a target policy’s value with a historical dataset generated by a different behavior policy. In addition to a point estimate, many applications would benefit significantly from having a confidence interval (CI) that quantifies the uncertainty of the point estimate. In this paper, we propose a novel deeply-debiasing procedure to construct an efficient, robust, and flexible CI on a target policy’s value. Our method is justified by theoretical results and numerical experiments. A Python implementation of the proposed procedure is available at https://github.com/RunzheStat/D2OPE
Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling
There is a growing interest in using reinforcement learning (RL) to
personalize sequences of treatments in digital health to support users in
adopting healthier behaviors. Such sequential decision-making problems involve
decisions about when to treat and how to treat based on the user's context
(e.g., prior activity level, location, etc.). Online RL is a promising
data-driven approach for this problem as it learns based on each user's
historical responses and uses that knowledge to personalize these decisions.
However, to decide whether the RL algorithm should be included in an
``optimized'' intervention for real-world deployment, we must assess the data
evidence indicating that the RL algorithm is actually personalizing the
treatments to its users. Due to the stochasticity in the RL algorithm, one may
get a false impression that it is learning in certain states and using this
learning to provide specific treatments. We use a working definition of
personalization and introduce a resampling-based methodology for investigating
whether the personalization exhibited by the RL algorithm is an artifact of the
RL algorithm stochasticity. We illustrate our methodology with a case study by
analyzing the data from a physical activity clinical trial called HeartSteps,
which included the use of an online RL algorithm. We demonstrate how our
approach enhances data-driven truth-in-advertising of algorithm personalization
both across all users as well as within specific users in the study.Comment: The first two authors contributed equall