3 research outputs found
Are sample means in multi-armed bandits positively or negatively biased?
It is well known that in stochastic multi-armed bandits (MAB), the sample
mean of an arm is typically not an unbiased estimator of its true mean. In this
paper, we decouple three different sources of this selection bias: adaptive
\emph{sampling} of arms, adaptive \emph{stopping} of the experiment, and
adaptively \emph{choosing} which arm to study. Through a new notion called
``optimism'' that captures certain natural monotonic behaviors of algorithms,
we provide a clean and unified analysis of how optimistic rules affect the sign
of the bias. The main takeaway message is that optimistic sampling induces a
negative bias, but optimistic stopping and optimistic choosing both induce a
positive bias. These results are derived in a general stochastic MAB setup that
is entirely agnostic to the final aim of the experiment (regret minimization or
best-arm identification or anything else). We provide examples of optimistic
rules of each type, demonstrate that simulations confirm our theoretical
predictions, and pose some natural but hard open problems.Comment: 21 pages. Advances in Neural Information Processing Systems 32
(NeurIPS 2019, Spotlight Presentation
Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments
Multi-armed bandit algorithms have been argued for decades as useful for
adaptively randomized experiments. In such experiments, an algorithm varies
which arms (e.g. alternative interventions to help students learn) are assigned
to participants, with the goal of assigning higher-reward arms to as many
participants as possible. We applied the bandit algorithm Thompson Sampling
(TS) to run adaptive experiments in three university classes. Instructors saw
great value in trying to rapidly use data to give their students in the
experiments better arms (e.g. better explanations of a concept). Our
deployment, however, illustrated a major barrier for scientists and
practitioners to use such adaptive experiments: a lack of quantifiable insight
into how much statistical analysis of specific real-world experiments is
impacted (Pallmann et al, 2018; FDA, 2019), compared to traditional uniform
random assignment. We therefore use our case study of the ubiquitous two-arm
binary reward setting to empirically investigate the impact of using Thompson
Sampling instead of uniform random assignment. In this setting, using common
statistical hypothesis tests, we show that collecting data with TS can as much
as double the False Positive Rate (FPR; incorrectly reporting differences when
none exist) and the False Negative Rate (FNR; failing to report differences
when they exist)..
Inference for Batched Bandits
As bandit algorithms are increasingly utilized in scientific studies and
industrial applications, there is an associated increasing need for reliable
inference methods based on the resulting adaptively-collected data. In this
work, we develop methods for inference on data collected in batches using a
bandit algorithm. We first prove that the ordinary least squares estimator
(OLS), which is asymptotically normal on independently sampled data, is not
asymptotically normal on data collected using standard bandit algorithms when
there is no unique optimal arm. This asymptotic non-normality result implies
that the naive assumption that the OLS estimator is approximately normal can
lead to Type-1 error inflation and confidence intervals with below-nominal
coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS)
that we prove is (1) asymptotically normal on data collected from both
multi-arm and contextual bandits and (2) robust to non-stationarity in the
baseline reward