Motivated by practical applications, chiefly clinical trials, we study the
regret achievable for stochastic bandits under the constraint that the employed
policy must split trials into a small number of batches. We propose a simple
policy, and show that a very small number of batches gives close to minimax
optimal regret bounds. As a byproduct, we derive optimal policies with low
switching cost for stochastic bandits.Comment: Published at http://dx.doi.org/10.1214/15-AOS1381 in the Annals of
  Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
  Statistics (http://www.imstat.org

Chassang, Sylvain

Perchet, Vianney

Rigollet, Philippe

Snowberg, Erik

English

arXiv

Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits

Caltech Authors - Main

Submitted to the Annals of Statistics
SUPPLEMENTARY MATERIAL FOR:
BATCHED BANDIT PROBLEMS
By Vianney Perchet∗, Philippe Rigollet†, Sylvain
Chassang‡, and Erik Snowberg‡,
Universite´ Paris Diderot and INRIA, Massachusetts Institute of
Technology, Princeton University, and California Institute of Technology
and NBER
Abstract Motivated by practical applications, chiefly clinical tri-
als, we study the regret achievable for stochastic bandits under the
constraint that the employed policy must split trials into a small num-
ber of batches. We propose a simple policy, and show that a very small
number of batches gives close to minimax optimal regret bounds. As
a byproduct, we derive optimal policies with low switching cost for
stochastic bandits.
In this supplementary material we compare, in simulations, the various
policies (grids) introduced in [PRCS15].
These are also compared withUcb2 [ACBF02], which, as noted in [PRCS15],
can be seen as an M batch trial with M = Θ(log T ). The simulations are
based both on data drawn from standard distributions, and from a real
medical trial: specifically, data from Project AWARE, an intervention that
sought to reduce the rate of sexually transmitted infections (STI) among
high-risk individuals [MFGea13].
Of the three policies introduced here, the minimax grid often does the best
at minimizing regret. While all three policies are often bested by Ucb2, it
is important to note that the latter algorithm uses an order of magnitude
more batches. This makes using Ucb2 for medical trials functionally impos-
sible. For example, in the real data we examine, the data on STI status was
not reliably available until at least six months after the intervention. Thus,
a three-batch trial would take 1.5 years to run—as intervention and data
collection would need to take place three times, six months apart. However,
in contrast, Ucb2 would use as many as 56 batches—meaning the overall
experiment would take at least 28 years. Despite this extreme difference
∗Supported by ANR grant ANR-13-JS01-0004.
†Supported by NSF grants DMS-1317308, CAREER-DMS-1053987.
‡Supported by NSF grant SES-1156154.
AMS 2000 subject classifications: Primary 62L05; secondary 62C20
Keywords and phrases: Multi-armed bandit problems, regret bounds, batches, multi-
phase allocation, grouped clinical trials, sample size determination, switching cost
1
2 PERCHET ET AL.
in time scales, the geometric and minimax grids produce similar levels of
average regret.
1. Effects of different parameters in simulations.
0
0.02
0.04
0 50 100 150 200 250
Number of Subjects (T − in thousands)
Gaussian with variance = 1
0
0.02
0.04
0 50 100 150 200 250
Number of Subjects (T − in thousands)
Student’s t with two degrees of freedom
0
0.02
0.04
0 50 100 150 200 250
Number of Subjects (T − in thousands)
Bernoulli
0
0.02
0.04
0 50 100 150 200 250
Number of Subjects (T − in thousands)
Poisson
A
ve
ra
ge
 R
eg
re
t p
er
 S
ub
jec
t
Arithmetic Geometric
Minimax UCB2
Figure 1. Performance of Policies with Different Distributions and M = 5. (For all
distributions µ(†) = 0.5, and µ(⋆) = 0.5 +∆ = 0.6.)
1.1. Effect of reward distributions. We begin, in Figure 1, by examining
how different distributions affect the average regret produced by different
policies for many values of the total sample, T . For each value of T in
the figure, a sample is drawn, grids are computed based on M and T , the
policy is implemented, and average regret is calculated based on the choices
in the policy. This is repeated 100 times for each value of T . Thus, each
panel compares average regret for different policies as a function of the total
sample T .
In all panels, the number of batches is set at M = 5 for all policies except
Ucb2. The panels each consider one of four distributions: two continuous—
Gaussian and Student’s t-distribution, and two discrete—Bernoulli and Pois-
BATCHED BANDITS 3
son. In all cases, and no matter the number of participants T , we set the
difference between the arms at ∆ = 0.1.
A few patterns are immediately apparent. First, the arithmetic grid pro-
duces relatively constant average regret above a certain number of partic-
ipants. The intuition is straightforward: when T is large enough, the etc
policy will tend to commit after the first batch, as the first evaluation point
will be greater than τ(∆). As in the case of the arithmetic grid, the size of
this first batch is a constant proportion of the overall participant pool, so
average regret will be constant once T is large enough.
Second, the minimax grid also produces relatively constant average regret,
although this holds for smaller values of T , and produces lower regret than
the geometric or arithmetic case when M is small. This indicates, using
the intuition above, that the minimax grid excels at choosing the optimal
batch size to allow a decision to commit very close to τ(∆). This advantage
over the arithmetic and geometric grids is clear, and it can even produce
lower regret than Ucb2, but with an order of magnitude fewer batches.
However, according to the theory above, with the minimax grid average
regret is bounded by a more steeply decreasing function than is apparent
in the figures. The discrepancy is due to the bounding of regret being loose
for relatively small T . As T grows, average regret does decrease, but more
slowly than the bound, so eventually the bound is tight at values greater
than shown in the figure.
Third, and finally, the Ucb2 algorithm generally produces lower regret
for all distributions, except the heavy-tailed Student’s t-distribution, than
any of the policies considered in the manuscript. This phenomenon can be
explained by the central limit theorem, or its generalization to handle ran-
dom variables with infinite variance (such a the Student’s t-distribution
with 2 degrees of freedom): batching heavy-tailed random variables creates,
asymptotically, random variables with Gaussian tails.
This increase in performance comes at a steep practical cost: many more
batches. For example, with draws from a Gaussian distribution, and T be-
tween 10,000 and 40,000, the minimax grid performs better than Ucb2.
Throughout this range, the number of batches is fixed at M = 5 for the
minimax grid, but Ucb2 uses an average of 40–46 batches. The average
number of batches used by Ucb2 increases with T , and with T = 250, 000
it reaches 56.
The fact that Ucb2 uses so many more batches than the geometric grid
may seem a bit surprising as both use geometric batches, leading Ucb2 to
have M = Θ(log T ). The difference occurs because the geometric grid uses
exactly M batches, while the total number of batches in Ucb2 is dominated
4 PERCHET ET AL.
by the constant terms in the range of T we consider. It should further be
noted that although the level of regret is higher for the geometric grid, it is
higher by a relatively constant factor.
1.2. Effect of the gap ∆. The patterns in Figure 1 are relatively indepen-
dent of the distribution used to generate the simulated data. Thus, in this
subsection, we focus on a single distribution: the exponential (to add vari-
ety), in Figure 2. What varies here is the difference in mean value between
the two arms, ∆ ∈ {.01, .5}.
In both panels of Figure 2, the mean of the second arm is set to µ(†) = 0.5,
so ∆ in these panels is 2% and 100%, respectively, of µ(†). This affects both
the maximum average regret T∆/T = ∆ and the number of participants it
will take to determine, using the statistical test in Section 3.1, which arm
to commit to.
When the value of ∆ is small (0.01), then in small to moderate samples
T , the performance of the geometric grid and Ucb2 are equivalent. When
samples get large, then the minimax grid, the geometric grid, and Ucb2
have similar performance. However, as before, Ucb2 uses an order of mag-
nitude larger number of batches—between 38–56, depending on the number
of participants, T . As in Figure 1, the arithmetic grid performs poorly, but
as expected, based on the intuition built in the previous subsection: more
participants are needed before the performance of this grid stabilizes at a
constant value. Although not shown, middling values of ∆ (for example,
∆ = 0.1) produce the same patterns as those shown in the panels of Figure
1 (except for the panel using Student’s t).
When the value of ∆ is relatively large (0.5), then there is a reversal of
the pattern found when ∆ is relatively small. In particular, the geometric
grid performs poorly—worse, in fact, than the arithmetic grid—for small
samples, but when the number of participants is large, the performance of
the minimax grid, geometric grid, and Ucb2 are comparable. Nevertheless,
the latter uses an order of magnitude more batches.
1.3. Effect of the number of batches (M). There is likely to be some
variation in how well different numbers of batches perform. This is explored
in Figure 3. The minimax grid’s performance is consistent between M =
2 to M = 10. However, as M gets large relative to both the number of
participants T and gap between the arms ∆, all grids perform approximately
equally. This occurs because as the sizes of the batches decrease, all grids
end up with decision points near τ(∆).
These simulations also reveal an important point about implementation:
the values of a, the termination point of the first batch—suggested in The-
BATCHED BANDITS 5
0.001
0.003
0.005
0 50 100 150 200 250
Number of Subjects (T − in thousands)
∆ = 0.01
0
0.05
0.1
0 50 100 150 200 250
Number of Subjects (T − in thousands)
∆ = 0.5
A
ve
ra
ge
 R
eg
re
t p
er
 S
ub
jec
t
Arithmetic Geometric
Minimax UCB2
Figure 2. Performance of Policies with different ∆ andM = 5. (For all panels µ(†) = 0.5,
and µ(⋆) = 0.5 + ∆.)
0
0.005
0.01
0.015
2 4 6 8 10
Number of Batches (M)
Bernoulli, ∆ = 0.05, T=250,000
0
0.05
0.1
0.15
2 4 6 8 10
Number of Batches (M)
Gaussian, ∆ = 0.5, T=10,000
A
ve
ra
ge
 R
eg
re
t p
er
 S
ub
jec
t
Arithmetic Geometric Minimax
Figure 3. Performance of policies with different numbers of batches. (For all panels µ(†) =
0.5, and µ(⋆) = 0.5 + ∆.)
6 PERCHET ET AL.
orems 2 and 3 are not feasible when M is “too big”, that is, if it is com-
parable to log(T/(log T )) in the case of the geometric grid, or comparable
to log2 log T in the case of the minimax grid. When this occurs, using this
initial value of a may lead to the last batch being entirely outside the range
of T . We used the suggested a whenever feasible, but, when it was not, we
selected a such that the last batch finished exactly at T = tM . In the simu-
lations displayed in Figure 3, this occurs with the geometric grid for M ≥ 7
in the first panel, and M ≥ 6 in the second panel. For the minimax grid, this
occurs for M ≥ 8 in the second panel. For the geometric grid, this improves
performance, and for the minimax grid it slightly decrease performance. In
both cases, this is due to the relatively small sample, and to how the grid
locates decision points relative to τ(∆).
1.4. Real Data. Our final simulations use data from Project AWARE,
a medical intervention to reduce the rate of sexually transmitted infections
(STI) among high-risk individuals [MFGea13]. In particular, when partic-
ipants went to a clinic to get an instant blood test for HIV, they were
randomly assigned to receive an information sheet—control, or arm 2, or
extensive “AWARE” counseling—treatment, or arm 1. The main outcome
of interest was whether a participant had an STI upon six-month follow up.
The data from this trial is useful for simulations for several reasons. First,
the time to observed outcome makes it clear that only a small number of
batches is feasible. Second, the difference in outcomes between the arms ∆
was slight, making the problem difficult. Indeed, the difference between the
arms was not statistically significant at conventional levels within the studied
sample. Third, the trial itself was fairly large by medical trial standards,
enrolling over 5,000 participants.
To simulate trials based on this data, we randomly draw observations,
with replacement, from the Project AWARE participant pool. We then as-
sign these participants to different batches, based on the outcomes of previ-
ous batches. The results of these simulations, for different numbers of par-
ticipants and different numbers of batches, can be found in Figure 4. The
arithmetic grid once again provides the intuition. Note that the performance
of this grid degrades as the number of batches M is increased. This occurs
because ∆ is so small that the etc policy does not commit until the last
round, where it “goes for broke”. However, when doing so, the policy rarely
makes a mistake. Thus, more batches cause the grid to “go for broke” later
and later, resulting in worse performance.
The geometric grid and minimax grid perform similarly to Ucb2, with
minimax performing best with a very small number of batches (M = 3), and
BATCHED BANDITS 7
0
0.003
0.006
0 50 100 150 200 250
Number of Subjects (T − in thousands)
M = 3
0
0.003
0.006
0 50 100 150 200 250
Number of Subjects (T − in thousands)
M = 5
0
0.003
0.006
0 50 100 150 200 250
Number of Subjects (T − in thousands)
M = 7
0
0.003
0.006
0 50 100 150 200 250
Number of Subjects (T − in thousands)
M = 9
A
ve
ra
ge
 R
eg
re
t p
er
 S
ub
jec
t
Arithmetic Geometric
Minimax UCB2
Figure 4. Performance of Policies using data from Project AWARE.
geometric performing best with a moderate number of batches (M = 9). In
both cases, this small difference comes from one grid or the other “going
for broke” at a slightly earlier time. As before, Ucb2 uses between 40–56
batches. Given the six-month time between intervention and outcome mea-
sures, this suggests that a complete trial could be accomplished in 1.5 years
using the minimax grid, but would take up to 28 years—a truly infeasible
amount of time—using Ucb2.
It is worth noting that there is nothing special in medical trials about
waiting six months for data from an intervention. Trials of cancer drugs
often measure variables like the 1- or 3-year survival rate, or the increase
in average survival off a baseline that may be greater than a year. In these
cases, the ability to get relatively low regret with a small number of batches
is extremely important.
8 PERCHET ET AL.
REFERENCES
[ACBF02] Peter Auer, Nicolo` Cesa-Bianchi, and Paul Fischer, Finite-time
analysis of the multiarmed bandit problem, Mach. Learn. 47
(2002), no. 2-3, 235–256.
[MFGea13] L. R. Metsch, D. J. Feaster, and L. Gooden et al., Effect of risk-
reduction counseling with rapid hiv testing on risk of acquiring
sexually transmitted infections: The aware randomized clinical
trial, JAMA 310 (2013), no. 16, 1701–1710.
[PRCS15] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik
Snowberg, Batched bandit problems, arXiv:1505.00369 (2015).
Vianney Perchet
LPMA, UMR 7599
Universite´ Paris Diderot
8, Place FM/13
75013, Paris, France
E-mail: vianney.perchet@normalesup.org
Philippe Rigollet
Department of Mathematics and IDSS
Massachusetts Institute of Technology
77 Massachusetts Avenue,
Cambridge, MA 02139-4307, USA
E-mail: rigollet@math.mit.edu
Sylvain Chassang
Department of Economics
Princeton University
Bendheim Hall 316
Princeton, NJ 08544-1021
E-mail: chassang@princeton.edu
Erik Snowberg
Division of the Humanities and Social Sciences
California Institute of Technology
MC 228-77
Pasadena, CA 91125
E-mail: snowberg@caltech.edu


Batched bandit problems

INRIA a CCSD electronic archive server

Batched Bandit Problems

Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.National Science Foundation (U.S.) (Grant DMS-1317308)National Science Foundation (U.S.) (CAREER-DMS-1053987)Meimaris Famil

DSpace@MIT

The Annals of Statistics

Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits

JMLR: Workshop and Conference Proceedings vol 40:1–2, 2015
Batched Bandit Problems
Vianney Perchet VIANNEY.PERCHET@NORMALESUP.ORG
Universite´ Paris Diderot – INRIA
Philippe Rigollet RIGOLLET@MATH.MIT.EDU
Massachusetts Institute of Technology
Sylvain Chassang CHASSANG@PRINCETON.EDU
Princeton University
Erik Snowberg SNOWBERG@CALTECH.EDU
California Institute of Technology
Abstract
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochas-
tic multi-armed bandits under the constraint that the employed policy must split trials into a small
number of batches. Our results show that a very small number of batches gives already close to
minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a
byproduct, we derive optimal policies with low switching cost for stochastic bandits.
In practice, fixed costs and delays in the observation of outcomes make it prohibitively expensive
to run clinical trials consisting of more than three or four batches of patients. The objective is to
describe the regret achievable for two-armed bandits under the constraint of a small number M of
batches, within which the likelihood of pulling each arm is constant. We study a class of explore-
then-commit (ETC) policies policies parameterized by the partition of patients across batches. In the
first batch, the policy randomizes uniformly between arms. At the end of each batch, a statistical test
of performance is implemented. If it is conclusive, the supposed-to-be suboptimal arm is eliminated.
If it is inconclusive, the policy keeps alternating between arms in the next batch.
For each batch size M , we describe ETC policies pi1 and pi2 achieving tight adaptive and mini-
max bounds on regret, where ∆ is the gap in expected returns between arms and T the horizon:
RT (pi
1) .
(
T
log(T )
) 1
M log(T∆2)
∆
RT (pi
2) . T
1
2−21−M logαM
(
T
1
2M−1
)
, αM ∈ [0, 1/4) .
Thus losses from using few batches are small: pi1 attains the optimal adaptive rate log(T∆2)/∆ if
M = Θ(log(T/ log(T ))); pi2 attains the optimal minimax rate
√
T whenever M = Θ(log log T ).
Tests on real and simulated data show that batch-optimized ETC policies with few batches per-
form well, even relative to more responsive strategies such as UCB2.
Acknowledments
Philippe Rigollet acknowledges the support of NSF grants DMS-1317308, CAREER-DMS-1053987,
the Howard B. Wentz Jr. Junior Faculty award and the Meimaris family. Vianney Perchet received
support from the French National Research Agency (ANR) Project GAGA: ANR-13-JS01-0004-01.
Sylvain Chassang and Erik Snowberg acknowledge support from NSF grant SES-1156154.
c© 2015 V. Perchet, P. Rigollet, S. Chassang & E. Snowberg.


https://hal.archives-ouvertes.fr/hal-01265077

Batched bandit problems

Abstract

Similar works

Full text

Available Versions

Caltech Authors - Main

INRIA a CCSD electronic archive server

DSpace@MIT

Caltech Authors - Main