We construct a new map from a convex function to a distribution on its
domain, with the property that this distribution is a multi-scale exploration
of the function. We use this map to solve a decade-old open problem in
adversarial bandit convex optimization by showing that the minimax regret for
this problem is $\tilde{O}(\mathrm{poly}(n) \sqrt{T})$, where $n$ is the
dimension and $T$ the number of rounds. This bound is obtained by studying the
dual Bayesian maximin regret via the information ratio analysis of Russo and
Van Roy, and then using the multi-scale exploration to solve the Bayesian
problem.Comment: Preliminary version; 22 page

Bubeck, Sébastien

Eldan, Ronen

English

arXiv

We construct a new map from a convex function to a distribution on its domain, with the property that this distribution is a multi-scale exploration of the function. We use this map to solve a decade-old open problem in adversarial bandit convex optimization by showing that the minimax regret for this problem is O(poly(n) √ T ), where n is the dimension and T the number of rounds. This bound is obtained by studying the dual Bayesian maximin regret via the information ratio analysis of Russo and Van Roy, and then using the multi-scale exploration to construct a new algorithm for the Bayesian convex bandit problem

Sébastien Bubeck

Ronen Eldan

CiteSeerX

JMLR: Workshop and Conference Proceedings vol 49:1–7, 2016
Multi-scale exploration of convex functions and bandit convex
optimization
Sébastien Bubeck SEBUBECK@MICROSOFT.COM
Microsoft Research
Ronen Eldan RONENELDAN@GMAIL.COM
Weizmann Institute
Abstract
We construct a new map from a convex function to a distribution on its domain, with the property
that this distribution is a multi-scale exploration of the function. We use this map to solve a decade-
old open problem in adversarial bandit convex optimization by showing that the minimax regret
for this problem is Õ(poly(n)
√
T ), where n is the dimension and T the number of rounds. This
bound is obtained by studying the dual Bayesian maximin regret via the information ratio analysis
of Russo and Van Roy, and then using the multi-scale exploration to construct a new algorithm for
the Bayesian convex bandit problem.1
1. Introduction
Let K ⊂ Rn be a convex body of diameter at most 1, and f : K → [0,+∞) a non-negative convex
function. Suppose we want to test whether some unknown convex function g : K → R is equal to f ,
with the alternative being that g takes a negative value somewhere on K. In statistical terminology
the null hypothesis is
H0 : g = f,
and the alternative is
H1 : ∃ α ∈ K such that g(α) < −ε,
where ε is some fixed positive number. In order to decide between the null hypothesis and the
alternative one is allowed to make a single noisy measurement of g. That is one can choose a
point x ∈ K (possibly at random) and obtain g(x) + ξ where ξ is a zero-mean random variable
independent of x (say ξ ∼ N (0, 1)). Is there a way to choose x such that the total variation
distance between the observed measurement under the null and the alternative is at least (up to
logarithmic terms) ε/poly(n)? Observe that without the convexity assumption on g this distance is
always O(εn+1), and thus a positive answer to this question would crucially rely on convexity. We
show that ε/poly(n) is indeed attainable by constructing a distribution on K which guarantees an
exploration of the convex function f at every scale simultaneously. Precisely we prove the following
new result on convex functions. We denote by c a universal constant whose value can change at each
occurence.
Theorem 1 Let K ⊂ Rn be a convex body of diameter at most 1. Let f : K → [0,+∞) be convex
and 1-Lipschitz, and let ε > 0. There exists a probability measure µ on K such that the following
1. Extended abstract. Full version appears as Bubeck and Eldan (2015).
c© 2016 S. Bubeck & R. Eldan.
BUBECK ELDAN
holds true. For every α ∈ K and for every convex and 1-Lipschitz function g : K → R satisfying
g(α) < −ε, one has
µ
({
x ∈ K : |f(x)− g(x)| > c
n7.5 log(1 + n/ε)
max(ε, f(x))
})
>
c
n3 log(1 + n/ε)
.
Our main application of the above result is to resolve a long-standing gap in bandit convex opti-
mization. We refer the reader to Bubeck and Cesa-Bianchi (2012) for an introduction to bandit prob-
lems (and some of their applications). The bandit convex optimization problem can be described as
the following sequential game: at each time step t = 1, . . . , T , a player selects an action xt ∈ K,
and simultaneously an adversary selects a convex (and 1-Lipschitz) loss function `t : K 7→ [0, 1].
The player’s feedback is its suffered loss, `t(xt). We assume that the adversary is oblivious, that is
the sequence of loss functions `1, . . . , `T is chosen before the game starts. The player has access to
external randomness, and can select her action xt based on the history Ht = (xs, `s(xs))s<t. The
player’s perfomance at the end of the game is measured through the regret:
RT =
T∑
t=1
`t(xt)−min
x∈K
T∑
t=1
`t(x),
which compares her cumulative loss to the best cumulative loss she could have obtained in hindsight
with a fixed action, if she had known the sequence of losses played by the adversary. A major
open problem since Kleinberg (2004); Flaxman et al. (2005) is to reduce the gap between the
√
T -
lower bound and the T 3/4-upper bound for the minimax regret of bandit convex optimization. In
dimension one (i.e., K = [0, 1]) this gap was closed recently in Bubeck et al. (2015) and our main
contribution is to extend this result to higher dimensions:
Theorem 2 There exists a player’s strategy such that for any sequence of convex (and 1-Lipschitz)
losses one has
ERT ≤ c n11 log4(T )
√
T ,
where the expectation is with respect to the player’s internal randomization.
We observe that this result also improves the state of the art regret bound for the easier situation
where the losses `1, . . . , `T form an i.i.d. sequence. In this situation the best previous bound was
obtained by Agarwal et al. (2011) and is Õ(n16
√
T ).
Using Theorem 1 we prove Theorem 2 in Section 2. Theorem 1 itself is proven in the full
version Bubeck and Eldan (2015). In this extended abstract we only provide the proof of Theorem
1 in dimension 1, see Section 3.
2. Proof of Theorem 2
Following Bubeck et al. (2015) we reduce the proof of Theorem 2 to upper bounding the Bayesian
maximin regret (this reduction is simply an application of Sion’s minimax theorem). In other words
the sequence (`1, . . . , `T ) is now a random variable with a distribution known to the player. Expec-
tations are now understood with respect to both the latter distribution, and possibly the randomness
in the player’s strategy. We denote Et for the expectation conditionally on the random variable Ht.
As in Bubeck et al. (2015) we analyze the Bayesian maximin regret with the information theoretic
2
MULTI-SCALE EXPLORATION FOR CONVEX BANDITS
approach of Russo and Van Roy (2014a), which we recall in subsection 2.1. A key contribution of
our work is then to propose in subsection 2.2 a new strategy for the Bayesian convex bandit prob-
lem, which can be viewed as an ε-greedy strategy, where the value of ε is derived from the form of
the posterior, and the exploration strategy is derived from Theorem 1.
2.1. The information ratio
Let K̄ = {x̄1, . . . , x̄K} be a 1/
√
T -net of K. Note that K ≤ (4T )n. We define a random variable
x̄∗ ∈ K̄ such that
∑T
t=1 `t(x̄
∗) = minx∈K̄
∑T
t=1 `t(x). Using that the losses are Lipschitz one has
RT ≤
√
T +
T∑
t=1
(`t(xt)− `t(x̄∗)). (1)
We introduce the following key quantities, for x ∈ K,
rt(x) = Et(`t(x)− `t(x̄∗)), and vt(x) = Vart(Et(`t(x)|x̄∗)). (2)
In words, conditionally on the history, rt(x) is the (approximate) expected regret of playing x at
time t, and vt(x) is a proxy for the information about x̄∗ revealed by playing x at time t. It will be
convenient to rewrite these functions slightly more explicitly. Let i∗ ∈ [K] be the random variable
such that x̄∗ = x̄i∗ . We denote by α∗ its distribution, which we view as a point in the K − 1
dimensional simplex. Let αt = Etα∗. In words αt = (α1,t, . . . , αK,t) is the posterior distribution
of x∗ at time t. Let fi,t, ft : K → [0, 1], i ∈ [K], t ∈ [T ], be defined by, for x ∈ K,
ft(x) = Et`t(x), fi,t(x) = Et(`t(x)|x̄∗ = x̄i).
Then one can easily see that
rt(x) = ft(x)−
K∑
i=1
αi,tfi,t(x̄i), and vt(x) =
K∑
i=1
αi,t(ft(x)− fi,t(x))2. (3)
The main observation in Russo and Van Roy (2014a) is the following lemma, which gives a bound
on the accumulation of information (see also [Appendix B, Bubeck et al. (2015)] for a short proof).
Lemma 3 One always has E
∑T
t=1 vt(xt) ≤
1
2 log(K).
An important consequence of Lemma 3 is the following result which follows from an application of
Cauchy-Schwarz (and (1)):
E
T∑
t=1
rt(xt) ≤
√
T + C
T∑
t=1
√
Evt(xt) ⇒ ERT ≤ 2
√
T + C
√
T
2
log(K). (4)
In particular a strategy which obtains at each time step an information proportional to its instanta-
neous regret has a controlled cumulative regret:
Etrt(xt) ≤
1√
T
+ C
√
Etvt(xt), ∀t ∈ [T ] ⇒ ERT ≤ 2
√
T + C
√
T
2
log(K). (5)
3
BUBECK ELDAN
Russo and Van Roy (2014a) refers to the quantity Etrt(xt)/
√
Etvt(xt) as the information ratio.
They show that Thompson Sampling (which plays xt at random, drawn from the distribution αt)
satisfies Etrt(xt)/
√
Etvt(xt) ≤ K (without any assumptions on the loss functions `t : K → [0, 1]).
In Bubeck et al. (2015) it is shown that in dimension one (i.e., n = 1), the latter bound can be im-
proved using the convexity of the losses by replacing K with a polylogarithmic term in K (Thomp-
son Sampling is also slightly modified). In the present paper we propose a completely different
strategy, which is loosely related to the Information Directed Sampling of Russo and Van Roy
(2014b). We describe and analyze our new strategy in the next subsection.
2.2. A two-point strategy
We describe here a new strategy to select xt, conditionally on Ht, and show that it satisfies a bound
of the form given in (5). To lighten notation we drop all time subscripts, e.g. one has r(x) =
f(x)−
∑K
i=1 αifi(x̄i), and v(x) =
∑K
i=1 αi (fi(x)− f(x))
2. Our objective is to describe a random
variable X ∈ K which satisfies
Er(X) ≤ 1√
T
+ C
√
Ev(X), (6)
where C is polylogarithmic in K (recall that K ≤ (4T )n). We now describe the construction of our
proposed random variable X (or to put it differently we describe a new algorithm for the Bayesian
convex bandit problem), and we prove that it satisfies (6).
Let x∗ ∈ argminx∈K f(x). We translate the functions so that f(x∗) = 0 and denote L =∑K
i=1 αifi(x̄i). If L ≥ −1/
√
T then X := x∗ satisfies (6), and thus in the following we assume
that L ≤ −1/
√
T .
Step 1: We claim that there exists ε ∈ [|L|/2, 1] such that
α ({i ∈ [K] : fi(x̄i) ≤ −ε}) ≥
|L|
2 log(2/|L|)ε
. (7)
Indeed assume that (7) is false for all ε ∈ [|L|/2, 1], and let Y be a random variable such that
P(Y = −fi(x̄i)) = αi, then
|L| = EY ≤ |L|/2 +
∫ 1
|L|/2
P(Y ≥ x)dx < |L|/2 +
∫ 1
|L|/2
|L|
2 log(2/|L|)x
dx = |L|,
thus leading to a contradiction. We denote I = {i ∈ [K] : fi(x̄i) ≤ −ε} with ε satisfying (7).
Step 2: We show here the existence of a point x̄ ∈ K and a set J ⊂ I such that α(J) ≥
c
n3 log(1+n/ε)
α(I) and for any i ∈ J ,
|f(x̄)− fi(x̄)| ≥
c
n7.5 log(1 + n/ε)
max(ε, f(x̄)). (8)
We say that a point is good for fi if it satisfies (8), and thus we want to prove the existence of a point
x̄ which is good for a large fraction (with respect to the posterior) of the fi’s. Denote
Ai =
{
x ∈ K : |f(x)− fi(x)| ≥
c
n7.5 log(1 + n/ε)
max(ε, f(x))
}
,
4
MULTI-SCALE EXPLORATION FOR CONVEX BANDITS
and let µ be the distribution given by Theorem 1. Then one obtains:
sup
x∈K
∑
i∈I
αi1{x ∈ Ai} ≥
∫
x∈K
∑
i∈I
αi1{x ∈ Ai}dµ(x) =
∑
i∈I
αiµ(Ai) ≥
c
n3 log(1 + n/ε)
α(I),
which clearly implies the existence of J and x̄.
Step 3: Let X be such that P(X = x̄) = α(J) and P(X = x∗) = 1− α(J). Then
Er(X) = |L|+ α(J)f(x̄),
and using the definition of x̄ one easily see that:√
Ev(X) ≥
√
α(J)v(x̄) ≥
√
α(J)
∑
i∈J
αi(fi(x̄)− f(x̄))2 ≥
c
n7.5 log(1 + n/ε)
α(J) max(ε, f(x̄)).
Finally, since α(J) ≥ c|L|
εn3 log2(1+n/ε)
, the two above displays clearly implies (6).
3. An exploratory distribution for convex functions
In this section we construct an exploratory distribution µ of a convex function f which satisfies
the conditions of Theorem 1. Our construction proceeds by induction on the dimension, and in
this extended abstract we only provide the proof ot the base case (see the full version Bubeck and
Eldan (2015) for the complete proof). The base case is much simpler than the proof for a general
dimension, but already contains some of the central ideas used in the general case. In particular, a
(much simpler) multi-scale argument is used.
The main ingredient is the following lemma which is easy to verify by picture (we provide a
formal proof for sake of completness).
Lemma 4 Let f, g : R → R be two convex functions. Suppose that f(x) ≥ 0. Let x0, α ∈ R be
two points satisfying α− 1 < x0 < α, and suppose that g(α) < −ε for some ε > 0 and that
f ′(x) ≥ 0, ∀x > x0. (9)
Let µ be a probability measure supported on [x0, α] whose density with respect to the Lebesgue
measure is bounded from above by some β > 1. Then we have
µ
({
x : |f(x)− g(x)| > 14β
−1 max(ε, f(x))
})
≥ 1
2
.
Proof We first argue that, without loss of generality, one may assume that f attains its minimum
at x0. Indeed, we may clearly change f as we please on the interval (−∞, x0) without affecting
the assumptions or the result of the Lemma. Using the condition (9) we may therefore make this
assumption legitimate.
Assume, for now, that there exists x1 ∈ [x0, α] for which f(x1) = g(x1). By convexity, and
since f(x0) ≥ 0 and g(α) < 0, if such point exists then it is unique. Let h(x) be the linear function
passing through (α, g(α)) and (x1, f(x1)). By convexity of g, we have that |g(x) − f(x)| ≥
|h(x) − f(x)| for all x ∈ [x0, α]. Now, since h(α) < −ε and since α < x1 + 1, we have
5
BUBECK ELDAN
h′(x0) < −(ε + f(x0)). Moreover, since we know that f(x) is non-decreasing in [x0, α], we
conclude that
|g(x)− f(x)| ≥ |h(x)− f(x)|
= |h(x)− f(x1)|+ |f(x)− f(x1)|
= (ε+ f(x1))|x− x1|+ |f(x)− f(x1)|
≥ max(ε, f(x))|x− x1|, ∀x ∈ [x0, α].
It follows that{
x; |f(x)− g(x)| < 14β
−1 max(ε, f(x))
}
⊂ I :=
[
x1 − 14β
−1, x1 +
1
4β
−1]
but since the density of µ is bounded by β, we have µ(I) ≤ 12 and we’re done.
It remains to consider the case that g(x) < f(x) for all x ∈ [x0, α]. In this case, we may define
g̃(x) = g(x) +
f(x0)− g(x0)
α− x0
(α− x).
Note that g̃(x) ≥ g(x) for all x ∈ [x0, α], which implies that |g(x)− f(x)| ≥ |g̃(x)− f(x)| for all
x ∈ [x0, α]. Since g̃(x0) = f(x0), we may continue the proof as above, replacing the function g by
g̃.
We are now ready to prove the one dimensional case. The proof essentially invokes the above
lemma on every scale between ε and 1.
Proof [Proof of Theorem 1, the case n = 1] Let x0 ∈ K be the point where the function f attains
its minimum and set d = diam(K). Define N = dlog2 1εe + 4. For all 0 ≤ k ≤ N , consider the
interval
Ik = [x0 − d2−k, x0 + d2−k] ∩ K
and define the measure µk to be the uniform measure over the interval Ik. Finally, we set
µ =
1
N + 2
N∑
k=0
µk +
1
N + 2
δx0 .
Now, let α ∈ K and let g(x) be a convex function satisfying g(α) ≤ −ε. We would like to
argue that µ(A) ≥ 18 log(1+1/ε) for A =
{
x ∈ K : |f(x)− g(x)| ≥ 18 max(ε, f(x))
}
.
Set k = dlog1/2(|α − x0|/d)e. Define Q(x) = x0 + d2−k(x − x0) and set f̃(x) = f(Q(x)),
g̃(x) = g(Q(x)), α̃ = Q−1(α) and consider the interval
I = Q−1(Ik) ∩ {x : (x− x0)(α− x0) ≥ 0}
It is easy to check that, by definition I is an interval of length 1, contained in the interval [x0, α̃].
Defining µ̃ = µI , we have that the density of µ̃ with respect to the Lebesgue measure is equal to
1. An application of Lemma 4 for the functions f̃ , g̃, the points x0, α̃ and the measure µ̃ teaches us
that
µk(A) = µQ−1(Ik)
({
x :
∣∣∣f̃(x)− g̃(x)∣∣∣ ≥ 1
8
max(ε, f̃(x))
})
≥ 1
2
µ̃
({
x :
∣∣∣f̃(x)− g̃(x)∣∣∣ ≥ 1
8
max(ε, f̃(x))
})
≥ 1
4
.
6
MULTI-SCALE EXPLORATION FOR CONVEX BANDITS
By definition of the measure µ, we have that whenever k ≤ N , one has
µ (A) ≥ 1
N + 2
≥ 1
8 log(1 + 1/ε)
.
Finally, if k > N , it means that |α − x0| < 2−N < ε4 . Since the function g is 1-Lipschitz, this
implies that g(x0) ≤ −ε/2 which in turn gives |f(x0)− g(x0)| ≥ 18 max(ε, f(x0)). Consequently,
x0 ∈ A and thus µ(A) ≥ µ({x0}) = 1N+2 ≥
1
8 log(1+1/ε) . The proof is complete.
References
A. Agarwal, D.P. Foster, D. Hsu, S.M. Kakade, and A. Rakhlin. Stochastic convex optimization
with bandit feedback. In Advances in Neural Information Processing Systems (NIPS), 2011.
S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
S. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit convex optimiza-
tion. Arxiv preprint arXiv:1507.06580, 2015.
S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization:
√
T regret in one
dimension. In Proceedings of the 28th Annual Conference on Learning Theory (COLT), 2015.
A. Flaxman, A. Kalai, and B. McMahan. Online convex optimization in the bandit setting: Gradient
descent without a gradient. In In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on
Discrete Algorithms (SODA), 2005.
R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural
Information Processing Systems (NIPS), 2004.
D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. arXiv preprint
arXiv:1403.5341, 2014a.
D. Russo and B. Van Roy. Learning to optimize via information directed sampling. arXiv preprint
arXiv:1403.5556, 2014b.
7


Multi-scale exploration of convex functions and bandit convex optimization

Abstract

Similar works

Full text

Available Versions

CiteSeerX