Search CORE

650 research outputs found

Pure Exploration with Multiple Correct Answers

Author: Degenne Rémy
Koolen Wouter M.
Publication venue
Publication date: 01/01/2019
Field of study

We determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound using a new game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensures that the Track-and-Stop algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. We present a new algorithm which extends Track-and-Stop to the multiple-answer case and has asymptotic sample complexity matching the lower bound

arXiv.org e-Print Archive

CWI's Institutional Repository

Bounded regret in stochastic multi-armed bandits

Author: Bubeck Sébastien
Perchet Vianney
Rigollet Philippe
Publication venue
Publication date: 01/02/2013
Field of study

We study the stochastic multi-armed bandit problem when one knows the value

\mu^{(\star)}

of an optimal arm, as a well as a positive lower bound on the smallest positive gap

\Delta

. We propose a new randomized policy that attains a regret {\em uniformly bounded over time} in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows

\Delta

, and bounded regret of order

1/\Delta

is not possible if one only knows $\mu^{(\star)}

arXiv.org e-Print Archive

Princeton University Open Access Repository

Sparse Stochastic Bandits

Author: Kwon Joon
Perchet Vianney
Vernade Claire
Publication venue
Publication date: 05/06/2017
Field of study

In the classical multi-armed bandit problem, d arms are available to the decision maker who pulls them sequentially in order to maximize his cumulative reward. Guarantees can be obtained on a relative quantity called regret, which scales linearly with d (or with sqrt(d) in the minimax sense). We here consider the sparse case of this classical problem in the sense that only a small number of arms, namely s < d, have a positive expected reward. We are able to leverage this additional assumption to provide an algorithm whose regret scales with s instead of d. Moreover, we prove that this algorithm is optimal by providing a matching lower bound - at least for a wide and pertinent range of parameters that we determine - and by evaluating its performance on simulated data

arXiv.org e-Print Archive

HAL-Polytechnique