146 research outputs found
A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences
We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed
bandit problem in the case of distributions with finite supports (not
necessarily known beforehand), whose asymptotic regret matches the lower bound
of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of
this algorithm; we get bounds whose main terms are smaller than the ones of
previously known algorithms with finite-time analyses (like UCB-type
algorithms)
An Information-Theoretic Analysis of Thompson Sampling
We provide an information-theoretic analysis of Thompson sampling that
applies across a broad range of online optimization problems in which a
decision-maker must learn from partial feedback. This analysis inherits the
simplicity and elegance of information theory and leads to regret bounds that
scale with the entropy of the optimal-action distribution. This strengthens
preexisting results and yields new insight into how information improves
performance
Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms
We consider stochastic multi-armed bandit problems where the expected reward
is a Lipschitz function of the arm, and where the set of arms is either
discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic
problem specific lower bounds for the regret satisfied by any algorithm, and
propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz
structure of the problem. In fact, we prove that OSLB is asymptotically
optimal, as its asymptotic regret matches the lower bound. The regret analysis
of our algorithms relies on a new concentration inequality for weighted sums of
KL divergences between the empirical distributions of rewards and their true
distributions. For continuous Lipschitz bandits, we propose to first discretize
the action space, and then apply OSLB or CKL-UCB, algorithms that provably
exploit the structure efficiently. This approach is shown, through numerical
experiments, to significantly outperform existing algorithms that directly deal
with the continuous set of arms. Finally the results and algorithms are
extended to contextual bandits with similarities.Comment: COLT 201
- …