Search CORE

146 research outputs found

A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences

Author: Maillard Odalric-Ambrym
Munos Rémi
Stoltz Gilles
Publication venue
Publication date: 29/05/2011
Field of study

We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms)

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

An Information-Theoretic Analysis of Thompson Sampling

Author: Russo Daniel
Van Roy Benjamin
Publication venue
Publication date: 01/01/2014
Field of study

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance

arXiv.org e-Print Archive

CiteSeerX

Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms

Author: Combes Richard
Magureanu Stefan
Proutiere Alexandre
Publication venue
Publication date: 01/01/2014
Field of study

We consider stochastic multi-armed bandit problems where the expected reward is a Lipschitz function of the arm, and where the set of arms is either discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic problem specific lower bounds for the regret satisfied by any algorithm, and propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz structure of the problem. In fact, we prove that OSLB is asymptotically optimal, as its asymptotic regret matches the lower bound. The regret analysis of our algorithms relies on a new concentration inequality for weighted sums of KL divergences between the empirical distributions of rewards and their true distributions. For continuous Lipschitz bandits, we propose to first discretize the action space, and then apply OSLB or CKL-UCB, algorithms that provably exploit the structure efficiently. This approach is shown, through numerical experiments, to significantly outperform existing algorithms that directly deal with the continuous set of arms. Finally the results and algorithms are extended to contextual bandits with similarities.Comment: COLT 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

Publikationer från KTH

CiteSeerX

Digitala Vetenskapliga Arkivet - Academic Archive On-line

HAL-Rennes 1