Search CORE

383 research outputs found

Fighting Bandits with a New Kind of Smoothness

Author: Abernethy Jacob
Lee Chansoo
Tewari Ambuj
Publication venue
Publication date: 13/12/2015
Field of study

We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the

\Theta(\sqrt{TN})

minimax regret. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as

O(\sqrt{TN \log N})

if the perturbation distribution has a bounded hazard rate. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property.Comment: In Proceedings of NIPS, 201

arXiv.org e-Print Archive

CiteSeerX

A near-optimal change-detection based algorithm for piecewise-stationary combinatorial semi-bandits

Author: LIM Ee-Peng
VARSHNEY Lav N.
WANG Lingda
ZHOU Huozhi
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 21/02/2020
Field of study

We investigate the piecewise-stationary combinatorial semi-bandit problem. Compared to the original combinatorial semi-bandit problem, our setting assumes the reward distributions of base arms may change in a piecewise-stationary manner at unknown time steps. We propose an algorithm, GLR-CUCB, which incorporates an efficient combinatorial semi-bandit algorithm, CUCB, with an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLR-CUCB is upper bounded by O(√NKT log T), where N is the number of piecewise-stationary segments, K is the number of base arms, and T is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of Ω(√NKT), for both piecewise-stationary multi-armed bandits and combinatorial semi-bandits, using information-theoretic techniques and judiciously constructed piecewise-stationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is Ω(√T). Numerical experiments on both synthetic and real-world datasets demonstrate the superiority of GLR-CUCB compared to other state-of-the-art algorithms

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

Association for the Advancement of Artificial Intelligence: AAAI Publications

Boltzmann Exploration Done Right

Author: Cesa-Bianchi Nicolò
Gentile Claudio
Lugosi Gábor
Neu Gergely
Publication venue
Publication date: 01/01/2017
Field of study

Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions in the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon

T

and the suboptimality gap

\Delta

). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order

\frac{K\log^2 T}{\Delta}

and a distribution-independent bound of order

\sqrt{KT}\log K

without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot