383 research outputs found
Fighting Bandits with a New Kind of Smoothness
We define a novel family of algorithms for the adversarial multi-armed bandit
problem, and provide a simple analysis technique based on convex smoothing. We
prove two main results. First, we show that regularization via the
\emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the
minimax regret. Second, we show that a wide class of
perturbation methods achieve a near-optimal regret as low as if the perturbation distribution has a bounded hazard rate. For example,
the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this
key property.Comment: In Proceedings of NIPS, 201
A near-optimal change-detection based algorithm for piecewise-stationary combinatorial semi-bandits
We investigate the piecewise-stationary combinatorial semi-bandit problem. Compared to the original combinatorial semi-bandit problem, our setting assumes the reward distributions of base arms may change in a piecewise-stationary manner at unknown time steps. We propose an algorithm, GLR-CUCB, which incorporates an efficient combinatorial semi-bandit algorithm, CUCB, with an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLR-CUCB is upper bounded by O(√NKT log T), where N is the number of piecewise-stationary segments, K is the number of base arms, and T is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of Ω(√NKT), for both piecewise-stationary multi-armed bandits and combinatorial semi-bandits, using information-theoretic techniques and judiciously constructed piecewise-stationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is Ω(√T). Numerical experiments on both synthetic and real-world datasets demonstrate the superiority of GLR-CUCB compared to other state-of-the-art algorithms
Boltzmann Exploration Done Right
Boltzmann exploration is a classic strategy for sequential decision-making
under uncertainty, and is one of the most standard tools in Reinforcement
Learning (RL). Despite its widespread use, there is virtually no theoretical
understanding about the limitations or the actual benefits of this exploration
scheme. Does it drive exploration in a meaningful way? Is it prone to
misidentifying the optimal actions or spending too much time exploring the
suboptimal ones? What is the right tuning for the learning rate? In this paper,
we address several of these questions in the classic setup of stochastic
multi-armed bandits. One of our main results is showing that the Boltzmann
exploration strategy with any monotone learning-rate sequence will induce
suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that
guarantees near-optimal performance, albeit only when given prior access to key
problem parameters that are typically not available in practical situations
(like the time horizon and the suboptimality gap ). More
importantly, we propose a novel variant that uses different learning rates for
different arms, and achieves a distribution-dependent regret bound of order
and a distribution-independent bound of order
without requiring such prior knowledge. To demonstrate the
flexibility of our technique, we also propose a variant that guarantees the
same performance bounds even if the rewards are heavy-tailed
- …