Search CORE

284 research outputs found

An efficient algorithm for learning with semi-bandit feedback

Author: A. György
A. Kalai
C. Allenberg
D. Suehiro
E. Takimoto
H.B. McMahan
J. Hannan
J. Poland
J.-Y. Audibert
N. Cesa-Bianchi
N. Cesa-Bianchi
P. Auer
Publication venue
Publication date: 01/01/2013
Field of study

We consider the problem of online combinatorial optimization under semi-bandit feedback. The goal of the learner is to sequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with d-dimensional binary vectors with at most m non-zero entries, we show that the expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m) over previous bounds for this algorithm.Comment: submitted to ALT 201

arXiv.org e-Print Archive

Crossref

Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Cesa-Bianchi Nicolo
Dekel Ofer
Shamir Ohad
Publication venue
Publication date: 01/01/2013
Field of study

We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we characterize ---in a nearly complete manner--- the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, the attainable rate with bandit feedback is

\widetilde{\Theta}(T^{2/3})

. Interestingly, this rate is significantly worse than the

\Theta(\sqrt{T})

rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also show that a bounded memory adversary can force

\widetilde{\Theta}(T^{2/3})

regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

Fighting Bandits with a New Kind of Smoothness

Author: Abernethy Jacob
Lee Chansoo
Tewari Ambuj
Publication venue
Publication date: 13/12/2015
Field of study

We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the

\Theta(\sqrt{TN})

minimax regret. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as

O(\sqrt{TN \log N})

if the perturbation distribution has a bounded hazard rate. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property.Comment: In Proceedings of NIPS, 201

arXiv.org e-Print Archive

CiteSeerX

Minimax Policies for Combinatorial Prediction Games

Author: Audibert Jean-Yves
Bubeck Sebastien
Lugosi Gabor
Publication venue
Publication date: 01/01/2011
Field of study

We address the online linear optimization problem when the actions of the forecaster are represented by binary vectors. Our goal is to understand the magnitude of the minimax regret for the worst possible set of actions. We study the problem under three different assumptions for the feedback: full information, and the partial information models of the so-called "semi-bandit", and "bandit" problems. We consider both

L_\infty

-, and

L_2

-type of restrictions for the losses assigned by the adversary. We formulate a general strategy using Bregman projections on top of a potential-based gradient descent, which generalizes the ones studied in the series of papers Gyorgy et al. (2007), Dani et al. (2008), Abernethy et al. (2008), Cesa-Bianchi and Lugosi (2009), Helmbold and Warmuth (2009), Koolen et al. (2010), Uchiya et al. (2010), Kale et al. (2010) and Audibert and Bubeck (2010). We provide simple proofs that recover most of the previous results. We propose new upper bounds for the semi-bandit game. Moreover we derive lower bounds for all three feedback assumptions. With the only exception of the bandit game, the upper and lower bounds are tight, up to a constant factor. Finally, we answer a question asked by Koolen et al. (2010) by showing that the exponentially weighted average forecaster is suboptimal against

L_{\infty}

adversaries

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM