Search CORE

1,082 research outputs found

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Author: Bubeck Sébastien
Cesa-Bianchi Nicolò
Publication venue
Publication date: 03/11/2012
Field of study

Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this survey, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.Comment: To appear in Foundations and Trends in Machine Learnin

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

Delay and Cooperation in Nonstochastic Bandits

Author: Cesa-Bianchi Nicolo'
Gentile Claudio
Mansour Yishay
Minora Alberto
Publication venue
Publication date: 01/01/2016
Field of study

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than

d

hops to arrive, where

d

is a delay parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc Exp3} algorithm and prove that with

K

actions and

N

agents the average per-agent regret after

T

rounds is at most of order

\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}

, where

\alpha_{\le d}

is the independence number of the

d

-th power of the connected communication graph

G

. We then show that for any connected graph, for

d=\sqrt{K}

the regret bound is

K^{1/4}\sqrt{T}

, strictly better than the minimax regret

\sqrt{KT}

for noncooperating agents. More informed choices of

d

lead to bounds which are arbitrarily close to the full information minimax regret

\sqrt{T\ln K}

when

G

is dense. When

G

has sparse components, we show that a variant of \textsc{Exp3-Coop}, allowing agents to choose their parameters according to their centrality in

G

, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.Comment: 30 page

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

Archivio istituzionale della ricerca - Università dell'Insubria

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Author: Neu Gergely
Publication venue
Publication date: 03/11/2015
Field of study

This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them requires a large deal of technical effort and significant modifications to the standard, more intuitive algorithms that come only with guarantees that hold on expectation. One of these modifications is forcing the learner to sample arms from the uniform distribution at least

\Omega(\sqrt{T})

times over

T

rounds, which can adversely affect performance if many of the arms are suboptimal. While it is widely conjectured that this property is essential for proving high-probability regret bounds, we show in this paper that it is possible to achieve such strong results without this undesirable exploration component. Our result relies on a simple and intuitive loss-estimation strategy called Implicit eXploration (IX) that allows a remarkably clean analysis. To demonstrate the flexibility of our technique, we derive several improved high-probability bounds for various extensions of the standard multi-armed bandit framework. Finally, we conduct a simple experiment that illustrates the robustness of our implicit exploration technique.Comment: To appear at NIPS 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Boltzmann Exploration Done Right

Author: Cesa-Bianchi Nicolò
Gentile Claudio
Lugosi Gábor
Neu Gergely
Publication venue
Publication date: 01/01/2017
Field of study

Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions in the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon

T

and the suboptimality gap

\Delta

). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order

\frac{K\log^2 T}{\Delta}

and a distribution-independent bound of order

\sqrt{KT}\log K

without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Nonparametric Stochastic Contextual Bandits

Author: Guan Melody Y.
Jiang Heinrich
Publication venue
Publication date: 05/01/2018
Field of study

We analyze the

K

-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of

\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big)

, where

D

is the context dimension, for a modified UCB algorithm that is simple to implement (

k

NN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Dynamic Ad Allocation: Bandits with Budgets

Author: Slivkins Aleksandrs
Publication venue
Publication date: 01/06/2013
Field of study

We consider an application of multi-armed bandits to internet advertising (specifically, to dynamic ad allocation in the pay-per-click model, with uncertainty on the click probabilities). We focus on an important practical issue that advertisers are constrained in how much money they can spend on their ad campaigns. This issue has not been considered in the prior work on bandit-based approaches for ad allocation, to the best of our knowledge. We define a simple, stylized model where an algorithm picks one ad to display in each round, and each ad has a \emph{budget}: the maximal amount of money that can be spent on this ad. This model admits a natural variant of UCB1, a well-known algorithm for multi-armed bandits with stochastic rewards. We derive strong provable guarantees for this algorithm

arXiv.org e-Print Archive

CiteSeerX

Trend Detection based Regret Minimization for Bandit Problems

Author: Nakhe Paresh
Reiffenhäuser Rebecca
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

We study a variation of the classical multi-armed bandits problem. In this problem, the learner has to make a sequence of decisions, picking from a fixed set of choices. In each round, she receives as feedback only the loss incurred from the chosen action. Conventionally, this problem has been studied when losses of the actions are drawn from an unknown distribution or when they are adversarial. In this paper, we study this problem when the losses of the actions also satisfy certain structural properties, and especially, do show a trend structure. When this is true, we show that using \textit{trend detection}, we can achieve regret of order

\tilde{O} (N \sqrt{TK})

with respect to a switching strategy for the version of the problem where a single action is chosen in each round and

\tilde{O} (Nm \sqrt{TK})

when

m

actions are chosen each round. This guarantee is a significant improvement over the conventional benchmark. Our approach can, as a framework, be applied in combination with various well-known bandit algorithms, like Exp3. For both versions of the problem, we give regret guarantees also for the \textit{anytime} setting, i.e. when the length of the choice-sequence is not known in advance. Finally, we pinpoint the advantages of our method by comparing it to some well-known other strategies

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

MPG.PuRe