Search CORE

4,959 research outputs found

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Author: Neu Gergely
Publication venue
Publication date: 03/11/2015
Field of study

This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them requires a large deal of technical effort and significant modifications to the standard, more intuitive algorithms that come only with guarantees that hold on expectation. One of these modifications is forcing the learner to sample arms from the uniform distribution at least

\Omega(\sqrt{T})

times over

T

rounds, which can adversely affect performance if many of the arms are suboptimal. While it is widely conjectured that this property is essential for proving high-probability regret bounds, we show in this paper that it is possible to achieve such strong results without this undesirable exploration component. Our result relies on a simple and intuitive loss-estimation strategy called Implicit eXploration (IX) that allows a remarkably clean analysis. To demonstrate the flexibility of our technique, we derive several improved high-probability bounds for various extensions of the standard multi-armed bandit framework. Finally, we conduct a simple experiment that illustrates the robustness of our implicit exploration technique.Comment: To appear at NIPS 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization

Author: Bogunovic Ilijia
Cevher Volkan
Scarlett Jonathan
Publication venue
Publication date: 31/05/2017
Field of study

In this paper, we consider the problem of sequentially optimizing a black-box function

f

based on noisy samples and bandit feedback. We assume that

f

is smooth in the sense of having a bounded norm in some reproducing kernel Hilbert space (RKHS), yielding a commonly-considered non-Bayesian form of Gaussian process bandit optimization. We provide algorithm-independent lower bounds on the simple regret, measuring the suboptimality of a single point reported after

T

rounds, and on the cumulative regret, measuring the sum of regrets over the

T

chosen points. For the isotropic squared-exponential kernel in

d

dimensions, we find that an average simple regret of

\epsilon

requires

T = \Omega\big(\frac{1}{\epsilon^2} (\log\frac{1}{\epsilon})^{d/2}\big)

, and the average cumulative regret is at least

\Omega\big( \sqrt{T(\log T)^{d/2}} \big)

, thus matching existing upper bounds up to the replacement of

d/2

2d+O(1)

in both cases. For the Mat\'ern-

\nu

kernel, we give analogous bounds of the form

\Omega\big( (\frac{1}{\epsilon})^{2+d/\nu}\big)

and

\Omega\big( T^{\frac{\nu + d}{2\nu + d}} \big)

, and discuss the resulting gaps to the existing upper bounds.Comment: Appearing in COLT 2017. This version corrects a few minor mistakes in Table I, which summarizes the new and existing regret bound

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Learning Algorithms for Minimizing Queue Length Regret

Author: Modiano Eytan
Shrader Brooke
Stahlbuhk Thomas
Publication venue
Publication date: 14/05/2020
Field of study

We consider a system consisting of a single transmitter/receiver pair and

N

channels over which they may communicate. Packets randomly arrive to the transmitter's queue and wait to be successfully sent to the receiver. The transmitter may attempt a frame transmission on one channel at a time, where each frame includes a packet if one is in the queue. For each channel, an attempted transmission is successful with an unknown probability. The transmitter's objective is to quickly identify the best channel to minimize the number of packets in the queue over

T

time slots. To analyze system performance, we introduce queue length regret, which is the expected difference between the total queue length of a learning policy and a controller that knows the rates, a priori. One approach to designing a transmission policy would be to apply algorithms from the literature that solve the closely-related stochastic multi-armed bandit problem. These policies would focus on maximizing the number of successful frame transmissions over time. However, we show that these methods have

\Omega(\log{T})

queue length regret. On the other hand, we show that there exists a set of queue-length based policies that can obtain order optimal

O(1)

queue length regret. We use our theoretical analysis to devise heuristic methods that are shown to perform well in simulation.Comment: 28 Pages, 11 figure

arXiv.org e-Print Archive

DSpace@MIT