Search CORE

221 research outputs found

Regret bounds for restless Markov bandits

Author: Akyildiz
Aldous
Aldous
Anantharam
Audibert
Auer
Auer
Bartlett
Bubeck
Daniil Ryabko
Filippi
Gittins
Givan
Jaksch
Lai
Levin
Maillard
Maillard
Nair
Ortner
Ortner
Peter Auer
Puterman
Ravindran
Ronald Ortner
Ryabko
Rémi Munos
Taylor
Tekin
Whittle
Publication venue: 'Elsevier BV'
Publication date
Field of study

Stationary Mixing Bandits

Author: Audiffren Julien
Ralaivola Liva
Publication venue
Publication date: 05/06/2014
Field of study

We study the bandit problem where arms are associated with stationary phi-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of recovering some independence by ignoring the value of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off. To do so, we provide a UCB strategy together with a general regret analysis for the case where the size of the independence blocks (the ignored rewards) is fixed and we go a step beyond by providing an algorithm that is able to compute the size of the independence blocks from the data. Finally, we give an analysis of our bandit problem in the restless case, i.e., in the situation where the time counters for all mixing processes simultaneously evolve

arXiv.org e-Print Archive

HAL AMU

Regret Bounds for Reinforcement Learning with Policy Advice

Author: C. Tekin
M.L. Puterman
N. Cesa-Bianchi
R. Ortner
R.S. Sutton
T. Jaksch
Publication venue
Publication date: 01/01/2013
Field of study

In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space. Our empirical simulations support our theoretical analysis. This suggests RLPA may offer significant advantages in large domains where some prior good policies are provided

arXiv.org e-Print Archive

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Time-Varying Gaussian Process Bandit Optimization

Author: Bogunovic Ilija
Cevher Volkan
Scarlett Jonathan
Publication venue
Publication date: 25/01/2016
Field of study

We consider the sequential Bayesian optimization problem with bandit feedback, adopting a formulation that allows for the reward function to vary with time. We model the reward function using a Gaussian process whose evolution obeys a simple Markov model. We introduce two natural extensions of the classical Gaussian process upper confidence bound (GP-UCB) algorithm. The first, R-GP-UCB, resets GP-UCB at regular intervals. The second, TV-GP-UCB, instead forgets about old data in a smooth fashion. Our main contribution comprises of novel regret bounds for these algorithms, providing an explicit characterization of the trade-off between the time horizon and the rate at which the function varies. We illustrate the performance of the algorithms on both synthetic and real data, and we find the gradual forgetting of TV-GP-UCB to perform favorably compared to the sharp resetting of R-GP-UCB. Moreover, both algorithms significantly outperform classical GP-UCB, since it treats stale and fresh data equally.Comment: To appear in AISTATS 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne