Search CORE

62 research outputs found

Sequential Transfer in Multi-armed Bandit with Finite Set of Models

Author: Azar Mohammad Gheshlaghi
Brunskill Emma
Lazaric Alessandro
Publication venue
Publication date: 25/07/2013
Field of study

Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of \textit{sequential transfer in online learning}, notably in the multi-armed bandit framework, where the objective is to minimize the cumulative regret over a sequence of tasks by incrementally transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for the estimation of the possible tasks and derive regret bounds for it

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

Reinforcement Learning with a Near Optimal Rate of Convergence

Author: Ghavamzadeh Mohammad
Gheshlaghi Azar Mohammad
Kappen Hilbert
Munos Rémi
Publication venue: HAL CCSD
Publication date: 27/10/2011
Field of study

We consider the problem of model-free reinforcement learning in the Markovian decision processes (MDP) under the PAC ("probably approximately correct") model. We introduce a new variant of Q-learning, called speedy Q-learning (SQL), to address the problem of the slow convergence in the standard Q-learning algorithm, and prove PAC bounds on the performance of SQL. The bounds show that for any MDP with n state-action pairs and the discount factor \gamma \in [0, 1) a total of O(n \log(n/\delta)/((1 − \gamma)^4\epsilon^2)) step suffices for the SQL algorithm to converge to an \epsilon-optimal action-value function with probability 1 − \delta. We also establish a lower-bound of \Omega(n \log(1/\delta)/((1 − \gamma)^2\epsilon^2)) for all reinforcement learning algorithms, which matches the upper bound in terms of \epsilon, \delta and n (up to a logarithmic factor). Further, our results have better dependencies on \epsilon and 1 −\gamma, and thus, are tighter than the best available results for Q-learning. SQL also improves on existing results for the batch Q-value iteration, so far considered to be more efficient than the incremental methods like Q-learning

HAL - Lille 3

INRIA a CCSD electronic archive server

Speedy Q-learning

Author: Azar Mohammad Gheshlaghi
Ghavamzadeh Mohammad
Kappen Hilbert
Munos Rémi
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceWe introduce a new convergent variant of Q-learning, called speedy Q-learning, to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O(log(n)/(ε^2 (1 - γ)^4)) steps are required for the SQL algorithm to converge to an ε-optimal action-value function with high probability. This bound has a better dependency on 1/ε and 1/(1 - γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both model-free and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning

HAL - Lille 3

INRIA a CCSD electronic archive server

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Author: Azar Mohammad Gheshlaghi
Kappen Hilbert
Munos Rémi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceWe consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with

N

state-action pairs and the discount factor γin[0, 1) only

O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])

state-transition samples are required to find an

\epsilon

-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of

\epsilon

, an order of

O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])

samples is required to find an

\epsilon

-optimal policy w.p. 1-δ. We also prove a matching lower bound of

\Omega(N log(N/δ)/ [(1 - γ)3\epsilon2])

on the sample complexity of estimating the optimal action-value function. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: The upper bound matches the lower bound interms of

N

\epsilon

, δ and 1/(1 -γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ)

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Online Stochastic Optimization under Correlated Bandit Feedback

Author: Azar Mohammad Gheshlaghi
Brunskill Emma
Lazaric Alessandro
Publication venue
Publication date: 19/05/2014
Field of study

In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel any-time

\mathcal{X}

-armed bandit algorithm, and derive regret bounds matching the performance of existing state-of-the-art in terms of dependency on number of steps and smoothness factor. The main advantage of HCT is that it handles the challenging case of correlated rewards, whereas existing methods require that the reward-generating process of each arm is an identically and independent distributed (iid) random process. HCT also improves on the state-of-the-art in terms of its memory requirement as well as requiring a weaker smoothness assumption on the mean-reward function in compare to the previous anytime algorithms. Finally, we discuss how HCT can be applied to the problem of policy search in reinforcement learning and we report preliminary empirical results

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

On the Sample Complexity of Reinforcement Learning with a Generative Model

Author: Azar Mohammad Gheshlaghi
Kappen Hilbert
Munos Rémi
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

International audienceWe consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor \gamma\in[0,1) only O(N\log(N/\delta)/((1-\gamma)^3\epsilon^2)) samples are required to find an \epsilon-optimal estimation of the action-value function with the probability 1-\delta. We also prove a matching lower bound of \Theta (N\log(N/\delta)/((1-\gamma)^3\epsilon^2)) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-) value function in which the upper bound matches the lower bound of RL in terms of N, \epsilon, \delta and 1/(1-\gamma). Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-\gamma)

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server