Search CORE

8 research outputs found

Q-learning with Nearest Neighbors

Author: Shah Devavrat
Xie Qiaomin
Publication venue
Publication date: 22/10/2018
Field of study

We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a

d

-dimensional state space and the discounted factor

\gamma \in (0,1)

, given an arbitrary sample path with "covering time"

L

, we establish that the algorithm is guaranteed to output an

\varepsilon

-accurate estimate of the optimal Q-function using

\tilde{O}\big(L/(\varepsilon^3(1-\gamma)^7)\big)

samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as

\tilde{O}\big(1/\varepsilon^d\big),

so the sample complexity scales as

\tilde{O}\big(1/\varepsilon^{d+3}\big).

Indeed, we establish a lower bound that argues that the dependence of

\tilde{\Omega}\big(1/\varepsilon^{d+2}\big)

is necessary.Comment: Accepted to NIPS 201

arXiv.org e-Print Archive

DSpace@MIT

Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Ciamac C. Moallemi
Nikhil Bhat
Vivek F. Farias
Publication venue
Publication date: 01/12/2012
Field of study

This paper presents a novel and practical non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful, dimension-independent approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable replacement to state of the art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by ‘kernelizing ’ a recent mathematical program for ADP (the ‘smoothed ’ approximate LP) proposed by Desai et al. (2011). Our theoretical guarantees establish that the quality of the approximation produced by our procedure improves gracefully with sampling effort. Via a computational study on a controlled queueing network, we show that our non-parametric procedure outperforms the state of the art parametric ADP approaches and established heuristics. 1

CiteSeerX

DSpace@MIT

An exact approach for aggregated formulations

Author: Gamst Mette
Spoorendonk Simon
Publication venue
Publication date: 01/01/2012
Field of study

Online Research Database In Technology