Search CORE

886 research outputs found

Q-learning with Nearest Neighbors

Author: Shah Devavrat
Xie Qiaomin
Publication venue
Publication date: 22/10/2018
Field of study

We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a

d

-dimensional state space and the discounted factor

\gamma \in (0,1)

, given an arbitrary sample path with "covering time"

L

, we establish that the algorithm is guaranteed to output an

\varepsilon

-accurate estimate of the optimal Q-function using

\tilde{O}\big(L/(\varepsilon^3(1-\gamma)^7)\big)

samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as

\tilde{O}\big(1/\varepsilon^d\big),

so the sample complexity scales as

\tilde{O}\big(1/\varepsilon^{d+3}\big).

Indeed, we establish a lower bound that argues that the dependence of

\tilde{\Omega}\big(1/\varepsilon^{d+2}\big)

is necessary.Comment: Accepted to NIPS 201

arXiv.org e-Print Archive

DSpace@MIT

Distributionally Robust Optimization for Sequential Decision Making

Author: Chen Zhi
Haskell William B.
Yu Pengqian
Publication venue
Publication date: 09/10/2018
Field of study

The distributionally robust Markov Decision Process (MDP) approach asks for a distributionally robust policy that achieves the maximal expected total reward under the most adversarial distribution of uncertain parameters. In this paper, we study distributionally robust MDPs where ambiguity sets for the uncertain parameters are of a format that can easily incorporate in its description the uncertainty's generalized moment as well as statistical distance information. In this way, we generalize existing works on distributionally robust MDP with generalized-moment-based and statistical-distance-based ambiguity sets to incorporate information from the former class such as moments and dispersions to the latter class that critically depends on empirical observations of the uncertain parameters. We show that, under this format of ambiguity sets, the resulting distributionally robust MDP remains tractable under mild technical conditions. To be more specific, a distributionally robust policy can be constructed by solving a sequence of one-stage convex optimization subproblems

arXiv.org e-Print Archive

ScholarBank@NUS

Deep Residual Reinforcement Learning

Author: Boehmer Wendelin
Whiteson Shimon
Zhang Shangtong
Publication venue
Publication date: 01/01/2020
Field of study

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD(

k

) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.Comment: AAMAS 202

arXiv.org e-Print Archive

Oxford University Research Archive

Certified Reinforcement Learning with Logic Guidance

Author: Abate Alessandro
Hasanbeig Mohammadhosein
Kroening Daniel
Publication venue
Publication date: 10/02/2020
Field of study

This paper proposes the first model-free Reinforcement Learning (RL) framework to synthesise policies for unknown, and continuous-state Markov Decision Processes (MDPs), such that a given linear temporal property is satisfied. We convert the given property into a Limit Deterministic Buchi Automaton (LDBA), namely a finite-state machine expressing the property. Exploiting the structure of the LDBA, we shape a synchronous reward function on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces that probabilistically satisfy the linear temporal property. This probability (certificate) is also calculated in parallel with policy learning when the state space of the MDP is finite: as such, the RL algorithm produces a policy that is certified with respect to the property. Under the assumption of finite state space, theoretical guarantees are provided on the convergence of the RL algorithm to an optimal policy, maximising the above probability. We also show that our method produces ''best available'' control policies when the logical property cannot be satisfied. In the general case of a continuous state space, we propose a neural network architecture for RL and we empirically show that the algorithm finds satisfying policies, if there exist such policies. The performance of the proposed framework is evaluated via a set of numerical examples and benchmarks, where we observe an improvement of one order of magnitude in the number of iterations required for the policy synthesis, compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782

arXiv.org e-Print Archive