Search CORE

33,299 research outputs found

Deep Conservative Policy Iteration

Author: Geist Matthieu
Pietquin Olivier
Vieillard Nino
Publication venue
Publication date: 06/01/2020
Field of study

Conservative Policy Iteration (CPI) is a founding algorithm of Approximate Dynamic Programming (ADP). Its core principle is to stabilize greediness through stochastic mixtures of consecutive policies. It comes with strong theoretical guarantees, and inspired approaches in deep Reinforcement Learning (RL). However, CPI itself has rarely been implemented, never with neural networks, and only experimented on toy problems. In this paper, we show how CPI can be practically combined with deep RL with discrete actions. We also introduce adaptive mixture rates inspired by the theory. We experiment thoroughly the resulting algorithm on the simple Cartpole problem, and validate the proposed method on a representative subset of Atari games. Overall, this work suggests that revisiting classic ADP may lead to improved and more stable deep RL algorithms.Comment: AAAI 2020 (long version

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Author: AG Barto
C Watkins
D Silver
LJ Lin
R Bellman
RJ Williams
VR Konda
WR Thompson
Publication venue
Publication date: 12/06/2019
Field of study

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML

arXiv.org e-Print Archive

VU Research Portal

Crossref

Efficient and Noise-Tolerant Reinforcement Learning Algorithms via Theoretical Analysis of Gap-Increasing and Softmax Operators

Author: Tadashi Kozuno
小津野将
Publication venue
Publication date: 31/03/2020
Field of study

Model-free deep Reinforcement Learning (RL) algorithms, a combination of deep learning and model-free RL algorithms, have attained remarkable successes in solving complex tasks such as video games. However, theoretical analyses and recent empirical results indicate its proneness to various types of value update errors including but not limited to estimation error of updates due to finite samples and function approximation error. Because real-world tasks are inherently complex and stochastic, such errors are inevitable, and thus, the development of error-tolerant RL algorithms are of great importance for applications of RL to real problems. To this end, I propose two error-tolerant algorithms for RL called Conservative Value Iteration (CVI) and Gap-increasing RetrAce for Policy Evaluation (GRAPE). CVI unifies value-iteration-like single-stage-lookahead algorithms such as soft value iteration, advantage learning and Ψ-learning, all of which are characterized by the use of a gap-increasing operator and/or softmax operator in value updates. We provide detailed theoretical analysis of CVI that not only shows CVI\u27s advantages but also contributes to the theory of RL in the following two points: First, it elucidates pros and cons of gap-increasing and softmax operators. Second, it provides an actual example in which performance of algorithms with max operator is worse than that of algorithms with softmax operator demonstrating the limitation of traditional greedy value updates. GRAPE is a policy evaluation algorithm extending advantage learning (AL) and retrace, both of which have different advantages: AL is noise-tolerant as shown through our theoretical analysis of CVI, while retrace is efficient in that it is off-policy and allows the control of bias-variance trade-off. Theoretical analys is of GRAPE shows that it enjoys the merits of both algorithms. In experiments, we demonstrate the benefit of GRAPE combined with a variant of trust region policy optimization and its superiority to previous algorithms. Through these studies, I theoretically elucidated the benefits of gap-increasing and softmax operators in both policy evaluation and control settings. While some open problems remain as explained in the final chapter, the results presented in this thesis are an important step towards a deep understanding of RL algorithms.Okinawa Institute of Science and Technology Graduate Universit

OIST Institutional Repository

Institutional Repositories DataBase (IRDB)

Probabilistically Safe Policy Transfer

Author: Abbeel Pieter
Held David
McCarthy Zoe
Shentu Fred
Zhang Michael
Publication venue
Publication date: 15/05/2017
Field of study

Although learning-based methods have great potential for robotics, one concern is that a robot that updates its parameters might cause large amounts of damage before it learns the optimal policy. We formalize the idea of safe learning in a probabilistic sense by defining an optimization problem: we desire to maximize the expected return while keeping the expected damage below a given safety limit. We study this optimization for the case of a robot manipulator with safety-based torque limits. We would like to ensure that the damage constraint is maintained at every step of the optimization and not just at convergence. To achieve this aim, we introduce a novel method which predicts how modifying the torque limit, as well as how updating the policy parameters, might affect the robot's safety. We show through a number of experiments that our approach allows the robot to improve its performance while ensuring that the expected damage constraint is not violated during the learning process

arXiv.org e-Print Archive

Crossref