33,299 research outputs found
Deep Conservative Policy Iteration
Conservative Policy Iteration (CPI) is a founding algorithm of Approximate
Dynamic Programming (ADP). Its core principle is to stabilize greediness
through stochastic mixtures of consecutive policies. It comes with strong
theoretical guarantees, and inspired approaches in deep Reinforcement Learning
(RL). However, CPI itself has rarely been implemented, never with neural
networks, and only experimented on toy problems. In this paper, we show how CPI
can be practically combined with deep RL with discrete actions. We also
introduce adaptive mixture rates inspired by the theory. We experiment
thoroughly the resulting algorithm on the simple Cartpole problem, and validate
the proposed method on a representative subset of Atari games. Overall, this
work suggests that revisiting classic ADP may lead to improved and more stable
deep RL algorithms.Comment: AAAI 2020 (long version
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Value-based reinforcement-learning algorithms provide state-of-the-art
results in model-free discrete-action settings, and tend to outperform
actor-critic algorithms. We argue that actor-critic algorithms are limited by
their need for an on-policy critic. We propose Bootstrapped Dual Policy
Iteration (BDPI), a novel model-free reinforcement-learning algorithm for
continuous states and discrete actions, with an actor and several off-policy
critics. Off-policy critics are compatible with experience replay, ensuring
high sample-efficiency, without the need for off-policy corrections. The actor,
by slowly imitating the average greedy policy of the critics, leads to
high-quality and state-specific exploration, which we compare to Thompson
sampling. Because the actor and critics are fully decoupled, BDPI is remarkably
stable, and unusually robust to its hyper-parameters. BDPI is significantly
more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete,
continuous and pixel-based tasks. Source code:
https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML
Efficient and Noise-Tolerant Reinforcement Learning Algorithms via Theoretical Analysis of Gap-Increasing and Softmax Operators
Model-free deep Reinforcement Learning (RL) algorithms, a combination of deep learning and model-free RL algorithms, have attained remarkable successes in solving complex tasks such as video games. However, theoretical analyses and recent empirical results indicate its proneness to various types of value update errors including but not limited to estimation error of updates due to finite samples and function approximation error. Because real-world tasks are inherently complex and stochastic, such errors are inevitable, and thus, the development of error-tolerant RL algorithms are of great importance for applications of RL to real problems. To this end, I propose two error-tolerant algorithms for RL called Conservative Value Iteration (CVI) and Gap-increasing RetrAce for Policy Evaluation (GRAPE). CVI unifies value-iteration-like single-stage-lookahead algorithms such as soft value iteration, advantage learning and Ψ-learning, all of which are characterized by the use of a gap-increasing operator and/or softmax operator in value updates. We provide detailed theoretical analysis of CVI that not only shows CVI\u27s advantages but also contributes to the theory of RL in the following two points: First, it elucidates pros and cons of gap-increasing and softmax operators. Second, it provides an actual example in which performance of algorithms with max operator is worse than that of algorithms with softmax operator demonstrating the limitation of traditional greedy value updates. GRAPE is a policy evaluation algorithm extending advantage learning (AL) and retrace, both of which have different advantages: AL is noise-tolerant as shown through our theoretical analysis of CVI, while retrace is efficient in that it is off-policy and allows the control of bias-variance trade-off. Theoretical analys is of GRAPE shows that it enjoys the merits of both algorithms. In experiments, we demonstrate the benefit of GRAPE combined with a variant of trust region policy optimization and its superiority to previous algorithms. Through these studies, I theoretically elucidated the benefits of gap-increasing and softmax operators in both policy evaluation and control settings. While some open problems remain as explained in the final chapter, the results presented in this thesis are an important step towards a deep understanding of RL algorithms.Okinawa Institute of Science and Technology Graduate Universit
Probabilistically Safe Policy Transfer
Although learning-based methods have great potential for robotics, one
concern is that a robot that updates its parameters might cause large amounts
of damage before it learns the optimal policy. We formalize the idea of safe
learning in a probabilistic sense by defining an optimization problem: we
desire to maximize the expected return while keeping the expected damage below
a given safety limit. We study this optimization for the case of a robot
manipulator with safety-based torque limits. We would like to ensure that the
damage constraint is maintained at every step of the optimization and not just
at convergence. To achieve this aim, we introduce a novel method which predicts
how modifying the torque limit, as well as how updating the policy parameters,
might affect the robot's safety. We show through a number of experiments that
our approach allows the robot to improve its performance while ensuring that
the expected damage constraint is not violated during the learning process
- …