5,585 research outputs found
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization
To facilitate efficient learning, policy gradient approaches to deep
reinforcement learning (RL) are typically paired with variance reduction
measures and strategies for making large but safe policy changes based on a
batch of experiences. Natural policy gradient methods, including Trust Region
Policy Optimization (TRPO), seek to produce monotonic improvement through
bounded changes in policy outputs. Proximal Policy Optimization (PPO) is a
commonly used, first-order algorithm that instead uses loss clipping to take
multiple safe optimization steps per batch of data, replacing the bound on the
single step of TRPO with regularization on multiple steps. In this work, we
find that the performance of PPO, when applied to continuous action spaces, may
be consistently improved through a simple change in objective. Instead of the
importance sampling objective of PPO, we instead recommend a basic policy
gradient, clipped in an equivalent fashion. While both objectives produce
biased gradient estimates with respect to the RL objective, they also both
display significantly reduced variance compared to the unbiased off-policy
policy gradient. Additionally, we show that (1) the clipped-objective policy
gradient (COPG) objective is on average "pessimistic" compared to both the PPO
objective and (2) this pessimism promotes enhanced exploration. As a result, we
empirically observe that COPG produces improved learning compared to PPO in
single-task, constrained, and multi-task learning, without adding significant
computational cost or complexity. Compared to TRPO, the COPG approach is seen
to offer comparable or superior performance, while retaining the simplicity of
a first-order method.Comment: 12 pages, 8 figure
Addressing Function Approximation Error in Actor-Critic Methods
In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and the critic. Our algorithm builds on Double Q-learning, by
taking the minimum value between a pair of critics to limit overestimation. We
draw the connection between target networks and overestimation bias, and
suggest delaying policy updates to reduce per-update error and further improve
performance. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Value-based reinforcement-learning algorithms provide state-of-the-art
results in model-free discrete-action settings, and tend to outperform
actor-critic algorithms. We argue that actor-critic algorithms are limited by
their need for an on-policy critic. We propose Bootstrapped Dual Policy
Iteration (BDPI), a novel model-free reinforcement-learning algorithm for
continuous states and discrete actions, with an actor and several off-policy
critics. Off-policy critics are compatible with experience replay, ensuring
high sample-efficiency, without the need for off-policy corrections. The actor,
by slowly imitating the average greedy policy of the critics, leads to
high-quality and state-specific exploration, which we compare to Thompson
sampling. Because the actor and critics are fully decoupled, BDPI is remarkably
stable, and unusually robust to its hyper-parameters. BDPI is significantly
more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete,
continuous and pixel-based tasks. Source code:
https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML
The Importance of Clipping in Neurocontrol by Direct Gradient Descent on the Cost-to-Go Function and in Adaptive Dynamic Programming
In adaptive dynamic programming, neurocontrol and reinforcement learning, the
objective is for an agent to learn to choose actions so as to minimise a total
cost function. In this paper we show that when discretized time is used to
model the motion of the agent, it can be very important to do "clipping" on the
motion of the agent in the final time step of the trajectory. By clipping we
mean that the final time step of the trajectory is to be truncated such that
the agent stops exactly at the first terminal state reached, and no distance
further. We demonstrate that when clipping is omitted, learning performance can
fail to reach the optimum; and when clipping is done properly, learning
performance can improve significantly.
The clipping problem we describe affects algorithms which use explicit
derivatives of the model functions of the environment to calculate a learning
gradient. These include Backpropagation Through Time for Control, and methods
based on Dual Heuristic Dynamic Programming. However the clipping problem does
not significantly affect methods based on Heuristic Dynamic Programming,
Temporal Differences or Policy Gradient Learning algorithms. Similarly, the
clipping problem does not affect fixed-length finite-horizon problems
- …