3,473 research outputs found
Reducing Estimation Bias via Weighted Delayed Deep Deterministic Policy Gradient
The overestimation phenomenon caused by function approximation is a
well-known issue in value-based reinforcement learning algorithms such as deep
Q-networks and DDPG, which could lead to suboptimal policies. To address this
issue, TD3 takes the minimum value between a pair of critics, which introduces
underestimation bias. By unifying these two opposites, we propose a novel
Weighted Delayed Deep Deterministic Policy Gradient algorithm, which can reduce
the estimation error and further improve the performance by weighting a pair of
critics. We compare the learning process of value function between DDPG, TD3,
and our proposed algorithm, which verifies that our algorithm could indeed
eliminate the estimation error of value function. We evaluate our algorithm in
the OpenAI Gym continuous control tasks, outperforming the state-of-the-art
algorithms on every environment tested
Addressing Function Approximation Error in Actor-Critic Methods
In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and the critic. Our algorithm builds on Double Q-learning, by
taking the minimum value between a pair of critics to limit overestimation. We
draw the connection between target networks and overestimation bias, and
suggest delaying policy updates to reduce per-update error and further improve
performance. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201
Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach
Compared to on-policy counterparts, off-policy model-free deep reinforcement
learning can improve data efficiency by repeatedly using the previously
gathered data. However, off-policy learning becomes challenging when the
discrepancy between the underlying distributions of the agent's policy and
collected data increases. Although the well-studied importance sampling and
off-policy policy gradient techniques were proposed to compensate for this
discrepancy, they usually require a collection of long trajectories and induce
additional problems such as vanishing/exploding gradients or discarding many
useful experiences, which eventually increases the computational complexity.
Moreover, their generalization to either continuous action domains or policies
approximated by deterministic deep neural networks is strictly limited. To
overcome these limitations, we introduce a novel policy similarity measure to
mitigate the effects of such discrepancy in continuous control. Our method
offers an adequate single-step off-policy correction that is applicable to
deterministic policy networks. Theoretical and empirical studies demonstrate
that it can achieve a "safe" off-policy learning and substantially improve the
state-of-the-art by attaining higher returns in fewer steps than the competing
methods through an effective schedule of the learning rate in Q-learning and
policy optimization
- …