Search CORE

3,473 research outputs found

Reducing Estimation Bias via Weighted Delayed Deep Deterministic Policy Gradient

Author: He Qiang
Hou Xinwen
Publication venue
Publication date: 17/06/2020
Field of study

The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics, which introduces underestimation bias. By unifying these two opposites, we propose a novel Weighted Delayed Deep Deterministic Policy Gradient algorithm, which can reduce the estimation error and further improve the performance by weighting a pair of critics. We compare the learning process of value function between DDPG, TD3, and our proposed algorithm, which verifies that our algorithm could indeed eliminate the estimation error of value function. We evaluate our algorithm in the OpenAI Gym continuous control tasks, outperforming the state-of-the-art algorithms on every environment tested

arXiv.org e-Print Archive

Addressing Function Approximation Error in Actor-Critic Methods

Author: Fujimoto Scott
Meger David
van Hoof Herke
Publication venue
Publication date: 01/01/2018
Field of study

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201

arXiv.org e-Print Archive

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Reducing estimation bias via triplet-average deep deterministic policy gradient

Author: DONG Xingping
HOI Steven C. H.
SHEN Jianbing
WU Dongming
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2020
Field of study

Institutional Knowledge at Singapore Management University

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Author: Cicek Dogan C.
Kozat Suleyman S.
Mutlu Furkan B.
Saglam Baturay
Publication venue
Publication date: 05/06/2023
Field of study

Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent's policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks. Theoretical and empirical studies demonstrate that it can achieve a "safe" off-policy learning and substantially improve the state-of-the-art by attaining higher returns in fewer steps than the competing methods through an effective schedule of the learning rate in Q-learning and policy optimization

arXiv.org e-Print Archive