24,439 research outputs found
Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach
Compared to on-policy counterparts, off-policy model-free deep reinforcement
learning can improve data efficiency by repeatedly using the previously
gathered data. However, off-policy learning becomes challenging when the
discrepancy between the underlying distributions of the agent's policy and
collected data increases. Although the well-studied importance sampling and
off-policy policy gradient techniques were proposed to compensate for this
discrepancy, they usually require a collection of long trajectories and induce
additional problems such as vanishing/exploding gradients or discarding many
useful experiences, which eventually increases the computational complexity.
Moreover, their generalization to either continuous action domains or policies
approximated by deterministic deep neural networks is strictly limited. To
overcome these limitations, we introduce a novel policy similarity measure to
mitigate the effects of such discrepancy in continuous control. Our method
offers an adequate single-step off-policy correction that is applicable to
deterministic policy networks. Theoretical and empirical studies demonstrate
that it can achieve a "safe" off-policy learning and substantially improve the
state-of-the-art by attaining higher returns in fewer steps than the competing
methods through an effective schedule of the learning rate in Q-learning and
policy optimization
- …