27 research outputs found
Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning
By reusing data throughout training, off-policy deep reinforcement learning
algorithms offer improved sample efficiency relative to on-policy approaches.
For continuous action spaces, the most popular methods for off-policy learning
include policy improvement steps where a learned state-action () value
function is maximized over selected batches of data. These updates are often
paired with regularization to combat associated overestimation of values.
With an eye toward safety, we revisit this strategy in environments with
"mixed-sign" reward functions; that is, with reward functions that include
independent positive (incentive) and negative (cost) terms. This setting is
common in real-world applications, and may be addressed with or without
constraints on the cost terms. We find the combination of function
approximation and a term that maximizes in the policy update to be
problematic in such environments, because systematic errors in value estimation
impact the contributions from the competing terms asymmetrically. This results
in overemphasis of either incentives or costs and may severely limit learning.
We explore two remedies to this issue. First, consistent with prior work, we
find that periodic resetting of and policy networks can be used to reduce
value estimation error and improve learning in this setting. Second, we
formulate novel off-policy actor-critic methods for both unconstrained and
constrained learning that do not explicitly maximize in the policy update.
We find that this second approach, when applied to continuous action spaces
with mixed-sign rewards, consistently and significantly outperforms
state-of-the-art methods augmented by resetting. We further find that our
approach produces agents that are both competitive with popular methods overall
and more reliably competent on frequently-studied control problems that do not
have mixed-sign rewards.Comment: 22 pages, 16 figure