A key problem in off-policy Reinforcement Learning (RL) is the mismatch, or
distribution shift, between the dataset and the distribution over states and
actions visited by the learned policy. This problem is exacerbated in the fully
offline setting. The main approach to correct this shift has been through
importance sampling, which leads to high-variance gradients. Other approaches,
such as conservatism or behavior-regularization, regularize the policy at the
cost of performance. In this paper, we propose a new approach for stable
off-policy Q-Learning. Our method, Projected Off-Policy Q-Learning (POP-QL), is
a novel actor-critic algorithm that simultaneously reweights off-policy samples
and constrains the policy to prevent divergence and reduce value-approximation
error. In our experiments, POP-QL not only shows competitive performance on
standard benchmarks, but also out-performs competing methods in tasks where the
data-collection policy is significantly sub-optimal.Comment: 10 page