582 research outputs found
Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
In this paper we revisit the method of off-policy corrections for
reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this
method, online updates to the value function are reweighted to avoid divergence
issues typical of off-policy learning. While Hallak et al.'s solution is
appealing, it cannot easily be transferred to nonlinear function approximation.
First, it requires a projection step onto the probability simplex; second, even
though the operator describing the expected behavior of the off-policy learning
algorithm is convergent, it is not known to be a contraction mapping, and
hence, may be more unstable in practice. We address these two issues by
introducing a discount factor into COP-TD. We analyze the behavior of
discounted COP-TD and find it better behaved from a theoretical perspective. We
also propose an alternative soft normalization penalty that can be minimized
online and obviates the need for an explicit projection step. We complement our
analysis with an empirical evaluation of the two techniques in an off-policy
setting on the game Pong from the Atari domain where we find discounted COP-TD
to be better behaved in practice than the soft normalization penalty. Finally,
we perform a more extensive evaluation of discounted COP-TD in 5 games of the
Atari domain, where we find performance gains for our approach.Comment: AAAI 201
Generalized Off-Policy Actor-Critic
We propose a new objective, the counterfactual objective, unifying existing
objectives for off-policy policy gradient algorithms in the continuing
reinforcement learning (RL) setting. Compared to the commonly used excursion
objective, which can be misleading about the performance of the target policy
when deployed, our new objective better predicts such performance. We prove the
Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient
of the counterfactual objective and use an emphatic approach to get an unbiased
sample from this policy gradient, yielding the Generalized Off-Policy
Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over
existing algorithms in Mujoco robot simulation tasks, the first empirical
success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201
CrossNorm: Normalization for Off-Policy TD Reinforcement Learning
Off-policy temporal difference (TD) methods are a powerful class of
reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD
algorithms are not commonly used in combination with feature normalization
techniques, despite positive effects of normalization in other domains. We show
that naive application of existing normalization techniques is indeed not
effective, but that well-designed normalization improves optimization stability
and removes the necessity of target networks. In particular, we introduce a
normalization based on a mixture of on- and off-policy transitions, which we
call cross-normalization. It can be regarded as an extension of batch
normalization that re-centers data for two different distributions, as present
in off-policy learning. Applied to DDPG and TD3, cross-normalization improves
over the state of the art across a range of MuJoCo benchmark tasks
Grounding Aleatoric Uncertainty for Unsupervised Environment Design
Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings
A Benchmark Comparison of Imitation Learning-based Control Policies for Autonomous Racing
Autonomous racing with scaled race cars has gained increasing attention as an
effective approach for developing perception, planning and control algorithms
for safe autonomous driving at the limits of the vehicle's handling. To train
agile control policies for autonomous racing, learning-based approaches largely
utilize reinforcement learning, albeit with mixed results. In this study, we
benchmark a variety of imitation learning policies for racing vehicles that are
applied directly or for bootstrapping reinforcement learning both in simulation
and on scaled real-world environments. We show that interactive imitation
learning techniques outperform traditional imitation learning methods and can
greatly improve the performance of reinforcement learning policies by
bootstrapping thanks to its better sample efficiency. Our benchmarks provide a
foundation for future research on autonomous racing using Imitation Learning
and Reinforcement Learning
- …