4 research outputs found
On the Generalization Gap in Reparameterizable Reinforcement Learning
Understanding generalization in reinforcement learning (RL) is a significant
challenge, as many common assumptions of traditional supervised learning theory
do not apply. We focus on the special class of reparameterizable RL problems,
where the trajectory distribution can be decomposed using the reparametrization
trick. For this problem class, estimating the expected return is efficient and
the trajectory can be computed deterministically given peripheral random
variables, which enables us to study reparametrizable RL using supervised
learning and transfer learning theory. Through these relationships, we derive
guarantees on the gap between the expected and empirical return for both
intrinsic and external errors, based on Rademacher complexity as well as the
PAC-Bayes bound. Our bound suggests the generalization capability of
reparameterizable RL is related to multiple factors including "smoothness" of
the environment transition, reward and agent policy function class. We also
empirically verify the relationship between the generalization gap and these
factors through simulations
Automatic Data Augmentation for Generalization in Deep Reinforcement Learning
Deep reinforcement learning (RL) agents often fail to generalize to unseen
scenarios, even when they are trained on many instances of semantically similar
environments. Data augmentation has recently been shown to improve the sample
efficiency and generalization of RL agents. However, different tasks tend to
benefit from different kinds of data augmentation. In this paper, we compare
three approaches for automatically finding an appropriate augmentation. These
are combined with two novel regularization terms for the policy and value
function, required to make the use of data augmentation theoretically sound for
certain actor-critic algorithms. We evaluate our methods on the Procgen
benchmark which consists of 16 procedurally-generated environments and show
that it improves test performance by ~40% relative to standard RL algorithms.
Our agent outperforms other baselines specifically designed to improve
generalization in RL. In addition, we show that our agent learns policies and
representations that are more robust to changes in the environment that do not
affect the agent, such as the background. Our implementation is available at
https://github.com/rraileanu/auto-drac
Discount Factor as a Regularizer in Reinforcement Learning
Specifying a Reinforcement Learning (RL) task involves choosing a suitable
planning horizon, which is typically modeled by a discount factor. It is known
that applying RL algorithms with a lower discount factor can act as a
regularizer, improving performance in the limited data regime. Yet the exact
nature of this regularizer has not been investigated. In this work, we fill in
this gap. For several Temporal-Difference (TD) learning methods, we show an
explicit equivalence between using a reduced discount factor and adding an
explicit regularization term to the algorithm's loss. Motivated by the
equivalence, we empirically study this technique compared to standard
regularization by extensive experiments in discrete and continuous domains,
using tabular and functional representations. Our experiments suggest the
regularization effectiveness is strongly related to properties of the available
data, such as size, distribution, and mixing rate.Comment: Published in ICML 202
SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies
Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to *decouple* robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/