42 research outputs found
Maximum Entropy RL (Provably) Solves Some Robust RL Problems
Many potential applications of reinforcement learning (RL) require guarantees
that the agent will perform well in the face of disturbances to the dynamics or
reward function. In this paper, we prove theoretically that standard maximum
entropy RL is robust to some disturbances in the dynamics and the reward
function. While this capability of MaxEnt RL has been observed empirically in
prior work, to the best of our knowledge our work provides the first rigorous
proof and theoretical characterization of the MaxEnt RL robust set. While a
number of prior robust RL algorithms have been designed to handle similar
disturbances to the reward function or dynamics, these methods typically
require adding additional moving parts and hyperparameters on top of a base RL
algorithm. In contrast, our theoretical results suggest that MaxEnt RL by
itself is robust to certain disturbances, without requiring any additional
modifications. While this does not imply that MaxEnt RL is the best available
robust RL method, MaxEnt RL does possess a striking simplicity and appealing
formal guarantees.Comment: Blog post and videos:
https://bair.berkeley.edu/blog/2021/03/10/maxent-robust-rl/. arXiv admin
note: text overlap with arXiv:1910.0191
Contrastive Difference Predictive Coding
Predicting and reasoning about the future lie at the heart of many
time-series questions. For example, goal-conditioned reinforcement learning can
be viewed as learning representations to predict which states are likely to be
visited in the future. While prior methods have used contrastive predictive
coding to model time series data, learning representations that encode
long-term dependencies usually requires large amounts of data. In this paper,
we introduce a temporal difference version of contrastive predictive coding
that stitches together pieces of different time series data to decrease the
amount of data required to learn predictions of future events. We apply this
representation learning method to derive an off-policy algorithm for
goal-conditioned RL. Experiments demonstrate that, compared with prior RL
methods, ours achieves median improvement in success rates and can
better cope with stochastic environments. In tabular settings, we show that our
method is about more sample efficient than the successor
representation and more sample efficient than the standard (Monte
Carlo) version of contrastive predictive coding.Comment: Website (https://chongyi-zheng.github.io/td_infonce) and code
(https://github.com/chongyi-zheng/td_infonce
Contrastive Value Learning: Implicit Models for Simple Offline RL
Model-based reinforcement learning (RL) methods are appealing in the offline
setting because they allow an agent to reason about the consequences of actions
without interacting with the environment. Prior methods learn a 1-step dynamics
model, which predicts the next state given the current state and action. These
models do not immediately tell the agent which actions to take, but must be
integrated into a larger RL framework. Can we model the environment dynamics in
a different way, such that the learned model does directly indicate the value
of each action? In this paper, we propose Contrastive Value Learning (CVL),
which learns an implicit, multi-step model of the environment dynamics. This
model can be learned without access to reward functions, but nonetheless can be
used to directly estimate the value of each action, without requiring any TD
learning. Because this model represents the multi-step transitions implicitly,
it avoids having to predict high-dimensional observations and thus scales to
high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior
offline RL methods on complex continuous control benchmarks.Comment: Deep Reinforcement Learning Workshop, NeurIPS 202
HIQL: Offline Goal-Conditioned RL with Latent States as Actions
Unsupervised pre-training has recently become the bedrock for computer vision
and natural language processing. In reinforcement learning (RL),
goal-conditioned RL can potentially provide an analogous self-supervised
approach for making use of large quantities of unlabeled (reward-free) data.
However, building effective algorithms for goal-conditioned RL that can learn
directly from diverse offline data is challenging, because it is hard to
accurately estimate the exact value function for faraway goals. Nonetheless,
goal-reaching problems exhibit structure, such that reaching distant goals
entails first passing through closer subgoals. This structure can be very
useful, as assessing the quality of actions for nearby goals is typically
easier than for more distant goals. Based on this idea, we propose a
hierarchical algorithm for goal-conditioned RL from offline data. Using one
action-free value function, we learn two policies that allow us to exploit this
structure: a high-level policy that treats states as actions and predicts (a
latent representation of) a subgoal and a low-level policy that predicts the
action for reaching this subgoal. Through analysis and didactic examples, we
show how this hierarchical decomposition makes our method robust to noise in
the estimated value function. We then apply our method to offline goal-reaching
benchmarks, showing that our method can solve long-horizon tasks that stymie
prior methods, can scale to high-dimensional image observations, and can
readily make use of action-free data. Our code is available at
https://seohong.me/projects/hiql
A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning
As with any machine learning problem with limited data, effective offline RL
algorithms require careful regularization to avoid overfitting. One-step
methods perform regularization by doing just a single step of policy
improvement, while critic regularization methods do many steps of policy
improvement with a regularized objective. These methods appear distinct.
One-step methods, such as advantage-weighted regression and conditional
behavioral cloning, truncate policy iteration after just one step. This ``early
stopping'' makes one-step RL simple and stable, but can limit its asymptotic
performance. Critic regularization typically requires more compute but has
appealing lower-bound guarantees. In this paper, we draw a close connection
between these methods: applying a multi-step critic regularization method with
a regularization coefficient of 1 yields the same policy as one-step RL. While
practical implementations violate our assumptions and critic regularization is
typically applied with smaller regularization coefficients, our experiments
nevertheless show that our analysis makes accurate, testable predictions about
practical offline RL methods (CQL and one-step RL) with commonly-used
hyperparameters. Our results that every problem can be solved with a single
step of policy improvement, but rather that one-step RL might be competitive
with critic regularization on RL problems that demand strong regularization.Comment: Accepted to ICML 2023. Video
(https://www.youtube.com/watch?v=1xlixIHZ0R4) and code
(https://github.com/ben-eysenbach/ac-connection