584 research outputs found
Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning
It is of significance for an agent to learn a widely applicable and
general-purpose policy that can achieve diverse goals including images and text
descriptions. Considering such perceptually-specific goals, the frontier of
deep reinforcement learning research is to learn a goal-conditioned policy
without hand-crafted rewards. To learn this kind of policy, recent works
usually take as the reward the non-parametric distance to a given goal in an
explicit embedding space. From a different viewpoint, we propose a novel
unsupervised learning approach named goal-conditioned policy with intrinsic
motivation (GPIM), which jointly learns both an abstract-level policy and a
goal-conditioned policy. The abstract-level policy is conditioned on a latent
variable to optimize a discriminator and discovers diverse states that are
further rendered into perceptually-specific goals for the goal-conditioned
policy. The learned discriminator serves as an intrinsic reward function for
the goal-conditioned policy to imitate the trajectory induced by the
abstract-level policy. Experiments on various robotic tasks demonstrate the
effectiveness and efficiency of our proposed GPIM method which substantially
outperforms prior techniques.Comment: Accepted by AAAI-2
CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning
Offline reinforcement learning (RL) aims to learn an optimal policy from
pre-collected and labeled datasets, which eliminates the time-consuming data
collection in online RL. However, offline RL still bears a large burden of
specifying/handcrafting extrinsic rewards for each transition in the offline
data. As a remedy for the labor-intensive labeling, we propose to endow offline
RL tasks with a few expert data and utilize the limited expert data to drive
intrinsic rewards, thus eliminating the need for extrinsic rewards. To achieve
that, we introduce \textbf{C}alibrated \textbf{L}atent
g\textbf{U}idanc\textbf{E} (CLUE), which utilizes a conditional variational
auto-encoder to learn a latent space such that intrinsic rewards can be
directly qualified over the latent space. CLUE's key idea is to align the
intrinsic rewards consistent with the expert intention via enforcing the
embeddings of expert data to a calibrated contextual representation. We
instantiate the expert-driven intrinsic rewards in sparse-reward offline RL
tasks, offline imitation learning (IL) tasks, and unsupervised offline RL
tasks. Empirically, we find that CLUE can effectively improve the sparse-reward
offline RL performance, outperform the state-of-the-art offline IL baselines,
and discover diverse skills from static reward-free offline data
Beyond Reward: Offline Preference-guided Policy Optimization
This study focuses on the topic of offline preference-based reinforcement
learning (PbRL), a variant of conventional reinforcement learning that
dispenses with the need for online interaction or specification of reward
functions. Instead, the agent is provided with fixed offline trajectories and
human preferences between pairs of trajectories to extract the dynamics and
task information, respectively. Since the dynamics and task information are
orthogonal, a naive approach would involve using preference-based reward
learning followed by an off-the-shelf offline RL algorithm. However, this
requires the separate learning of a scalar reward function, which is assumed to
be an information bottleneck of the learning process. To address this issue, we
propose the offline preference-guided policy optimization (OPPO) paradigm,
which models offline trajectories and preferences in a one-step process,
eliminating the need for separately learning a reward function. OPPO achieves
this by introducing an offline hindsight information matching objective for
optimizing a contextual policy and a preference modeling objective for finding
the optimal context. OPPO further integrates a well-performing decision policy
by optimizing the two objectives iteratively. Our empirical results demonstrate
that OPPO effectively models offline preferences and outperforms prior
competing baselines, including offline RL algorithms performed over either true
or pseudo reward function specifications. Our code is available on the project
website: https://sites.google.com/view/oppo-icml-2023
Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization
In this work, we decouple the iterative bi-level offline RL (value estimation
and policy extraction) from the offline training phase, forming a non-iterative
bi-level paradigm and avoiding the iterative error propagation over two levels.
Specifically, this non-iterative paradigm allows us to conduct inner-level
optimization (value estimation) in training, while performing outer-level
optimization (policy extraction) in testing. Naturally, such a paradigm raises
three core questions that are not fully answered by prior non-iterative offline
RL counterparts like reward-conditioned policy: (q1) What information should we
transfer from the inner-level to the outer-level? (q2) What should we pay
attention to when exploiting the transferred information for safe/confident
outer-level optimization? (q3) What are the benefits of concurrently conducting
outer-level optimization during testing? Motivated by model-based optimization
(MBO), we propose DROP (design from policies), which fully answers the above
questions. Specifically, in the inner-level, DROP decomposes offline data into
multiple subsets, and learns an MBO score model (a1). To keep safe exploitation
to the score model in the outer-level, we explicitly learn a behavior embedding
and introduce a conservative regularization (a2). During testing, we show that
DROP permits deployment adaptation, enabling an adaptive inference across
states (a3). Empirically, we evaluate DROP on various tasks, showing that DROP
gains comparable or better performance compared to prior methods.Comment: NeurIPS 202
- …