8 research outputs found
Guide Your Agent with Adaptive Multimodal Rewards
Developing an agent capable of adapting to unseen environments remains a
difficult challenge in imitation learning. In this work, we present Adaptive
Return-conditioned Policy (ARP), an efficient framework designed to enhance the
agent's generalization ability using natural language task descriptions and
pre-trained multimodal encoders. Our key idea is to calculate a similarity
between visual observations and natural language instructions in the
pre-trained multimodal embedding space (such as CLIP) and use it as a reward
signal. We then train a return-conditioned policy using expert demonstrations
labeled with multimodal rewards. Because the multimodal rewards provide
adaptive signals at each timestep, our ARP effectively mitigates the goal
misgeneralization. This results in superior generalization performances even
when faced with unseen text instructions, compared to existing text-conditioned
policies. To improve the quality of rewards, we also introduce a fine-tuning
method for pre-trained multimodal encoders, further enhancing the performance.
Video demonstrations and source code are available on the project website:
https://sites.google.com/view/2023arp.Comment: Project webpage: https://sites.google.com/view/2023ar
Masked World Models for Visual Control
Visual model-based reinforcement learning (RL) has the potential to enable
sample-efficient robot learning from visual observations. Yet the current
approaches typically train a single model end-to-end for learning both visual
representations and dynamics, making it difficult to accurately model the
interaction between robots and small objects. In this work, we introduce a
visual model-based RL framework that decouples visual representation learning
and dynamics learning. Specifically, we train an autoencoder with convolutional
layers and vision transformers (ViT) to reconstruct pixels given masked
convolutional features, and learn a latent dynamics model that operates on the
representations from the autoencoder. Moreover, to encode task-relevant
information, we introduce an auxiliary reward prediction objective for the
autoencoder. We continually update both autoencoder and dynamics model using
online samples collected from environment interaction. We demonstrate that our
decoupling approach achieves state-of-the-art performance on a variety of
visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7%
success rate on 50 visual robotic manipulation tasks from Meta-world, while the
baseline achieves 67.9%. Code is available on the project website:
https://sites.google.com/view/mwm-rl.Comment: Project website: https://sites.google.com/view/mwm-rl. Accepted to
CoRL 202
Learning what to defer for maximum independent sets
Designing efficient algorithms for combinatorial optimization appears ubiquitously in various scientific fields. Recently, deep reinforcement learning (DRL) frameworks have gained considerable attention as a new approach: They can automate the design of a solver while relying less on sophisticated domain knowledge of the target problem. However, the existing DRL solvers determine the solution using a number of stages proportional to the number of elements in the solution, which severely limits their applicability to large-scale graphs. In this paper, we seek to resolve this issue by proposing a novel DRL scheme, coined learning what to defer (LwD), where the agent adaptively shrinks or stretch the number of stages by learning to distribute the element-wise decisions of the solution at each stage. We apply the proposed framework to the maximum independent set (MIS) problem, and demonstrate its significant improvement over the current state-ofthe-art DRL scheme. We also show that LwD can outperform the conventional MIS solvers on largescale graphs having millions of vertices, under a limited time budget.1
Reinforcement Learning with Action-Free Pre-Training from Videos
Recent unsupervised pre-training methods have shown to be effective on
language and vision domains by learning useful representations for multiple
downstream tasks. In this paper, we investigate if such unsupervised
pre-training methods can also be effective for vision-based reinforcement
learning (RL). To this end, we introduce a framework that learns
representations useful for understanding the dynamics via generative
pre-training on videos. Our framework consists of two phases: we pre-train an
action-free latent video prediction model, and then utilize the pre-trained
representations for efficiently learning action-conditional world models on
unseen environments. To incorporate additional action inputs during
fine-tuning, we introduce a new architecture that stacks an action-conditional
latent prediction model on top of the pre-trained action-free prediction model.
Moreover, for better exploration, we propose a video-based intrinsic bonus that
leverages pre-trained representations. We demonstrate that our framework
significantly improves both final performances and sample-efficiency of
vision-based RL in a variety of manipulation and locomotion tasks. Code is
available at https://github.com/younggyoseo/apv.Comment: International Conference on Machine Learning (ICML 2022). Project
page: https://sites.google.com/view/rl-ap