229 research outputs found
Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
Reinforcement Learning algorithms require a large number of samples to solve
complex tasks with sparse and delayed rewards. Complex tasks can often be
hierarchically decomposed into sub-tasks. A step in the Q-function can be
associated with solving a sub-task, where the expectation of the return
increases. RUDDER has been introduced to identify these steps and then
redistribute reward to them, thus immediately giving reward if sub-tasks are
solved. Since the problem of delayed rewards is mitigated, learning is
considerably sped up. However, for complex tasks, current exploration
strategies as deployed in RUDDER struggle with discovering episodes with high
rewards. Therefore, we assume that episodes with high rewards are given as
demonstrations and do not have to be discovered by exploration. Typically the
number of demonstrations is small and RUDDER's LSTM model as a deep learning
method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER
with two major modifications. First, Align-RUDDER assumes that episodes with
high rewards are given as demonstrations, replacing RUDDER's safe exploration
and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile
model that is obtained from multiple sequence alignment of demonstrations.
Profile models can be constructed from as few as two demonstrations as known
from bioinformatics. Align-RUDDER inherits the concept of reward
redistribution, which considerably reduces the delay of rewards, thus speeding
up learning. Align-RUDDER outperforms competitors on complex artificial tasks
with delayed reward and few demonstrations. On the MineCraft ObtainDiamond
task, Align-RUDDER is able to mine a diamond, though not frequently. Github:
https://github.com/ml-jku/align-rudder, YouTube: https://youtu.be/HO-_8ZUl-U
Improving Generalization in Game Agents with Data Augmentation in Imitation Learning
Imitation learning is an effective approach for training game-playing agents
and, consequently, for efficient game production. However, generalization - the
ability to perform well in related but unseen scenarios - is an essential
requirement that remains an unsolved challenge for game AI. Generalization is
difficult for imitation learning agents because it requires the algorithm to
take meaningful actions outside of the training distribution. In this paper we
propose a solution to this challenge. Inspired by the success of data
augmentation in supervised learning, we augment the training data so the
distribution of states and actions in the dataset better represents the real
state-action distribution. This study evaluates methods for combining and
applying data augmentations to observations, to improve generalization of
imitation learning agents. It also provides a performance benchmark of these
augmentations across several 3D environments. These results demonstrate that
data augmentation is a promising framework for improving generalization in
imitation learning agents.Comment: 8 pages, 5 figure
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
We study the problem of learning goal-conditioned policies in Minecraft, a
popular, widely accessible yet challenging open-ended environment for
developing human-level multi-task agents. We first identify two main challenges
of learning such policies: 1) the indistinguishability of tasks from the state
distribution, due to the vast scene diversity, and 2) the non-stationary nature
of environment dynamics caused by partial observability. To tackle the first
challenge, we propose Goal-Sensitive Backbone (GSB) for the policy to encourage
the emergence of goal-relevant visual state representations. To tackle the
second challenge, the policy is further fueled by an adaptive horizon
prediction module that helps alleviate the learning uncertainty brought by the
non-stationary dynamics. Experiments on 20 Minecraft tasks show that our method
significantly outperforms the best baseline so far; in many of them, we double
the performance. Our ablation and exploratory studies then explain how our
approach beat the counterparts and also unveil the surprising bonus of
zero-shot generalization to new scenes (biomes). We hope our agent could help
shed some light on learning goal-conditioned, multi-task agents in challenging,
open-ended environments like Minecraft.Comment: This paper is accepted by CVPR202
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Constructing AI models that respond to text instructions is challenging,
especially for sequential decision-making tasks. This work introduces an
instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1,
demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective
for creating instruction-following sequential decision-making agents. STEVE-1
is trained in two steps: adapting the pretrained VPT model to follow commands
in MineCLIP's latent space, then training a prior to predict latent codes from
text. This allows us to finetune VPT through self-supervised behavioral cloning
and hindsight relabeling, bypassing the need for costly human text annotations.
By leveraging pretrained models like VPT and MineCLIP and employing best
practices from text-conditioned image generation, STEVE-1 costs just $60 to
train and can follow a wide range of short-horizon open-ended text and visual
instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction
following in Minecraft with low-level controls (mouse and keyboard) and raw
pixel inputs, far outperforming previous baselines. We provide experimental
evidence highlighting key factors for downstream performance, including
pretraining, classifier-free guidance, and data scaling. All resources,
including our model weights, training scripts, and evaluation tools are made
available for further research
Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
AI systems are increasingly applied to complex tasks that involve interaction
with humans. During training, such systems are potentially dangerous, as they
haven't yet learned to avoid actions that could cause serious harm. How can an
AI system explore and learn without making a single mistake that harms humans
or otherwise causes serious damage? For model-free reinforcement learning,
having a human "in the loop" and ready to intervene is currently the only way
to prevent all catastrophes. We formalize human intervention for RL and show
how to reduce the human labor required by training a supervised learner to
imitate the human's intervention decisions. We evaluate this scheme on Atari
games, with a Deep RL agent being overseen by a human for four hours. When the
class of catastrophes is simple, we are able to prevent all catastrophes
without affecting the agent's learning (whereas an RL baseline fails due to
catastrophic forgetting). However, this scheme is less successful when
catastrophes are more complex: it reduces but does not eliminate catastrophes
and the supervised learner fails on adversarial examples found by the agent.
Extrapolating to more challenging environments, we show that our implementation
would not scale (due to the infeasible amount of human labor required). We
outline extensions of the scheme that are necessary if we are to train
model-free agents without a single catastrophe
- …