5,969 research outputs found
Effective offline training and efficient online adaptation
Developing agents that behave intelligently in the world is an open challenge in
machine learning. Desiderata for such agents are efficient exploration, maximizing
long term utility, and the ability to effectively leverage prior data to solve new
tasks. Reinforcement learning (RL) is an approach that is predicated on learning
by directly interacting with an environment through trial-and-error, and presents
a way for us to train and deploy such agents. Moreover, combining RL with
powerful neural network function approximators – a sub-field known as “deep RL” –
has shown evidence towards achieving this goal. For instance, deep RL has yielded
agents that can play Go at superhuman levels, improve the efficiency of microchip
designs, and learn complex novel strategies for controlling nuclear fusion reactions.
A key issue that stands in the way of deploying deep RL is poor sample efficiency. Concretely, while it is possible to train effective agents using deep
RL, the key successes have largely been in environments where we have access to
large amounts of online interaction, often through the use of simulators. However,
in many real-world problems, we are confronted with scenarios where samples
are expensive to obtain. As has been alluded to, one way to alleviate this issue
is through accessing some prior data, often termed “offline data”, which can
accelerate how quickly we learn such agents, such as leveraging exploratory data
to prevent redundant deployments, or using human-expert data to quickly guide
agents towards promising behaviors and beyond. However, the best way to
incorporate this data into existing deep RL algorithms is not straightforward;
naĂŻvely pre-training using RL algorithms on this offline data, a paradigm called
“offline RL” as a starting point for subsequent learning is often detrimental.
Moreover, it is unclear how to explicitly derive useful behaviors online that are
positively influenced by this offline pre-training.
With these factors in mind, this thesis follows a 3-pronged strategy towards
improving sample-efficiency in deep RL. First, we investigate effective pre-training
on offline data. Then, we tackle the online problem, looking at efficient adaptation
to environments when operating purely online. Finally, we conclude with hybrid
strategies that use offline data to explicitly augment policies when acting online
Learning General World Models in a Handful of Reward-Free Deployments
Building generally capable agents is a grand challenge for deep reinforcement learning (RL). To approach this challenge practically, we outline two key desiderata: 1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate scalability, exploration policies should collect large quantities of data without costly centralized retraining. Combining these two properties, we introduce the reward-free deployment efficiency setting, a new paradigm for RL research. We then present CASCADE, a novel approach for self-supervised exploration in this new setting. CASCADE seeks to learn a world model by collecting data with a population of agents, using an information theoretic objective inspired by Bayesian Active Learning. CASCADE achieves this by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective. We provide theoretical intuition for CASCADE which we show in a tabular setting improves upon naĂŻve approaches that do not account for population diversity. We then demonstrate that CASCADE collects diverse task-agnostic datasets and learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari, MiniGrid, Crafter and the DM Control Suite. Code and videos are available at https://ycxuyingchen.github.io/cascade/
- …