539 research outputs found
Efficient Intrinsically Motivated Robotic Grasping with Learning-Adaptive Imagination in Latent Space
Combining model-based and model-free deep reinforcement learning has shown
great promise for improving sample efficiency on complex control tasks while
still retaining high performance. Incorporating imagination is a recent effort
in this direction inspired by human mental simulation of motor behavior. We
propose a learning-adaptive imagination approach which, unlike previous
approaches, takes into account the reliability of the learned dynamics model
used for imagining the future. Our approach learns an ensemble of disjoint
local dynamics models in latent space and derives an intrinsic reward based on
learning progress, motivating the controller to take actions leading to data
that improves the models. The learned models are used to generate imagined
experiences, augmenting the training set of real experiences. We evaluate our
approach on learning vision-based robotic grasping and show that it
significantly improves sample efficiency and achieves near-optimal performance
in a sparse reward environment.Comment: In: Proceedings of the Joint IEEE International Conference on
Development and Learning and on Epigenetic Robotics (ICDL-EpiRob), Oslo,
Norway, Aug. 19-22, 201
Effective offline training and efficient online adaptation
Developing agents that behave intelligently in the world is an open challenge in
machine learning. Desiderata for such agents are efficient exploration, maximizing
long term utility, and the ability to effectively leverage prior data to solve new
tasks. Reinforcement learning (RL) is an approach that is predicated on learning
by directly interacting with an environment through trial-and-error, and presents
a way for us to train and deploy such agents. Moreover, combining RL with
powerful neural network function approximators – a sub-field known as “deep RL” –
has shown evidence towards achieving this goal. For instance, deep RL has yielded
agents that can play Go at superhuman levels, improve the efficiency of microchip
designs, and learn complex novel strategies for controlling nuclear fusion reactions.
A key issue that stands in the way of deploying deep RL is poor sample efficiency. Concretely, while it is possible to train effective agents using deep
RL, the key successes have largely been in environments where we have access to
large amounts of online interaction, often through the use of simulators. However,
in many real-world problems, we are confronted with scenarios where samples
are expensive to obtain. As has been alluded to, one way to alleviate this issue
is through accessing some prior data, often termed “offline data”, which can
accelerate how quickly we learn such agents, such as leveraging exploratory data
to prevent redundant deployments, or using human-expert data to quickly guide
agents towards promising behaviors and beyond. However, the best way to
incorporate this data into existing deep RL algorithms is not straightforward;
naĂŻvely pre-training using RL algorithms on this offline data, a paradigm called
“offline RL” as a starting point for subsequent learning is often detrimental.
Moreover, it is unclear how to explicitly derive useful behaviors online that are
positively influenced by this offline pre-training.
With these factors in mind, this thesis follows a 3-pronged strategy towards
improving sample-efficiency in deep RL. First, we investigate effective pre-training
on offline data. Then, we tackle the online problem, looking at efficient adaptation
to environments when operating purely online. Finally, we conclude with hybrid
strategies that use offline data to explicitly augment policies when acting online
Towards efficient and robust reinforcement learning via synthetic environments and offline data
Over the past decade, Deep Reinforcement Learning (RL) has driven many advances in sequential decision-making, including remarkable applications in superhuman Go-playing, robotic control, and automated algorithm discovery. However, despite these successes, deep RL is also notoriously sample-inefficient, usually generalizes poorly to settings beyond the original environment, and can be unstable during training. Moreover, the conventional RL setting still relies on exploring and learning tabula-rasa in new environments and does not make use of pre-existing data. This thesis investigates two promising directions to address these challenges. First, we explore the use of synthetic data and environments in order to broaden an agent's experience. Second, we propose principled techniques to leverage pre-existing datasets, thereby reducing or replacing the need for costly online data collection.
The first part of the thesis focuses on the generation of synthetic data and environments to train RL agents. While there is a rich history in model-based RL of leveraging a learned dynamics model to improve sample efficiency, these methods are usually restricted to single-task settings. To overcome this limitation, we propose Augmented World Models, a novel approach designed for offline-to-online transfer where the test dynamics may differ from the training data. Our method augments a learned dynamics model with simple transformations that seek to capture potential changes in the physical properties of a robot, leading to more robust policies. Additionally, we train the agent with the sampled augmentation as context for test-time inference, significantly improving zero-shot generalization to novel dynamics. Going beyond commonly used forward dynamics models, we propose an alternative paradigm, Synthetic Experience Replay, which uses generative modeling to directly model and upsample the agent's training data distribution. Leveraging recent advances in diffusion generative models, our approach outperforms and is composable with standard data augmentation, and is particularly effective in low-data regimes. Furthermore, our method opens the door for certain RL agents to train stably with much larger networks than before.
In the second part of the thesis, we explore a complementary direction to data efficiency where we can leverage pre-existing data. While adjacent fields of machine learning, such as computer vision and natural language processing, have made significant progress in scaling data and model size, traditional RL algorithms can find it difficult to incorporate additional data due to the need for on-policy data. We begin by investigating a principled method for incorporating expert demonstrations to accelerate online RL, KL-regularization to a behavioral prior, and identify a pathology stemming from the behavioral prior having uncalibrated uncertainties. We show that standard parameterizations of the behavioral reference policy can lead to unstable training dynamics, and propose a solution, Non-Parametric Prior Actor–Critic, that represents the new state-of-the-art in locomotion and dexterous manipulation tasks. Furthermore, we make advances in offline reinforcement learning, with which agents can be trained without any online data collection at all. In this domain, we elucidate the design space of offline model-based RL algorithms and highlight where prior methods use suboptimal heuristics and choices for hyperparameters. By rigorously searching through this space, we show that we can vastly improve standard algorithms and provide insights into which design choices are most important. Finally, we make progress towards extending offline RL to pixel-based environments by presenting Vision Datasets for Deep Data-Driven RL, the first comprehensive and publicly available evaluation suite for this field, alongside simple model-based and model-free baselines for assessing future progress in this domain.
In conclusion, this thesis represents explorations toward making RL algorithms more efficient and readily deployable in the real world. Further progress along these directions may bring us closer to the ultimate goal of more generally capable agents, that are able to both generate appropriate learning environments for themselves and bootstrap learning from vast quantities of pre-existing data
Safe DreamerV3: Safe Reinforcement Learning with World Models
The widespread application of Reinforcement Learning (RL) in real-world
situations is yet to come to fruition, largely as a result of its failure to
satisfy the essential safety demands of such systems. Existing safe
reinforcement learning (SafeRL) methods, employing cost functions to enhance
safety, fail to achieve zero-cost in complex scenarios, including vision-only
tasks, even with comprehensive data sampling and training. To address this, we
introduce Safe DreamerV3, a novel algorithm that integrates both
Lagrangian-based and planning-based methods within a world model. Our
methodology represents a significant advancement in SafeRL as the first
algorithm to achieve nearly zero-cost in both low-dimensional and vision-only
tasks within the Safety-Gymnasium benchmark. Our project website can be found
in: https://sites.google.com/view/safedreamerv3
ENTL: Embodied Navigation Trajectory Learner
We propose Embodied Navigation Trajectory Learner (ENTL), a method for
extracting long sequence representations for embodied navigation. Our approach
unifies world modeling, localization and imitation learning into a single
sequence prediction task. We train our model using vector-quantized predictions
of future states conditioned on current states and actions. ENTL's generic
architecture enables the sharing of the the spatio-temporal sequence encoder
for multiple challenging embodied tasks. We achieve competitive performance on
navigation tasks using significantly less data than strong baselines while
performing auxiliary tasks such as localization and future frame prediction (a
proxy for world modeling). A key property of our approach is that the model is
pre-trained without any explicit reward signal, which makes the resulting model
generalizable to multiple tasks and environments
- …