55,146 research outputs found
Reward-Conditioned Policies
Reinforcement learning offers the promise of automating the acquisition of
complex behavioral skills. However, compared to commonly used and
well-understood supervised learning methods, reinforcement learning algorithms
can be brittle, difficult to use and tune, and sensitive to seemingly innocuous
implementation decisions. In contrast, imitation learning utilizes standard and
well-understood supervised learning methods, but requires near-optimal expert
data. Can we learn effective policies via supervised learning without
demonstrations? The main idea that we explore in this work is that non-expert
trajectories collected from sub-optimal policies can be viewed as optimal
supervision, not for maximizing the reward, but for matching the reward of the
given trajectory. By then conditioning the policy on the numerical value of the
reward, we can obtain a policy that generalizes to larger returns. We show how
such an approach can be derived as a principled method for policy search,
discuss several variants, and compare the method experimentally to a variety of
current reinforcement learning methods on standard benchmarks
From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following
Reinforcement learning is a promising framework for solving control problems,
but its use in practical situations is hampered by the fact that reward
functions are often difficult to engineer. Specifying goals and tasks for
autonomous machines, such as robots, is a significant challenge:
conventionally, reward functions and goal states have been used to communicate
objectives. But people can communicate objectives to each other simply by
describing or demonstrating them. How can we build learning algorithms that
will allow us to tell machines what we want them to do? In this work, we
investigate the problem of grounding language commands as reward functions
using inverse reinforcement learning, and argue that language-conditioned
rewards are more transferable than language-conditioned policies to new
environments. We propose language-conditioned reward learning (LC-RL), which
grounds language commands as a reward function represented by a deep neural
network. We demonstrate that our model learns rewards that transfer to novel
tasks and environments on realistic, high-dimensional visual environments with
natural language commands, whereas directly learning a language-conditioned
policy leads to poor performance
FollowNet: Robot Navigation by Following Natural Language Directions with Deep Reinforcement Learning
Understanding and following directions provided by humans can enable robots
to navigate effectively in unknown situations. We present FollowNet, an
end-to-end differentiable neural architecture for learning multi-modal
navigation policies. FollowNet maps natural language instructions as well as
visual and depth inputs to locomotion primitives. FollowNet processes
instructions using an attention mechanism conditioned on its visual and depth
input to focus on the relevant parts of the command while performing the
navigation task. Deep reinforcement learning (RL) a sparse reward learns
simultaneously the state representation, the attention function, and control
policies. We evaluate our agent on a dataset of complex natural language
directions that guide the agent through a rich and realistic dataset of
simulated homes. We show that the FollowNet agent learns to execute previously
unseen instructions described with a similar vocabulary, and successfully
navigates along paths not encountered during training. The agent shows 30%
improvement over a baseline model without the attention mechanism, with 52%
success rate at novel instructions.Comment: 7 pages, 8 figure
Learning Actionable Representations with Goal-Conditioned Policies
Representation learning is a central challenge across a range of machine
learning areas. In reinforcement learning, effective and functional
representations have the potential to tremendously accelerate learning progress
and solve more challenging problems. Most prior work on representation learning
has focused on generative approaches, learning representations that capture all
underlying factors of variation in the observation space in a more disentangled
or well-ordered manner. In this paper, we instead aim to learn functionally
salient representations: representations that are not necessarily complete in
terms of capturing all factors of variation in the observation space, but
rather aim to capture those factors of variation that are important for
decision making -- that are "actionable." These representations are aware of
the dynamics of the environment, and capture only the elements of the
observation that are necessary for decision making rather than all factors of
variation, without explicit reconstruction of the observation. We show how
these representations can be useful to improve exploration for sparse reward
problems, to enable long horizon hierarchical reinforcement learning, and as a
state representation for learning policies for downstream tasks. We evaluate
our method on a number of simulated environments, and compare it to prior
methods for representation learning, exploration, and hierarchical
reinforcement learning.Comment: To be presented at ICLR 201
Online Robust Policy Learning in the Presence of Unknown Adversaries
The growing prospect of deep reinforcement learning (DRL) being used in
cyber-physical systems has raised concerns around safety and robustness of
autonomous agents. Recent work on generating adversarial attacks have shown
that it is computationally feasible for a bad actor to fool a DRL policy into
behaving sub optimally. Although certain adversarial attacks with specific
attack models have been addressed, most studies are only interested in off-line
optimization in the data space (e.g., example fitting, distillation). This
paper introduces a Meta-Learned Advantage Hierarchy (MLAH) framework that is
attack model-agnostic and more suited to reinforcement learning, via handling
the attacks in the decision space (as opposed to data space) and directly
mitigating learned bias introduced by the adversary. In MLAH, we learn separate
sub-policies (nominal and adversarial) in an online manner, as guided by a
supervisory master agent that detects the presence of the adversary by
leveraging the advantage function for the sub-policies. We demonstrate that the
proposed algorithm enables policy learning with significantly lower bias as
compared to the state-of-the-art policy learning approaches even in the
presence of heavy state information attacks. We present algorithm analysis and
simulation results using popular OpenAI Gym environments.Comment: 18 pages, 9 figure
Visual Reinforcement Learning with Imagined Goals
For an autonomous agent to fulfill a wide range of user-specified goals at
test time, it must be able to learn broadly applicable and general-purpose
skill repertoires. Furthermore, to provide the requisite level of generality,
these skills must handle raw sensory input such as images. In this paper, we
propose an algorithm that acquires such general-purpose skills by combining
unsupervised representation learning and reinforcement learning of
goal-conditioned policies. Since the particular goals that might be required at
test-time are not known in advance, the agent performs a self-supervised
"practice" phase where it imagines goals and attempts to achieve them. We learn
a visual representation with three distinct purposes: sampling goals for
self-supervised practice, providing a structured transformation of raw sensory
inputs, and computing a reward signal for goal reaching. We also propose a
retroactive goal relabeling scheme to further improve the sample-efficiency of
our method. Our off-policy algorithm is efficient enough to learn policies that
operate on raw image observations and goals for a real-world robotic system,
and substantially outperforms prior techniques.Comment: 15 pages, NeurIPS 201
Generalization through Simulation: Integrating Simulated and Real Data into Deep Reinforcement Learning for Vision-Based Autonomous Flight
Deep reinforcement learning provides a promising approach for vision-based
control of real-world robots. However, the generalization of such models
depends critically on the quantity and variety of data available for training.
This data can be difficult to obtain for some types of robotic systems, such as
fragile, small-scale quadrotors. Simulated rendering and physics can provide
for much larger datasets, but such data is inherently of lower quality: many of
the phenomena that make the real-world autonomous flight problem challenging,
such as complex physics and air currents, are modeled poorly or not at all, and
the systematic differences between simulation and the real world are typically
impossible to eliminate. In this work, we investigate how data from both
simulation and the real world can be combined in a hybrid deep reinforcement
learning algorithm. Our method uses real-world data to learn about the dynamics
of the system, and simulated data to learn a generalizable perception system
that can enable the robot to avoid collisions using only a monocular camera. We
demonstrate our approach on a real-world nano aerial vehicle collision
avoidance task, showing that with only an hour of real-world data, the
quadrotor can avoid collisions in new environments with various lighting
conditions and geometry. Code, instructions for building the aerial vehicles,
and videos of the experiments can be found at github.com/gkahn13/GtSComment: First three authors contributed equally. Accepted to ICRA 201
Temporal Difference Models: Model-Free Deep RL for Model-Based Control
Model-free reinforcement learning (RL) is a powerful, general tool for
learning complex behaviors. However, its sample efficiency is often
impractically large for solving challenging real-world problems, even with
off-policy algorithms such as Q-learning. A limiting factor in classic
model-free RL is that the learning signal consists only of scalar rewards,
ignoring much of the rich information contained in state transition tuples.
Model-based RL uses this information, by training a predictive model, but often
does not achieve the same asymptotic performance as model-free RL due to model
bias. We introduce temporal difference models (TDMs), a family of
goal-conditioned value functions that can be trained with model-free learning
and used for model-based control. TDMs combine the benefits of model-free and
model-based RL: they leverage the rich information in state transitions to
learn very efficiently, while still attaining asymptotic performance that
exceeds that of direct model-based RL methods. Our experimental results show
that, on a range of continuous control tasks, TDMs provide a substantial
improvement in efficiency compared to state-of-the-art model-based and
model-free methods.Comment: Appeared in ICLR 2018; typos correcte
Unsupervised Control Through Non-Parametric Discriminative Rewards
Learning to control an environment without hand-crafted rewards or expert
data remains challenging and is at the frontier of reinforcement learning
research. We present an unsupervised learning algorithm to train agents to
achieve perceptually-specified goals using only a stream of observations and
actions. Our agent simultaneously learns a goal-conditioned policy and a goal
achievement reward function that measures how similar a state is to the goal
state. This dual optimization leads to a co-operative game, giving rise to a
learned reward function that reflects similarity in controllable aspects of the
environment instead of distance in the space of observations. We demonstrate
the efficacy of our agent to learn, in an unsupervised manner, to reach a
diverse set of goals on three domains -- Atari, the DeepMind Control Suite and
DeepMind Lab.Comment: 10 pages + references & 5 page appendi
LEAF: Latent Exploration Along the Frontier
Self-supervised goal proposal and reaching is a key component for exploration
and efficient policy learning algorithms. Such a self-supervised approach
without access to any oracle goal sampling distribution requires deep
exploration and commitment so that long horizon plans can be efficiently
discovered. In this paper, we propose an exploration framework, which learns a
dynamics-aware manifold of reachable states. For a goal, our proposed method
deterministically visits a state at the current frontier of reachable states
(commitment/reaching) and then stochastically explores to reach the goal
(exploration). This allocates exploration budget near the frontier of the
reachable region instead of its interior. We target the challenging problem of
policy learning from initial and goal states specified as images, and do not
assume any access to the underlying ground-truth states of the robot and the
environment. To keep track of reachable latent states, we propose a
distance-conditioned reachability network that is trained to infer whether one
state is reachable from another within the specified latent space distance.
Given an initial state, we obtain a frontier of reachable states from that
state. By incorporating a curriculum for sampling easier goals (closer to the
start state) before more difficult goals, we demonstrate that the proposed
self-supervised exploration algorithm, can achieve superior performance
on average compared to existing baselines on a set of challenging robotic
environments, including on a real robot manipulation task.Comment: Preprint. Preliminary repor
- …