18,849 research outputs found
Goal-conditioned Imitation Learning
Designing rewards for Reinforcement Learning (RL) is challenging because it
needs to convey the desired task, be efficient to optimize, and be easy to
compute. The latter is particularly problematic when applying RL to robotics,
where detecting whether the desired configuration is reached might require
considerable supervision and instrumentation. Furthermore, we are often
interested in being able to reach a wide range of configurations, hence setting
up a different reward every time might be unpractical. Methods like Hindsight
Experience Replay (HER) have recently shown promise to learn policies able to
reach many goals, without the need of a reward. Unfortunately, without tricks
like resetting to points along the trajectory, HER might require many samples
to discover how to reach certain areas of the state-space. In this work we
investigate different approaches to incorporate demonstrations to drastically
speed up the convergence to a policy able to reach any goal, also surpassing
the performance of an agent trained with other Imitation Learning algorithms.
Furthermore, we show our method can also be used when the available expert
trajectories do not contain the actions, which can leverage kinesthetic or
third person demonstration. The code is available at
https://sites.google.com/view/goalconditioned-il/.Comment: Published at NeurIPS 201
Temporal Difference Models: Model-Free Deep RL for Model-Based Control
Model-free reinforcement learning (RL) is a powerful, general tool for
learning complex behaviors. However, its sample efficiency is often
impractically large for solving challenging real-world problems, even with
off-policy algorithms such as Q-learning. A limiting factor in classic
model-free RL is that the learning signal consists only of scalar rewards,
ignoring much of the rich information contained in state transition tuples.
Model-based RL uses this information, by training a predictive model, but often
does not achieve the same asymptotic performance as model-free RL due to model
bias. We introduce temporal difference models (TDMs), a family of
goal-conditioned value functions that can be trained with model-free learning
and used for model-based control. TDMs combine the benefits of model-free and
model-based RL: they leverage the rich information in state transitions to
learn very efficiently, while still attaining asymptotic performance that
exceeds that of direct model-based RL methods. Our experimental results show
that, on a range of continuous control tasks, TDMs provide a substantial
improvement in efficiency compared to state-of-the-art model-based and
model-free methods.Comment: Appeared in ICLR 2018; typos correcte
Unsupervised Control Through Non-Parametric Discriminative Rewards
Learning to control an environment without hand-crafted rewards or expert
data remains challenging and is at the frontier of reinforcement learning
research. We present an unsupervised learning algorithm to train agents to
achieve perceptually-specified goals using only a stream of observations and
actions. Our agent simultaneously learns a goal-conditioned policy and a goal
achievement reward function that measures how similar a state is to the goal
state. This dual optimization leads to a co-operative game, giving rise to a
learned reward function that reflects similarity in controllable aspects of the
environment instead of distance in the space of observations. We demonstrate
the efficacy of our agent to learn, in an unsupervised manner, to reach a
diverse set of goals on three domains -- Atari, the DeepMind Control Suite and
DeepMind Lab.Comment: 10 pages + references & 5 page appendi
Zero-Shot Skill Composition and Simulation-to-Real Transfer by Learning Task Representations
Simulation-to-real transfer is an important strategy for making reinforcement
learning practical with real robots. Successful sim-to-real transfer systems
have difficulty producing policies which generalize across tasks, despite
training for thousands of hours equivalent real robot time. To address this
shortcoming, we present a novel approach to efficiently learning new robotic
skills directly on a real robot, based on model-predictive control (MPC) and an
algorithm for learning task representations. In short, we show how to reuse the
simulation from the pre-training step of sim-to-real methods as a tool for
foresight, allowing the sim-to-real policy adapt to unseen tasks. Rather than
end-to-end learning policies for single tasks and attempting to transfer them,
we first use simulation to simultaneously learn (1) a continuous
parameterization (i.e. a skill embedding or latent) of task-appropriate
primitive skills, and (2) a single policy for these skills which is conditioned
on this representation. We then directly transfer our multi-skill policy to a
real robot, and actuate the robot by choosing sequences of skill latents which
actuate the policy, with each latent corresponding to a pre-learned primitive
skill controller. We complete unseen tasks by choosing new sequences of skill
latents to control the robot using MPC, where our MPC model is composed of the
pre-trained skill policy executed in the simulation environment, run in
parallel with the real robot. We discuss the background and principles of our
method, detail its practical implementation, and evaluate its performance by
using our method to train a real Sawyer Robot to achieve motion tasks such as
drawing and block pushing.Comment: Submitted to ICRA 2019. See https://youtu.be/te4JWe7LPKw for
supplemental vide
Visual Reinforcement Learning with Imagined Goals
For an autonomous agent to fulfill a wide range of user-specified goals at
test time, it must be able to learn broadly applicable and general-purpose
skill repertoires. Furthermore, to provide the requisite level of generality,
these skills must handle raw sensory input such as images. In this paper, we
propose an algorithm that acquires such general-purpose skills by combining
unsupervised representation learning and reinforcement learning of
goal-conditioned policies. Since the particular goals that might be required at
test-time are not known in advance, the agent performs a self-supervised
"practice" phase where it imagines goals and attempts to achieve them. We learn
a visual representation with three distinct purposes: sampling goals for
self-supervised practice, providing a structured transformation of raw sensory
inputs, and computing a reward signal for goal reaching. We also propose a
retroactive goal relabeling scheme to further improve the sample-efficiency of
our method. Our off-policy algorithm is efficient enough to learn policies that
operate on raw image observations and goals for a real-world robotic system,
and substantially outperforms prior techniques.Comment: 15 pages, NeurIPS 201
Learning Actionable Representations with Goal-Conditioned Policies
Representation learning is a central challenge across a range of machine
learning areas. In reinforcement learning, effective and functional
representations have the potential to tremendously accelerate learning progress
and solve more challenging problems. Most prior work on representation learning
has focused on generative approaches, learning representations that capture all
underlying factors of variation in the observation space in a more disentangled
or well-ordered manner. In this paper, we instead aim to learn functionally
salient representations: representations that are not necessarily complete in
terms of capturing all factors of variation in the observation space, but
rather aim to capture those factors of variation that are important for
decision making -- that are "actionable." These representations are aware of
the dynamics of the environment, and capture only the elements of the
observation that are necessary for decision making rather than all factors of
variation, without explicit reconstruction of the observation. We show how
these representations can be useful to improve exploration for sparse reward
problems, to enable long horizon hierarchical reinforcement learning, and as a
state representation for learning policies for downstream tasks. We evaluate
our method on a number of simulated environments, and compare it to prior
methods for representation learning, exploration, and hierarchical
reinforcement learning.Comment: To be presented at ICLR 201
Universal Planning Networks
A key challenge in complex visuomotor control is learning abstract
representations that are effective for specifying goals, planning, and
generalization. To this end, we introduce universal planning networks (UPN).
UPNs embed differentiable planning within a goal-directed policy. This planning
computation unrolls a forward model in a latent space and infers an optimal
action plan through gradient descent trajectory optimization. The
plan-by-gradient-descent process and its underlying representations are learned
end-to-end to directly optimize a supervised imitation learning objective. We
find that the representations learned are not only effective for goal-directed
visual imitation via gradient-based trajectory optimization, but can also
provide a metric for specifying goals using images. The learned representations
can be leveraged to specify distance-based rewards to reach new target states
for model-free reinforcement learning, resulting in substantially more
effective learning when solving new tasks described via image-based goals. We
were able to achieve successful transfer of visuomotor planning strategies
across robots with significantly different morphologies and actuation
capabilities.Comment: Videos available at https://sites.google.com/view/upn-public/hom
Reinforcement Learning without Ground-Truth State
To perform robot manipulation tasks, a low-dimensional state of the
environment typically needs to be estimated. However, designing a state
estimator can sometimes be difficult, especially in environments with
deformable objects. An alternative is to learn an end-to-end policy that maps
directly from high-dimensional sensor inputs to actions. However, if this
policy is trained with reinforcement learning, then without a state estimator,
it is hard to specify a reward function based on high-dimensional observations.
To meet this challenge, we propose a simple indicator reward function for
goal-conditioned reinforcement learning: we only give a positive reward when
the robot's observation exactly matches a target goal observation. We show that
by relabeling the original goal with the achieved goal to obtain positive
rewards (Andrychowicz et al., 2017), we can learn with the indicator reward
function even in continuous state spaces. We propose two methods to further
speed up convergence with indicator rewards: reward balancing and reward
filtering. We show comparable performance between our method and an oracle
which uses the ground-truth state for computing rewards. We show that our
method can perform complex tasks in continuous state spaces such as rope
manipulation from RGB-D images, without knowledge of the ground-truth state
Near-Optimal Representation Learning for Hierarchical Reinforcement Learning
We study the problem of representation learning in goal-conditioned
hierarchical reinforcement learning. In such hierarchical structures, a
higher-level controller solves tasks by iteratively communicating goals which a
lower-level policy is trained to reach. Accordingly, the choice of
representation -- the mapping of observation space to goal space -- is crucial.
To study this problem, we develop a notion of sub-optimality of a
representation, defined in terms of expected reward of the optimal hierarchical
policy using this representation. We derive expressions which bound the
sub-optimality and show how these expressions can be translated to
representation learning objectives which may be optimized in practice. Results
on a number of difficult continuous-control tasks show that our approach to
representation learning yields qualitatively better representations as well as
quantitatively better hierarchical policies, compared to existing methods (see
videos at https://sites.google.com/view/representation-hrl).Comment: ICLR 2019 Conference Pape
VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control
Recent breakthroughs in Go play and strategic games have witnessed the great
potential of reinforcement learning in intelligently scheduling in uncertain
environment, but some bottlenecks are also encountered when we generalize this
paradigm to universal complex tasks. Among them, the low efficiency of data
utilization in model-free reinforcement algorithms is of great concern. In
contrast, the model-based reinforcement learning algorithms can reveal
underlying dynamics in learning environments and seldom suffer the data
utilization problem. To address the problem, a model-based reinforcement
learning algorithm with attention mechanism embedded is proposed as an
extension of World Models in this paper. We learn the environment model through
Mixture Density Network Recurrent Network(MDN-RNN) for agents to interact, with
combinations of variational auto-encoder(VAE) and attention incorporated in
state value estimates during the process of learning policy. In this way, agent
can learn optimal policies through less interactions with actual environment,
and final experiments demonstrate the effectiveness of our model in control
problem
- …