6,711 research outputs found
Model-Free Imitation Learning with Policy Optimization
In imitation learning, an agent learns how to behave in an environment with
an unknown cost function by mimicking expert demonstrations. Existing imitation
learning algorithms typically involve solving a sequence of planning or
reinforcement learning problems. Such algorithms are therefore not directly
applicable to large, high-dimensional environments, and their performance can
significantly degrade if the planning problems are not solved to optimality.
Under the apprenticeship learning formalism, we develop alternative model-free
algorithms for finding a parameterized stochastic policy that performs at least
as well as an expert policy on an unknown cost function, based on sample
trajectories from the expert. Our approach, based on policy gradients, scales
to large continuous environments with guaranteed convergence to local minima.Comment: In Proceedings of the 33rd International Conference on Machine
Learning, 201
Differentiable MPC for End-to-end Planning and Control
We present foundations for using Model Predictive Control (MPC) as a
differentiable policy class for reinforcement learning in continuous state and
action spaces. This provides one way of leveraging and combining the advantages
of model-free and model-based approaches. Specifically, we differentiate
through MPC by using the KKT conditions of the convex approximation at a fixed
point of the controller. Using this strategy, we are able to learn the cost and
dynamics of a controller via end-to-end learning. Our experiments focus on
imitation learning in the pendulum and cartpole domains, where we learn the
cost and dynamics terms of an MPC policy class. We show that our MPC policies
are significantly more data-efficient than a generic neural network and that
our method is superior to traditional system identification in a setting where
the expert is unrealizable.Comment: NeurIPS 201
InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations
The goal of imitation learning is to mimic expert behavior without access to
an explicit reward signal. Expert demonstrations provided by humans, however,
often show significant variability due to latent factors that are typically not
explicitly modeled. In this paper, we propose a new algorithm that can infer
the latent structure of expert demonstrations in an unsupervised way. Our
method, built on top of Generative Adversarial Imitation Learning, can not only
imitate complex behaviors, but also learn interpretable and meaningful
representations of complex behavioral data, including visual demonstrations. In
the driving domain, we show that a model learned from human demonstrations is
able to both accurately reproduce a variety of behaviors and accurately
anticipate human actions using raw visual inputs. Compared with various
baselines, our method can better capture the latent structure underlying expert
demonstrations, often recovering semantically meaningful factors of variation
in the data.Comment: 14 pages, NIPS 201
Generative Adversarial Imitation Learning
Consider learning a policy from example expert behavior, without interaction
with the expert or access to reinforcement signal. One approach is to recover
the expert's cost function with inverse reinforcement learning, then extract a
policy from that cost function with reinforcement learning. This approach is
indirect and can be slow. We propose a new general framework for directly
extracting a policy from data, as if it were obtained by reinforcement learning
following inverse reinforcement learning. We show that a certain instantiation
of our framework draws an analogy between imitation learning and generative
adversarial networks, from which we derive a model-free imitation learning
algorithm that obtains significant performance gains over existing model-free
methods in imitating complex behaviors in large, high-dimensional environments
Imitating Driver Behavior with Generative Adversarial Networks
The ability to accurately predict and simulate human driving behavior is
critical for the development of intelligent transportation systems. Traditional
modeling methods have employed simple parametric models and behavioral cloning.
This paper adopts a method for overcoming the problem of cascading errors
inherent in prior approaches, resulting in realistic behavior that is robust to
trajectory perturbations. We extend Generative Adversarial Imitation Learning
to the training of recurrent policies, and we demonstrate that our model
outperforms rule-based controllers and maximum likelihood models in realistic
highway simulations. Our model both reproduces emergent behavior of human
drivers, such as lane change rate, while maintaining realistic control over
long time horizons.Comment: 8 pages, 6 figure
Dual Policy Iteration
Recently, a novel class of Approximate Policy Iteration (API) algorithms have
demonstrated impressive practical performance (e.g., ExIt from [2],
AlphaGo-Zero from [27]). This new family of algorithms maintains, and
alternately optimizes, two policies: a fast, reactive policy (e.g., a deep
neural network) deployed at test time, and a slow, non-reactive policy (e.g.,
Tree Search), that can plan multiple steps ahead. The reactive policy is
updated under supervision from the non-reactive policy, while the non-reactive
policy is improved with guidance from the reactive policy. In this work we
study this Dual Policy Iteration (DPI) strategy in an alternating optimization
framework and provide a convergence analysis that extends existing API theory.
We also develop a special instance of this framework which reduces the update
of non-reactive policies to model-based optimal control using learned local
models, and provides a theoretically sound way of unifying model-free and
model-based RL approaches with unknown dynamics. We demonstrate the efficacy of
our approach on various continuous control Markov Decision Processes.Comment: NeurIPS 2018; Additional related work
Trajectory Optimization for Unknown Constrained Systems using Reinforcement Learning
In this paper, we propose a reinforcement learning-based algorithm for
trajectory optimization for constrained dynamical systems. This problem is
motivated by the fact that for most robotic systems, the dynamics may not
always be known. Generating smooth, dynamically feasible trajectories could be
difficult for such systems. Using sampling-based algorithms for motion planning
may result in trajectories that are prone to undesirable control jumps.
However, they can usually provide a good reference trajectory which a
model-free reinforcement learning algorithm can then exploit by limiting the
search domain and quickly finding a dynamically smooth trajectory. We use this
idea to train a reinforcement learning agent to learn a dynamically smooth
trajectory in a curriculum learning setting. Furthermore, for generalization,
we parameterize the policies with goal locations, so that the agent can be
trained for multiple goals simultaneously. We show result in both simulated
environments as well as real experiments, for a -DoF manipulator arm
operated in position-controlled mode to validate the proposed idea. We compare
the proposed ideas against a PID controller which is used to track a designed
trajectory in configuration space. Our experiments show that our RL agent
trained with a reference path outperformed a model-free PID controller of the
type commonly used on many robotic platforms for trajectory tracking.Comment: 8 pages, 6 figures, Accepted to IROS 201
Task Transfer by Preference-Based Cost Learning
The goal of task transfer in reinforcement learning is migrating the action
policy of an agent to the target task from the source task. Given their
successes on robotic action planning, current methods mostly rely on two
requirements: exactly-relevant expert demonstrations or the explicitly-coded
cost function on target task, both of which, however, are inconvenient to
obtain in practice. In this paper, we relax these two strong conditions by
developing a novel task transfer framework where the expert preference is
applied as a guidance. In particular, we alternate the following two steps:
Firstly, letting experts apply pre-defined preference rules to select related
expert demonstrates for the target task. Secondly, based on the selection
result, we learn the target cost function and trajectory distribution
simultaneously via enhanced Adversarial MaxEnt IRL and generate more
trajectories by the learned target distribution for the next preference
selection. The theoretical analysis on the distribution learning and
convergence of the proposed algorithm are provided. Extensive simulations on
several benchmarks have been conducted for further verifying the effectiveness
of the proposed method.Comment: Accepted to AAAI 2019. Mingxuan Jing and Xiaojian Ma contributed
equally to this wor
Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces
Policy optimization methods have shown great promise in solving complex
reinforcement and imitation learning tasks. While model-free methods are
broadly applicable, they often require many samples to optimize complex
policies. Model-based methods greatly improve sample-efficiency but at the cost
of poor generalization, requiring a carefully handcrafted model of the system
dynamics for each task. Recently, hybrid methods have been successful in
trading off applicability for improved sample-complexity. However, these have
been limited to continuous action spaces. In this work, we present a new hybrid
method based on an approximation of the dynamics as an expectation over the
next state under the current policy. This relaxation allows us to derive a
novel hybrid policy gradient estimator, combining score function and pathwise
derivative estimators, that is applicable to discrete action spaces. We show
significant gains in sample complexity, ranging between and ,
when learning parameterized policies on Cart Pole, Acrobot, Mountain Car and
Hand Mass. Our method is applicable to both discrete and continuous action
spaces, when competing pathwise methods are limited to the latter.Comment: In AAAI 2018 proceeding
Dynamic Regret Convergence Analysis and an Adaptive Regularization Algorithm for On-Policy Robot Imitation Learning
On-policy imitation learning algorithms such as DAgger evolve a robot control
policy by executing it, measuring performance (loss), obtaining corrective
feedback from a supervisor, and generating the next policy. As the loss between
iterations can vary unpredictably, a fundamental question is under what
conditions this process will eventually achieve a converged policy. If one
assumes the underlying trajectory distribution is static (stationary), it is
possible to prove convergence for DAgger. However, in more realistic models for
robotics, the underlying trajectory distribution is dynamic because it is a
function of the policy. Recent results show it is possible to prove convergence
of DAgger when a regularity condition on the rate of change of the trajectory
distributions is satisfied. In this article, we reframe this result using
dynamic regret theory from the field of online optimization and show that
dynamic regret can be applied to any on-policy algorithm to analyze its
convergence and optimality. These results inspire a new algorithm, Adaptive
On-Policy Regularization (AOR), that ensures the conditions for convergence. We
present simulation results with cart-pole balancing and locomotion benchmarks
that suggest AOR can significantly decrease dynamic regret and chattering as
the robot learns. To our knowledge, this the first application of dynamic
regret theory to imitation learning
- …