157 research outputs found
Learning Continuous Control Policies by Stochastic Value Gradients
We present a unified framework for learning continuous control policies using
backpropagation. It supports stochastic control by treating stochasticity in
the Bellman equation as a deterministic function of exogenous noise. The
product is a spectrum of general policy gradient algorithms that range from
model-free methods with value functions to model-based methods without value
functions. We use learned models but only require observations from the
environment in- stead of observations from model-predicted trajectories,
minimizing the impact of compounded model errors. We apply these algorithms
first to a toy stochastic control problem and then to several physics-based
control problems in simulation. One of these variants, SVG(1), shows the
effectiveness of learning models, value functions, and policies simultaneously
in continuous domains.Comment: 13 pages, NIPS 201
Structured Neural Network Dynamics for Model-based Control
We present a structured neural network architecture that is inspired by
linear time-varying dynamical systems. The network is designed to mimic the
properties of linear dynamical systems which makes analysis and control simple.
The architecture facilitates the integration of learned system models with
gradient-based model predictive control algorithms, and removes the requirement
of computing potentially costly derivatives online. We demonstrate the efficacy
of this modeling technique in computing autonomous control policies through
evaluation in a variety of standard continuous control domains
Actor-critic versus direct policy search: a comparison based on sample complexity
Sample efficiency is a critical property when optimizing policy parameters
for the controller of a robot. In this paper, we evaluate two state-of-the-art
policy optimization algorithms. One is a recent deep reinforcement learning
method based on an actor-critic algorithm, Deep Deterministic Policy Gradient
(DDPG), that has been shown to perform well on various control benchmarks. The
other one is a direct policy search method, Covariance Matrix Adaptation
Evolution Strategy (CMA-ES), a black-box optimization method that is widely
used for robot learning. The algorithms are evaluated on a continuous version
of the mountain car benchmark problem, so as to compare their sample
complexity. From a preliminary analysis, we expect DDPG to be more sample
efficient than CMA-ES, which is confirmed by our experimental results.Comment: Proceedings JFPDA (Journees Francaises Planification Decision
Apprentissage
Model-Based Action Exploration for Learning Dynamic Motion Skills
Deep reinforcement learning has achieved great strides in solving challenging
motion control tasks. Recently, there has been significant work on methods for
exploiting the data gathered during training, but there has been less work on
how to best generate the data to learn from. For continuous action domains, the
most common method for generating exploratory actions involves sampling from a
Gaussian distribution centred around the mean action output by a policy.
Although these methods can be quite capable, they do not scale well with the
dimensionality of the action space, and can be dangerous to apply on hardware.
We consider learning a forward dynamics model to predict the result,
(), of taking a particular action, (), given a specific observation
of the state, (). With this model we perform internal look-ahead
predictions of outcomes and seek actions we believe have a reasonable chance of
success. This method alters the exploratory action space, thereby increasing
learning speed and enables higher quality solutions to difficult problems, such
as robotic locomotion and juggling.Comment: 7 pages, 7 figures, conference pape
Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming
The need to efficiently calculate first- and higher-order derivatives of
increasingly complex models expressed in Python has stressed or exceeded the
capabilities of available tools. In this work, we explore techniques from the
field of automatic differentiation (AD) that can give researchers expressive
power, performance and strong usability. These include source-code
transformation (SCT), flexible gradient surgery, efficient in-place array
operations, higher-order derivatives as well as mixing of forward and reverse
mode AD. We implement and demonstrate these ideas in the Tangent software
library for Python, the first AD framework for a dynamic language that uses
SCT
Towards Generalization and Simplicity in Continuous Control
This work shows that policies with simple linear and RBF parameterizations
can be trained to solve a variety of continuous control tasks, including the
OpenAI gym benchmarks. The performance of these trained policies are
competitive with state of the art results, obtained with more elaborate
parameterizations such as fully connected neural networks. Furthermore,
existing training and testing scenarios are shown to be very limited and prone
to over-fitting, thus giving rise to only trajectory-centric policies. Training
with a diverse initial state distribution is shown to produce more global
policies with better generalization. This allows for interactive control
scenarios where the system recovers from large on-line perturbations; as shown
in the supplementary video.Comment: NIPS 2017, Project page: https://sites.google.com/view/simple-po
Model-Based Regularization for Deep Reinforcement Learning with Transcoder Networks
This paper proposes a new optimization objective for value-based deep
reinforcement learning. We extend conventional Deep Q-Networks (DQNs) by adding
a model-learning component yielding a transcoder network. The prediction errors
for the model are included in the basic DQN loss as additional regularizers.
This augmented objective leads to a richer training signal that provides
feedback at every time step. Moreover, because learning an environment model
shares a common structure with the RL problem, we hypothesize that the
resulting objective improves both sample efficiency and performance. We
empirically confirm our hypothesis on a range of 20 games from the Atari
benchmark attaining superior results over vanilla DQN without model-based
regularization.Comment: Presented at the NIPS Deep Reinforcement Learning Workshop, Montreal,
Canada, 201
Memory-based control with recurrent neural networks
Partially observed control problems are a challenging aspect of reinforcement
learning. We extend two related, model-free algorithms for continuous control
-- deterministic policy gradient and stochastic value gradient -- to solve
partially observed domains using recurrent neural networks trained with
backpropagation through time.
We demonstrate that this approach, coupled with long-short term memory is
able to solve a variety of physical control problems exhibiting an assortment
of memory requirements. These include the short-term integration of information
from noisy sensors and the identification of system parameters, as well as
long-term memory problems that require preserving information over many time
steps. We also demonstrate success on a combined exploration and memory problem
in the form of a simplified version of the well-known Morris water maze task.
Finally, we show that our approach can deal with high-dimensional observations
by learning directly from pixels.
We find that recurrent deterministic and stochastic policies are able to
learn similarly good solutions to these tasks, including the water maze where
the agent must learn effective search strategies.Comment: NIPS Deep Reinforcement Learning Workshop 201
Deep Reinforcement Learning to Acquire Navigation Skills for Wheel-Legged Robots in Complex Environments
Mobile robot navigation in complex and dynamic environments is a challenging
but important problem. Reinforcement learning approaches fail to solve these
tasks efficiently due to reward sparsities, temporal complexities and
high-dimensionality of sensorimotor spaces which are inherent in such problems.
We present a novel approach to train action policies to acquire navigation
skills for wheel-legged robots using deep reinforcement learning. The policy
maps height-map image observations to motor commands to navigate to a target
position while avoiding obstacles. We propose to acquire the multifaceted
navigation skill by learning and exploiting a number of manageable navigation
behaviors. We also introduce a domain randomization technique to improve the
versatility of the training samples. We demonstrate experimentally a
significant improvement in terms of data-efficiency, success rate, robustness
against irrelevant sensory data, and also the quality of the maneuver skills.Comment: Submitted to IROS 201
Synthesizing Neural Network Controllers with Probabilistic Model based Reinforcement Learning
We present an algorithm for rapidly learning controllers for robotics
systems. The algorithm follows the model-based reinforcement learning paradigm,
and improves upon existing algorithms; namely Probabilistic learning in Control
(PILCO) and a sample-based version of PILCO with neural network dynamics
(Deep-PILCO). We propose training a neural network dynamics model using
variational dropout with truncated Log-Normal noise. This allows us to obtain a
dynamics model with calibrated uncertainty, which can be used to simulate
controller executions via rollouts. We also describe set of techniques,
inspired by viewing PILCO as a recurrent neural network model, that are crucial
to improve the convergence of the method. We test our method on a variety of
benchmark tasks, demonstrating data-efficiency that is competitive with PILCO,
while being able to optimize complex neural network controllers. Finally, we
assess the performance of the algorithm for learning motor controllers for a
six legged autonomous underwater vehicle. This demonstrates the potential of
the algorithm for scaling up the dimensionality and dataset sizes, in more
complex control tasks.Comment: 8 pages, 7 figure
- …