660 research outputs found
A Tour of Reinforcement Learning: The View from Continuous Control
This manuscript surveys reinforcement learning from the perspective of
optimization and control with a focus on continuous control applications. It
surveys the general formulation, terminology, and typical experimental
implementations of reinforcement learning and reviews competing solution
paradigms. In order to compare the relative merits of various techniques, this
survey presents a case study of the Linear Quadratic Regulator (LQR) with
unknown dynamics, perhaps the simplest and best-studied problem in optimal
control. The manuscript describes how merging techniques from learning theory
and control can provide non-asymptotic characterizations of LQR performance and
shows that these characterizations tend to match experimental behavior. In
turn, when revisiting more complex applications, many of the observed phenomena
in LQR persist. In particular, theory and experiment demonstrate the role and
importance of models and the cost of generality in reinforcement learning
algorithms. This survey concludes with a discussion of some of the challenges
in designing learning systems that safely and reliably interact with complex
and uncertain environments and how tools from reinforcement learning and
control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
On the Sample Complexity of the Linear Quadratic Regulator
This paper addresses the optimal control problem known as the Linear
Quadratic Regulator in the case when the dynamics are unknown. We propose a
multi-stage procedure, called Coarse-ID control, that estimates a model from a
few experimental trials, estimates the error in that model with respect to the
truth, and then designs a controller using both the model and uncertainty
estimate. Our technique uses contemporary tools from random matrix theory to
bound the error in the estimation procedure. We also employ a recently
developed approach to control synthesis called System Level Synthesis that
enables robust control design by solving a convex optimization problem. We
provide end-to-end bounds on the relative error in control cost that are nearly
optimal in the number of parameters and that highlight salient properties of
the system to be controlled such as closed-loop sensitivity and optimal control
magnitude. We show experimentally that the Coarse-ID approach enables efficient
computation of a stabilizing controller in regimes where simple control schemes
that do not take the model uncertainty into account fail to stabilize the true
system.Comment: Contains a new analysis of finite-dimensional truncation, a new
data-dependent estimation bound, and an expanded exposition on necessary
background in control theory and System Level Synthesi
RLOC: Neurobiologically Inspired Hierarchical Reinforcement Learning Algorithm for Continuous Control of Nonlinear Dynamical Systems
Nonlinear optimal control problems are often solved with numerical methods
that require knowledge of system's dynamics which may be difficult to infer,
and that carry a large computational cost associated with iterative
calculations. We present a novel neurobiologically inspired hierarchical
learning framework, Reinforcement Learning Optimal Control, which operates on
two levels of abstraction and utilises a reduced number of controllers to solve
nonlinear systems with unknown dynamics in continuous state and action spaces.
Our approach is inspired by research at two levels of abstraction: first, at
the level of limb coordination human behaviour is explained by linear optimal
feedback control theory. Second, in cognitive tasks involving learning symbolic
level action selection, humans learn such problems using model-free and
model-based reinforcement learning algorithms. We propose that combining these
two levels of abstraction leads to a fast global solution of nonlinear control
problems using reduced number of controllers. Our framework learns the local
task dynamics from naive experience and forms locally optimal infinite horizon
Linear Quadratic Regulators which produce continuous low-level control. A
top-level reinforcement learner uses the controllers as actions and learns how
to best combine them in state space while maximising a long-term reward. A
single optimal control objective function drives high-level symbolic learning
by providing training signals on desirability of each selected controller. We
show that a small number of locally optimal linear controllers are able to
solve global nonlinear control problems with unknown dynamics when combined
with a reinforcement learner in this hierarchical framework. Our algorithm
competes in terms of computational cost and solution quality with sophisticated
control algorithms and we illustrate this with solutions to benchmark problems.Comment: 33 pages, 8 figure
Optimal and Learning Control for Autonomous Robots
Optimal and Learning Control for Autonomous Robots has been taught in the
Robotics, Systems and Controls Masters at ETH Zurich with the aim to teach
optimal control and reinforcement learning for closed loop control problems
from a unified point of view. The starting point is the formulation of of an
optimal control problem and deriving the different types of solutions and
algorithms from there. These lecture notes aim at supporting this unified view
with a unified notation wherever possible, and a bit of a translation help to
compare the terminology and notation in the different fields. The course
assumes basic knowledge of Control Theory, Linear Algebra and Stochastic
Calculus.Comment: Lecture Notes, 101 page
Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator
Reinforcement learning (RL) has been successfully used to solve many
continuous control tasks. Despite its impressive results however, fundamental
questions regarding the sample complexity of RL on continuous problems remain
open. We study the performance of RL in this setting by considering the
behavior of the Least-Squares Temporal Difference (LSTD) estimator on the
classic Linear Quadratic Regulator (LQR) problem from optimal control. We give
the first finite-time analysis of the number of samples needed to estimate the
value function for a fixed static state-feedback policy to within
-relative error. In the process of deriving our result, we give a
general characterization for when the minimum eigenvalue of the empirical
covariance matrix formed along the sample path of a fast-mixing stochastic
process concentrates above zero, extending a result by Koltchinskii and
Mendelson in the independent covariates setting. Finally, we provide
experimental evidence indicating that our analysis correctly captures the
qualitative behavior of LSTD on several LQR instances
Performance guarantees for model-based Approximate Dynamic Programming in continuous spaces
We study both the value function and Q-function formulation of the Linear
Programming approach to Approximate Dynamic Programming. The approach is
model-based and optimizes over a restricted function space to approximate the
value function or Q-function. Working in the discrete time, continuous space
setting, we provide guarantees for the fitting error and online performance of
the policy. In particular, the online performance guarantee is obtained by
analyzing an iterated version of the greedy policy, and the fitting error
guarantee by analyzing an iterated version of the Bellman inequality. These
guarantees complement the existing bounds that appear in the literature. The
Q-function formulation offers benefits, for example, in decentralized
controller design, however it can lead to computationally demanding
optimization problems. To alleviate this drawback, we provide a condition that
simplifies the formulation, resulting in improved computational times.Comment: 18 pages, 5 figures, journal pape
The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint
The effectiveness of model-based versus model-free methods is a long-standing
question in reinforcement learning (RL). Motivated by recent empirical success
of RL on continuous control tasks, we study the sample complexity of popular
model-based and model-free algorithms on the Linear Quadratic Regulator (LQR).
We show that for policy evaluation, a simple model-based plugin method requires
asymptotically less samples than the classical least-squares temporal
difference (LSTD) estimator to reach the same quality of solution; the sample
complexity gap between the two methods can be at least a factor of state
dimension. For policy evaluation, we study a simple family of problem instances
and show that nominal (certainty equivalence principle) control also requires
several factors of state and input dimension fewer samples than the policy
gradient method to reach the same level of control performance on these
instances. Furthermore, the gap persists even when employing commonly used
baselines. To the best of our knowledge, this is the first theoretical result
which demonstrates a separation in the sample complexity between model-based
and model-free methods on a continuous control task.Comment: Improved the main result regarding policy optimizatio
From self-tuning regulators to reinforcement learning and back again
Machine and reinforcement learning (RL) are increasingly being applied to
plan and control the behavior of autonomous systems interacting with the
physical world. Examples include self-driving vehicles, distributed sensor
networks, and agile robots. However, when machine learning is to be applied in
these new settings, the algorithms had better come with the same type of
reliability, robustness, and safety bounds that are hallmarks of control
theory, or failures could be catastrophic. Thus, as learning algorithms are
increasingly and more aggressively deployed in safety critical settings, it is
imperative that control theorists join the conversation. The goal of this
tutorial paper is to provide a starting point for control theorists wishing to
work on learning related problems, by covering recent advances bridging
learning and control theory, and by placing these results within an appropriate
historical context of system identification and adaptive control.Comment: Tutorial paper, 2019 IEEE Conference on Decision and Control, to
appea
Hamilton-Jacobi-Bellman Equations for Q-Learning in Continuous Time
In this paper, we introduce Hamilton-Jacobi-Bellman (HJB) equations for
Q-functions in continuous time optimal control problems with Lipschitz
continuous controls. The standard Q-function used in reinforcement learning is
shown to be the unique viscosity solution of the HJB equation. A necessary and
sufficient condition for optimality is provided using the viscosity solution
framework. By using the HJB equation, we develop a Q-learning method for
continuous-time dynamical systems. A DQN-like algorithm is also proposed for
high-dimensional state and control spaces. The performance of the proposed
Q-learning algorithm is demonstrated using 1-, 10- and 20-dimensional dynamical
systems.Comment: 2nd Annual Conference on Learning for Dynamics and Control (L4DC
Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator
We study the sample complexity of approximate policy iteration (PI) for the
Linear Quadratic Regulator (LQR), building on a recent line of work using LQR
as a testbed to understand the limits of reinforcement learning (RL) algorithms
on continuous control tasks. Our analysis quantifies the tension between policy
improvement and policy evaluation, and suggests that policy evaluation is the
dominant factor in terms of sample complexity. Specifically, we show that to
obtain a controller that is within of the optimal LQR controller,
each step of policy evaluation requires at most
samples, where is the dimension of the state vector and is the
dimension of the input vector. On the other hand, only
policy improvement steps suffice, resulting in an overall sample complexity of
. We furthermore build on our
analysis and construct a simple adaptive procedure based on
-greedy exploration which relies on approximate PI as a
sub-routine and obtains regret, improving upon a recent result of
Abbasi-Yadkori et al
- …