Search CORE

5,168 research outputs found

On the Design of LQR Kernels for Efficient Controller Learning

Author: Hennig Philipp
Marco Alonso
Schaal Stefan
Trimpe Sebastian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Finding optimal feedback controllers for nonlinear dynamic systems from data is hard. Recently, Bayesian optimization (BO) has been proposed as a powerful framework for direct controller tuning from experimental trials. For selecting the next query point and finding the global optimum, BO relies on a probabilistic description of the latent objective function, typically a Gaussian process (GP). As is shown herein, GPs with a common kernel choice can, however, lead to poor learning outcomes on standard quadratic control problems. For a first-order system, we construct two kernels that specifically leverage the structure of the well-known Linear Quadratic Regulator (LQR), yet retain the flexibility of Bayesian nonparametric learning. Simulations of uncertain linear and nonlinear systems demonstrate that the LQR kernels yield superior learning performance.Comment: 8 pages, 5 figures, to appear in 56th IEEE Conference on Decision and Control (CDC 2017

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Stick-Breaking Policy Learning in Dec-POMDPs

Author: Amato Christopher
Carin Lawrence
How Jonathan P.
Liao Xuejun
Liu Miao
Publication venue
Publication date: 01/07/2015
Field of study

Expectation maximization (EM) has recently been shown to be an efficient algorithm for learning finite-state controllers (FSCs) in large decentralized POMDPs (Dec-POMDPs). However, current methods use fixed-size FSCs and often converge to maxima that are far from optimal. This paper considers a variable-size FSC to represent the local policy of each agent. These variable-size FSCs are constructed using a stick-breaking prior, leading to a new framework called \emph{decentralized stick-breaking policy representation} (Dec-SBPR). This approach learns the controller parameters with a variational Bayesian algorithm without having to assume that the Dec-POMDP model is available. The performance of Dec-SBPR is demonstrated on several benchmark problems, showing that the algorithm scales to large problems while outperforming other state-of-the-art methods

arXiv.org e-Print Archive

DSpace@MIT

Optimizing the CVaR via Sampling

Author: Glassner Yonatan
Mannor Shie
Tamar Aviv
Publication venue
Publication date: 22/11/2014
Field of study

Conditional Value at Risk (CVaR) is a prominent risk measure that is being used extensively in various domains. We develop a new formula for the gradient of the CVaR in the form of a conditional expectation. Based on this formula, we propose a novel sampling-based estimator for the CVaR gradient, in the spirit of the likelihood-ratio method. We analyze the bias of the estimator, and prove the convergence of a corresponding stochastic gradient descent algorithm to a local CVaR optimum. Our method allows to consider CVaR optimization in new domains. As an example, we consider a reinforcement learning application, and learn a risk-sensitive controller for the game of Tetris.Comment: To appear in AAAI 201

arXiv.org e-Print Archive

CiteSeerX

Association for the Advancement of Artificial Intelligence: AAAI Publications

Q-learning with Nearest Neighbors

Author: Shah Devavrat
Xie Qiaomin
Publication venue
Publication date: 22/10/2018
Field of study

We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a

d

-dimensional state space and the discounted factor

\gamma \in (0,1)

, given an arbitrary sample path with "covering time"

L

, we establish that the algorithm is guaranteed to output an

\varepsilon

-accurate estimate of the optimal Q-function using

\tilde{O}\big(L/(\varepsilon^3(1-\gamma)^7)\big)

samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as

\tilde{O}\big(1/\varepsilon^d\big),

so the sample complexity scales as

\tilde{O}\big(1/\varepsilon^{d+3}\big).

Indeed, we establish a lower bound that argues that the dependence of

\tilde{\Omega}\big(1/\varepsilon^{d+2}\big)

is necessary.Comment: Accepted to NIPS 201

arXiv.org e-Print Archive

DSpace@MIT

Inverse Reinforcement Learning in Large State Spaces via Function Approximation

Author: Burdick Joel W.
Li Kun
Publication venue
Publication date: 28/07/2017
Field of study

This paper introduces a new method for inverse reinforcement learning in large-scale and high-dimensional state spaces. To avoid solving the computationally expensive reinforcement learning problems in reward learning, we propose a function approximation method to ensure that the Bellman Optimality Equation always holds, and then estimate a function to maximize the likelihood of the observed motion. The time complexity of the proposed method is linearly proportional to the cardinality of the action set, thus it can handle large state spaces efficiently. We test the proposed method in a simulated environment, and show that it is more accurate than existing methods and significantly better in scalability. We also show that the proposed method can extend many existing methods to high-dimensional state spaces. We then apply the method to evaluating the effect of rehabilitative stimulations on patients with spinal cord injuries based on the observed patient motions.Comment: Experiment update

arXiv.org e-Print Archive

Caltech Authors