Search CORE

1,659 research outputs found

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Author: Liang Yingbin
Wang Zhe
Xu Tengyu
Publication venue
Publication date: 11/02/2021
Field of study

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an

\epsilon

-accurate stationary point improves the best known sample complexity of AC by an order of

\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))

, and the overall sample complexity for a mini-batch NAC to attain an

\epsilon

-accurate globally optimal point improves the existing sample complexity of NAC by an order of

\mathcal{O}(\epsilon^{-1}/\log(1/\epsilon))

. Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of

\mathcal{O}((1-\gamma)^{-3})

and

\mathcal{O}((1-\gamma)^{-4}\epsilon^{-1}/\log(1/\epsilon))

, respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Softmax Policy Gradient Methods Can Take Exponential Time to Converge

Author: Chen Yuxin
Chi Yuejie
Gu Yuantao
Li Gen
Wei Yuting
Publication venue
Publication date: 07/06/2021
Field of study

The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For

\gamma

-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space

\mathcal{S}

and the effective horizon

\frac{1}{1-\gamma}

, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize

\eta

can take

\frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} ~\text{iterations}

to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods

arXiv.org e-Print Archive

Scalable Bilinear $\pi$ Learning Using State and Action Features

Author: Chen Yichen
Li Lihong
Wang Mengdi
Publication venue
Publication date: 26/04/2018
Field of study

Approximate linear programming (ALP) represents one of the major algorithmic families to solve large-scale Markov decision processes (MDP). In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear

\pi

learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts (bi)linear models to represent the high-dimensional value function and state-action distributions, using given state and action features. Its run-time complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sample-efficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension of the parameter space

arXiv.org e-Print Archive

A Tour of Reinforcement Learning: The View from Continuous Control

Author: Recht Benjamin
Publication venue
Publication date: 10/11/2018
Field of study

This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo

arXiv.org e-Print Archive

Batch Policy Learning under Constraints

Author: Le Hoang M.
Voloshin Cameron
Yue Yisong
Publication venue
Publication date: 20/03/2019
Field of study

When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting

arXiv.org e-Print Archive

Caltech Authors

Learning Efficient Representations for Reinforcement Learning

Author: Huang Yanping
Publication venue
Publication date: 28/08/2015
Field of study

Markov decision processes (MDPs) are a well studied framework for solving sequential decision making problems under uncertainty. Exact methods for solving MDPs based on dynamic programming such as policy iteration and value iteration are effective on small problems. In problems with a large discrete state space or with continuous state spaces, a compact representation is essential for providing an efficient approximation solutions to MDPs. Commonly used approximation algorithms involving constructing basis functions for projecting the value function onto a low dimensional subspace, and building a factored or hierarchical graphical model to decompose the transition and reward functions. However, hand-coding a good compact representation for a given reinforcement learning (RL) task can be quite difficult and time consuming. Recent approaches have attempted to automatically discover efficient representations for RL. In this thesis proposal, we discuss the problems of automatically constructing structured kernel for kernel based RL, a popular approach to learning non-parametric approximations for value function. We explore a space of kernel structures which are built compositionally from base kernels using a context-free grammar. We examine a greedy algorithm for searching over the structure space. To demonstrate how the learned structure can represent and approximate the original RL problem in terms of compactness and efficiency, we plan to evaluate our method on a synthetic problem and compare it to other RL baselines

arXiv.org e-Print Archive

Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces

Author: Bazerque Juan Andrés
Paternain Santiago
Ribeiro Alejandro
Small Austin
Publication venue
Publication date: 30/07/2018
Field of study

Reinforcement learning consists of finding policies that maximize an expected cumulative long-term reward in a Markov decision process with unknown transition probabilities and instantaneous rewards. In this paper, we consider the problem of finding such optimal policies while assuming they are continuous functions belonging to a reproducing kernel Hilbert space (RKHS). To learn the optimal policy we introduce a stochastic policy gradient ascent algorithm with three unique novel features: (i) The stochastic estimates of policy gradients are unbiased. (ii) The variance of stochastic gradients is reduced by drawing on ideas from numerical differentiation. (iii) Policy complexity is controlled using sparse RKHS representations. Novel feature (i) is instrumental in proving convergence to a stationary point of the expected cumulative reward. Novel feature (ii) facilitates reasonable convergence times. Novel feature (iii) is a necessity in practical implementations which we show can be done in a way that does not eliminate convergence guarantees. Numerical examples in standard problems illustrate successful learning of policies with low complexity representations which are close to stationary points of the expected cumulative reward

arXiv.org e-Print Archive

Simultaneous Perturbation Algorithms for Batch Off-Policy Search

Author: Fonteneau Raphael
Prashanth L. A.
Publication venue
Publication date: 01/01/2014
Field of study

We propose novel policy search algorithms in the context of off-policy, batch mode reinforcement learning (RL) with continuous state and action spaces. Given a batch collection of trajectories, we perform off-line policy evaluation using an algorithm similar to that by [Fonteneau et al., 2010]. Using this Monte-Carlo like policy evaluator, we perform policy search in a class of parameterized policies. We propose both first order policy gradient and second order policy Newton algorithms. All our algorithms incorporate simultaneous perturbation estimates for the gradient as well as the Hessian of the cost-to-go vector, since the latter is unknown and only biased estimates are available. We demonstrate their practicality on a simple 1-dimensional continuous state space problem

arXiv.org e-Print Archive

CiteSeerX

The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint

Author: Recht Benjamin
Tu Stephen
Publication venue
Publication date: 03/02/2019
Field of study

The effectiveness of model-based versus model-free methods is a long-standing question in reinforcement learning (RL). Motivated by recent empirical success of RL on continuous control tasks, we study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR). We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. For policy evaluation, we study a simple family of problem instances and show that nominal (certainty equivalence principle) control also requires several factors of state and input dimension fewer samples than the policy gradient method to reach the same level of control performance on these instances. Furthermore, the gap persists even when employing commonly used baselines. To the best of our knowledge, this is the first theoretical result which demonstrates a separation in the sample complexity between model-based and model-free methods on a continuous control task.Comment: Improved the main result regarding policy optimizatio

arXiv.org e-Print Archive

Riemannian Proximal Policy Optimization

Author: Chu Wei
Li Chen
Qi Yuan
Wang Shijun
Wu Mingzhe
Zhang James
Zhu Baocheng
Publication venue
Publication date: 18/05/2020
Field of study

In this paper, We propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a nonconvex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower bound on policy improvement by using bounds derived from the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of our proposed Riemannian proximal policy optimization algorithm.Comment: 12 pages, 1 figure

arXiv.org e-Print Archive