1,659 research outputs found

    Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

    Full text link
    The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an ϵ\epsilon-accurate stationary point improves the best known sample complexity of AC by an order of O(ϵ1log(1/ϵ))\mathcal{O}(\epsilon^{-1}\log(1/\epsilon)), and the overall sample complexity for a mini-batch NAC to attain an ϵ\epsilon-accurate globally optimal point improves the existing sample complexity of NAC by an order of O(ϵ1/log(1/ϵ))\mathcal{O}(\epsilon^{-1}/\log(1/\epsilon)). Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of O((1γ)3)\mathcal{O}((1-\gamma)^{-3}) and O((1γ)4ϵ1/log(1/ϵ))\mathcal{O}((1-\gamma)^{-4}\epsilon^{-1}/\log(1/\epsilon)), respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.Comment: Accepted by NeurIPS 202

    Softmax Policy Gradient Methods Can Take Exponential Time to Converge

    Full text link
    The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For γ\gamma-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space S\mathcal{S} and the effective horizon 11γ\frac{1}{1-\gamma}, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize η\eta can take 1ηS2Ω(11γ) iterations \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} ~\text{iterations} to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods

    Scalable Bilinear π\pi Learning Using State and Action Features

    Full text link
    Approximate linear programming (ALP) represents one of the major algorithmic families to solve large-scale Markov decision processes (MDP). In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear π\pi learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts (bi)linear models to represent the high-dimensional value function and state-action distributions, using given state and action features. Its run-time complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sample-efficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension of the parameter space

    A Tour of Reinforcement Learning: The View from Continuous Control

    Full text link
    This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo

    Batch Policy Learning under Constraints

    Get PDF
    When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting

    Learning Efficient Representations for Reinforcement Learning

    Full text link
    Markov decision processes (MDPs) are a well studied framework for solving sequential decision making problems under uncertainty. Exact methods for solving MDPs based on dynamic programming such as policy iteration and value iteration are effective on small problems. In problems with a large discrete state space or with continuous state spaces, a compact representation is essential for providing an efficient approximation solutions to MDPs. Commonly used approximation algorithms involving constructing basis functions for projecting the value function onto a low dimensional subspace, and building a factored or hierarchical graphical model to decompose the transition and reward functions. However, hand-coding a good compact representation for a given reinforcement learning (RL) task can be quite difficult and time consuming. Recent approaches have attempted to automatically discover efficient representations for RL. In this thesis proposal, we discuss the problems of automatically constructing structured kernel for kernel based RL, a popular approach to learning non-parametric approximations for value function. We explore a space of kernel structures which are built compositionally from base kernels using a context-free grammar. We examine a greedy algorithm for searching over the structure space. To demonstrate how the learned structure can represent and approximate the original RL problem in terms of compactness and efficiency, we plan to evaluate our method on a synthetic problem and compare it to other RL baselines

    Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces

    Full text link
    Reinforcement learning consists of finding policies that maximize an expected cumulative long-term reward in a Markov decision process with unknown transition probabilities and instantaneous rewards. In this paper, we consider the problem of finding such optimal policies while assuming they are continuous functions belonging to a reproducing kernel Hilbert space (RKHS). To learn the optimal policy we introduce a stochastic policy gradient ascent algorithm with three unique novel features: (i) The stochastic estimates of policy gradients are unbiased. (ii) The variance of stochastic gradients is reduced by drawing on ideas from numerical differentiation. (iii) Policy complexity is controlled using sparse RKHS representations. Novel feature (i) is instrumental in proving convergence to a stationary point of the expected cumulative reward. Novel feature (ii) facilitates reasonable convergence times. Novel feature (iii) is a necessity in practical implementations which we show can be done in a way that does not eliminate convergence guarantees. Numerical examples in standard problems illustrate successful learning of policies with low complexity representations which are close to stationary points of the expected cumulative reward

    Simultaneous Perturbation Algorithms for Batch Off-Policy Search

    Full text link
    We propose novel policy search algorithms in the context of off-policy, batch mode reinforcement learning (RL) with continuous state and action spaces. Given a batch collection of trajectories, we perform off-line policy evaluation using an algorithm similar to that by [Fonteneau et al., 2010]. Using this Monte-Carlo like policy evaluator, we perform policy search in a class of parameterized policies. We propose both first order policy gradient and second order policy Newton algorithms. All our algorithms incorporate simultaneous perturbation estimates for the gradient as well as the Hessian of the cost-to-go vector, since the latter is unknown and only biased estimates are available. We demonstrate their practicality on a simple 1-dimensional continuous state space problem

    The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint

    Full text link
    The effectiveness of model-based versus model-free methods is a long-standing question in reinforcement learning (RL). Motivated by recent empirical success of RL on continuous control tasks, we study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR). We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. For policy evaluation, we study a simple family of problem instances and show that nominal (certainty equivalence principle) control also requires several factors of state and input dimension fewer samples than the policy gradient method to reach the same level of control performance on these instances. Furthermore, the gap persists even when employing commonly used baselines. To the best of our knowledge, this is the first theoretical result which demonstrates a separation in the sample complexity between model-based and model-free methods on a continuous control task.Comment: Improved the main result regarding policy optimizatio

    Riemannian Proximal Policy Optimization

    Full text link
    In this paper, We propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a nonconvex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower bound on policy improvement by using bounds derived from the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of our proposed Riemannian proximal policy optimization algorithm.Comment: 12 pages, 1 figure
    corecore