1,107 research outputs found

    A Finite Time Analysis of Two Time-Scale Actor Critic Methods

    Full text link
    Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., ∥∇J(θ)∥22≤ϵ\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon) of the non-concave performance function J(θ)J(\boldsymbol{\theta}), with O~(ϵ−2.5)\mathcal{\tilde{O}}(\epsilon^{-2.5}) sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.Comment: 45 page

    Algorithms for CVaR Optimization in MDPs

    Full text link
    In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in costs in addition to minimizing a standard criterion. Conditional value-at-risk (CVaR) is a relatively new risk measure that addresses some of the shortcomings of the well-known variance-related risk measures, and because of its computational efficiencies has gained popularity in finance and operations research. In this paper, we consider the mean-CVaR optimization problem in MDPs. We first derive a formula for computing the gradient of this risk-sensitive objective function. We then devise policy gradient and actor-critic algorithms that each uses a specific method to estimate this gradient and updates the policy parameters in the descent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem.Comment: Submitted to NIPS 1

    Finite Time Analysis of Constrained Actor Critic and Constrained Natural Actor Critic Algorithms

    Full text link
    Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., ∥∇L(θ,γ)∥22≤ϵ\Vert \nabla L(\theta,\gamma)\Vert_2^2 \leq \epsilon) of the performance (Lagrange) function L(θ,γ)L(\theta,\gamma), with a sample complexity of O~(ϵ−2.5)\mathcal{\tilde{O}}(\epsilon^{-2.5}) in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms.We also show the results of experiments on a few different grid world settings and observe good empirical performance using both of these algorithms. In particular, for large grid sizes, Constrained Natural Actor Critic shows slightly better results than Constrained Actor Critic while the latter is slightly better for a small grid size
    • …