Search CORE

49,621 research outputs found

A Finite Time Analysis of Two Time-Scale Actor Critic Methods

Author: Gu Quanquan
Wu Yue
Xu Pan
Zhang Weitong
Publication venue
Publication date: 14/06/2020
Field of study

Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e.,

\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon

) of the non-concave performance function

J(\boldsymbol{\theta})

, with

\mathcal{\tilde{O}}(\epsilon^{-2.5})

sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.Comment: 45 page

arXiv.org e-Print Archive

Hierarchical Knowledge-Gradient for Sequential Sampling

Author: Frazier Peter I.
Mes Martijn R.K.
Powel Warren B.
Publication venue: Beta Research School for Operations Management and Logistics, University of Twente
Publication date: 01/01/2009
Field of study

We consider the problem of selecting the best of a finite but very large set of alternatives. Each alternative may be characterized by a multi-dimensional vector and has independent normal rewards. This problem arises in various settings such as (i) ranking and selection, (ii) simulation optimization where the unknown mean of each alternative is estimated with stochastic simulation output, and (iii) approximate dynamic programming where we need to estimate values based on Monte-Carlo simulation. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. Because the number of alternatives is large, we propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement, thus greatly reducing the measurement effort required. We demonstrate how this hierarchical knowledge-gradient policy can be applied to efficiently maximize a continuous function and prove that this policy finds a globally optimal alternative in the limit

CiteSeerX

University of Twente Research Information