37,515 research outputs found
Recommended from our members
Algorithms for First-order Sparse Reinforcement Learning
This thesis presents a general framework for first-order temporal difference learning algorithms with an in-depth theoretical analysis. The main contribution of the thesis is the development and design of a family of first-order regularized temporal-difference (TD) algorithms using stochastic approximation and stochastic optimization. To scale up TD algorithms to large-scale problems, we use first-order optimization to explore regularized TD methods using linear value function approximation. Previous regularized TD methods often use matrix inversion, which requires cubic time and quadratic memory complexity. We propose two algorithms, sparse-Q and RO-TD, for on-policy and off-policy learning, respectively. These two algorithms exhibit linear computational complexity per-step, and their asymptotic convergence guarantee and error bound analysis are given using stochastic optimization and stochastic approximation. The second major contribution of the thesis is the establishment of a unified general framework for stochastic-gradient-based temporal-difference learning algorithms that use proximal gradient methods. The primal-dual saddle-point formulation is introduced, and state-of-the-art stochastic gradient solvers, such as mirror descent and extragradient are used to design several novel RL algorithms. Theoretical analysis is given, including regularization, acceleration analysis and finite-sample analysis, along with detailed empirical experiments to demonstrate the effectiveness of the proposed algorithms
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling
Despite the wide applications of Adam in reinforcement learning (RL), the
theoretical convergence of Adam-type RL algorithms has not been established.
This paper provides the first such convergence analysis for two fundamental RL
algorithms of policy gradient (PG) and temporal difference (TD) learning that
incorporate AMSGrad updates (a standard alternative of Adam in theoretical
analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover,
our analysis focuses on Markovian sampling for both algorithms. We show that
under general nonlinear function approximation, PG-AMSGrad with a constant
stepsize converges to a neighborhood of a stationary point at the rate of
(where denotes the number of iterations), and with a
diminishing stepsize converges exactly to a stationary point at the rate of
. Furthermore, under linear function
approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood
of the global optimum at the rate of , and with a diminishing
stepsize converges exactly to the global optimum at the rate of
. Our study develops new techniques for analyzing
the Adam-type RL algorithms under Markovian sampling
Recommended from our members
Optimization Foundations of Reinforcement Learning
Reinforcement learning (RL) has attracted rapidly increasing interest in the machine learning and artificial intelligence communities in the past decade. With tremendous success already demonstrated for Game AI, RL offers great potential for applications in more complex, real world domains, for example in robotics, autonomous driving and even drug discovery. Although researchers have devoted a lot of engineering effort to deploy RL methods at scale, many state-of-the art RL techniques still seem mysterious - with limited theoretical guarantees on their behaviour in practice.
In this thesis, we focus on understanding convergence guarantees for two key ideas in reinforcement learning, namely Temporal difference learning and policy gradient methods, from an optimization perspective. In Chapter 2, we provide a simple and explicit finite time analysis of Temporal difference (TD) learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Our convergence results extend seamlessly to the study of TD learning with eligibility traces, known as TD(λ), and to Q-learning for a class of high-dimensional optimal stopping problems.
In Chapter 3, we turn our attention to policy gradient methods and present a simple and general understanding of their global convergence properties. The main challenge here is that even for simple control problems, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point of the objective. We identify structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that despite non-convexity, any stationary point of the policy gradient objective is globally optimal. In the final chapter, we extend our analysis for finite MDPs to show linear convergence guarantees for many popular variants of policy gradient methods like projected policy gradient, Frank-Wolfe, mirror descent and natural policy gradients
- …