113,933 research outputs found
Recommended from our members
Algorithms for First-order Sparse Reinforcement Learning
This thesis presents a general framework for first-order temporal difference learning algorithms with an in-depth theoretical analysis. The main contribution of the thesis is the development and design of a family of first-order regularized temporal-difference (TD) algorithms using stochastic approximation and stochastic optimization. To scale up TD algorithms to large-scale problems, we use first-order optimization to explore regularized TD methods using linear value function approximation. Previous regularized TD methods often use matrix inversion, which requires cubic time and quadratic memory complexity. We propose two algorithms, sparse-Q and RO-TD, for on-policy and off-policy learning, respectively. These two algorithms exhibit linear computational complexity per-step, and their asymptotic convergence guarantee and error bound analysis are given using stochastic optimization and stochastic approximation. The second major contribution of the thesis is the establishment of a unified general framework for stochastic-gradient-based temporal-difference learning algorithms that use proximal gradient methods. The primal-dual saddle-point formulation is introduced, and state-of-the-art stochastic gradient solvers, such as mirror descent and extragradient are used to design several novel RL algorithms. Theoretical analysis is given, including regularization, acceleration analysis and finite-sample analysis, along with detailed empirical experiments to demonstrate the effectiveness of the proposed algorithms
Recommended from our members
Optimization Foundations of Reinforcement Learning
Reinforcement learning (RL) has attracted rapidly increasing interest in the machine learning and artificial intelligence communities in the past decade. With tremendous success already demonstrated for Game AI, RL offers great potential for applications in more complex, real world domains, for example in robotics, autonomous driving and even drug discovery. Although researchers have devoted a lot of engineering effort to deploy RL methods at scale, many state-of-the art RL techniques still seem mysterious - with limited theoretical guarantees on their behaviour in practice.
In this thesis, we focus on understanding convergence guarantees for two key ideas in reinforcement learning, namely Temporal difference learning and policy gradient methods, from an optimization perspective. In Chapter 2, we provide a simple and explicit finite time analysis of Temporal difference (TD) learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Our convergence results extend seamlessly to the study of TD learning with eligibility traces, known as TD(λ), and to Q-learning for a class of high-dimensional optimal stopping problems.
In Chapter 3, we turn our attention to policy gradient methods and present a simple and general understanding of their global convergence properties. The main challenge here is that even for simple control problems, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point of the objective. We identify structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that despite non-convexity, any stationary point of the policy gradient objective is globally optimal. In the final chapter, we extend our analysis for finite MDPs to show linear convergence guarantees for many popular variants of policy gradient methods like projected policy gradient, Frank-Wolfe, mirror descent and natural policy gradients
Q-learning with Nearest Neighbors
We consider model-free reinforcement learning for infinite-horizon discounted
Markov Decision Processes (MDPs) with a continuous state space and unknown
transition kernel, when only a single sample path under an arbitrary policy of
the system is available. We consider the Nearest Neighbor Q-Learning (NNQL)
algorithm to learn the optimal Q function using nearest neighbor regression
method. As the main contribution, we provide tight finite sample analysis of
the convergence rate. In particular, for MDPs with a -dimensional state
space and the discounted factor , given an arbitrary sample
path with "covering time" , we establish that the algorithm is guaranteed
to output an -accurate estimate of the optimal Q-function using
samples. For instance, for a
well-behaved MDP, the covering time of the sample path under the purely random
policy scales as so the sample
complexity scales as Indeed, we
establish a lower bound that argues that the dependence of is necessary.Comment: Accepted to NIPS 201
- …