1,800 research outputs found
Recommended from our members
Algorithms for First-order Sparse Reinforcement Learning
This thesis presents a general framework for first-order temporal difference learning algorithms with an in-depth theoretical analysis. The main contribution of the thesis is the development and design of a family of first-order regularized temporal-difference (TD) algorithms using stochastic approximation and stochastic optimization. To scale up TD algorithms to large-scale problems, we use first-order optimization to explore regularized TD methods using linear value function approximation. Previous regularized TD methods often use matrix inversion, which requires cubic time and quadratic memory complexity. We propose two algorithms, sparse-Q and RO-TD, for on-policy and off-policy learning, respectively. These two algorithms exhibit linear computational complexity per-step, and their asymptotic convergence guarantee and error bound analysis are given using stochastic optimization and stochastic approximation. The second major contribution of the thesis is the establishment of a unified general framework for stochastic-gradient-based temporal-difference learning algorithms that use proximal gradient methods. The primal-dual saddle-point formulation is introduced, and state-of-the-art stochastic gradient solvers, such as mirror descent and extragradient are used to design several novel RL algorithms. Theoretical analysis is given, including regularization, acceleration analysis and finite-sample analysis, along with detailed empirical experiments to demonstrate the effectiveness of the proposed algorithms
Breaking the Deadly Triad with a Target Network
The deadly triad refers to the instability of a reinforcement learning
algorithm when it employs off-policy learning, function approximation, and
bootstrapping simultaneously. In this paper, we investigate the target network
as a tool for breaking the deadly triad, providing theoretical support for the
conventional wisdom that a target network stabilizes training. We first propose
and analyze a novel target network update rule which augments the commonly used
Polyak-averaging style update with two projections. We then apply the target
network and ridge regularization in several divergent algorithms and show their
convergence to regularized TD fixed points. Those algorithms are off-policy
with linear function approximation and bootstrapping, spanning both policy
evaluation and control, as well as both discounted and average-reward settings.
In particular, we provide the first convergent linear -learning algorithms
under nonrestrictive and changing behavior policies without bi-level
optimization.Comment: ICML 202
A Dantzig Selector Approach to Temporal Difference Learning
LSTD is a popular algorithm for value function approximation. Whenever the
number of features is larger than the number of samples, it must be paired with
some form of regularization. In particular, L1-regularization methods tend to
perform feature selection by promoting sparsity, and thus, are well-suited for
high-dimensional problems. However, since LSTD is not a simple regression
algorithm, but it solves a fixed--point problem, its integration with
L1-regularization is not straightforward and might come with some drawbacks
(e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a
novel algorithm obtained by integrating LSTD with the Dantzig Selector. We
investigate the performance of the proposed algorithm and its relationship with
the existing regularized approaches, and show how it addresses some of their
drawbacks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Control Regularization for Reduced Variance Reinforcement Learning
Dealing with high variance is a significant challenge in model-free
reinforcement learning (RL). Existing methods are unreliable, exhibiting high
variance in performance from run to run using different initializations/seeds.
Focusing on problems arising in continuous control, we propose a functional
regularization approach to augmenting model-free RL. In particular, we
regularize the behavior of the deep policy to be similar to a policy prior,
i.e., we regularize in function space. We show that functional regularization
yields a bias-variance trade-off, and propose an adaptive tuning strategy to
optimize this trade-off. When the policy prior has control-theoretic stability
guarantees, we further show that this regularization approximately preserves
those stability guarantees throughout learning. We validate our approach
empirically on a range of settings, and demonstrate significantly reduced
variance, guaranteed dynamic stability, and more efficient learning than deep
RL alone.Comment: Appearing in ICML 201
- …