13 research outputs found
Dantzig Selector with an Approximately Optimal Denoising Matrix and its Application to Reinforcement Learning
Dantzig Selector (DS) is widely used in compressed sensing and sparse
learning for feature selection and sparse signal recovery. Since the DS
formulation is essentially a linear programming optimization, many existing
linear programming solvers can be simply applied for scaling up. The DS
formulation can be explained as a basis pursuit denoising problem, wherein the
data matrix (or measurement matrix) is employed as the denoising matrix to
eliminate the observation noise. However, we notice that the data matrix may
not be the optimal denoising matrix, as shown by a simple counter-example. This
motivates us to pursue a better denoising matrix for defining a general DS
formulation. We first define the optimal denoising matrix through a minimax
optimization, which turns out to be an NPhard problem. To make the problem
computationally tractable, we propose a novel algorithm, termed as Optimal
Denoising Dantzig Selector (ODDS), to approximately estimate the optimal
denoising matrix. Empirical experiments validate the proposed method. Finally,
a novel sparse reinforcement learning algorithm is formulated by extending the
proposed ODDS algorithm to temporal difference learning, and empirical
experimental results demonstrate to outperform the conventional vanilla DS-TD
algorithm
Mirror Descent Search and its Acceleration
In recent years, attention has been focused on the relationship between
black-box optimiza- tion problem and reinforcement learning problem. In this
research, we propose the Mirror Descent Search (MDS) algorithm which is
applicable both for black box optimization prob- lems and reinforcement
learning problems. Our method is based on the mirror descent method, which is a
general optimization algorithm. The contribution of this research is roughly
twofold. We propose two essential algorithms, called MDS and Accelerated Mirror
Descent Search (AMDS), and two more approximate algorithms: Gaussian Mirror
Descent Search (G-MDS) and Gaussian Accelerated Mirror Descent Search (G-AMDS).
This re- search shows that the advanced methods developed in the context of the
mirror descent research can be applied to reinforcement learning problem. We
also clarify the relationship between an existing reinforcement learning
algorithm and our method. With two evaluation experiments, we show our proposed
algorithms converge faster than some state-of-the-art methods.Comment: Gold open access in Journal of Robotics and Autonomous Systems:
https://www.sciencedirect.com/science/article/pii/S092188901730754
An Analysis of State-Relevance Weights and Sampling Distributions on L1-Regularized Approximate Linear Programming Approximation Accuracy
Recent interest in the use of regularization in the use of value
function approximation includes Petrik et al.'s introduction of
-Regularized Approximate Linear Programming (RALP). RALP is unique among
-regularized approaches in that it approximates the optimal value function
using off-policy samples. Additionally, it produces policies which outperform
those of previous methods, such as LSPI. RALP's value function approximation
quality is affected heavily by the choice of state-relevance weights in the
objective function of the linear program, and by the distribution from which
samples are drawn; however, there has been no discussion of these
considerations in the previous literature. In this paper, we discuss and
explain the effects of choices in the state-relevance weights and sampling
distribution on approximation quality, using both theoretical and experimental
illustrations. The results provide insight not only onto these effects, but
also provide intuition into the types of MDPs which are especially well suited
for approximation with RALP.Comment: Identical to the ICML 2014 paper of the same name, but with full
proofs. Please cite the ICML pape
Investigating practical linear temporal difference learning
Off-policy reinforcement learning has many applications including: learning
from demonstration, learning multiple goal seeking policies in parallel, and
representing predictive knowledge. Recently there has been an proliferation of
new policy-evaluation algorithms that fill a longstanding algorithmic void in
reinforcement learning: combining robustness to off-policy sampling, function
approximation, linear complexity, and temporal difference (TD) updates. This
paper contains two main contributions. First, we derive two new hybrid TD
policy-evaluation algorithms, which fill a gap in this collection of
algorithms. Second, we perform an empirical comparison to elicit which of these
new linear TD methods should be preferred in different situations, and make
concrete suggestions about practical use.Comment: Autonomous Agents and Multi-agent Systems, 201
Regularized Off-Policy TD-Learning
We present a novel regularized off-policy convergent TD-learning method
(termed RO-TD), which is able to learn sparse representations of value
functions with low computational complexity. The algorithmic framework
underlying RO-TD integrates two key ideas: off-policy convergent gradient TD
methods, such as TDC, and a convex-concave saddle-point formulation of
non-smooth convex optimization, which enables first-order solvers and feature
selection using online convex regularization. A detailed theoretical and
experimental analysis of RO-TD is presented. A variety of experiments are
presented to illustrate the off-policy convergence, sparse feature selection
capability and low computational cost of the RO-TD algorithm.Comment: 26th Advances in Neural Information Processing Systems (NIPS). arXiv
admin note: substantial text overlap with arXiv:1405.675
Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity
In this paper, we introduce proximal gradient temporal difference learning,
which provides a principled way of designing and analyzing true stochastic
gradient temporal difference learning algorithms. We show how gradient TD (GTD)
reinforcement learning methods can be formally derived, not by starting from
their original objective functions, as previously attempted, but rather from a
primal-dual saddle-point objective function. We also conduct a saddle-point
error analysis to obtain finite-sample bounds on their performance. Previous
analyses of this class of algorithms use stochastic approximation techniques to
prove asymptotic convergence, and do not provide any finite-sample analysis. We
also propose an accelerated algorithm, called GTD2-MP, that uses proximal
``mirror maps'' to yield an improved convergence rate. The results of our
theoretical analysis imply that the GTD family of algorithms are comparable and
may indeed be preferred over existing least squares TD methods for off-policy
learning, due to their linear complexity. We provide experimental results
showing the improved performance of our accelerated gradient TD methods.Comment: Journal of Artificial Intelligence (JAIR
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
In this paper, we set forth a new vision of reinforcement learning developed
by us over the past few years, one that yields mathematically rigorous
solutions to longstanding important questions that have remained unresolved:
(i) how to design reliable, convergent, and robust reinforcement learning
algorithms (ii) how to guarantee that reinforcement learning satisfies
pre-specified "safety" guarantees, and remains in a stable region of the
parameter space (iii) how to design "off-policy" temporal difference learning
algorithms in a reliable and stable manner, and finally (iv) how to integrate
the study of reinforcement learning into the rich theory of stochastic
optimization. In this paper, we provide detailed answers to all these questions
using the powerful framework of proximal operators.
The key idea that emerges is the use of primal dual spaces connected through
the use of a Legendre transform. This allows temporal difference updates to
occur in dual spaces, allowing a variety of important technical advantages. The
Legendre transform elegantly generalizes past algorithms for solving
reinforcement learning problems, such as natural gradient methods, which we
show relate closely to the previously unconnected framework of mirror descent
methods. Equally importantly, proximal operator theory enables the systematic
development of operator splitting methods that show how to safely and reliably
decompose complex products of gradients that occur in recent variants of
gradient-based temporal difference learning. This key technical innovation
makes it possible to finally design "true" stochastic gradient methods for
reinforcement learning. Finally, Legendre transforms enable a variety of other
benefits, including modeling sparsity and domain geometry. Our work builds
extensively on recent work on the convergence of saddle-point algorithms, and
on the theory of monotone operators.Comment: 121 page
Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize
We consider the emphatic temporal-difference (TD) algorithm, ETD(),
for learning the value functions of stationary policies in a discounted, finite
state and action Markov decision process. The ETD() algorithm was
recently proposed by Sutton, Mahmood, and White to solve a long-standing
divergence problem of the standard TD algorithm when it is applied to
off-policy training, where data from an exploratory policy are used to evaluate
other policies of interest. The almost sure convergence of ETD() has
been proved in our recent work under general off-policy training conditions,
but for a narrow range of diminishing stepsize. In this paper we present
convergence results for constrained versions of ETD() with constant
stepsize and with diminishing stepsize from a broad range. Our results
characterize the asymptotic behavior of the trajectory of iterates produced by
those algorithms, and are derived by combining key properties of ETD()
with powerful convergence theorems from the weak convergence methods in
stochastic approximation theory. For the case of constant stepsize, in addition
to analyzing the behavior of the algorithms in the limit as the stepsize
parameter approaches zero, we also analyze their behavior for a fixed stepsize
and bound the deviations of their averaged iterates from the desired solution.
These results are obtained by exploiting the weak Feller property of the Markov
chains associated with the algorithms, and by using ergodic theorems for weak
Feller Markov chains, in conjunction with the convergence results we get from
the weak convergence methods. Besides ETD(), our analysis also applies
to the off-policy TD() algorithm, when the divergence issue is avoided
by setting sufficiently large.Comment: Minor edits; 53 pages. Longer and more proof details than the journal
versio
A generalization of regularized dual averaging and its dynamics
Excessive computational cost for learning large data and streaming data can
be alleviated by using stochastic algorithms, such as stochastic gradient
descent and its variants. Recent advances improve stochastic algorithms on
convergence speed, adaptivity and structural awareness. However, distributional
aspects of these new algorithms are poorly understood, especially for
structured parameters. To develop statistical inference in this case, we
propose a class of generalized regularized dual averaging (gRDA) algorithms
with constant step size, which improves RDA (Xiao, 2010; Flammarion and Bach,
2017). Weak convergence of gRDA trajectories are studied, and as a consequence,
for the first time in the literature, the asymptotic distributions for online
l1 penalized problems become available. These general results apply to both
convex and non-convex differentiable loss functions, and in particular, recover
the existing regret bound for convex losses (Nemirovski et al., 2009). As
important applications, statistical inferential theory on online sparse linear
regression and online sparse principal component analysis are developed, and
are supported by extensive numerical analysis. Interestingly, when gRDA is
properly tuned, support recovery and central limiting distribution (with mean
zero) hold simultaneously in the online setting, which is in contrast with the
biased central limiting distribution of batch Lasso (Knight and Fu, 2000).
Technical devices, including weak convergence of stochastic mirror descent, are
developed as by-products with independent interest. Preliminary empirical
analysis of modern image data shows that learning very sparse deep neural
networks by gRDA does not necessarily sacrifice testing accuracy
On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning
We consider off-policy temporal-difference (TD) learning methods for policy
evaluation in Markov decision processes with finite spaces and discounted
reward criteria, and we present a collection of convergence results for several
gradient-based TD algorithms with linear function approximation. The algorithms
we analyze include: (i) two basic forms of two-time-scale gradient-based TD
algorithms, which we call GTD and which minimize the mean squared projected
Bellman error using stochastic gradient-descent; (ii) their "robustified"
biased variants; (iii) their mirror-descent versions which combine the
mirror-descent idea with TD learning; and (iv) a single-time-scale version of
GTD that solves minimax problems formulated for approximate policy evaluation.
We derive convergence results for three types of stepsizes: constant
stepsize, slowly diminishing stepsize, as well as the standard type of
diminishing stepsize with a square-summable condition. For the first two types
of stepsizes, we apply the weak convergence method from stochastic
approximation theory to characterize the asymptotic behavior of the algorithms,
and for the standard type of stepsize, we analyze the algorithmic behavior with
respect to a stronger mode of convergence, almost sure convergence. Our
convergence results are for the aforementioned TD algorithms with three general
ways of setting their -parameters: (i) state-dependent ; (ii)
a recently proposed scheme of using history-dependent to keep the
eligibility traces of the algorithms bounded while allowing for relatively
large values of ; and (iii) a composite scheme of setting the
-parameters that combines the preceding two schemes and allows a
broader class of generalized Bellman operators to be used for approximate
policy evaluation with TD methods.Comment: Revised technical report; added Section 4.2.4 and Section 4.3; 86
page