42 research outputs found
Exponential Stability of Primal-Dual Gradient Dynamics with Non-Strong Convexity
This paper studies the exponential stability of primal-dual gradient dynamics
(PDGD) for solving convex optimization problems where constraints are in the
form of Ax+By= d and the objective is min f(x)+g(y) with strongly convex smooth
f but only convex smooth g. We show that when g is a quadratic function or when
g and matrix B together satisfy an inequality condition, the PDGD can achieve
global exponential stability given that matrix A is of full row rank. These
results indicate that the PDGD is locally exponentially stable with respect to
any convex smooth g under a regularity condition. To prove the exponential
stability, two quadratic Lyapunov functions are designed. Lastly, numerical
experiments further complement the theoretical analysis.Comment: 8 page
Scalable Bilinear Learning Using State and Action Features
Approximate linear programming (ALP) represents one of the major algorithmic
families to solve large-scale Markov decision processes (MDP). In this work, we
study a primal-dual formulation of the ALP, and develop a scalable, model-free
algorithm called bilinear learning for reinforcement learning when a
sampling oracle is provided. This algorithm enjoys a number of advantages.
First, it adopts (bi)linear models to represent the high-dimensional value
function and state-action distributions, using given state and action features.
Its run-time complexity depends on the number of features, not the size of the
underlying MDPs. Second, it operates in a fully online fashion without having
to store any sample, thus having minimal memory footprint. Third, we prove that
it is sample-efficient, solving for the optimal policy to high precision with a
sample complexity linear in the dimension of the parameter space
Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging
We consider -dimensional linear stochastic approximation algorithms (LSAs)
with a constant step-size and the so called Polyak-Ruppert (PR) averaging of
iterates. LSAs are widely applied in machine learning and reinforcement
learning (RL), where the aim is to compute an appropriate (that is an optimum or a fixed point) using noisy data and
updates per iteration. In this paper, we are motivated by the problem (in RL)
of policy evaluation from experience replay using the \emph{temporal
difference} (TD) class of learning algorithms that are also LSAs. For LSAs with
a constant step-size, and PR averaging, we provide bounds for the mean squared
error (MSE) after iterations. We assume that data is \iid with finite
variance (underlying distribution being ) and that the expected dynamics is
Hurwitz. For a given LSA with PR averaging, and data distribution
satisfying the said assumptions, we show that there exists a range of constant
step-sizes such that its MSE decays as .
We examine the conditions under which a constant step-size can be chosen
uniformly for a class of data distributions , and show that not
all data distributions `admit' such a uniform constant step-size. We also
suggest a heuristic step-size tuning algorithm to choose a constant step-size
of a given LSA for a given data distribution . We compare our results with
related work and also discuss the implication of our results in the context of
TD algorithms that are LSAs.Comment: 16 pages, 2 figures, was submitted to NIPS 201
Stabilizing Adversarial Nets With Prediction Methods
Adversarial neural networks solve many important problems in data science,
but are notoriously difficult to train. These difficulties come from the fact
that optimal weights for adversarial nets correspond to saddle points, and not
minimizers, of the loss function. The alternating stochastic gradient methods
typically used for such problems do not reliably converge to saddle points, and
when convergence does happen it is often highly sensitive to learning rates. We
propose a simple modification of stochastic gradient descent that stabilizes
adversarial networks. We show, both in theory and practice, that the proposed
method reliably converges to saddle points, and is stable with a wider range of
training parameters than a non-prediction method. This makes adversarial
networks less likely to "collapse," and enables faster training with larger
learning rates.Comment: Accepted at ICLR 201
Convergent Tree Backup and Retrace with Function Approximation
Off-policy learning is key to scaling up reinforcement learning as it allows
to learn about a target policy from the experience generated by a different
behavior policy. Unfortunately, it has been challenging to combine off-policy
learning with function approximation and multi-step bootstrapping in a way that
leads to both stable and efficient algorithms. In this work, we show that the
\textsc{Tree Backup} and \textsc{Retrace} algorithms are unstable with linear
function approximation, both in theory and in practice with specific examples.
Based on our analysis, we then derive stable and efficient gradient-based
algorithms using a quadratic convex-concave saddle-point formulation. By
exploiting the problem structure proper to these algorithms, we are able to
provide convergence guarantees and finite-sample bounds. The applicability of
our new analysis also goes beyond \textsc{Tree Backup} and \textsc{Retrace} and
allows us to provide new convergence rates for the GTD and GTD2 algorithms
without having recourse to projections or Polyak averaging
A Block Coordinate Ascent Algorithm for Mean-Variance Optimization
Risk management in dynamic decision problems is a primary concern in many
fields, including financial investment, autonomous driving, and healthcare. The
mean-variance function is one of the most widely used objective functions in
risk management due to its simplicity and interpretability. Existing algorithms
for mean-variance optimization are based on multi-time-scale stochastic
approximation, whose learning rate schedules are often hard to tune, and have
only asymptotic convergence proof. In this paper, we develop a model-free
policy search framework for mean-variance optimization with finite-sample error
bound analysis (to local optima). Our starting point is a reformulation of the
original mean-variance function with its Fenchel dual, from which we propose a
stochastic block coordinate ascent policy search algorithm. Both the asymptotic
convergence guarantee of the last iteration's solution and the convergence rate
of the randomly picked solution are provided, and their applicability is
demonstrated on several benchmark domains.Comment: Accepted by NIPS 201
High-confidence error estimates for learned value functions
Estimating the value function for a fixed policy is a fundamental problem in
reinforcement learning. Policy evaluation algorithms---to estimate value
functions---continue to be developed, to improve convergence rates, improve
stability and handle variability, particularly for off-policy learning. To
understand the properties of these algorithms, the experimenter needs
high-confidence estimates of the accuracy of the learned value functions. For
environments with small, finite state-spaces, like chains, the true value
function can be easily computed, to compute accuracy. For large, or continuous
state-spaces, however, this is no longer feasible. In this paper, we address
the largely open problem of how to obtain these high-confidence estimates, for
general state-spaces. We provide a high-confidence bound on an empirical
estimate of the value error to the true value error. We use this bound to
design an offline sampling algorithm, which stores the required quantities to
repeatedly compute value error estimates for any learned value function. We
provide experiments investigating the number of samples required by this
offline algorithm in simple benchmark reinforcement learning domains, and
highlight that there are still many open questions to be solved for this
important problem.Comment: Presented at (UAI) Uncertainty in Artificial Intelligence 201
Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization
Despite the success of single-agent reinforcement learning, multi-agent
reinforcement learning (MARL) remains challenging due to complex interactions
between agents. Motivated by decentralized applications such as sensor
networks, swarm robotics, and power grids, we study policy evaluation in MARL,
where agents with jointly observed state-action pairs and private local rewards
collaborate to learn the value of a given policy. In this paper, we propose a
double averaging scheme, where each agent iteratively performs averaging over
both space and time to incorporate neighboring gradient information and local
reward information, respectively. We prove that the proposed algorithm
converges to the optimal solution at a global geometric rate. In particular,
such an algorithm is built upon a primal-dual reformulation of the mean squared
projected Bellman error minimization problem, which gives rise to a
decentralized convex-concave saddle-point problem. To the best of our
knowledge, the proposed double averaging primal-dual optimization algorithm is
the first to achieve fast finite-time convergence on decentralized
convex-concave saddle-point problems.Comment: final version as appeared in NeurIPS 201
SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation
When function approximation is used, solving the Bellman optimality equation
with stability guarantees has remained a major open problem in reinforcement
learning for decades. The fundamental difficulty is that the Bellman operator
may become an expansion in general, resulting in oscillating and even divergent
behavior of popular algorithms like Q-learning. In this paper, we revisit the
Bellman equation, and reformulate it into a novel primal-dual optimization
problem using Nesterov's smoothing technique and the Legendre-Fenchel
transformation. We then develop a new algorithm, called Smoothed Bellman Error
Embedding, to solve this optimization problem where any differentiable function
class may be used. We provide what we believe to be the first convergence
guarantee for general nonlinear function approximation, and analyze the
algorithm's sample complexity. Empirically, our algorithm compares favorably to
state-of-the-art baselines in several benchmark control problems.Comment: 28 pages, 13 figures. To appear at the 35th International Conference
on Machine Learning (ICML 2018
Variance Reduction for Deep Q-Learning using Stochastic Recursive Gradient
Deep Q-learning algorithms often suffer from poor gradient estimations with
an excessive variance, resulting in unstable training and poor sampling
efficiency. Stochastic variance-reduced gradient methods such as SVRG have been
applied to reduce the estimation variance (Zhao et al. 2019). However, due to
the online instance generation nature of reinforcement learning, directly
applying SVRG to deep Q-learning is facing the problem of the inaccurate
estimation of the anchor points, which dramatically limits the potentials of
SVRG. To address this issue and inspired by the recursive gradient variance
reduction algorithm SARAH (Nguyen et al. 2017), this paper proposes to
introduce the recursive framework for updating the stochastic gradient
estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN.
Unlike the SVRG-based algorithms, SRG-DQN designs a recursive update of the
stochastic gradient estimate. The parameter update is along an accumulated
direction using the past stochastic gradient information, and therefore can get
rid of the estimation of the full gradients as the anchors. Additionally,
SRG-DQN involves the Adam process for further accelerating the training
process. Theoretical analysis and the experimental results on well-known
reinforcement learning tasks demonstrate the efficiency and effectiveness of
the proposed SRG-DQN algorithm.Comment: 8 pages, 3 figure