18 research outputs found
Accelerated Reinforcement Learning
Policy gradient methods are widely used in reinforcement learning algorithms
to search for better policies in the parameterized policy space. They do
gradient search in the policy space and are known to converge very slowly.
Nesterov developed an accelerated gradient search algorithm for convex
optimization problems. This has been recently extended for non-convex and also
stochastic optimization. We use Nesterov's acceleration for policy gradient
search in the well-known actor-critic algorithm and show the convergence using
ODE method. We tested this algorithm on a scheduling problem. Here an incoming
job is scheduled into one of the four queues based on the queue lengths. We see
from experimental results that algorithm using Nesterov's acceleration has
significantly better performance compared to algorithm which do not use
acceleration. To the best of our knowledge this is the first time Nesterov's
acceleration has been used with actor-critic algorithm.Comment: The proof is not complete as it has to be shown the algorithm tracks
the OD
Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint
The classic objective in a reinforcement learning (RL) problem is to find a
policy that minimizes, in expectation, a long-run objective such as the
infinite-horizon discounted or long-run average cost. In many practical
applications, optimizing the expected value alone is not sufficient, and it may
be necessary to include a risk measure in the optimization process, either as
the objective or as a constraint. Various risk measures have been proposed in
the literature, e.g., mean-variance tradeoff, exponential utility, the
percentile performance, value at risk, conditional value at risk, prospect
theory and its later enhancement, cumulative prospect theory. In this article,
we focus on the combination of risk criteria and reinforcement learning in a
constrained optimization framework, i.e., a setting where the goal to find a
policy that optimizes the usual objective of infinite-horizon
discounted/average cost, while ensuring that an explicit risk constraint is
satisfied. We introduce the risk-constrained RL framework, cover popular risk
measures based on variance, conditional value-at-risk and cumulative prospect
theory, and present a template for a risk-sensitive RL algorithm. We survey
some of our recent work on this topic, covering problems encompassing
discounted cost, average cost, and stochastic shortest path settings, together
with the aforementioned risk measures in a constrained framework. This
non-exhaustive survey is aimed at giving a flavor of the challenges involved in
solving a risk-sensitive RL problem, and outlining some potential future
research directions
An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes
We develop in this article the first actor-critic reinforcement learning algorithm with function approximation for a problem of control under multiple inequality constraints. We consider the infinite horizon discounted cost framework in which both the objective and the constraint functions are suitable expected policy-dependent discounted sums of certain sample path functions. We apply the Lagrange multiplier method to handle the inequality constraints. Our algorithm makes use of multi-timescale stochastic approximation and incorporates a temporal difference (TD) critic and an actor that makes a gradient search in the space of policy parameters using efficient simultaneous perturbation stochastic approximation (SPSA) gradient estimates. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal policy. (C) 2010 Elsevier B.V. All rights reserved
Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs
While the techniques in optimal control theory are often model-based, the
policy optimization (PO) approach can directly optimize the performance metric
of interest without explicit dynamical models, and is an essential approach for
reinforcement learning problems. However, it usually leads to a non-convex
optimization problem in most cases, where there is little theoretical
understanding on its performance. In this paper, we focus on the
risk-constrained Linear Quadratic Regulator (LQR) problem with noisy input via
the PO approach, which results in a challenging non-convex problem. To this
end, we first build on our earlier result that the optimal policy has an affine
structure to show that the associated Lagrangian function is locally gradient
dominated with respect to the policy, based on which we establish strong
duality. Then, we design policy gradient primal-dual methods with global
convergence guarantees to find an optimal policy-multiplier pair in both
model-based and sample-based settings. Finally, we use samples of system
trajectories in simulations to validate our policy gradient primal-dual
methods
Algorithms for CVaR Optimization in MDPs
In many sequential decision-making problems we may want to manage risk by
minimizing some measure of variability in costs in addition to minimizing a
standard criterion. Conditional value-at-risk (CVaR) is a relatively new risk
measure that addresses some of the shortcomings of the well-known
variance-related risk measures, and because of its computational efficiencies
has gained popularity in finance and operations research. In this paper, we
consider the mean-CVaR optimization problem in MDPs. We first derive a formula
for computing the gradient of this risk-sensitive objective function. We then
devise policy gradient and actor-critic algorithms that each uses a specific
method to estimate this gradient and updates the policy parameters in the
descent direction. We establish the convergence of our algorithms to locally
risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our
algorithms in an optimal stopping problem.Comment: Submitted to NIPS 1