1,927 research outputs found
Algorithms for CVaR Optimization in MDPs
In many sequential decision-making problems we may want to manage risk by
minimizing some measure of variability in costs in addition to minimizing a
standard criterion. Conditional value-at-risk (CVaR) is a relatively new risk
measure that addresses some of the shortcomings of the well-known
variance-related risk measures, and because of its computational efficiencies
has gained popularity in finance and operations research. In this paper, we
consider the mean-CVaR optimization problem in MDPs. We first derive a formula
for computing the gradient of this risk-sensitive objective function. We then
devise policy gradient and actor-critic algorithms that each uses a specific
method to estimate this gradient and updates the policy parameters in the
descent direction. We establish the convergence of our algorithms to locally
risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our
algorithms in an optimal stopping problem.Comment: Submitted to NIPS 1
Optimizing the CVaR via Sampling
Conditional Value at Risk (CVaR) is a prominent risk measure that is being
used extensively in various domains. We develop a new formula for the gradient
of the CVaR in the form of a conditional expectation. Based on this formula, we
propose a novel sampling-based estimator for the CVaR gradient, in the spirit
of the likelihood-ratio method. We analyze the bias of the estimator, and prove
the convergence of a corresponding stochastic gradient descent algorithm to a
local CVaR optimum. Our method allows to consider CVaR optimization in new
domains. As an example, we consider a reinforcement learning application, and
learn a risk-sensitive controller for the game of Tetris.Comment: To appear in AAAI 201
Risk-Sensitive Reinforcement Learning with Exponential Criteria
While reinforcement learning has shown experimental success in a number of
applications, it is known to be sensitive to noise and perturbations in the
parameters of the system, leading to high variance in the total reward amongst
different episodes on slightly different environments. To introduce robustness,
as well as sample efficiency, risk-sensitive reinforcement learning methods are
being thoroughly studied. In this work, we provide a definition of robust
reinforcement learning policies and formulate a risk-sensitive reinforcement
learning problem to approximate them, by solving an optimization problem with
respect to a modified objective based on exponential criteria. In particular,
we study a model-free risk-sensitive variation of the widely-used Monte Carlo
Policy Gradient algorithm, and introduce a novel risk-sensitive online
Actor-Critic algorithm based on solving a multiplicative Bellman equation using
stochastic approximation updates. Analytical results suggest that the use of
exponential criteria generalizes commonly used ad-hoc regularization
approaches, improves sample efficiency, and introduces robustness with respect
to perturbations in the model parameters and the environment. The
implementation, performance, and robustness properties of the proposed methods
are evaluated in simulated experiments
Actor-Critic Algorithms for Risk-Sensitive MDPs
In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application
- …