40 research outputs found
Mean-Variance Optimization in Markov Decision Processes
We consider finite horizon Markov decision processes under performance
measures that involve both the mean and the variance of the cumulative reward.
We show that either randomized or history-based policies can improve
performance. We prove that the complexity of computing a policy that maximizes
the mean reward under a variance constraint is NP-hard for some cases, and
strongly NP-hard for others. We finally offer pseudopolynomial exact and
approximation algorithms.Comment: A full version of an ICML 2011 pape
Algorithmic aspects of mean–variance optimization in Markov decision processes
We consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or history-based policies can improve performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NP-hard for some cases, and strongly NP-hard for others. We finally offer pseudopolynomial exact and approximation algorithms.National Science Foundation (U.S.) (CMMI-0856063
Expectations or Guarantees? I Want It All! A crossroad between games and MDPs
When reasoning about the strategic capabilities of an agent, it is important
to consider the nature of its adversaries. In the particular context of
controller synthesis for quantitative specifications, the usual problem is to
devise a strategy for a reactive system which yields some desired performance,
taking into account the possible impact of the environment of the system. There
are at least two ways to look at this environment. In the classical analysis of
two-player quantitative games, the environment is purely antagonistic and the
problem is to provide strict performance guarantees. In Markov decision
processes, the environment is seen as purely stochastic: the aim is then to
optimize the expected payoff, with no guarantee on individual outcomes.
In this expository work, we report on recent results introducing the beyond
worst-case synthesis problem, which is to construct strategies that guarantee
some quantitative requirement in the worst-case while providing an higher
expected value against a particular stochastic model of the environment given
as input. This problem is relevant to produce system controllers that provide
nice expected performance in the everyday situation while ensuring a strict
(but relaxed) performance threshold even in the event of very bad (while
unlikely) circumstances. It has been studied for both the mean-payoff and the
shortest path quantitative measures.Comment: In Proceedings SR 2014, arXiv:1404.041
Policy Gradients for CVaR-Constrained MDPs
We study a risk-constrained version of the stochastic shortest path (SSP)
problem, where the risk measure considered is Conditional Value-at-Risk (CVaR).
We propose two algorithms that obtain a locally risk-optimal policy by
employing four tools: stochastic approximation, mini batches, policy gradients
and importance sampling. Both the algorithms incorporate a CVaR estimation
procedure, along the lines of Bardou et al. [2009], which in turn is based on
Rockafellar-Uryasev's representation for CVaR and utilize the likelihood ratio
principle for estimating the gradient of the sum of one cost function
(objective of the SSP) and the gradient of the CVaR of the sum of another cost
function (in the constraint of SSP). The algorithms differ in the manner in
which they approximate the CVaR estimates/necessary gradients - the first
algorithm uses stochastic approximation, while the second employ mini-batches
in the spirit of Monte Carlo methods. We establish asymptotic convergence of
both the algorithms. Further, since estimating CVaR is related to rare-event
simulation, we incorporate an importance sampling based variance reduction
scheme into our proposed algorithms
Functional Bandits
We introduce the functional bandit problem, where the objective is to find an
arm that optimises a known functional of the unknown arm-reward distributions.
These problems arise in many settings such as maximum entropy methods in
natural language processing, and risk-averse decision-making, but current
best-arm identification techniques fail in these domains. We propose a new
approach, that combines functional estimation and arm elimination, to tackle
this problem. This method achieves provably efficient performance guarantees.
In addition, we illustrate this method on a number of important functionals in
risk management and information theory, and refine our generic theoretical
results in those cases
Exploration vs Exploitation vs Safety: Risk-averse Multi-Armed Bandits
Motivated by applications in energy management, this paper presents the
Multi-Armed Risk-Aware Bandit (MARAB) algorithm. With the goal of limiting the
exploration of risky arms, MARAB takes as arm quality its conditional value at
risk. When the user-supplied risk level goes to 0, the arm quality tends toward
the essential infimum of the arm distribution density, and MARAB tends toward
the MIN multi-armed bandit algorithm, aimed at the arm with maximal minimal
value. As a first contribution, this paper presents a theoretical analysis of
the MIN algorithm under mild assumptions, establishing its robustness
comparatively to UCB. The analysis is supported by extensive experimental
validation of MIN and MARAB compared to UCB and state-of-art risk-aware MAB
algorithms on artificial and real-world problems.Comment: 16 page
Risk Aversion in Finite Markov Decision Processes Using Total Cost Criteria and Average Value at Risk
In this paper we present an algorithm to compute risk averse policies in
Markov Decision Processes (MDP) when the total cost criterion is used together
with the average value at risk (AVaR) metric. Risk averse policies are needed
when large deviations from the expected behavior may have detrimental effects,
and conventional MDP algorithms usually ignore this aspect. We provide
conditions for the structure of the underlying MDP ensuring that approximations
for the exact problem can be derived and solved efficiently. Our findings are
novel inasmuch as average value at risk has not previously been considered in
association with the total cost criterion. Our method is demonstrated in a
rapid deployment scenario, whereby a robot is tasked with the objective of
reaching a target location within a temporal deadline where increased speed is
associated with increased probability of failure. We demonstrate that the
proposed algorithm not only produces a risk averse policy reducing the
probability of exceeding the expected temporal deadline, but also provides the
statistical distribution of costs, thus offering a valuable analysis tool