3 research outputs found

    Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint

    Full text link
    The classic objective in a reinforcement learning (RL) problem is to find a policy that minimizes, in expectation, a long-run objective such as the infinite-horizon discounted or long-run average cost. In many practical applications, optimizing the expected value alone is not sufficient, and it may be necessary to include a risk measure in the optimization process, either as the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., mean-variance tradeoff, exponential utility, the percentile performance, value at risk, conditional value at risk, prospect theory and its later enhancement, cumulative prospect theory. In this article, we focus on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied. We introduce the risk-constrained RL framework, cover popular risk measures based on variance, conditional value-at-risk and cumulative prospect theory, and present a template for a risk-sensitive RL algorithm. We survey some of our recent work on this topic, covering problems encompassing discounted cost, average cost, and stochastic shortest path settings, together with the aforementioned risk measures in a constrained framework. This non-exhaustive survey is aimed at giving a flavor of the challenges involved in solving a risk-sensitive RL problem, and outlining some potential future research directions

    Stochastic Systems with Cumulative Prospect Theory

    Get PDF
    Stochastic control problems arise in many fields. Traditionally, the most widely used class of performance criteria in stochastic control problems is risk-neutral. More recent attempts at introducing risk-sensitivity into stochastic control problems include the application of utility functions. The decision theory community has long debated the merits of using expected utility for modeling human behaviors, as exemplified by the Allais paradox. Substantiated by strong experimental evidence, Cumulative Prospect Theory (CPT) based performance measures have been proposed as alternatives to expected utility based performance measures for evaluating human-centric systems. Our goal is to study stochastic control problems using performance measures derived from the cumulative prospect theory. The first part of this thesis solves the problem of evaluating Markov decision processes (MDPs) using CPT-based performance measures. A well-known method of solving MDPs is dynamic programming, which has traditionally been applied with an expected utility criterion. When the performance measure is CPT-inspired, several complications arise. Firstly, when solving a problem via dynamic programming, it is important that the performance criterion has a recursive structure, which is not true for all CPT-based criteria. Secondly, we need to prove the traditional optimality criteria for the updated problems (i.e., MDPs with CPT-based performance criteria). The theorems stated in this part of the thesis answer the question: what are the conditions required on a CPT-inspired criterion such that the corresponding MDP is solvable via dynamic programming? The second part of this thesis deals with stochastic global optimization problems. Using ideas from the cumulative prospect theory, we are able to introduce a novel model-based randomized optimization algorithm: Cumulative Weighting Optimization (CWO). The key contributions of our research are: 1) proving the convergence of the algorithm to an optimal solution given a mild assumption on the initial condition; 2) showing that the well-known cross-entropy optimization algorithm is a special case of CWO-based algorithms. To the best knowledge of the author, there is no previous convergence proof for the cross-entropy method. In practice, numerical experiments have demonstrated that a CWO-based algorithm can find a better solution than the cross-entropy method. Finally, in the future, we would like to apply some of the ideas from cumulative prospect theory to games. In this thesis, we present a numerical example where cumulative prospect theory has an unexpected effect on the equilibrium points of the classic prisoner's dilemma game
    corecore