Search CORE

4,163 research outputs found

Some new results on sample path optimality in ergodic control of diffusions

Author: Arapostathis Ari
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

We present some new results on sample path optimality for the ergodic control problem of a class of non-degenerate diffusions controlled through the drift. The hypothesis most often used in the literature to ensure the existence of an a.s. sample path optimal stationary Markov control requires finite second moments of the first hitting times

\tau

of bounded domains over all admissible controls. We show that this can be considerably weakened:

{\mathrm E}[\tau^2]

may be replaced with

{\mathrm E}[\tau\ln^+(\tau)]

, thus reducing the required rate of convergence of averages from polynomial to logarithmic. A Foster-Lyapunov condition which guarantees this is also exhibited. Moreover, we study a large class of models that are neither uniformly stable, nor have a near-monotone running cost, and we exhibit sufficient conditions for the existence of a sample path optimal stationary Markov control.Comment: 10 page

arXiv.org e-Print Archive

QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds

Author: Halperin Igor
Publication venue
Publication date: 02/09/2019
Field of study

This paper presents a discrete-time option pricing model that is rooted in Reinforcement Learning (RL), and more specifically in the famous Q-Learning method of RL. We construct a risk-adjusted Markov Decision Process for a discrete-time version of the classical Black-Scholes-Merton (BSM) model, where the option price is an optimal Q-function, while the optimal hedge is a second argument of this optimal Q-function, so that both the price and hedge are parts of the same formula. Pricing is done by learning to dynamically optimize risk-adjusted returns for an option replicating portfolio, as in the Markowitz portfolio theory. Using Q-Learning and related methods, once created in a parametric setting, the model is able to go model-free and learn to price and hedge an option directly from data, and without an explicit model of the world. This suggests that RL may provide efficient data-driven and model-free methods for optimal pricing and hedging of options, once we depart from the academic continuous-time limit, and vice versa, option pricing methods developed in Mathematical Finance may be viewed as special cases of model-based Reinforcement Learning. Further, due to simplicity and tractability of our model which only needs basic linear algebra (plus Monte Carlo simulation, if we work with synthetic data), and its close relation to the original BSM model, we suggest that our model could be used for benchmarking of different RL algorithms for financial trading applicationsComment: 30 pages (minor changes in the presentation, updated references

arXiv.org e-Print Archive

On Bellman's principle with inequality constraints

Author: Adaska Jason
Chong Edwin K. P.
Miller Scott A.
Publication venue
Publication date: 14/11/2011
Field of study

We consider an example by Haviv (1996) of a constrained Markov decision process that, in some sense, violates Bellman's principle. We resolve this issue by showing how to preserve a form of Bellman's principle that accounts for a change of constraint at states that are reachable from the initial state

arXiv.org e-Print Archive

A Distributional Perspective on Reinforcement Learning

Author: Bellemare Marc G.
Dabney Will
Munos Rémi
Publication venue
Publication date: 21/07/2017
Field of study

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.Comment: ICML 201

arXiv.org e-Print Archive

Reinforcement Learning

Author: Buffet Olivier
Pietquin Olivier
Weng Paul
Publication venue
Publication date: 13/06/2020
Field of study

Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains, e.g., board games, video games or autonomous vehicles. In such problems, an agent faces a sequential decision-making problem where, at every time step, it observes its state, performs an action, receives a reward and moves to a new state. An RL agent learns by trial and error a good policy (or controller) based on observations and numeric reward feedback on the previously performed action. In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy. The first one, which is value-based, consists in estimating the value of an optimal policy, value from which a policy can be recovered, while the other, called policy search, directly works in a policy space. Actor-critic methods can be seen as a policy search technique where the policy value that is learned guides the policy improvement. Besides, we give an overview of some extensions of the standard RL framework, notably when risk-averse behavior needs to be taken into account or when rewards are not available or not known.Comment: Chapter in "A Guided Tour of Artificial Intelligence Research", Springe

arXiv.org e-Print Archive

Least Inferable Policies for Markov Decision Processes

Author: Karabag Mustafa O.
Ornik Melkior
Topcu Ufuk
Publication venue
Publication date: 17/09/2018
Field of study

In a variety of applications, an agent's success depends on the knowledge that an adversarial observer has or can gather about the agent's decisions. It is therefore desirable for the agent to achieve a task while reducing the ability of an observer to infer the agent's policy. We consider the task of the agent as a reachability problem in a Markov decision process and study the synthesis of policies that minimize the observer's ability to infer the transition probabilities of the agent between the states of the Markov decision process. We introduce a metric that is based on the Fisher information as a proxy for the information leaked to the observer and using this metric formulate a problem that minimizes expected total information subject to the reachability constraint. We proceed to solve the problem using convex optimization methods. To verify the proposed method, we analyze the relationship between the expected total information and the estimation error of the observer, and show that, for a particular class of Markov decision processes, these two values are inversely proportional

arXiv.org e-Print Archive

Optimal Sensing and Data Estimation in a Large Sensor Network

Author: Chattopadhyay Arpan
Mitra Urbashi
Publication venue
Publication date: 11/09/2017
Field of study

An energy efficient use of large scale sensor networks necessitates activating a subset of possible sensors for estimation at a fusion center. The problem is inherently combinatorial; to this end, a set of iterative, randomized algorithms are developed for sensor subset selection by exploiting the underlying statistics. Gibbs sampling-based methods are designed to optimize the estimation error and the mean number of activated sensors. The optimality of the proposed strategy is proven, along with guarantees on their convergence speeds. Also, another new algorithm exploiting stochastic approximation in conjunction with Gibbs sampling is derived for a constrained version of the sensor selection problem. The methodology is extended to the scenario where the fusion center has access to only a parametric form of the joint statistics, but not the true underlying distribution. Therein, expectation-maximization is effectively employed to learn the distribution. Strategies for iid time-varying data are also outlined. Numerical results show that the proposed methods converge very fast to the respective optimal solutions, and therefore can be employed for optimal sensor subset selection in practical sensor networks.Comment: 9 page

arXiv.org e-Print Archive

Optimal control of uncertain stochastic systems with Markovian switching and its applications to portfolio decisions

Author: Fei Weiyin
Publication venue: 'Informa UK Limited'
Publication date: 11/01/2014
Field of study

This paper first describes a class of uncertain stochastic control systems with Markovian switching, and derives an It\^o-Liu formula for Markov-modulated processes. And we characterize an optimal control law, which satisfies the generalized Hamilton-Jacobi-Bellman (HJB) equation with Markovian switching. Then, by using the generalized HJB equation, we deduce the optimal consumption and portfolio policies under uncertain stochastic financial markets with Markovian switching. Finally, for constant relative risk-aversion (CRRA) felicity functions, we explicitly obtain the optimal consumption and portfolio policies. Moreover, we also make an economic analysis through numerical examples.Comment: 21 pages, 2 figure

arXiv.org e-Print Archive

QoE-aware Media Streaming in Technology and Cost Heterogeneous Networks

Author: Medard Muriel
Ozdaglar Asuman
ParandehGheibi Ali
Publication venue
Publication date: 14/03/2012
Field of study

We present a framework for studying the problem of media streaming in technology and cost heterogeneous environments. We first address the problem of efficient streaming in a technology-heterogeneous setting. We employ random linear network coding to simplify the packet selection strategies and alleviate issues such as duplicate packet reception. Then, we study the problem of media streaming from multiple cost-heterogeneous access networks. Our objective is to characterize analytically the trade-off between access cost and user experience. We model the Quality of user Experience (QoE) as the probability of interruption in playback as well as the initial waiting time. We design and characterize various control policies, and formulate the optimal control problem using a Markov Decision Process (MDP) with a probabilistic constraint. We present a characterization of the optimal policy using the Hamilton-Jacobi-Bellman (HJB) equation. For a fluid approximation model, we provide an exact and explicit characterization of a threshold policy and prove its optimality using the HJB equation. Our simulation results show that under properly designed control policy, the existence of alternative access technology as a complement for a primary access network can significantly improve the user experience without any bandwidth over-provisioning.Comment: submitted to IEEE Transactions on Information Theory. arXiv admin note: substantial text overlap with arXiv:1004.352

arXiv.org e-Print Archive

A Time Consistent Formulation of Risk Constrained Stochastic Optimal Control

Author: Chow Yinlam
Pavone Marco
Publication venue
Publication date: 25/03/2015
Field of study

Time-consistency is an essential requirement in risk sensitive optimal control problems to make rational decisions. An optimization problem is time consistent if its solution policy does not depend on the time sequence of solving the optimization problem. On the other hand, a dynamic risk measure is time consistent if a certain outcome is considered less risky in the future implies this outcome is also less risky at current stage. In this paper, we study time-consistency of risk constrained problem where the risk metric is time consistent. From the Bellman optimality condition in [1], we establish an analytical "risk-to-go" that results in a time consistent optimal policy. Finally we demonstrate the effectiveness of the analytical solution by solving Haviv's counter-example [2] in time inconsistent planning

arXiv.org e-Print Archive