4,163 research outputs found
Some new results on sample path optimality in ergodic control of diffusions
We present some new results on sample path optimality for the ergodic control
problem of a class of non-degenerate diffusions controlled through the drift.
The hypothesis most often used in the literature to ensure the existence of an
a.s. sample path optimal stationary Markov control requires finite second
moments of the first hitting times of bounded domains over all
admissible controls. We show that this can be considerably weakened: may be replaced with , thus reducing
the required rate of convergence of averages from polynomial to logarithmic. A
Foster-Lyapunov condition which guarantees this is also exhibited. Moreover, we
study a large class of models that are neither uniformly stable, nor have a
near-monotone running cost, and we exhibit sufficient conditions for the
existence of a sample path optimal stationary Markov control.Comment: 10 page
QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds
This paper presents a discrete-time option pricing model that is rooted in
Reinforcement Learning (RL), and more specifically in the famous Q-Learning
method of RL. We construct a risk-adjusted Markov Decision Process for a
discrete-time version of the classical Black-Scholes-Merton (BSM) model, where
the option price is an optimal Q-function, while the optimal hedge is a second
argument of this optimal Q-function, so that both the price and hedge are parts
of the same formula. Pricing is done by learning to dynamically optimize
risk-adjusted returns for an option replicating portfolio, as in the Markowitz
portfolio theory. Using Q-Learning and related methods, once created in a
parametric setting, the model is able to go model-free and learn to price and
hedge an option directly from data, and without an explicit model of the world.
This suggests that RL may provide efficient data-driven and model-free methods
for optimal pricing and hedging of options, once we depart from the academic
continuous-time limit, and vice versa, option pricing methods developed in
Mathematical Finance may be viewed as special cases of model-based
Reinforcement Learning. Further, due to simplicity and tractability of our
model which only needs basic linear algebra (plus Monte Carlo simulation, if we
work with synthetic data), and its close relation to the original BSM model, we
suggest that our model could be used for benchmarking of different RL
algorithms for financial trading applicationsComment: 30 pages (minor changes in the presentation, updated references
On Bellman's principle with inequality constraints
We consider an example by Haviv (1996) of a constrained Markov decision
process that, in some sense, violates Bellman's principle. We resolve this
issue by showing how to preserve a form of Bellman's principle that accounts
for a change of constraint at states that are reachable from the initial state
A Distributional Perspective on Reinforcement Learning
In this paper we argue for the fundamental importance of the value
distribution: the distribution of the random return received by a reinforcement
learning agent. This is in contrast to the common approach to reinforcement
learning which models the expectation of this return, or value. Although there
is an established body of literature studying the value distribution, thus far
it has always been used for a specific purpose such as implementing risk-aware
behaviour. We begin with theoretical results in both the policy evaluation and
control settings, exposing a significant distributional instability in the
latter. We then use the distributional perspective to design a new algorithm
which applies Bellman's equation to the learning of approximate value
distributions. We evaluate our algorithm using the suite of games from the
Arcade Learning Environment. We obtain both state-of-the-art results and
anecdotal evidence demonstrating the importance of the value distribution in
approximate reinforcement learning. Finally, we combine theoretical and
empirical evidence to highlight the ways in which the value distribution
impacts learning in the approximate setting.Comment: ICML 201
Reinforcement Learning
Reinforcement learning (RL) is a general framework for adaptive control,
which has proven to be efficient in many domains, e.g., board games, video
games or autonomous vehicles. In such problems, an agent faces a sequential
decision-making problem where, at every time step, it observes its state,
performs an action, receives a reward and moves to a new state. An RL agent
learns by trial and error a good policy (or controller) based on observations
and numeric reward feedback on the previously performed action. In this
chapter, we present the basic framework of RL and recall the two main families
of approaches that have been developed to learn a good policy. The first one,
which is value-based, consists in estimating the value of an optimal policy,
value from which a policy can be recovered, while the other, called policy
search, directly works in a policy space. Actor-critic methods can be seen as a
policy search technique where the policy value that is learned guides the
policy improvement. Besides, we give an overview of some extensions of the
standard RL framework, notably when risk-averse behavior needs to be taken into
account or when rewards are not available or not known.Comment: Chapter in "A Guided Tour of Artificial Intelligence Research",
Springe
Least Inferable Policies for Markov Decision Processes
In a variety of applications, an agent's success depends on the knowledge
that an adversarial observer has or can gather about the agent's decisions. It
is therefore desirable for the agent to achieve a task while reducing the
ability of an observer to infer the agent's policy. We consider the task of the
agent as a reachability problem in a Markov decision process and study the
synthesis of policies that minimize the observer's ability to infer the
transition probabilities of the agent between the states of the Markov decision
process. We introduce a metric that is based on the Fisher information as a
proxy for the information leaked to the observer and using this metric
formulate a problem that minimizes expected total information subject to the
reachability constraint. We proceed to solve the problem using convex
optimization methods. To verify the proposed method, we analyze the
relationship between the expected total information and the estimation error of
the observer, and show that, for a particular class of Markov decision
processes, these two values are inversely proportional
Optimal Sensing and Data Estimation in a Large Sensor Network
An energy efficient use of large scale sensor networks necessitates
activating a subset of possible sensors for estimation at a fusion center. The
problem is inherently combinatorial; to this end, a set of iterative,
randomized algorithms are developed for sensor subset selection by exploiting
the underlying statistics. Gibbs sampling-based methods are designed to
optimize the estimation error and the mean number of activated sensors. The
optimality of the proposed strategy is proven, along with guarantees on their
convergence speeds. Also, another new algorithm exploiting stochastic
approximation in conjunction with Gibbs sampling is derived for a constrained
version of the sensor selection problem. The methodology is extended to the
scenario where the fusion center has access to only a parametric form of the
joint statistics, but not the true underlying distribution. Therein,
expectation-maximization is effectively employed to learn the distribution.
Strategies for iid time-varying data are also outlined. Numerical results show
that the proposed methods converge very fast to the respective optimal
solutions, and therefore can be employed for optimal sensor subset selection in
practical sensor networks.Comment: 9 page
Optimal control of uncertain stochastic systems with Markovian switching and its applications to portfolio decisions
This paper first describes a class of uncertain stochastic control systems
with Markovian switching, and derives an It\^o-Liu formula for Markov-modulated
processes. And we characterize an optimal control law, which satisfies the
generalized Hamilton-Jacobi-Bellman (HJB) equation with Markovian switching.
Then, by using the generalized HJB equation, we deduce the optimal consumption
and portfolio policies under uncertain stochastic financial markets with
Markovian switching. Finally, for constant relative risk-aversion (CRRA)
felicity functions, we explicitly obtain the optimal consumption and portfolio
policies. Moreover, we also make an economic analysis through numerical
examples.Comment: 21 pages, 2 figure
QoE-aware Media Streaming in Technology and Cost Heterogeneous Networks
We present a framework for studying the problem of media streaming in
technology and cost heterogeneous environments. We first address the problem of
efficient streaming in a technology-heterogeneous setting. We employ random
linear network coding to simplify the packet selection strategies and alleviate
issues such as duplicate packet reception. Then, we study the problem of media
streaming from multiple cost-heterogeneous access networks. Our objective is to
characterize analytically the trade-off between access cost and user
experience. We model the Quality of user Experience (QoE) as the probability of
interruption in playback as well as the initial waiting time. We design and
characterize various control policies, and formulate the optimal control
problem using a Markov Decision Process (MDP) with a probabilistic constraint.
We present a characterization of the optimal policy using the
Hamilton-Jacobi-Bellman (HJB) equation. For a fluid approximation model, we
provide an exact and explicit characterization of a threshold policy and prove
its optimality using the HJB equation.
Our simulation results show that under properly designed control policy, the
existence of alternative access technology as a complement for a primary access
network can significantly improve the user experience without any bandwidth
over-provisioning.Comment: submitted to IEEE Transactions on Information Theory. arXiv admin
note: substantial text overlap with arXiv:1004.352
A Time Consistent Formulation of Risk Constrained Stochastic Optimal Control
Time-consistency is an essential requirement in risk sensitive optimal
control problems to make rational decisions. An optimization problem is time
consistent if its solution policy does not depend on the time sequence of
solving the optimization problem. On the other hand, a dynamic risk measure is
time consistent if a certain outcome is considered less risky in the future
implies this outcome is also less risky at current stage.
In this paper, we study time-consistency of risk constrained problem where
the risk metric is time consistent. From the Bellman optimality condition in
[1], we establish an analytical "risk-to-go" that results in a time consistent
optimal policy. Finally we demonstrate the effectiveness of the analytical
solution by solving Haviv's counter-example [2] in time inconsistent planning
- …