16 research outputs found
Markov Decision Problems Where Means Bound Variances
We identify a rich class of finite-horizon Markov decision problems (MDPs) for which the variance of the optimal total reward can be bounded by a simple linear function of its expected value. The class is characterized by three natural properties: reward nonnegativity and boundedness, existence of a do-nothing action, and optimal action monotonicity. These properties are commonly present and typically easy to check. Implications of the class properties and of the variance bound are illustrated by examples of MDPs from operations research, operations management, financial engineering, and combinatorial optimization
State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning
In the framework of MDP, although the general reward function takes three
arguments-current state, action, and successor state; it is often simplified to
a function of two arguments-current state and action. The former is called a
transition-based reward function, whereas the latter is called a state-based
reward function. When the objective involves the expected cumulative reward
only, this simplification works perfectly. However, when the objective is
risk-sensitive, this simplification leads to an incorrect value. We present
state-augmentation transformations (SATs), which preserve the reward sequences
as well as the reward distributions and the optimal policy in risk-sensitive
reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for
every MDP with a stochastic transition-based reward function, there exists an
MDP with a deterministic state-based reward function, such that for any given
(randomized) policy for the first MDP, there exists a corresponding policy for
the second MDP, such that both Markov reward processes share the same reward
sequence. Secondly we illustrate that two situations require the proposed SATs
in an inventory control problem. One could be using Q-learning (or other
learning methods) on MDPs with transition-based reward functions, and the other
could be using methods, which are for the Markov processes with a deterministic
state-based reward functions, on the Markov processes with general reward
functions. We show the advantage of the SATs by considering Value-at-Risk as an
example, which is a risk measure on the reward distribution instead of the
measures (such as mean and variance) of the distribution. We illustrate the
error in the reward distribution estimation from the direct use of Q-learning,
and show how the SATs enable a variance formula to work on Markov processes
with general reward functions
TA11 -8~30 VARIABILITY SENSITIVE MARKOV DECISION PROCESSES*
Abstract The time-average Markov Decisiou Procesncn with fiuitc state and action spaces are considered. Several definitionn of varialdity are introduced aid coinparetl. It in nliown that a ntatioiiary policy niaxiinizcn oue of these criteria, iiaincly, the cxpcctlcd long-run average variability. Furtlicriuorc, ail algoritllul is given wliicli prodiiccs such an optiuial statiouary policy
Trading Performance for Stability in Markov Decision Processes
We study the complexity of central controller synthesis problems for
finite-state Markov decision processes, where the objective is to optimize both
the expected mean-payoff performance of the system and its stability.
We argue that the basic theoretical notion of expressing the stability in
terms of the variance of the mean-payoff (called global variance in our paper)
is not always sufficient, since it ignores possible instabilities on respective
runs. For this reason we propose alernative definitions of stability, which we
call local and hybrid variance, and which express how rewards on each run
deviate from the run's own mean-payoff and from the expected mean-payoff,
respectively.
We show that a strategy ensuring both the expected mean-payoff and the
variance below given bounds requires randomization and memory, under all the
above semantics of variance. We then look at the problem of determining whether
there is a such a strategy. For the global variance, we show that the problem
is in PSPACE, and that the answer can be approximated in pseudo-polynomial
time. For the hybrid variance, the analogous decision problem is in NP, and a
polynomial-time approximating algorithm also exists. For local variance, we
show that the decision problem is in NP. Since the overall performance can be
traded for stability (and vice versa), we also present algorithms for
approximating the associated Pareto curve in all the three cases.
Finally, we study a special case of the decision problems, where we require a
given expected mean-payoff together with zero variance. Here we show that the
problems can be all solved in polynomial time.Comment: Extended version of a paper presented at LICS 201
Adaptive Experimental Design with Temporal Interference: A Maximum Likelihood Approach
Suppose an online platform wants to compare a treatment and control policy,
e.g., two different matching algorithms in a ridesharing system, or two
different inventory management algorithms in an online retail site. Standard
randomized controlled trials are typically not feasible, since the goal is to
estimate policy performance on the entire system. Instead, the typical current
practice involves dynamically alternating between the two policies for fixed
lengths of time, and comparing the average performance of each over the
intervals in which they were run as an estimate of the treatment effect.
However, this approach suffers from *temporal interference*: one algorithm
alters the state of the system as seen by the second algorithm, biasing
estimates of the treatment effect. Further, the simple non-adaptive nature of
such designs implies they are not sample efficient.
We develop a benchmark theoretical model in which to study optimal
experimental design for this setting. We view testing the two policies as the
problem of estimating the steady state difference in reward between two unknown
Markov chains (i.e., policies). We assume estimation of the steady state reward
for each chain proceeds via nonparametric maximum likelihood, and search for
consistent (i.e., asymptotically unbiased) experimental designs that are
efficient (i.e., asymptotically minimum variance). Characterizing such designs
is equivalent to a Markov decision problem with a minimum variance objective;
such problems generally do not admit tractable solutions. Remarkably, in our
setting, using a novel application of classical martingale analysis of Markov
chains via Poisson's equation, we characterize efficient designs via a succinct
convex optimization problem. We use this characterization to propose a
consistent, efficient online experimental design that adaptively samples the
two Markov chains