11 research outputs found
Trading Performance for Stability in Markov Decision Processes
We study the complexity of central controller synthesis problems for
finite-state Markov decision processes, where the objective is to optimize both
the expected mean-payoff performance of the system and its stability.
We argue that the basic theoretical notion of expressing the stability in
terms of the variance of the mean-payoff (called global variance in our paper)
is not always sufficient, since it ignores possible instabilities on respective
runs. For this reason we propose alernative definitions of stability, which we
call local and hybrid variance, and which express how rewards on each run
deviate from the run's own mean-payoff and from the expected mean-payoff,
respectively.
We show that a strategy ensuring both the expected mean-payoff and the
variance below given bounds requires randomization and memory, under all the
above semantics of variance. We then look at the problem of determining whether
there is a such a strategy. For the global variance, we show that the problem
is in PSPACE, and that the answer can be approximated in pseudo-polynomial
time. For the hybrid variance, the analogous decision problem is in NP, and a
polynomial-time approximating algorithm also exists. For local variance, we
show that the decision problem is in NP. Since the overall performance can be
traded for stability (and vice versa), we also present algorithms for
approximating the associated Pareto curve in all the three cases.
Finally, we study a special case of the decision problems, where we require a
given expected mean-payoff together with zero variance. Here we show that the
problems can be all solved in polynomial time.Comment: Extended version of a paper presented at LICS 201
Value Iteration for Long-run Average Reward in Markov Decision Processes
Markov decision processes (MDPs) are standard models for probabilistic
systems with non-deterministic behaviours. Long-run average rewards provide a
mathematically elegant formalism for expressing long term performance. Value
iteration (VI) is one of the simplest and most efficient algorithmic approaches
to MDPs with other properties, such as reachability objectives. Unfortunately,
a naive extension of VI does not work for MDPs with long-run average rewards,
as there is no known stopping criterion. In this work our contributions are
threefold. (1) We refute a conjecture related to stopping criteria for MDPs
with long-run average rewards. (2) We present two practical algorithms for MDPs
with long-run average rewards based on VI. First, we show that a combination of
applying VI locally for each maximal end-component (MEC) and VI for
reachability objectives can provide approximation guarantees. Second, extending
the above approach with a simulation-guided on-demand variant of VI, we present
an anytime algorithm that is able to deal with very large models. (3) Finally,
we present experimental results showing that our methods significantly
outperform the standard approaches on several benchmarks
Non-Zero Sum Games for Reactive Synthesis
In this invited contribution, we summarize new solution concepts useful for
the synthesis of reactive systems that we have introduced in several recent
publications. These solution concepts are developed in the context of non-zero
sum games played on graphs. They are part of the contributions obtained in the
inVEST project funded by the European Research Council.Comment: LATA'16 invited pape
Conditional Value-at-Risk for Reachability and Mean Payoff in Markov Decision Processes
We present the conditional value-at-risk (CVaR) in the context of Markov
chains and Markov decision processes with reachability and mean-payoff
objectives. CVaR quantifies risk by means of the expectation of the worst
p-quantile. As such it can be used to design risk-averse systems. We consider
not only CVaR constraints, but also introduce their conjunction with
expectation constraints and quantile constraints (value-at-risk, VaR). We
derive lower and upper bounds on the computational complexity of the respective
decision problems and characterize the structure of the strategies in terms of
memory and randomization
Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes
We consider Markov decision processes (MDPs) with multiple limit-average (or
mean-payoff) objectives. There exist two different views: (i) the expectation
semantics, where the goal is to optimize the expected mean-payoff objective,
and (ii) the satisfaction semantics, where the goal is to maximize the
probability of runs such that the mean-payoff value stays above a given vector.
We consider optimization with respect to both objectives at once, thus unifying
the existing semantics. Precisely, the goal is to optimize the expectation
while ensuring the satisfaction constraint. Our problem captures the notion of
optimization with respect to strategies that are risk-averse (i.e., ensure
certain probabilistic guarantee). Our main results are as follows: First, we
present algorithms for the decision problems which are always polynomial in the
size of the MDP. We also show that an approximation of the Pareto-curve can be
computed in time polynomial in the size of the MDP, and the approximation
factor, but exponential in the number of dimensions. Second, we present a
complete characterization of the strategy complexity (in terms of memory bounds
and randomization) required to solve our problem.Comment: Extended journal version of the LICS'15 pape
Approximating values of generalized-reachability stochastic games
Simple stochastic games are turn-based 2½-player games with a reachability objective. The basic question asks whether one player can ensure reaching a given target with at least a given probability. A natural extension is games with a conjunction of such conditions as objective. Despite a plethora of recent results on the analysis of systems with multiple objectives, the decidability of this basic problem remains open. In this paper, we present an algorithm approximating the Pareto frontier of the achievable values to a given precision. Moreover, it is an anytime algorithm, meaning it can be stopped at any time returning the current approximation and its error bound
Dark Control: The Default Mode Network as a Reinforcement Learning Agent
International audienceThe default mode network (DMN) is believed to subserve the baseline mental activity in humans. Its higher energy consumption compared to other brain networks and its intimate coupling with conscious awareness are both pointing to an unknown overarching function. Many research streams speak in favor of an evolutionarily adaptive role in envisioning experience to anticipate the future. In the present work, we propose a process model that tries to explain how the DMN may implement continuous evaluation and prediction of the environment to guide behavior. The main purpose of DMN activity, we argue, may be described by Markov Decision Processes that optimize action policies via value estimates based through vicarious trial and error. Our formal perspective on DMN function naturally accommodates as special cases previous interpretations based on (1) predictive coding, (2) semantic associations, and (3) a sentinel role. Moreover, this process model for the neural optimization of complex behavior in the DMN offers parsimonious explanations for recent experimental findings in animals and humans