81 research outputs found
On Polynomial Sized MDP Succinct Policies
Policies of Markov Decision Processes (MDPs) determine the next action to
execute from the current state and, possibly, the history (the past states).
When the number of states is large, succinct representations are often used to
compactly represent both the MDPs and the policies in a reduced amount of
space. In this paper, some problems related to the size of succinctly
represented policies are analyzed. Namely, it is shown that some MDPs have
policies that can only be represented in space super-polynomial in the size of
the MDP, unless the polynomial hierarchy collapses. This fact motivates the
study of the problem of deciding whether a given MDP has a policy of a given
size and reward. Since some algorithms for MDPs work by finding a succinct
representation of the value function, the problem of deciding the existence of
a succinct representation of a value function of a given size and reward is
also considered
On the Complexity of Value Iteration
Value iteration is a fundamental algorithm for solving Markov Decision Processes (MDPs). It computes the maximal n-step payoff by iterating n times a recurrence equation which is naturally associated to the MDP. At the same time, value iteration provides a policy for the MDP that is optimal on a given finite horizon n. In this paper, we settle the computational complexity of value iteration. We show that, given a horizon n in binary and an MDP, computing an optimal policy is EXPTIME-complete, thus resolving an open problem that goes back to the seminal 1987 paper on the complexity of MDPs by Papadimitriou and Tsitsiklis. To obtain this main result, we develop several stepping stones that yield results of an independent interest. For instance, we show that it is EXPTIME-complete to compute the n-fold iteration (with n in binary) of a function given by a straight-line program over the integers with max and + as operators. We also provide new complexity results for the bounded halting problem in linear-update counter machines
Multi-Objective Model Checking of Markov Decision Processes
We study and provide efficient algorithms for multi-objective model checking
problems for Markov Decision Processes (MDPs). Given an MDP, M, and given
multiple linear-time (\omega -regular or LTL) properties \varphi\_i, and
probabilities r\_i \epsilon [0,1], i=1,...,k, we ask whether there exists a
strategy \sigma for the controller such that, for all i, the probability that a
trajectory of M controlled by \sigma satisfies \varphi\_i is at least r\_i. We
provide an algorithm that decides whether there exists such a strategy and if
so produces it, and which runs in time polynomial in the size of the MDP. Such
a strategy may require the use of both randomization and memory. We also
consider more general multi-objective \omega -regular queries, which we
motivate with an application to assume-guarantee compositional reasoning for
probabilistic systems.
Note that there can be trade-offs between different properties: satisfying
property \varphi\_1 with high probability may necessitate satisfying \varphi\_2
with low probability. Viewing this as a multi-objective optimization problem,
we want information about the "trade-off curve" or Pareto curve for maximizing
the probabilities of different properties. We show that one can compute an
approximate Pareto curve with respect to a set of \omega -regular properties in
time polynomial in the size of the MDP.
Our quantitative upper bounds use LP methods. We also study qualitative
multi-objective model checking problems, and we show that these can be analysed
by purely graph-theoretic methods, even though the strategies may still require
both randomization and memory.Comment: 21 pages, 2 figure
Certified Reinforcement Learning with Logic Guidance
This paper proposes the first model-free Reinforcement Learning (RL)
framework to synthesise policies for unknown, and continuous-state Markov
Decision Processes (MDPs), such that a given linear temporal property is
satisfied. We convert the given property into a Limit Deterministic Buchi
Automaton (LDBA), namely a finite-state machine expressing the property.
Exploiting the structure of the LDBA, we shape a synchronous reward function
on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces
that probabilistically satisfy the linear temporal property. This probability
(certificate) is also calculated in parallel with policy learning when the
state space of the MDP is finite: as such, the RL algorithm produces a policy
that is certified with respect to the property. Under the assumption of finite
state space, theoretical guarantees are provided on the convergence of the RL
algorithm to an optimal policy, maximising the above probability. We also show
that our method produces ''best available'' control policies when the logical
property cannot be satisfied. In the general case of a continuous state space,
we propose a neural network architecture for RL and we empirically show that
the algorithm finds satisfying policies, if there exist such policies. The
performance of the proposed framework is evaluated via a set of numerical
examples and benchmarks, where we observe an improvement of one order of
magnitude in the number of iterations required for the policy synthesis,
compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
Parameter Synthesis for Markov Models
Markov chain analysis is a key technique in reliability engineering. A
practical obstacle is that all probabilities in Markov models need to be known.
However, system quantities such as failure rates or packet loss ratios, etc.
are often not---or only partially---known. This motivates considering
parametric models with transitions labeled with functions over parameters.
Whereas traditional Markov chain analysis evaluates a reliability metric for a
single, fixed set of probabilities, analysing parametric Markov models focuses
on synthesising parameter values that establish a given reliability or
performance specification . Examples are: what component failure rates
ensure the probability of a system breakdown to be below 0.00000001?, or which
failure rates maximise reliability? This paper presents various analysis
algorithms for parametric Markov chains and Markov decision processes. We focus
on three problems: (a) do all parameter values within a given region satisfy
?, (b) which regions satisfy and which ones do not?, and (c)
an approximate version of (b) focusing on covering a large fraction of all
possible parameter values. We give a detailed account of the various
algorithms, present a software tool realising these techniques, and report on
an extensive experimental evaluation on benchmarks that span a wide range of
applications.Comment: 38 page
- …