5,196 research outputs found
Scalable Verification of Markov Decision Processes
Markov decision processes (MDP) are useful to model concurrent process
optimisation problems, but verifying them with numerical methods is often
intractable. Existing approximative approaches do not scale well and are
limited to memoryless schedulers. Here we present the basis of scalable
verification for MDPSs, using an O(1) memory representation of
history-dependent schedulers. We thus facilitate scalable learning techniques
and the use of massively parallel verification.Comment: V4: FMDS version, 12 pages, 4 figure
Markov Decision Processes with Multiple Long-run Average Objectives
We study Markov decision processes (MDPs) with multiple limit-average (or
mean-payoff) functions. We consider two different objectives, namely,
expectation and satisfaction objectives. Given an MDP with k limit-average
functions, in the expectation objective the goal is to maximize the expected
limit-average value, and in the satisfaction objective the goal is to maximize
the probability of runs such that the limit-average value stays above a given
vector. We show that under the expectation objective, in contrast to the case
of one limit-average function, both randomization and memory are necessary for
strategies even for epsilon-approximation, and that finite-memory randomized
strategies are sufficient for achieving Pareto optimal values. Under the
satisfaction objective, in contrast to the case of one limit-average function,
infinite memory is necessary for strategies achieving a specific value (i.e.
randomized finite-memory strategies are not sufficient), whereas memoryless
randomized strategies are sufficient for epsilon-approximation, for all
epsilon>0. We further prove that the decision problems for both expectation and
satisfaction objectives can be solved in polynomial time and the trade-off
curve (Pareto curve) can be epsilon-approximated in time polynomial in the size
of the MDP and 1/epsilon, and exponential in the number of limit-average
functions, for all epsilon>0. Our analysis also reveals flaws in previous work
for MDPs with multiple mean-payoff functions under the expectation objective,
corrects the flaws, and allows us to obtain improved results
Experimental results : Reinforcement Learning of POMDPs using Spectral Methods
We propose a new reinforcement learning algorithm for partially observable
Markov decision processes (POMDP) based on spectral decomposition methods.
While spectral methods have been previously employed for consistent learning of
(passive) latent variable models such as hidden Markov models, POMDPs are more
challenging since the learner interacts with the environment and possibly
changes the future observations in the process. We devise a learning algorithm
running through epochs, in each epoch we employ spectral techniques to learn
the POMDP parameters from a trajectory generated by a fixed policy. At the end
of the epoch, an optimization oracle returns the optimal memoryless planning
policy which maximizes the expected reward based on the estimated POMDP model.
We prove an order-optimal regret bound with respect to the optimal memoryless
policy and efficient scaling with respect to the dimensionality of observation
and action spaces.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016),
Barcelona, Spai
Expectations or Guarantees? I Want It All! A crossroad between games and MDPs
When reasoning about the strategic capabilities of an agent, it is important
to consider the nature of its adversaries. In the particular context of
controller synthesis for quantitative specifications, the usual problem is to
devise a strategy for a reactive system which yields some desired performance,
taking into account the possible impact of the environment of the system. There
are at least two ways to look at this environment. In the classical analysis of
two-player quantitative games, the environment is purely antagonistic and the
problem is to provide strict performance guarantees. In Markov decision
processes, the environment is seen as purely stochastic: the aim is then to
optimize the expected payoff, with no guarantee on individual outcomes.
In this expository work, we report on recent results introducing the beyond
worst-case synthesis problem, which is to construct strategies that guarantee
some quantitative requirement in the worst-case while providing an higher
expected value against a particular stochastic model of the environment given
as input. This problem is relevant to produce system controllers that provide
nice expected performance in the everyday situation while ensuring a strict
(but relaxed) performance threshold even in the event of very bad (while
unlikely) circumstances. It has been studied for both the mean-payoff and the
shortest path quantitative measures.Comment: In Proceedings SR 2014, arXiv:1404.041
- …