5,196 research outputs found

    Scalable Verification of Markov Decision Processes

    Get PDF
    Markov decision processes (MDP) are useful to model concurrent process optimisation problems, but verifying them with numerical methods is often intractable. Existing approximative approaches do not scale well and are limited to memoryless schedulers. Here we present the basis of scalable verification for MDPSs, using an O(1) memory representation of history-dependent schedulers. We thus facilitate scalable learning techniques and the use of massively parallel verification.Comment: V4: FMDS version, 12 pages, 4 figure

    Markov Decision Processes with Multiple Long-run Average Objectives

    Get PDF
    We study Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) functions. We consider two different objectives, namely, expectation and satisfaction objectives. Given an MDP with k limit-average functions, in the expectation objective the goal is to maximize the expected limit-average value, and in the satisfaction objective the goal is to maximize the probability of runs such that the limit-average value stays above a given vector. We show that under the expectation objective, in contrast to the case of one limit-average function, both randomization and memory are necessary for strategies even for epsilon-approximation, and that finite-memory randomized strategies are sufficient for achieving Pareto optimal values. Under the satisfaction objective, in contrast to the case of one limit-average function, infinite memory is necessary for strategies achieving a specific value (i.e. randomized finite-memory strategies are not sufficient), whereas memoryless randomized strategies are sufficient for epsilon-approximation, for all epsilon>0. We further prove that the decision problems for both expectation and satisfaction objectives can be solved in polynomial time and the trade-off curve (Pareto curve) can be epsilon-approximated in time polynomial in the size of the MDP and 1/epsilon, and exponential in the number of limit-average functions, for all epsilon>0. Our analysis also reveals flaws in previous work for MDPs with multiple mean-payoff functions under the expectation objective, corrects the flaws, and allows us to obtain improved results

    Experimental results : Reinforcement Learning of POMDPs using Spectral Methods

    Get PDF
    We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through epochs, in each epoch we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the epoch, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spai

    Expectations or Guarantees? I Want It All! A crossroad between games and MDPs

    Full text link
    When reasoning about the strategic capabilities of an agent, it is important to consider the nature of its adversaries. In the particular context of controller synthesis for quantitative specifications, the usual problem is to devise a strategy for a reactive system which yields some desired performance, taking into account the possible impact of the environment of the system. There are at least two ways to look at this environment. In the classical analysis of two-player quantitative games, the environment is purely antagonistic and the problem is to provide strict performance guarantees. In Markov decision processes, the environment is seen as purely stochastic: the aim is then to optimize the expected payoff, with no guarantee on individual outcomes. In this expository work, we report on recent results introducing the beyond worst-case synthesis problem, which is to construct strategies that guarantee some quantitative requirement in the worst-case while providing an higher expected value against a particular stochastic model of the environment given as input. This problem is relevant to produce system controllers that provide nice expected performance in the everyday situation while ensuring a strict (but relaxed) performance threshold even in the event of very bad (while unlikely) circumstances. It has been studied for both the mean-payoff and the shortest path quantitative measures.Comment: In Proceedings SR 2014, arXiv:1404.041
    • …
    corecore