1,768 research outputs found

    Smart Sampling for Lightweight Verification of Markov Decision Processes

    Get PDF
    Markov decision processes (MDP) are useful to model optimisation problems in concurrent systems. To verify MDPs with efficient Monte Carlo techniques requires that their nondeterminism be resolved by a scheduler. Recent work has introduced the elements of lightweight techniques to sample directly from scheduler space, but finding optimal schedulers by simple sampling may be inefficient. Here we describe "smart" sampling algorithms that can make substantial improvements in performance.Comment: IEEE conference style, 11 pages, 5 algorithms, 11 figures, 1 tabl

    Scalable First-Order Methods for Robust MDPs

    Full text link
    Robust Markov Decision Processes (MDPs) are a powerful framework for modeling sequential decision-making problems with model uncertainty. This paper proposes the first first-order framework for solving robust MDPs. Our algorithm interleaves primal-dual first-order updates with approximate Value Iteration updates. By carefully controlling the tradeoff between the accuracy and cost of Value Iteration updates, we achieve an ergodic convergence rate of O(A2S3log⁥(S)log⁥(ϔ−1)ϔ−1)O \left( A^{2} S^{3}\log(S)\log(\epsilon^{-1}) \epsilon^{-1} \right) for the best choice of parameters on ellipsoidal and Kullback-Leibler ss-rectangular uncertainty sets, where SS and AA is the number of states and actions, respectively. Our dependence on the number of states and actions is significantly better (by a factor of O(A1.5S1.5)O(A^{1.5}S^{1.5})) than that of pure Value Iteration algorithms. In numerical experiments on ellipsoidal uncertainty sets we show that our algorithm is significantly more scalable than state-of-the-art approaches. Our framework is also the first one to solve robust MDPs with ss-rectangular KL uncertainty sets

    Approximate Policy Iteration Schemes: A Comparison

    Get PDF
    We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP∞_\infty), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(α\alpha), but this comes at the cost of a relative---exponential in 1Ï”\frac{1}{\epsilon}---increase of the number of iterations. 2) PSDP∞_\infty enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP∞_\infty is proportional to their number of iterations, which may be problematic when the discount factor Îł\gamma is close to 1 or the approximation error Ï”\epsilon is close to 00; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.Comment: ICML (2014

    Regret Bounds for Reinforcement Learning with Policy Advice

    Get PDF
    In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space. Our empirical simulations support our theoretical analysis. This suggests RLPA may offer significant advantages in large domains where some prior good policies are provided
    • 

    corecore