2,134 research outputs found

    Asymptotic Optimality of Finite Approximations to Markov Decision Processes with Borel Spaces

    Full text link
    Calculating optimal policies is known to be computationally difficult for Markov decision processes (MDPs) with Borel state and action spaces. This paper studies finite-state approximations of discrete time Markov decision processes with Borel state and action spaces, for both discounted and average costs criteria. The stationary policies thus obtained are shown to approximate the optimal stationary policy with arbitrary precision under quite general conditions for discounted cost and more restrictive conditions for average cost. For compact-state MDPs, we obtain explicit rate of convergence bounds quantifying how the approximation improves as the size of the approximating finite state space increases. Using information theoretic arguments, the order optimality of the obtained convergence rates is established for a large class of problems. We also show that, as a pre-processing step the action space can also be finitely approximated with sufficiently large number points; thereby, well known algorithms, such as value or policy iteration, Q-learning, etc., can be used to calculate near optimal policies.Comment: 41 page

    A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces

    Full text link
    In this paper, a convex optimization-based method is proposed for numerically solving dynamic programs in continuous state and action spaces. The key idea is to approximate the output of the Bellman operator at a particular state by the optimal value of a convex program. The approximate Bellman operator has a computational advantage because it involves a convex optimization problem in the case of control-affine systems and convex costs. Using this feature, we propose a simple dynamic programming algorithm to evaluate the approximate value function at pre-specified grid points by solving convex optimization problems in each iteration. We show that the proposed method approximates the optimal value function with a uniform convergence property in the case of convex optimal value functions. We also propose an interpolation-free design method for a control policy, of which performance converges uniformly to the optimum as the grid resolution becomes finer. When a nonlinear control-affine system is considered, the convex optimization approach provides an approximate policy with a provable suboptimality bound. For general cases, the proposed convex formulation of dynamic programming operators can be modified as a nonconvex bi-level program, in which the inner problem is a linear program, without losing uniform convergence properties

    Scalable Bilinear π\pi Learning Using State and Action Features

    Full text link
    Approximate linear programming (ALP) represents one of the major algorithmic families to solve large-scale Markov decision processes (MDP). In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear π\pi learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts (bi)linear models to represent the high-dimensional value function and state-action distributions, using given state and action features. Its run-time complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sample-efficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension of the parameter space

    Coupling and a generalised Policy Iteration Algorithm in continuous time

    Full text link
    We analyse a version of the policy iteration algorithm for the discounted infinite-horizon problem for controlled multidimensional diffusion processes, where both the drift and the diffusion coefficient can be controlled. We prove that, under assumptions on the problem data, the payoffs generated by the algorithm converge monotonically to the value function and an accumulation point of the sequence of policies is an optimal policy. The algorithm is stated and analysed in continuous time and state, with discretisation featuring neither in theorems nor the proofs. A key technical tool used to show that the algorithm is well-defined is the mirror coupling of Lindvall and Rogers.Comment: 21 pages, 2 figure

    A Fenchel-Moreau-Rockafellar type theorem on the Kantorovich-Wasserstein space with Applications in Partially Observable Markov Decision Processes

    Full text link
    By using the fact that the space of all probability measures with finite support can be somehow completed in two different fashions, one generating the Arens-Eells space and another generating the Kantorovich-Wasserstein (Wasserstein-1) space, and by exploiting the duality relationship between the Arens-Eells space with the space of Lipschitz functions, we provide a dual representation of Fenchel-Moreau-Rockafellar type for proper convex functionals on Wasserstein-1. We retrieve dual transportation inequalities as a Corollary and we provide examples where the theorem can be used to easily prove dual expressions like the celebrated Donsker-Varadhan variational formula. Finally our result allows to write convex functions as the supremum over all linear functions that are generated by roots of its conjugate dual, something that we apply to the field of Partially observable Markov decision processes (POMDPs) to approximate the value function of a given POMDP by iterating level sets. This extends the method used in Smallwood 1973 for finite state spaces to the case were the state space is a Polish metric space.Comment: 20 page

    Q-learning with Nearest Neighbors

    Full text link
    We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a dd-dimensional state space and the discounted factor γ∈(0,1)\gamma \in (0,1), given an arbitrary sample path with "covering time" L L , we establish that the algorithm is guaranteed to output an ε\varepsilon-accurate estimate of the optimal Q-function using O~(L/(ε3(1−γ)7))\tilde{O}\big(L/(\varepsilon^3(1-\gamma)^7)\big) samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as O~(1/εd), \tilde{O}\big(1/\varepsilon^d\big), so the sample complexity scales as O~(1/εd+3).\tilde{O}\big(1/\varepsilon^{d+3}\big). Indeed, we establish a lower bound that argues that the dependence of Ω~(1/εd+2) \tilde{\Omega}\big(1/\varepsilon^{d+2}\big) is necessary.Comment: Accepted to NIPS 201

    Empirical Dynamic Programming

    Full text link
    We propose empirical dynamic programming algorithms for Markov decision processes (MDPs). In these algorithms, the exact expectation in the Bellman operator in classical value iteration is replaced by an empirical estimate to get `empirical value iteration' (EVI). Policy evaluation and policy improvement in classical policy iteration are also replaced by simulation to get `empirical policy iteration' (EPI). Thus, these empirical dynamic programming algorithms involve iteration of a random operator, the empirical Bellman operator. We introduce notions of probabilistic fixed points for such random monotone operators. We develop a stochastic dominance framework for convergence analysis of such operators. We then use this to give sample complexity bounds for both EVI and EPI. We then provide various variations and extensions to asynchronous empirical dynamic programming, the minimax empirical dynamic program, and show how this can also be used to solve the dynamic newsvendor problem. Preliminary experimental results suggest a faster rate of convergence than stochastic approximation algorithms.Comment: 34 Pages, 1 Figur

    An Empirical Dynamic Programming Algorithm for Continuous MDPs

    Full text link
    We propose universal randomized function approximation-based empirical value iteration (EVI) algorithms for Markov decision processes. The `empirical' nature comes from each iteration being done empirically from samples available from simulations of the next state. This makes the Bellman operator a random operator. A parametric and a non-parametric method for function approximation using a parametric function space and the Reproducing Kernel Hilbert Space (RKHS) respectively are then combined with EVI. Both function spaces have the universal function approximation property. Basis functions are picked randomly. Convergence analysis is done using a random operator framework with techniques from the theory of stochastic dominance. Finite time sample complexity bounds are derived for both universal approximate dynamic programming algorithms. Numerical experiments support the versatility and effectiveness of this approach.Comment: Accepted for publication in IEEE Transactions on Automatic Contro

    Solving Factored MDPs with Hybrid State and Action Variables

    Full text link
    Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a new hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function by a linear combination of basis functions and optimize its weights by linear programming. We analyze both theoretical and computational aspects of this approach, and demonstrate its scale-up potential on several hybrid optimization problems

    Robustness to incorrect system models in stochastic control

    Full text link
    In stochastic control applications, typically only an ideal model (controlled transition kernel) is assumed and the control design is based on the given model, raising the problem of performance loss due to the mismatch between the assumed model and the actual model. Toward this end, we study continuity properties of discrete-time stochastic control problems with respect to system models (i.e., controlled transition kernels) and robustness of optimal control policies designed for incorrect models applied to the true system. We study both fully observed and partially observed setups under an infinite horizon discounted expected cost criterion. We show that continuity and robustness cannot be established under weak and setwise convergences of transition kernels in general, but that the expected induced cost is robust under total variation. By imposing further assumptions on the measurement models and on the kernel itself (such as continuous convergence), we show that the optimal cost can be made continuous under weak convergence of transition kernels as well. Using these continuity properties, we establish convergence results and error bounds due to mismatch that occurs by the application of a control policy which is designed for an incorrectly estimated system model to a true model, thus establishing positive and negative results on robustness.Compared to the existing literature, we obtain strictly refined robustness results that are applicable even when the incorrect models can be investigated under weak convergence and setwise convergence criteria (with respect to a true model), in addition to the total variation criteria. These entail positive implications on empirical learning in (data-driven) stochastic control since often system models are learned through empirical training data where typically weak convergence criterion applies but stronger convergence criteria do not.Comment: Conference version to appear at the 2018 IEEE CDC with title "Robustness to Incorrect System Models in Stochastic Control and Application to Data-Driven Learning". The paper is to appear in SIAM J. on Control and Optimizatio
    • …
    corecore