52,489 research outputs found

    An Efficient Policy Iteration Algorithm for Dynamic Programming Equations

    Full text link
    We present an accelerated algorithm for the solution of static Hamilton-Jacobi-Bellman equations related to optimal control problems. Our scheme is based on a classic policy iteration procedure, which is known to have superlinear convergence in many relevant cases provided the initial guess is sufficiently close to the solution. In many cases, this limitation degenerates into a behavior similar to a value iteration method, with an increased computation time. The new scheme circumvents this problem by combining the advantages of both algorithms with an efficient coupling. The method starts with a value iteration phase and then switches to a policy iteration procedure when a certain error threshold is reached. A delicate point is to determine this threshold in order to avoid cumbersome computation with the value iteration and, at the same time, to be reasonably sure that the policy iteration method will finally converge to the optimal solution. We analyze the methods and efficient coupling in a number of examples in dimension two, three and four illustrating its properties

    An efficient policy iteration algorithm for dynamic programming equations

    Get PDF
    We present an accelerated algorithm for the solution of static Hamilton–Jacobi–Bellman equations related to optimal control problems. Our scheme is based on a classic policy iteration procedure, which is known to have superlinear convergence in many relevant cases provided the initial guess is sufficiently close to the solution. This limitation often degenerates into a behavior similar to a value iteration method, with an increased computation time. The new scheme circumvents this problem by combining the advantages of both algorithms with an efficient coupling. The method starts with a coarse-mesh value iteration phase and then switches to a fine-mesh policy iteration procedure when a certain error threshold is reached. A delicate point is to determine this threshold in order to avoid cumbersome computations with the value iteration and at the same time to ensure the convergence of the policy iteration method to the optimal solution. We analyze the methods and efficient coupling in a number of examples in different dimensions, illustrating their properties

    Adaptive Channel Recommendation For Opportunistic Spectrum Access

    Full text link
    We propose a dynamic spectrum access scheme where secondary users recommend "good" channels to each other and access accordingly. We formulate the problem as an average reward based Markov decision process. We show the existence of the optimal stationary spectrum access policy, and explore its structure properties in two asymptotic cases. Since the action space of the Markov decision process is continuous, it is difficult to find the optimal policy by simply discretizing the action space and use the policy iteration, value iteration, or Q-learning methods. Instead, we propose a new algorithm based on the Model Reference Adaptive Search method, and prove its convergence to the optimal policy. Numerical results show that the proposed algorithms achieve up to 18% and 100% performance improvement than the static channel recommendation scheme in homogeneous and heterogeneous channel environments, respectively, and is more robust to channel dynamics

    Dynamic Programming for Positive Linear Systems with Linear Costs

    Full text link
    Recent work by Rantzer [Ran22] formulated a class of optimal control problems involving positive linear systems, linear stage costs, and linear constraints. It was shown that the associated Bellman's equation can be characterized by a finite-dimensional nonlinear equation, which is solved by linear programming. In this work, we report complementary theories for the same class of problems. In particular, we provide conditions under which the solution is unique, investigate properties of the optimal policy, study the convergence of value iteration, policy iteration, and optimistic policy iteration applied to such problems, and analyze the boundedness of the solution to the associated linear program. Apart from a form of the Frobenius-Perron theorem, the majority of our results are built upon generic dynamic programming theory applicable to problems involving nonnegative stage costs

    Nested Pseudo-likelihood Estimation and Bootstrap-based Inference for Structural Discrete Markov Decision Models

    Get PDF
    This paper analyzes the higher-order properties of nested pseudo-likelihood (NPL) estimators and their practical implementation for parametric discrete Markov decision models in which the probability distribution is defined as a fixed point. We propose a new NPL estimator that can achieve quadratic convergence without fully solving the fixed point problem in every iteration. We then extend the NPL estimators to develop one-step NPL bootstrap procedures for discrete Markov decision models and provide some Monte Carlo evidence based on a machine replacement model of Rust (1987). The proposed one-step bootstrap test statistics and confidence intervals improve upon the first order asymptotics even with a relatively small number of iterations. Improvements are particularly noticeable when analyzing the dynamic impacts of counterfactual policies.Edgeworth expansion, k-step bootstrap, maximum pseudo-likelihood estimators, nested fixed point algorithm, Newton-Raphson method, policy iteration

    A note on the policy iteration algorithm for discounted Markov decision processes for a class of semicontinuous models

    Full text link
    The standard version of the policy iteration (PI) algorithm fails for semicontinuous models, that is, for models with lower semicontinuous one-step costs and weakly continuous transition law. This is due to the lack of continuity properties of the discounted cost for stationary policies, thus appearing a measurability problem in the improvement step. The present work proposes an alternative version of PI algorithm which performs an smoothing step to avoid the measurability problem. Assuming that the model satisfies a Lyapunov growth conditions and also some standard continuity-compactness properties, it is shown the linear convergence of the policy iteration functions to the optimal value function. Strengthening the continuity conditions, in a second result, it is shown that among the improvement policies there is one with the best possible improvement and whose cost function is continuous.Comment: Fourteen pages page

    Deep Reinforcement Learning for Approximate Policy Iteration: Convergence Analysis and a Post-Earthquake Disaster Response Case Study

    Get PDF
    Approximate Policy Iteration (API) is a Class of Reinforcement Learning (RL) Algorithms that Seek to Solve the Long-Run Discounted Reward Markov Decision Process (MDP), Via the Policy Iteration Paradigm, Without Learning the Transition Model in the Underlying Bellman Equation. Unfortunately, These Algorithms Suffer from a Defect Known as Chattering in Which the Solution (Policy) Delivered in Each Iteration of the Algorithm Oscillates between Improved and Worsened Policies, Leading to Sub-Optimal Behavior. Two Causes for This that Have Been Traced to the Crucial Policy Improvement Step Are: (I) the Inaccuracies in the Policy Improvement Function and (Ii) the Exploration/exploitation Tradeoff Integral to This Step, Which Generates Variability in Performance. Both of These Defects Are Amplified by Simulation Noise. Deep RL Belongs to a Newer Class of Algorithms in Which the Resolution of the Learning Process is Refined Via Mechanisms Such as Experience Replay And/or Deep Neural Networks for Improved Performance. in This Paper, a New Deep Learning Approach is Developed for API Which Employs a More Accurate Policy Improvement Function, Via an Enhanced Resolution Bellman Equation, Thereby Reducing Chattering and Eliminating the Need for Exploration in the Policy Improvement Step. Versions of the New Algorithm for Both the Long-Run Discounted MDP and Semi-MDP Are Presented. Convergence Properties of the New Algorithm Are Studied Mathematically, and a Post-Earthquake Disaster Response Case Study is Employed to Demonstrate Numerically the Algorithm\u27s Efficacy
    • …
    corecore