52,489 research outputs found
An Efficient Policy Iteration Algorithm for Dynamic Programming Equations
We present an accelerated algorithm for the solution of static
Hamilton-Jacobi-Bellman equations related to optimal control problems. Our
scheme is based on a classic policy iteration procedure, which is known to have
superlinear convergence in many relevant cases provided the initial guess is
sufficiently close to the solution. In many cases, this limitation degenerates
into a behavior similar to a value iteration method, with an increased
computation time. The new scheme circumvents this problem by combining the
advantages of both algorithms with an efficient coupling. The method starts
with a value iteration phase and then switches to a policy iteration procedure
when a certain error threshold is reached. A delicate point is to determine
this threshold in order to avoid cumbersome computation with the value
iteration and, at the same time, to be reasonably sure that the policy
iteration method will finally converge to the optimal solution. We analyze the
methods and efficient coupling in a number of examples in dimension two, three
and four illustrating its properties
An efficient policy iteration algorithm for dynamic programming equations
We present an accelerated algorithm for the solution of static Hamilton–Jacobi–Bellman equations related to optimal control problems. Our scheme is based on a classic policy iteration procedure, which is known to have superlinear convergence in many relevant cases provided the initial guess is sufficiently close to the solution. This limitation often degenerates into a behavior similar to a value iteration method, with an increased computation time. The new scheme circumvents this problem by combining the advantages of both algorithms with an efficient coupling. The method starts with a coarse-mesh value iteration phase and then switches to a fine-mesh policy iteration procedure when a certain error threshold is reached. A delicate point is to determine this threshold in order to avoid cumbersome computations with the value iteration and at the same time to ensure the convergence of the policy iteration method to the optimal solution. We analyze the methods and efficient coupling in a number of examples in different dimensions, illustrating their properties
Adaptive Channel Recommendation For Opportunistic Spectrum Access
We propose a dynamic spectrum access scheme where secondary users recommend
"good" channels to each other and access accordingly. We formulate the problem
as an average reward based Markov decision process. We show the existence of
the optimal stationary spectrum access policy, and explore its structure
properties in two asymptotic cases. Since the action space of the Markov
decision process is continuous, it is difficult to find the optimal policy by
simply discretizing the action space and use the policy iteration, value
iteration, or Q-learning methods. Instead, we propose a new algorithm based on
the Model Reference Adaptive Search method, and prove its convergence to the
optimal policy. Numerical results show that the proposed algorithms achieve up
to 18% and 100% performance improvement than the static channel recommendation
scheme in homogeneous and heterogeneous channel environments, respectively, and
is more robust to channel dynamics
Dynamic Programming for Positive Linear Systems with Linear Costs
Recent work by Rantzer [Ran22] formulated a class of optimal control problems
involving positive linear systems, linear stage costs, and linear constraints.
It was shown that the associated Bellman's equation can be characterized by a
finite-dimensional nonlinear equation, which is solved by linear programming.
In this work, we report complementary theories for the same class of problems.
In particular, we provide conditions under which the solution is unique,
investigate properties of the optimal policy, study the convergence of value
iteration, policy iteration, and optimistic policy iteration applied to such
problems, and analyze the boundedness of the solution to the associated linear
program. Apart from a form of the Frobenius-Perron theorem, the majority of our
results are built upon generic dynamic programming theory applicable to
problems involving nonnegative stage costs
Nested Pseudo-likelihood Estimation and Bootstrap-based Inference for Structural Discrete Markov Decision Models
This paper analyzes the higher-order properties of nested pseudo-likelihood (NPL) estimators and their practical implementation for parametric discrete Markov decision models in which the probability distribution is defined as a fixed point. We propose a new NPL estimator that can achieve quadratic convergence without fully solving the fixed point problem in every iteration. We then extend the NPL estimators to develop one-step NPL bootstrap procedures for discrete Markov decision models and provide some Monte Carlo evidence based on a machine replacement model of Rust (1987). The proposed one-step bootstrap test statistics and confidence intervals improve upon the first order asymptotics even with a relatively small number of iterations. Improvements are particularly noticeable when analyzing the dynamic impacts of counterfactual policies.Edgeworth expansion, k-step bootstrap, maximum pseudo-likelihood estimators, nested fixed point algorithm, Newton-Raphson method, policy iteration
A note on the policy iteration algorithm for discounted Markov decision processes for a class of semicontinuous models
The standard version of the policy iteration (PI) algorithm fails for
semicontinuous models, that is, for models with lower semicontinuous one-step
costs and weakly continuous transition law. This is due to the lack of
continuity properties of the discounted cost for stationary policies, thus
appearing a measurability problem in the improvement step. The present work
proposes an alternative version of PI algorithm which performs an smoothing
step to avoid the measurability problem. Assuming that the model satisfies a
Lyapunov growth conditions and also some standard continuity-compactness
properties, it is shown the linear convergence of the policy iteration
functions to the optimal value function. Strengthening the continuity
conditions, in a second result, it is shown that among the improvement policies
there is one with the best possible improvement and whose cost function is
continuous.Comment: Fourteen pages page
Deep Reinforcement Learning for Approximate Policy Iteration: Convergence Analysis and a Post-Earthquake Disaster Response Case Study
Approximate Policy Iteration (API) is a Class of Reinforcement Learning (RL) Algorithms that Seek to Solve the Long-Run Discounted Reward Markov Decision Process (MDP), Via the Policy Iteration Paradigm, Without Learning the Transition Model in the Underlying Bellman Equation. Unfortunately, These Algorithms Suffer from a Defect Known as Chattering in Which the Solution (Policy) Delivered in Each Iteration of the Algorithm Oscillates between Improved and Worsened Policies, Leading to Sub-Optimal Behavior. Two Causes for This that Have Been Traced to the Crucial Policy Improvement Step Are: (I) the Inaccuracies in the Policy Improvement Function and (Ii) the Exploration/exploitation Tradeoff Integral to This Step, Which Generates Variability in Performance. Both of These Defects Are Amplified by Simulation Noise. Deep RL Belongs to a Newer Class of Algorithms in Which the Resolution of the Learning Process is Refined Via Mechanisms Such as Experience Replay And/or Deep Neural Networks for Improved Performance. in This Paper, a New Deep Learning Approach is Developed for API Which Employs a More Accurate Policy Improvement Function, Via an Enhanced Resolution Bellman Equation, Thereby Reducing Chattering and Eliminating the Need for Exploration in the Policy Improvement Step. Versions of the New Algorithm for Both the Long-Run Discounted MDP and Semi-MDP Are Presented. Convergence Properties of the New Algorithm Are Studied Mathematically, and a Post-Earthquake Disaster Response Case Study is Employed to Demonstrate Numerically the Algorithm\u27s Efficacy
- …