43,900 research outputs found

    Approximate Modified Policy Iteration

    Get PDF
    Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analyses that unify those for approximate policy and value iteration. On the last classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation

    Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies

    Get PDF
    We consider approximate dynamic programming for the infinite-horizon stationary γ\gamma-discounted optimal control problem formalized by Markov Decision Processes. While in the exact case it is known that there always exists an optimal policy that is stationary, we show that when using value function approximation, looking for a non-stationary policy may lead to a better performance guarantee. We define a non-stationary variant of MPI that unifies a broad family of approximate DP algorithms of the literature. For this algorithm we provide an error propagation analysis in the form of a performance bound of the resulting policies that can improve the usual performance bound by a factor O(1−γ)O(1-\gamma), which is significant when the discount factor γ\gamma is close to 1. Doing so, our approach unifies recent results for Value and Policy Iteration. Furthermore, we show, by constructing a specific deterministic MDP, that our performance guarantee is tight

    Non-stationary approximate modified policy iteration

    Get PDF
    International audienceWe consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error at each iteration is known to lead to stationary policies that are at least 2γ/(1−γ)^2-optimal. Variations of Value and Policy Iteration, that build l-periodic non-stationary policies, have recently been shown to display a better 2γ/((1−γ)(1−γ^l))-optimality guarantee. We describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m ≥ 0 and l ≥ 1 that generalizes all the above mentionned algorithms. While m allows one to interpolate between Value-Iteration-style and Policy-Iteration-style updates, l specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved 2γ/((1−γ)(1−γ))-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and l; this tightness was to our knowledge only known in two specific cases, Value Iteration (m = 0, l = 1) and Policy Iteration (m = ∞, l = 1)

    Approximate Dynamic Programming Algorithms for United States Air Force Officer Sustainment

    Get PDF
    The United States Air Force (USAF) officer sustainment system involves making accession and promotion decisions for nearly 64 thousand officers annually. We formulate a discrete time stochastic Markov decision process model to examine this military workforce planning problem. The large size of the motivating problem suggests that conventional exact dynamic programming algorithms are inappropriate. As such, we propose two approximate dynamic programming (ADP) algorithms to solve the problem. We employ a least-squares approximate policy iteration (API) algorithm with instrumental variables Bellman error minimization to determine approximate policies. In this API algorithm, we use a modified version of the Bellman equation based on the post-decision state variable. Approximating the value function using a post-decision state variable allows us to find the best policy for a given approximation using a decomposable mixed integer nonlinear programming formulation. We also propose an approximate value iteration algorithm using concave adaptive value estimation (CAVE). The CAVE algorithm identities an improved policy for a test problem based on the current USAF officer sustainment system. The CAVE algorithm obtains a statistically significant 2.8% improvement over the currently employed USAF policy, which serves as the benchmark

    Smart building real time pricing for offering load-side regulation service reserves

    Get PDF
    Abstract-Provision of Regulation Service (RS) reserves to Power Markets by smart building demand response has attracted attention in recent literature. This paper develops tractable dynamic optimal pricing algorithms for distributed RS reserve provision. It shows monotonicity and convexity properties of the optimal pricing policies and the associated differential cost function. Then, it uses them to propose and implement a modified Least Squares Temporal Differences (LSTD) Actor-Critic algorithm with a bounded and continuous action space. This algorithm solves for the best policy within a pre-specified broad family. In addition, the paper develops a novel Approximate Policy Iteration (API) algorithm and uses it successfully to optimize the parameters of an analytic policy function. Numerical results are obtained to demonstrate and compare the Actor-Critic and Approximate Policy Iteration algorithms, demonstrating that the novel API algorithm outperforms the Bounded LSTD Actor-Critic algorithm in both computational effort and policy minimum cost

    Approximate modified policy iteration and its application to the game of Tetris

    Get PDF
    International audienceModified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms:~fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples
    • …
    corecore