45,268 research outputs found
Approximate Modified Policy Iteration
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that
contains the two celebrated policy and value iteration methods. Despite its
generality, MPI has not been thoroughly studied, especially its approximation
form which is used when the state and/or action spaces are large or infinite.
In this paper, we propose three implementations of approximate MPI (AMPI) that
are extensions of well-known approximate DP algorithms: fitted-value iteration,
fitted-Q iteration, and classification-based policy iteration. We provide error
propagation analyses that unify those for approximate policy and value
iteration. On the last classification-based implementation, we develop a
finite-sample analysis that shows that MPI's main parameter allows to control
the balance between the estimation error of the classifier and the overall
value function approximation
Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies
We consider approximate dynamic programming for the infinite-horizon
stationary -discounted optimal control problem formalized by Markov
Decision Processes. While in the exact case it is known that there always
exists an optimal policy that is stationary, we show that when using value
function approximation, looking for a non-stationary policy may lead to a
better performance guarantee. We define a non-stationary variant of MPI that
unifies a broad family of approximate DP algorithms of the literature. For this
algorithm we provide an error propagation analysis in the form of a performance
bound of the resulting policies that can improve the usual performance bound by
a factor , which is significant when the discount factor
is close to 1. Doing so, our approach unifies recent results for Value and
Policy Iteration. Furthermore, we show, by constructing a specific
deterministic MDP, that our performance guarantee is tight
Non-stationary approximate modified policy iteration
International audienceWe consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error at each iteration is known to lead to stationary policies that are at least 2γ/(1−γ)^2-optimal. Variations of Value and Policy Iteration, that build l-periodic non-stationary policies, have recently been shown to display a better 2γ/((1−γ)(1−γ^l))-optimality guarantee. We describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m ≥ 0 and l ≥ 1 that generalizes all the above mentionned algorithms. While m allows one to interpolate between Value-Iteration-style and Policy-Iteration-style updates, l specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved 2γ/((1−γ)(1−γ))-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and l; this tightness was to our knowledge only known in two specific cases, Value Iteration (m = 0, l = 1) and Policy Iteration (m = ∞, l = 1)
Approximate Dynamic Programming Algorithms for United States Air Force Officer Sustainment
The United States Air Force (USAF) officer sustainment system involves making accession and promotion decisions for nearly 64 thousand officers annually. We formulate a discrete time stochastic Markov decision process model to examine this military workforce planning problem. The large size of the motivating problem suggests that conventional exact dynamic programming algorithms are inappropriate. As such, we propose two approximate dynamic programming (ADP) algorithms to solve the problem. We employ a least-squares approximate policy iteration (API) algorithm with instrumental variables Bellman error minimization to determine approximate policies. In this API algorithm, we use a modified version of the Bellman equation based on the post-decision state variable. Approximating the value function using a post-decision state variable allows us to find the best policy for a given approximation using a decomposable mixed integer nonlinear programming formulation. We also propose an approximate value iteration algorithm using concave adaptive value estimation (CAVE). The CAVE algorithm identities an improved policy for a test problem based on the current USAF officer sustainment system. The CAVE algorithm obtains a statistically significant 2.8% improvement over the currently employed USAF policy, which serves as the benchmark
Smart building real time pricing for offering load-side regulation service reserves
Abstract-Provision of Regulation Service (RS) reserves to Power Markets by smart building demand response has attracted attention in recent literature. This paper develops tractable dynamic optimal pricing algorithms for distributed RS reserve provision. It shows monotonicity and convexity properties of the optimal pricing policies and the associated differential cost function. Then, it uses them to propose and implement a modified Least Squares Temporal Differences (LSTD) Actor-Critic algorithm with a bounded and continuous action space. This algorithm solves for the best policy within a pre-specified broad family. In addition, the paper develops a novel Approximate Policy Iteration (API) algorithm and uses it successfully to optimize the parameters of an analytic policy function. Numerical results are obtained to demonstrate and compare the Actor-Critic and Approximate Policy Iteration algorithms, demonstrating that the novel API algorithm outperforms the Bounded LSTD Actor-Critic algorithm in both computational effort and policy minimum cost
Approximate modified policy iteration and its application to the game of Tetris
International audienceModified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms:~fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples
- …