19,763 research outputs found
Approximate Modified Policy Iteration
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that
contains the two celebrated policy and value iteration methods. Despite its
generality, MPI has not been thoroughly studied, especially its approximation
form which is used when the state and/or action spaces are large or infinite.
In this paper, we propose three implementations of approximate MPI (AMPI) that
are extensions of well-known approximate DP algorithms: fitted-value iteration,
fitted-Q iteration, and classification-based policy iteration. We provide error
propagation analyses that unify those for approximate policy and value
iteration. On the last classification-based implementation, we develop a
finite-sample analysis that shows that MPI's main parameter allows to control
the balance between the estimation error of the classifier and the overall
value function approximation
Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies
We consider approximate dynamic programming for the infinite-horizon
stationary -discounted optimal control problem formalized by Markov
Decision Processes. While in the exact case it is known that there always
exists an optimal policy that is stationary, we show that when using value
function approximation, looking for a non-stationary policy may lead to a
better performance guarantee. We define a non-stationary variant of MPI that
unifies a broad family of approximate DP algorithms of the literature. For this
algorithm we provide an error propagation analysis in the form of a performance
bound of the resulting policies that can improve the usual performance bound by
a factor , which is significant when the discount factor
is close to 1. Doing so, our approach unifies recent results for Value and
Policy Iteration. Furthermore, we show, by constructing a specific
deterministic MDP, that our performance guarantee is tight
Approximate modified policy iteration and its application to the game of Tetris
International audienceModified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms:~fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples
Sampling-based Approximations with Quantitative Performance for the Probabilistic Reach-Avoid Problem over General Markov Processes
This article deals with stochastic processes endowed with the Markov
(memoryless) property and evolving over general (uncountable) state spaces. The
models further depend on a non-deterministic quantity in the form of a control
input, which can be selected to affect the probabilistic dynamics. We address
the computation of maximal reach-avoid specifications, together with the
synthesis of the corresponding optimal controllers. The reach-avoid
specification deals with assessing the likelihood that any finite-horizon
trajectory of the model enters a given goal set, while avoiding a given set of
undesired states. This article newly provides an approximate computational
scheme for the reach-avoid specification based on the Fitted Value Iteration
algorithm, which hinges on random sample extractions, and gives a-priori
computable formal probabilistic bounds on the error made by the approximation
algorithm: as such, the output of the numerical scheme is quantitatively
assessed and thus meaningful for safety-critical applications. Furthermore, we
provide tighter probabilistic error bounds that are sample-based. The overall
computational scheme is put in relationship with alternative approximation
algorithms in the literature, and finally its performance is practically
assessed over a benchmark case study
A Theory of Regularized Markov Decision Processes
Many recent successful (deep) reinforcement learning algorithms make use of
regularization, generally based on entropy or Kullback-Leibler divergence. We
propose a general theory of regularized Markov Decision Processes that
generalizes these approaches in two directions: we consider a larger class of
regularizers, and we consider the general modified policy iteration approach,
encompassing both policy iteration and value iteration. The core building
blocks of this theory are a notion of regularized Bellman operator and the
Legendre-Fenchel transform, a classical tool of convex optimization. This
approach allows for error propagation analyses of general algorithmic schemes
of which (possibly variants of) classical algorithms such as Trust Region
Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy
Programming are special cases. This also draws connections to proximal convex
optimization, especially to Mirror Descent.Comment: ICML 201
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
- …