2,134 research outputs found
Asymptotic Optimality of Finite Approximations to Markov Decision Processes with Borel Spaces
Calculating optimal policies is known to be computationally difficult for
Markov decision processes (MDPs) with Borel state and action spaces. This paper
studies finite-state approximations of discrete time Markov decision processes
with Borel state and action spaces, for both discounted and average costs
criteria. The stationary policies thus obtained are shown to approximate the
optimal stationary policy with arbitrary precision under quite general
conditions for discounted cost and more restrictive conditions for average
cost. For compact-state MDPs, we obtain explicit rate of convergence bounds
quantifying how the approximation improves as the size of the approximating
finite state space increases. Using information theoretic arguments, the order
optimality of the obtained convergence rates is established for a large class
of problems. We also show that, as a pre-processing step the action space can
also be finitely approximated with sufficiently large number points; thereby,
well known algorithms, such as value or policy iteration, Q-learning, etc., can
be used to calculate near optimal policies.Comment: 41 page
A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces
In this paper, a convex optimization-based method is proposed for numerically
solving dynamic programs in continuous state and action spaces. The key idea is
to approximate the output of the Bellman operator at a particular state by the
optimal value of a convex program. The approximate Bellman operator has a
computational advantage because it involves a convex optimization problem in
the case of control-affine systems and convex costs. Using this feature, we
propose a simple dynamic programming algorithm to evaluate the approximate
value function at pre-specified grid points by solving convex optimization
problems in each iteration. We show that the proposed method approximates the
optimal value function with a uniform convergence property in the case of
convex optimal value functions. We also propose an interpolation-free design
method for a control policy, of which performance converges uniformly to the
optimum as the grid resolution becomes finer. When a nonlinear control-affine
system is considered, the convex optimization approach provides an approximate
policy with a provable suboptimality bound. For general cases, the proposed
convex formulation of dynamic programming operators can be modified as a
nonconvex bi-level program, in which the inner problem is a linear program,
without losing uniform convergence properties
Scalable Bilinear Learning Using State and Action Features
Approximate linear programming (ALP) represents one of the major algorithmic
families to solve large-scale Markov decision processes (MDP). In this work, we
study a primal-dual formulation of the ALP, and develop a scalable, model-free
algorithm called bilinear learning for reinforcement learning when a
sampling oracle is provided. This algorithm enjoys a number of advantages.
First, it adopts (bi)linear models to represent the high-dimensional value
function and state-action distributions, using given state and action features.
Its run-time complexity depends on the number of features, not the size of the
underlying MDPs. Second, it operates in a fully online fashion without having
to store any sample, thus having minimal memory footprint. Third, we prove that
it is sample-efficient, solving for the optimal policy to high precision with a
sample complexity linear in the dimension of the parameter space
Coupling and a generalised Policy Iteration Algorithm in continuous time
We analyse a version of the policy iteration algorithm for the discounted
infinite-horizon problem for controlled multidimensional diffusion processes,
where both the drift and the diffusion coefficient can be controlled. We prove
that, under assumptions on the problem data, the payoffs generated by the
algorithm converge monotonically to the value function and an accumulation
point of the sequence of policies is an optimal policy. The algorithm is stated
and analysed in continuous time and state, with discretisation featuring
neither in theorems nor the proofs. A key technical tool used to show that the
algorithm is well-defined is the mirror coupling of Lindvall and Rogers.Comment: 21 pages, 2 figure
A Fenchel-Moreau-Rockafellar type theorem on the Kantorovich-Wasserstein space with Applications in Partially Observable Markov Decision Processes
By using the fact that the space of all probability measures with finite
support can be somehow completed in two different fashions, one generating the
Arens-Eells space and another generating the Kantorovich-Wasserstein
(Wasserstein-1) space, and by exploiting the duality relationship between the
Arens-Eells space with the space of Lipschitz functions, we provide a dual
representation of Fenchel-Moreau-Rockafellar type for proper convex functionals
on Wasserstein-1. We retrieve dual transportation inequalities as a Corollary
and we provide examples where the theorem can be used to easily prove dual
expressions like the celebrated Donsker-Varadhan variational formula. Finally
our result allows to write convex functions as the supremum over all linear
functions that are generated by roots of its conjugate dual, something that we
apply to the field of Partially observable Markov decision processes (POMDPs)
to approximate the value function of a given POMDP by iterating level sets.
This extends the method used in Smallwood 1973 for finite state spaces to the
case were the state space is a Polish metric space.Comment: 20 page
Q-learning with Nearest Neighbors
We consider model-free reinforcement learning for infinite-horizon discounted
Markov Decision Processes (MDPs) with a continuous state space and unknown
transition kernel, when only a single sample path under an arbitrary policy of
the system is available. We consider the Nearest Neighbor Q-Learning (NNQL)
algorithm to learn the optimal Q function using nearest neighbor regression
method. As the main contribution, we provide tight finite sample analysis of
the convergence rate. In particular, for MDPs with a -dimensional state
space and the discounted factor , given an arbitrary sample
path with "covering time" , we establish that the algorithm is guaranteed
to output an -accurate estimate of the optimal Q-function using
samples. For instance, for a
well-behaved MDP, the covering time of the sample path under the purely random
policy scales as so the sample
complexity scales as Indeed, we
establish a lower bound that argues that the dependence of is necessary.Comment: Accepted to NIPS 201
Empirical Dynamic Programming
We propose empirical dynamic programming algorithms for Markov decision
processes (MDPs). In these algorithms, the exact expectation in the Bellman
operator in classical value iteration is replaced by an empirical estimate to
get `empirical value iteration' (EVI). Policy evaluation and policy improvement
in classical policy iteration are also replaced by simulation to get `empirical
policy iteration' (EPI). Thus, these empirical dynamic programming algorithms
involve iteration of a random operator, the empirical Bellman operator. We
introduce notions of probabilistic fixed points for such random monotone
operators. We develop a stochastic dominance framework for convergence analysis
of such operators. We then use this to give sample complexity bounds for both
EVI and EPI. We then provide various variations and extensions to asynchronous
empirical dynamic programming, the minimax empirical dynamic program, and show
how this can also be used to solve the dynamic newsvendor problem. Preliminary
experimental results suggest a faster rate of convergence than stochastic
approximation algorithms.Comment: 34 Pages, 1 Figur
An Empirical Dynamic Programming Algorithm for Continuous MDPs
We propose universal randomized function approximation-based empirical value
iteration (EVI) algorithms for Markov decision processes. The `empirical'
nature comes from each iteration being done empirically from samples available
from simulations of the next state. This makes the Bellman operator a random
operator. A parametric and a non-parametric method for function approximation
using a parametric function space and the Reproducing Kernel Hilbert Space
(RKHS) respectively are then combined with EVI. Both function spaces have the
universal function approximation property. Basis functions are picked randomly.
Convergence analysis is done using a random operator framework with techniques
from the theory of stochastic dominance. Finite time sample complexity bounds
are derived for both universal approximate dynamic programming algorithms.
Numerical experiments support the versatility and effectiveness of this
approach.Comment: Accepted for publication in IEEE Transactions on Automatic Contro
Solving Factored MDPs with Hybrid State and Action Variables
Efficient representations and solutions for large decision problems with
continuous and discrete variables are among the most important challenges faced
by the designers of automated decision support systems. In this paper, we
describe a novel hybrid factored Markov decision process (MDP) model that
allows for a compact representation of these problems, and a new hybrid
approximate linear programming (HALP) framework that permits their efficient
solutions. The central idea of HALP is to approximate the optimal value
function by a linear combination of basis functions and optimize its weights by
linear programming. We analyze both theoretical and computational aspects of
this approach, and demonstrate its scale-up potential on several hybrid
optimization problems
Robustness to incorrect system models in stochastic control
In stochastic control applications, typically only an ideal model (controlled
transition kernel) is assumed and the control design is based on the given
model, raising the problem of performance loss due to the mismatch between the
assumed model and the actual model. Toward this end, we study continuity
properties of discrete-time stochastic control problems with respect to system
models (i.e., controlled transition kernels) and robustness of optimal control
policies designed for incorrect models applied to the true system. We study
both fully observed and partially observed setups under an infinite horizon
discounted expected cost criterion. We show that continuity and robustness
cannot be established under weak and setwise convergences of transition kernels
in general, but that the expected induced cost is robust under total variation.
By imposing further assumptions on the measurement models and on the kernel
itself (such as continuous convergence), we show that the optimal cost can be
made continuous under weak convergence of transition kernels as well. Using
these continuity properties, we establish convergence results and error bounds
due to mismatch that occurs by the application of a control policy which is
designed for an incorrectly estimated system model to a true model, thus
establishing positive and negative results on robustness.Compared to the
existing literature, we obtain strictly refined robustness results that are
applicable even when the incorrect models can be investigated under weak
convergence and setwise convergence criteria (with respect to a true model), in
addition to the total variation criteria. These entail positive implications on
empirical learning in (data-driven) stochastic control since often system
models are learned through empirical training data where typically weak
convergence criterion applies but stronger convergence criteria do not.Comment: Conference version to appear at the 2018 IEEE CDC with title
"Robustness to Incorrect System Models in Stochastic Control and Application
to Data-Driven Learning". The paper is to appear in SIAM J. on Control and
Optimizatio
- …