10 research outputs found
Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model
In this paper we consider the problem of computing an -optimal
policy of a discounted Markov Decision Process (DMDP) provided we can only
access its transition function through a generative sampling model that given
any state-action pair samples from the transition function in time.
Given such a DMDP with states , actions , discount factor
, and rewards in range we provide an algorithm which
computes an -optimal policy with probability where
\emph{both} the time spent and number of sample taken are upper bounded by For fixed values
of , this improves upon the previous best known bounds by a factor of
and matches the sample complexity lower bounds proved in
Azar et al. (2013) up to logarithmic factors. We also extend our method to
computing -optimal policies for finite-horizon MDP with a generative
model and provide a nearly matching sample complexity lower bound.Comment: 31 pages. Accepted to NeurIPS, 201
Probabilistic Contraction Analysis of Iterated Random Operators
Consider a contraction operator over a complete metric space
with the fixed point . In many computational applications, it is
difficult to compute ; therefore, one replaces the application
contraction operator at iteration by a random operator
using independent and identically distributed samples of a random variable.
Consider the Markov chain , which is generated
by . In this paper, we identify some
sufficient conditions under which (i) the distribution of
converges to a Dirac mass over as and go to infinity, and
(ii) the probability that is far from as goes to
infinity can be made arbitrarily small by an appropriate choice of . We also
derive an upper bound on the probability that is far from
as . We apply the result to study the
convergence in probability of iterates generated by empirical value iteration
algorithms for discounted and average cost Markov decision problems.Comment: 37 pages, submitted to SIAM Journal on Control and Optimizatio
Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage
The goal of an offline reinforcement learning (RL) algorithm is to learn
optimal polices using historical (offline) data, without access to the
environment for online exploration. One of the main challenges in offline RL is
the distribution shift which refers to the difference between the state-action
visitation distribution of the data generating policy and the learning policy.
Many recent works have used the idea of pessimism for developing offline RL
algorithms and characterizing their sample complexity under a relatively weak
assumption of single policy concentrability. Different from the offline RL
literature, the area of distributionally robust learning (DRL) offers a
principled framework that uses a minimax formulation to tackle model mismatch
between training and testing environments. In this work, we aim to bridge these
two areas by showing that the DRL approach can be used to tackle the
distributional shift problem in offline RL. In particular, we propose two
offline RL algorithms using the DRL framework, for the tabular and linear
function approximation settings, and characterize their sample complexity under
the single policy concentrability assumption. We also demonstrate the superior
performance our proposed algorithm through simulation experiments.Comment: 33 pages, preprin