Search CORE

10 research outputs found

Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

Author: Sidford Aaron
Wang Mengdi
Wu Xian
Yang Lin F.
Ye Yinyu
Publication venue
Publication date: 04/06/2018
Field of study

In this paper we consider the problem of computing an

\epsilon

-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in

O(1)

time. Given such a DMDP with states

S

, actions

A

, discount factor

\gamma\in(0,1)

, and rewards in range

[0, 1]

we provide an algorithm which computes an

\epsilon

-optimal policy with probability

1 - \delta

where \emph{both} the time spent and number of sample taken are upper bounded by

O\left[\frac{|S||A|}{(1-\gamma)^3 \epsilon^2} \log \left(\frac{|S||A|}{(1-\gamma)\delta \epsilon} \right) \log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right] ~.

For fixed values of

\epsilon

, this improves upon the previous best known bounds by a factor of

(1 - \gamma)^{-1}

and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing

\epsilon

-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.Comment: 31 pages. Accepted to NeurIPS, 201

arXiv.org e-Print Archive

eScholarship - University of California

Probabilistic Contraction Analysis of Iterated Random Operators

Author: Glynn Peter
Gupta Abhishek
Jain Rahul
Publication venue
Publication date: 15/07/2020
Field of study

Consider a contraction operator

T

over a complete metric space

\mathcal X

with the fixed point

x^\star

. In many computational applications, it is difficult to compute

T(x)

; therefore, one replaces the application contraction operator

T

at iteration

k

by a random operator

\hat T^n_k

using

n

independent and identically distributed samples of a random variable. Consider the Markov chain

(\hat X^n_k)_{k\in\mathbb{N}}

, which is generated by

\hat X^n_{k+1} = \hat T^n_k(\hat X^n_k)

. In this paper, we identify some sufficient conditions under which (i) the distribution of

\hat X^n_k

converges to a Dirac mass over

x^\star

k

and

n

go to infinity, and (ii) the probability that

\hat X^n_k

is far from

x^\star

k

goes to infinity can be made arbitrarily small by an appropriate choice of

n

. We also derive an upper bound on the probability that

\hat X^n_k

is far from

x^\star

k\rightarrow \infty

. We apply the result to study the convergence in probability of iterates generated by empirical value iteration algorithms for discounted and average cost Markov decision problems.Comment: 37 pages, submitted to SIAM Journal on Control and Optimizatio

arXiv.org e-Print Archive

Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage

Author: Ghavamzadeh Mohammad
Kalathil Dileep
Panaganti Kishan
Xu Zaiyan
Publication venue
Publication date: 27/10/2023
Field of study

The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift which refers to the difference between the state-action visitation distribution of the data generating policy and the learning policy. Many recent works have used the idea of pessimism for developing offline RL algorithms and characterizing their sample complexity under a relatively weak assumption of single policy concentrability. Different from the offline RL literature, the area of distributionally robust learning (DRL) offers a principled framework that uses a minimax formulation to tackle model mismatch between training and testing environments. In this work, we aim to bridge these two areas by showing that the DRL approach can be used to tackle the distributional shift problem in offline RL. In particular, we propose two offline RL algorithms using the DRL framework, for the tabular and linear function approximation settings, and characterize their sample complexity under the single policy concentrability assumption. We also demonstrate the superior performance our proposed algorithm through simulation experiments.Comment: 33 pages, preprin

arXiv.org e-Print Archive