10 research outputs found

    Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

    Get PDF
    In this paper we consider the problem of computing an Ο΅\epsilon-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in O(1)O(1) time. Given such a DMDP with states SS, actions AA, discount factor γ∈(0,1)\gamma\in(0,1), and rewards in range [0,1][0, 1] we provide an algorithm which computes an Ο΅\epsilon-optimal policy with probability 1βˆ’Ξ΄1 - \delta where \emph{both} the time spent and number of sample taken are upper bounded by O[∣S∣∣A∣(1βˆ’Ξ³)3Ο΅2log⁑(∣S∣∣A∣(1βˆ’Ξ³)δϡ)log⁑(1(1βˆ’Ξ³)Ο΅)]Β . O\left[\frac{|S||A|}{(1-\gamma)^3 \epsilon^2} \log \left(\frac{|S||A|}{(1-\gamma)\delta \epsilon} \right) \log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right] ~. For fixed values of Ο΅\epsilon, this improves upon the previous best known bounds by a factor of (1βˆ’Ξ³)βˆ’1(1 - \gamma)^{-1} and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing Ο΅\epsilon-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.Comment: 31 pages. Accepted to NeurIPS, 201

    Probabilistic Contraction Analysis of Iterated Random Operators

    Full text link
    Consider a contraction operator TT over a complete metric space X\mathcal X with the fixed point x⋆x^\star. In many computational applications, it is difficult to compute T(x)T(x); therefore, one replaces the application contraction operator TT at iteration kk by a random operator T^kn\hat T^n_k using nn independent and identically distributed samples of a random variable. Consider the Markov chain (X^kn)k∈N(\hat X^n_k)_{k\in\mathbb{N}}, which is generated by X^k+1n=T^kn(X^kn)\hat X^n_{k+1} = \hat T^n_k(\hat X^n_k). In this paper, we identify some sufficient conditions under which (i) the distribution of X^kn\hat X^n_k converges to a Dirac mass over x⋆x^\star as kk and nn go to infinity, and (ii) the probability that X^kn\hat X^n_k is far from x⋆x^\star as kk goes to infinity can be made arbitrarily small by an appropriate choice of nn. We also derive an upper bound on the probability that X^kn\hat X^n_k is far from x⋆x^\star as kβ†’βˆžk\rightarrow \infty. We apply the result to study the convergence in probability of iterates generated by empirical value iteration algorithms for discounted and average cost Markov decision problems.Comment: 37 pages, submitted to SIAM Journal on Control and Optimizatio

    Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage

    Full text link
    The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift which refers to the difference between the state-action visitation distribution of the data generating policy and the learning policy. Many recent works have used the idea of pessimism for developing offline RL algorithms and characterizing their sample complexity under a relatively weak assumption of single policy concentrability. Different from the offline RL literature, the area of distributionally robust learning (DRL) offers a principled framework that uses a minimax formulation to tackle model mismatch between training and testing environments. In this work, we aim to bridge these two areas by showing that the DRL approach can be used to tackle the distributional shift problem in offline RL. In particular, we propose two offline RL algorithms using the DRL framework, for the tabular and linear function approximation settings, and characterize their sample complexity under the single policy concentrability assumption. We also demonstrate the superior performance our proposed algorithm through simulation experiments.Comment: 33 pages, preprin
    corecore