Search CORE

4 research outputs found

Large-Scale Markov Decision Problems via the Linear Programming Dual

Author: Abbasi-Yadkori Yasin
Bartlett Peter L.
Chen Xi
Malek Alan
Publication venue
Publication date: 06/01/2019
Field of study

We consider the problem of controlling a fully specified Markov decision process (MDP), also known as the planning problem, when the state space is very large and calculating the optimal policy is intractable. Instead, we pursue the more modest goal of optimizing over some small family of policies. Specifically, we show that the family of policies associated with a low-dimensional approximation of occupancy measures yields a tractable optimization. Moreover, we propose an efficient algorithm, scaling with the size of the subspace but not the state space, that is able to find a policy with low excess loss relative to the best policy in this class. To the best of our knowledge, such results did not exist in the literature previously. We bound excess loss in the average cost and discounted cost cases, which are treated separately. Preliminary experiments show the effectiveness of the proposed algorithms in a queueing application.Comment: 53 pages. arXiv admin note: text overlap with arXiv:1402.676

arXiv.org e-Print Archive

Large Scale Markov Decision Processes with Changing Rewards

Author: Cardoso Adrian Rivera
Wang He
Xu Huan
Publication venue
Publication date: 25/05/2019
Field of study

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of

O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))

, where

S

is the state space,

A

is the action space,

\tau

is the mixing time of the MDP, and

T

is the number of periods. The algorithm's computational complexity is polynomial in

|S|

and

|A|

per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension

d\ll|S|

, we propose a modified algorithm with computational complexity polynomial in

d

. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first

\tilde{O}(\sqrt{T})

regret bound for large scale MDPs with changing rewards

arXiv.org e-Print Archive

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

Author: Farajtabar Mehrdad
Gorur Dilan
Harris Chris
Lazic Nevena
Levine Nir
Schuurmans Dale
Yin Dong
Publication venue
Publication date: 17/06/2020
Field of study

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments

arXiv.org e-Print Archive

Efficient Planning in Large MDPs with Weak Linear Function Approximation

Author: Shariff Roshan
Szepesvári Csaba
Publication venue
Publication date: 13/07/2020
Field of study

Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad

arXiv.org e-Print Archive