4 research outputs found

    Large-Scale Markov Decision Problems via the Linear Programming Dual

    Full text link
    We consider the problem of controlling a fully specified Markov decision process (MDP), also known as the planning problem, when the state space is very large and calculating the optimal policy is intractable. Instead, we pursue the more modest goal of optimizing over some small family of policies. Specifically, we show that the family of policies associated with a low-dimensional approximation of occupancy measures yields a tractable optimization. Moreover, we propose an efficient algorithm, scaling with the size of the subspace but not the state space, that is able to find a policy with low excess loss relative to the best policy in this class. To the best of our knowledge, such results did not exist in the literature previously. We bound excess loss in the average cost and discounted cost cases, which are treated separately. Preliminary experiments show the effectiveness of the proposed algorithms in a queueing application.Comment: 53 pages. arXiv admin note: text overlap with arXiv:1402.676

    Large Scale Markov Decision Processes with Changing Rewards

    Full text link
    We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of O(τ(lnS+lnA)Tln(T))O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T)), where SS is the state space, AA is the action space, τ\tau is the mixing time of the MDP, and TT is the number of periods. The algorithm's computational complexity is polynomial in S|S| and A|A| per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension dSd\ll|S|, we propose a modified algorithm with computational complexity polynomial in dd. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first O~(T)\tilde{O}(\sqrt{T}) regret bound for large scale MDPs with changing rewards

    A maximum-entropy approach to off-policy evaluation in average-reward MDPs

    Full text link
    This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments

    Efficient Planning in Large MDPs with Weak Linear Function Approximation

    Full text link
    Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad
    corecore