3 research outputs found

    Efficient Planning in Large MDPs with Weak Linear Function Approximation

    Full text link
    Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad

    On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function

    Full text link
    We consider local planning in fixed-horizon MDPs with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a local access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and features are only accessible for states that are encountered in this process. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in HH (the horizon of the MDP) or dd (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that poly(H,d)(H,d) planning is possible with state value function realizability whenever the action set has a constant size. In particular, we present the TensorPlan algorithm which uses poly((dH/δ)A)((dH/\delta)^A) simulator queries to find a δ\delta-optimal policy relative to any deterministic policy for which the value function is linearly realizable with some bounded parameter. This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an open question. We extend the upper bound to the near-realizable case and to the infinite-horizon discounted setup. We also present a lower bound in the infinite-horizon episodic setting: Planners that achieve constant suboptimality need exponentially many queries, either in dd or the number of actions

    An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

    Full text link
    A fundamental question in the theory of reinforcement learning is: suppose the optimal QQ-function lies in the linear span of a given dd dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in dd) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal QQ-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal QQ-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model)
    corecore