3 research outputs found
Efficient Planning in Large MDPs with Weak Linear Function Approximation
Large-scale Markov decision processes (MDPs) require planning algorithms with
runtime independent of the number of states of the MDP. We consider the
planning problem in MDPs using linear value function approximation with only
weak requirements: low approximation error for the optimal value function, and
a small set of "core" states whose features span those of other states. In
particular, we make no assumptions about the representability of policies or
value functions of non-optimal policies. Our algorithm produces almost-optimal
actions for any state using a generative oracle (simulator) for the MDP, while
its computation time scales polynomially with the number of features, core
states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on
Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad
On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function
We consider local planning in fixed-horizon MDPs with a generative model
under the assumption that the optimal value function lies close to the span of
a feature map. The generative model provides a local access to the MDP: The
planner can ask for random transitions from previously returned states and
arbitrary actions, and features are only accessible for states that are
encountered in this process. As opposed to previous work (e.g. Lattimore et al.
(2020)) where linear realizability of all policies was assumed, we consider the
significantly relaxed assumption of a single linearly realizable
(deterministic) policy. A recent lower bound by Weisz et al. (2020) established
that the related problem when the action-value function of the optimal policy
is linearly realizable requires an exponential number of queries, either in
(the horizon of the MDP) or (the dimension of the feature mapping). Their
construction crucially relies on having an exponentially large action set. In
contrast, in this work, we establish that poly planning is possible with
state value function realizability whenever the action set has a constant size.
In particular, we present the TensorPlan algorithm which uses
poly simulator queries to find a -optimal policy
relative to any deterministic policy for which the value function is linearly
realizable with some bounded parameter. This is the first algorithm to give a
polynomial query complexity guarantee using only linear-realizability of a
single competing value function. Whether the computation cost is similarly
bounded remains an open question. We extend the upper bound to the
near-realizable case and to the infinite-horizon discounted setup. We also
present a lower bound in the infinite-horizon episodic setting: Planners that
achieve constant suboptimality need exponentially many queries, either in
or the number of actions
An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
A fundamental question in the theory of reinforcement learning is: suppose
the optimal -function lies in the linear span of a given dimensional
feature mapping, is sample-efficient reinforcement learning (RL) possible? The
recent and remarkable result of Weisz et al. (2020) resolved this question in
the negative, providing an exponential (in ) sample size lower bound, which
holds even if the agent has access to a generative model of the environment.
One may hope that this information theoretic barrier for RL can be circumvented
by further supposing an even more favorable assumption: there exists a
\emph{constant suboptimality gap} between the optimal -value of the best
action and that of the second-best action (for all states). The hope is that
having a large suboptimality gap would permit easier identification of optimal
actions themselves, thus making the problem tractable; indeed, provided the
agent has access to a generative model, sample-efficient RL is in fact possible
with the addition of this more favorable assumption.
This work focuses on this question in the standard online reinforcement
learning setting, where our main result resolves this question in the negative:
our hardness result shows that an exponential sample complexity lower bound
still holds even if a constant suboptimality gap is assumed in addition to
having a linearly realizable optimal -function. Perhaps surprisingly, this
implies an exponential separation between the online RL setting and the
generative model setting. Complementing our negative hardness result, we give
two positive results showing that provably sample-efficient RL is possible
either under an additional low-variance assumption or under a novel
hypercontractivity assumption (both implicitly place stronger conditions on the
underlying dynamics model)