3 research outputs found
Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
Learning to plan for long horizons is a central challenge in episodic
reinforcement learning problems. A fundamental question is to understand how
the difficulty of the problem scales as the horizon increases. Here the natural
measure of sample complexity is a normalized one: we are interested in the
number of episodes it takes to provably discover a policy whose value is
near to that of the optimal value, where the value is measured by
the normalized cumulative reward in each episode. In a COLT 2018 open problem,
Jiang and Agarwal conjectured that, for tabular, episodic reinforcement
learning problems, there exists a sample complexity lower bound which exhibits
a polynomial dependence on the horizon -- a conjecture which is consistent with
all known sample complexity upper bounds. This work refutes this conjecture,
proving that tabular, episodic reinforcement learning is possible with a sample
complexity that scales only logarithmically with the planning horizon. In other
words, when the values are appropriately normalized (to lie in the unit
interval), this results shows that long horizon RL is no more difficult than
short horizon RL, at least in a minimax sense. Our analysis introduces two
ideas: (i) the construction of an -net for optimal policies whose
log-covering number scales only logarithmically with the planning horizon, and
(ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all
policies in a given policy class using sample complexity that scales with the
log-covering number of the given policy class. Both may be of independent
interest
Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Motivated by the prevailing paradigm of using unsupervised learning for
efficient exploration in reinforcement learning (RL) problems
[tang2017exploration,bellemare2016unifying], we investigate when this paradigm
is provably efficient. We study episodic Markov decision processes with rich
observations generated from a small number of latent states. We present a
general algorithmic framework that is built upon two components: an
unsupervised learning algorithm and a no-regret tabular RL algorithm.
Theoretically, we prove that as long as the unsupervised learning algorithm
enjoys a polynomial sample complexity guarantee, we can find a near-optimal
policy with sample complexity polynomial in the number of latent states, which
is significantly smaller than the number of observations. Empirically, we
instantiate our framework on a class of hard exploration problems to
demonstrate the practicality of our theory
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
In order to deal with the curse of dimensionality in reinforcement learning
(RL), it is common practice to make parametric assumptions where values or
policies are functions of some low dimensional feature space. This work focuses
on the representation learning question: how can we learn such features? Under
the assumption that the underlying (unknown) dynamics correspond to a low rank
transition matrix, we show how the representation learning question is related
to a particular non-linear matrix decomposition problem. Structurally, we make
precise connections between these low rank MDPs and latent variable models,
showing how they significantly generalize prior formulations for representation
learning in RL. Algorithmically, we develop FLAMBE, which engages in
exploration and representation learning for provably efficient RL in low rank
transition models.Comment: New algorithm and analysis to remove the reachability assumptio