458 research outputs found
Provably Efficient -learning with Function Approximation via Distribution Shift Error Checking Oracle
-learning with function approximation is one of the most popular methods
in reinforcement learning. Though the idea of using function approximation was
proposed at least 60 years ago, even in the simplest setup, i.e, approximating
-functions with linear functions, it is still an open problem on how to
design a provably efficient algorithm that learns a near-optimal policy. The
key challenges are how to efficiently explore the state space and how to decide
when to stop exploring in conjunction with the function approximation scheme.
The current paper presents a provably efficient algorithm for -learning
with linear function approximation. Under certain regularity assumptions, our
algorithm, Difference Maximization -learning (DMQ), combined with linear
function approximation, returns a near-optimal policy using a polynomial number
of trajectories. Our algorithm introduces a new notion, the Distribution Shift
Error Checking (DSEC) oracle. This oracle tests whether there exists a function
in the function class that predicts well on a distribution , but
predicts poorly on another distribution , where
and are distributions over states induced by two different
exploration policies. For the linear function class, this oracle is equivalent
to solving a top eigenvalue problem. We believe our algorithmic insights,
especially the DSEC oracle, are also useful in designing and analyzing
reinforcement learning algorithms with general function approximation.Comment: In NeurIPS 201
Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
Learning to plan for long horizons is a central challenge in episodic
reinforcement learning problems. A fundamental question is to understand how
the difficulty of the problem scales as the horizon increases. Here the natural
measure of sample complexity is a normalized one: we are interested in the
number of episodes it takes to provably discover a policy whose value is
near to that of the optimal value, where the value is measured by
the normalized cumulative reward in each episode. In a COLT 2018 open problem,
Jiang and Agarwal conjectured that, for tabular, episodic reinforcement
learning problems, there exists a sample complexity lower bound which exhibits
a polynomial dependence on the horizon -- a conjecture which is consistent with
all known sample complexity upper bounds. This work refutes this conjecture,
proving that tabular, episodic reinforcement learning is possible with a sample
complexity that scales only logarithmically with the planning horizon. In other
words, when the values are appropriately normalized (to lie in the unit
interval), this results shows that long horizon RL is no more difficult than
short horizon RL, at least in a minimax sense. Our analysis introduces two
ideas: (i) the construction of an -net for optimal policies whose
log-covering number scales only logarithmically with the planning horizon, and
(ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all
policies in a given policy class using sample complexity that scales with the
log-covering number of the given policy class. Both may be of independent
interest
Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Motivated by the prevailing paradigm of using unsupervised learning for
efficient exploration in reinforcement learning (RL) problems
[tang2017exploration,bellemare2016unifying], we investigate when this paradigm
is provably efficient. We study episodic Markov decision processes with rich
observations generated from a small number of latent states. We present a
general algorithmic framework that is built upon two components: an
unsupervised learning algorithm and a no-regret tabular RL algorithm.
Theoretically, we prove that as long as the unsupervised learning algorithm
enjoys a polynomial sample complexity guarantee, we can find a near-optimal
policy with sample complexity polynomial in the number of latent states, which
is significantly smaller than the number of observations. Empirically, we
instantiate our framework on a class of hard exploration problems to
demonstrate the practicality of our theory
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
In order to deal with the curse of dimensionality in reinforcement learning
(RL), it is common practice to make parametric assumptions where values or
policies are functions of some low dimensional feature space. This work focuses
on the representation learning question: how can we learn such features? Under
the assumption that the underlying (unknown) dynamics correspond to a low rank
transition matrix, we show how the representation learning question is related
to a particular non-linear matrix decomposition problem. Structurally, we make
precise connections between these low rank MDPs and latent variable models,
showing how they significantly generalize prior formulations for representation
learning in RL. Algorithmically, we develop FLAMBE, which engages in
exploration and representation learning for provably efficient RL in low rank
transition models.Comment: New algorithm and analysis to remove the reachability assumptio
Smooth Structured Prediction Using Quantum and Classical Gibbs Samplers
We introduce two quantum algorithms for solving structured prediction
problems. We show that a stochastic subgradient descent method that uses the
quantum minimum finding algorithm and takes its probabilistic failure into
account solves the structured prediction problem with a runtime that scales
with the square root of the size of the label space, and in with respect to the precision, , of the
solution. Motivated by robust inference techniques in machine learning, we
introduce another quantum algorithm that solves a smooth approximation of the
structured prediction problem with a similar quantum speedup in the size of the
label space and a similar scaling in the precision parameter. In doing so, we
analyze a stochastic gradient algorithm for convex optimization in the presence
of an additive error in the calculation of the gradients, and show that its
convergence rate does not deteriorate if the additive errors are of the order
. This algorithm uses quantum Gibbs sampling at temperature
as a subroutine. Based on these theoretical observations,
we propose a method for using quantum Gibbs samplers to combine feedforward
neural networks with probabilistic graphical models for quantum machine
learning. Our numerical results using Monte Carlo simulations on an image
tagging task demonstrate the benefit of the approach
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension
Value function approximation has demonstrated phenomenal empirical success in
reinforcement learning (RL). Nevertheless, despite a handful of recent progress
on developing theory for RL with linear function approximation, the
understanding of general function approximation schemes largely remains
missing. In this paper, we establish a provably efficient RL algorithm with
general value function approximation. We show that if the value functions admit
an approximation with a function class , our algorithm achieves a
regret bound of where is a
complexity measure of that depends on the eluder dimension [Russo
and Van Roy, 2013] and log-covering numbers, is the planning horizon, and
is the number interactions with the environment. Our theory generalizes
recent progress on RL with linear value function approximation and does not
make explicit assumptions on the model of the environment. Moreover, our
algorithm is model-free and provides a framework to justify the effectiveness
of algorithms used in practice
Efficient Planning in Large MDPs with Weak Linear Function Approximation
Large-scale Markov decision processes (MDPs) require planning algorithms with
runtime independent of the number of states of the MDP. We consider the
planning problem in MDPs using linear value function approximation with only
weak requirements: low approximation error for the optimal value function, and
a small set of "core" states whose features span those of other states. In
particular, we make no assumptions about the representability of policies or
value functions of non-optimal policies. Our algorithm produces almost-optimal
actions for any state using a generative oracle (simulator) for the MDP, while
its computation time scales polynomially with the number of features, core
states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on
Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad
An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
A fundamental question in the theory of reinforcement learning is: suppose
the optimal -function lies in the linear span of a given dimensional
feature mapping, is sample-efficient reinforcement learning (RL) possible? The
recent and remarkable result of Weisz et al. (2020) resolved this question in
the negative, providing an exponential (in ) sample size lower bound, which
holds even if the agent has access to a generative model of the environment.
One may hope that this information theoretic barrier for RL can be circumvented
by further supposing an even more favorable assumption: there exists a
\emph{constant suboptimality gap} between the optimal -value of the best
action and that of the second-best action (for all states). The hope is that
having a large suboptimality gap would permit easier identification of optimal
actions themselves, thus making the problem tractable; indeed, provided the
agent has access to a generative model, sample-efficient RL is in fact possible
with the addition of this more favorable assumption.
This work focuses on this question in the standard online reinforcement
learning setting, where our main result resolves this question in the negative:
our hardness result shows that an exponential sample complexity lower bound
still holds even if a constant suboptimality gap is assumed in addition to
having a linearly realizable optimal -function. Perhaps surprisingly, this
implies an exponential separation between the online RL setting and the
generative model setting. Complementing our negative hardness result, we give
two positive results showing that provably sample-efficient RL is possible
either under an additional low-variance assumption or under a novel
hypercontractivity assumption (both implicitly place stronger conditions on the
underlying dynamics model)
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium
We develop provably efficient reinforcement learning algorithms for
two-player zero-sum finite-horizon Markov games with simultaneous moves. To
incorporate function approximation, we consider a family of Markov games where
the reward function and transition kernel possess a linear structure. Both the
offline and online settings of the problems are considered. In the offline
setting, we control both players and aim to find the Nash Equilibrium by
minimizing the duality gap. In the online setting, we control a single player
playing against an arbitrary opponent and aim to minimize the regret. For both
settings, we propose an optimistic variant of the least-squares minimax value
iteration algorithm. We show that our algorithm is computationally efficient
and provably achieves an upper bound on the
duality gap and regret, where is the linear dimension, the horizon and
the total number of timesteps. Our results do not require additional
assumptions on the sampling model.
Our setting requires overcoming several new challenges that are absent in
Markov decision processes or turn-based Markov games. In particular, to achieve
optimism with simultaneous moves, we construct both upper and lower confidence
bounds of the value function, and then compute the optimistic policy by solving
a general-sum matrix game with these bounds as the payoff matrices. As finding
the Nash Equilibrium of a general-sum game is computationally hard, our
algorithm instead solves for a Coarse Correlated Equilibrium (CCE), which can
be obtained efficiently. To our best knowledge, such a CCE-based scheme for
optimism has not appeared in the literature and might be of interest in its own
right.Comment: Accepted for presentation at COLT 202
Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity
Reinforcement learning (RL) is empirically successful in complex nonlinear
Markov decision processes (MDPs) with continuous state spaces. By contrast, the
majority of theoretical RL literature requires the MDP to satisfy some form of
linear structure, in order to guarantee sample efficient RL. Such efforts
typically assume the transition dynamics or value function of the MDP are
described by linear functions of the state features. To resolve this
discrepancy between theory and practice, we introduce the Effective Planning
Window (EPW) condition, a structural condition on MDPs that makes no linearity
assumptions. We demonstrate that the EPW condition permits sample efficient RL,
by providing an algorithm which provably solves MDPs satisfying this condition.
Our algorithm requires minimal assumptions on the policy class, which can
include multi-layer neural networks with nonlinear activation functions.
Notably, the EPW condition is directly motivated by popular gaming benchmarks,
and we show that many classic Atari games satisfy this condition. We
additionally show the necessity of conditions like EPW, by demonstrating that
simple MDPs with slight nonlinearities cannot be solved sample efficiently.Comment: ICML 202
- …