895 research outputs found
PAC Reinforcement Learning with Rich Observations
We propose and study a new model for reinforcement learning with rich
observations, generalizing contextual bandits to sequential decision making.
These models require an agent to take actions based on observations (features)
with the goal of achieving long-term performance competitive with a large set
of policies. To avoid barriers to sample-efficient learning associated with
large observation spaces and general POMDPs, we focus on problems that can be
summarized by a small number of hidden states and have long-term rewards that
are predictable by a reactive function class. In this setting, we design and
analyze a new reinforcement learning algorithm, Least Squares Value Elimination
by Exploration. We prove that the algorithm learns near optimal behavior after
a number of episodes that is polynomial in all relevant parameters, logarithmic
in the number of policies, and independent of the size of the observation
space. Our result provides theoretical justification for reinforcement learning
with function approximation
Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
Learning to plan for long horizons is a central challenge in episodic
reinforcement learning problems. A fundamental question is to understand how
the difficulty of the problem scales as the horizon increases. Here the natural
measure of sample complexity is a normalized one: we are interested in the
number of episodes it takes to provably discover a policy whose value is
near to that of the optimal value, where the value is measured by
the normalized cumulative reward in each episode. In a COLT 2018 open problem,
Jiang and Agarwal conjectured that, for tabular, episodic reinforcement
learning problems, there exists a sample complexity lower bound which exhibits
a polynomial dependence on the horizon -- a conjecture which is consistent with
all known sample complexity upper bounds. This work refutes this conjecture,
proving that tabular, episodic reinforcement learning is possible with a sample
complexity that scales only logarithmically with the planning horizon. In other
words, when the values are appropriately normalized (to lie in the unit
interval), this results shows that long horizon RL is no more difficult than
short horizon RL, at least in a minimax sense. Our analysis introduces two
ideas: (i) the construction of an -net for optimal policies whose
log-covering number scales only logarithmically with the planning horizon, and
(ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all
policies in a given policy class using sample complexity that scales with the
log-covering number of the given policy class. Both may be of independent
interest
EX2: Exploration with Exemplar Models for Deep Reinforcement Learning
Deep reinforcement learning algorithms have been shown to learn complex tasks
using highly general policy classes. However, sparse reward problems remain a
significant challenge. Exploration methods based on novelty detection have been
particularly successful in such settings but typically require generative or
predictive models of the observations, which can be difficult to train when the
observations are very high-dimensional and complex, as in the case of raw
images. We propose a novelty detection algorithm for exploration that is based
entirely on discriminatively trained exemplar models, where classifiers are
trained to discriminate each visited state against all others. Intuitively,
novel states are easier to distinguish against other states seen during
training. We show that this kind of discriminative modeling corresponds to
implicit density estimation, and that it can be combined with count-based
exploration to produce competitive results on a range of popular benchmark
tasks, including state-of-the-art results on challenging egocentric
observations in the vizDoom benchmark
PAC-Bayes Control: Learning Policies that Provably Generalize to Novel Environments
Our goal is to learn control policies for robots that provably generalize
well to novel environments given a dataset of example environments. The key
technical idea behind our approach is to leverage tools from generalization
theory in machine learning by exploiting a precise analogy (which we present in
the form of a reduction) between generalization of control policies to novel
environments and generalization of hypotheses in the supervised learning
setting. In particular, we utilize the Probably Approximately Correct
(PAC)-Bayes framework, which allows us to obtain upper bounds that hold with
high probability on the expected cost of (stochastic) control policies across
novel environments. We propose policy learning algorithms that explicitly seek
to minimize this upper bound. The corresponding optimization problem can be
solved using convex optimization (Relative Entropy Programming in particular)
in the setting where we are optimizing over a finite policy space. In the more
general setting of continuously parameterized policies (e.g., neural network
policies), we minimize this upper bound using stochastic gradient descent. We
present simulated results of our approach applied to learning (1) reactive
obstacle avoidance policies and (2) neural network-based grasping policies. We
also present hardware results for the Parrot Swing drone navigating through
different obstacle environments. Our examples demonstrate the potential of our
approach to provide strong generalization guarantees for robotic systems with
continuous state and action spaces, complicated (e.g., nonlinear) dynamics,
rich sensory inputs (e.g., depth images), and neural network-based policies.Comment: Extended version of paper presented at the 2018 Conference on Robot
Learning (CoRL
Provably efficient RL with Rich Observations via Latent State Decoding
We study the exploration problem in episodic MDPs with rich observations
generated from a small number of latent states. Under certain identifiability
assumptions, we demonstrate how to estimate a mapping from the observations to
latent states inductively through a sequence of regression and clustering
steps---where previously decoded latent states provide labels for later
regression problems---and use it to construct good exploration policies. We
provide finite-sample guarantees on the quality of the learned state decoding
function and exploration policies, and complement our theory with an empirical
evaluation on a class of hard exploration problems. Our method exponentially
improves over -learning with na\"ive exploration, even when -learning has
cheating access to latent states.Comment: ICML 201
Information-Theoretic Considerations in Batch Reinforcement Learning
Value-function approximation methods that operate in batch mode have
foundational importance to reinforcement learning (RL). Finite sample
guarantees for these methods often crucially rely on two types of assumptions:
(1) mild distribution shift, and (2) representation conditions that are
stronger than realizability. However, the necessity ("why do we need them?")
and the naturalness ("when do they hold?") of such assumptions have largely
eluded the literature. In this paper, we revisit these assumptions and provide
theoretical results towards answering the above questions, and make steps
towards a deeper understanding of value-function approximation.Comment: Published in ICML 201
Markov Decision Processes with Continuous Side Information
We consider a reinforcement learning (RL) setting in which the agent
interacts with a sequence of episodic MDPs. At the start of each episode the
agent has access to some side-information or context that determines the
dynamics of the MDP for that episode. Our setting is motivated by applications
in healthcare where baseline measurements of a patient at the start of a
treatment episode form the context that may provide information about how the
patient might respond to treatment decisions. We propose algorithms for
learning in such Contextual Markov Decision Processes (CMDPs) under an
assumption that the unobserved MDP parameters vary smoothly with the observed
context. We also give lower and upper PAC bounds under the smoothness
assumption. Because our lower bound has an exponential dependence on the
dimension, we consider a tractable linear setting where the context is used to
create linear combinations of a finite set of MDPs. For the linear setting, we
give a PAC learning algorithm based on KWIK learning techniques
Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Motivated by the prevailing paradigm of using unsupervised learning for
efficient exploration in reinforcement learning (RL) problems
[tang2017exploration,bellemare2016unifying], we investigate when this paradigm
is provably efficient. We study episodic Markov decision processes with rich
observations generated from a small number of latent states. We present a
general algorithmic framework that is built upon two components: an
unsupervised learning algorithm and a no-regret tabular RL algorithm.
Theoretically, we prove that as long as the unsupervised learning algorithm
enjoys a polynomial sample complexity guarantee, we can find a near-optimal
policy with sample complexity polynomial in the number of latent states, which
is significantly smaller than the number of observations. Empirically, we
instantiate our framework on a class of hard exploration problems to
demonstrate the practicality of our theory
On Oracle-Efficient PAC RL with Rich Observations
We study the computational tractability of PAC reinforcement learning with
rich observations. We present new provably sample-efficient algorithms for
environments with deterministic hidden state dynamics and stochastic rich
observations. These methods operate in an oracle model of computation --
accessing policy and value function classes exclusively through standard
optimization primitives -- and therefore represent computationally efficient
alternatives to prior algorithms that require enumeration. With stochastic
hidden state dynamics, we prove that the only known sample-efficient algorithm,
OLIVE, cannot be implemented in the oracle model. We also present several
examples that illustrate fundamental challenges of tractable PAC reinforcement
learning in such general settings.Comment: appeared at NeurIPS 18; full paper including appendix; updated style
fil
Provably Efficient -learning with Function Approximation via Distribution Shift Error Checking Oracle
-learning with function approximation is one of the most popular methods
in reinforcement learning. Though the idea of using function approximation was
proposed at least 60 years ago, even in the simplest setup, i.e, approximating
-functions with linear functions, it is still an open problem on how to
design a provably efficient algorithm that learns a near-optimal policy. The
key challenges are how to efficiently explore the state space and how to decide
when to stop exploring in conjunction with the function approximation scheme.
The current paper presents a provably efficient algorithm for -learning
with linear function approximation. Under certain regularity assumptions, our
algorithm, Difference Maximization -learning (DMQ), combined with linear
function approximation, returns a near-optimal policy using a polynomial number
of trajectories. Our algorithm introduces a new notion, the Distribution Shift
Error Checking (DSEC) oracle. This oracle tests whether there exists a function
in the function class that predicts well on a distribution , but
predicts poorly on another distribution , where
and are distributions over states induced by two different
exploration policies. For the linear function class, this oracle is equivalent
to solving a top eigenvalue problem. We believe our algorithmic insights,
especially the DSEC oracle, are also useful in designing and analyzing
reinforcement learning algorithms with general function approximation.Comment: In NeurIPS 201
- …