156 research outputs found
Model-based Reinforcement Learning and the Eluder Dimension
We consider the problem of learning to optimize an unknown Markov decision
process (MDP). We show that, if the MDP can be parameterized within some known
function class, we can obtain regret bounds that scale with the dimensionality,
rather than cardinality, of the system. We characterize this dependence
explicitly as where is time elapsed, is
the Kolmogorov dimension and is the \emph{eluder dimension}. These
represent the first unified regret bounds for model-based reinforcement
learning and provide state of the art guarantees in several important settings.
Moreover, we present a simple and computationally efficient algorithm
\emph{posterior sampling for reinforcement learning} (PSRL) that satisfies
these bounds
A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning
With the increasing need for handling large state and action spaces, general
function approximation has become a key technique in reinforcement learning
(RL). In this paper, we propose a general framework that unifies model-based
and model-free RL, and an Admissible Bellman Characterization (ABC) class that
subsumes nearly all Markov Decision Process (MDP) models in the literature for
tractable RL. We propose a novel estimation function with decomposable
structural properties for optimization-based exploration and the functional
eluder dimension as a complexity measure of the ABC class. Under our framework,
a new sample-efficient algorithm namely OPtimization-based ExploRation with
Approximation (OPERA) is proposed, achieving regret bounds that match or
improve over the best-known results for a variety of MDP models. In particular,
for MDPs with low Witness rank, under a slightly stronger assumption, OPERA
improves the state-of-the-art sample complexity results by a factor of .
Our framework provides a generic interface to design and analyze new RL models
and algorithms
One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration
In online reinforcement learning (online RL), balancing exploration and
exploitation is crucial for finding an optimal policy in a sample-efficient
way. To achieve this, existing sample-efficient online RL algorithms typically
consist of three components: estimation, planning, and exploration. However, in
order to cope with general function approximators, most of them involve
impractical algorithmic components to incentivize exploration, such as
optimization within data-dependent level-sets or complicated sampling
procedures. To address this challenge, we propose an easy-to-implement RL
framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs
to optimize \emph{unconstrainedly} a single objective that integrates the
estimation and planning components while balancing exploration and exploitation
automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear
regret with general function approximations for Markov decision processes (MDP)
and is further extendable to two-player zero-sum Markov games (MG). Meanwhile,
we adapt deep RL baselines to design practical versions of \texttt{MEX}, in
both model-free and model-based manners, which can outperform baselines by a
stable margin in various MuJoCo environments with sparse rewards. Compared with
existing sample-efficient online RL algorithms with general function
approximations, \texttt{MEX} achieves similar sample efficiency while enjoying
a lower computational cost and is more compatible with modern deep RL methods
The Role of Coverage in Online Reinforcement Learning
Coverage conditions -- which assert that the data logging distribution
adequately covers the state space -- play a fundamental role in determining the
sample complexity of offline reinforcement learning. While such conditions
might seem irrelevant to online reinforcement learning at first glance, we
establish a new connection by showing -- somewhat surprisingly -- that the mere
existence of a data distribution with good coverage can enable sample-efficient
online RL. Concretely, we show that coverability -- that is, existence of a
data distribution that satisfies a ubiquitous coverage condition called
concentrability -- can be viewed as a structural property of the underlying
MDP, and can be exploited by standard algorithms for sample-efficient
exploration, even when the agent does not know said distribution. We complement
this result by proving that several weaker notions of coverage, despite being
sufficient for offline RL, are insufficient for online RL. We also show that
existing complexity measures for online RL, including Bellman rank and
Bellman-Eluder dimension, fail to optimally capture coverability, and propose a
new complexity measure, the sequential extrapolation coefficient, to provide a
unification
- …