112 research outputs found
Model-based Reinforcement Learning and the Eluder Dimension
We consider the problem of learning to optimize an unknown Markov decision
process (MDP). We show that, if the MDP can be parameterized within some known
function class, we can obtain regret bounds that scale with the dimensionality,
rather than cardinality, of the system. We characterize this dependence
explicitly as where is time elapsed, is
the Kolmogorov dimension and is the \emph{eluder dimension}. These
represent the first unified regret bounds for model-based reinforcement
learning and provide state of the art guarantees in several important settings.
Moreover, we present a simple and computationally efficient algorithm
\emph{posterior sampling for reinforcement learning} (PSRL) that satisfies
these bounds
Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning
Finding unified complexity measures and algorithms for sample-efficient
learning is a central topic of research in reinforcement learning (RL). The
Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al.
(2021) as a necessary and sufficient complexity measure for sample-efficient
no-regret RL. This paper makes progress towards a unified theory for RL with
the DEC framework. First, we propose two new DEC-type complexity measures:
Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are
necessary and sufficient for sample-efficient PAC learning and reward-free
learning, thereby extending the original DEC which only captures no-regret
learning. Next, we design new unified sample-efficient algorithms for all three
learning goals. Our algorithms instantiate variants of the
Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model
estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA
improves upon the algorithms of Foster et al. (2021) which require either
bounding a variant of the DEC which may be prohibitively large, or designing
problem-specific estimation subroutines. As applications, we recover existing
and obtain new sample-efficient learning results for a wide range of tractable
RL problems using essentially a single algorithm. We also generalize the DEC to
give sample-efficient algorithms for all-policy model estimation, with
applications for learning equilibria in Markov Games. Finally, as a connection,
we re-analyze two existing optimistic model-based algorithms based on Posterior
Sampling or Maximum Likelihood Estimation, showing that they enjoy similar
regret bounds as E2D-TA under similar structural conditions as the DEC
The Role of Coverage in Online Reinforcement Learning
Coverage conditions -- which assert that the data logging distribution
adequately covers the state space -- play a fundamental role in determining the
sample complexity of offline reinforcement learning. While such conditions
might seem irrelevant to online reinforcement learning at first glance, we
establish a new connection by showing -- somewhat surprisingly -- that the mere
existence of a data distribution with good coverage can enable sample-efficient
online RL. Concretely, we show that coverability -- that is, existence of a
data distribution that satisfies a ubiquitous coverage condition called
concentrability -- can be viewed as a structural property of the underlying
MDP, and can be exploited by standard algorithms for sample-efficient
exploration, even when the agent does not know said distribution. We complement
this result by proving that several weaker notions of coverage, despite being
sufficient for offline RL, are insufficient for online RL. We also show that
existing complexity measures for online RL, including Bellman rank and
Bellman-Eluder dimension, fail to optimally capture coverability, and propose a
new complexity measure, the sequential extrapolation coefficient, to provide a
unification
Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning
Provably efficient Model-Based Reinforcement Learning (MBRL) based on
optimism or posterior sampling (PSRL) is ensured to attain the global
optimality asymptotically by introducing the complexity measure of the model.
However, the complexity might grow exponentially for the simplest nonlinear
models, where global convergence is impossible within finite iterations. When
the model suffers a large generalization error, which is quantitatively
measured by the model complexity, the uncertainty can be large. The sampled
model that current policy is greedily optimized upon will thus be unsettled,
resulting in aggressive policy updates and over-exploration. In this work, we
propose Conservative Dual Policy Optimization (CDPO) that involves a
Referential Update and a Conservative Update. The policy is first optimized
under a reference model, which imitates the mechanism of PSRL while offering
more stability. A conservative range of randomness is guaranteed by maximizing
the expectation of model value. Without harmful sampling procedures, CDPO can
still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic
policy improvement and global optimality simultaneously. Empirical results also
validate the exploration efficiency of CDPO.Comment: Published at NeurIPS 202
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation
The exploration-exploitation dilemma has been a central challenge in
reinforcement learning (RL) with complex model classes. In this paper, we
propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound
(MQL-UCB) for RL with general function approximation. Our key algorithmic
design includes (1) a general deterministic policy-switching strategy that
achieves low switching cost, (2) a monotonic value function structure with
carefully controlled function class complexity, and (3) a variance-weighted
regression scheme that exploits historical trajectories with high data
efficiency. MQL-UCB achieves minimax optimal regret of
when is sufficiently large and near-optimal policy switching cost of
, with being the eluder dimension of the function class,
being the planning horizon, and being the number of episodes.
Our work sheds light on designing provably sample-efficient and
deployment-efficient Q-learning with nonlinear function approximation.Comment: 52 pages, 1 tabl
VOL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation
We study time-inhomogeneous episodic reinforcement learning (RL) under
general function approximation and sparse rewards. We design a new algorithm,
Variance-weighted Optimistic -Learning (VOL), based on -learning and
bound its regret assuming completeness and bounded Eluder dimension for the
regression function class. As a special case, VOL achieves
regret over episodes for a horizon MDP
under (-dimensional) linear function approximation, which is asymptotically
optimal. Our algorithm incorporates weighted regression-based upper and lower
bounds on the optimal value function to obtain this improved regret. The
algorithm is computationally efficient given a regression oracle over the
function class, making this the first computationally tractable and
statistically optimal approach for linear MDPs
- β¦