16 research outputs found
No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach
We consider the regret minimization problem in reinforcement learning (RL) in
the episodic setting. In many real-world RL environments, the state and action
spaces are continuous or very large. Existing approaches establish regret
guarantees by either a low-dimensional representation of the stochastic
transition model or an approximation of the -functions. However, the
understanding of function approximation schemes for state-value functions
largely remains missing. In this paper, we propose an online model-based RL
algorithm, namely the CME-RL, that learns representations of transition
distributions as embeddings in a reproducing kernel Hilbert space while
carefully balancing the exploitation-exploration tradeoff. We demonstrate the
efficiency of our algorithm by proving a frequentist (worst-case) regret bound
that is of order , where is the
episode length, is the total number of time steps and is an
information theoretic quantity relating the effective dimension of the
state-action feature space. Our method bypasses the need for estimating
transition probabilities and applies to any domain on which kernels can be
defined. It also brings new insights into the general theory of kernel methods
for approximate inference and RL regret minimization
Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks
We study the statistical theory of offline reinforcement learning (RL) with
deep ReLU network function approximation. We analyze a variant of fitted-Q
iteration (FQI) algorithm under a new dynamic condition that we call Besov
dynamic closure, which encompasses the conditions from prior analyses for deep
neural network function approximation. Under Besov dynamic closure, we prove
that the FQI-type algorithm enjoys the sample complexity of
where is a distribution shift measure, is the
dimensionality of the state-action space, is the (possibly fractional)
smoothness parameter of the underlying MDP, and is a user-specified
precision. This is an improvement over the sample complexity of
in the prior result [Yang et al., 2019] where is an
algorithmic iteration number which is arbitrarily large in practice.
Importantly, our sample complexity is obtained under the new general dynamic
condition and a data-dependent structure where the latter is either ignored in
prior algorithms or improperly handled by prior analyses. This is the first
comprehensive analysis for offline RL with deep ReLU network function
approximation under a general setting.Comment: A short version published in the ICML Workshop on Reinforcement
Learning Theory, 202
Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature
This paper studies model-based bandit and reinforcement learning (RL) with
nonlinear function approximations. We propose to study convergence to
approximate local maxima because we show that global convergence is
statistically intractable even for one-layer neural net bandit with a
deterministic reward. For both nonlinear bandit and RL, the paper presents a
model-based algorithm, Virtual Ascent with Online Model Learner (ViOL), which
provably converges to a local maximum with sample complexity that only depends
on the sequential Rademacher complexity of the model class. Our results imply
novel global or local regret bounds on several concrete settings such as linear
bandit with finite or sparse model class, and two-layer neural net bandit. A
key algorithmic insight is that optimism may lead to over-exploration even for
two-layer neural net model class. On the other hand, for convergence to local
maxima, it suffices to maximize the virtual return if the model can also
reasonably predict the size of the gradient and Hessian of the real return.Comment: Added an instantiation (Example 4.3) of the RL theorem (Theorem 4.4)
and more reference