Search CORE

16 research outputs found

No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach

Author: Chowdhury Sayak Ray
Oliveira Rafael
Publication venue
Publication date: 20/11/2020
Field of study

We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by either a low-dimensional representation of the stochastic transition model or an approximation of the

Q

-functions. However, the understanding of function approximation schemes for state-value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order

\tilde{O}\big(H\gamma_N\sqrt{N}\big)

, where

H

is the episode length,

N

is the total number of time steps and

\gamma_N

is an information theoretic quantity relating the effective dimension of the state-action feature space. Our method bypasses the need for estimating transition probabilities and applies to any domain on which kernels can be defined. It also brings new insights into the general theory of kernel methods for approximate inference and RL regret minimization

arXiv.org e-Print Archive

Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks

Author: Gupta Sunil
Nguyen-Tang Thanh
Tran-The Hung
Venkatesh Svetha
Publication venue
Publication date: 11/07/2021
Field of study

We study the statistical theory of offline reinforcement learning (RL) with deep ReLU network function approximation. We analyze a variant of fitted-Q iteration (FQI) algorithm under a new dynamic condition that we call Besov dynamic closure, which encompasses the conditions from prior analyses for deep neural network function approximation. Under Besov dynamic closure, we prove that the FQI-type algorithm enjoys the sample complexity of

\tilde{\mathcal{O}}\left( \kappa^{1 + d/\alpha} \cdot \epsilon^{-2 - 2d/\alpha} \right)

where

\kappa

is a distribution shift measure,

d

is the dimensionality of the state-action space,

\alpha

is the (possibly fractional) smoothness parameter of the underlying MDP, and

\epsilon

is a user-specified precision. This is an improvement over the sample complexity of

\tilde{\mathcal{O}}\left( K \cdot \kappa^{2 + d/\alpha} \cdot \epsilon^{-2 - d/\alpha} \right)

in the prior result [Yang et al., 2019] where

K

is an algorithmic iteration number which is arbitrarily large in practice. Importantly, our sample complexity is obtained under the new general dynamic condition and a data-dependent structure where the latter is either ignored in prior algorithms or improperly handled by prior analyses. This is the first comprehensive analysis for offline RL with deep ReLU network function approximation under a general setting.Comment: A short version published in the ICML Workshop on Reinforcement Learning Theory, 202

arXiv.org e-Print Archive

Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Author: Dong Kefan
Ma Tengyu
Yang Jiaqi
Publication venue
Publication date: 24/03/2021
Field of study

This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also reasonably predict the size of the gradient and Hessian of the real return.Comment: Added an instantiation (Example 4.3) of the RL theorem (Theorem 4.4) and more reference

arXiv.org e-Print Archive