Search CORE

74 research outputs found

Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

Author: Wan Yuanyu
Wang Yimu
Xue Bo
Yi Jinfeng
Zhang Lijun
Publication venue
Publication date: 28/10/2023
Field of study

This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose

(1+\epsilon)

-th moment is bounded for some

\epsilon\in (0,1]

. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of

\widetilde{O}(dT^{\frac{1}{1+\epsilon}})

, where

d

is the dimension of contextual information and

T

is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only

O(\log T)

rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when

\epsilon=1

. Numerical experimental results confirm the merits of our algorithms

arXiv.org e-Print Archive

Robust Offline Policy Evaluation and Optimization with Heavy-Tailed Rewards

Author: Luo Shikai
Qi Zhengling
Shi Chengchun
Wan Runzhe
Zhu Jin
Publication venue
Publication date: 28/10/2023
Field of study

This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation (OPE) and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions

arXiv.org e-Print Archive

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Author: Huang Jiayi
Wang Liwei
Yang Lin F.
Zhong Han
Publication venue
Publication date: 27/10/2023
Field of study

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite

(1+\epsilon)

-th moments for some

\epsilon\in(0,1]

. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent}

T

-round regret of

\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)

, the \emph{first} of this kind. Here,

d

is the feature dimension, and

\nu_t^{1+\epsilon}

is the

(1+\epsilon)

-th central moment of the reward at the

t

-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent}

K

-episode regret of

\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})

. Here,

H

is length of the episode, and

\mathcal{U}^*, \mathcal{V}^*

are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound

\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})

to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.Comment: NeurIPS 202

arXiv.org e-Print Archive