Search CORE

251 research outputs found

Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

Author: Wan Yuanyu
Wang Yimu
Xue Bo
Yi Jinfeng
Zhang Lijun
Publication venue
Publication date: 28/10/2023
Field of study

This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose

(1+\epsilon)

-th moment is bounded for some

\epsilon\in (0,1]

. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of

\widetilde{O}(dT^{\frac{1}{1+\epsilon}})

, where

d

is the dimension of contextual information and

T

is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only

O(\log T)

rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when

\epsilon=1

. Numerical experimental results confirm the merits of our algorithms

arXiv.org e-Print Archive

Bandits with heavy tail

Author: Bubeck Sébastien
Cesa-Bianchi Nicolò
Lugosi Gábor
Publication venue
Publication date: 01/01/2012
Field of study

The stochastic multi-armed bandit problem is well understood when the reward distributions are sub-Gaussian. In this paper we examine the bandit problem under the weaker assumption that the distributions have moments of order 1+\epsilon, for some

\epsilon \in (0,1]

. Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain regret bounds of the same order as under sub-Gaussian reward distributions. In order to achieve such regret, we define sampling strategies based on refined estimators of the mean such as the truncated empirical mean, Catoni's M-estimator, and the median-of-means estimator. We also derive matching lower bounds that also show that the best achievable regret deteriorates when \epsilon <1

arXiv.org e-Print Archive

CiteSeerX

AIR Universita degli studi di Milano

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Author: Huang Jiayi
Wang Liwei
Yang Lin F.
Zhong Han
Publication venue
Publication date: 27/10/2023
Field of study

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite

(1+\epsilon)

-th moments for some

\epsilon\in(0,1]

. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent}

T

-round regret of

\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)

, the \emph{first} of this kind. Here,

d

is the feature dimension, and

\nu_t^{1+\epsilon}

is the

(1+\epsilon)

-th central moment of the reward at the

t

-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent}

K

-episode regret of

\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})

. Here,

H

is length of the episode, and

\mathcal{U}^*, \mathcal{V}^*

are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound

\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})

to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.Comment: NeurIPS 202

arXiv.org e-Print Archive

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

Author: Simchi-Levi David
Zheng Zeyu
Zhu Feng
Publication venue
Publication date: 09/04/2023
Field of study

We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given

\alpha\in[1/2, 1)

and

\beta\in[0, \alpha]

, our policy achieves a worst-case expected regret of

\tilde O(T^\alpha)

(we call it

\alpha

-optimal) and an instance-dependent expected regret of

\tilde O(T^\beta)

(we call it

\beta

-consistent), while enjoys a probability of incurring an

\tilde O(T^\delta)

regret (

\delta\geq\alpha

in the worst-case scenario and

\delta\geq\beta

in the instance-dependent scenario) that decays exponentially with a polynomial

T

term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon

T

is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks

arXiv.org e-Print Archive