251 research outputs found
Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards
This paper investigates the problem of generalized linear bandits with
heavy-tailed rewards, whose -th moment is bounded for some
. Although there exist methods for generalized linear
bandits, most of them focus on bounded or sub-Gaussian rewards and are not
well-suited for many real-world scenarios, such as financial markets and
web-advertising. To address this issue, we propose two novel algorithms based
on truncation and mean of medians. These algorithms achieve an almost optimal
regret bound of , where is the
dimension of contextual information and is the time horizon. Our
truncation-based algorithm supports online learning, distinguishing it from
existing truncation-based approaches. Additionally, our mean-of-medians-based
algorithm requires only rewards and one estimator per epoch, making
it more practical. Moreover, our algorithms improve the regret bounds by a
logarithmic factor compared to existing algorithms when . Numerical
experimental results confirm the merits of our algorithms
Bandits with heavy tail
The stochastic multi-armed bandit problem is well understood when the reward
distributions are sub-Gaussian. In this paper we examine the bandit problem
under the weaker assumption that the distributions have moments of order
1+\epsilon, for some . Surprisingly, moments of order 2
(i.e., finite variance) are sufficient to obtain regret bounds of the same
order as under sub-Gaussian reward distributions. In order to achieve such
regret, we define sampling strategies based on refined estimators of the mean
such as the truncated empirical mean, Catoni's M-estimator, and the
median-of-means estimator. We also derive matching lower bounds that also show
that the best achievable regret deteriorates when \epsilon <1
Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
While numerous works have focused on devising efficient algorithms for
reinforcement learning (RL) with uniformly bounded rewards, it remains an open
question whether sample or time-efficient algorithms for RL with large
state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with
only finite -th moments for some . In this
work, we address the challenge of such rewards in RL with linear function
approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for
heavy-tailed linear bandits, achieving an \emph{instance-dependent} -round
regret of , the
\emph{first} of this kind. Here, is the feature dimension, and
is the -th central moment of the reward at
the -th round. We further show the above bound is minimax optimal when
applied to the worst-case instances in stochastic and deterministic linear
bandits. We then extend this algorithm to the RL settings with linear function
approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the
\emph{first} computationally efficient \emph{instance-dependent} -episode
regret of . Here, is length of the episode, and
are instance-dependent quantities scaling with
the central moment of reward and value functions, respectively. We also provide
a matching minimax lower bound to demonstrate the optimality of our algorithm in the worst
case. Our result is achieved via a novel robust self-normalized concentration
inequality that may be of independent interest in handling heavy-tailed noise
in general online regression problems.Comment: NeurIPS 202
Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk
We study the trade-off between expectation and tail risk for regret
distribution in the stochastic multi-armed bandit problem. We fully
characterize the interplay among three desired properties for policy design:
worst-case optimality, instance-dependent consistency, and light-tailed risk.
We show how the order of expected regret exactly affects the decaying rate of
the regret tail probability for both the worst-case and instance-dependent
scenario. A novel policy is proposed to characterize the optimal regret tail
probability for any regret threshold. Concretely, for any given and , our policy achieves a worst-case expected regret
of (we call it -optimal) and an instance-dependent
expected regret of (we call it -consistent), while
enjoys a probability of incurring an regret
( in the worst-case scenario and in the
instance-dependent scenario) that decays exponentially with a polynomial
term. Such decaying rate is proved to be best achievable. Moreover, we discover
an intrinsic gap of the optimal tail rate under the instance-dependent scenario
between whether the time horizon is known a priori or not. Interestingly,
when it comes to the worst-case scenario, this gap disappears. Finally, we
extend our proposed policy design to (1) a stochastic multi-armed bandit
setting with non-stationary baseline rewards, and (2) a stochastic linear
bandit setting. Our results reveal insights on the trade-off between regret
expectation and regret tail risk for both worst-case and instance-dependent
scenarios, indicating that more sub-optimality and inconsistency leave space
for more light-tailed risk of incurring a large regret, and that knowing the
planning horizon in advance can make a difference on alleviating tail risks
- …