74 research outputs found
Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards
This paper investigates the problem of generalized linear bandits with
heavy-tailed rewards, whose -th moment is bounded for some
. Although there exist methods for generalized linear
bandits, most of them focus on bounded or sub-Gaussian rewards and are not
well-suited for many real-world scenarios, such as financial markets and
web-advertising. To address this issue, we propose two novel algorithms based
on truncation and mean of medians. These algorithms achieve an almost optimal
regret bound of , where is the
dimension of contextual information and is the time horizon. Our
truncation-based algorithm supports online learning, distinguishing it from
existing truncation-based approaches. Additionally, our mean-of-medians-based
algorithm requires only rewards and one estimator per epoch, making
it more practical. Moreover, our algorithms improve the regret bounds by a
logarithmic factor compared to existing algorithms when . Numerical
experimental results confirm the merits of our algorithms
Robust Offline Policy Evaluation and Optimization with Heavy-Tailed Rewards
This paper endeavors to augment the robustness of offline reinforcement
learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent
circumstance in real-world applications. We propose two algorithmic frameworks,
ROAM and ROOM, for robust off-policy evaluation (OPE) and offline policy
optimization (OPO), respectively. Central to our frameworks is the strategic
incorporation of the median-of-means method with offline RL, enabling
straightforward uncertainty estimation for the value function estimator. This
not only adheres to the principle of pessimism in OPO but also adeptly manages
heavy-tailed rewards. Theoretical results and extensive experiments demonstrate
that our two frameworks outperform existing methods on the logged dataset
exhibits heavy-tailed reward distributions
Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
While numerous works have focused on devising efficient algorithms for
reinforcement learning (RL) with uniformly bounded rewards, it remains an open
question whether sample or time-efficient algorithms for RL with large
state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with
only finite -th moments for some . In this
work, we address the challenge of such rewards in RL with linear function
approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for
heavy-tailed linear bandits, achieving an \emph{instance-dependent} -round
regret of , the
\emph{first} of this kind. Here, is the feature dimension, and
is the -th central moment of the reward at
the -th round. We further show the above bound is minimax optimal when
applied to the worst-case instances in stochastic and deterministic linear
bandits. We then extend this algorithm to the RL settings with linear function
approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the
\emph{first} computationally efficient \emph{instance-dependent} -episode
regret of . Here, is length of the episode, and
are instance-dependent quantities scaling with
the central moment of reward and value functions, respectively. We also provide
a matching minimax lower bound to demonstrate the optimality of our algorithm in the worst
case. Our result is achieved via a novel robust self-normalized concentration
inequality that may be of independent interest in handling heavy-tailed noise
in general online regression problems.Comment: NeurIPS 202
- …