251 research outputs found

    Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

    Full text link
    This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose (1+ϵ)(1+\epsilon)-th moment is bounded for some ϵ(0,1]\epsilon\in (0,1]. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of O~(dT11+ϵ)\widetilde{O}(dT^{\frac{1}{1+\epsilon}}), where dd is the dimension of contextual information and TT is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only O(logT)O(\log T) rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when ϵ=1\epsilon=1. Numerical experimental results confirm the merits of our algorithms

    Bandits with heavy tail

    Full text link
    The stochastic multi-armed bandit problem is well understood when the reward distributions are sub-Gaussian. In this paper we examine the bandit problem under the weaker assumption that the distributions have moments of order 1+\epsilon, for some ϵ(0,1]\epsilon \in (0,1]. Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain regret bounds of the same order as under sub-Gaussian reward distributions. In order to achieve such regret, we define sampling strategies based on refined estimators of the mean such as the truncated empirical mean, Catoni's M-estimator, and the median-of-means estimator. We also derive matching lower bounds that also show that the best achievable regret deteriorates when \epsilon <1

    Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

    Full text link
    While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite (1+ϵ)(1+\epsilon)-th moments for some ϵ(0,1]\epsilon\in(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} TT-round regret of O~(dT1ϵ2(1+ϵ)t=1Tνt2+dT1ϵ2(1+ϵ))\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big), the \emph{first} of this kind. Here, dd is the feature dimension, and νt1+ϵ\nu_t^{1+\epsilon} is the (1+ϵ)(1+\epsilon)-th central moment of the reward at the tt-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} KK-episode regret of O~(dHUK11+ϵ+dHVK)\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K}). Here, HH is length of the episode, and U,V\mathcal{U}^*, \mathcal{V}^* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(dHK11+ϵ+dH3K)\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K}) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.Comment: NeurIPS 202

    Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

    Full text link
    We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given α[1/2,1)\alpha\in[1/2, 1) and β[0,α]\beta\in[0, \alpha], our policy achieves a worst-case expected regret of O~(Tα)\tilde O(T^\alpha) (we call it α\alpha-optimal) and an instance-dependent expected regret of O~(Tβ)\tilde O(T^\beta) (we call it β\beta-consistent), while enjoys a probability of incurring an O~(Tδ)\tilde O(T^\delta) regret (δα\delta\geq\alpha in the worst-case scenario and δβ\delta\geq\beta in the instance-dependent scenario) that decays exponentially with a polynomial TT term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon TT is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks
    corecore