74 research outputs found

    Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

    Full text link
    This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose (1+ϵ)(1+\epsilon)-th moment is bounded for some ϵ(0,1]\epsilon\in (0,1]. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of O~(dT11+ϵ)\widetilde{O}(dT^{\frac{1}{1+\epsilon}}), where dd is the dimension of contextual information and TT is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only O(logT)O(\log T) rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when ϵ=1\epsilon=1. Numerical experimental results confirm the merits of our algorithms

    Robust Offline Policy Evaluation and Optimization with Heavy-Tailed Rewards

    Full text link
    This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation (OPE) and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions

    Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

    Full text link
    While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite (1+ϵ)(1+\epsilon)-th moments for some ϵ(0,1]\epsilon\in(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} TT-round regret of O~(dT1ϵ2(1+ϵ)t=1Tνt2+dT1ϵ2(1+ϵ))\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big), the \emph{first} of this kind. Here, dd is the feature dimension, and νt1+ϵ\nu_t^{1+\epsilon} is the (1+ϵ)(1+\epsilon)-th central moment of the reward at the tt-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} KK-episode regret of O~(dHUK11+ϵ+dHVK)\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K}). Here, HH is length of the episode, and U,V\mathcal{U}^*, \mathcal{V}^* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(dHK11+ϵ+dH3K)\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K}) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.Comment: NeurIPS 202
    corecore