235 research outputs found
Stochastic Bandits with Delay-Dependent Payoffs
Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by Oe 1a kT , where k is the number of arms and T is time. Our algorithm uses only O k ln ln T) switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instance
Last Switch Dependent Bandits with Monotone Payoff Functions
In a recent work, Laforgue et al. introduce the model of last switch
dependent (LSD) bandits, in an attempt to capture nonstationary phenomena
induced by the interaction between the player and the environment. Examples
include satiation, where consecutive plays of the same action lead to decreased
performance, or deprivation, where the payoff of an action increases after an
interval of inactivity. In this work, we take a step towards understanding the
approximability of planning LSD bandits, namely, the (NP-hard) problem of
computing an optimal arm-pulling strategy under complete knowledge of the
model. In particular, we design the first efficient constant approximation
algorithm for the problem and show that, under a natural monotonicity
assumption on the payoffs, its approximation guarantee (almost) matches the
state-of-the-art for the special and well-studied class of recharging bandits
(also known as delay-dependent). In this attempt, we develop new tools and
insights for this class of problems, including a novel higher-dimensional
relaxation and the technique of mirroring the evolution of virtual states. We
believe that these novel elements could potentially be used for approaching
richer classes of action-induced nonstationary bandits (e.g., special instances
of restless bandits). In the case where the model parameters are initially
unknown, we develop an online learning adaptation of our algorithm for which we
provide sublinear regret guarantees against its full-information counterpart.Comment: Accepted to the 40th International Conference on Machine Learning
(ICML 2023
Further Optimal Regret Bounds for Thompson Sampling
Thompson Sampling is one of the oldest heuristics for multi-armed bandit
problems. It is a randomized algorithm based on Bayesian ideas, and has
recently generated significant interest after several studies demonstrated it
to have better empirical performance compared to the state of the art methods.
In this paper, we provide a novel regret analysis for Thompson Sampling that
simultaneously proves both the optimal problem-dependent bound of
and the
first near-optimal problem-independent bound of on the
expected regret of this algorithm. Our near-optimal problem-independent bound
solves a COLT 2012 open problem of Chapelle and Li. The optimal
problem-dependent regret bound for this problem was first proven recently by
Kaufmann et al. [ALT 2012]. Our novel martingale-based analysis techniques are
conceptually simple, easily extend to distributions other than the Beta
distribution, and also extend to the more general contextual bandits setting
[Manuscript, Agrawal and Goyal, 2012].Comment: arXiv admin note: substantial text overlap with arXiv:1111.179
- …