Search CORE

13 research outputs found

Rotting bandits are not harder than stochastic ones

Author: Carpentier Alexandra
Lazaric Alessandro
Locatelli Andrea
Seznec Julien
Valko Michal
Publication venue
Publication date: 01/01/2019
Field of study

In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon

T

, and without any knowledge on the decreasing behavior of the

K

arms, FEWA achieves problem-dependent regret bound of

\widetilde{\mathcal{O}}(\log{(KT)}),

and a problem-independent one of

\widetilde{\mathcal{O}}(\sqrt{KT})

. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret

\widetilde{\mathcal{O}}(K^{1/3}T^{2/3})

. FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Training a Single Bandit Arm

Author: Kamble Vijay
Ozbay Eren
Publication venue
Publication date: 30/09/2020
Field of study

The stochastic multi-armed bandit problem captures the fundamental exploration vs. exploitation tradeoff inherent in online decision-making in uncertain settings. However, in several applications, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to groom novice workers with unknown trainability in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from

T

pulls, we consider the vector of cumulative rewards earned from each of the

K

arms at the end of

T

pulls, and aim to maximize the expected value of the highest

cumulative

reward. This corresponds to the objective of grooming a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur a regret of

\Omega(K^{1/3}T^{2/3})

in the worst case. We design an explore-then-commit policy featuring exploration based on finely tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and guarantees a regret of

O(K^{1/3}T^{2/3}\sqrt{\log K})

in the worst case. Our numerical experiments demonstrate that this policy improves upon several natural candidate policies for this setting.Comment: 23 pages, 1 figure, 1 tabl

arXiv.org e-Print Archive

A Field Test of Bandit Algorithms for Recommendations: Understanding the Validity of Assumptions on Human Preferences in Multi-armed Bandits

Author: Kılınç-Karzan Fatma
Leqi Liu
Lipton Zachary C.
Montgomery Alan L.
Zhou Giulio
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/04/2023
Field of study

Personalized recommender systems suffuse modern life, shaping what media we read and what products we consume. Algorithms powering such systems tend to consist of supervised learning-based heuristics, such as latent factor models with a variety of heuristically chosen prediction targets. Meanwhile, theoretical treatments of recommendation frequently address the decision-theoretic nature of the problem, including the need to balance exploration and exploitation, via the multi-armed bandits (MABs) framework. However, MAB-based approaches rely heavily on assumptions about human preferences. These preference assumptions are seldom tested using human subject studies, partly due to the lack of publicly available toolkits to conduct such studies. In this work, we conduct a study with crowdworkers in a comics recommendation MABs setting. Each arm represents a comic category, and users provide feedback after each recommendation. We check the validity of core MABs assumptions-that human preferences (reward distributions) are fixed over time-and find that they do not hold. This finding suggests that any MAB algorithm used for recommender systems should account for human preference dynamics. While answering these questions, we provide a flexible experimental framework for understanding human preference dynamics and testing MABs algorithms with human users. The code for our experimental framework and the collected data can be found at https://github.com/HumainLab/human-bandit-evaluation.Comment: Accepted to CHI. 16 pages, 6 figure

arXiv.org e-Print Archive

Finite Continuum-Armed Bandits

Author: Gaucher Solenne
Publication venue
Publication date: 02/11/2020
Field of study

We consider a situation where an agent has

T

ressources to be allocated to a larger number

N

of actions. Each action can be completed at most once and results in a stochastic reward with unknown mean. The goal of the agent is to maximize her cumulative reward. Non trivial strategies are possible when side information on the actions is available, for example in the form of covariates. Focusing on a nonparametric setting, where the mean reward is an unknown function of a one-dimensional covariate, we propose an optimal strategy for this problem. Under natural assumptions on the reward function, we prove that the optimal regret scales as

O(T^{1/3})

up to poly-logarithmic factors when the budget

T

is proportional to the number of actions

N

. When

T

becomes small compared to

N

, a smooth transition occurs. When the ratio

T/N

decreases from a constant to

N^{-1/3}

, the regret increases progressively up to the

O(T^{1/2})

rate encountered in continuum-armed bandits

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Bandit problems with fidelity rewards

Author: Lugosi G
Pike-Burke C
Savalle P-A
Publication venue: Microtome Publishing
Publication date: 14/11/2023
Field of study

The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how ‘loyal’ the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every m plays of an arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret

Spiral - Imperial College Digital Repository