Search CORE

181 research outputs found

Rotting bandits are not harder than stochastic ones

Author: Carpentier Alexandra
Lazaric Alessandro
Locatelli Andrea
Seznec Julien
Valko Michal
Publication venue
Publication date: 01/01/2019
Field of study

In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon

T

, and without any knowledge on the decreasing behavior of the

K

arms, FEWA achieves problem-dependent regret bound of

\widetilde{\mathcal{O}}(\log{(KT)}),

and a problem-independent one of

\widetilde{\mathcal{O}}(\sqrt{KT})

. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret

\widetilde{\mathcal{O}}(K^{1/3}T^{2/3})

. FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

The Assistive Multi-Armed Bandit

Author: Chan Lawrence
Dragan Anca
Hadfield-Menell Dylan
Srinivasa Siddhartha
Publication venue
Publication date: 24/01/2019
Field of study

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science. However, most work makes the assumption that humans are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people are themselves learning about what they want. In this work, we introduce the assistive multi-armed bandit, where a robot assists a human playing a bandit task to maximize cumulative reward. In this problem, the human does not know the reward function but can learn it through the rewards received from arm pulls; the robot only observes which arms the human pulls but not the reward associated with each pull. We offer sufficient and necessary conditions for successfully assisting the human in this framework. Surprisingly, better human performance in isolation does not necessarily lead to better performance when assisted by the robot: a human policy can do better by effectively communicating its observed rewards to the robot. We conduct proof-of-concept experiments that support these results. We see this work as contributing towards a theory behind algorithms for human-robot interaction.Comment: Accepted to HRI 201

arXiv.org e-Print Archive

Crossref

Training a Single Bandit Arm

Author: Kamble Vijay
Ozbay Eren
Publication venue
Publication date: 30/09/2020
Field of study

The stochastic multi-armed bandit problem captures the fundamental exploration vs. exploitation tradeoff inherent in online decision-making in uncertain settings. However, in several applications, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to groom novice workers with unknown trainability in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from

T

pulls, we consider the vector of cumulative rewards earned from each of the

K

arms at the end of

T

pulls, and aim to maximize the expected value of the highest

cumulative

reward. This corresponds to the objective of grooming a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur a regret of

\Omega(K^{1/3}T^{2/3})

in the worst case. We design an explore-then-commit policy featuring exploration based on finely tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and guarantees a regret of

O(K^{1/3}T^{2/3}\sqrt{\log K})

in the worst case. Our numerical experiments demonstrate that this policy improves upon several natural candidate policies for this setting.Comment: 23 pages, 1 figure, 1 tabl

arXiv.org e-Print Archive

Rotting infinitely many-armed bandits

Author: Kim Jung-Hun
Vojnovic Milan
Yun Se-Young
Publication venue: Journal of Machine Learning Research
Publication date: 01/01/2021
Field of study

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate ϱ=o(1). We show that this learning problem has an Ω(max{ϱ1/3T,T−−√}) worst-case regret lower bound where T is the time horizon. We show that a matching upper bound O~(max{ϱ1/3T,T−−√}), up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate ϱ. We also show that an O~(max{ϱ1/3T,T3/4}) regret upper bound can be achieved by an algorithm that does not know the value of ϱ, by using an adaptive UCB index along with an adaptive threshold value

arXiv.org e-Print Archive

LSE Research Online

Stochastic Bandits with Delay-Dependent Payoffs

Author: L. Cella
N. Cesa Bianchi
Publication venue: PMLR
Publication date: 01/01/2020
Field of study

Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by Oe 1a kT , where k is the number of arms and T is time. Our algorithm uses only O k ln ln T) switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instance

AIR Universita degli studi di Milano

Congested Bandits: Optimal Routing via Short-term Resets

Author: Awasthi Pranjal
Bhatia Kush
Gollapudi Sreenivas
Kollias Kostas
Publication venue
Publication date: 22/01/2023
Field of study

For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past

\Delta

timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as

\tilde{O}(\sqrt{K \Delta T})

. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret

\tilde{O}(\sqrt{dT} + \Delta)

. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.Comment: Published at ICML 202

arXiv.org e-Print Archive

Online Learning and Bandits with Queried Hints

Author: Bhaskara Aditya
Gollapudi Sreenivas
Im Sungjin
Kollias Kostas
Munagala Kamesh
Publication venue
Publication date: 04/11/2022
Field of study

We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number (

k

) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with

k=2

suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from

O(\sqrt{nT})

O(n^2 \log T)

, where

n

is the number of arms and

T

is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case,

k=3

probes suffice to achieve parameter-independent constant regret,

O(n^2)

. Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited ``advice'' via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.Comment: To appear in ITCS 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server