Search CORE

600 research outputs found

On Learning to Rank Long Sequences with Contextual Bandits

Author: Aggarwal Gaurav
Gentile Claudio
Li Shuai
Santara Anirban
Publication venue
Publication date: 07/06/2021
Field of study

Motivated by problems of learning to rank long item sequences, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to vanilla cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms on a number of real-world datasets, and show significantly improved empirical performance as compared to known cascading bandit baselines

arXiv.org e-Print Archive

Observe Before Play: Multi-armed Bandit with Pre-observations

Author: Joe-Wong Carlee
Zhang Xiaoxi
Zuo Jinhang
Publication venue
Publication date: 21/11/2019
Field of study

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for

K

arms with Bernoulli rewards, and prove a

T

-round regret upper bound

O(K^2\log T)

. In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its

T

-round regret relative to an offline greedy strategy is upper bounded in

O(\frac{K^4}{M^2}\log T)

for

K

arms and

M

players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits

Author: Chambaz Antoine
Kaufmann Emilie
Luedtke Alexander
Publication venue
Publication date: 01/09/2019
Field of study

We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rateand leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot