Search CORE

26 research outputs found

Observe Before Play: Multi-armed Bandit with Pre-observations

Author: Joe-Wong Carlee
Zhang Xiaoxi
Zuo Jinhang
Publication venue
Publication date: 21/11/2019
Field of study

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for

K

arms with Bernoulli rewards, and prove a

T

-round regret upper bound

O(K^2\log T)

. In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its

T

-round regret relative to an offline greedy strategy is upper bounded in

O(\frac{K^4}{M^2}\log T)

for

K

arms and

M

players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users' Feedback

Author: Amghar Tassadit
Camp Olivier
Gutowski Nicolas
Letard Alexandre
Publication venue
Publication date: 16/09/2020
Field of study

Recent works on Multi-Armed Bandits (MAB) and Combinatorial Multi-Armed Bandits (COM-MAB) show good results on a global accuracy metric. This can be achieved, in the case of recommender systems, with personalization. However, with a combinatorial online learning approach, personalization implies a large amount of user feedbacks. Such feedbacks can be hard to acquire when users need to be directly and frequently solicited. For a number of fields of activities undergoing the digitization of their business, online learning is unavoidable. Thus, a number of approaches allowing implicit user feedback retrieval have been implemented. Nevertheless, this implicit feedback can be misleading or inefficient for the agent's learning. Herein, we propose a novel approach reducing the number of explicit feedbacks required by Combinatorial Multi Armed bandit (COM-MAB) algorithms while providing similar levels of global accuracy and learning efficiency to classical competitive methods. In this paper we present a novel approach for considering user feedback and evaluate it using three distinct strategies. Despite a limited number of feedbacks returned by users (as low as 20% of the total), our approach obtains similar results to those of state of the art approaches

arXiv.org e-Print Archive

Crossref

Hal-Diderot