26 research outputs found
Observe Before Play: Multi-armed Bandit with Pre-observations
We consider the stochastic multi-armed bandit (MAB) problem in a setting
where a player can pay to pre-observe arm rewards before playing an arm in each
round. Apart from the usual trade-off between exploring new arms to find the
best one and exploiting the arm believed to offer the highest reward, we
encounter an additional dilemma: pre-observing more arms gives a higher chance
to play the best one, but incurs a larger cost. For the single-player setting,
we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for
arms with Bernoulli rewards, and prove a -round regret upper bound
. In the multi-player setting, collisions will occur when players
select the same arm to play in the same round. We design a centralized
algorithm, C-MP-OBP, and prove its -round regret relative to an offline
greedy strategy is upper bounded in for arms and
players. We also propose distributed versions of the C-MP-OBP policy,
called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect
to collision-free target policies. Experiments on synthetic data and wireless
channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and
offline optimal policies that do not allow pre-observations
Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users' Feedback
Recent works on Multi-Armed Bandits (MAB) and Combinatorial Multi-Armed
Bandits (COM-MAB) show good results on a global accuracy metric. This can be
achieved, in the case of recommender systems, with personalization. However,
with a combinatorial online learning approach, personalization implies a large
amount of user feedbacks. Such feedbacks can be hard to acquire when users need
to be directly and frequently solicited. For a number of fields of activities
undergoing the digitization of their business, online learning is unavoidable.
Thus, a number of approaches allowing implicit user feedback retrieval have
been implemented. Nevertheless, this implicit feedback can be misleading or
inefficient for the agent's learning. Herein, we propose a novel approach
reducing the number of explicit feedbacks required by Combinatorial Multi Armed
bandit (COM-MAB) algorithms while providing similar levels of global accuracy
and learning efficiency to classical competitive methods. In this paper we
present a novel approach for considering user feedback and evaluate it using
three distinct strategies. Despite a limited number of feedbacks returned by
users (as low as 20% of the total), our approach obtains similar results to
those of state of the art approaches