192 research outputs found
Multiple-Play Bandits in the Position-Based Model
Sequentially learning to place items in multi-position displays or lists is a
task that can be cast into the multiple-play semi-bandit setting. However, a
major concern in this context is when the system cannot decide whether the user
feedback for each item is actually exploitable. Indeed, much of the content may
have been simply ignored by the user. The present work proposes to exploit
available information regarding the display position bias under the so-called
Position-based click model (PBM). We first discuss how this model differs from
the Cascade model and its variants considered in several recent works on
multiple-play bandits. We then provide a novel regret lower bound for this
model as well as computationally efficient algorithms that display good
empirical and theoretical performance
Copeland Dueling Bandits
A version of the dueling bandit problem is addressed in which a Condorcet
winner may not exist. Two algorithms are proposed that instead seek to minimize
regret with respect to the Copeland winner, which, unlike the Condorcet winner,
is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed
for small numbers of arms, while the second, Scalable Copeland Bandits (SCB),
works better for large-scale problems. We provide theoretical results bounding
the regret accumulated by CCB and SCB, both substantially improving existing
results. Such existing results either offer bounds of the form
but require restrictive assumptions, or offer bounds of the form without requiring such assumptions. Our results offer the best of both
worlds: bounds without restrictive assumptions.Comment: 33 pages, 8 figure
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
- …