600 research outputs found
On Learning to Rank Long Sequences with Contextual Bandits
Motivated by problems of learning to rank long item sequences, we introduce a
variant of the cascading bandit model that considers flexible length sequences
with varying rewards and losses. We formulate two generative models for this
problem within the generalized linear setting, and design and analyze upper
confidence algorithms for it. Our analysis delivers tight regret bounds which,
when specialized to vanilla cascading bandits, results in sharper guarantees
than previously available in the literature. We evaluate our algorithms on a
number of real-world datasets, and show significantly improved empirical
performance as compared to known cascading bandit baselines
Observe Before Play: Multi-armed Bandit with Pre-observations
We consider the stochastic multi-armed bandit (MAB) problem in a setting
where a player can pay to pre-observe arm rewards before playing an arm in each
round. Apart from the usual trade-off between exploring new arms to find the
best one and exploiting the arm believed to offer the highest reward, we
encounter an additional dilemma: pre-observing more arms gives a higher chance
to play the best one, but incurs a larger cost. For the single-player setting,
we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for
arms with Bernoulli rewards, and prove a -round regret upper bound
. In the multi-player setting, collisions will occur when players
select the same arm to play in the same round. We design a centralized
algorithm, C-MP-OBP, and prove its -round regret relative to an offline
greedy strategy is upper bounded in for arms and
players. We also propose distributed versions of the C-MP-OBP policy,
called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect
to collision-free target policies. Experiments on synthetic data and wireless
channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and
offline optimal policies that do not allow pre-observations
Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits
We study a generalization of the multi-armed bandit problem with multiple
plays where there is a cost associated with pulling each arm and the agent has
a budget at each time that dictates how much she can expect to spend. We derive
an asymptotic regret lower bound for any uniformly efficient algorithm in our
setting. We then study a variant of Thompson sampling for Bernoulli rewards and
a variant of KL-UCB for both single-parameter exponential families and bounded,
finitely supported rewards. We show these algorithms are asymptotically
optimal, both in rateand leading problem-dependent constants, including in the
thick margin setting where multiple arms fall on the decision boundary
- …