13 research outputs found
Rotting bandits are not harder than stochastic ones
In stochastic multi-armed bandits, the reward distribution of each arm is
assumed to be stationary. This assumption is often violated in practice (e.g.,
in recommendation systems), where the reward of an arm may change whenever is
selected, i.e., rested bandit setting. In this paper, we consider the
non-parametric rotting bandit setting, where rewards can only decrease. We
introduce the filtering on expanding window average (FEWA) algorithm that
constructs moving averages of increasing windows to identify arms that are more
likely to return high rewards when pulled once more. We prove that for an
unknown horizon , and without any knowledge on the decreasing behavior of
the arms, FEWA achieves problem-dependent regret bound of
and a problem-independent one of
. Our result substantially improves over
the algorithm of Levine et al. (2017), which suffers regret
. FEWA also matches known bounds for
the stochastic bandit setting, thus showing that the rotting bandits are not
harder. Finally, we report simulations confirming the theoretical improvements
of FEWA
Training a Single Bandit Arm
The stochastic multi-armed bandit problem captures the fundamental
exploration vs. exploitation tradeoff inherent in online decision-making in
uncertain settings. However, in several applications, the traditional objective
of maximizing the expected sum of rewards obtained can be inappropriate.
Motivated by the problem of optimizing job assignments to groom novice workers
with unknown trainability in labor platforms, we consider a new objective in
the classical setup. Instead of maximizing the expected total reward from
pulls, we consider the vector of cumulative rewards earned from each of the
arms at the end of pulls, and aim to maximize the expected value of the
highest reward. This corresponds to the objective of grooming a
single, highly skilled worker using a limited supply of training jobs.
For this new objective, we show that any policy must incur a regret of
in the worst case. We design an explore-then-commit
policy featuring exploration based on finely tuned confidence bounds on the
mean reward and an adaptive stopping criterion, which adapts to the problem
difficulty and guarantees a regret of in the
worst case. Our numerical experiments demonstrate that this policy improves
upon several natural candidate policies for this setting.Comment: 23 pages, 1 figure, 1 tabl
A Field Test of Bandit Algorithms for Recommendations: Understanding the Validity of Assumptions on Human Preferences in Multi-armed Bandits
Personalized recommender systems suffuse modern life, shaping what media we
read and what products we consume. Algorithms powering such systems tend to
consist of supervised learning-based heuristics, such as latent factor models
with a variety of heuristically chosen prediction targets. Meanwhile,
theoretical treatments of recommendation frequently address the
decision-theoretic nature of the problem, including the need to balance
exploration and exploitation, via the multi-armed bandits (MABs) framework.
However, MAB-based approaches rely heavily on assumptions about human
preferences. These preference assumptions are seldom tested using human subject
studies, partly due to the lack of publicly available toolkits to conduct such
studies. In this work, we conduct a study with crowdworkers in a comics
recommendation MABs setting. Each arm represents a comic category, and users
provide feedback after each recommendation. We check the validity of core MABs
assumptions-that human preferences (reward distributions) are fixed over
time-and find that they do not hold. This finding suggests that any MAB
algorithm used for recommender systems should account for human preference
dynamics. While answering these questions, we provide a flexible experimental
framework for understanding human preference dynamics and testing MABs
algorithms with human users. The code for our experimental framework and the
collected data can be found at
https://github.com/HumainLab/human-bandit-evaluation.Comment: Accepted to CHI. 16 pages, 6 figure
Finite Continuum-Armed Bandits
We consider a situation where an agent has ressources to be allocated to
a larger number of actions. Each action can be completed at most once and
results in a stochastic reward with unknown mean. The goal of the agent is to
maximize her cumulative reward. Non trivial strategies are possible when side
information on the actions is available, for example in the form of covariates.
Focusing on a nonparametric setting, where the mean reward is an unknown
function of a one-dimensional covariate, we propose an optimal strategy for
this problem. Under natural assumptions on the reward function, we prove that
the optimal regret scales as up to poly-logarithmic factors when
the budget is proportional to the number of actions . When becomes
small compared to , a smooth transition occurs. When the ratio
decreases from a constant to , the regret increases progressively up
to the rate encountered in continuum-armed bandits
Bandit problems with fidelity rewards
The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how ‘loyal’ the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every m plays of an arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret