181 research outputs found
Rotting bandits are not harder than stochastic ones
In stochastic multi-armed bandits, the reward distribution of each arm is
assumed to be stationary. This assumption is often violated in practice (e.g.,
in recommendation systems), where the reward of an arm may change whenever is
selected, i.e., rested bandit setting. In this paper, we consider the
non-parametric rotting bandit setting, where rewards can only decrease. We
introduce the filtering on expanding window average (FEWA) algorithm that
constructs moving averages of increasing windows to identify arms that are more
likely to return high rewards when pulled once more. We prove that for an
unknown horizon , and without any knowledge on the decreasing behavior of
the arms, FEWA achieves problem-dependent regret bound of
and a problem-independent one of
. Our result substantially improves over
the algorithm of Levine et al. (2017), which suffers regret
. FEWA also matches known bounds for
the stochastic bandit setting, thus showing that the rotting bandits are not
harder. Finally, we report simulations confirming the theoretical improvements
of FEWA
The Assistive Multi-Armed Bandit
Learning preferences implicit in the choices humans make is a well studied
problem in both economics and computer science. However, most work makes the
assumption that humans are acting (noisily) optimally with respect to their
preferences. Such approaches can fail when people are themselves learning about
what they want. In this work, we introduce the assistive multi-armed bandit,
where a robot assists a human playing a bandit task to maximize cumulative
reward. In this problem, the human does not know the reward function but can
learn it through the rewards received from arm pulls; the robot only observes
which arms the human pulls but not the reward associated with each pull. We
offer sufficient and necessary conditions for successfully assisting the human
in this framework. Surprisingly, better human performance in isolation does not
necessarily lead to better performance when assisted by the robot: a human
policy can do better by effectively communicating its observed rewards to the
robot. We conduct proof-of-concept experiments that support these results. We
see this work as contributing towards a theory behind algorithms for
human-robot interaction.Comment: Accepted to HRI 201
Training a Single Bandit Arm
The stochastic multi-armed bandit problem captures the fundamental
exploration vs. exploitation tradeoff inherent in online decision-making in
uncertain settings. However, in several applications, the traditional objective
of maximizing the expected sum of rewards obtained can be inappropriate.
Motivated by the problem of optimizing job assignments to groom novice workers
with unknown trainability in labor platforms, we consider a new objective in
the classical setup. Instead of maximizing the expected total reward from
pulls, we consider the vector of cumulative rewards earned from each of the
arms at the end of pulls, and aim to maximize the expected value of the
highest reward. This corresponds to the objective of grooming a
single, highly skilled worker using a limited supply of training jobs.
For this new objective, we show that any policy must incur a regret of
in the worst case. We design an explore-then-commit
policy featuring exploration based on finely tuned confidence bounds on the
mean reward and an adaptive stopping criterion, which adapts to the problem
difficulty and guarantees a regret of in the
worst case. Our numerical experiments demonstrate that this policy improves
upon several natural candidate policies for this setting.Comment: 23 pages, 1 figure, 1 tabl
Rotting infinitely many-armed bandits
We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate ϱ=o(1). We show that this learning problem has an Ω(max{ϱ1/3T,T−−√}) worst-case regret lower bound where T is the time horizon. We show that a matching upper bound O~(max{ϱ1/3T,T−−√}), up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate ϱ. We also show that an O~(max{ϱ1/3T,T3/4}) regret upper bound can be achieved by an algorithm that does not know the value of ϱ, by using an adaptive UCB index along with an adaptive threshold value
Stochastic Bandits with Delay-Dependent Payoffs
Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by Oe 1a kT , where k is the number of arms and T is time. Our algorithm uses only O k ln ln T) switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instance
Congested Bandits: Optimal Routing via Short-term Resets
For traffic routing platforms, the choice of which route to recommend to a
user depends on the congestion on these routes -- indeed, an individual's
utility depends on the number of people using the recommended route at that
instance. Motivated by this, we introduce the problem of Congested Bandits
where each arm's reward is allowed to depend on the number of times it was
played in the past timesteps. This dependence on past history of
actions leads to a dynamical system where an algorithm's present choices also
affect its future pay-offs, and requires an algorithm to plan for this. We
study the congestion aware formulation in the multi-armed bandit (MAB) setup
and in the contextual bandit setup with linear rewards. For the multi-armed
setup, we propose a UCB style algorithm and show that its policy regret scales
as . For the linear contextual bandit setup, our
algorithm, based on an iterative least squares planner, achieves policy regret
. From an experimental standpoint, we
corroborate the no-regret properties of our algorithms via a simulation study.Comment: Published at ICML 202
Online Learning and Bandits with Queried Hints
We consider the classic online learning and stochastic multi-armed bandit
(MAB) problems, when at each step, the online policy can probe and find out
which of a small number () of choices has better reward (or loss) before
making its choice. In this model, we derive algorithms whose regret bounds have
exponentially better dependence on the time horizon compared to the classic
regret bounds. In particular, we show that probing with suffices to
achieve time-independent regret bounds for online linear and convex
optimization. The same number of probes improve the regret bound of stochastic
MAB with independent arms from to , where is
the number of arms and is the horizon length. For stochastic MAB, we also
consider a stronger model where a probe reveals the reward values of the probed
arms, and show that in this case, probes suffice to achieve
parameter-independent constant regret, . Such regret bounds cannot be
achieved even with full feedback after the play, showcasing the power of
limited ``advice'' via probing before making the play. We also present
extensions to the setting where the hints can be imperfect, and to the case of
stochastic MAB where the rewards of the arms can be correlated.Comment: To appear in ITCS 202
- …