411 research outputs found
Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms
We consider stochastic multi-armed bandits where the expected reward is a
unimodal function over partially ordered arms. This important class of problems
has been recently investigated in (Cope 2009, Yu 2011). The set of arms is
either discrete, in which case arms correspond to the vertices of a finite
graph whose structure represents similarity in rewards, or continuous, in which
case arms belong to a bounded interval. For discrete unimodal bandits, we
derive asymptotic lower bounds for the regret achieved under any algorithm, and
propose OSUB, an algorithm whose regret matches this lower bound. Our algorithm
optimally exploits the unimodal structure of the problem, and surprisingly, its
asymptotic regret does not depend on the number of arms. We also provide a
regret upper bound for OSUB in non-stationary environments where the expected
rewards smoothly evolve over time. The analytical results are supported by
numerical experiments showing that OSUB performs significantly better than the
state-of-the-art algorithms. For continuous sets of arms, we provide a brief
discussion. We show that combining an appropriate discretization of the set of
arms with the UCB algorithm yields an order-optimal regret, and in practice,
outperforms recently proposed algorithms designed to exploit the unimodal
structure.Comment: ICML 2014 (technical report). arXiv admin note: text overlap with
arXiv:1307.730
Stochastic Bandit Models for Delayed Conversions
Online advertising and product recommendation are important domains of
applications for multi-armed bandit methods. In these fields, the reward that
is immediately available is most often only a proxy for the actual outcome of
interest, which we refer to as a conversion. For instance, in web advertising,
clicks can be observed within a few seconds after an ad display but the
corresponding sale --if any-- will take hours, if not days to happen. This
paper proposes and investigates a new stochas-tic multi-armed bandit model in
the framework proposed by Chapelle (2014) --based on empirical studies in the
field of web advertising-- in which each action may trigger a future reward
that will then happen with a stochas-tic delay. We assume that the probability
of conversion associated with each action is unknown while the distribution of
the conversion delay is known, distinguishing between the (idealized) case
where the conversion events may be observed whatever their delay and the more
realistic setting in which late conversions are censored. We provide
performance lower bounds as well as two simple but efficient algorithms based
on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when
conversion rates are low, is based on a Poissonization argument, of independent
interest in other settings where aggregation of Bernoulli observations with
different success probabilities is required.Comment: Conference on Uncertainty in Artificial Intelligence, Aug 2017,
Sydney, Australi
Rotting bandits are not harder than stochastic ones
In stochastic multi-armed bandits, the reward distribution of each arm is
assumed to be stationary. This assumption is often violated in practice (e.g.,
in recommendation systems), where the reward of an arm may change whenever is
selected, i.e., rested bandit setting. In this paper, we consider the
non-parametric rotting bandit setting, where rewards can only decrease. We
introduce the filtering on expanding window average (FEWA) algorithm that
constructs moving averages of increasing windows to identify arms that are more
likely to return high rewards when pulled once more. We prove that for an
unknown horizon , and without any knowledge on the decreasing behavior of
the arms, FEWA achieves problem-dependent regret bound of
and a problem-independent one of
. Our result substantially improves over
the algorithm of Levine et al. (2017), which suffers regret
. FEWA also matches known bounds for
the stochastic bandit setting, thus showing that the rotting bandits are not
harder. Finally, we report simulations confirming the theoretical improvements
of FEWA
- …