605 research outputs found
Collaborative Learning of Stochastic Bandits over a Social Network
We consider a collaborative online learning paradigm, wherein a group of
agents connected through a social network are engaged in playing a stochastic
multi-armed bandit game. Each time an agent takes an action, the corresponding
reward is instantaneously observed by the agent, as well as its neighbours in
the social network. We perform a regret analysis of various policies in this
collaborative learning setting. A key finding of this paper is that natural
extensions of widely-studied single agent learning policies to the network
setting need not perform well in terms of regret. In particular, we identify a
class of non-altruistic and individually consistent policies, and argue by
deriving regret lower bounds that they are liable to suffer a large regret in
the networked setting. We also show that the learning performance can be
substantially improved if the agents exploit the structure of the network, and
develop a simple learning algorithm based on dominating sets of the network.
Specifically, we first consider a star network, which is a common motif in
hierarchical social networks, and show analytically that the hub agent can be
used as an information sink to expedite learning and improve the overall
regret. We also derive networkwide regret bounds for the algorithm applied to
general networks. We conduct numerical experiments on a variety of networks to
corroborate our analytical results.Comment: 14 Pages, 6 Figure
Stochastic Bandit Models for Delayed Conversions
Online advertising and product recommendation are important domains of
applications for multi-armed bandit methods. In these fields, the reward that
is immediately available is most often only a proxy for the actual outcome of
interest, which we refer to as a conversion. For instance, in web advertising,
clicks can be observed within a few seconds after an ad display but the
corresponding sale --if any-- will take hours, if not days to happen. This
paper proposes and investigates a new stochas-tic multi-armed bandit model in
the framework proposed by Chapelle (2014) --based on empirical studies in the
field of web advertising-- in which each action may trigger a future reward
that will then happen with a stochas-tic delay. We assume that the probability
of conversion associated with each action is unknown while the distribution of
the conversion delay is known, distinguishing between the (idealized) case
where the conversion events may be observed whatever their delay and the more
realistic setting in which late conversions are censored. We provide
performance lower bounds as well as two simple but efficient algorithms based
on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when
conversion rates are low, is based on a Poissonization argument, of independent
interest in other settings where aggregation of Bernoulli observations with
different success probabilities is required.Comment: Conference on Uncertainty in Artificial Intelligence, Aug 2017,
Sydney, Australi
Bandits with heavy tail
The stochastic multi-armed bandit problem is well understood when the reward
distributions are sub-Gaussian. In this paper we examine the bandit problem
under the weaker assumption that the distributions have moments of order
1+\epsilon, for some . Surprisingly, moments of order 2
(i.e., finite variance) are sufficient to obtain regret bounds of the same
order as under sub-Gaussian reward distributions. In order to achieve such
regret, we define sampling strategies based on refined estimators of the mean
such as the truncated empirical mean, Catoni's M-estimator, and the
median-of-means estimator. We also derive matching lower bounds that also show
that the best achievable regret deteriorates when \epsilon <1
- âŠ