529 research outputs found
Further Optimal Regret Bounds for Thompson Sampling
Thompson Sampling is one of the oldest heuristics for multi-armed bandit
problems. It is a randomized algorithm based on Bayesian ideas, and has
recently generated significant interest after several studies demonstrated it
to have better empirical performance compared to the state of the art methods.
In this paper, we provide a novel regret analysis for Thompson Sampling that
simultaneously proves both the optimal problem-dependent bound of
and the
first near-optimal problem-independent bound of on the
expected regret of this algorithm. Our near-optimal problem-independent bound
solves a COLT 2012 open problem of Chapelle and Li. The optimal
problem-dependent regret bound for this problem was first proven recently by
Kaufmann et al. [ALT 2012]. Our novel martingale-based analysis techniques are
conceptually simple, easily extend to distributions other than the Beta
distribution, and also extend to the more general contextual bandits setting
[Manuscript, Agrawal and Goyal, 2012].Comment: arXiv admin note: substantial text overlap with arXiv:1111.179
On the Prior Sensitivity of Thompson Sampling
The empirically successful Thompson Sampling algorithm for stochastic bandits
has drawn much interest in understanding its theoretical properties. One
important benefit of the algorithm is that it allows domain knowledge to be
conveniently encoded as a prior distribution to balance exploration and
exploitation more effectively. While it is generally believed that the
algorithm's regret is low (high) when the prior is good (bad), little is known
about the exact dependence. In this paper, we fully characterize the
algorithm's worst-case dependence of regret on the choice of prior, focusing on
a special yet representative case. These results also provide insights into the
general sensitivity of the algorithm to the choice of priors. In particular,
with being the prior probability mass of the true reward-generating model,
we prove and regret upper bounds for the
bad- and good-prior cases, respectively, as well as \emph{matching} lower
bounds. Our proofs rely on the discovery of a fundamental property of Thompson
Sampling and make heavy use of martingale theory, both of which appear novel in
the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning
Theory (ALT), 201
Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits
We study a generalization of the multi-armed bandit problem with multiple
plays where there is a cost associated with pulling each arm and the agent has
a budget at each time that dictates how much she can expect to spend. We derive
an asymptotic regret lower bound for any uniformly efficient algorithm in our
setting. We then study a variant of Thompson sampling for Bernoulli rewards and
a variant of KL-UCB for both single-parameter exponential families and bounded,
finitely supported rewards. We show these algorithms are asymptotically
optimal, both in rateand leading problem-dependent constants, including in the
thick margin setting where multiple arms fall on the decision boundary
Collaborative Learning of Stochastic Bandits over a Social Network
We consider a collaborative online learning paradigm, wherein a group of
agents connected through a social network are engaged in playing a stochastic
multi-armed bandit game. Each time an agent takes an action, the corresponding
reward is instantaneously observed by the agent, as well as its neighbours in
the social network. We perform a regret analysis of various policies in this
collaborative learning setting. A key finding of this paper is that natural
extensions of widely-studied single agent learning policies to the network
setting need not perform well in terms of regret. In particular, we identify a
class of non-altruistic and individually consistent policies, and argue by
deriving regret lower bounds that they are liable to suffer a large regret in
the networked setting. We also show that the learning performance can be
substantially improved if the agents exploit the structure of the network, and
develop a simple learning algorithm based on dominating sets of the network.
Specifically, we first consider a star network, which is a common motif in
hierarchical social networks, and show analytically that the hub agent can be
used as an information sink to expedite learning and improve the overall
regret. We also derive networkwide regret bounds for the algorithm applied to
general networks. We conduct numerical experiments on a variety of networks to
corroborate our analytical results.Comment: 14 Pages, 6 Figure
- …