887 research outputs found
Thompson Sampling for Bandits with Clustered Arms
We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms
Thompson Sampling: An Asymptotically Optimal Finite Time Analysis
The question of the optimality of Thompson Sampling for solving the
stochastic multi-armed bandit problem had been open since 1933. In this paper
we answer it positively for the case of Bernoulli rewards by providing the
first finite-time analysis that matches the asymptotic rate given in the Lai
and Robbins lower bound for the cumulative regret. The proof is accompanied by
a numerical comparison with other optimal policies, experiments that have been
lacking in the literature until now for the Bernoulli case.Comment: 15 pages, 2 figures, submitted to ALT (Algorithmic Learning Theory
Bounded regret in stochastic multi-armed bandits
We study the stochastic multi-armed bandit problem when one knows the value
of an optimal arm, as a well as a positive lower bound on the
smallest positive gap . We propose a new randomized policy that attains
a regret {\em uniformly bounded over time} in this setting. We also prove
several lower bounds, which show in particular that bounded regret is not
possible if one only knows , and bounded regret of order is
not possible if one only knows $\mu^{(\star)}
Achieving Fairness in the Stochastic Multi-armed Bandit Problem
We study an interesting variant of the stochastic multi-armed bandit problem,
called the Fair-SMAB problem, where each arm is required to be pulled for at
least a given fraction of the total available rounds. We investigate the
interplay between learning and fairness in terms of a pre-specified vector
denoting the fractions of guaranteed pulls. We define a fairness-aware regret,
called -Regret, that takes into account the above fairness constraints and
naturally extends the conventional notion of regret. Our primary contribution
is characterizing a class of Fair-SMAB algorithms by two parameters: the
unfairness tolerance and the learning algorithm used as a black-box. We provide
a fairness guarantee for this class that holds uniformly over time irrespective
of the choice of the learning algorithm. In particular, when the learning
algorithm is UCB1, we show that our algorithm achieves -Regret.
Finally, we evaluate the cost of fairness in terms of the conventional notion
of regret.Comment: arXiv admin note: substantial text overlap with arXiv:1905.1126
Bandits with heavy tail
The stochastic multi-armed bandit problem is well understood when the reward
distributions are sub-Gaussian. In this paper we examine the bandit problem
under the weaker assumption that the distributions have moments of order
1+\epsilon, for some . Surprisingly, moments of order 2
(i.e., finite variance) are sufficient to obtain regret bounds of the same
order as under sub-Gaussian reward distributions. In order to achieve such
regret, we define sampling strategies based on refined estimators of the mean
such as the truncated empirical mean, Catoni's M-estimator, and the
median-of-means estimator. We also derive matching lower bounds that also show
that the best achievable regret deteriorates when \epsilon <1
- âŠ