18 research outputs found
On the Prior Sensitivity of Thompson Sampling
The empirically successful Thompson Sampling algorithm for stochastic bandits
has drawn much interest in understanding its theoretical properties. One
important benefit of the algorithm is that it allows domain knowledge to be
conveniently encoded as a prior distribution to balance exploration and
exploitation more effectively. While it is generally believed that the
algorithm's regret is low (high) when the prior is good (bad), little is known
about the exact dependence. In this paper, we fully characterize the
algorithm's worst-case dependence of regret on the choice of prior, focusing on
a special yet representative case. These results also provide insights into the
general sensitivity of the algorithm to the choice of priors. In particular,
with being the prior probability mass of the true reward-generating model,
we prove and regret upper bounds for the
bad- and good-prior cases, respectively, as well as \emph{matching} lower
bounds. Our proofs rely on the discovery of a fundamental property of Thompson
Sampling and make heavy use of martingale theory, both of which appear novel in
the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning
Theory (ALT), 201
Thompson Sampling for Linearly Constrained Bandits
We address multi-armed bandits (MAB) where the objective is to maximize the
cumulative reward under a probabilistic linear constraint. For a few real-world
instances of this problem, constrained extensions of the well-known Thompson
Sampling (TS) heuristic have recently been proposed. However, finite-time
analysis of constrained TS is challenging; as a result, only O(\sqrt{T}) bounds
on the cumulative reward loss (i.e., the regret) are available. In this paper,
we describe LinConTS, a TS-based algorithm for bandits that place a linear
constraint on the probability of earning a reward in every round. We show that
for LinConTS, the regret as well as the cumulative constraint violations are
upper bounded by O(\log T) for the suboptimal arms. We develop a proof
technique that relies on careful analysis of the dual problem and combine it
with recent theoretical work on unconstrained TS. Through numerical experiments
on two real-world datasets, we demonstrate that LinConTS outperforms an
asymptotically optimal upper confidence bound (UCB) scheme in terms of
simultaneously minimizing the regret and the violation.Comment: 10 pages, 2 figures, updated version of paper accepted at AISTATS202