31 research outputs found
Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors
In stochastic bandit problems, a Bayesian policy called Thompson sampling
(TS) has recently attracted much attention for its excellent empirical
performance. However, the theoretical analysis of this policy is difficult and
its asymptotic optimality is only proved for one-parameter models. In this
paper we discuss the optimality of TS for the model of normal distributions
with unknown means and variances as one of the most fundamental example of
multiparameter models. First we prove that the expected regret of TS with the
uniform prior achieves the theoretical bound, which is the first result to show
that the asymptotic bound is achievable for the normal distribution model. Next
we prove that TS with Jeffreys prior and reference prior cannot achieve the
theoretical bound. Therefore the choice of priors is important for TS and
non-informative priors are sometimes risky in cases of multiparameter models