We consider the stochastic bandit problem with a continuous set of arms, with
the expected reward function over the arms assumed to be fixed but unknown. We
provide two new Gaussian process-based algorithms for continuous bandit
optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and
derive corresponding regret bounds. Specifically, the bounds hold when the
expected reward function belongs to the reproducing kernel Hilbert space (RKHS)
that naturally corresponds to a Gaussian process kernel used as input by the
algorithms. Along the way, we derive a new self-normalized concentration
inequality for vector- valued martingales of arbitrary, possibly infinite,
dimension. Finally, experimental evaluation and comparisons to existing
algorithms on synthetic and real-world environments are carried out that
highlight the favorable gains of the proposed strategies in many cases