24 research outputs found
Perturbed-History Exploration in Stochastic Linear Bandits
We propose a new online algorithm for minimizing the cumulative regret in
stochastic linear bandits. The key idea is to build a perturbed history, which
mixes the history of observed rewards with a pseudo-history of randomly
generated i.i.d. pseudo-rewards. Our algorithm, perturbed-history exploration
in a linear bandit (LinPHE), estimates a linear model from its perturbed
history and pulls the arm with the highest value under that model. We prove a
gap-free bound on the expected -round regret of
LinPHE, where is the number of features. Our analysis relies on novel
concentration and anti-concentration bounds on the weighted sum of Bernoulli
random variables. To show the generality of our design, we extend LinPHE to a
logistic reward model. We evaluate both algorithms empirically and show that
they are practical
PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits
We address the problem of regret minimization in logistic contextual bandits,
where a learner decides among sequential actions or arms given their respective
contexts to maximize binary rewards. Using a fast inference procedure with
Polya-Gamma distributed augmentation variables, we propose an improved version
of Thompson Sampling, a Bayesian formulation of contextual bandits with
near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling
(PG-TS), achieves state-of-the-art performance on simulated and real data.
PG-TS explores the action space efficiently and exploits high-reward arms,
quickly converging to solutions of low regret. Its explicit estimation of the
posterior distribution of the context feature covariance leads to substantial
empirical gains over approximate approaches. PG-TS is the first approach to
demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose
an efficient Gibbs sampler for approximating the analytically unsolvable
integral of logistic contextual bandits