Search CORE

24 research outputs found

Perturbed-History Exploration in Stochastic Linear Bandits

Author: Boutilier Craig
Ghavamzadeh Mohammad
Kveton Branislav
Szepesvari Csaba
Publication venue
Publication date: 21/03/2019
Field of study

We propose a new online algorithm for minimizing the cumulative regret in stochastic linear bandits. The key idea is to build a perturbed history, which mixes the history of observed rewards with a pseudo-history of randomly generated i.i.d. pseudo-rewards. Our algorithm, perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model. We prove a

\tilde{O}(d \sqrt{n})

gap-free bound on the expected

n

-round regret of LinPHE, where

d

is the number of features. Our analysis relies on novel concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we extend LinPHE to a logistic reward model. We evaluate both algorithms empirically and show that they are practical

arXiv.org e-Print Archive

PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits

Author: Dumitrascu Bianca
Engelhardt Barbara E
Feng Karen
Publication venue
Publication date: 01/01/2018
Field of study

We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits

arXiv.org e-Print Archive

Princeton University Open Access Repository