Search CORE

913 research outputs found

Randomized Exploration in Generalized Linear Bandits

Author: Boutilier Craig
Ghavamzadeh Mohammad
Kveton Branislav
Li Lihong
Szepesvari Csaba
Zaheer Manzil
Publication venue
Publication date: 10/10/2019
Field of study

We study two randomized algorithms for generalized linear bandits, GLM-TSL and GLM-FPL. GLM-TSL samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. GLM-FPL fits a GLM to a randomly perturbed history of past rewards. We prove

\tilde{O}(d \sqrt{n \log K})

bounds on the

n

-round regret of GLM-TSL and GLM-FPL, where

d

is the number of features and

K

is the number of arms. The regret bound of GLM-TSL improves upon prior work and the regret bound of GLM-FPL is the first of its kind. We apply both GLM-TSL and GLM-FPL to logistic and neural network bandits, and show that they perform well empirically. In more complex models, GLM-FPL is significantly faster. Our results showcase the role of randomization, beyond sampling from the posterior, in exploration

arXiv.org e-Print Archive

Perturbed-History Exploration in Stochastic Linear Bandits

Author: Boutilier Craig
Ghavamzadeh Mohammad
Kveton Branislav
Szepesvari Csaba
Publication venue
Publication date: 21/03/2019
Field of study

We propose a new online algorithm for minimizing the cumulative regret in stochastic linear bandits. The key idea is to build a perturbed history, which mixes the history of observed rewards with a pseudo-history of randomly generated i.i.d. pseudo-rewards. Our algorithm, perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model. We prove a

\tilde{O}(d \sqrt{n})

gap-free bound on the expected

n

-round regret of LinPHE, where

d

is the number of features. Our analysis relies on novel concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we extend LinPHE to a logistic reward model. We evaluate both algorithms empirically and show that they are practical

arXiv.org e-Print Archive