Search CORE

565 research outputs found

Perturbed-History Exploration in Stochastic Linear Bandits

Author: Boutilier Craig
Ghavamzadeh Mohammad
Kveton Branislav
Szepesvari Csaba
Publication venue
Publication date: 21/03/2019
Field of study

We propose a new online algorithm for minimizing the cumulative regret in stochastic linear bandits. The key idea is to build a perturbed history, which mixes the history of observed rewards with a pseudo-history of randomly generated i.i.d. pseudo-rewards. Our algorithm, perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model. We prove a

\tilde{O}(d \sqrt{n})

gap-free bound on the expected

n

-round regret of LinPHE, where

d

is the number of features. Our analysis relies on novel concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we extend LinPHE to a logistic reward model. We evaluate both algorithms empirically and show that they are practical

arXiv.org e-Print Archive

GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation

Author: Chowdhury Mubarrat
Ismayilzada Elkhan
Kim Gi-Soo
Sayem Khalequzzaman
Publication venue
Publication date: 20/01/2023
Field of study

In sequential decision-making scenarios i.e., mobile health recommendation systems revenue management contextual multi-armed bandit algorithms have garnered attention for their performance. But most of the existing algorithms are built on the assumption of a strictly parametric reward model mostly linear in nature. In this work we propose a new algorithm with a semi-parametric reward model with state-of-the-art complexity of upper bound on regret amongst existing semi-parametric algorithms. Our work expands the scope of another representative algorithm of state-of-the-art complexity with a similar reward model by proposing an algorithm built upon the same action filtering procedures but provides explicit action selection distribution for scenarios involving more than two arms at a particular time step while requiring fewer computations. We derive the said complexity of the upper bound on regret and present simulation results that affirm our methods superiority out of all prevalent semi-parametric bandit algorithms for cases involving over two arms

arXiv.org e-Print Archive