Search CORE

148 research outputs found

Linearly Parameterized Bandits

Author: Auer P.
Bertsekas D.
Bertsekas D.
Bertsimas D.
Ginebra J.
John N. Tsitsiklis
Paat Rusmevichientong
Pressman E. L.
Thompson W. R.
Publication venue
Publication date: 01/01/2008
Field of study

We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an

r

-dimensional random vector

\mathbf{Z} \in \mathbb{R}^r

, where

r \geq 2

. The objective is to minimize the cumulative regret and Bayes risk. When the set of arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order

\Theta(r \sqrt{T})

, by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases. The phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes risk admit upper bounds of the form

O(r \sqrt{T} \log^{3/2} T)

.Comment: 40 pages; updated results and reference

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

On Kernelized Multi-armed Bandits

Author: Brown André EX
Ch'ng Quee-Lim
Currie Michael
Grundy Laura J
Hokanson Jim
Javer Avelino
Kerr Rex
Lee Chee Wai
Li Chris
Li Kezhi
Schafer William R
Yemini Eviatar
Publication venue
Publication date: 17/05/2017
Field of study

We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases

arXiv.org e-Print Archive

ZENODO

FigShare

Hierarchical Exploration for Accelerating Contextual Bandits

Author: Guestrin Carlos
Hong Sue Ann
Yue Yisong
Publication venue
Publication date: 27/06/2012
Field of study

Contextual bandit learning is an increasingly popular approach to optimizing recommender systems via user feedback, but can be slow to converge in practice due to the need for exploring a large feature space. In this paper, we propose a coarse-to-fine hierarchical approach for encoding prior knowledge that drastically reduces the amount of exploration required. Intuitively, user preferences can be reasonably embedded in a coarse low-dimensional feature space that can be explored efficiently, requiring exploration in the high-dimensional space only as necessary. We introduce a bandit algorithm that explores within this coarse-to-fine spectrum, and prove performance guarantees that depend on how well the coarse space captures the user's preferences. We demonstrate substantial improvement over conventional bandit algorithms through extensive simulation as well as a live user study in the setting of personalized news recommendation.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

arXiv.org e-Print Archive

Caltech Authors