148 research outputs found
Linearly Parameterized Bandits
We consider bandit problems involving a large (possibly infinite) collection
of arms, in which the expected reward of each arm is a linear function of an
-dimensional random vector , where .
The objective is to minimize the cumulative regret and Bayes risk. When the set
of arms corresponds to the unit sphere, we prove that the regret and Bayes risk
is of order , by establishing a lower bound for an
arbitrary policy, and showing that a matching upper bound is obtained through a
policy that alternates between exploration and exploitation phases. The
phase-based policy is also shown to be effective if the set of arms satisfies a
strong convexity condition. For the case of a general set of arms, we describe
a near-optimal policy whose regret and Bayes risk admit upper bounds of the
form .Comment: 40 pages; updated results and reference
On Kernelized Multi-armed Bandits
We consider the stochastic bandit problem with a continuous set of arms, with
the expected reward function over the arms assumed to be fixed but unknown. We
provide two new Gaussian process-based algorithms for continuous bandit
optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and
derive corresponding regret bounds. Specifically, the bounds hold when the
expected reward function belongs to the reproducing kernel Hilbert space (RKHS)
that naturally corresponds to a Gaussian process kernel used as input by the
algorithms. Along the way, we derive a new self-normalized concentration
inequality for vector- valued martingales of arbitrary, possibly infinite,
dimension. Finally, experimental evaluation and comparisons to existing
algorithms on synthetic and real-world environments are carried out that
highlight the favorable gains of the proposed strategies in many cases
Hierarchical Exploration for Accelerating Contextual Bandits
Contextual bandit learning is an increasingly popular approach to optimizing
recommender systems via user feedback, but can be slow to converge in practice
due to the need for exploring a large feature space. In this paper, we propose
a coarse-to-fine hierarchical approach for encoding prior knowledge that
drastically reduces the amount of exploration required. Intuitively, user
preferences can be reasonably embedded in a coarse low-dimensional feature
space that can be explored efficiently, requiring exploration in the
high-dimensional space only as necessary. We introduce a bandit algorithm that
explores within this coarse-to-fine spectrum, and prove performance guarantees
that depend on how well the coarse space captures the user's preferences. We
demonstrate substantial improvement over conventional bandit algorithms through
extensive simulation as well as a live user study in the setting of
personalized news recommendation.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
- …