1,807 research outputs found
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
Hierarchical Exploration for Accelerating Contextual Bandits
Contextual bandit learning is an increasingly popular approach to optimizing
recommender systems via user feedback, but can be slow to converge in practice
due to the need for exploring a large feature space. In this paper, we propose
a coarse-to-fine hierarchical approach for encoding prior knowledge that
drastically reduces the amount of exploration required. Intuitively, user
preferences can be reasonably embedded in a coarse low-dimensional feature
space that can be explored efficiently, requiring exploration in the
high-dimensional space only as necessary. We introduce a bandit algorithm that
explores within this coarse-to-fine spectrum, and prove performance guarantees
that depend on how well the coarse space captures the user's preferences. We
demonstrate substantial improvement over conventional bandit algorithms through
extensive simulation as well as a live user study in the setting of
personalized news recommendation.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Exploration via linearly perturbed loss minimisation
We introduce exploration via linear loss perturbations (EVILL), a randomised
exploration method for structured stochastic bandit problems that works by
solving for the minimiser of a linearly perturbed regularised negative
log-likelihood function. We show that, for the case of generalised linear
bandits, EVILL reduces to perturbed history exploration (PHE), a method where
exploration is done by training on randomly perturbed rewards. In doing so, we
provide a simple and clean explanation of when and why random reward
perturbations give rise to good bandit algorithms. With the data-dependent
perturbations we propose, not present in previous PHE-type methods, EVILL is
shown to match the performance of Thompson-sampling-style
parameter-perturbation methods, both in theory and in practice. Moreover, we
show an example outside of generalised linear bandits where PHE leads to
inconsistent estimates, and thus linear regret, while EVILL remains performant.
Like PHE, EVILL can be implemented in just a few lines of code
- …