275 research outputs found
Hierarchical Exploration for Accelerating Contextual Bandits
Contextual bandit learning is an increasingly popular approach to optimizing
recommender systems via user feedback, but can be slow to converge in practice
due to the need for exploring a large feature space. In this paper, we propose
a coarse-to-fine hierarchical approach for encoding prior knowledge that
drastically reduces the amount of exploration required. Intuitively, user
preferences can be reasonably embedded in a coarse low-dimensional feature
space that can be explored efficiently, requiring exploration in the
high-dimensional space only as necessary. We introduce a bandit algorithm that
explores within this coarse-to-fine spectrum, and prove performance guarantees
that depend on how well the coarse space captures the user's preferences. We
demonstrate substantial improvement over conventional bandit algorithms through
extensive simulation as well as a live user study in the setting of
personalized news recommendation.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints
We propose a novel master-slave architecture to solve the top-
combinatorial multi-armed bandits problem with non-linear bandit feedback and
diversity constraints, which, to the best of our knowledge, is the first
combinatorial bandits setting considering diversity constraints under bandit
feedback. Specifically, to efficiently explore the combinatorial and
constrained action space, we introduce six slave models with distinguished
merits to generate diversified samples well balancing rewards and constraints
as well as efficiency. Moreover, we propose teacher learning based optimization
and the policy co-training technique to boost the performance of the multiple
slave models. The master model then collects the elite samples provided by the
slave models and selects the best sample estimated by a neural contextual
UCB-based network to make a decision with a trade-off between exploration and
exploitation. Thanks to the elaborate design of slave models, the co-training
mechanism among slave models, and the novel interactions between the master and
slave models, our approach significantly surpasses existing state-of-the-art
algorithms in both synthetic and real datasets for recommendation tasks. The
code is available at:
\url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.Comment: IEEE Transactions on Neural Networks and Learning System
Minimax Policies for Combinatorial Prediction Games
We address the online linear optimization problem when the actions of the
forecaster are represented by binary vectors. Our goal is to understand the
magnitude of the minimax regret for the worst possible set of actions. We study
the problem under three different assumptions for the feedback: full
information, and the partial information models of the so-called "semi-bandit",
and "bandit" problems. We consider both -, and -type of
restrictions for the losses assigned by the adversary.
We formulate a general strategy using Bregman projections on top of a
potential-based gradient descent, which generalizes the ones studied in the
series of papers Gyorgy et al. (2007), Dani et al. (2008), Abernethy et al.
(2008), Cesa-Bianchi and Lugosi (2009), Helmbold and Warmuth (2009), Koolen et
al. (2010), Uchiya et al. (2010), Kale et al. (2010) and Audibert and Bubeck
(2010). We provide simple proofs that recover most of the previous results. We
propose new upper bounds for the semi-bandit game. Moreover we derive lower
bounds for all three feedback assumptions. With the only exception of the
bandit game, the upper and lower bounds are tight, up to a constant factor.
Finally, we answer a question asked by Koolen et al. (2010) by showing that the
exponentially weighted average forecaster is suboptimal against
adversaries
- …