11 research outputs found
MaxGap Bandit: Adaptive Algorithms for Approximate Ranking
This paper studies the problem of adaptively sampling from K distributions
(arms) in order to identify the largest gap between any two adjacent means. We
call this the MaxGap-bandit problem. This problem arises naturally in
approximate ranking, noisy sorting, outlier detection, and top-arm
identification in bandits. The key novelty of the MaxGap-bandit problem is that
it aims to adaptively determine the natural partitioning of the distributions
into a subset with larger means and a subset with smaller means, where the
split is determined by the largest gap rather than a pre-specified rank or
threshold. Estimating an arm's gap requires sampling its neighboring arms in
addition to itself, and this dependence results in a novel hardness parameter
that characterizes the sample complexity of the problem. We propose elimination
and UCB-style algorithms and show that they are minimax optimal. Our
experiments show that the UCB-style algorithms require 6-8x fewer samples than
non-adaptive sampling to achieve the same error
Multi-Task Off-Policy Learning from Bandit Feedback
Many practical applications, such as recommender systems and learning to
rank, involve solving multiple similar tasks. One example is learning of
recommendation policies for users with similar movie preferences, where the
users may still rank the individual movies slightly differently. Such tasks can
be organized in a hierarchy, where similar tasks are related through a shared
structure. In this work, we formulate this problem as a contextual off-policy
optimization in a hierarchical graphical model from logged bandit feedback. To
solve the problem, we propose a hierarchical off-policy optimization algorithm
(HierOPO), which estimates the parameters of the hierarchical model and then
acts pessimistically with respect to them. We instantiate HierOPO in linear
Gaussian models, for which we also provide an efficient implementation and
analysis. We prove per-task bounds on the suboptimality of the learned
policies, which show a clear improvement over not using the hierarchical model.
We also evaluate the policies empirically. Our theoretical and empirical
results show a clear advantage of using the hierarchy over solving each task
independently.Comment: 14 pages, 3 figure
Carousel Personalization in Music Streaming Apps with Contextual Bandits
Media services providers, such as music streaming platforms, frequently
leverage swipeable carousels to recommend personalized content to their users.
However, selecting the most relevant items (albums, artists, playlists...) to
display in these carousels is a challenging task, as items are numerous and as
users have different preferences. In this paper, we model carousel
personalization as a contextual multi-armed bandit problem with multiple plays,
cascade-based updates and delayed batch feedback. We empirically show the
effectiveness of our framework at capturing characteristics of real-world
carousels by addressing a large-scale playlist recommendation task on a global
music streaming mobile app. Along with this paper, we publicly release
industrial data from our experiments, as well as an open-source environment to
simulate comparable carousel personalization learning problems.Comment: 14th ACM Conference on Recommender Systems (RecSys 2020, Best Short
Paper Candidate
Meta-Learning for Simple Regret Minimization
We develop a meta-learning framework for simple regret minimization in
bandits. In this framework, a learning agent interacts with a sequence of
bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and
learns its meta-parameters to perform better on future tasks. We propose the
first Bayesian and frequentist algorithms for this meta-learning problem. The
Bayesian algorithm has access to a prior distribution over the meta-parameters
and its meta simple regret over bandit tasks with horizon is mere
. This is while we show that the meta simple regret of
the frequentist algorithm is , and thus,
worse. However, the algorithm is more general, because it does not need a prior
distribution over the meta-parameters, and is easier to implement for various
distributions. We instantiate our algorithms for several classes of bandit
problems. Our algorithms are general and we complement our theory by evaluating
them empirically in several environments