14 research outputs found
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
This paper proposes a new method for the K-armed dueling bandit problem, a
variation on the regular K-armed bandit problem that offers only relative
feedback about pairs of arms. Our approach extends the Upper Confidence Bound
algorithm to the relative setting by using estimates of the pairwise
probabilities to select a promising arm and applying Upper Confidence Bound
with the winner as a benchmark. We prove a finite-time regret bound of order
O(log t). In addition, our empirical results using real data from an
information retrieval application show that it greatly outperforms the state of
the art.Comment: 13 pages, 6 figure
Copeland Dueling Bandits
A version of the dueling bandit problem is addressed in which a Condorcet
winner may not exist. Two algorithms are proposed that instead seek to minimize
regret with respect to the Copeland winner, which, unlike the Condorcet winner,
is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed
for small numbers of arms, while the second, Scalable Copeland Bandits (SCB),
works better for large-scale problems. We provide theoretical results bounding
the regret accumulated by CCB and SCB, both substantially improving existing
results. Such existing results either offer bounds of the form
but require restrictive assumptions, or offer bounds of the form without requiring such assumptions. Our results offer the best of both
worlds: bounds without restrictive assumptions.Comment: 33 pages, 8 figure
Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations
This paper analyzes the problem of Gaussian process (GP) bandits with
deterministic observations. The analysis uses a branch and bound algorithm that
is related to the UCB algorithm of (Srinivas et al, 2010). For GPs with
Gaussian observation noise, with variance strictly greater than zero, Srinivas
et al proved that the regret vanishes at the approximate rate of
, where t is the number of observations. To complement their
result, we attack the deterministic case and attain a much faster exponential
convergence rate. Under some regularity assumptions, we show that the regret
decreases asymptotically according to
with high probability. Here, d is the dimension of the search space and tau is
a constant that depends on the behaviour of the objective function near its
global maximum.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012). arXiv admin note: substantial text overlap with
arXiv:1203.217
MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation
Online ranker evaluation is one of the key challenges in information
retrieval. While the preferences of rankers can be inferred by interleaving
methods, the problem of how to effectively choose the ranker pair that
generates the interleaved list without degrading the user experience too much
is still challenging. On the one hand, if two rankers have not been compared
enough, the inferred preference can be noisy and inaccurate. On the other, if
two rankers are compared too many times, the interleaving process inevitably
hurts the user experience too much. This dilemma is known as the exploration
versus exploitation tradeoff. It is captured by the -armed dueling bandit
problem, which is a variant of the -armed bandit problem, where the feedback
comes in the form of pairwise preferences. Today's deployed search systems can
evaluate a large number of rankers concurrently, and scaling effectively in the
presence of numerous rankers is a critical aspect of -armed dueling bandit
problems.
In this paper, we focus on solving the large-scale online ranker evaluation
problem under the so-called Condorcet assumption, where there exists an optimal
ranker that is preferred to all other rankers. We propose Merge Double Thompson
Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that
localizes the comparisons carried out by the algorithm to small batches of
rankers, and then employs Thompson Sampling (TS) to reduce the comparisons
between suboptimal rankers inside these small batches. The effectiveness
(regret) and efficiency (time complexity) of MergeDTS are extensively evaluated
using examples from the domain of online evaluation for web search. Our main
finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS
outperforms the state-of-the-art dueling bandit algorithms.Comment: Accepted at TOI
Density-based User Representation through Gaussian Process Regression for Multi-interest Personalized Retrieval
Accurate modeling of the diverse and dynamic interests of users remains a
significant challenge in the design of personalized recommender systems.
Existing user modeling methods, like single-point and multi-point
representations, have limitations w.r.t. accuracy, diversity, computational
cost, and adaptability. To overcome these deficiencies, we introduce
density-based user representations (DURs), a novel model that leverages
Gaussian process regression for effective multi-interest recommendation and
retrieval. Our approach, GPR4DUR, exploits DURs to capture user interest
variability without manual tuning, incorporates uncertainty-awareness, and
scales well to large numbers of users. Experiments using real-world offline
datasets confirm the adaptability and efficiency of GPR4DUR, while online
experiments with simulated users demonstrate its ability to address the
exploration-exploitation trade-off by effectively utilizing model uncertainty.Comment: 16 pages, 5 figure
Carousel Personalization in Music Streaming Apps with Contextual Bandits
Media services providers, such as music streaming platforms, frequently
leverage swipeable carousels to recommend personalized content to their users.
However, selecting the most relevant items (albums, artists, playlists...) to
display in these carousels is a challenging task, as items are numerous and as
users have different preferences. In this paper, we model carousel
personalization as a contextual multi-armed bandit problem with multiple plays,
cascade-based updates and delayed batch feedback. We empirically show the
effectiveness of our framework at capturing characteristics of real-world
carousels by addressing a large-scale playlist recommendation task on a global
music streaming mobile app. Along with this paper, we publicly release
industrial data from our experiments, as well as an open-source environment to
simulate comparable carousel personalization learning problems.Comment: 14th ACM Conference on Recommender Systems (RecSys 2020, Best Short
Paper Candidate
Regret bounds for Gaussian process bandits without observation noise
This thesis presents some statistical refinements of the bandits approach presented in [11] in the situation where there is no observation noise. We give an improved bound on the cumulative regret of the samples chosen by an algorithm that is related (though not identical) to the UCB algorithm of [11] in a complementary setting. Given a function f on a domain D ⊆ R^d , sampled from a Gaussian process with an anisotropic kernel that is four times differentiable at 0, and a lattice L ⊆ D, we show that if the points in L are chosen for sampling using our branch-and-bound algorithm, the regret asymptotically decreases according to O(e^{τt/(ln t)^{d/4}}) with high probability,
where t is the number of observations carried out so far and τ is a constant that depends on the objective function.Science, Faculty ofComputer Science, Department ofGraduat