14 research outputs found

    Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

    Get PDF
    This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms. Our approach extends the Upper Confidence Bound algorithm to the relative setting by using estimates of the pairwise probabilities to select a promising arm and applying Upper Confidence Bound with the winner as a benchmark. We prove a finite-time regret bound of order O(log t). In addition, our empirical results using real data from an information retrieval application show that it greatly outperforms the state of the art.Comment: 13 pages, 6 figure

    Copeland Dueling Bandits

    Full text link
    A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. Such existing results either offer bounds of the form O(KlogT)O(K \log T) but require restrictive assumptions, or offer bounds of the form O(K2logT)O(K^2 \log T) without requiring such assumptions. Our results offer the best of both worlds: O(KlogT)O(K \log T) bounds without restrictive assumptions.Comment: 33 pages, 8 figure

    Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations

    Get PDF
    This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al, 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, Srinivas et al proved that the regret vanishes at the approximate rate of O(1/t)O(1/\sqrt{t}), where t is the number of observations. To complement their result, we attack the deterministic case and attain a much faster exponential convergence rate. Under some regularity assumptions, we show that the regret decreases asymptotically according to O(eτt(lnt)d/4)O(e^{-\frac{\tau t}{(\ln t)^{d/4}}}) with high probability. Here, d is the dimension of the search space and tau is a constant that depends on the behaviour of the objective function near its global maximum.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012). arXiv admin note: substantial text overlap with arXiv:1203.217

    MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation

    Full text link
    Online ranker evaluation is one of the key challenges in information retrieval. While the preferences of rankers can be inferred by interleaving methods, the problem of how to effectively choose the ranker pair that generates the interleaved list without degrading the user experience too much is still challenging. On the one hand, if two rankers have not been compared enough, the inferred preference can be noisy and inaccurate. On the other, if two rankers are compared too many times, the interleaving process inevitably hurts the user experience too much. This dilemma is known as the exploration versus exploitation tradeoff. It is captured by the KK-armed dueling bandit problem, which is a variant of the KK-armed bandit problem, where the feedback comes in the form of pairwise preferences. Today's deployed search systems can evaluate a large number of rankers concurrently, and scaling effectively in the presence of numerous rankers is a critical aspect of KK-armed dueling bandit problems. In this paper, we focus on solving the large-scale online ranker evaluation problem under the so-called Condorcet assumption, where there exists an optimal ranker that is preferred to all other rankers. We propose Merge Double Thompson Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that localizes the comparisons carried out by the algorithm to small batches of rankers, and then employs Thompson Sampling (TS) to reduce the comparisons between suboptimal rankers inside these small batches. The effectiveness (regret) and efficiency (time complexity) of MergeDTS are extensively evaluated using examples from the domain of online evaluation for web search. Our main finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS outperforms the state-of-the-art dueling bandit algorithms.Comment: Accepted at TOI

    Density-based User Representation through Gaussian Process Regression for Multi-interest Personalized Retrieval

    Full text link
    Accurate modeling of the diverse and dynamic interests of users remains a significant challenge in the design of personalized recommender systems. Existing user modeling methods, like single-point and multi-point representations, have limitations w.r.t. accuracy, diversity, computational cost, and adaptability. To overcome these deficiencies, we introduce density-based user representations (DURs), a novel model that leverages Gaussian process regression for effective multi-interest recommendation and retrieval. Our approach, GPR4DUR, exploits DURs to capture user interest variability without manual tuning, incorporates uncertainty-awareness, and scales well to large numbers of users. Experiments using real-world offline datasets confirm the adaptability and efficiency of GPR4DUR, while online experiments with simulated users demonstrate its ability to address the exploration-exploitation trade-off by effectively utilizing model uncertainty.Comment: 16 pages, 5 figure

    Carousel Personalization in Music Streaming Apps with Contextual Bandits

    Full text link
    Media services providers, such as music streaming platforms, frequently leverage swipeable carousels to recommend personalized content to their users. However, selecting the most relevant items (albums, artists, playlists...) to display in these carousels is a challenging task, as items are numerous and as users have different preferences. In this paper, we model carousel personalization as a contextual multi-armed bandit problem with multiple plays, cascade-based updates and delayed batch feedback. We empirically show the effectiveness of our framework at capturing characteristics of real-world carousels by addressing a large-scale playlist recommendation task on a global music streaming mobile app. Along with this paper, we publicly release industrial data from our experiments, as well as an open-source environment to simulate comparable carousel personalization learning problems.Comment: 14th ACM Conference on Recommender Systems (RecSys 2020, Best Short Paper Candidate

    Regret bounds for Gaussian process bandits without observation noise

    No full text
    This thesis presents some statistical refinements of the bandits approach presented in [11] in the situation where there is no observation noise. We give an improved bound on the cumulative regret of the samples chosen by an algorithm that is related (though not identical) to the UCB algorithm of [11] in a complementary setting. Given a function f on a domain D ⊆ R^d , sampled from a Gaussian process with an anisotropic kernel that is four times differentiable at 0, and a lattice L ⊆ D, we show that if the points in L are chosen for sampling using our branch-and-bound algorithm, the regret asymptotically decreases according to O(e^{τt/(ln t)^{d/4}}) with high probability, where t is the number of observations carried out so far and τ is a constant that depends on the objective function.Science, Faculty ofComputer Science, Department ofGraduat
    corecore