36 research outputs found

    Copeland Dueling Bandits

    Full text link
    A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. Such existing results either offer bounds of the form O(KlogT)O(K \log T) but require restrictive assumptions, or offer bounds of the form O(K2logT)O(K^2 \log T) without requiring such assumptions. Our results offer the best of both worlds: O(KlogT)O(K \log T) bounds without restrictive assumptions.Comment: 33 pages, 8 figure

    Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces

    Get PDF
    We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a K-armed Dueling Bandits problem where K is the total number of decisions. When K is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper studies the dueling bandits problem with a large number of arms that exhibit a low-dimensional correlation structure. Our problem is motivated by a clinical decision making process in large decision space. We propose an efficient algorithm CorrDuel which optimizes the exploration/exploitation tradeoff in this large decision space of clinical treatments. More broadly, our approach can be applied to other sequential decision problems with large and structured decision spaces. We derive regret bounds, and evaluate performance in simulation experiments as well as on a live clinical trial of therapeutic spinal cord stimulation. To our knowledge, this marks the first time an online learning algorithm was applied towards spinal cord injury treatments. Our experimental results show the effectiveness and efficiency of our approach

    Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces

    Get PDF
    We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a K-armed Dueling Bandits problem where K is the total number of decisions. When K is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper studies the dueling bandits problem with a large number of arms that exhibit a low-dimensional correlation structure. Our problem is motivated by a clinical decision making process in large decision space. We propose an efficient algorithm CorrDuel which optimizes the exploration/exploitation tradeoff in this large decision space of clinical treatments. More broadly, our approach can be applied to other sequential decision problems with large and structured decision spaces. We derive regret bounds, and evaluate performance in simulation experiments as well as on a live clinical trial of therapeutic spinal cord stimulation. To our knowledge, this marks the first time an online learning algorithm was applied towards spinal cord injury treatments. Our experimental results show the effectiveness and efficiency of our approach

    MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation

    Full text link
    Online ranker evaluation is one of the key challenges in information retrieval. While the preferences of rankers can be inferred by interleaving methods, the problem of how to effectively choose the ranker pair that generates the interleaved list without degrading the user experience too much is still challenging. On the one hand, if two rankers have not been compared enough, the inferred preference can be noisy and inaccurate. On the other, if two rankers are compared too many times, the interleaving process inevitably hurts the user experience too much. This dilemma is known as the exploration versus exploitation tradeoff. It is captured by the KK-armed dueling bandit problem, which is a variant of the KK-armed bandit problem, where the feedback comes in the form of pairwise preferences. Today's deployed search systems can evaluate a large number of rankers concurrently, and scaling effectively in the presence of numerous rankers is a critical aspect of KK-armed dueling bandit problems. In this paper, we focus on solving the large-scale online ranker evaluation problem under the so-called Condorcet assumption, where there exists an optimal ranker that is preferred to all other rankers. We propose Merge Double Thompson Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that localizes the comparisons carried out by the algorithm to small batches of rankers, and then employs Thompson Sampling (TS) to reduce the comparisons between suboptimal rankers inside these small batches. The effectiveness (regret) and efficiency (time complexity) of MergeDTS are extensively evaluated using examples from the domain of online evaluation for web search. Our main finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS outperforms the state-of-the-art dueling bandit algorithms.Comment: Accepted at TOI
    corecore