36 research outputs found
Copeland Dueling Bandits
A version of the dueling bandit problem is addressed in which a Condorcet
winner may not exist. Two algorithms are proposed that instead seek to minimize
regret with respect to the Copeland winner, which, unlike the Condorcet winner,
is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed
for small numbers of arms, while the second, Scalable Copeland Bandits (SCB),
works better for large-scale problems. We provide theoretical results bounding
the regret accumulated by CCB and SCB, both substantially improving existing
results. Such existing results either offer bounds of the form
but require restrictive assumptions, or offer bounds of the form without requiring such assumptions. Our results offer the best of both
worlds: bounds without restrictive assumptions.Comment: 33 pages, 8 figure
Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a K-armed Dueling Bandits problem where K is the total number of decisions. When K is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper studies the dueling bandits problem with a large number of arms that exhibit a low-dimensional correlation structure. Our problem is motivated by a clinical decision making process in large decision space. We propose an efficient algorithm CorrDuel which optimizes the exploration/exploitation tradeoff in this large decision space of clinical treatments. More broadly, our approach can be applied to other sequential decision problems with large and structured decision spaces. We derive regret bounds, and evaluate performance in simulation experiments as well as on a live clinical trial of therapeutic spinal cord stimulation. To our knowledge, this marks the first time an online learning algorithm was applied towards spinal cord injury treatments. Our experimental results show the effectiveness and efficiency of our approach
Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a K-armed Dueling Bandits problem where K is the total number of decisions. When K is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper studies the dueling bandits problem with a large number of arms that exhibit a low-dimensional correlation structure. Our problem is motivated by a clinical decision making process in large decision space. We propose an efficient algorithm CorrDuel which optimizes the exploration/exploitation tradeoff in this large decision space of clinical treatments. More broadly, our approach can be applied to other sequential decision problems with large and structured decision spaces. We derive regret bounds, and evaluate performance in simulation experiments as well as on a live clinical trial of therapeutic spinal cord stimulation. To our knowledge, this marks the first time an online learning algorithm was applied towards spinal cord injury treatments. Our experimental results show the effectiveness and efficiency of our approach
MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation
Online ranker evaluation is one of the key challenges in information
retrieval. While the preferences of rankers can be inferred by interleaving
methods, the problem of how to effectively choose the ranker pair that
generates the interleaved list without degrading the user experience too much
is still challenging. On the one hand, if two rankers have not been compared
enough, the inferred preference can be noisy and inaccurate. On the other, if
two rankers are compared too many times, the interleaving process inevitably
hurts the user experience too much. This dilemma is known as the exploration
versus exploitation tradeoff. It is captured by the -armed dueling bandit
problem, which is a variant of the -armed bandit problem, where the feedback
comes in the form of pairwise preferences. Today's deployed search systems can
evaluate a large number of rankers concurrently, and scaling effectively in the
presence of numerous rankers is a critical aspect of -armed dueling bandit
problems.
In this paper, we focus on solving the large-scale online ranker evaluation
problem under the so-called Condorcet assumption, where there exists an optimal
ranker that is preferred to all other rankers. We propose Merge Double Thompson
Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that
localizes the comparisons carried out by the algorithm to small batches of
rankers, and then employs Thompson Sampling (TS) to reduce the comparisons
between suboptimal rankers inside these small batches. The effectiveness
(regret) and efficiency (time complexity) of MergeDTS are extensively evaluated
using examples from the domain of online evaluation for web search. Our main
finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS
outperforms the state-of-the-art dueling bandit algorithms.Comment: Accepted at TOI