32 research outputs found
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
This paper proposes a new method for the K-armed dueling bandit problem, a
variation on the regular K-armed bandit problem that offers only relative
feedback about pairs of arms. Our approach extends the Upper Confidence Bound
algorithm to the relative setting by using estimates of the pairwise
probabilities to select a promising arm and applying Upper Confidence Bound
with the winner as a benchmark. We prove a finite-time regret bound of order
O(log t). In addition, our empirical results using real data from an
information retrieval application show that it greatly outperforms the state of
the art.Comment: 13 pages, 6 figure
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
Preference-Based Monte Carlo Tree Search
Monte Carlo tree search (MCTS) is a popular choice for solving sequential
anytime problems. However, it depends on a numeric feedback signal, which can
be difficult to define. Real-time MCTS is a variant which may only rarely
encounter states with an explicit, extrinsic reward. To deal with such cases,
the experimenter has to supply an additional numeric feedback signal in the
form of a heuristic, which intrinsically guides the agent. Recent work has
shown evidence that in different areas the underlying structure is ordinal and
not numerical. Hence erroneous and biased heuristics are inevitable, especially
in such domains. In this paper, we propose a MCTS variant which only depends on
qualitative feedback, and therefore opens up new applications for MCTS. We also
find indications that translating absolute into ordinal feedback may be
beneficial. Using a puzzle domain, we show that our preference-based MCTS
variant, wich only receives qualitative feedback, is able to reach a
performance level comparable to a regular MCTS baseline, which obtains
quantitative feedback.Comment: To be publishe
Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a K-armed Dueling Bandits problem where K is the total number of decisions. When K is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper studies the dueling bandits problem with a large number of arms that exhibit a low-dimensional correlation structure. Our problem is motivated by a clinical decision making process in large decision space. We propose an efficient algorithm CorrDuel which optimizes the exploration/exploitation tradeoff in this large decision space of clinical treatments. More broadly, our approach can be applied to other sequential decision problems with large and structured decision spaces. We derive regret bounds, and evaluate performance in simulation experiments as well as on a live clinical trial of therapeutic spinal cord stimulation. To our knowledge, this marks the first time an online learning algorithm was applied towards spinal cord injury treatments. Our experimental results show the effectiveness and efficiency of our approach
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits
We study the K-armed dueling bandit problem which is a variation of the
classical Multi-Armed Bandit (MAB) problem in which the learner receives only
relative feedback about the selected pairs of arms. We propose a new algorithm
called Relative Exponential-weight algorithm for Exploration and Exploitation
(REX3) to handle the adversarial utility-based formulation of this problem.
This algorithm is a non-trivial extension of the Exponential-weight algorithm
for Exploration and Exploitation (EXP3) algorithm. We prove a finite time
expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a
general lower bound of order omega(sqrt(KT)). At the end, we provide
experimental results using real data from information retrieval applications