123 research outputs found
Reducing Dueling Bandits to Cardinal Bandits
We present algorithms for reducing the Dueling Bandits problem to the
conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits
problem is an online model of learning with ordinal feedback of the form "A is
preferred to B" (as opposed to cardinal feedback like "A has value 2.5"),
giving it wide applicability in learning from implicit user feedback and
revealed and stated preferences. In contrast to existing algorithms for the
Dueling Bandits problem, our reductions -- named \Doubler, \MultiSbm and
\DoubleSbm -- provide a generic schema for translating the extensive body of
known results about conventional Multi-Armed Bandit algorithms to the Dueling
Bandits setting. For \Doubler and \MultiSbm we prove regret upper bounds in
both finite and infinite settings, and conjecture about the performance of
\DoubleSbm which empirically outperforms the other two as well as previous
algorithms in our experiments. In addition, we provide the first almost optimal
regret bound in terms of second order terms, such as the differences between
the values of the arms
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
Decoy Bandits Dueling on a Poset
We adress the problem of dueling bandits defined on partially ordered sets,
or posets. In this setting, arms may not be comparable, and there may be
several (incomparable) optimal arms. We propose an algorithm, UnchainedBandits,
that efficiently finds the set of optimal arms of any poset even when pairs of
comparable arms cannot be distinguished from pairs of incomparable arms, with a
set of minimal assumptions. This algorithm relies on the concept of decoys,
which stems from social psychology. For the easier case where the
incomparability information may be accessible, we propose a second algorithm,
SlicingBandits, which takes advantage of this information and achieves a very
significant gain of performance compared to UnchainedBandits. We provide
theoretical guarantees and experimental evaluation for both algorithms
- …