Search CORE

195 research outputs found

Borda Regret Minimization for Generalized Linear Dueling Bandits

Author: Farnoud Farzad
Gu Quanquan
Jin Tao
Lou Hao
Wu Yue
Publication venue
Publication date: 25/09/2023
Field of study

Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order

\Omega(d^{2/3} T^{2/3})

for the Borda regret minimization problem, where

d

is the dimension of contextual vectors and

T

is the time horizon. To attain this lower bound, we propose an explore-then-commit type algorithm for the stochastic setting, which has a nearly matching regret upper bound

\tilde{O}(d^{2/3} T^{2/3})

. We also propose an EXP3-type algorithm for the adversarial linear setting, where the underlying model parameter can change at each round. Our algorithm achieves an

\tilde{O}(d^{2/3} T^{2/3})

regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated real-world environment are conducted to corroborate our theoretical analysis.Comment: 33 pages, 5 figure. This version includes new results for dueling bandits in the adversarial settin

arXiv.org e-Print Archive

Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences

Author: Gaillard Pierre
Saha Aadirupa
Publication venue: HAL CCSD
Publication date: 14/02/2022
Field of study

International audienceWe study the problem of

K

-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal

O(\sum_{i = 1}^K \frac{\log T}{\Delta_i})

regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size

K

and the instance-specific suboptimality gaps

\{\Delta_i\}_{i = 1}^K

. This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Factored Bandits

Author: Seldin Yevgeny
Zimmert Julian
Publication venue
Publication date: 01/01/2018
Field of study

We introduce the factored bandits model, which is a framework for learning with limited (bandit) feedback, where actions can be decomposed into a Cartesian product of atomic actions. Factored bandits incorporate rank-1 bandits as a special case, but significantly relax the assumptions on the form of the reward function. We provide an anytime algorithm for stochastic factored bandits and up to constants matching upper and lower regret bounds for the problem. Furthermore, we show that with a slight modification the proposed algorithm can be applied to utility based dueling bandits. We obtain an improvement in the additive terms of the regret bound compared to state of the art algorithms (the additive terms are dominating up to time horizons which are exponential in the number of arms)

arXiv.org e-Print Archive

Copenhagen University Research Information System

A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

Author: Clérot Fabrice
Gajane Pratik
Urvoy Tanguy
Publication venue
Publication date: 01/01/2015
Field of study

We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose a new algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. This algorithm is a non-trivial extension of the Exponential-weight algorithm for Exploration and Exploitation (EXP3) algorithm. We prove a finite time expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a general lower bound of order omega(sqrt(KT)). At the end, we provide experimental results using real data from information retrieval applications

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

Author: de Rijke Maarten
Munos Remi
Whiteson Shimon
Zoghi Masrour
Publication venue
Publication date: 17/12/2013
Field of study

This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms. Our approach extends the Upper Confidence Bound algorithm to the relative setting by using estimates of the pairwise probabilities to select a promising arm and applying Upper Confidence Bound with the winner as a benchmark. We prove a finite-time regret bound of order O(log t). In addition, our empirical results using real data from an information retrieval application show that it greatly outperforms the state of the art.Comment: 13 pages, 6 figure

arXiv.org e-Print Archive

UvA-DARE

International Migration, Integration and Social Cohesion online publications