Search CORE

300 research outputs found

Kernelized Offline Contextual Dueling Bandits

Author: Das Vikramjeet
Lin Sen
Mehta Viraj
Neiswanger Willie
Neopane Ojash
Schneider Jeff
Publication venue
Publication date: 20/07/2023
Field of study

Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts

arXiv.org e-Print Archive

Borda Regret Minimization for Generalized Linear Dueling Bandits

Author: Farnoud Farzad
Gu Quanquan
Jin Tao
Lou Hao
Wu Yue
Publication venue
Publication date: 25/09/2023
Field of study

Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order

\Omega(d^{2/3} T^{2/3})

for the Borda regret minimization problem, where

d

is the dimension of contextual vectors and

T

is the time horizon. To attain this lower bound, we propose an explore-then-commit type algorithm for the stochastic setting, which has a nearly matching regret upper bound

\tilde{O}(d^{2/3} T^{2/3})

. We also propose an EXP3-type algorithm for the adversarial linear setting, where the underlying model parameter can change at each round. Our algorithm achieves an

\tilde{O}(d^{2/3} T^{2/3})

regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated real-world environment are conducted to corroborate our theoretical analysis.Comment: 33 pages, 5 figure. This version includes new results for dueling bandits in the adversarial settin

arXiv.org e-Print Archive

Calibrated Fairness in Bandits

Author: Dimitrakakis Christos
Liu Yang
Mandal Debmalya
Parkes David C.
Radanovic Goran
Publication venue
Publication date: 06/07/2017
Field of study

We study fairness within the stochastic, \emph{multi-armed bandit} (MAB) decision making framework. We adapt the fairness framework of "treating similar individuals similarly" to this setting. Here, an `individual' corresponds to an arm and two arms are `similar' if they have a similar quality distribution. First, we adopt a {\em smoothness constraint} that if two arms have a similar quality distribution then the probability of selecting each arm should be similar. In addition, we define the {\em fairness regret}, which corresponds to the degree to which an algorithm is not calibrated, where perfect calibration requires that the probability of selecting an arm is equal to the probability with which the arm has the best quality realization. We show that a variation on Thompson sampling satisfies smooth fairness for total variation distance, and give an

\tilde{O}((kT)^{2/3})

bound on fairness regret. This complements prior work, which protects an on-average better arm from being less favored. We also explain how to extend our algorithm to the dueling bandit setting.Comment: To be presented at the FAT-ML'17 worksho

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

Author: Clérot Fabrice
Gajane Pratik
Urvoy Tanguy
Publication venue
Publication date: 01/01/2015
Field of study

We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose a new algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. This algorithm is a non-trivial extension of the Exponential-weight algorithm for Exploration and Exploitation (EXP3) algorithm. We prove a finite time expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a general lower bound of order omega(sqrt(KT)). At the end, we provide experimental results using real data from information retrieval applications

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot