11 research outputs found
One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits
We address the problem of \emph{`Internal Regret'} in \emph{Sleeping Bandits}
in the fully adversarial setup, as well as draw connections between different
existing notions of sleeping regrets in the multiarmed bandits (MAB) literature
and consequently analyze the implications: Our first contribution is to propose
the new notion of \emph{Internal Regret} for sleeping MAB. We then proposed an
algorithm that yields sublinear regret in that measure, even for a completely
adversarial sequence of losses and availabilities. We further show that a low
sleeping internal regret always implies a low external regret, and as well as a
low policy regret for iid sequence of losses. The main contribution of this
work precisely lies in unifying different notions of existing regret in
sleeping bandits and understand the implication of one to another. Finally, we
also extend our results to the setting of \emph{Dueling Bandits} (DB)--a
preference feedback variant of MAB, and proposed a reduction to MAB idea to
design a low regret algorithm for sleeping dueling bandits with stochastic
preferences and adversarial availabilities. The efficacy of our algorithms is
justified through empirical evaluations
Faster Convergence with Multiway Preferences
We address the problem of convex optimization with preference feedback, where
the goal is to minimize a convex function given a weaker form of comparison
queries. Each query consists of two points and the dueling feedback returns a
(noisy) single-bit binary comparison of the function values of the two queried
points. Here we consider the sign-function-based comparison feedback model and
analyze the convergence rates with batched and multiway (argmin of a set
queried points) comparisons. Our main goal is to understand the improved
convergence rates owing to parallelization in sign-feedback-based optimization
problems. Our work is the first to study the problem of convex optimization
with multiway preferences and analyze the optimal convergence rates. Our first
contribution lies in designing efficient algorithms with a convergence rate of
for -batched
preference feedback where the learner can query -pairs in parallel. We next
study a -multiway comparison (`battling') feedback, where the learner can
get to see the argmin feedback of -subset of queried points and show a
convergence rate of . We show further improved convergence rates with an additional assumption
of strong convexity. Finally, we also study the convergence lower bounds for
batched preferences and multiway feedback optimization showing the optimality
of our convergence rates w.r.t.
Dueling Bandits with Adversarial Sleeping
We introduce the problem of sleeping dueling bandits with stochastic
preferences and adversarial availabilities (DB-SPAA). In almost all dueling
bandit applications, the decision space often changes over time; eg, retail
store management, online shopping, restaurant recommendation, search engine
optimization, etc. Surprisingly, this `sleeping aspect' of dueling bandits has
never been studied in the literature. Like dueling bandits, the goal is to
compete with the best arm by sequentially querying the preference feedback of
item pairs. The non-triviality however results due to the non-stationary item
spaces that allow any arbitrary subsets items to go unavailable every round.
The goal is to find an optimal `no-regret' policy that can identify the best
available item at each round, as opposed to the standard `fixed best-arm regret
objective' of dueling bandits. We first derive an instance-specific lower bound
for DB-SPAA , where is the number of items and is the
gap between items and . This indicates that the sleeping problem with
preference feedback is inherently more difficult than that for classical
multi-armed bandits (MAB). We then propose two algorithms, with near optimal
regret guarantees. Our results are corroborated empirically
Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences
International audienceWe study the problem of -armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size and the instance-specific suboptimality gaps . This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods
One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits
We address the problem of `Internal Regret' in Sleeping Bandits in the fully adversarial setup, as well as draw connections between different existing notions of sleeping regrets in the multiarmed bandits (MAB) literature and consequently analyze the implications: Our first contribution is to propose the new notion of Internal Regret for sleeping MAB. We then proposed an algorithm that yields sublinear regret in that measure, even for a completely adversarial sequence of losses and availabilities. We further show that a low sleeping internal regret always implies a low external regret, and as well as a low policy regret for iid sequence of losses. The main contribution of this work precisely lies in unifying different notions of existing regret in sleeping bandits and understand the implication of one to another. Finally, we also extend our results to the setting of Dueling Bandits (DB)--a preference feedback variant of MAB, and proposed a reduction to MAB idea to design a low regret algorithm for sleeping dueling bandits with stochastic preferences and adversarial availabilities. The efficacy of our algorithms is justified through empirical evaluations
Dueling Bandits with Adversarial Sleeping
International audienceWe introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA). In almost all dueling bandit applications, the decision space often changes over time; eg, retail store management, online shopping, restaurant recommendation, search engine optimization, etc. Surprisingly, this 'sleeping aspect' of dueling bandits has never been studied in the literature. Like dueling bandits, the goal is to compete with the best arm by sequentially querying the preference feedback of item pairs. The non-triviality however results due to the non-stationary item spaces that allow any arbitrary subsets items to go unavailable every round. The goal is to find an optimal 'no-regret' policy that can identify the best available item at each round, as opposed to the standard 'fixed best-arm regret objective' of dueling bandits. We first derive an instance-specific lower bound for DB-SPAA Ω(K−1 i=1 K j=i+1 log T ∆(i,j)), where K is the number of items and ∆(i, j) is the gap between items i and j. This indicates that the sleeping problem with preference feedback is inherently more difficult than that for classical multi-armed bandits (MAB). We then propose two algorithms, with near optimal regret guarantees. Our results are corroborated empirically
Principled Reinforcement Learning with Human Feedback from Pairwise or -wise Comparisons
We provide a theoretical framework for Reinforcement Learning with Human
Feedback (RLHF). Our analysis shows that when the true reward function is
linear, the widely used maximum likelihood estimator (MLE) converges under both
the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However,
we show that when training a policy based on the learned reward model, MLE
fails while a pessimistic MLE provides policies with improved performance under
certain coverage assumptions. Additionally, we demonstrate that under the PL
model, the true MLE and an alternative MLE that splits the -wise comparison
into pairwise comparisons both converge. Moreover, the true MLE is
asymptotically more efficient. Our results validate the empirical success of
existing RLHF algorithms in InstructGPT and provide new insights for algorithm
design. Furthermore, our results unify the problem of RLHF and max-entropy
Inverse Reinforcement Learning (IRL), and provide the first sample complexity
bound for max-entropy IRL