1,964 research outputs found
Learning Contextual Bandits in a Non-stationary Environment
Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.Comment: 10 pages, 13 figures, To appear on ACM Special Interest Group on
Information Retrieval (SIGIR) 201
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
This paper proposes a new method for the K-armed dueling bandit problem, a
variation on the regular K-armed bandit problem that offers only relative
feedback about pairs of arms. Our approach extends the Upper Confidence Bound
algorithm to the relative setting by using estimates of the pairwise
probabilities to select a promising arm and applying Upper Confidence Bound
with the winner as a benchmark. We prove a finite-time regret bound of order
O(log t). In addition, our empirical results using real data from an
information retrieval application show that it greatly outperforms the state of
the art.Comment: 13 pages, 6 figure
Copeland Dueling Bandits
A version of the dueling bandit problem is addressed in which a Condorcet
winner may not exist. Two algorithms are proposed that instead seek to minimize
regret with respect to the Copeland winner, which, unlike the Condorcet winner,
is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed
for small numbers of arms, while the second, Scalable Copeland Bandits (SCB),
works better for large-scale problems. We provide theoretical results bounding
the regret accumulated by CCB and SCB, both substantially improving existing
results. Such existing results either offer bounds of the form
but require restrictive assumptions, or offer bounds of the form without requiring such assumptions. Our results offer the best of both
worlds: bounds without restrictive assumptions.Comment: 33 pages, 8 figure
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits
We study the K-armed dueling bandit problem which is a variation of the
classical Multi-Armed Bandit (MAB) problem in which the learner receives only
relative feedback about the selected pairs of arms. We propose a new algorithm
called Relative Exponential-weight algorithm for Exploration and Exploitation
(REX3) to handle the adversarial utility-based formulation of this problem.
This algorithm is a non-trivial extension of the Exponential-weight algorithm
for Exploration and Exploitation (EXP3) algorithm. We prove a finite time
expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a
general lower bound of order omega(sqrt(KT)). At the end, we provide
experimental results using real data from information retrieval applications
- …