9,937 research outputs found
Bandit Problems with Side Observations
An extension of the traditional two-armed bandit problem is considered, in
which the decision maker has access to some side information before deciding
which arm to pull. At each time t, before making a selection, the decision
maker is able to observe a random variable X_t that provides some information
on the rewards to be obtained. The focus is on finding uniformly good rules
(that minimize the growth rate of the inferior sampling time) and on
quantifying how much the additional information helps. Various settings are
considered and for each setting, lower bounds on the achievable inferior
sampling time are developed and asymptotically optimal adaptive schemes
achieving these lower bounds are constructed.Comment: 16 pages, 3 figures. To be published in the IEEE Transactions on
Automatic Contro
Learning Contextual Bandits in a Non-stationary Environment
Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.Comment: 10 pages, 13 figures, To appear on ACM Special Interest Group on
Information Retrieval (SIGIR) 201
Explore no more: Improved high-probability regret bounds for non-stochastic bandits
This work addresses the problem of regret minimization in non-stochastic
multi-armed bandit problems, focusing on performance guarantees that hold with
high probability. Such results are rather scarce in the literature since
proving them requires a large deal of technical effort and significant
modifications to the standard, more intuitive algorithms that come only with
guarantees that hold on expectation. One of these modifications is forcing the
learner to sample arms from the uniform distribution at least
times over rounds, which can adversely affect
performance if many of the arms are suboptimal. While it is widely conjectured
that this property is essential for proving high-probability regret bounds, we
show in this paper that it is possible to achieve such strong results without
this undesirable exploration component. Our result relies on a simple and
intuitive loss-estimation strategy called Implicit eXploration (IX) that allows
a remarkably clean analysis. To demonstrate the flexibility of our technique,
we derive several improved high-probability bounds for various extensions of
the standard multi-armed bandit framework. Finally, we conduct a simple
experiment that illustrates the robustness of our implicit exploration
technique.Comment: To appear at NIPS 201
- …