9,807 research outputs found
Bandit Problems with Side Observations
An extension of the traditional two-armed bandit problem is considered, in
which the decision maker has access to some side information before deciding
which arm to pull. At each time t, before making a selection, the decision
maker is able to observe a random variable X_t that provides some information
on the rewards to be obtained. The focus is on finding uniformly good rules
(that minimize the growth rate of the inferior sampling time) and on
quantifying how much the additional information helps. Various settings are
considered and for each setting, lower bounds on the achievable inferior
sampling time are developed and asymptotically optimal adaptive schemes
achieving these lower bounds are constructed.Comment: 16 pages, 3 figures. To be published in the IEEE Transactions on
Automatic Contro
Efficient learning by implicit exploration in bandit problems with side observations
International audienceWe consider online learning problems under a a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem
Path Planning Problems with Side Observations-When Colonels Play Hide-and-Seek
Resource allocation games such as the famous Colonel Blotto (CB) and
Hide-and-Seek (HS) games are often used to model a large variety of practical
problems, but only in their one-shot versions. Indeed, due to their extremely
large strategy space, it remains an open question how one can efficiently learn
in these games. In this work, we show that the online CB and HS games can be
cast as path planning problems with side-observations (SOPPP): at each stage, a
learner chooses a path on a directed acyclic graph and suffers the sum of
losses that are adversarially assigned to the corresponding edges; and she then
receives semi-bandit feedback with side-observations (i.e., she observes the
losses on the chosen edges plus some others). We propose a novel algorithm,
EXP3-OE, the first-of-its-kind with guaranteed efficient running time for SOPPP
without requiring any auxiliary oracle. We provide an expected-regret bound of
EXP3-OE in SOPPP matching the order of the best benchmark in the literature.
Moreover, we introduce additional assumptions on the observability model under
which we can further improve the regret bounds of EXP3-OE. We illustrate the
benefit of using EXP3-OE in SOPPP by applying it to the online CB and HS games.Comment: Previously, this work appeared as arXiv:1911.09023 which was
mistakenly submitted as a new article (has been submitted to be withdrawn).
This is a preprint of the work published in Proceedings of the 34th AAAI
Conference on Artificial Intelligence (AAAI
Information Directed Sampling for Stochastic Bandits with Graph Feedback
We consider stochastic multi-armed bandit problems with graph feedback, where
the decision maker is allowed to observe the neighboring actions of the chosen
action. We allow the graph structure to vary with time and consider both
deterministic and Erd\H{o}s-R\'enyi random graph models. For such a graph
feedback model, we first present a novel analysis of Thompson sampling that
leads to tighter performance bound than existing work. Next, we propose new
Information Directed Sampling based policies that are graph-aware in their
decision making. Under the deterministic graph case, we establish a Bayesian
regret bound for the proposed policies that scales with the clique cover number
of the graph instead of the number of actions. Under the random graph case, we
provide a Bayesian regret bound for the proposed policies that scales with the
ratio of the number of actions over the expected number of observations per
iteration. To the best of our knowledge, this is the first analytical result
for stochastic bandits with random graph feedback. Finally, using numerical
evaluations, we demonstrate that our proposed IDS policies outperform existing
approaches, including adaptions of upper confidence bound, -greedy
and Exp3 algorithms.Comment: Accepted by AAAI 201
Explore no more: Improved high-probability regret bounds for non-stochastic bandits
This work addresses the problem of regret minimization in non-stochastic
multi-armed bandit problems, focusing on performance guarantees that hold with
high probability. Such results are rather scarce in the literature since
proving them requires a large deal of technical effort and significant
modifications to the standard, more intuitive algorithms that come only with
guarantees that hold on expectation. One of these modifications is forcing the
learner to sample arms from the uniform distribution at least
times over rounds, which can adversely affect
performance if many of the arms are suboptimal. While it is widely conjectured
that this property is essential for proving high-probability regret bounds, we
show in this paper that it is possible to achieve such strong results without
this undesirable exploration component. Our result relies on a simple and
intuitive loss-estimation strategy called Implicit eXploration (IX) that allows
a remarkably clean analysis. To demonstrate the flexibility of our technique,
we derive several improved high-probability bounds for various extensions of
the standard multi-armed bandit framework. Finally, we conduct a simple
experiment that illustrates the robustness of our implicit exploration
technique.Comment: To appear at NIPS 201
Learning Contextual Bandits in a Non-stationary Environment
Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.Comment: 10 pages, 13 figures, To appear on ACM Special Interest Group on
Information Retrieval (SIGIR) 201
A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem
The multi-armed bandit problem has been extensively studied under the
stationary assumption. However in reality, this assumption often does not hold
because the distributions of rewards themselves may change over time. In this
paper, we propose a change-detection (CD) based framework for multi-armed
bandit problems under the piecewise-stationary setting, and study a class of
change-detection based UCB (Upper Confidence Bound) policies, CD-UCB, that
actively detects change points and restarts the UCB indices. We then develop
CUSUM-UCB and PHT-UCB, that belong to the CD-UCB class and use cumulative sum
(CUSUM) and Page-Hinkley Test (PHT) to detect changes. We show that CUSUM-UCB
obtains the best known regret upper bound under mild assumptions. We also
demonstrate the regret reduction of the CD-UCB policies over arbitrary
Bernoulli rewards and Yahoo! datasets of webpage click-through rates.Comment: accepted by AAAI 201
- …