530 research outputs found
Information Directed Sampling for Stochastic Bandits with Graph Feedback
We consider stochastic multi-armed bandit problems with graph feedback, where
the decision maker is allowed to observe the neighboring actions of the chosen
action. We allow the graph structure to vary with time and consider both
deterministic and Erd\H{o}s-R\'enyi random graph models. For such a graph
feedback model, we first present a novel analysis of Thompson sampling that
leads to tighter performance bound than existing work. Next, we propose new
Information Directed Sampling based policies that are graph-aware in their
decision making. Under the deterministic graph case, we establish a Bayesian
regret bound for the proposed policies that scales with the clique cover number
of the graph instead of the number of actions. Under the random graph case, we
provide a Bayesian regret bound for the proposed policies that scales with the
ratio of the number of actions over the expected number of observations per
iteration. To the best of our knowledge, this is the first analytical result
for stochastic bandits with random graph feedback. Finally, using numerical
evaluations, we demonstrate that our proposed IDS policies outperform existing
approaches, including adaptions of upper confidence bound, -greedy
and Exp3 algorithms.Comment: Accepted by AAAI 201
Analysis of Thompson Sampling for Graphical Bandits Without the Graphs
We study multi-armed bandit problems with graph feedback, in which the
decision maker is allowed to observe the neighboring actions of the chosen
action, in a setting where the graph may vary over time and is never fully
revealed to the decision maker. We show that when the feedback graphs are
undirected, the original Thompson Sampling achieves the optimal (within
logarithmic factors) regret over
time horizon , where is the average independence number of the
latent graphs. To the best of our knowledge, this is the first result showing
that the original Thompson Sampling is optimal for graphical bandits in the
undirected setting. A slightly weaker regret bound of Thompson Sampling in the
directed setting is also presented. To fill this gap, we propose a variant of
Thompson Sampling, that attains the optimal regret in the directed setting
within a logarithmic factor. Both algorithms can be implemented efficiently and
do not require the knowledge of the feedback graphs at any time.Comment: Accepted by UAI 201
Feedback graph regret bounds for Thompson Sampling and UCB
We study the stochastic multi-armed bandit problem with the graph-based
feedback structure introduced by Mannor and Shamir. We analyze the performance
of the two most prominent stochastic bandit algorithms, Thompson Sampling and
Upper Confidence Bound (UCB), in the graph-based feedback setting. We show that
these algorithms achieve regret guarantees that combine the graph structure and
the gaps between the means of the arm distributions. Surprisingly this holds
despite the fact that these algorithms do not explicitly use the graph
structure to select arms; they observe the additional feedback but do not
explore based on it. Towards this result we introduce a "layering technique"
highlighting the commonalities in the two algorithms.Comment: Appeared in ALT 202
An Information-Theoretic Approach to Minimax Regret in Partial Monitoring
We prove a new minimax theorem connecting the worst-case Bayesian regret and
minimax regret under partial monitoring with no assumptions on the space of
signals or decisions of the adversary. We then generalise the
information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian
regret bounds and combine them with the minimax theorem to derive minimax
regret bounds for various partial monitoring settings. The highlight is a clean
analysis of `non-degenerate easy' and `hard' finite partial monitoring, with
new regret bounds that are independent of arbitrarily large game-dependent
constants. The power of the generalised machinery is further demonstrated by
proving that the minimax regret for k-armed adversarial bandits is at most
sqrt{2kn}, improving on existing results by a factor of 2. Finally, we provide
a simple analysis of the cops and robbers game, also improving best known
constants.Comment: 29 pages, to appear in COLT 201
Preference-based Online Learning with Dueling Bandits: A Survey
In machine learning, the notion of multi-armed bandits refers to a class of
online learning problems, in which an agent is supposed to simultaneously
explore and exploit a given set of choice alternatives in the course of a
sequential decision process. In the standard setting, the agent learns from
stochastic feedback in the form of real-valued rewards. In many applications,
however, numerical reward signals are not readily available -- instead, only
weaker information is provided, in particular relative preferences in the form
of qualitative comparisons between pairs of alternatives. This observation has
motivated the study of variants of the multi-armed bandit problem, in which
more general representations are used both for the type of feedback to learn
from and the target of prediction. The aim of this paper is to provide a survey
of the state of the art in this field, referred to as preference-based
multi-armed bandits or dueling bandits. To this end, we provide an overview of
problems that have been considered in the literature as well as methods for
tackling them. Our taxonomy is mainly based on the assumptions made by these
methods about the data-generating process and, related to this, the properties
of the preference-based feedback
Contextual Bandits with Latent Confounders: An NMF Approach
Motivated by online recommendation and advertising systems, we consider a
causal model for stochastic contextual bandits with a latent low-dimensional
confounder. In our model, there are observed contexts and arms of the
bandit. The observed context influences the reward obtained through a latent
confounder variable with cardinality (). The arm choice and the
latent confounder causally determines the reward while the observed context is
correlated with the confounder. Under this model, the mean reward
matrix (for each context in and each arm in )
factorizes into non-negative factors () and
(). This insight enables us to propose an
-greedy NMF-Bandit algorithm that designs a sequence of interventions
(selecting specific arms), that achieves a balance between learning this
low-dimensional structure and selecting the best arm to minimize regret. Our
algorithm achieves a regret of at time , as compared to for conventional
contextual bandits, assuming a constant gap between the best arm and the rest
for each context. These guarantees are obtained under mild sufficiency
conditions on the factors that are weaker versions of the well-known
Statistical RIP condition. We further propose a class of generative models that
satisfy our sufficient conditions, and derive a lower bound of
. These are the first regret guarantees for
online matrix completion with bandit feedback, when the rank is greater than
one. We further compare the performance of our algorithm with the state of the
art, on synthetic and real world data-sets.Comment: 37 pages, 2 figure
Causal Bandits: Learning Good Interventions via Causal Inference
We study the problem of using causal models to improve the rate at which good
interventions can be learned online in a stochastic environment. Our formalism
combines multi-arm bandits and causal inference to model a novel type of bandit
feedback that is not exploited by existing approaches. We propose a new
algorithm that exploits the causal feedback and prove a bound on its simple
regret that is strictly better (in all quantities) than algorithms that do not
use the additional causal information
First-Order Bayesian Regret Analysis of Thompson Sampling
We address online combinatorial optimization when the player has a prior over
the adversary's sequence of losses. In this framework, Russo and Van Roy
proposed an information-theoretic analysis of Thompson Sampling based on the
{\em information ratio}, resulting in optimal worst-case regret bounds. In this
paper we introduce three novel ideas to this line of work. First we propose a
new quantity, the scale-sensitive information ratio, which allows us to obtain
more refined first-order regret bounds (i.e., bounds of the form
where is the loss of the best combinatorial action). Second we replace
the entropy over combinatorial actions by a coordinate entropy, which allows us
to obtain the first optimal worst-case bound for Thompson Sampling in the
combinatorial setting. Finally, we introduce a novel link between Bayesian
agents and frequentist confidence intervals. Combining these ideas we show that
the classical multi-armed bandit first-order regret bound still holds true in the more challenging and more general semi-bandit
scenario. This latter result improves the previous state of the art bound
by Lykouris, Sridharan and Tardos.Comment: 42 pages. v2 adds results on graphical feedback and contextual
bandit, and tightens previous results using Tsallis entropy and log barrie
Horde of Bandits using Gaussian Markov Random Fields
The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual
bandits framework that shares information between a set of bandit problems,
related by a known (possibly noisy) graph. This model is useful in problems
like recommender systems where the large number of users makes it vital to
transfer information between users. Despite its effectiveness, the existing GOB
model can only be applied to small problems due to its quadratic
time-dependence on the number of nodes. Existing solutions to combat the
scalability issue require an often-unrealistic clustering assumption. By
exploiting a connection to Gaussian Markov random fields (GMRFs), we show that
the GOB model can be made to scale to much larger graphs without additional
assumptions. In addition, we propose a Thompson sampling algorithm which uses
the recent GMRF sampling-by-perturbation technique, allowing it to scale to
even larger problems (leading to a "horde" of bandits). We give regret bounds
and experimental results for GOB with Thompson sampling and epoch-greedy
algorithms, indicating that these methods are as good as or significantly
better than ignoring the graph or adopting a clustering-based approach.
Finally, when an existing graph is not available, we propose a heuristic for
learning it on the fly and show promising results
On the Performance of Thompson Sampling on Logistic Bandits
We study the logistic bandit, in which rewards are binary with success
probability and
actions and coefficients are within the -dimensional unit ball.
While prior regret bounds for algorithms that address the logistic bandit
exhibit exponential dependence on the slope parameter , we establish a
regret bound for Thompson sampling that is independent of .
Specifically, we establish that, when the set of feasible actions is identical
to the set of possible coefficient vectors, the Bayesian regret of Thompson
sampling is . We also establish a bound that applies more broadly, where is the worst-case
optimal log-odds and is the "fragility dimension," a new statistic we
define to capture the degree to which an optimal action for one model fails to
satisfice for others. We demonstrate that the fragility dimension plays an
essential role by showing that, for any , no algorithm can
achieve regret.Comment: Accepted for presentation at the Conference on Learning Theory (COLT)
201
- …