Search CORE

530 research outputs found

Information Directed Sampling for Stochastic Bandits with Graph Feedback

Author: Buccapatnam Swapna
Liu Fang
Shroff Ness
Publication venue
Publication date: 08/11/2017
Field of study

We consider stochastic multi-armed bandit problems with graph feedback, where the decision maker is allowed to observe the neighboring actions of the chosen action. We allow the graph structure to vary with time and consider both deterministic and Erd\H{o}s-R\'enyi random graph models. For such a graph feedback model, we first present a novel analysis of Thompson sampling that leads to tighter performance bound than existing work. Next, we propose new Information Directed Sampling based policies that are graph-aware in their decision making. Under the deterministic graph case, we establish a Bayesian regret bound for the proposed policies that scales with the clique cover number of the graph instead of the number of actions. Under the random graph case, we provide a Bayesian regret bound for the proposed policies that scales with the ratio of the number of actions over the expected number of observations per iteration. To the best of our knowledge, this is the first analytical result for stochastic bandits with random graph feedback. Finally, using numerical evaluations, we demonstrate that our proposed IDS policies outperform existing approaches, including adaptions of upper confidence bound,

\epsilon

-greedy and Exp3 algorithms.Comment: Accepted by AAAI 201

arXiv.org e-Print Archive

Analysis of Thompson Sampling for Graphical Bandits Without the Graphs

Author: Liu Fang
Shroff Ness
Zheng Zizhan
Publication venue
Publication date: 22/05/2018
Field of study

We study multi-armed bandit problems with graph feedback, in which the decision maker is allowed to observe the neighboring actions of the chosen action, in a setting where the graph may vary over time and is never fully revealed to the decision maker. We show that when the feedback graphs are undirected, the original Thompson Sampling achieves the optimal (within logarithmic factors) regret

\tilde{O}\left(\sqrt{\beta_0(G)T}\right)

over time horizon

T

, where

\beta_0(G)

is the average independence number of the latent graphs. To the best of our knowledge, this is the first result showing that the original Thompson Sampling is optimal for graphical bandits in the undirected setting. A slightly weaker regret bound of Thompson Sampling in the directed setting is also presented. To fill this gap, we propose a variant of Thompson Sampling, that attains the optimal regret in the directed setting within a logarithmic factor. Both algorithms can be implemented efficiently and do not require the knowledge of the feedback graphs at any time.Comment: Accepted by UAI 201

arXiv.org e-Print Archive

Feedback graph regret bounds for Thompson Sampling and UCB

Author: Lykouris Thodoris
Tardos Eva
Wali Drishti
Publication venue
Publication date: 14/02/2020
Field of study

We study the stochastic multi-armed bandit problem with the graph-based feedback structure introduced by Mannor and Shamir. We analyze the performance of the two most prominent stochastic bandit algorithms, Thompson Sampling and Upper Confidence Bound (UCB), in the graph-based feedback setting. We show that these algorithms achieve regret guarantees that combine the graph structure and the gaps between the means of the arm distributions. Surprisingly this holds despite the fact that these algorithms do not explicitly use the graph structure to select arms; they observe the additional feedback but do not explore based on it. Towards this result we introduce a "layering technique" highlighting the commonalities in the two algorithms.Comment: Appeared in ALT 202

arXiv.org e-Print Archive

An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

Author: Lattimore Tor
Szepesvari Csaba
Publication venue
Publication date: 29/05/2019
Field of study

We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then generalise the information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian regret bounds and combine them with the minimax theorem to derive minimax regret bounds for various partial monitoring settings. The highlight is a clean analysis of `non-degenerate easy' and `hard' finite partial monitoring, with new regret bounds that are independent of arbitrarily large game-dependent constants. The power of the generalised machinery is further demonstrated by proving that the minimax regret for k-armed adversarial bandits is at most sqrt{2kn}, improving on existing results by a factor of 2. Finally, we provide a simple analysis of the cops and robbers game, also improving best known constants.Comment: 29 pages, to appear in COLT 201

arXiv.org e-Print Archive

Preference-based Online Learning with Dueling Bandits: A Survey

Author: Busa-Fekete Robert
Hüllermeier Eyke
Mesaoudi-Paul Adil El
Publication venue
Publication date: 30/07/2018
Field of study

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available -- instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback

arXiv.org e-Print Archive

Contextual Bandits with Latent Confounders: An NMF Approach

Author: Dimakis Alexandros G.
Kocaoglu Murat
Sen Rajat
Shakkottai Sanjay
Shanmugam Karthikeyan
Publication venue
Publication date: 27/10/2016
Field of study

Motivated by online recommendation and advertising systems, we consider a causal model for stochastic contextual bandits with a latent low-dimensional confounder. In our model, there are

L

observed contexts and

K

arms of the bandit. The observed context influences the reward obtained through a latent confounder variable with cardinality

m

(

m \ll L,K

). The arm choice and the latent confounder causally determines the reward while the observed context is correlated with the confounder. Under this model, the

L \times K

mean reward matrix

\mathbf{U}

(for each context in

[L]

and each arm in

[K]

) factorizes into non-negative factors

\mathbf{A}

(

L \times m

) and

\mathbf{W}

(

m \times K

). This insight enables us to propose an

\epsilon

-greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning this low-dimensional structure and selecting the best arm to minimize regret. Our algorithm achieves a regret of

\mathcal{O}\left(L\mathrm{poly}(m, \log K) \log T \right)

at time

T

, as compared to

\mathcal{O}(LK\log T)

for conventional contextual bandits, assuming a constant gap between the best arm and the rest for each context. These guarantees are obtained under mild sufficiency conditions on the factors that are weaker versions of the well-known Statistical RIP condition. We further propose a class of generative models that satisfy our sufficient conditions, and derive a lower bound of

\mathcal{O}\left(Km\log T\right)

. These are the first regret guarantees for online matrix completion with bandit feedback, when the rank is greater than one. We further compare the performance of our algorithm with the state of the art, on synthetic and real world data-sets.Comment: 37 pages, 2 figure

arXiv.org e-Print Archive

Causal Bandits: Learning Good Interventions via Causal Inference

Author: Lattimore Finnian
Lattimore Tor
Reid Mark D.
Publication venue
Publication date: 10/06/2016
Field of study

We study the problem of using causal models to improve the rate at which good interventions can be learned online in a stochastic environment. Our formalism combines multi-arm bandits and causal inference to model a novel type of bandit feedback that is not exploited by existing approaches. We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information

arXiv.org e-Print Archive

First-Order Bayesian Regret Analysis of Thompson Sampling

Author: Bubeck Sébastien
Sellke Mark
Publication venue
Publication date: 26/11/2019
Field of study

We address online combinatorial optimization when the player has a prior over the adversary's sequence of losses. In this framework, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the {\em information ratio}, resulting in optimal worst-case regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form

\sqrt{L^*}

where

L^*

is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. Finally, we introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound

\tilde{O}(\sqrt{d L^*})

still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound

\tilde{O}(\sqrt{(d+m^3)L^*})

by Lykouris, Sridharan and Tardos.Comment: 42 pages. v2 adds results on graphical feedback and contextual bandit, and tightens previous results using Tsallis entropy and log barrie

arXiv.org e-Print Archive

Horde of Bandits using Gaussian Markov Random Fields

Author: Lakshmanan Laks V. S.
Schmidt Mark
Vaswani Sharan
Publication venue
Publication date: 07/03/2017
Field of study

The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual bandits framework that shares information between a set of bandit problems, related by a known (possibly noisy) graph. This model is useful in problems like recommender systems where the large number of users makes it vital to transfer information between users. Despite its effectiveness, the existing GOB model can only be applied to small problems due to its quadratic time-dependence on the number of nodes. Existing solutions to combat the scalability issue require an often-unrealistic clustering assumption. By exploiting a connection to Gaussian Markov random fields (GMRFs), we show that the GOB model can be made to scale to much larger graphs without additional assumptions. In addition, we propose a Thompson sampling algorithm which uses the recent GMRF sampling-by-perturbation technique, allowing it to scale to even larger problems (leading to a "horde" of bandits). We give regret bounds and experimental results for GOB with Thompson sampling and epoch-greedy algorithms, indicating that these methods are as good as or significantly better than ignoring the graph or adopting a clustering-based approach. Finally, when an existing graph is not available, we propose a heuristic for learning it on the fly and show promising results

arXiv.org e-Print Archive

On the Performance of Thompson Sampling on Logistic Bandits

Author: Dong Shi
Ma Tengyu
Van Roy Benjamin
Publication venue
Publication date: 12/05/2019
Field of study

We study the logistic bandit, in which rewards are binary with success probability

\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta))

and actions

a

and coefficients

\theta

are within the

d

-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter

\beta

, we establish a regret bound for Thompson sampling that is independent of

\beta

. Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is

\tilde{O}(d\sqrt{T})

. We also establish a

\tilde{O}(\sqrt{d\eta T}/\lambda)

bound that applies more broadly, where

\lambda

is the worst-case optimal log-odds and

\eta

is the "fragility dimension," a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any

\epsilon > 0

, no algorithm can achieve

\mathrm{poly}(d, 1/\lambda)\cdot T^{1-\epsilon}

regret.Comment: Accepted for presentation at the Conference on Learning Theory (COLT) 201

arXiv.org e-Print Archive