530 research outputs found

    Information Directed Sampling for Stochastic Bandits with Graph Feedback

    Full text link
    We consider stochastic multi-armed bandit problems with graph feedback, where the decision maker is allowed to observe the neighboring actions of the chosen action. We allow the graph structure to vary with time and consider both deterministic and Erd\H{o}s-R\'enyi random graph models. For such a graph feedback model, we first present a novel analysis of Thompson sampling that leads to tighter performance bound than existing work. Next, we propose new Information Directed Sampling based policies that are graph-aware in their decision making. Under the deterministic graph case, we establish a Bayesian regret bound for the proposed policies that scales with the clique cover number of the graph instead of the number of actions. Under the random graph case, we provide a Bayesian regret bound for the proposed policies that scales with the ratio of the number of actions over the expected number of observations per iteration. To the best of our knowledge, this is the first analytical result for stochastic bandits with random graph feedback. Finally, using numerical evaluations, we demonstrate that our proposed IDS policies outperform existing approaches, including adaptions of upper confidence bound, ϵ\epsilon-greedy and Exp3 algorithms.Comment: Accepted by AAAI 201

    Analysis of Thompson Sampling for Graphical Bandits Without the Graphs

    Full text link
    We study multi-armed bandit problems with graph feedback, in which the decision maker is allowed to observe the neighboring actions of the chosen action, in a setting where the graph may vary over time and is never fully revealed to the decision maker. We show that when the feedback graphs are undirected, the original Thompson Sampling achieves the optimal (within logarithmic factors) regret O~(β0(G)T)\tilde{O}\left(\sqrt{\beta_0(G)T}\right) over time horizon TT, where β0(G)\beta_0(G) is the average independence number of the latent graphs. To the best of our knowledge, this is the first result showing that the original Thompson Sampling is optimal for graphical bandits in the undirected setting. A slightly weaker regret bound of Thompson Sampling in the directed setting is also presented. To fill this gap, we propose a variant of Thompson Sampling, that attains the optimal regret in the directed setting within a logarithmic factor. Both algorithms can be implemented efficiently and do not require the knowledge of the feedback graphs at any time.Comment: Accepted by UAI 201

    Feedback graph regret bounds for Thompson Sampling and UCB

    Full text link
    We study the stochastic multi-armed bandit problem with the graph-based feedback structure introduced by Mannor and Shamir. We analyze the performance of the two most prominent stochastic bandit algorithms, Thompson Sampling and Upper Confidence Bound (UCB), in the graph-based feedback setting. We show that these algorithms achieve regret guarantees that combine the graph structure and the gaps between the means of the arm distributions. Surprisingly this holds despite the fact that these algorithms do not explicitly use the graph structure to select arms; they observe the additional feedback but do not explore based on it. Towards this result we introduce a "layering technique" highlighting the commonalities in the two algorithms.Comment: Appeared in ALT 202

    An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

    Full text link
    We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then generalise the information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian regret bounds and combine them with the minimax theorem to derive minimax regret bounds for various partial monitoring settings. The highlight is a clean analysis of `non-degenerate easy' and `hard' finite partial monitoring, with new regret bounds that are independent of arbitrarily large game-dependent constants. The power of the generalised machinery is further demonstrated by proving that the minimax regret for k-armed adversarial bandits is at most sqrt{2kn}, improving on existing results by a factor of 2. Finally, we provide a simple analysis of the cops and robbers game, also improving best known constants.Comment: 29 pages, to appear in COLT 201

    Preference-based Online Learning with Dueling Bandits: A Survey

    Full text link
    In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available -- instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback

    Contextual Bandits with Latent Confounders: An NMF Approach

    Full text link
    Motivated by online recommendation and advertising systems, we consider a causal model for stochastic contextual bandits with a latent low-dimensional confounder. In our model, there are LL observed contexts and KK arms of the bandit. The observed context influences the reward obtained through a latent confounder variable with cardinality mm (mL,Km \ll L,K). The arm choice and the latent confounder causally determines the reward while the observed context is correlated with the confounder. Under this model, the L×KL \times K mean reward matrix U\mathbf{U} (for each context in [L][L] and each arm in [K][K]) factorizes into non-negative factors A\mathbf{A} (L×mL \times m) and W\mathbf{W} (m×Km \times K). This insight enables us to propose an ϵ\epsilon-greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning this low-dimensional structure and selecting the best arm to minimize regret. Our algorithm achieves a regret of O(Lpoly(m,logK)logT)\mathcal{O}\left(L\mathrm{poly}(m, \log K) \log T \right) at time TT, as compared to O(LKlogT)\mathcal{O}(LK\log T) for conventional contextual bandits, assuming a constant gap between the best arm and the rest for each context. These guarantees are obtained under mild sufficiency conditions on the factors that are weaker versions of the well-known Statistical RIP condition. We further propose a class of generative models that satisfy our sufficient conditions, and derive a lower bound of O(KmlogT)\mathcal{O}\left(Km\log T\right). These are the first regret guarantees for online matrix completion with bandit feedback, when the rank is greater than one. We further compare the performance of our algorithm with the state of the art, on synthetic and real world data-sets.Comment: 37 pages, 2 figure

    Causal Bandits: Learning Good Interventions via Causal Inference

    Full text link
    We study the problem of using causal models to improve the rate at which good interventions can be learned online in a stochastic environment. Our formalism combines multi-arm bandits and causal inference to model a novel type of bandit feedback that is not exploited by existing approaches. We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information

    First-Order Bayesian Regret Analysis of Thompson Sampling

    Full text link
    We address online combinatorial optimization when the player has a prior over the adversary's sequence of losses. In this framework, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the {\em information ratio}, resulting in optimal worst-case regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form L\sqrt{L^*} where LL^* is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. Finally, we introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound O~(dL)\tilde{O}(\sqrt{d L^*}) still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound O~((d+m3)L)\tilde{O}(\sqrt{(d+m^3)L^*}) by Lykouris, Sridharan and Tardos.Comment: 42 pages. v2 adds results on graphical feedback and contextual bandit, and tightens previous results using Tsallis entropy and log barrie

    Horde of Bandits using Gaussian Markov Random Fields

    Full text link
    The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual bandits framework that shares information between a set of bandit problems, related by a known (possibly noisy) graph. This model is useful in problems like recommender systems where the large number of users makes it vital to transfer information between users. Despite its effectiveness, the existing GOB model can only be applied to small problems due to its quadratic time-dependence on the number of nodes. Existing solutions to combat the scalability issue require an often-unrealistic clustering assumption. By exploiting a connection to Gaussian Markov random fields (GMRFs), we show that the GOB model can be made to scale to much larger graphs without additional assumptions. In addition, we propose a Thompson sampling algorithm which uses the recent GMRF sampling-by-perturbation technique, allowing it to scale to even larger problems (leading to a "horde" of bandits). We give regret bounds and experimental results for GOB with Thompson sampling and epoch-greedy algorithms, indicating that these methods are as good as or significantly better than ignoring the graph or adopting a clustering-based approach. Finally, when an existing graph is not available, we propose a heuristic for learning it on the fly and show promising results

    On the Performance of Thompson Sampling on Logistic Bandits

    Full text link
    We study the logistic bandit, in which rewards are binary with success probability exp(βaθ)/(1+exp(βaθ))\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta)) and actions aa and coefficients θ\theta are within the dd-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter β\beta, we establish a regret bound for Thompson sampling that is independent of β\beta. Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is O~(dT)\tilde{O}(d\sqrt{T}). We also establish a O~(dηT/λ)\tilde{O}(\sqrt{d\eta T}/\lambda) bound that applies more broadly, where λ\lambda is the worst-case optimal log-odds and η\eta is the "fragility dimension," a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any ϵ>0\epsilon > 0, no algorithm can achieve poly(d,1/λ)T1ϵ\mathrm{poly}(d, 1/\lambda)\cdot T^{1-\epsilon} regret.Comment: Accepted for presentation at the Conference on Learning Theory (COLT) 201
    corecore