74 research outputs found

    Thompson Sampling: An Asymptotically Optimal Finite Time Analysis

    Full text link
    The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.Comment: 15 pages, 2 figures, submitted to ALT (Algorithmic Learning Theory

    A two-armed bandit based scheme for accelerated decentralized learning

    Get PDF
    The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, QoS control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning. Extensive experiments demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making

    Further Optimal Regret Bounds for Thompson Sampling

    Full text link
    Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously proves both the optimal problem-dependent bound of (1+ϵ)ilnTΔi+O(Nϵ2)(1+\epsilon)\sum_i \frac{\ln T}{\Delta_i}+O(\frac{N}{\epsilon^2}) and the first near-optimal problem-independent bound of O(NTlnT)O(\sqrt{NT\ln T}) on the expected regret of this algorithm. Our near-optimal problem-independent bound solves a COLT 2012 open problem of Chapelle and Li. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [ALT 2012]. Our novel martingale-based analysis techniques are conceptually simple, easily extend to distributions other than the Beta distribution, and also extend to the more general contextual bandits setting [Manuscript, Agrawal and Goyal, 2012].Comment: arXiv admin note: substantial text overlap with arXiv:1111.179

    Towards Thompson Sampling for Complex Bayesian Reasoning

    Get PDF
    Paper III, IV, and VI are not available as a part of the dissertation due to the copyright.Thompson Sampling (TS) is a state-of-art algorithm for bandit problems set in a Bayesian framework. Both the theoretical foundation and the empirical efficiency of TS is wellexplored for plain bandit problems. However, the Bayesian underpinning of TS means that TS could potentially be applied to other, more complex, problems as well, beyond the bandit problem, if suitable Bayesian structures can be found. The objective of this thesis is the development and analysis of TS-based schemes for more complex optimization problems, founded on Bayesian reasoning. We address several complex optimization problems where the previous state-of-art relies on a relatively myopic perspective on the problem. These includes stochastic searching on the line, the Goore game, the knapsack problem, travel time estimation, and equipartitioning. Instead of employing Bayesian reasoning to obtain a solution, they rely on carefully engineered rules. In all brevity, we recast each of these optimization problems in a Bayesian framework, introducing dedicated TS based solution schemes. For all of the addressed problems, the results show that besides being more effective, the TS based approaches we introduce are also capable of solving more adverse versions of the problems, such as dealing with stochastic liars.publishedVersio

    Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game

    Get PDF
    The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning. Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making

    Learning Techniques in Multi-Armed Bandits

    Get PDF
    Multi-Armed bandit problem is a classic example of the exploration vs. exploitation dilemma in which a collection of one-armed bandits, each with unknown but fixed reward probability, is given. The key idea is to develop a strategy, which results in the arm with the highest reward probability to be played such that the total reward obtained is maximized. Although seemingly a simplistic problem, solution strategies are important because of their wide applicability in a myriad of areas such as adaptive routing, resource allocation, clinical trials, and more recently in the area of online recommendation of news articles, advertisements, coupons, etc. to name a few. In this dissertation, we present different types of Bayesian Inference based bandit algorithms for Two and Multiple Armed Bandits which use Order Statistics to select the next arm to play. The Bayesian strategies, also known in literature as Thompson Method are shown to function well for a whole range of values, including very small values, outperforming UCB and other commonly used strategies. Empirical analysis results show a significant improvement on multiple datasets. In the second part of the dissertation, two types of Successive Reduction (SR) strategies - 1) Successive Reduction Hoeffding (SRH) and 2) Successive Reduction Order Statistics (SRO) are introduced. Both use an Order Statistics based Sampling method for arm selection, and then successively eliminate bandit arms from consideration depending on a confidence threshold. While SRH uses Hoeffding Bounds for elimination, SRO uses the probability of an arm being superior to the currently selected arm to measure confidence. The empirical results show that the performance advantage of proposed SRO scheme increasing persistently with the number of bandit arms while the SRH scheme shows similar performance as pure Thompson Sampling Method. In the third part of the dissertation, the assumption of the reward probability being fixed is removed. We model problems where reward probabilities are drifting , and introduce a new method called Dynamic Thompson Sampling (DTS) which adapts the reward probability estimate faster than traditional schemes and thus leads to improved performance in terms of lower regret. Our empirical results demonstrate that DTS method outperforms the state-of-the-art techniques, namely pure Thompson Sampling, UCB-Normal and UCB-f, for the case of dynamic reward probabilities. Furthermore, the performance advantage of the proposed DTS scheme increases persistently with the number of bandit arms. In the last part of the dissertation, we delve into arm space decomposition and use of multiple agents in the Bandit process. The three most important characteristics of a multi-agent systems are 1) Autonomy --- agents are completely or partially autonomous, 2) Local views --- agents are restricted to a local view of information, and 3) Decentralization of control --- each agent influences a limited part of the overall decision space. We study and compare Centralized vs. Decentralized Sampling Algorithm in Multi-Armed Bandit problems in the context of common payoff games . In the Centralized Decision Making, a central agent maintains a global view of the currently available information and makes a decision to choose the next arm just as the regular Bayesian Algorithm. In Decentralized Decision Making, each agent maintains a local view of the arms and makes decisions just based on the local information available at its end without communicating with other agents. The Decentralized Decision Making is modeled as a Game Theory problem. Our results show that the Decentralized systems perform well for both the cases of Pure as well Mixed Nash equilibria and their performance scales well with the increase in the number of arms due to reduced dimensionality of the space. We thus believe that this dissertation establishes Bayesian Multi-Armed bandit strategies as one of the prominent strategies in the field of bandits and opens up venues for new interesting research in the future
    corecore