52 research outputs found

    Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics

    Full text link
    We consider the restless multi-armed bandit (RMAB) problem with unknown dynamics in which a player chooses M out of N arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which M arms are the most rewarding and always plays the M best arms. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order when arbitrary (but nontrivial) bounds on certain system parameters are known. When no knowledge about the system is available, we show that the proposed policy achieves a regret arbitrarily close to the logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.Comment: 33 pages, 5 figures, submitted to IEEE Transactions on Information Theory, 201

    Decentralized Restless Bandit with Multiple Players and Unknown Dynamics

    Full text link
    We consider decentralized restless multi-armed bandit problems with unknown dynamics and multiple players. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. Players activating the same arm at the same time collide and suffer from reward loss. The objective is to maximize the long-term reward by designing a decentralized arm selection policy to address unknown reward models and collisions among players. A decentralized policy is constructed that achieves a regret with logarithmic order when an arbitrary nontrivial bound on certain system parameters is known. When no knowledge about the system is available, we extend the policy to achieve a regret arbitrarily close to the logarithmic order. The result finds applications in communication networks, financial investment, and industrial engineering.Comment: 7 pages, 2 figures, in Proc. of Information Theory and Applications Workshop (ITA), January, 201

    Decentralized Learning for Multi-player Multi-armed Bandits

    Full text link
    We consider the problem of distributed online learning with multiple players in multi-armed bandits (MAB) models. Each player can pick among multiple arms. When a player picks an arm, it gets a reward. We consider both i.i.d. reward model and Markovian reward model. In the i.i.d. model each arm is modelled as an i.i.d. process with an unknown distribution with an unknown mean. In the Markovian model, each arm is modelled as a finite, irreducible, aperiodic and reversible Markov chain with an unknown probability transition matrix and stationary distribution. The arms give different rewards to different players. If two players pick the same arm, there is a "collision", and neither of them get any reward. There is no dedicated control channel for coordination or communication among the players. Any other communication between the users is costly and will add to the regret. We propose an online index-based distributed learning policy called dUCB4{\tt dUCB_4} algorithm that trades off \textit{exploration v. exploitation} in the right way, and achieves expected regret that grows at most as near-O(log2T)O(\log^2 T). The motivation comes from opportunistic spectrum access by multiple secondary users in cognitive radio networks wherein they must pick among various wireless channels that look different to different users. This is the first distributed learning algorithm for multi-player MABs to the best of our knowledge.Comment: 33 pages, 3 figures. Submitted to IEEE Transactions on Information Theor

    Online Influence Maximization in Non-Stationary Social Networks

    Full text link
    Social networks have been popular platforms for information propagation. An important use case is viral marketing: given a promotion budget, an advertiser can choose some influential users as the seed set and provide them free or discounted sample products; in this way, the advertiser hopes to increase the popularity of the product in the users' friend circles by the world-of-mouth effect, and thus maximizes the number of users that information of the production can reach. There has been a body of literature studying the influence maximization problem. Nevertheless, the existing studies mostly investigate the problem on a one-off basis, assuming fixed known influence probabilities among users, or the knowledge of the exact social network topology. In practice, the social network topology and the influence probabilities are typically unknown to the advertiser, which can be varying over time, i.e., in cases of newly established, strengthened or weakened social ties. In this paper, we focus on a dynamic non-stationary social network and design a randomized algorithm, RSB, based on multi-armed bandit optimization, to maximize influence propagation over time. The algorithm produces a sequence of online decisions and calibrates its explore-exploit strategy utilizing outcomes of previous decisions. It is rigorously proven to achieve an upper-bounded regret in reward and applicable to large-scale social networks. Practical effectiveness of the algorithm is evaluated using both synthetic and real-world datasets, which demonstrates that our algorithm outperforms previous stationary methods under non-stationary conditions.Comment: 10 pages. To appear in IEEE/ACM IWQoS 2016. Full versio

    Learning Techniques in Multi-Armed Bandits

    Get PDF
    Multi-Armed bandit problem is a classic example of the exploration vs. exploitation dilemma in which a collection of one-armed bandits, each with unknown but fixed reward probability, is given. The key idea is to develop a strategy, which results in the arm with the highest reward probability to be played such that the total reward obtained is maximized. Although seemingly a simplistic problem, solution strategies are important because of their wide applicability in a myriad of areas such as adaptive routing, resource allocation, clinical trials, and more recently in the area of online recommendation of news articles, advertisements, coupons, etc. to name a few. In this dissertation, we present different types of Bayesian Inference based bandit algorithms for Two and Multiple Armed Bandits which use Order Statistics to select the next arm to play. The Bayesian strategies, also known in literature as Thompson Method are shown to function well for a whole range of values, including very small values, outperforming UCB and other commonly used strategies. Empirical analysis results show a significant improvement on multiple datasets. In the second part of the dissertation, two types of Successive Reduction (SR) strategies - 1) Successive Reduction Hoeffding (SRH) and 2) Successive Reduction Order Statistics (SRO) are introduced. Both use an Order Statistics based Sampling method for arm selection, and then successively eliminate bandit arms from consideration depending on a confidence threshold. While SRH uses Hoeffding Bounds for elimination, SRO uses the probability of an arm being superior to the currently selected arm to measure confidence. The empirical results show that the performance advantage of proposed SRO scheme increasing persistently with the number of bandit arms while the SRH scheme shows similar performance as pure Thompson Sampling Method. In the third part of the dissertation, the assumption of the reward probability being fixed is removed. We model problems where reward probabilities are drifting , and introduce a new method called Dynamic Thompson Sampling (DTS) which adapts the reward probability estimate faster than traditional schemes and thus leads to improved performance in terms of lower regret. Our empirical results demonstrate that DTS method outperforms the state-of-the-art techniques, namely pure Thompson Sampling, UCB-Normal and UCB-f, for the case of dynamic reward probabilities. Furthermore, the performance advantage of the proposed DTS scheme increases persistently with the number of bandit arms. In the last part of the dissertation, we delve into arm space decomposition and use of multiple agents in the Bandit process. The three most important characteristics of a multi-agent systems are 1) Autonomy --- agents are completely or partially autonomous, 2) Local views --- agents are restricted to a local view of information, and 3) Decentralization of control --- each agent influences a limited part of the overall decision space. We study and compare Centralized vs. Decentralized Sampling Algorithm in Multi-Armed Bandit problems in the context of common payoff games . In the Centralized Decision Making, a central agent maintains a global view of the currently available information and makes a decision to choose the next arm just as the regular Bayesian Algorithm. In Decentralized Decision Making, each agent maintains a local view of the arms and makes decisions just based on the local information available at its end without communicating with other agents. The Decentralized Decision Making is modeled as a Game Theory problem. Our results show that the Decentralized systems perform well for both the cases of Pure as well Mixed Nash equilibria and their performance scales well with the increase in the number of arms due to reduced dimensionality of the space. We thus believe that this dissertation establishes Bayesian Multi-Armed bandit strategies as one of the prominent strategies in the field of bandits and opens up venues for new interesting research in the future
    corecore