Search CORE

73 research outputs found

Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem

Author: Leonard Naomi Ehrich
Madhushani Udari
Publication venue
Publication date: 01/01/2019
Field of study

We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors. Neighbors are defined by a network graph with heterogeneous and stochastic interconnections. These interactions are determined by the sociability of each agent, which corresponds to the probability that the agent observes its neighbors. We design an algorithm for each agent to maximize its own expected cumulative reward and prove performance bounds that depend on the sociability of the agents and the network structure. We use the bounds to predict the rank ordering of agents according to their performance and verify the accuracy analytically and computationally

arXiv.org e-Print Archive

Princeton University Open Access Repository

Upper-Confidence Bound for Channel Selection in LPWA Networks with Retransmissions

Author: Besson Lilian
Bonnefoi Remi
Manco-Vasquez Julio
Moy Christophe
Publication venue
Publication date: 27/02/2019
Field of study

In this paper, we propose and evaluate different learning strategies based on Multi-Arm Bandit (MAB) algorithms. They allow Internet of Things (IoT) devices to improve their access to the network and their autonomy, while taking into account the impact of encountered radio collisions. For that end, several heuristics employing Upper-Confident Bound (UCB) algorithms are examined, to explore the contextual information provided by the number of retransmissions. Our results show that approaches based on UCB obtain a significant improvement in terms of successful transmission probabilities. Furthermore, it also reveals that a pure UCB channel access is as efficient as more sophisticated learning strategies.Comment: The source code (MATLAB or Octave) used for the simula-tions and the figures is open-sourced under the MIT License, atBitbucket.org/scee\_ietr/ucb\_smart\_retran

arXiv.org e-Print Archive

Crossref

Learning to Optimize under Non-Stationarity

Author: Cheung Wang Chi
Simchi-Levi David
Zhu Ruihao
Publication venue
Publication date: 02/03/2019
Field of study

We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandit setting. It captures natural applications such as dynamic pricing and ads allocation in a changing environment. We show how the difficulty posed by the non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining

d,B_T,

and

T

as the problem dimension, the \emph{variation budget}, and the total time horizon, respectively, our main contributions are the tuned Sliding Window UCB (\texttt{SW-UCB}) algorithm with optimal

\widetilde{O}(d^{2/3}(B_T+1)^{1/3}T^{2/3})

dynamic regret, and the tuning free bandit-over-bandit (\texttt{BOB}) framework built on top of the \texttt{SW-UCB} algorithm with best

\widetilde{O}(d^{2/3}(B_T+1)^{1/4}T^{3/4})

dynamic regret

arXiv.org e-Print Archive

DSpace@MIT

Online Influence Maximization in Non-Stationary Social Networks

Author: Bao Yixin
Lau Francis C. M.
Wang Xiaoke
Wang Zhi
Wu Chuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Social networks have been popular platforms for information propagation. An important use case is viral marketing: given a promotion budget, an advertiser can choose some influential users as the seed set and provide them free or discounted sample products; in this way, the advertiser hopes to increase the popularity of the product in the users' friend circles by the world-of-mouth effect, and thus maximizes the number of users that information of the production can reach. There has been a body of literature studying the influence maximization problem. Nevertheless, the existing studies mostly investigate the problem on a one-off basis, assuming fixed known influence probabilities among users, or the knowledge of the exact social network topology. In practice, the social network topology and the influence probabilities are typically unknown to the advertiser, which can be varying over time, i.e., in cases of newly established, strengthened or weakened social ties. In this paper, we focus on a dynamic non-stationary social network and design a randomized algorithm, RSB, based on multi-armed bandit optimization, to maximize influence propagation over time. The algorithm produces a sequence of online decisions and calibrates its explore-exploit strategy utilizing outcomes of previous decisions. It is rigorously proven to achieve an upper-bounded regret in reward and applicable to large-scale social networks. Practical effectiveness of the algorithm is evaluated using both synthetic and real-world datasets, which demonstrates that our algorithm outperforms previous stationary methods under non-stationary conditions.Comment: 10 pages. To appear in IEEE/ACM IWQoS 2016. Full versio

arXiv.org e-Print Archive

Crossref

HKU Scholars Hub

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

Author: Cheung Wang Chi
Simchi-Levi David
Zhu Ruihao
Publication venue
Publication date: 24/06/2020
Field of study

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Notably, learning non-stationary MDPs via the conventional optimistic exploration technique presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.Comment: To appear in proceedings of the 37th International Conference on Machine Learning. Shortened conference version of its journal version (available at: arXiv:1906.02922

arXiv.org e-Print Archive

DSpace@MIT