18 research outputs found
A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem
The multi-armed bandit problem has been extensively studied under the
stationary assumption. However in reality, this assumption often does not hold
because the distributions of rewards themselves may change over time. In this
paper, we propose a change-detection (CD) based framework for multi-armed
bandit problems under the piecewise-stationary setting, and study a class of
change-detection based UCB (Upper Confidence Bound) policies, CD-UCB, that
actively detects change points and restarts the UCB indices. We then develop
CUSUM-UCB and PHT-UCB, that belong to the CD-UCB class and use cumulative sum
(CUSUM) and Page-Hinkley Test (PHT) to detect changes. We show that CUSUM-UCB
obtains the best known regret upper bound under mild assumptions. We also
demonstrate the regret reduction of the CD-UCB policies over arbitrary
Bernoulli rewards and Yahoo! datasets of webpage click-through rates.Comment: accepted by AAAI 201
A Decentralized Communication Policy for Multi Agent Multi Armed Bandit Problems
This paper proposes a novel policy for a group of agents to, individually as
well as collectively, solve a multi armed bandit (MAB) problem. The policy
relies solely on the information that an agent has obtained through sampling of
the options on its own and through communication with neighbors. The option
selection policy is based on an Upper Confidence Based (UCB) strategy while the
communication strategy that is proposed forces agents to communicate with other
agents who they believe are most likely to be exploring than exploiting. The
overall strategy is shown to significantly outperform an independent
Erd\H{o}s-R\'{e}nyi (ER) graph based random communication policy. The policy is
shown to be cost effective in terms of communication and thus to be easily
scalable to a large network of agents.Comment: This is the full version of a preprint that will appear in the
proceedings of the 2020 European Control Conference (ECC
Satisficing in multi-armed bandit problems
Satisficing is a relaxation of maximizing and allows for less risky decision
making in the face of uncertainty. We propose two sets of satisficing
objectives for the multi-armed bandit problem, where the objective is to
achieve reward-based decision-making performance above a given threshold. We
show that these new problems are equivalent to various standard multi-armed
bandit problems with maximizing objectives and use the equivalence to find
bounds on performance. The different objectives can result in qualitatively
different behavior; for example, agents explore their options continually in
one case and only a finite number of times in another. For the case of Gaussian
rewards we show an additional equivalence between the two sets of satisficing
objectives that allows algorithms developed for one set to be applied to the
other. We then develop variants of the Upper Credible Limit (UCL) algorithm
that solve the problems with satisficing objectives and show that these
modified UCL algorithms achieve efficient satisficing performance.Comment: To appear in IEEE Transactions on Automatic Contro