645 research outputs found
Decentralized Learning for Multi-player Multi-armed Bandits
We consider the problem of distributed online learning with multiple players
in multi-armed bandits (MAB) models. Each player can pick among multiple arms.
When a player picks an arm, it gets a reward. We consider both i.i.d. reward
model and Markovian reward model. In the i.i.d. model each arm is modelled as
an i.i.d. process with an unknown distribution with an unknown mean. In the
Markovian model, each arm is modelled as a finite, irreducible, aperiodic and
reversible Markov chain with an unknown probability transition matrix and
stationary distribution. The arms give different rewards to different players.
If two players pick the same arm, there is a "collision", and neither of them
get any reward. There is no dedicated control channel for coordination or
communication among the players. Any other communication between the users is
costly and will add to the regret. We propose an online index-based distributed
learning policy called algorithm that trades off
\textit{exploration v. exploitation} in the right way, and achieves expected
regret that grows at most as near-. The motivation comes from
opportunistic spectrum access by multiple secondary users in cognitive radio
networks wherein they must pick among various wireless channels that look
different to different users. This is the first distributed learning algorithm
for multi-player MABs to the best of our knowledge.Comment: 33 pages, 3 figures. Submitted to IEEE Transactions on Information
Theor
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with
unknown reward models. At each time, a player selects one arm to play, aiming
to maximize the total expected reward over a horizon of length T. An approach
based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is
developed for constructing sequential arm selection policies. It is shown that
for all light-tailed reward distributions, DSEE achieves the optimal
logarithmic order of the regret, where regret is defined as the total expected
reward loss against the ideal case with known reward models. For heavy-tailed
reward distributions, DSEE achieves O(T^1/p) regret when the moments of the
reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2))
for p>2. With the knowledge of an upperbound on a finite moment of the
heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret
order. The proposed DSEE approach complements existing work on MAB by providing
corresponding results for general reward distributions. Furthermore, with a
clearly defined tunable parameter-the cardinality of the exploration sequence,
the DSEE approach is easily extendable to variations of MAB, including MAB with
various objectives, decentralized MAB with multiple players and incomplete
reward observations under collisions, MAB with unknown Markov dynamics, and
combinatorial MAB with dependent arms that often arise in network optimization
problems such as the shortest path, the minimum spanning, and the dominating
set problems under unknown random weights.Comment: 22 pages, 2 figure
Decentralized Exploration in Multi-Armed Bandits
We consider the decentralized exploration problem: a set of players
collaborate to identify the best arm by asynchronously interacting with the
same stochastic environment. The objective is to insure privacy in the best arm
identification problem between asynchronous, collaborative, and thrifty
players. In the context of a digital service, we advocate that this
decentralized approach allows a good balance between the interests of users and
those of service providers: the providers optimize their services, while
protecting the privacy of the users and saving resources. We define the privacy
level as the amount of information an adversary could infer by intercepting the
messages concerning a single user. We provide a generic algorithm Decentralized
Elimination, which uses any best arm identification algorithm as a subroutine.
We prove that this algorithm insures privacy, with a low communication cost,
and that in comparison to the lower bound of the best arm identification
problem, its sample complexity suffers from a penalty depending on the inverse
of the probability of the most frequent players. Then, thanks to the genericity
of the approach, we extend the proposed algorithm to the non-stationary
bandits. Finally, experiments illustrate and complete the analysis
Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics
We consider the restless multi-armed bandit (RMAB) problem with unknown
dynamics in which a player chooses M out of N arms to play at each time. The
reward state of each arm transits according to an unknown Markovian rule when
it is played and evolves according to an arbitrary unknown random process when
it is passive. The performance of an arm selection policy is measured by
regret, defined as the reward loss with respect to the case where the player
knows which M arms are the most rewarding and always plays the M best arms. We
construct a policy with an interleaving exploration and exploitation epoch
structure that achieves a regret with logarithmic order when arbitrary (but
nontrivial) bounds on certain system parameters are known. When no knowledge
about the system is available, we show that the proposed policy achieves a
regret arbitrarily close to the logarithmic order. We further extend the
problem to a decentralized setting where multiple distributed players share the
arms without information exchange. Under both an exogenous restless model and
an endogenous restless model, we show that a decentralized extension of the
proposed policy preserves the logarithmic regret order as in the centralized
setting. The results apply to adaptive learning in various dynamic systems and
communication networks, as well as financial investment.Comment: 33 pages, 5 figures, submitted to IEEE Transactions on Information
Theory, 201
Distributed Learning in Multi-Armed Bandit with Multiple Players
We formulate and study a decentralized multi-armed bandit (MAB) problem.
There are M distributed players competing for N independent arms. Each arm,
when played, offers i.i.d. reward according to a distribution with an unknown
parameter. At each time, each player chooses one arm to play without exchanging
observations or any information with other players. Players choosing the same
arm collide, and, depending on the collision model, either no one receives
reward or the colliding players share the reward in an arbitrary way. We show
that the minimum system regret of the decentralized MAB grows with time at the
same logarithmic order as in the centralized counterpart where players act
collectively as a single entity by exchanging observations and making decisions
jointly. A decentralized policy is constructed to achieve this optimal order
while ensuring fairness among players and without assuming any pre-agreement or
information exchange among players. Based on a Time Division Fair Sharing
(TDFS) of the M best arms, the proposed policy is constructed and its order
optimality is proven under a general reward model. Furthermore, the basic
structure of the TDFS policy can be used with any order-optimal single-player
policy to achieve order optimality in the decentralized setting. We also
establish a lower bound on the system regret growth rate for a general class of
decentralized polices, to which the proposed policy belongs. This problem finds
potential applications in cognitive radio networks, multi-channel communication
systems, multi-agent systems, web search and advertising, and social networks.Comment: 31 pages, 8 figures, revised paper submitted to IEEE Transactions on
Signal Processing, April, 2010, the pre-agreement in the decentralized TDFS
policy is eliminated to achieve a complete decentralization among player
Channel Selection for Network-assisted D2D Communication via No-Regret Bandit Learning with Calibrated Forecasting
We consider the distributed channel selection problem in the context of
device-to-device (D2D) communication as an underlay to a cellular network.
Underlaid D2D users communicate directly by utilizing the cellular spectrum but
their decisions are not governed by any centralized controller. Selfish D2D
users that compete for access to the resources construct a distributed system,
where the transmission performance depends on channel availability and quality.
This information, however, is difficult to acquire. Moreover, the adverse
effects of D2D users on cellular transmissions should be minimized. In order to
overcome these limitations, we propose a network-assisted distributed channel
selection approach in which D2D users are only allowed to use vacant cellular
channels. This scenario is modeled as a multi-player multi-armed bandit game
with side information, for which a distributed algorithmic solution is
proposed. The solution is a combination of no-regret learning and calibrated
forecasting, and can be applied to a broad class of multi-player stochastic
learning problems, in addition to the formulated channel selection problem.
Analytically, it is established that this approach not only yields vanishing
regret (in comparison to the global optimal solution), but also guarantees that
the empirical joint frequencies of the game converge to the set of correlated
equilibria.Comment: 31 pages (one column), 9 figure
- …