262 research outputs found
Decentralized Learning for Multi-player Multi-armed Bandits
We consider the problem of distributed online learning with multiple players
in multi-armed bandits (MAB) models. Each player can pick among multiple arms.
When a player picks an arm, it gets a reward. We consider both i.i.d. reward
model and Markovian reward model. In the i.i.d. model each arm is modelled as
an i.i.d. process with an unknown distribution with an unknown mean. In the
Markovian model, each arm is modelled as a finite, irreducible, aperiodic and
reversible Markov chain with an unknown probability transition matrix and
stationary distribution. The arms give different rewards to different players.
If two players pick the same arm, there is a "collision", and neither of them
get any reward. There is no dedicated control channel for coordination or
communication among the players. Any other communication between the users is
costly and will add to the regret. We propose an online index-based distributed
learning policy called algorithm that trades off
\textit{exploration v. exploitation} in the right way, and achieves expected
regret that grows at most as near-. The motivation comes from
opportunistic spectrum access by multiple secondary users in cognitive radio
networks wherein they must pick among various wireless channels that look
different to different users. This is the first distributed learning algorithm
for multi-player MABs to the best of our knowledge.Comment: 33 pages, 3 figures. Submitted to IEEE Transactions on Information
Theor
Online Influence Maximization in Non-Stationary Social Networks
Social networks have been popular platforms for information propagation. An
important use case is viral marketing: given a promotion budget, an advertiser
can choose some influential users as the seed set and provide them free or
discounted sample products; in this way, the advertiser hopes to increase the
popularity of the product in the users' friend circles by the world-of-mouth
effect, and thus maximizes the number of users that information of the
production can reach. There has been a body of literature studying the
influence maximization problem. Nevertheless, the existing studies mostly
investigate the problem on a one-off basis, assuming fixed known influence
probabilities among users, or the knowledge of the exact social network
topology. In practice, the social network topology and the influence
probabilities are typically unknown to the advertiser, which can be varying
over time, i.e., in cases of newly established, strengthened or weakened social
ties. In this paper, we focus on a dynamic non-stationary social network and
design a randomized algorithm, RSB, based on multi-armed bandit optimization,
to maximize influence propagation over time. The algorithm produces a sequence
of online decisions and calibrates its explore-exploit strategy utilizing
outcomes of previous decisions. It is rigorously proven to achieve an
upper-bounded regret in reward and applicable to large-scale social networks.
Practical effectiveness of the algorithm is evaluated using both synthetic and
real-world datasets, which demonstrates that our algorithm outperforms previous
stationary methods under non-stationary conditions.Comment: 10 pages. To appear in IEEE/ACM IWQoS 2016. Full versio
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with
unknown reward models. At each time, a player selects one arm to play, aiming
to maximize the total expected reward over a horizon of length T. An approach
based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is
developed for constructing sequential arm selection policies. It is shown that
for all light-tailed reward distributions, DSEE achieves the optimal
logarithmic order of the regret, where regret is defined as the total expected
reward loss against the ideal case with known reward models. For heavy-tailed
reward distributions, DSEE achieves O(T^1/p) regret when the moments of the
reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2))
for p>2. With the knowledge of an upperbound on a finite moment of the
heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret
order. The proposed DSEE approach complements existing work on MAB by providing
corresponding results for general reward distributions. Furthermore, with a
clearly defined tunable parameter-the cardinality of the exploration sequence,
the DSEE approach is easily extendable to variations of MAB, including MAB with
various objectives, decentralized MAB with multiple players and incomplete
reward observations under collisions, MAB with unknown Markov dynamics, and
combinatorial MAB with dependent arms that often arise in network optimization
problems such as the shortest path, the minimum spanning, and the dominating
set problems under unknown random weights.Comment: 22 pages, 2 figure
- …