Search CORE

1,062 research outputs found

Influence Maximization with Bandits

Author: Lakshmanan Laks. V. S.
Schmidt Mark
Vaswani Sharan
Publication venue
Publication date: 27/04/2016
Field of study

We consider the problem of \emph{influence maximization}, the problem of maximizing the number of people that become aware of a product by finding the `best' set of `seed' users to expose the product to. Most prior work on this topic assumes that we know the probability of each user influencing each other user, or we have data that lets us estimate these influences. However, this information is typically not initially available or is difficult to obtain. To avoid this assumption, we adopt a combinatorial multi-armed bandit paradigm that estimates the influence probabilities as we sequentially try different seed sets. We establish bounds on the performance of this procedure under the existing edge-level feedback as well as a novel and more realistic node-level feedback. Beyond our theoretical results, we describe a practical implementation and experimentally demonstrate its efficiency and effectiveness on four real datasets.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

Nonparametric Stochastic Contextual Bandits

Author: Guan Melody Y.
Jiang Heinrich
Publication venue
Publication date: 05/01/2018
Field of study

We analyze the

K

-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of

\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big)

, where

D

is the context dimension, for a modified UCB algorithm that is simple to implement (

k

NN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Exploration vs Exploitation vs Safety: Risk-averse Multi-Armed Bandits

Author: Galichet Nicolas
Sebag Michèle
Teytaud Olivier
Publication venue
Publication date: 13/11/2013
Field of study

Motivated by applications in energy management, this paper presents the Multi-Armed Risk-Aware Bandit (MARAB) algorithm. With the goal of limiting the exploration of risky arms, MARAB takes as arm quality its conditional value at risk. When the user-supplied risk level goes to 0, the arm quality tends toward the essential infimum of the arm distribution density, and MARAB tends toward the MIN multi-armed bandit algorithm, aimed at the arm with maximal minimal value. As a first contribution, this paper presents a theoretical analysis of the MIN algorithm under mild assumptions, establishing its robustness comparatively to UCB. The analysis is supported by extensive experimental validation of MIN and MARAB compared to UCB and state-of-art risk-aware MAB algorithms on artificial and real-world problems.Comment: 16 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

On the Prior Sensitivity of Thompson Sampling

Author: BC May
D Russo
D Russo
E Kaufmann
J Bartroff
N Cesa-Bianchi
P Auer
S Bubeck
SL Scott
TL Lai
W Thompson
Publication venue
Publication date: 20/07/2016
Field of study

The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm's regret is low (high) when the prior is good (bad), little is known about the exact dependence. In this paper, we fully characterize the algorithm's worst-case dependence of regret on the choice of prior, focusing on a special yet representative case. These results also provide insights into the general sensitivity of the algorithm to the choice of priors. In particular, with

p

being the prior probability mass of the true reward-generating model, we prove

O(\sqrt{T/p})

and

O(\sqrt{(1-p)T})

regret upper bounds for the bad- and good-prior cases, respectively, as well as \emph{matching} lower bounds. Our proofs rely on the discovery of a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning Theory (ALT), 201

arXiv.org e-Print Archive

Crossref