379 research outputs found
Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
We derive an algorithm that achieves the optimal (within constants)
pseudo-regret in both adversarial and stochastic multi-armed bandits without
prior knowledge of the regime and time horizon. The algorithm is based on
online mirror descent (OMD) with Tsallis entropy regularization with power
and reduced-variance loss estimators. More generally, we define an
adversarial regime with a self-bounding constraint, which includes stochastic
regime, stochastically constrained adversarial regime (Wei and Luo), and
stochastic regime with adversarial corruptions (Lykouris et al.) as special
cases, and show that the algorithm achieves logarithmic regret guarantee in
this regime and all of its special cases simultaneously with the adversarial
regret guarantee.} The algorithm also achieves adversarial and stochastic
optimality in the utility-based dueling bandit setting. We provide empirical
evaluation of the algorithm demonstrating that it significantly outperforms
UCB1 and EXP3 in stochastic environments. We also provide examples of
adversarial environments, where UCB1 and Thompson Sampling exhibit almost
linear regret, whereas our algorithm suffers only logarithmic regret. To the
best of our knowledge, this is the first example demonstrating vulnerability of
Thompson Sampling in adversarial environments. Last, but not least, we present
a general stochastic analysis and a general adversarial analysis of OMD
algorithms with Tsallis entropy regularization for and explain
the reason why works best
Decentralized Exploration in Multi-Armed Bandits
We consider the decentralized exploration problem: a set of players
collaborate to identify the best arm by asynchronously interacting with the
same stochastic environment. The objective is to insure privacy in the best arm
identification problem between asynchronous, collaborative, and thrifty
players. In the context of a digital service, we advocate that this
decentralized approach allows a good balance between the interests of users and
those of service providers: the providers optimize their services, while
protecting the privacy of the users and saving resources. We define the privacy
level as the amount of information an adversary could infer by intercepting the
messages concerning a single user. We provide a generic algorithm Decentralized
Elimination, which uses any best arm identification algorithm as a subroutine.
We prove that this algorithm insures privacy, with a low communication cost,
and that in comparison to the lower bound of the best arm identification
problem, its sample complexity suffers from a penalty depending on the inverse
of the probability of the most frequent players. Then, thanks to the genericity
of the approach, we extend the proposed algorithm to the non-stationary
bandits. Finally, experiments illustrate and complete the analysis
Nonparametric Stochastic Contextual Bandits
We analyze the -armed bandit problem where the reward for each arm is a
noisy realization based on an observed context under mild nonparametric
assumptions. We attain tight results for top-arm identification and a sublinear
regret of , where is the
context dimension, for a modified UCB algorithm that is simple to implement
(NN-UCB). We then give global intrinsic dimension dependent and ambient
dimension independent regret bounds. We also discuss recovering topological
structures within the context space based on expected bandit performance and
provide an extension to infinite-armed contextual bandits. Finally, we
experimentally show the improvement of our algorithm over existing multi-armed
bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201
Bandit problems with fidelity rewards
The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how ‘loyal’ the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every m plays of an arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret
- …