2 research outputs found
A Gang of Adversarial Bandits
We consider running multiple instances of multi-armed bandit (MAB) problems in parallel. A main motivation for this study are online recommendation systems, in which each of N users is associated with a MAB problem and the goal is to exploit users' similarity in order to learn users' preferences to K items more efficiently. We consider the adversarial MAB setting, whereby an adversary is free to choose which user and which loss to present to the learner during the learning process. Users are in a social network and the learner is aided by a-priori knowledge of the strengths of the social links between all pairs of users. It is assumed that if the social link between two users is strong then they tend to share the same action. The regret is measured relative to an arbitrary function which maps users to actions. The smoothness of the function is captured by a resistance-based dispersion measure Ψ. We present two learning algorithms, GABA-I and GABA-II which exploit the network structure to bias towards functions of low Ψ values. We show that GABA-I has an expected regret bound of O(pln(N K/Ψ)ΨKT) and per-trial time complexity of O(K ln(N)), whilst GABA-II has a weaker O(pln(N/Ψ) ln(N K/Ψ)ΨKT) regret, but a better O(ln(K) ln(N)) per-trial time complexity. We highlight improvements of both algorithms over running independent standard MABs across users
Online Network Source Optimization with Graph-Kernel MAB
We propose Grab-UCB, a graph-kernel multi-arms bandit algorithm to learn
online the optimal source placement in large scale networks, such that the
reward obtained from a priori unknown network processes is maximized. The
uncertainty calls for online learning, which suffers however from the curse of
dimensionality. To achieve sample efficiency, we describe the network processes
with an adaptive graph dictionary model, which typically leads to sparse
spectral representations. This enables a data-efficient learning framework,
whose learning rate scales with the dimension of the spectral representation
model instead of the one of the network. We then propose Grab-UCB, an online
sequential decision strategy that learns the parameters of the spectral
representation while optimizing the action strategy. We derive the performance
guarantees that depend on network parameters, which further influence the
learning curve of the sequential decision strategy We introduce a
computationally simplified solving method, Grab-arm-Light, an algorithm that
walks along the edges of the polytope representing the objective function.
Simulations results show that the proposed online learning algorithm
outperforms baseline offline methods that typically separate the learning phase
from the testing one. The results confirm the theoretical findings, and further
highlight the gain of the proposed online learning strategy in terms of
cumulative regret, sample efficiency and computational complexity