49 research outputs found
Functional Bandits
We introduce the functional bandit problem, where the objective is to find an
arm that optimises a known functional of the unknown arm-reward distributions.
These problems arise in many settings such as maximum entropy methods in
natural language processing, and risk-averse decision-making, but current
best-arm identification techniques fail in these domains. We propose a new
approach, that combines functional estimation and arm elimination, to tackle
this problem. This method achieves provably efficient performance guarantees.
In addition, we illustrate this method on a number of important functionals in
risk management and information theory, and refine our generic theoretical
results in those cases
Conditionally Risk-Averse Contextual Bandits
Contextual bandits with average-case statistical guarantees are inadequate in
risk-averse situations because they might trade off degraded worst-case
behaviour for better average performance. Designing a risk-averse contextual
bandit is challenging because exploration is necessary but risk-aversion is
sensitive to the entire distribution of rewards; nonetheless we exhibit the
first risk-averse contextual bandit algorithm with an online regret guarantee.
We conduct experiments from diverse scenarios where worst-case outcomes should
be avoided, from dynamic pricing, inventory management, and self-tuning
software; including a production exascale data processing system
Game of Thrones: Fully Distributed Learning for Multi-Player Bandits
We consider a multi-armed bandit game where N players compete for M arms for
T turns. Each player has different expected rewards for the arms, and the
instantaneous rewards are independent and identically distributed or Markovian.
When two or more players choose the same arm, they all receive zero reward.
Performance is measured using the expected sum of regrets, compared to optimal
assignment of arms to players. We assume that each player only knows her
actions and the reward she received each turn. Players cannot observe the
actions of other players, and no communication between players is possible. We
present a distributed algorithm and prove that it achieves an expected sum of
regrets of near-O\left(\log T\right). This is the first algorithm to achieve a
near order optimal regret in this fully distributed scenario. All other works
have assumed that either all players have the same vector of expected rewards
or that communication between players is possible.Comment: A preliminary version was accepted to NIPS 2018. This extended paper,
currently under review (submitted in September 2019), improves the regret
bound to near-log(T), generalizes to unbounded and Markovian rewards and has
a much better convergence rat