445 research outputs found
Functional Bandits
We introduce the functional bandit problem, where the objective is to find an
arm that optimises a known functional of the unknown arm-reward distributions.
These problems arise in many settings such as maximum entropy methods in
natural language processing, and risk-averse decision-making, but current
best-arm identification techniques fail in these domains. We propose a new
approach, that combines functional estimation and arm elimination, to tackle
this problem. This method achieves provably efficient performance guarantees.
In addition, we illustrate this method on a number of important functionals in
risk management and information theory, and refine our generic theoretical
results in those cases
A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits
In a typical stochastic multi-armed bandit problem, the objective is often to
maximize the expected sum of rewards over some time horizon . While the
choice of a strategy that accomplishes that is optimal with no additional
information, it is no longer the case when provided additional
environment-specific knowledge. In particular, in areas of high volatility like
healthcare or finance, a naive reward maximization approach often does not
accurately capture the complexity of the learning problem and results in
unreliable solutions. To tackle problems of this nature, we propose a framework
of adaptive risk-aware strategies that operate in non-stationary environments.
Our framework incorporates various risk measures prevalent in the literature to
map multiple families of multi-armed bandit algorithms into a risk-sensitive
setting. In addition, we equip the resulting algorithms with the Restarted
Bayesian Online Change-Point Detection (R-BOCPD) algorithm and impose a
(tunable) forced exploration strategy to detect local (per-arm) switches. We
provide finite-time theoretical guarantees and an asymptotic regret bound of
order up to time horizon with the total
number of change-points. In practice, our framework compares favorably to the
state-of-the-art in both synthetic and real-world environments and manages to
perform efficiently with respect to both risk-sensitivity and non-stationarity
Conditionally Risk-Averse Contextual Bandits
Contextual bandits with average-case statistical guarantees are inadequate in
risk-averse situations because they might trade off degraded worst-case
behaviour for better average performance. Designing a risk-averse contextual
bandit is challenging because exploration is necessary but risk-aversion is
sensitive to the entire distribution of rewards; nonetheless we exhibit the
first risk-averse contextual bandit algorithm with an online regret guarantee.
We conduct experiments from diverse scenarios where worst-case outcomes should
be avoided, from dynamic pricing, inventory management, and self-tuning
software; including a production exascale data processing system
- …