Search CORE

445 research outputs found

Functional Bandits

Author: Tran-Thanh Long
Yu Jia Yuan
Publication venue
Publication date: 10/05/2014
Field of study

We introduce the functional bandit problem, where the objective is to find an arm that optimises a known functional of the unknown arm-reward distributions. These problems arise in many settings such as maximum entropy methods in natural language processing, and risk-averse decision-making, but current best-arm identification techniques fail in these domains. We propose a new approach, that combines functional estimation and arm elimination, to tackle this problem. This method achieves provably efficient performance guarantees. In addition, we illustrate this method on a number of important functionals in risk management and information theory, and refine our generic theoretical results in those cases

arXiv.org e-Print Archive

CiteSeerX

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

Author: Achab Mastane
Alami Reda
Mahfoud Mohammed
Publication venue
Publication date: 24/10/2023
Field of study

In a typical stochastic multi-armed bandit problem, the objective is often to maximize the expected sum of rewards over some time horizon

T

. While the choice of a strategy that accomplishes that is optimal with no additional information, it is no longer the case when provided additional environment-specific knowledge. In particular, in areas of high volatility like healthcare or finance, a naive reward maximization approach often does not accurately capture the complexity of the learning problem and results in unreliable solutions. To tackle problems of this nature, we propose a framework of adaptive risk-aware strategies that operate in non-stationary environments. Our framework incorporates various risk measures prevalent in the literature to map multiple families of multi-armed bandit algorithms into a risk-sensitive setting. In addition, we equip the resulting algorithms with the Restarted Bayesian Online Change-Point Detection (R-BOCPD) algorithm and impose a (tunable) forced exploration strategy to detect local (per-arm) switches. We provide finite-time theoretical guarantees and an asymptotic regret bound of order

\tilde O(\sqrt{K_T T})

up to time horizon

T

with

K_T

the total number of change-points. In practice, our framework compares favorably to the state-of-the-art in both synthetic and real-world environments and manages to perform efficiently with respect to both risk-sensitivity and non-stationarity

arXiv.org e-Print Archive

Conditionally Risk-Averse Contextual Bandits

Author: Farsang Mónika
Mineiro Paul
Zhang Wangda
Publication venue
Publication date: 08/07/2023
Field of study

Contextual bandits with average-case statistical guarantees are inadequate in risk-averse situations because they might trade off degraded worst-case behaviour for better average performance. Designing a risk-averse contextual bandit is challenging because exploration is necessary but risk-aversion is sensitive to the entire distribution of rewards; nonetheless we exhibit the first risk-averse contextual bandit algorithm with an online regret guarantee. We conduct experiments from diverse scenarios where worst-case outcomes should be avoided, from dynamic pricing, inventory management, and self-tuning software; including a production exascale data processing system

arXiv.org e-Print Archive