Search CORE

564 research outputs found

Trading off rewards and errors in multi-armed bandits

Author: Brunskill Emma
Erraqabi Akram
Lazaric Alessandro
Liu Yun-En
Valko Michal
Publication venue: HAL CCSD
Publication date: 01/01/2017
Field of study

International audienceIn multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward

INRIA a CCSD electronic archive server

Hal-Diderot

Hedging using reinforcement learning: Contextual $k$ -Armed Bandit versus $Q$ -learning

Author: Cannelli Loris
Nuti Giuseppe
Sala Marzio
Szehr Oleg
Publication venue
Publication date: 03/07/2020
Field of study

The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. In real markets, continuous replication, such as in the model of Black, Scholes and Merton, is not only unrealistic but it is also undesirable due to high transaction costs. Over the last decades stochastic optimal-control methods have been developed to balance between effective replication and losses. More recently, with the rise of artificial intelligence, temporal-difference Reinforcement Learning, in particular variations of

Q

-learning in conjunction with Deep Neural Networks, have attracted significant interest. From a practical point of view, however, such methods are often relatively sample inefficient, hard to train and lack performance guarantees. This motivates the investigation of a stable benchmark algorithm for hedging. In this article, the hedging problem is viewed as an instance of a risk-averse contextual

k

-armed bandit problem, for which a large body of theoretical results and well-studied algorithms are available. We find that the

k

-armed bandit model naturally fits to the

P\&L

formulation of hedging, providing for a more accurate and sample efficient approach than

Q

-learning and reducing to the Black-Scholes model in the absence of transaction costs and risks.Comment: 15 pages, 7 figure

arXiv.org e-Print Archive

Rewards and errors in multi-arm bandits for interactive education

Author: Brunskill Emma
Erraqabi Akram
Lazaric Alessandro
Liu Yun-En
Valko Michal
Publication venue: HAL CCSD
Publication date: 01/01/2016
Field of study

INRIA a CCSD electronic archive server

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

Author: Simchi-Levi David
Zheng Zeyu
Zhu Feng
Publication venue
Publication date: 09/04/2023
Field of study

We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given

\alpha\in[1/2, 1)

and

\beta\in[0, \alpha]

, our policy achieves a worst-case expected regret of

\tilde O(T^\alpha)

(we call it

\alpha

-optimal) and an instance-dependent expected regret of

\tilde O(T^\beta)

(we call it

\beta

-consistent), while enjoys a probability of incurring an

\tilde O(T^\delta)

regret (

\delta\geq\alpha

in the worst-case scenario and

\delta\geq\beta

in the instance-dependent scenario) that decays exponentially with a polynomial

T

term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon

T

is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks

arXiv.org e-Print Archive