Search CORE

40 research outputs found

Learning to Optimize under Non-Stationarity

Author: Cheung Wang Chi
Simchi-Levi David
Zhu Ruihao
Publication venue
Publication date: 02/03/2019
Field of study

We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandit setting. It captures natural applications such as dynamic pricing and ads allocation in a changing environment. We show how the difficulty posed by the non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining

d,B_T,

and

T

as the problem dimension, the \emph{variation budget}, and the total time horizon, respectively, our main contributions are the tuned Sliding Window UCB (\texttt{SW-UCB}) algorithm with optimal

\widetilde{O}(d^{2/3}(B_T+1)^{1/3}T^{2/3})

dynamic regret, and the tuning free bandit-over-bandit (\texttt{BOB}) framework built on top of the \texttt{SW-UCB} algorithm with best

\widetilde{O}(d^{2/3}(B_T+1)^{1/4}T^{3/4})

dynamic regret

arXiv.org e-Print Archive

DSpace@MIT

Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits

Author: Seldin Yevgeny
Zimmert Julian
Publication venue
Publication date: 23/03/2020
Field of study

We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power

\alpha=1/2

and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for

\alpha\in[0,1]

and explain the reason why

\alpha=1/2

works best

arXiv.org e-Print Archive

Copenhagen University Research Information System

Conditionally Risk-Averse Contextual Bandits

Author: Farsang Mónika
Mineiro Paul
Zhang Wangda
Publication venue
Publication date: 08/07/2023
Field of study

Contextual bandits with average-case statistical guarantees are inadequate in risk-averse situations because they might trade off degraded worst-case behaviour for better average performance. Designing a risk-averse contextual bandit is challenging because exploration is necessary but risk-aversion is sensitive to the entire distribution of rewards; nonetheless we exhibit the first risk-averse contextual bandit algorithm with an online regret guarantee. We conduct experiments from diverse scenarios where worst-case outcomes should be avoided, from dynamic pricing, inventory management, and self-tuning software; including a production exascale data processing system

arXiv.org e-Print Archive