2 research outputs found
MNL-Bandit in non-stationary environments
In this paper, we study the MNL-Bandit problem in a non-stationary
environment and present an algorithm with a worst-case expected regret of
. Here is the number of arms, is the number of
changes and is a variation measure of the unknown
parameters. Furthermore, we show matching lower bounds on the expected regret
(up to logarithmic factors), implying that our algorithm is optimal. Our
approach builds upon the epoch-based algorithm for stationary MNL-Bandit in
Agrawal et al. 2016. However, non-stationarity poses several challenges and we
introduce new techniques and ideas to address these. In particular, we give a
tight characterization for the bias introduced in the estimators due to non
stationarity and derive new concentration bounds
Last Switch Dependent Bandits with Monotone Payoff Functions
In a recent work, Laforgue et al. introduce the model of last switch
dependent (LSD) bandits, in an attempt to capture nonstationary phenomena
induced by the interaction between the player and the environment. Examples
include satiation, where consecutive plays of the same action lead to decreased
performance, or deprivation, where the payoff of an action increases after an
interval of inactivity. In this work, we take a step towards understanding the
approximability of planning LSD bandits, namely, the (NP-hard) problem of
computing an optimal arm-pulling strategy under complete knowledge of the
model. In particular, we design the first efficient constant approximation
algorithm for the problem and show that, under a natural monotonicity
assumption on the payoffs, its approximation guarantee (almost) matches the
state-of-the-art for the special and well-studied class of recharging bandits
(also known as delay-dependent). In this attempt, we develop new tools and
insights for this class of problems, including a novel higher-dimensional
relaxation and the technique of mirroring the evolution of virtual states. We
believe that these novel elements could potentially be used for approaching
richer classes of action-induced nonstationary bandits (e.g., special instances
of restless bandits). In the case where the model parameters are initially
unknown, we develop an online learning adaptation of our algorithm for which we
provide sublinear regret guarantees against its full-information counterpart.Comment: Accepted to the 40th International Conference on Machine Learning
(ICML 2023