4 research outputs found
Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards
Sequential decision-making under uncertainty is often associated with long
feedback delays. Such delays degrade the performance of the learning agent in
identifying a subset of arms with the optimal collective reward in the long
run. This problem becomes significantly challenging in a non-stationary
environment with structural dependencies amongst the reward distributions
associated with the arms. Therefore, besides adapting to delays and
environmental changes, learning the causal relations alleviates the adverse
effects of feedback delay on the decision-making process. We formalize the
described setting as a non-stationary and delayed combinatorial semi-bandit
problem with causally related rewards. We model the causal relations by a
directed graph in a stationary structural equation model. The agent maximizes
the long-term average payoff, defined as a linear function of the base arms'
rewards. We develop a policy that learns the structural dependencies from
delayed feedback and utilizes that to optimize the decision-making while
adapting to drifts. We prove a regret bound for the performance of the proposed
algorithm. Besides, we evaluate our method via numerical analysis using
synthetic and real-world datasets to detect the regions that contribute the
most to the spread of Covid-19 in Italy.Comment: 33 pages, 9 figures. arXiv admin note: text overlap with
arXiv:2212.1292
MNL-Bandit in non-stationary environments
In this paper, we study the MNL-Bandit problem in a non-stationary
environment and present an algorithm with a worst-case expected regret of
. Here is the number of arms, is the number of
changes and is a variation measure of the unknown
parameters. Furthermore, we show matching lower bounds on the expected regret
(up to logarithmic factors), implying that our algorithm is optimal. Our
approach builds upon the epoch-based algorithm for stationary MNL-Bandit in
Agrawal et al. 2016. However, non-stationarity poses several challenges and we
introduce new techniques and ideas to address these. In particular, we give a
tight characterization for the bias introduced in the estimators due to non
stationarity and derive new concentration bounds