1,174 research outputs found
Learning Contextual Bandits in a Non-stationary Environment
Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.Comment: 10 pages, 13 figures, To appear on ACM Special Interest Group on
Information Retrieval (SIGIR) 201
Self-Concordant Analysis of Generalized Linear Bandits with Forgetting
International audienceContextual sequential decision problems with categorical or numerical observations are ubiquitous and Generalized Linear Bandits (GLB) offer a solid theoretical framework to address them. In contrast to the case of linear bandits, existing algorithms for GLB have two drawbacks undermining their applicability. First, they rely on excessively pessimistic concentration bounds due to the non-linear nature of the model. Second, they require either non-convex projection steps or burn-in phases to enforce boundedness of the estimators. Both of these issues are worsened when considering non-stationary models, in which the GLB parameter may vary with time. In this work, we focus on self-concordant GLB (which include logistic and Poisson regression) with forgetting achieved either by the use of a sliding window or exponential weights. We propose a novel confidence-based algorithm for the maximum-likehood estimator with forgetting and analyze its perfomance in abruptly changing environments. These results as well as the accompanying numerical simulations highlight the potential of the proposed approach to address non-stationarity in GLB
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
- …