25 research outputs found
Stochastic Bandit Models for Delayed Conversions
Online advertising and product recommendation are important domains of
applications for multi-armed bandit methods. In these fields, the reward that
is immediately available is most often only a proxy for the actual outcome of
interest, which we refer to as a conversion. For instance, in web advertising,
clicks can be observed within a few seconds after an ad display but the
corresponding sale --if any-- will take hours, if not days to happen. This
paper proposes and investigates a new stochas-tic multi-armed bandit model in
the framework proposed by Chapelle (2014) --based on empirical studies in the
field of web advertising-- in which each action may trigger a future reward
that will then happen with a stochas-tic delay. We assume that the probability
of conversion associated with each action is unknown while the distribution of
the conversion delay is known, distinguishing between the (idealized) case
where the conversion events may be observed whatever their delay and the more
realistic setting in which late conversions are censored. We provide
performance lower bounds as well as two simple but efficient algorithms based
on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when
conversion rates are low, is based on a Poissonization argument, of independent
interest in other settings where aggregation of Bernoulli observations with
different success probabilities is required.Comment: Conference on Uncertainty in Artificial Intelligence, Aug 2017,
Sydney, Australi
Sparse Stochastic Bandits
In the classical multi-armed bandit problem, d arms are available to the
decision maker who pulls them sequentially in order to maximize his cumulative
reward. Guarantees can be obtained on a relative quantity called regret, which
scales linearly with d (or with sqrt(d) in the minimax sense). We here consider
the sparse case of this classical problem in the sense that only a small number
of arms, namely s < d, have a positive expected reward. We are able to leverage
this additional assumption to provide an algorithm whose regret scales with s
instead of d. Moreover, we prove that this algorithm is optimal by providing a
matching lower bound - at least for a wide and pertinent range of parameters
that we determine - and by evaluating its performance on simulated data
Multiple-Play Bandits in the Position-Based Model
Sequentially learning to place items in multi-position displays or lists is a
task that can be cast into the multiple-play semi-bandit setting. However, a
major concern in this context is when the system cannot decide whether the user
feedback for each item is actually exploitable. Indeed, much of the content may
have been simply ignored by the user. The present work proposes to exploit
available information regarding the display position bias under the so-called
Position-based click model (PBM). We first discuss how this model differs from
the Cascade model and its variants considered in several recent works on
multiple-play bandits. We then provide a novel regret lower bound for this
model as well as computationally efficient algorithms that display good
empirical and theoretical performance
Beyond Average Return in Markov Decision Processes
What are the functionals of the reward that can be computed and optimized
exactly in Markov Decision Processes? In the finite-horizon, undiscounted
setting, Dynamic Programming (DP) can only handle these operations efficiently
for certain classes of statistics. We summarize the characterization of these
classes for policy evaluation, and give a new answer for the planning problem.
Interestingly, we prove that only generalized means can be optimized exactly,
even in the more general framework of Distributional Reinforcement Learning
(DistRL).DistRL permits, however, to evaluate other functionals approximately.
We provide error bounds on the resulting estimators, and discuss the potential
of this approach as well as its limitations.These results contribute to
advancing the theory of Markov Decision Processes by examining overall
characteristics of the return, and particularly risk-conscious strategies.Comment: Neurips 2023, Dec 2023, New Orleans, United State
ModÚle de distraction pour la sélection séquentielle de contenu
National audience<p>Dans le contexte du marketing sur Internet, il est frĂ©quent que les publicitĂ©sprĂ©sentĂ©es aux utilisateurs soient hiĂ©rarchisĂ©es : les mieux placĂ©es retiennent lâattentionde lâutilisateur et obtiennent plus de clics, indĂ©pendamment de leur contenu propre. Pourconstruire sĂ©quentiellement une campagne qui recueille de nombreux clics sans informationa priori sur la qualitĂ© des articles, il faut donc ĂȘtre capable dâapprendre quelle est lameilleure liste ordonnĂ©e de L parmi K produits disponibles dans le catalogue. Ă chaquefois quâune liste est proposĂ©e Ă lâinternaute, celui-ci clique sur certains produits et câestlâunique information quâil envoie au systĂšme. Dans le cadre de lâapprentissage sĂ©quentiel,ce dernier doit alors mettre Ă jour ses estimateurs afin de proposer une liste potentiellementmeilleure au futur visiteur. LâinconvĂ©nient des mĂ©thodes existantes pour rĂ©soudre ceproblĂšme rĂ©side dans les modĂšles : ceux-ci nĂ©gligent la relative inattention de lâutilisateur,ce qui induit une sous-estimation des probabilitĂ©s de clics pour les produits prĂ©sentĂ©s etde possibles failles dans lâexploration. Nous proposons donc une maniĂšre dâinclure cetaspect dans un modĂšle de Bandits Manchots original. AprĂšs avoir prĂ©cautionneusementĂ©tudiĂ© lâimpact de la distraction de lâutilisateur sur les performances asymptotiques desalgorithmes, nous exploitons le principe dâoptimisme face Ă lâincertitude pour proposerune sĂ©rie dâalgorithmes efficaces que nous Ă©valuons expĂ©rimentalement.</p