88 research outputs found
Stochastic Bandit Models for Delayed Conversions
Online advertising and product recommendation are important domains of
applications for multi-armed bandit methods. In these fields, the reward that
is immediately available is most often only a proxy for the actual outcome of
interest, which we refer to as a conversion. For instance, in web advertising,
clicks can be observed within a few seconds after an ad display but the
corresponding sale --if any-- will take hours, if not days to happen. This
paper proposes and investigates a new stochas-tic multi-armed bandit model in
the framework proposed by Chapelle (2014) --based on empirical studies in the
field of web advertising-- in which each action may trigger a future reward
that will then happen with a stochas-tic delay. We assume that the probability
of conversion associated with each action is unknown while the distribution of
the conversion delay is known, distinguishing between the (idealized) case
where the conversion events may be observed whatever their delay and the more
realistic setting in which late conversions are censored. We provide
performance lower bounds as well as two simple but efficient algorithms based
on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when
conversion rates are low, is based on a Poissonization argument, of independent
interest in other settings where aggregation of Bernoulli observations with
different success probabilities is required.Comment: Conference on Uncertainty in Artificial Intelligence, Aug 2017,
Sydney, Australi
An Efficient Bandit Algorithm for Realtime Multivariate Optimization
Optimization is commonly employed to determine the content of web pages, such
as to maximize conversions on landing pages or click-through rates on search
engine result pages. Often the layout of these pages can be decoupled into
several separate decisions. For example, the composition of a landing page may
involve deciding which image to show, which wording to use, what color
background to display, etc. Such optimization is a combinatorial problem over
an exponentially large decision space. Randomized experiments do not scale well
to this setting, and therefore, in practice, one is typically limited to
optimizing a single aspect of a web page at a time. This represents a missed
opportunity in both the speed of experimentation and the exploitation of
possible interactions between layout decisions.
Here we focus on multivariate optimization of interactive web pages. We
formulate an approach where the possible interactions between different
components of the page are modeled explicitly. We apply bandit methodology to
explore the layout space efficiently and use hill-climbing to select optimal
content in realtime. Our algorithm also extends to contextualization and
personalization of layout selection. Simulation results show the suitability of
our approach to large decision spaces with strong interactions between content.
We further apply our algorithm to optimize a message that promotes adoption of
an Amazon service. After only a single week of online optimization, we saw a
21% conversion increase compared to the median layout. Our technique is
currently being deployed to optimize content across several locations at
Amazon.com.Comment: KDD'17 Audience Appreciation Awar
Sales Channel Optimization via Simulations Based on Observational Data with Delayed Rewards: A Case Study at LinkedIn
Training models on data obtained from randomized experiments is ideal for
making good decisions. However, randomized experiments are often
time-consuming, expensive, risky, infeasible or unethical to perform, leaving
decision makers little choice but to rely on observational data collected under
historical policies when training models. This opens questions regarding not
only which decision-making policies would perform best in practice, but also
regarding the impact of different data collection protocols on the performance
of various policies trained on the data, or the robustness of policy
performance with respect to changes in problem characteristics such as action-
or reward- specific delays in observing outcomes. We aim to answer such
questions for the problem of optimizing sales channel allocations at LinkedIn,
where sales accounts (leads) need to be allocated to one of three channels,
with the goal of maximizing the number of successful conversions over a period
of time. A key problem feature constitutes the presence of stochastic delays in
observing allocation outcomes, whose distribution is both channel- and outcome-
dependent. We built a discrete-time simulation that can handle our problem
features and used it to evaluate: a) a historical rule-based policy; b) a
supervised machine learning policy (XGBoost); and c) multi-armed bandit (MAB)
policies, under different scenarios involving: i) data collection used for
training (observational vs randomized); ii) lead conversion scenarios; iii)
delay distributions. Our simulation results indicate that LinUCB, a simple MAB
policy, consistently outperforms the other policies, achieving a 18-47% lift
relative to a rule-based policyComment: Accepted at REVEAL'22 Workshop (16th ACM Conference on Recommender
Systems - RecSys 2022
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not
immediately reflect on the feedback and spreads its effects over a long time
frame. For instance, in online advertising, investing in a platform produces an
instantaneous increase of awareness, but the actual reward, i.e., a conversion,
might occur far in the future. Furthermore, whether a conversion takes place
depends on: how fast the awareness grows, its vanishing effects, and the
synergy or interference with other advertising platforms. Previous work has
investigated the Multi-Armed Bandit framework with the possibility of delayed
and aggregated feedback, without a particular structure on how an action
propagates in the future, disregarding possible dynamical effects. In this
paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an
extension of the linear bandits characterized by a hidden state. When an action
is performed, the learner observes a noisy reward whose mean is a linear
function of the hidden state and of the action. Then, the hidden state evolves
according to linear dynamics, affected by the performed action too. We start by
introducing the setting, discussing the notion of optimal policy, and deriving
an expected regret lower bound. Then, we provide an optimistic regret
minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB),
that suffers an expected regret of order , where is a
measure of the stability of the system, and is the dimension of the action
vector. Finally, we conduct a numerical validation on a synthetic environment
and on real-world data to show the effectiveness of DynLin-UCB in comparison
with several baselines
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order , where is a measure of the stability of the system, and is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines
Capturing Delayed Feedback in Conversion Rate Prediction via Elapsed-Time Sampling
Conversion rate (CVR) prediction is one of the most critical tasks for
digital display advertising. Commercial systems often require to update models
in an online learning manner to catch up with the evolving data distribution.
However, conversions usually do not happen immediately after a user click. This
may result in inaccurate labeling, which is called delayed feedback problem. In
previous studies, delayed feedback problem is handled either by waiting
positive label for a long period of time, or by consuming the negative sample
on its arrival and then insert a positive duplicate when a conversion happens
later. Indeed, there is a trade-off between waiting for more accurate labels
and utilizing fresh data, which is not considered in existing works. To strike
a balance in this trade-off, we propose Elapsed-Time Sampling Delayed Feedback
Model (ES-DFM), which models the relationship between the observed conversion
distribution and the true conversion distribution. Then we optimize the
expectation of true conversion distribution via importance sampling under the
elapsed-time sampling distribution. We further estimate the importance weight
for each instance, which is used as the weight of loss function in CVR
prediction. To demonstrate the effectiveness of ES-DFM, we conduct extensive
experiments on a public data and a private industrial dataset. Experimental
results confirm that our method consistently outperforms the previous
state-of-the-art results.Comment: This paper has been accepted by AAAI 202
Profit maximization through budget allocation in display advertising
Online display advertising provides advertisers a unique opportunity to calculate real-time return on investment for advertising campaigns. Based on the target audiences, each advertising campaign is divided into sub campaigns, called ad sets, which all have their individual returns. Consequently, the advertiser faces an optimization problem of how to allocate the advertising budget across ad sets so that the total return on investment is maximized. Performance of each ad set is unknown to the advertiser beforehand. Thus the advertiser risks choosing a suboptimal ad set if allocating budget to the one assumed to be the optimal. On the other hand, the advertiser wastes money when exploring the returns and not allocating budget to the optimal ad set.
This exploration vs. exploitation dilemma is known from so called multi-armed bandit problem. Standard multi-armed bandit problem consists of a gambler and multiple gambling-slot machines i.e. bandits. The gambler needs to balance between exploring which of the bandits has the highest rewards and simultaneously maximising the reward by playing the bandit having the highest return. I formalize the budget allocation problem faced by the online advertiser as a batched bandit problem where the bandits have to be played in batches instead of one by one. Based on the previous literature, I propose several allocation policies to solve the budget allocation problem. In addition, I use an extensive real world dataset from over 200 Facebook advertising campaigns to test the performance impact of different allocation policies.
My empirical results give evidence that the return on investment of online advertising campaigns can be improved by dynamically allocating budget. So called greedy algorithms, allocating more of the budget to the ad set having the best historical average, seem to perform notable well. I show that the performance can further be improved by dynamically decreasing the exploration budget by time. Another well performing policy is Thompson sampling which allocates budget by sampling return estimates from a prior distribution formed based on historical returns. Upper confidence and probability policies, often proposed in the machine learning literature, donât seem to apply that well to the real world resource allocation problem.
I also contribute to the previous literature by providing evidence that the advertiser should base the budget allocation on observations of the real revenue generating event (e.g. product purchase) instead of using observations of more general events (e.g. clicks of ads). In addition, my research gives evidence that the performance of the allocation policies is dependent on the number of observations the policy has to make the decision based on. This may be an issue in real world applications if the number of available observations is scarce. I believe this issue is not unique to display advertising and consequently propose a future research topic of developing more robust batched bandit algorithms for resource allocation decisions where the rate of return is small
- âŠ