168 research outputs found
Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits
We study a generalization of the multi-armed bandit problem with multiple
plays where there is a cost associated with pulling each arm and the agent has
a budget at each time that dictates how much she can expect to spend. We derive
an asymptotic regret lower bound for any uniformly efficient algorithm in our
setting. We then study a variant of Thompson sampling for Bernoulli rewards and
a variant of KL-UCB for both single-parameter exponential families and bounded,
finitely supported rewards. We show these algorithms are asymptotically
optimal, both in rateand leading problem-dependent constants, including in the
thick margin setting where multiple arms fall on the decision boundary
Profit maximization through budget allocation in display advertising
Online display advertising provides advertisers a unique opportunity to calculate real-time return on investment for advertising campaigns. Based on the target audiences, each advertising campaign is divided into sub campaigns, called ad sets, which all have their individual returns. Consequently, the advertiser faces an optimization problem of how to allocate the advertising budget across ad sets so that the total return on investment is maximized. Performance of each ad set is unknown to the advertiser beforehand. Thus the advertiser risks choosing a suboptimal ad set if allocating budget to the one assumed to be the optimal. On the other hand, the advertiser wastes money when exploring the returns and not allocating budget to the optimal ad set.
This exploration vs. exploitation dilemma is known from so called multi-armed bandit problem. Standard multi-armed bandit problem consists of a gambler and multiple gambling-slot machines i.e. bandits. The gambler needs to balance between exploring which of the bandits has the highest rewards and simultaneously maximising the reward by playing the bandit having the highest return. I formalize the budget allocation problem faced by the online advertiser as a batched bandit problem where the bandits have to be played in batches instead of one by one. Based on the previous literature, I propose several allocation policies to solve the budget allocation problem. In addition, I use an extensive real world dataset from over 200 Facebook advertising campaigns to test the performance impact of different allocation policies.
My empirical results give evidence that the return on investment of online advertising campaigns can be improved by dynamically allocating budget. So called greedy algorithms, allocating more of the budget to the ad set having the best historical average, seem to perform notable well. I show that the performance can further be improved by dynamically decreasing the exploration budget by time. Another well performing policy is Thompson sampling which allocates budget by sampling return estimates from a prior distribution formed based on historical returns. Upper confidence and probability policies, often proposed in the machine learning literature, don’t seem to apply that well to the real world resource allocation problem.
I also contribute to the previous literature by providing evidence that the advertiser should base the budget allocation on observations of the real revenue generating event (e.g. product purchase) instead of using observations of more general events (e.g. clicks of ads). In addition, my research gives evidence that the performance of the allocation policies is dependent on the number of observations the policy has to make the decision based on. This may be an issue in real world applications if the number of available observations is scarce. I believe this issue is not unique to display advertising and consequently propose a future research topic of developing more robust batched bandit algorithms for resource allocation decisions where the rate of return is small
Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals
We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a
player chooses from arms with unknown expected rewards and costs. The goal
is to maximize the total reward under a budget constraint. A player thus seeks
to choose the arm with the highest reward-cost ratio as often as possible.
Current state-of-the-art policies for this problem have several issues, which
we illustrate. To overcome them, we propose a new upper confidence bound (UCB)
sampling policy, -UCB, that uses asymmetric confidence intervals. These
intervals scale with the distance between the sample mean and the bounds of a
random variable, yielding a more accurate and tight estimation of the
reward-cost ratio compared to our competitors. We show that our approach has
logarithmic regret and consistently outperforms existing policies in synthetic
and real settings
Unimodal Thompson Sampling for Graph-Structured Arms
We study, to the best of our knowledge, the first Bayesian algorithm for
unimodal Multi-Armed Bandit (MAB) problems with graph structure. In this
setting, each arm corresponds to a node of a graph and each edge provides a
relationship, unknown to the learner, between two nodes in terms of expected
reward. Furthermore, for any node of the graph there is a path leading to the
unique node providing the maximum expected reward, along which the expected
reward is monotonically increasing. Previous results on this setting describe
the behavior of frequentist MAB algorithms. In our paper, we design a Thompson
Sampling-based algorithm whose asymptotic pseudo-regret matches the lower bound
for the considered setting. We show that -as it happens in a wide number of
scenarios- Bayesian MAB algorithms dramatically outperform frequentist ones. In
particular, we provide a thorough experimental evaluation of the performance of
our and state-of-the-art algorithms as the properties of the graph vary
On the Prior Sensitivity of Thompson Sampling
The empirically successful Thompson Sampling algorithm for stochastic bandits
has drawn much interest in understanding its theoretical properties. One
important benefit of the algorithm is that it allows domain knowledge to be
conveniently encoded as a prior distribution to balance exploration and
exploitation more effectively. While it is generally believed that the
algorithm's regret is low (high) when the prior is good (bad), little is known
about the exact dependence. In this paper, we fully characterize the
algorithm's worst-case dependence of regret on the choice of prior, focusing on
a special yet representative case. These results also provide insights into the
general sensitivity of the algorithm to the choice of priors. In particular,
with being the prior probability mass of the true reward-generating model,
we prove and regret upper bounds for the
bad- and good-prior cases, respectively, as well as \emph{matching} lower
bounds. Our proofs rely on the discovery of a fundamental property of Thompson
Sampling and make heavy use of martingale theory, both of which appear novel in
the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning
Theory (ALT), 201
- …