168 research outputs found

    Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits

    Get PDF
    We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rateand leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary

    Profit maximization through budget allocation in display advertising

    Get PDF
    Online display advertising provides advertisers a unique opportunity to calculate real-time return on investment for advertising campaigns. Based on the target audiences, each advertising campaign is divided into sub campaigns, called ad sets, which all have their individual returns. Consequently, the advertiser faces an optimization problem of how to allocate the advertising budget across ad sets so that the total return on investment is maximized. Performance of each ad set is unknown to the advertiser beforehand. Thus the advertiser risks choosing a suboptimal ad set if allocating budget to the one assumed to be the optimal. On the other hand, the advertiser wastes money when exploring the returns and not allocating budget to the optimal ad set. This exploration vs. exploitation dilemma is known from so called multi-armed bandit problem. Standard multi-armed bandit problem consists of a gambler and multiple gambling-slot machines i.e. bandits. The gambler needs to balance between exploring which of the bandits has the highest rewards and simultaneously maximising the reward by playing the bandit having the highest return. I formalize the budget allocation problem faced by the online advertiser as a batched bandit problem where the bandits have to be played in batches instead of one by one. Based on the previous literature, I propose several allocation policies to solve the budget allocation problem. In addition, I use an extensive real world dataset from over 200 Facebook advertising campaigns to test the performance impact of different allocation policies. My empirical results give evidence that the return on investment of online advertising campaigns can be improved by dynamically allocating budget. So called greedy algorithms, allocating more of the budget to the ad set having the best historical average, seem to perform notable well. I show that the performance can further be improved by dynamically decreasing the exploration budget by time. Another well performing policy is Thompson sampling which allocates budget by sampling return estimates from a prior distribution formed based on historical returns. Upper confidence and probability policies, often proposed in the machine learning literature, don’t seem to apply that well to the real world resource allocation problem. I also contribute to the previous literature by providing evidence that the advertiser should base the budget allocation on observations of the real revenue generating event (e.g. product purchase) instead of using observations of more general events (e.g. clicks of ads). In addition, my research gives evidence that the performance of the allocation policies is dependent on the number of observations the policy has to make the decision based on. This may be an issue in real world applications if the number of available observations is scarce. I believe this issue is not unique to display advertising and consequently propose a future research topic of developing more robust batched bandit algorithms for resource allocation decisions where the rate of return is small

    Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals

    Full text link
    We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from KK arms with unknown expected rewards and costs. The goal is to maximize the total reward under a budget constraint. A player thus seeks to choose the arm with the highest reward-cost ratio as often as possible. Current state-of-the-art policies for this problem have several issues, which we illustrate. To overcome them, we propose a new upper confidence bound (UCB) sampling policy, ω\omega-UCB, that uses asymmetric confidence intervals. These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio compared to our competitors. We show that our approach has logarithmic regret and consistently outperforms existing policies in synthetic and real settings

    Unimodal Thompson Sampling for Graph-Structured Arms

    Full text link
    We study, to the best of our knowledge, the first Bayesian algorithm for unimodal Multi-Armed Bandit (MAB) problems with graph structure. In this setting, each arm corresponds to a node of a graph and each edge provides a relationship, unknown to the learner, between two nodes in terms of expected reward. Furthermore, for any node of the graph there is a path leading to the unique node providing the maximum expected reward, along which the expected reward is monotonically increasing. Previous results on this setting describe the behavior of frequentist MAB algorithms. In our paper, we design a Thompson Sampling-based algorithm whose asymptotic pseudo-regret matches the lower bound for the considered setting. We show that -as it happens in a wide number of scenarios- Bayesian MAB algorithms dramatically outperform frequentist ones. In particular, we provide a thorough experimental evaluation of the performance of our and state-of-the-art algorithms as the properties of the graph vary

    On the Prior Sensitivity of Thompson Sampling

    Full text link
    The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm's regret is low (high) when the prior is good (bad), little is known about the exact dependence. In this paper, we fully characterize the algorithm's worst-case dependence of regret on the choice of prior, focusing on a special yet representative case. These results also provide insights into the general sensitivity of the algorithm to the choice of priors. In particular, with pp being the prior probability mass of the true reward-generating model, we prove O(T/p)O(\sqrt{T/p}) and O((1−p)T)O(\sqrt{(1-p)T}) regret upper bounds for the bad- and good-prior cases, respectively, as well as \emph{matching} lower bounds. Our proofs rely on the discovery of a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning Theory (ALT), 201
    • …
    corecore