2,937 research outputs found

    Dynamic Ad Allocation: Bandits with Budgets

    Full text link
    We consider an application of multi-armed bandits to internet advertising (specifically, to dynamic ad allocation in the pay-per-click model, with uncertainty on the click probabilities). We focus on an important practical issue that advertisers are constrained in how much money they can spend on their ad campaigns. This issue has not been considered in the prior work on bandit-based approaches for ad allocation, to the best of our knowledge. We define a simple, stylized model where an algorithm picks one ad to display in each round, and each ad has a \emph{budget}: the maximal amount of money that can be spent on this ad. This model admits a natural variant of UCB1, a well-known algorithm for multi-armed bandits with stochastic rewards. We derive strong provable guarantees for this algorithm

    Contextual Bandits with Cross-learning

    Full text link
    In the classical contextual bandits problem, in each round tt, a learner observes some context cc, chooses some action aa to perform, and receives some reward ra,t(c)r_{a,t}(c). We consider the variant of this problem where in addition to receiving the reward ra,t(c)r_{a,t}(c), the learner also learns the values of ra,t(c′)r_{a,t}(c') for all other contexts c′c'; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions (in this setting the context is the decision maker's private valuation for each auction). We call this problem the contextual bandits problem with cross-learning. The best algorithms for the classical contextual bandits problem achieve O~(CKT)\tilde{O}(\sqrt{CKT}) regret against all stationary policies, where CC is the number of contexts, KK the number of actions, and TT the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on CC and achieve regret O(KT)O(\sqrt{KT}) (when contexts are stochastic with known distribution), O~(K1/3T2/3)\tilde{O}(K^{1/3}T^{2/3}) (when contexts are stochastic with unknown distribution), and O~(KT)\tilde{O}(\sqrt{KT}) (when contexts are adversarial but rewards are stochastic).Comment: 48 pages, 5 figure
    • …
    corecore