Search CORE

208 research outputs found

Dynamic Ad Allocation: Bandits with Budgets

Author: Slivkins Aleksandrs
Publication venue
Publication date: 01/06/2013
Field of study

We consider an application of multi-armed bandits to internet advertising (specifically, to dynamic ad allocation in the pay-per-click model, with uncertainty on the click probabilities). We focus on an important practical issue that advertisers are constrained in how much money they can spend on their ad campaigns. This issue has not been considered in the prior work on bandit-based approaches for ad allocation, to the best of our knowledge. We define a simple, stylized model where an algorithm picks one ad to display in each round, and each ad has a \emph{budget}: the maximal amount of money that can be spent on this ad. This model admits a natural variant of UCB1, a well-known algorithm for multi-armed bandits with stochastic rewards. We derive strong provable guarantees for this algorithm

arXiv.org e-Print Archive

CiteSeerX

Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation

Author: Hwang Taehyun
Oh Min-hwan
Publication venue
Publication date: 27/12/2022
Field of study

We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP) whose transition probability is parametrized by an unknown transition core with features of state and action. Despite much recent progress in analyzing algorithms in the linear MDP setting, the understanding of more general transition models is very restrictive. In this paper, we establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model. To balance the exploration-exploitation trade-off, we propose an upper confidence bound-based algorithm. We show that our proposed algorithm achieves

\tilde{\mathcal{O}}(d \sqrt{H^3 T})

regret bound where

d

is the dimension of the transition core,

H

is the horizon, and

T

is the total number of steps. To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees. We also comprehensively evaluate our proposed algorithm numerically and show that it consistently outperforms the existing methods, hence achieving both provable efficiency and practical superior performance.Comment: Accepted in AAAI 2023 (Main Technical Track

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Author: Hsieh Ping-Chun
Hung Yu-Heng
Kumar P. R.
Liu Xi
Publication venue
Publication date: 08/10/2020
Field of study

Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

Author: Chu Wei
Langford John
Li Lihong
Wang Xuanhui
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. \emph{Offline} evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a \emph{replay} methodology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.Comment: 10 pages, 7 figures, revised from the published version at the WSDM 2011 conferenc

arXiv.org e-Print Archive

CiteSeerX

Crossref

Fixed-Budget Best-Arm Identification in Contextual Bandits: A Static-Adaptive Algorithm

Author: Azizi Mohammad Javad
Ghavamzadeh Mohammad
Kveton Branislav
Publication venue
Publication date: 08/07/2021
Field of study

We study the problem of best-arm identification (BAI) in contextual bandits in the fixed-budget setting. We propose a general successive elimination algorithm that proceeds in stages and eliminates a fixed fraction of suboptimal arms in each stage. This design takes advantage of the strengths of static and adaptive allocations. We analyze the algorithm in linear models and obtain a better error bound than prior work. We also apply it to generalized linear models (GLMs) and bound its error. This is the first BAI algorithm for GLMs in the fixed-budget setting. Our extensive numerical experiments show that our algorithm outperforms the state of art.Comment: 23 page

arXiv.org e-Print Archive

From Bandits to Experts: On the Value of Side-Observations

Author: Mannor Shie
Shamir Ohad
Publication venue
Publication date: 01/01/2011
Field of study

We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known "experts" setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds.Comment: Presented at the NIPS 2011 conferenc

arXiv.org e-Print Archive

CiteSeerX