208 research outputs found
Dynamic Ad Allocation: Bandits with Budgets
We consider an application of multi-armed bandits to internet advertising
(specifically, to dynamic ad allocation in the pay-per-click model, with
uncertainty on the click probabilities). We focus on an important practical
issue that advertisers are constrained in how much money they can spend on
their ad campaigns. This issue has not been considered in the prior work on
bandit-based approaches for ad allocation, to the best of our knowledge.
We define a simple, stylized model where an algorithm picks one ad to display
in each round, and each ad has a \emph{budget}: the maximal amount of money
that can be spent on this ad. This model admits a natural variant of UCB1, a
well-known algorithm for multi-armed bandits with stochastic rewards. We derive
strong provable guarantees for this algorithm
Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation
We study model-based reinforcement learning (RL) for episodic Markov decision
processes (MDP) whose transition probability is parametrized by an unknown
transition core with features of state and action. Despite much recent progress
in analyzing algorithms in the linear MDP setting, the understanding of more
general transition models is very restrictive. In this paper, we establish a
provably efficient RL algorithm for the MDP whose state transition is given by
a multinomial logistic model. To balance the exploration-exploitation
trade-off, we propose an upper confidence bound-based algorithm. We show that
our proposed algorithm achieves regret
bound where is the dimension of the transition core, is the horizon,
and is the total number of steps. To the best of our knowledge, this is the
first model-based RL algorithm with multinomial logistic function approximation
with provable guarantees. We also comprehensively evaluate our proposed
algorithm numerically and show that it consistently outperforms the existing
methods, hence achieving both provable efficiency and practical superior
performance.Comment: Accepted in AAAI 2023 (Main Technical Track
Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits
Modifying the reward-biased maximum likelihood method originally proposed in
the adaptive control literature, we propose novel learning algorithms to handle
the explore-exploit trade-off in linear bandits problems as well as generalized
linear bandits problems. We develop novel index policies that we prove achieve
order-optimality, and show that they achieve empirical performance competitive
with the state-of-the-art benchmark methods in extensive experiments. The new
policies achieve this with low computation time per pull for linear bandits,
and thereby resulting in both favorable regret as well as computational
efficiency
Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms
Contextual bandit algorithms have become popular for online recommendation
systems such as Digg, Yahoo! Buzz, and news recommendation in general.
\emph{Offline} evaluation of the effectiveness of new algorithms in these
applications is critical for protecting online user experiences but very
challenging due to their "partial-label" nature. Common practice is to create a
simulator which simulates the online environment for the problem at hand and
then run an algorithm against this simulator. However, creating simulator
itself is often difficult and modeling bias is usually unavoidably introduced.
In this paper, we introduce a \emph{replay} methodology for contextual bandit
algorithm evaluation. Different from simulator-based approaches, our method is
completely data-driven and very easy to adapt to different applications. More
importantly, our method can provide provably unbiased evaluations. Our
empirical results on a large-scale news article recommendation dataset
collected from Yahoo! Front Page conform well with our theoretical results.
Furthermore, comparisons between our offline replay and online bucket
evaluation of several contextual bandit algorithms show accuracy and
effectiveness of our offline evaluation method.Comment: 10 pages, 7 figures, revised from the published version at the WSDM
2011 conferenc
Fixed-Budget Best-Arm Identification in Contextual Bandits: A Static-Adaptive Algorithm
We study the problem of best-arm identification (BAI) in contextual bandits
in the fixed-budget setting. We propose a general successive elimination
algorithm that proceeds in stages and eliminates a fixed fraction of suboptimal
arms in each stage. This design takes advantage of the strengths of static and
adaptive allocations. We analyze the algorithm in linear models and obtain a
better error bound than prior work. We also apply it to generalized linear
models (GLMs) and bound its error. This is the first BAI algorithm for GLMs in
the fixed-budget setting. Our extensive numerical experiments show that our
algorithm outperforms the state of art.Comment: 23 page
From Bandits to Experts: On the Value of Side-Observations
We consider an adversarial online learning setting where a decision maker can
choose an action in every stage of the game. In addition to observing the
reward of the chosen action, the decision maker gets side observations on the
reward he would have obtained had he chosen some of the other actions. The
observation structure is encoded as a graph, where node i is linked to node j
if sampling i provides information on the reward of j. This setting naturally
interpolates between the well-known "experts" setting, where the decision maker
can view all rewards, and the multi-armed bandits setting, where the decision
maker can only view the reward of the chosen action. We develop practical
algorithms with provable regret guarantees, which depend on non-trivial
graph-theoretic properties of the information feedback structure. We also
provide partially-matching lower bounds.Comment: Presented at the NIPS 2011 conferenc
- …