24,311 research outputs found
Thompson Sampling in Dynamic Systems for Contextual Bandit Problems
We consider the multiarm bandit problems in the timevarying dynamic system
for rich structural features. For the nonlinear dynamic model, we propose the
approximate inference for the posterior distributions based on Laplace
Approximation. For the context bandit problems, Thompson Sampling is adopted
based on the underlying posterior distributions of the parameters. More
specifically, we introduce the discount decays on the previous samples impact
and analyze the different decay rates with the underlying sample dynamics.
Consequently, the exploration and exploitation is adaptively tradeoff according
to the dynamics in the system.Comment: 22 pages, 10 figure
A Contextual-Bandit Approach to Personalized News Article Recommendation
Personalized web services strive to adapt their services (advertisements,
news articles, etc) to individual users by making use of both content and user
information. Despite a few recent advances, this problem remains challenging
for at least two reasons. First, web service is featured with dynamically
changing pools of content, rendering traditional collaborative filtering
methods inapplicable. Second, the scale of most web services of practical
interest calls for solutions that are both fast in learning and computation.
In this work, we model personalized recommendation of news articles as a
contextual bandit problem, a principled approach in which a learning algorithm
sequentially selects articles to serve users based on contextual information
about the users and articles, while simultaneously adapting its
article-selection strategy based on user-click feedback to maximize total user
clicks.
The contributions of this work are three-fold. First, we propose a new,
general contextual bandit algorithm that is computationally efficient and well
motivated from learning theory. Second, we argue that any bandit algorithm can
be reliably evaluated offline using previously recorded random traffic.
Finally, using this offline evaluation method, we successfully applied our new
algorithm to a Yahoo! Front Page Today Module dataset containing over 33
million events. Results showed a 12.5% click lift compared to a standard
context-free bandit algorithm, and the advantage becomes even greater when data
gets more scarce.Comment: 10 pages, 5 figure
Learning Contextual Bandits in a Non-stationary Environment
Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.Comment: 10 pages, 13 figures, To appear on ACM Special Interest Group on
Information Retrieval (SIGIR) 201
Corrupt Bandits for Preserving Local Privacy
We study a variant of the stochastic multi-armed bandit (MAB) problem in
which the rewards are corrupted. In this framework, motivated by privacy
preservation in online recommender systems, the goal is to maximize the sum of
the (unobserved) rewards, based on the observation of transformation of these
rewards through a stochastic corruption process with known parameters. We
provide a lower bound on the expected regret of any bandit algorithm in this
corrupted setting. We devise a frequentist algorithm, KLUCB-CF, and a Bayesian
algorithm, TS-CF and give upper bounds on their regret. We also provide the
appropriate corruption parameters to guarantee a desired level of local privacy
and analyze how this impacts the regret. Finally, we present some experimental
results that confirm our analysis
- …