662 research outputs found
Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms
Contextual bandit algorithms have become popular for online recommendation
systems such as Digg, Yahoo! Buzz, and news recommendation in general.
\emph{Offline} evaluation of the effectiveness of new algorithms in these
applications is critical for protecting online user experiences but very
challenging due to their "partial-label" nature. Common practice is to create a
simulator which simulates the online environment for the problem at hand and
then run an algorithm against this simulator. However, creating simulator
itself is often difficult and modeling bias is usually unavoidably introduced.
In this paper, we introduce a \emph{replay} methodology for contextual bandit
algorithm evaluation. Different from simulator-based approaches, our method is
completely data-driven and very easy to adapt to different applications. More
importantly, our method can provide provably unbiased evaluations. Our
empirical results on a large-scale news article recommendation dataset
collected from Yahoo! Front Page conform well with our theoretical results.
Furthermore, comparisons between our offline replay and online bucket
evaluation of several contextual bandit algorithms show accuracy and
effectiveness of our offline evaluation method.Comment: 10 pages, 7 figures, revised from the published version at the WSDM
2011 conferenc
Contextual Linear Bandits under Noisy Features: Towards Bayesian Oracles
We study contextual linear bandit problems under uncertainty on features;
they are noisy with missing entries. To address the challenges from the noise,
we analyze Bayesian oracles given observed noisy features. Our Bayesian
analysis finds that the optimal hypothesis can be far from the underlying
realizability function, depending on noise characteristics, which is highly
non-intuitive and does not occur for classical noiseless setups. This implies
that classical approaches cannot guarantee a non-trivial regret bound. We thus
propose an algorithm aiming at the Bayesian oracle from observed information
under this model, achieving regret bound with respect to
feature dimension and time horizon . We demonstrate the proposed
algorithm using synthetic and real-world datasets.Comment: 30 page
A Contextual-Bandit Approach to Personalized News Article Recommendation
Personalized web services strive to adapt their services (advertisements,
news articles, etc) to individual users by making use of both content and user
information. Despite a few recent advances, this problem remains challenging
for at least two reasons. First, web service is featured with dynamically
changing pools of content, rendering traditional collaborative filtering
methods inapplicable. Second, the scale of most web services of practical
interest calls for solutions that are both fast in learning and computation.
In this work, we model personalized recommendation of news articles as a
contextual bandit problem, a principled approach in which a learning algorithm
sequentially selects articles to serve users based on contextual information
about the users and articles, while simultaneously adapting its
article-selection strategy based on user-click feedback to maximize total user
clicks.
The contributions of this work are three-fold. First, we propose a new,
general contextual bandit algorithm that is computationally efficient and well
motivated from learning theory. Second, we argue that any bandit algorithm can
be reliably evaluated offline using previously recorded random traffic.
Finally, using this offline evaluation method, we successfully applied our new
algorithm to a Yahoo! Front Page Today Module dataset containing over 33
million events. Results showed a 12.5% click lift compared to a standard
context-free bandit algorithm, and the advantage becomes even greater when data
gets more scarce.Comment: 10 pages, 5 figure
Safe Exploration for Optimizing Contextual Bandits
Contextual bandit problems are a natural fit for many information retrieval
tasks, such as learning to rank, text classification, recommendation, etc.
However, existing learning methods for contextual bandit problems have one of
two drawbacks: they either do not explore the space of all possible document
rankings (i.e., actions) and, thus, may miss the optimal ranking, or they
present suboptimal rankings to a user and, thus, may harm the user experience.
We introduce a new learning method for contextual bandit problems, Safe
Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by
using a baseline (or production) ranking system (i.e., policy), which does not
harm the user experience and, thus, is safe to execute, but has suboptimal
performance and, thus, needs to be improved. Then SEA uses counterfactual
learning to learn a new policy based on the behavior of the baseline policy.
SEA also uses high-confidence off-policy evaluation to estimate the performance
of the newly learned policy. Once the performance of the newly learned policy
is at least as good as the performance of the baseline policy, SEA starts using
the new policy to execute new actions, allowing it to actively explore
favorable regions of the action space. This way, SEA never performs worse than
the baseline policy and, thus, does not harm the user experience, while still
exploring the action space and, thus, being able to find an optimal policy. Our
experiments using text classification and document retrieval confirm the above
by comparing SEA (and a boundless variant called BSEA) to online and offline
learning methods for contextual bandit problems.Comment: 23 pages, 3 figure
Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian rewards
In this work, we study the performance of the Thompson Sampling algorithm for
Contextual Bandit problems based on the framework introduced by Neu et al. and
their concept of lifted information ratio. First, we prove a comprehensive
bound on the Thompson Sampling expected cumulative regret that depends on the
mutual information of the environment parameters and the history. Then, we
introduce new bounds on the lifted information ratio that hold for sub-Gaussian
rewards, thus generalizing the results from Neu et al. which analysis requires
binary rewards. Finally, we provide explicit regret bounds for the special
cases of unstructured bounded contextual bandits, structured bounded contextual
bandits with Laplace likelihood, structured Bernoulli bandits, and bounded
linear contextual bandits.Comment: 8 pages: 5 of the main text, 1 of references, and 2 of appendices.
Accepted to ISIT 202
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Thirty Years of Machine Learning: The Road to Pareto-Optimal Wireless Networks
Future wireless networks have a substantial potential in terms of supporting
a broad range of complex compelling applications both in military and civilian
fields, where the users are able to enjoy high-rate, low-latency, low-cost and
reliable information services. Achieving this ambitious goal requires new radio
techniques for adaptive learning and intelligent decision making because of the
complex heterogeneous nature of the network structures and wireless services.
Machine learning (ML) algorithms have great success in supporting big data
analytics, efficient parameter estimation and interactive decision making.
Hence, in this article, we review the thirty-year history of ML by elaborating
on supervised learning, unsupervised learning, reinforcement learning and deep
learning. Furthermore, we investigate their employment in the compelling
applications of wireless networks, including heterogeneous networks (HetNets),
cognitive radios (CR), Internet of things (IoT), machine to machine networks
(M2M), and so on. This article aims for assisting the readers in clarifying the
motivation and methodology of the various ML algorithms, so as to invoke them
for hitherto unexplored services as well as scenarios of future wireless
networks.Comment: 46 pages, 22 fig
- …