1,174 research outputs found
Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms
Contextual bandit algorithms have become popular for online recommendation
systems such as Digg, Yahoo! Buzz, and news recommendation in general.
\emph{Offline} evaluation of the effectiveness of new algorithms in these
applications is critical for protecting online user experiences but very
challenging due to their "partial-label" nature. Common practice is to create a
simulator which simulates the online environment for the problem at hand and
then run an algorithm against this simulator. However, creating simulator
itself is often difficult and modeling bias is usually unavoidably introduced.
In this paper, we introduce a \emph{replay} methodology for contextual bandit
algorithm evaluation. Different from simulator-based approaches, our method is
completely data-driven and very easy to adapt to different applications. More
importantly, our method can provide provably unbiased evaluations. Our
empirical results on a large-scale news article recommendation dataset
collected from Yahoo! Front Page conform well with our theoretical results.
Furthermore, comparisons between our offline replay and online bucket
evaluation of several contextual bandit algorithms show accuracy and
effectiveness of our offline evaluation method.Comment: 10 pages, 7 figures, revised from the published version at the WSDM
2011 conferenc
Learning Contextual Bandits in a Non-stationary Environment
Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.Comment: 10 pages, 13 figures, To appear on ACM Special Interest Group on
Information Retrieval (SIGIR) 201
Distributed Online Learning via Cooperative Contextual Bandits
In this paper we propose a novel framework for decentralized, online learning
by many learners. At each moment of time, an instance characterized by a
certain context may arrive to each learner; based on the context, the learner
can select one of its own actions (which gives a reward and provides
information) or request assistance from another learner. In the latter case,
the requester pays a cost and receives the reward but the provider learns the
information. In our framework, learners are modeled as cooperative contextual
bandits. Each learner seeks to maximize the expected reward from its arrivals,
which involves trading off the reward received from its own actions, the
information learned from its own actions, the reward received from the actions
requested of others and the cost paid for these actions - taking into account
what it has learned about the value of assistance from each other learner. We
develop distributed online learning algorithms and provide analytic bounds to
compare the efficiency of these with algorithms with the complete knowledge
(oracle) benchmark (in which the expected reward of every action in every
context is known by every learner). Our estimates show that regret - the loss
incurred by the algorithm - is sublinear in time. Our theoretical framework can
be used in many practical applications including Big Data mining, event
detection in surveillance sensor networks and distributed online recommendation
systems
Bandit-Based Task Assignment for Heterogeneous Crowdsourcing
We consider a task assignment problem in crowdsourcing, which is aimed at
collecting as many reliable labels as possible within a limited budget. A
challenge in this scenario is how to cope with the diversity of tasks and the
task-dependent reliability of workers, e.g., a worker may be good at
recognizing the name of sports teams, but not be familiar with cosmetics
brands. We refer to this practical setting as heterogeneous crowdsourcing. In
this paper, we propose a contextual bandit formulation for task assignment in
heterogeneous crowdsourcing, which is able to deal with the
exploration-exploitation trade-off in worker selection. We also theoretically
investigate the regret bounds for the proposed method, and demonstrate its
practical usefulness experimentally
Budget-constrained Edge Service Provisioning with Demand Estimation via Bandit Learning
Shared edge computing platforms, which enable Application Service Providers
(ASPs) to deploy applications in close proximity to mobile users are providing
ultra-low latency and location-awareness to a rich portfolio of services.
Though ubiquitous edge service provisioning, i.e., deploying the application at
all possible edge sites, is always preferable, it is impractical due to often
limited operational budget of ASPs. In this case, an ASP has to cautiously
decide where to deploy the edge service and how much budget it is willing to
use. A central issue here is that the service demand received by each edge
site, which is the key factor of deploying benefit, is unknown to ASPs a
priori. What's more complicated is that this demand pattern varies temporally
and spatially across geographically distributed edge sites. In this paper, we
investigate an edge resource rental problem where the ASP learns service demand
patterns for individual edge sites while renting computation resource at these
sites to host its applications for edge service provisioning. An online
algorithm, called Context-aware Online Edge Resource Rental (COERR), is
proposed based on the framework of Contextual Combinatorial Multi-armed Bandit
(CC-MAB). COERR observes side-information (context) to learn the demand
patterns of edge sites and decides rental decisions (including where to rent
the computation resource and how much to rent) to maximize ASP's utility given
a limited budget. COERR provides a provable performance achieving sublinear
regret compared to an Oracle algorithm that knows exactly the expected service
demand of edge sites. Experiments are carried out on a real-world dataset and
the results show that COERR significantly outperforms other benchmarks
Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook
In recent years, reinforcement learning and bandits have transformed a wide
range of real-world applications including healthcare, finance, recommendation
systems, robotics, and last but not least, the speech and natural language
processing. While most speech and language applications of reinforcement
learning algorithms are centered around improving the training of deep neural
networks with its flexible optimization properties, there are still many
grounds to explore to utilize the benefits of reinforcement learning, such as
its reward-driven adaptability, state representations, temporal structures and
generalizability. In this survey, we present an overview of recent advancements
of reinforcement learning and bandits, and discuss how they can be effectively
employed to solve speech and natural language processing problems with models
that are adaptive, interactive and scalable.Comment: To appear in Expert Systems with Applications. Accompanying
INTERSPEECH 2022 Tutorial on the same topic. Including latest advancements in
large language models (LLMs
- …