252 research outputs found
A Survey on Practical Applications of Multi-Armed and Contextual Bandits
In recent years, multi-armed bandit (MAB) framework has attracted a lot of
attention in various applications, from recommender systems and information
retrieval to healthcare and finance, due to its stellar performance combined
with certain attractive properties, such as learning from less feedback. The
multi-armed bandit field is currently flourishing, as novel problem settings
and algorithms motivated by various practical applications are being
introduced, building on top of the classical bandit problem. This article aims
to provide a comprehensive review of top recent developments in multiple
real-life applications of the multi-armed bandit. Specifically, we introduce a
taxonomy of common MAB-based applications and summarize state-of-art for each
of those domains. Furthermore, we identify important current trends and provide
new perspectives pertaining to the future of this exciting and fast-growing
field.Comment: under review by IJCAI 2019 Surve
Meta-Learning Bandit Policies by Gradient Ascent
Most bandit policies are designed to either minimize regret in any problem
instance, making very few assumptions about the underlying environment, or in a
Bayesian sense, assuming a prior distribution over environment parameters. The
former are often too conservative in practical settings, while the latter
require assumptions that are hard to verify in practice. We study bandit
problems that fall between these two extremes, where the learning agent has
access to sampled bandit instances from an unknown prior distribution
and aims to achieve high reward on average over the bandit
instances drawn from . This setting is of a particular importance
because it lays foundations for meta-learning of bandit policies and reflects
more realistic assumptions in many practical domains. We propose the use of
parameterized bandit policies that are differentiable and can be optimized
using policy gradients. This provides a broadly applicable framework that is
easy to implement. We derive reward gradients that reflect the structure of
bandit problems and policies, for both non-contextual and contextual settings,
and propose a number of interesting policies that are both differentiable and
have low regret. Our algorithmic and theoretical contributions are supported by
extensive experiments that show the importance of baseline subtraction, learned
biases, and the practicality of our approach on a range problems
Smoothness-Adaptive Contextual Bandits
We study a non-parametric multi-armed bandit problem with stochastic
covariates, where a key complexity driver is the smoothness of payoff functions
with respect to covariates. Previous studies have focused on deriving
minimax-optimal algorithms in cases where it is a priori known how smooth the
payoff functions are. In practice, however, the smoothness of payoff functions
is typically not known in advance, and misspecification of smoothness may
severely deteriorate the performance of existing methods. In this work, we
consider a framework where the smoothness of payoff functions is not known, and
study when and how algorithms may adapt to unknown smoothness. First, we
establish that designing algorithms that adapt to unknown smoothness of payoff
functions is, in general, impossible. However, under a self-similarity
condition (which does not reduce the minimax complexity of the dynamic
optimization problem at hand), we establish that adapting to unknown smoothness
is possible, and further devise a general policy for achieving
smoothness-adaptive performance. Our policy infers the smoothness of payoffs
throughout the decision-making process, while leveraging the structure of
off-the-shelf non-adaptive policies. We establish that for problem settings
with either differentiable or non-differentiable payoff functions, this policy
matches (up to a logarithmic scale) the regret rate that is achievable when the
smoothness of payoffs is known a priori
Differentiable Bandit Exploration
Exploration policies in Bayesian bandits maximize the average reward over
problem instances drawn from some distribution . In this work, we
learn such policies for an unknown distribution using samples
from . Our approach is a form of meta-learning and exploits
properties of without making strong assumptions about its form.
To do this, we parameterize our policies in a differentiable way and optimize
them by policy gradients, an approach that is general and easy to implement. We
derive effective gradient estimators and introduce novel variance reduction
techniques. We also analyze and experiment with various bandit policy classes,
including neural networks and a novel softmax policy. The latter has regret
guarantees and is a natural starting point for our optimization. Our
experiments show the versatility of our approach. We also observe that neural
network policies can learn implicit biases expressed only through the sampled
instances
Stochastic Structured Prediction under Bandit Feedback
Stochastic structured prediction under bandit feedback follows a learning
protocol where on each of a sequence of iterations, the learner receives an
input, predicts an output structure, and receives partial feedback in form of a
task loss evaluation of the predicted structure. We present applications of
this learning scenario to convex and non-convex objectives for structured
prediction and analyze them as stochastic first-order methods. We present an
experimental evaluation on problems of natural language processing over
exponential output spaces, and compare convergence speed across different
objectives under the practical criterion of optimal task performance on
development data and the optimization-theoretic criterion of minimal squared
gradient norm. Best results under both criteria are obtained for a non-convex
objective for pairwise preference learning under bandit feedback.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016),
Barcelona, Spai
Smooth Bandit Optimization: Generalization to H\"older Space
We consider bandit optimization of a smooth reward function, where the goal
is cumulative regret minimization. This problem has been studied for
-H\"older continuous (including Lipschitz) functions with . Our main result is in generalization of the reward function to H\"older
space with exponent to bridge the gap between Lipschitz bandits and
infinitely-differentiable models such as linear bandits. For H\"older
continuous functions, approaches based on random sampling in bins of a
discretized domain suffices as optimal. In contrast, we propose a class of
two-layer algorithms that deploy misspecified linear/polynomial bandit
algorithms in bins. We demonstrate that the proposed algorithm can exploit
higher-order smoothness of the function by deriving a regret upper bound of
for when , which matches
existing lower bound. We also study adaptation to unknown function smoothness
over a continuous scale of H\"older spaces indexed by , with a bandit
model selection approach applied with our proposed two-layer algorithms. We
show that it achieves regret rate that matches the existing lower bound for
adaptation within the subset.Comment: 11 main pages, 2 figures, 13 appendix page
Learning to Actively Learn: A Robust Approach
This work proposes a procedure for designing algorithms for specific adaptive
data collection tasks like active learning and pure-exploration multi-armed
bandits. Unlike the design of traditional adaptive algorithms that rely on
concentration of measure and careful analysis to justify the correctness and
sample complexity of the procedure, our adaptive algorithm is learned via
adversarial training over equivalence classes of problems derived from
information theoretic lower bounds. In particular, a single adaptive learning
algorithm is learned that competes with the best adaptive algorithm learned for
each equivalence class. Our procedure takes as input just the available
queries, set of hypotheses, loss function, and total query budget. This is in
contrast to existing meta-learning work that learns an adaptive algorithm
relative to an explicit, user-defined subset or prior distribution over
problems which can be challenging to define and be mismatched to the instance
encountered at test time. This work is particularly focused on the regime when
the total query budget is very small, such as a few dozen, which is much
smaller than those budgets typically considered by theoretically derived
algorithms. We perform synthetic experiments to justify the stability and
effectiveness of the training procedure, and then evaluate the method on tasks
derived from real data including a noisy 20 Questions game and a joke
recommendation task
Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation
We present an approach to structured prediction from bandit feedback, called
Bandit Structured Prediction, where only the value of a task loss function at a
single predicted point, instead of a correct structure, is observed in
learning. We present an application to discriminative reranking in Statistical
Machine Translation (SMT) where the learning algorithm only has access to a
1-BLEU loss evaluation of a predicted translation instead of obtaining a gold
standard reference translation. In our experiment bandit feedback is obtained
by evaluating BLEU on reference translations without revealing them to the
algorithm. This can be thought of as a simulation of interactive machine
translation where an SMT system is personalized by a user who provides single
point feedback to predicted translations. Our experiments show that our
approach improves translation quality and is comparable to approaches that
employ more informative feedback in learning.Comment: In Proceedings of MT Summit XV, 2015. Miami, F
Differentiable Linear Bandit Algorithm
Upper Confidence Bound (UCB) is arguably the most commonly used method for
linear multi-arm bandit problems. While conceptually and computationally
simple, this method highly relies on the confidence bounds, failing to strike
the optimal exploration-exploitation if these bounds are not properly set. In
the literature, confidence bounds are typically derived from concentration
inequalities based on assumptions on the reward distribution, e.g.,
sub-Gaussianity. The validity of these assumptions however is unknown in
practice. In this work, we aim at learning the confidence bound in a
data-driven fashion, making it adaptive to the actual problem structure.
Specifically, noting that existing UCB-typed algorithms are not differentiable
with respect to confidence bound, we first propose a novel differentiable
linear bandit algorithm. Then, we introduce a gradient estimator, which allows
the confidence bound to be learned via gradient ascent. Theoretically, we show
that the proposed algorithm achieves a
upper bound of -round regret,
where is the dimension of arm features and is the learned
size of confidence bound. Empirical results show that is
significantly smaller than its theoretical upper bound and proposed algorithms
outperforms baseline ones on both simulated and real-world datasets.Comment: 16 page
Lifelong Learning in Multi-Armed Bandits
Continuously learning and leveraging the knowledge accumulated from prior
tasks in order to improve future performance is a long standing machine
learning problem. In this paper, we study the problem in the multi-armed bandit
framework with the objective to minimize the total regret incurred over a
series of tasks. While most bandit algorithms are designed to have a low
worst-case regret, we examine here the average regret over bandit instances
drawn from some prior distribution which may change over time. We specifically
focus on confidence interval tuning of UCB algorithms. We propose a bandit over
bandit approach with greedy algorithms and we perform extensive experimental
evaluations in both stationary and non-stationary environments. We further
apply our solution to the mortal bandit problem, showing empirical improvement
over previous work
- …