1,496 research outputs found
Contextual Bandits with Random Projection
Contextual bandits with linear payoffs, which are also known as linear
bandits, provide a powerful alternative for solving practical problems of
sequential decisions, e.g., online advertisements. In the era of big data,
contextual data usually tend to be high-dimensional, which leads to new
challenges for traditional linear bandits mostly designed for the setting of
low-dimensional contextual data. Due to the curse of dimensionality, there are
two challenges in most of the current bandit algorithms: the first is high
time-complexity; and the second is extreme large upper regret bounds with
high-dimensional data. In this paper, in order to attack the above two
challenges effectively, we develop an algorithm of Contextual Bandits via
RAndom Projection (\texttt{CBRAP}) in the setting of linear payoffs, which
works especially for high-dimensional contextual data. The proposed
\texttt{CBRAP} algorithm is time-efficient and flexible, because it enables
players to choose an arm in a low-dimensional space, and relaxes the sparsity
assumption of constant number of non-zero components in previous work. Besides,
we provide a linear upper regret bound for the proposed algorithm, which is
associated with reduced dimensions
Contextual Dueling Bandits
We consider the problem of learning to choose actions using contextual
information when provided with limited feedback in the form of relative
pairwise comparisons. We study this problem in the dueling-bandits framework of
Yue et al. (2009), which we extend to incorporate context. Roughly, the
learner's goal is to find the best policy, or way of behaving, in some space of
policies, although "best" is not always so clearly defined. Here, we propose a
new and natural solution concept, rooted in game theory, called a von Neumann
winner, a randomized policy that beats or ties every other policy. We show that
this notion overcomes important limitations of existing solutions, particularly
the Condorcet winner which has typically been used in the past, but which
requires strong and often unrealistic assumptions. We then present three
efficient algorithms for online learning in our setting, and for approximating
a von Neumann winner from batch-like data. The first of these algorithms
achieves particularly low regret, even when data is adversarial, although its
time and space requirements are linear in the size of the policy space. The
other two algorithms require time and space only logarithmic in the size of the
policy space when provided access to an oracle for solving classification
problems on the space.Comment: 25 pages, 4 figures, Published at COLT 201
Semiparametric Contextual Bandits
This paper studies semiparametric contextual bandits, a generalization of the
linear stochastic bandit problem where the reward for an action is modeled as a
linear function of known action features confounded by an non-linear
action-independent term. We design new algorithms that achieve
regret over rounds, when the linear function is
-dimensional, which matches the best known bounds for the simpler
unconfounded case and improves on a recent result of Greenewald et al. (2017).
Via an empirical evaluation, we show that our algorithms outperform prior
approaches when there are non-linear confounding effects on the rewards.
Technically, our algorithms use a new reward estimator inspired by
doubly-robust approaches and our proofs require new concentration inequalities
for self-normalized martingales
CONQUER: Confusion Queried Online Bandit Learning
We present a new recommendation setting for picking out two items from a
given set to be highlighted to a user, based on contextual input. These two
items are presented to a user who chooses one of them, possibly stochastically,
with a bias that favours the item with the higher value. We propose a
second-order algorithm framework that members of it use uses relative
upper-confidence bounds to trade off exploration and exploitation, and some
explore via sampling. We analyze one algorithm in this framework in an
adversarial setting with only mild assumption on the data, and prove a regret
bound of , where is the number
of rounds and is the cumulative approximation error of item values using
a linear model. Experiments with product reviews from 33 domains show the
advantage of our methods over algorithms designed for related settings, and
that UCB based algorithms are inferior to greed or sampling based algorithms
An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives
We consider a contextual version of multi-armed bandit problem with global
knapsack constraints. In each round, the outcome of pulling an arm is a scalar
reward and a resource consumption vector, both dependent on the context, and
the global knapsack constraints require the total consumption for each resource
to be below some pre-fixed budget. The learning agent competes with an
arbitrary set of context-dependent policies. This problem was introduced by
Badanidiyuru et al. (2014), who gave a computationally inefficient algorithm
with near-optimal regret bounds for it. We give a computationally efficient
algorithm for this problem with slightly better regret bounds, by generalizing
the approach of Agarwal et al. (2014) for the non-constrained version of the
problem. The computational time of our algorithm scales logarithmically in the
size of the policy space. This answers the main open question of Badanidiyuru
et al. (2014). We also extend our results to a variant where there are no
knapsack constraints but the objective is an arbitrary Lipschitz concave
function of the sum of outcome vectors.Comment: Extended abstract appeared in COLT 201
Scalable Generalized Linear Bandits: Online Computation and Hashing
Generalized Linear Bandits (GLBs), a natural extension of the stochastic
linear bandits, has been popular and successful in recent years. However,
existing GLBs scale poorly with the number of rounds and the number of arms,
limiting their utility in practice. This paper proposes new, scalable solutions
to the GLB problem in two respects. First, unlike existing GLBs, whose
per-time-step space and time complexity grow at least linearly with time ,
we propose a new algorithm that performs online computations to enjoy a
constant space and time complexity. At its heart is a novel Generalized Linear
extension of the Online-to-confidence-set Conversion (GLOC method) that takes
\emph{any} online learning algorithm and turns it into a GLB algorithm. As a
special case, we apply GLOC to the online Newton step algorithm, which results
in a low-regret GLB algorithm with much lower time and memory complexity than
prior work. Second, for the case where the number of arms is very large, we
propose new algorithms in which each next arm is selected via an inner product
search. Such methods can be implemented via hashing algorithms (i.e.,
"hash-amenable") and result in a time complexity sublinear in . While a
Thompson sampling extension of GLOC is hash-amenable, its regret bound for
-dimensional arm sets scales with , whereas GLOC's regret bound
scales with . Towards closing this gap, we propose a new hash-amenable
algorithm whose regret bound scales with . Finally, we propose a fast
approximate hash-key computation (inner product) with a better accuracy than
the state-of-the-art, which can be of independent interest. We conclude the
paper with preliminary experimental results confirming the merits of our
methods.Comment: accepted to NIPS'17 (typos fixed
BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits
We present efficient algorithms for the problem of contextual bandits with
i.i.d. covariates, an arbitrary sequence of rewards, and an arbitrary class of
policies. Our algorithm BISTRO requires d calls to the empirical risk
minimization (ERM) oracle per round, where d is the number of actions. The
method uses unlabeled data to make the problem computationally simple. When the
ERM problem itself is computationally hard, we extend the approach by employing
multiplicative approximation algorithms for the ERM. The integrality gap of the
relaxation only enters in the regret bound rather than the benchmark. Finally,
we show that the adversarial version of the contextual bandit problem is
learnable (and efficient) whenever the full-information supervised online
learning problem has a non-trivial regret guarantee (and efficient)
Multi-Objective Generalized Linear Bandits
In this paper, we study the multi-objective bandits (MOB) problem, where a
learner repeatedly selects one arm to play and then receives a reward vector
consisting of multiple objectives. MOB has found many real-world applications
as varied as online recommendation and network routing. On the other hand,
these applications typically contain contextual information that can guide the
learning process which, however, is ignored by most of existing work. To
utilize this information, we associate each arm with a context vector and
assume the reward follows the generalized linear model (GLM). We adopt the
notion of Pareto regret to evaluate the learner's performance and develop a
novel algorithm for minimizing it. The essential idea is to apply a variant of
the online Newton step to estimate model parameters, based on which we utilize
the upper confidence bound (UCB) policy to construct an approximation of the
Pareto front, and then uniformly at random choose one arm from the approximate
Pareto front. Theoretical analysis shows that the proposed algorithm achieves
an Pareto regret, where is the time horizon and
is the dimension of contexts, which matches the optimal result for single
objective contextual bandits problem. Numerical experiments demonstrate the
effectiveness of our method
Off-policy evaluation for slate recommendation
This paper studies the evaluation of policies that recommend an ordered set
of items (e.g., a ranking) based on some context---a common scenario in web
search, ads, and recommendation. We build on techniques from combinatorial
bandits to introduce a new practical estimator that uses logged data to
estimate a policy's performance. A thorough empirical evaluation on real-world
data reveals that our estimator is accurate in a variety of settings, including
as a subroutine in a learning-to-rank task, where it achieves competitive
performance. We derive conditions under which our estimator is unbiased---these
conditions are weaker than prior heuristics for slate evaluation---and
experimentally demonstrate a smaller bias than parametric approaches, even when
these conditions are violated. Finally, our theory and experiments also show
exponential savings in the amount of required data compared with general
unbiased estimators.Comment: 31 pages (9 main paper, 20 supplementary), 12 figures (2 main paper,
10 supplementary
Provably Optimal Algorithms for Generalized Linear Contextual Bandits
Contextual bandits are widely used in Internet services from news
recommendation to advertising, and to Web search. Generalized linear models
(logistical regression in particular) have demonstrated stronger performance
than linear models in many applications where rewards are binary. However, most
theoretical analyses on contextual bandits so far are on linear bandits. In
this work, we propose an upper confidence bound based algorithm for generalized
linear contextual bandits, which achieves an regret over
rounds with dimensional feature vectors. This regret matches the
minimax lower bound, up to logarithmic terms, and improves on the best previous
result by a factor, assuming the number of arms is fixed. A key
component in our analysis is to establish a new, sharp finite-sample confidence
bound for maximum-likelihood estimates in generalized linear models, which may
be of independent interest. We also analyze a simpler upper confidence bound
algorithm, which is useful in practice, and prove it to have optimal regret for
certain cases.Comment: Published at ICML 201
- …