2,272 research outputs found
Action Centered Contextual Bandits
Contextual bandits have become popular as they offer a middle ground between
very simple approaches based on multi-armed bandits and very complex approaches
using the full power of reinforcement learning. They have demonstrated success
in web applications and have a rich body of associated theoretical guarantees.
Linear models are well understood theoretically and preferred by practitioners
because they are not only easily interpretable but also simple to implement and
debug. Furthermore, if the linear model is true, we get very strong performance
guarantees. Unfortunately, in emerging applications in mobile health, the
time-invariant linear model assumption is untenable. We provide an extension of
the linear model for contextual bandits that has two parts: baseline reward and
treatment effect. We allow the former to be complex but keep the latter simple.
We argue that this model is plausible for mobile health applications. At the
same time, it leads to algorithms with strong performance guarantees as in the
linear model setting, while still allowing for complex nonlinear baseline
modeling. Our theory is supported by experiments on data gathered in a recently
concluded mobile health study.Comment: to appear at NIPS 201
Semiparametric Contextual Bandits
This paper studies semiparametric contextual bandits, a generalization of the
linear stochastic bandit problem where the reward for an action is modeled as a
linear function of known action features confounded by an non-linear
action-independent term. We design new algorithms that achieve
regret over rounds, when the linear function is
-dimensional, which matches the best known bounds for the simpler
unconfounded case and improves on a recent result of Greenewald et al. (2017).
Via an empirical evaluation, we show that our algorithms outperform prior
approaches when there are non-linear confounding effects on the rewards.
Technically, our algorithms use a new reward estimator inspired by
doubly-robust approaches and our proofs require new concentration inequalities
for self-normalized martingales
Tight Regret Bounds for Infinite-armed Linear Contextual Bandits
Linear contextual bandit is an important class of sequential decision making
problems with a wide range of applications to recommender systems, online
advertising, healthcare, and many other machine learning related tasks. While
there is a lot of prior research, tight regret bounds of linear contextual
bandit with infinite action sets remain open. In this paper, we address this
open problem by considering the linear contextual bandit with (changing)
infinite action sets. We prove a regret upper bound on the order of
where is the domain
dimension and is the time horizon. Our upper bound matches the previous
lower bound of in [Li et al., 2019] up to iterated
logarithmic terms.Comment: 10 pages, accepted for presentation at AISTATS 202
Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
We study the linear contextual bandit problem with finite action sets. When
the problem dimension is , the time horizon is , and there are candidate actions per time period, we (1) show that the minimax
expected regret is for every algorithm,
and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose
regret matches the lower bound up to iterated logarithmic factors. Our
algorithmic result saves two factors from previous analysis,
and our information-theoretical lower bound also improves previous results by
one factor, revealing a regret scaling quite different from
classical multi-armed bandits in which no logarithmic term is present in
minimax regret. Our proof techniques include variable confidence levels and a
careful analysis of layer sizes of SupLinUCB on the upper bound side, and
delicately constructed adversarial sequences showing the tightness of
elliptical potential lemmas on the lower bound side
Contextual bandits with surrogate losses: Margin bounds and efficient algorithms
We use surrogate losses to obtain several new regret bounds and new
algorithms for contextual bandit learning. Using the ramp loss, we derive new
margin-based regret bounds in terms of standard sequential complexity measures
of a benchmark class of real-valued regression functions. Using the hinge loss,
we derive an efficient algorithm with a -type mistake bound against
benchmark policies induced by -dimensional regressors. Under realizability
assumptions, our results also yield classical regret bounds
Nonparametric Stochastic Contextual Bandits
We analyze the -armed bandit problem where the reward for each arm is a
noisy realization based on an observed context under mild nonparametric
assumptions. We attain tight results for top-arm identification and a sublinear
regret of , where is the
context dimension, for a modified UCB algorithm that is simple to implement
(NN-UCB). We then give global intrinsic dimension dependent and ambient
dimension independent regret bounds. We also discuss recovering topological
structures within the context space based on expected bandit performance and
provide an extension to infinite-armed contextual bandits. Finally, we
experimentally show the improvement of our algorithm over existing multi-armed
bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201
Provably Optimal Algorithms for Generalized Linear Contextual Bandits
Contextual bandits are widely used in Internet services from news
recommendation to advertising, and to Web search. Generalized linear models
(logistical regression in particular) have demonstrated stronger performance
than linear models in many applications where rewards are binary. However, most
theoretical analyses on contextual bandits so far are on linear bandits. In
this work, we propose an upper confidence bound based algorithm for generalized
linear contextual bandits, which achieves an regret over
rounds with dimensional feature vectors. This regret matches the
minimax lower bound, up to logarithmic terms, and improves on the best previous
result by a factor, assuming the number of arms is fixed. A key
component in our analysis is to establish a new, sharp finite-sample confidence
bound for maximum-likelihood estimates in generalized linear models, which may
be of independent interest. We also analyze a simpler upper confidence bound
algorithm, which is useful in practice, and prove it to have optimal regret for
certain cases.Comment: Published at ICML 201
Gaussian Process bandits with adaptive discretization
In this paper, the problem of maximizing a black-box function is studied in the Bayesian framework with a Gaussian Process
(GP) prior. In particular, a new algorithm for this problem is proposed, and
high probability bounds on its simple and cumulative regret are established.
The query point selection rule in most existing methods involves an exhaustive
search over an increasingly fine sequence of uniform discretizations of
. The proposed algorithm, in contrast, adaptively refines
which leads to a lower computational complexity, particularly
when is a subset of a high dimensional Euclidean space. In
addition to the computational gains, sufficient conditions are identified under
which the regret bounds of the new algorithm improve upon the known results.
Finally an extension of the algorithm to the case of contextual bandits is
proposed, and high probability bounds on the contextual regret are presented.Comment: 34 pages, 2 figure
Linear Contextual Bandits with Knapsacks
We consider the linear contextual bandit problem with resource consumption,
in addition to reward generation. In each round, the outcome of pulling an arm
is a reward as well as a vector of resource consumptions. The expected values
of these outcomes depend linearly on the context of that arm. The
budget/capacity constraints require that the total consumption doesn't exceed
the budget for each resource. The objective is once again to maximize the total
reward. This problem turns out to be a common generalization of classic linear
contextual bandits (linContextual), bandits with knapsacks (BwK), and the
online stochastic packing problem (OSPP). We present algorithms with
near-optimal regret bounds for this problem. Our bounds compare favorably to
results on the unstructured version of the problem where the relation between
the contexts and the outcomes could be arbitrary, but the algorithm only
competes against a fixed set of policies accessible through an optimization
oracle. We combine techniques from the work on linContextual, BwK, and OSPP in
a nontrivial manner while also tackling new difficulties that are not present
in any of these special cases
contextual: Evaluating Contextual Multi-Armed Bandit Problems in R
Over the past decade, contextual bandit algorithms have been gaining in
popularity due to their effectiveness and flexibility in solving sequential
decision problems---from online advertising and finance to clinical trial
design and personalized medicine. At the same time, there are, as of yet,
surprisingly few options that enable researchers and practitioners to simulate
and compare the wealth of new and existing bandit algorithms in a standardized
way. To help close this gap between analytical research and empirical
evaluation the current paper introduces the object-oriented R package
"contextual": a user-friendly and, through its object-oriented structure,
easily extensible framework that facilitates parallelized comparison of
contextual and context-free bandit policies through both simulation and offline
analysis.Comment: 55 pages, 12 figure
- β¦