Search CORE

2,272 research outputs found

Action Centered Contextual Bandits

Author: Greenewald Kristjan
Klasnja Predrag
Murphy Susan
Tewari Ambuj
Publication venue
Publication date: 09/11/2017
Field of study

Contextual bandits have become popular as they offer a middle ground between very simple approaches based on multi-armed bandits and very complex approaches using the full power of reinforcement learning. They have demonstrated success in web applications and have a rich body of associated theoretical guarantees. Linear models are well understood theoretically and preferred by practitioners because they are not only easily interpretable but also simple to implement and debug. Furthermore, if the linear model is true, we get very strong performance guarantees. Unfortunately, in emerging applications in mobile health, the time-invariant linear model assumption is untenable. We provide an extension of the linear model for contextual bandits that has two parts: baseline reward and treatment effect. We allow the former to be complex but keep the latter simple. We argue that this model is plausible for mobile health applications. At the same time, it leads to algorithms with strong performance guarantees as in the linear model setting, while still allowing for complex nonlinear baseline modeling. Our theory is supported by experiments on data gathered in a recently concluded mobile health study.Comment: to appear at NIPS 201

arXiv.org e-Print Archive

Semiparametric Contextual Bandits

Author: Krishnamurthy Akshay
Syrgkanis Vasilis
Wu Zhiwei Steven
Publication venue
Publication date: 16/07/2018
Field of study

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve

\tilde{O}(d\sqrt{T})

regret over

T

rounds, when the linear function is

d

-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales

arXiv.org e-Print Archive

Tight Regret Bounds for Infinite-armed Linear Contextual Bandits

Author: Chen Xi
Li Yingkai
Wang Yining
Zhou Yuan
Publication venue
Publication date: 26/01/2021
Field of study

Linear contextual bandit is an important class of sequential decision making problems with a wide range of applications to recommender systems, online advertising, healthcare, and many other machine learning related tasks. While there is a lot of prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we address this open problem by considering the linear contextual bandit with (changing) infinite action sets. We prove a regret upper bound on the order of

O(\sqrt{d^2T\log T})\times \text{poly}(\log\log T)

where

d

is the domain dimension and

T

is the time horizon. Our upper bound matches the previous lower bound of

\Omega(\sqrt{d^2 T\log T})

in [Li et al., 2019] up to iterated logarithmic terms.Comment: 10 pages, accepted for presentation at AISTATS 202

arXiv.org e-Print Archive

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

Author: Li Yingkai
Wang Yining
Zhou Yuan
Publication venue
Publication date: 18/08/2020
Field of study

We study the linear contextual bandit problem with finite action sets. When the problem dimension is

d

, the time horizon is

T

, and there are

n \leq 2^{d/2}

candidate actions per time period, we (1) show that the minimax expected regret is

\Omega(\sqrt{dT (\log T) (\log n)})

for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two

\sqrt{\log T}

factors from previous analysis, and our information-theoretical lower bound also improves previous results by one

\sqrt{\log T}

factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic

T

term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side

arXiv.org e-Print Archive

Contextual bandits with surrogate losses: Margin bounds and efficient algorithms

Author: Foster Dylan J.
Krishnamurthy Akshay
Publication venue
Publication date: 04/11/2018
Field of study

We use surrogate losses to obtain several new regret bounds and new algorithms for contextual bandit learning. Using the ramp loss, we derive new margin-based regret bounds in terms of standard sequential complexity measures of a benchmark class of real-valued regression functions. Using the hinge loss, we derive an efficient algorithm with a

\sqrt{dT}

-type mistake bound against benchmark policies induced by

d

-dimensional regressors. Under realizability assumptions, our results also yield classical regret bounds

arXiv.org e-Print Archive

Nonparametric Stochastic Contextual Bandits

Author: Guan Melody Y.
Jiang Heinrich
Publication venue
Publication date: 05/01/2018
Field of study

We analyze the

K

-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of

\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big)

, where

D

is the context dimension, for a modified UCB algorithm that is simple to implement (

k

NN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201

arXiv.org e-Print Archive

Provably Optimal Algorithms for Generalized Linear Contextual Bandits

Author: Li Lihong
Lu Yu
Zhou Dengyong
Publication venue
Publication date: 18/06/2017
Field of study

Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search. Generalized linear models (logistical regression in particular) have demonstrated stronger performance than linear models in many applications where rewards are binary. However, most theoretical analyses on contextual bandits so far are on linear bandits. In this work, we propose an upper confidence bound based algorithm for generalized linear contextual bandits, which achieves an

\tilde{O}(\sqrt{dT})

regret over

T

rounds with

d

dimensional feature vectors. This regret matches the minimax lower bound, up to logarithmic terms, and improves on the best previous result by a

\sqrt{d}

factor, assuming the number of arms is fixed. A key component in our analysis is to establish a new, sharp finite-sample confidence bound for maximum-likelihood estimates in generalized linear models, which may be of independent interest. We also analyze a simpler upper confidence bound algorithm, which is useful in practice, and prove it to have optimal regret for certain cases.Comment: Published at ICML 201

arXiv.org e-Print Archive

Gaussian Process bandits with adaptive discretization

Author: Javidi Tara
Shekhar Shubhanshu
Publication venue
Publication date: 05/01/2018
Field of study

In this paper, the problem of maximizing a black-box function

f:\mathcal{X} \to \mathbb{R}

is studied in the Bayesian framework with a Gaussian Process (GP) prior. In particular, a new algorithm for this problem is proposed, and high probability bounds on its simple and cumulative regret are established. The query point selection rule in most existing methods involves an exhaustive search over an increasingly fine sequence of uniform discretizations of

\mathcal{X}

. The proposed algorithm, in contrast, adaptively refines

\mathcal{X}

which leads to a lower computational complexity, particularly when

\mathcal{X}

is a subset of a high dimensional Euclidean space. In addition to the computational gains, sufficient conditions are identified under which the regret bounds of the new algorithm improve upon the known results. Finally an extension of the algorithm to the case of contextual bandits is proposed, and high probability bounds on the contextual regret are presented.Comment: 34 pages, 2 figure

arXiv.org e-Print Archive

Linear Contextual Bandits with Knapsacks

Author: Agrawal Shipra
Devanur Nikhil R.
Publication venue
Publication date: 09/07/2016
Field of study

We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn't exceed the budget for each resource. The objective is once again to maximize the total reward. This problem turns out to be a common generalization of classic linear contextual bandits (linContextual), bandits with knapsacks (BwK), and the online stochastic packing problem (OSPP). We present algorithms with near-optimal regret bounds for this problem. Our bounds compare favorably to results on the unstructured version of the problem where the relation between the contexts and the outcomes could be arbitrary, but the algorithm only competes against a fixed set of policies accessible through an optimization oracle. We combine techniques from the work on linContextual, BwK, and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases

arXiv.org e-Print Archive

contextual: Evaluating Contextual Multi-Armed Bandit Problems in R

Author: Kaptein Maurits
van Emden Robin
Publication venue
Publication date: 01/01/2020
Field of study

Over the past decade, contextual bandit algorithms have been gaining in popularity due to their effectiveness and flexibility in solving sequential decision problems---from online advertising and finance to clinical trial design and personalized medicine. At the same time, there are, as of yet, surprisingly few options that enable researchers and practitioners to simulate and compare the wealth of new and existing bandit algorithms in a standardized way. To help close this gap between analytical research and empirical evaluation the current paper introduces the object-oriented R package "contextual": a user-friendly and, through its object-oriented structure, easily extensible framework that facilitates parallelized comparison of contextual and context-free bandit policies through both simulation and offline analysis.Comment: 55 pages, 12 figure

arXiv.org e-Print Archive