2,272 research outputs found

    Action Centered Contextual Bandits

    Full text link
    Contextual bandits have become popular as they offer a middle ground between very simple approaches based on multi-armed bandits and very complex approaches using the full power of reinforcement learning. They have demonstrated success in web applications and have a rich body of associated theoretical guarantees. Linear models are well understood theoretically and preferred by practitioners because they are not only easily interpretable but also simple to implement and debug. Furthermore, if the linear model is true, we get very strong performance guarantees. Unfortunately, in emerging applications in mobile health, the time-invariant linear model assumption is untenable. We provide an extension of the linear model for contextual bandits that has two parts: baseline reward and treatment effect. We allow the former to be complex but keep the latter simple. We argue that this model is plausible for mobile health applications. At the same time, it leads to algorithms with strong performance guarantees as in the linear model setting, while still allowing for complex nonlinear baseline modeling. Our theory is supported by experiments on data gathered in a recently concluded mobile health study.Comment: to appear at NIPS 201

    Semiparametric Contextual Bandits

    Full text link
    This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve O~(dT)\tilde{O}(d\sqrt{T}) regret over TT rounds, when the linear function is dd-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales

    Tight Regret Bounds for Infinite-armed Linear Contextual Bandits

    Full text link
    Linear contextual bandit is an important class of sequential decision making problems with a wide range of applications to recommender systems, online advertising, healthcare, and many other machine learning related tasks. While there is a lot of prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we address this open problem by considering the linear contextual bandit with (changing) infinite action sets. We prove a regret upper bound on the order of O(d2Tlog⁑T)Γ—poly(log⁑log⁑T)O(\sqrt{d^2T\log T})\times \text{poly}(\log\log T) where dd is the domain dimension and TT is the time horizon. Our upper bound matches the previous lower bound of Ξ©(d2Tlog⁑T)\Omega(\sqrt{d^2 T\log T}) in [Li et al., 2019] up to iterated logarithmic terms.Comment: 10 pages, accepted for presentation at AISTATS 202

    Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

    Full text link
    We study the linear contextual bandit problem with finite action sets. When the problem dimension is dd, the time horizon is TT, and there are n≀2d/2n \leq 2^{d/2} candidate actions per time period, we (1) show that the minimax expected regret is Ξ©(dT(log⁑T)(log⁑n))\Omega(\sqrt{dT (\log T) (\log n)}) for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two log⁑T\sqrt{\log T} factors from previous analysis, and our information-theoretical lower bound also improves previous results by one log⁑T\sqrt{\log T} factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic TT term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side

    Contextual bandits with surrogate losses: Margin bounds and efficient algorithms

    Full text link
    We use surrogate losses to obtain several new regret bounds and new algorithms for contextual bandit learning. Using the ramp loss, we derive new margin-based regret bounds in terms of standard sequential complexity measures of a benchmark class of real-valued regression functions. Using the hinge loss, we derive an efficient algorithm with a dT\sqrt{dT}-type mistake bound against benchmark policies induced by dd-dimensional regressors. Under realizability assumptions, our results also yield classical regret bounds

    Nonparametric Stochastic Contextual Bandits

    Full text link
    We analyze the KK-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of O~(T1+D2+D)\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big), where DD is the context dimension, for a modified UCB algorithm that is simple to implement (kkNN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201

    Provably Optimal Algorithms for Generalized Linear Contextual Bandits

    Full text link
    Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search. Generalized linear models (logistical regression in particular) have demonstrated stronger performance than linear models in many applications where rewards are binary. However, most theoretical analyses on contextual bandits so far are on linear bandits. In this work, we propose an upper confidence bound based algorithm for generalized linear contextual bandits, which achieves an O~(dT)\tilde{O}(\sqrt{dT}) regret over TT rounds with dd dimensional feature vectors. This regret matches the minimax lower bound, up to logarithmic terms, and improves on the best previous result by a d\sqrt{d} factor, assuming the number of arms is fixed. A key component in our analysis is to establish a new, sharp finite-sample confidence bound for maximum-likelihood estimates in generalized linear models, which may be of independent interest. We also analyze a simpler upper confidence bound algorithm, which is useful in practice, and prove it to have optimal regret for certain cases.Comment: Published at ICML 201

    Gaussian Process bandits with adaptive discretization

    Full text link
    In this paper, the problem of maximizing a black-box function f:X→Rf:\mathcal{X} \to \mathbb{R} is studied in the Bayesian framework with a Gaussian Process (GP) prior. In particular, a new algorithm for this problem is proposed, and high probability bounds on its simple and cumulative regret are established. The query point selection rule in most existing methods involves an exhaustive search over an increasingly fine sequence of uniform discretizations of X\mathcal{X}. The proposed algorithm, in contrast, adaptively refines X\mathcal{X} which leads to a lower computational complexity, particularly when X\mathcal{X} is a subset of a high dimensional Euclidean space. In addition to the computational gains, sufficient conditions are identified under which the regret bounds of the new algorithm improve upon the known results. Finally an extension of the algorithm to the case of contextual bandits is proposed, and high probability bounds on the contextual regret are presented.Comment: 34 pages, 2 figure

    Linear Contextual Bandits with Knapsacks

    Full text link
    We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn't exceed the budget for each resource. The objective is once again to maximize the total reward. This problem turns out to be a common generalization of classic linear contextual bandits (linContextual), bandits with knapsacks (BwK), and the online stochastic packing problem (OSPP). We present algorithms with near-optimal regret bounds for this problem. Our bounds compare favorably to results on the unstructured version of the problem where the relation between the contexts and the outcomes could be arbitrary, but the algorithm only competes against a fixed set of policies accessible through an optimization oracle. We combine techniques from the work on linContextual, BwK, and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases

    contextual: Evaluating Contextual Multi-Armed Bandit Problems in R

    Full text link
    Over the past decade, contextual bandit algorithms have been gaining in popularity due to their effectiveness and flexibility in solving sequential decision problems---from online advertising and finance to clinical trial design and personalized medicine. At the same time, there are, as of yet, surprisingly few options that enable researchers and practitioners to simulate and compare the wealth of new and existing bandit algorithms in a standardized way. To help close this gap between analytical research and empirical evaluation the current paper introduces the object-oriented R package "contextual": a user-friendly and, through its object-oriented structure, easily extensible framework that facilitates parallelized comparison of contextual and context-free bandit policies through both simulation and offline analysis.Comment: 55 pages, 12 figure
    • …
    corecore