9 research outputs found

    Context-lumpable stochastic bandits

    Full text link
    We consider a contextual bandit problem with SS contexts and AA actions. In each round t=1,2,t=1,2,\dots the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into rmin{S,A}r\le \min\{S ,A \} groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ϵ\epsilon-optimal policy after using at most O~(r(S+A)/ϵ2)\widetilde O(r (S +A )/\epsilon^2) samples with high probability and provide a matching Ω~(r(S+A)/ϵ2)\widetilde\Omega(r (S +A )/\epsilon^2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time TT is bounded by O~(r3(S+A)T)\widetilde O(\sqrt{r^3(S +A )T}). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and O~(poly(r)(S+K)T)\widetilde O(\sqrt{{poly}(r)(S+K)T}) minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios

    Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems

    Full text link
    In the stochastic contextual low-rank matrix bandit problem, the expected reward of an action is given by the inner product between the action's feature matrix and some fixed, but initially unknown d1d_1 by d2d_2 matrix Θ\Theta^* with rank r{d1,d2}r \ll \{d_1, d_2\}, and an agent sequentially takes actions based on past experience to maximize the cumulative reward. In this paper, we study the generalized low-rank matrix bandit problem, which has been recently proposed in \cite{lu2021low} under the Generalized Linear Model (GLM) framework. To overcome the computational infeasibility and theoretical restrain of existing algorithms on this problem, we first propose the G-ESTT framework that modifies the idea from \cite{jun2019bilinear} by using Stein's method on the subspace estimation and then leverage the estimated subspaces via a regularization idea. Furthermore, we remarkably improve the efficiency of G-ESTT by using a novel exclusion idea on the estimated subspace instead, and propose the G-ESTS framework. We also show that G-ESTT can achieve the O~((d1+d2)MrT)\tilde{O}(\sqrt{(d_1+d_2)MrT}) bound of regret while G-ESTS can achineve the O~((d1+d2)3/2Mr3/2T)\tilde{O}(\sqrt{(d_1+d_2)^{3/2}Mr^{3/2}T}) bound of regret under mild assumption up to logarithm terms, where MM is some problem dependent value. Under a reasonable assumption that M=O((d1+d2)2)M = O((d_1+d_2)^2) in our problem setting, the regret of G-ESTT is consistent with the current best regret of O~((d1+d2)3/2rT/Drr)\tilde{O}((d_1+d_2)^{3/2} \sqrt{rT}/D_{rr})~\citep{lu2021low} (DrrD_{rr} will be defined later). For completeness, we conduct experiments to illustrate that our proposed algorithms, especially G-ESTS, are also computationally tractable and consistently outperform other state-of-the-art (generalized) linear matrix bandit methods based on a suite of simulations.Comment: Revision of the paper accepted by NeurIPS 202

    A Simple Unified Framework for High Dimensional Bandit Problems

    Full text link
    Stochastic high dimensional bandit problems with low dimensional structures are useful in different applications such as online advertising and drug discovery. In this work, we propose a simple unified algorithm for such problems and present a general analysis framework for the regret upper bound of our algorithm. We show that under some mild unified assumptions, our algorithm can be applied to different high dimensional bandit problems. Our framework utilizes the low dimensional structure to guide the parameter estimation in the problem, therefore our algorithm achieves the best regret bounds in the LASSO bandit, as well as novel bounds in the low-rank matrix bandit, the group sparse matrix bandit, and in a new problem: the multi-agent LASSO bandit
    corecore