1,949 research outputs found

    Sparse Stochastic Bandits

    Full text link
    In the classical multi-armed bandit problem, d arms are available to the decision maker who pulls them sequentially in order to maximize his cumulative reward. Guarantees can be obtained on a relative quantity called regret, which scales linearly with d (or with sqrt(d) in the minimax sense). We here consider the sparse case of this classical problem in the sense that only a small number of arms, namely s < d, have a positive expected reward. We are able to leverage this additional assumption to provide an algorithm whose regret scales with s instead of d. Moreover, we prove that this algorithm is optimal by providing a matching lower bound - at least for a wide and pertinent range of parameters that we determine - and by evaluating its performance on simulated data

    Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit

    Get PDF
    We consider a linear stochastic bandit problem where the dimension KK of the unknown parameter θ\theta is larger than the sampling budget nn. In such cases, it is in general impossible to derive sub-linear regret bounds since usual linear bandit algorithms have a regret in O(Kn)O(K\sqrt{n}). In this paper we assume that θ\theta is S−S-sparse, i.e. has at most S−S-non-zero components, and that the space of arms is the unit ball for the ∣∣.∣∣2||.||_2 norm. We combine ideas from Compressed Sensing and Bandit Theory and derive algorithms with regret bounds in O(Sn)O(S\sqrt{n})

    Linear Bandits with Feature Feedback

    Full text link
    This paper explores a new form of the linear bandit problem in which the algorithm receives the usual stochastic rewards as well as stochastic feedback about which features are relevant to the rewards, the latter feedback being the novel aspect. The focus of this paper is the development of new theory and algorithms for linear bandits with feature feedback. We show that linear bandits with feature feedback can achieve regret over time horizon TT that scales like kTk\sqrt{T}, without prior knowledge of which features are relevant nor the number kk of relevant features. In comparison, the regret of traditional linear bandits is dTd\sqrt{T}, where dd is the total number of (relevant and irrelevant) features, so the improvement can be dramatic if k≪dk\ll d. The computational complexity of the new algorithm is proportional to kk rather than dd, making it much more suitable for real-world applications compared to traditional linear bandits. We demonstrate the performance of the new algorithm with synthetic and real human-labeled data

    Misspecified Linear Bandits

    Full text link
    We consider the problem of online learning in misspecified linear stochastic multi-armed bandit problems. Regret guarantees for state-of-the-art linear bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit (OFUL) hold under the assumption that the arms expected rewards are perfectly linear in their features. It is, however, of interest to investigate the impact of potential misspecification in linear bandit models, where the expected rewards are perturbed away from the linear subspace determined by the arms features. Although OFUL has recently been shown to be robust to relatively small deviations from linearity, we show that any linear bandit algorithm that enjoys optimal regret performance in the perfectly linear setting (e.g., OFUL) must suffer linear regret under a sparse additive perturbation of the linear model. In an attempt to overcome this negative result, we define a natural class of bandit models characterized by a non-sparse deviation from linearity. We argue that the OFUL algorithm can fail to achieve sublinear regret even under models that have non-sparse deviation.We finally develop a novel bandit algorithm, comprising a hypothesis test for linearity followed by a decision to use either the OFUL or Upper Confidence Bound (UCB) algorithm. For perfectly linear bandit models, the algorithm provably exhibits OFULs favorable regret performance, while for misspecified models satisfying the non-sparse deviation property, the algorithm avoids the linear regret phenomenon and falls back on UCBs sublinear regret scaling. Numerical experiments on synthetic data, and on recommendation data from the public Yahoo! Learning to Rank Challenge dataset, empirically support our findings.Comment: Thirty-First AAAI Conference on Artificial Intelligence, 201

    Hierarchical Exploration for Accelerating Contextual Bandits

    Get PDF
    Contextual bandit learning is an increasingly popular approach to optimizing recommender systems via user feedback, but can be slow to converge in practice due to the need for exploring a large feature space. In this paper, we propose a coarse-to-fine hierarchical approach for encoding prior knowledge that drastically reduces the amount of exploration required. Intuitively, user preferences can be reasonably embedded in a coarse low-dimensional feature space that can be explored efficiently, requiring exploration in the high-dimensional space only as necessary. We introduce a bandit algorithm that explores within this coarse-to-fine spectrum, and prove performance guarantees that depend on how well the coarse space captures the user's preferences. We demonstrate substantial improvement over conventional bandit algorithms through extensive simulation as well as a live user study in the setting of personalized news recommendation.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012
    • …
    corecore