Search CORE

1,949 research outputs found

Sparse Stochastic Bandits

Author: Kwon Joon
Perchet Vianney
Vernade Claire
Publication venue
Publication date: 05/06/2017
Field of study

In the classical multi-armed bandit problem, d arms are available to the decision maker who pulls them sequentially in order to maximize his cumulative reward. Guarantees can be obtained on a relative quantity called regret, which scales linearly with d (or with sqrt(d) in the minimax sense). We here consider the sparse case of this classical problem in the sense that only a small number of arms, namely s < d, have a positive expected reward. We are able to leverage this additional assumption to provide an algorithm whose regret scales with s instead of d. Moreover, we prove that this algorithm is optimal by providing a matching lower bound - at least for a wide and pertinent range of parameters that we determine - and by evaluating its performance on simulated data

arXiv.org e-Print Archive

HAL-Polytechnique

Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit

Author: Carpentier Alexandra
Munos Rémi
Publication venue
Publication date: 01/01/2012
Field of study

We consider a linear stochastic bandit problem where the dimension

K

of the unknown parameter

\theta

is larger than the sampling budget

n

. In such cases, it is in general impossible to derive sub-linear regret bounds since usual linear bandit algorithms have a regret in

O(K\sqrt{n})

. In this paper we assume that

\theta

S-

sparse, i.e. has at most

S-

non-zero components, and that the space of arms is the unit ball for the

||.||_2

norm. We combine ideas from Compressed Sensing and Bandit Theory and derive algorithms with regret bounds in

O(S\sqrt{n})

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Linear Bandits with Feature Feedback

Author: Bhargava Aniruddha
Nowak Robert
Oswal Urvashi
Publication venue
Publication date: 11/03/2019
Field of study

This paper explores a new form of the linear bandit problem in which the algorithm receives the usual stochastic rewards as well as stochastic feedback about which features are relevant to the rewards, the latter feedback being the novel aspect. The focus of this paper is the development of new theory and algorithms for linear bandits with feature feedback. We show that linear bandits with feature feedback can achieve regret over time horizon

T

that scales like

k\sqrt{T}

, without prior knowledge of which features are relevant nor the number

k

of relevant features. In comparison, the regret of traditional linear bandits is

d\sqrt{T}

, where

d

is the total number of (relevant and irrelevant) features, so the improvement can be dramatic if

k\ll d

. The computational complexity of the new algorithm is proportional to

k

rather than

d

, making it much more suitable for real-world applications compared to traditional linear bandits. We demonstrate the performance of the new algorithm with synthetic and real human-labeled data

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Misspecified Linear Bandits

Author: Chowdhury Sayak Ray
Ghosh Avishek
Gopalan Aditya
Publication venue
Publication date: 12/02/2017
Field of study

We consider the problem of online learning in misspecified linear stochastic multi-armed bandit problems. Regret guarantees for state-of-the-art linear bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit (OFUL) hold under the assumption that the arms expected rewards are perfectly linear in their features. It is, however, of interest to investigate the impact of potential misspecification in linear bandit models, where the expected rewards are perturbed away from the linear subspace determined by the arms features. Although OFUL has recently been shown to be robust to relatively small deviations from linearity, we show that any linear bandit algorithm that enjoys optimal regret performance in the perfectly linear setting (e.g., OFUL) must suffer linear regret under a sparse additive perturbation of the linear model. In an attempt to overcome this negative result, we define a natural class of bandit models characterized by a non-sparse deviation from linearity. We argue that the OFUL algorithm can fail to achieve sublinear regret even under models that have non-sparse deviation.We finally develop a novel bandit algorithm, comprising a hypothesis test for linearity followed by a decision to use either the OFUL or Upper Confidence Bound (UCB) algorithm. For perfectly linear bandit models, the algorithm provably exhibits OFULs favorable regret performance, while for misspecified models satisfying the non-sparse deviation property, the algorithm avoids the linear regret phenomenon and falls back on UCBs sublinear regret scaling. Numerical experiments on synthetic data, and on recommendation data from the public Yahoo! Learning to Rank Challenge dataset, empirically support our findings.Comment: Thirty-First AAAI Conference on Artificial Intelligence, 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Hierarchical Exploration for Accelerating Contextual Bandits

Author: Guestrin Carlos
Hong Sue Ann
Yue Yisong
Publication venue
Publication date: 27/06/2012
Field of study

Contextual bandit learning is an increasingly popular approach to optimizing recommender systems via user feedback, but can be slow to converge in practice due to the need for exploring a large feature space. In this paper, we propose a coarse-to-fine hierarchical approach for encoding prior knowledge that drastically reduces the amount of exploration required. Intuitively, user preferences can be reasonably embedded in a coarse low-dimensional feature space that can be explored efficiently, requiring exploration in the high-dimensional space only as necessary. We introduce a bandit algorithm that explores within this coarse-to-fine spectrum, and prove performance guarantees that depend on how well the coarse space captures the user's preferences. We demonstrate substantial improvement over conventional bandit algorithms through extensive simulation as well as a live user study in the setting of personalized news recommendation.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

arXiv.org e-Print Archive

Caltech Authors