Search CORE

1,012 research outputs found

Towards Optimal and Efficient Best Arm Identification in Linear Bandits

Author: Gopalan Aditya
Mohan Avinash
Zaki Mohammadi
Publication venue
Publication date: 07/11/2019
Field of study

We give a new algorithm for best arm identification in linearly parameterised bandits in the fixed confidence setting. The algorithm generalises the well-known LUCB algorithm of Kalyanakrishnan et al. (2012) by playing an arm which minimises a suitable notion of geometric overlap of the statistical confidence set for the unknown parameter, and is fully adaptive and computationally efficient as compared to several state-of-the methods. We theoretically analyse the sample complexity of the algorithm for problems with two and three arms, showing optimality in many cases. Numerical results indicate favourable performance over other algorithms with which we compare

arXiv.org e-Print Archive

Best arm identification in multi-armed bandits with delayed feedback

Author: Attia Peter
Chen Michael
Cheong Bryan
Chueh William
Ermon Stefano
Grover Aditya
Harris Stephen
Jin Norman
Markov Todor
Perkins Nicholas
Yang Zi
Publication venue
Publication date: 29/03/2018
Field of study

We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample complexity of standard algorithms, but can be offset if we have access to partial feedback received before a pull is completed. We propose a general framework to model the relationship between partial and delayed feedback, and as a special case we introduce efficient algorithms for settings where the partial feedback are biased or unbiased estimators of the delayed feedback. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Our experiments in real-world settings, involving policy search and hyperparameter optimization in computational sustainability domains for fast charging of batteries and wildlife corridor construction, demonstrate that exploiting the structure of partial feedback can lead to significant improvements over baselines in both sequential and parallel MAB.Comment: AISTATS 201

arXiv.org e-Print Archive

Optimally Confident UCB: Improved Regret for Finite-Armed Bandits

Author: Lattimore Tor
Publication venue
Publication date: 24/02/2016
Field of study

I present the first algorithm for stochastic finite-armed bandits that simultaneously enjoys order-optimal problem-dependent regret and worst-case regret. Besides the theoretical results, the new algorithm is simple, efficient and empirically superb. The approach is based on UCB, but with a carefully chosen confidence parameter that optimally balances the risk of failing confidence intervals against the cost of excessive optimism.Comment: 26 page

arXiv.org e-Print Archive

Preference-based Online Learning with Dueling Bandits: A Survey

Author: Busa-Fekete Robert
Hüllermeier Eyke
Mesaoudi-Paul Adil El
Publication venue
Publication date: 30/07/2018
Field of study

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available -- instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback

arXiv.org e-Print Archive

A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit

Author: Burtini Giuseppe
Lawrence Ramon
Loeppky Jason
Publication venue
Publication date: 03/11/2015
Field of study

Adaptive and sequential experiment design is a well-studied area in numerous domains. We survey and synthesize the work of the online statistical learning paradigm referred to as multi-armed bandits integrating the existing research as a resource for a certain class of online experiments. We first explore the traditional stochastic model of a multi-armed bandit, then explore a taxonomic scheme of complications to that model, for each complication relating it to a specific requirement or consideration of the experiment design context. Finally, at the end of the paper, we present a table of known upper-bounds of regret for all studied algorithms providing both perspectives for future theoretical work and a decision-making tool for practitioners looking for theoretical guarantees.Comment: 49 pages, 1 figur

arXiv.org e-Print Archive

A Survey on Practical Applications of Multi-Armed and Contextual Bandits

Author: Bouneffouf Djallel
Rish Irina
Publication venue
Publication date: 02/04/2019
Field of study

In recent years, multi-armed bandit (MAB) framework has attracted a lot of attention in various applications, from recommender systems and information retrieval to healthcare and finance, due to its stellar performance combined with certain attractive properties, such as learning from less feedback. The multi-armed bandit field is currently flourishing, as novel problem settings and algorithms motivated by various practical applications are being introduced, building on top of the classical bandit problem. This article aims to provide a comprehensive review of top recent developments in multiple real-life applications of the multi-armed bandit. Specifically, we introduce a taxonomy of common MAB-based applications and summarize state-of-art for each of those domains. Furthermore, we identify important current trends and provide new perspectives pertaining to the future of this exciting and fast-growing field.Comment: under review by IJCAI 2019 Surve

arXiv.org e-Print Archive

Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems

Author: Glimsdal Sondre
Granmo Ole-Christoffer
Publication venue
Publication date: 05/08/2017
Field of study

The multi-armed bandit problem forms the foundation for solving a wide range of on-line stochastic optimization problems through a simple, yet effective mechanism. One simply casts the problem as a gambler that repeatedly pulls one out of N slot machine arms, eliciting random rewards. Learning of reward probabilities is then combined with reward maximization, by carefully balancing reward exploration against reward exploitation. In this paper, we address a particularly intriguing variant of the multi-armed bandit problem, referred to as the {\it Stochastic Point Location (SPL) Problem}. The gambler is here only told whether the optimal arm (point) lies to the "left" or to the "right" of the arm pulled, with the feedback being erroneous with probability

1-\pi

. This formulation thus captures optimization in continuous action spaces with both {\it informative} and {\it deceptive} feedback. To tackle this class of problems, we formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal arm as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning

\pi

, TS-SPL also supports {\it deceptive} environments that are lying about the direction of the optimal arm. This, in turn, allows us to solve the fundamental Stochastic Root Finding (SRF) Problem. Empirical results demonstrate that our scheme deals with both deceptive and informative environments, significantly outperforming competing algorithms both for SRF and SPL.Comment: 17 pages, 2 figures. A preliminary version of some of the results of this paper appears in the Proceedings of AIAI'1

arXiv.org e-Print Archive

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Author: Chen Lijie
Gupta Anupam
Li Jian
Qiao Mingda
Wang Ruosong
Publication venue
Publication date: 04/06/2017
Field of study

We study the combinatorial pure exploration problem Best-Set in stochastic multi-armed bandits. In a Best-Set instance, we are given

n

arms with unknown reward distributions, as well as a family

\mathcal{F}

of feasible subsets over the arms. Our goal is to identify the feasible subset in

\mathcal{F}

with the maximum total mean using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-

k

arm identification problem, both of which have attracted significant attention in recent years. We provide a novel instance-wise lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of

\ln|\mathcal{F}|

. For an important class of combinatorial families, we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and approximate Pareto curves in multi-objective optimization. We also show that the

\ln|\mathcal{F}|

factor is inevitable in general through a nontrivial lower bound construction. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general Best-Set problem. We further introduce an even more general problem, formulated in geometric terms. We are given

n

Gaussian arms with unknown means and unit variance. Consider the

n

-dimensional Euclidean space

\mathbb{R}^n

, and a collection

\mathcal{O}

of disjoint subsets. Our goal is to determine the subset in

\mathcal{O}

that contains the

n

-dimensional vector of the means. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.Comment: Accepted to COLT 201

arXiv.org e-Print Archive

Bandits meet Computer Architecture: Designing a Smartly-allocated Cache

Author: Crammer Koby
Glassner Yonatan
Publication venue
Publication date: 31/01/2016
Field of study

In many embedded systems, such as imaging sys- tems, the system has a single designated purpose, and same threads are executed repeatedly. Profiling thread behavior, allows the system to allocate each thread its resources in a way that improves overall system performance. We study an online resource al- locationproblem,wherearesourcemanagersimulta- neously allocates resources (exploration), learns the impact on the different consumers (learning) and im- proves allocation towards optimal performance (ex- ploitation). We build on the rich framework of multi- armed bandits and present online and offline algo- rithms. Through extensive experiments with both synthetic data and real-world cache allocation to threads we show the merits and properties of our al- gorithm

arXiv.org e-Print Archive

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

Author: Li Yingkai
Wang Yining
Zhou Yuan
Publication venue
Publication date: 18/08/2020
Field of study

We study the linear contextual bandit problem with finite action sets. When the problem dimension is

d

, the time horizon is

T

, and there are

n \leq 2^{d/2}

candidate actions per time period, we (1) show that the minimax expected regret is

\Omega(\sqrt{dT (\log T) (\log n)})

for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two

\sqrt{\log T}

factors from previous analysis, and our information-theoretical lower bound also improves previous results by one

\sqrt{\log T}

factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic

T

term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side

arXiv.org e-Print Archive