1,012 research outputs found
Towards Optimal and Efficient Best Arm Identification in Linear Bandits
We give a new algorithm for best arm identification in linearly parameterised
bandits in the fixed confidence setting. The algorithm generalises the
well-known LUCB algorithm of Kalyanakrishnan et al. (2012) by playing an arm
which minimises a suitable notion of geometric overlap of the statistical
confidence set for the unknown parameter, and is fully adaptive and
computationally efficient as compared to several state-of-the methods. We
theoretically analyse the sample complexity of the algorithm for problems with
two and three arms, showing optimality in many cases. Numerical results
indicate favourable performance over other algorithms with which we compare
Best arm identification in multi-armed bandits with delayed feedback
We propose a generalization of the best arm identification problem in
stochastic multi-armed bandits (MAB) to the setting where every pull of an arm
is associated with delayed feedback. The delay in feedback increases the
effective sample complexity of standard algorithms, but can be offset if we
have access to partial feedback received before a pull is completed. We propose
a general framework to model the relationship between partial and delayed
feedback, and as a special case we introduce efficient algorithms for settings
where the partial feedback are biased or unbiased estimators of the delayed
feedback. Additionally, we propose a novel extension of the algorithms to the
parallel MAB setting where an agent can control a batch of arms. Our
experiments in real-world settings, involving policy search and hyperparameter
optimization in computational sustainability domains for fast charging of
batteries and wildlife corridor construction, demonstrate that exploiting the
structure of partial feedback can lead to significant improvements over
baselines in both sequential and parallel MAB.Comment: AISTATS 201
Optimally Confident UCB: Improved Regret for Finite-Armed Bandits
I present the first algorithm for stochastic finite-armed bandits that
simultaneously enjoys order-optimal problem-dependent regret and worst-case
regret. Besides the theoretical results, the new algorithm is simple, efficient
and empirically superb. The approach is based on UCB, but with a carefully
chosen confidence parameter that optimally balances the risk of failing
confidence intervals against the cost of excessive optimism.Comment: 26 page
Preference-based Online Learning with Dueling Bandits: A Survey
In machine learning, the notion of multi-armed bandits refers to a class of
online learning problems, in which an agent is supposed to simultaneously
explore and exploit a given set of choice alternatives in the course of a
sequential decision process. In the standard setting, the agent learns from
stochastic feedback in the form of real-valued rewards. In many applications,
however, numerical reward signals are not readily available -- instead, only
weaker information is provided, in particular relative preferences in the form
of qualitative comparisons between pairs of alternatives. This observation has
motivated the study of variants of the multi-armed bandit problem, in which
more general representations are used both for the type of feedback to learn
from and the target of prediction. The aim of this paper is to provide a survey
of the state of the art in this field, referred to as preference-based
multi-armed bandits or dueling bandits. To this end, we provide an overview of
problems that have been considered in the literature as well as methods for
tackling them. Our taxonomy is mainly based on the assumptions made by these
methods about the data-generating process and, related to this, the properties
of the preference-based feedback
A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit
Adaptive and sequential experiment design is a well-studied area in numerous
domains. We survey and synthesize the work of the online statistical learning
paradigm referred to as multi-armed bandits integrating the existing research
as a resource for a certain class of online experiments. We first explore the
traditional stochastic model of a multi-armed bandit, then explore a taxonomic
scheme of complications to that model, for each complication relating it to a
specific requirement or consideration of the experiment design context.
Finally, at the end of the paper, we present a table of known upper-bounds of
regret for all studied algorithms providing both perspectives for future
theoretical work and a decision-making tool for practitioners looking for
theoretical guarantees.Comment: 49 pages, 1 figur
A Survey on Practical Applications of Multi-Armed and Contextual Bandits
In recent years, multi-armed bandit (MAB) framework has attracted a lot of
attention in various applications, from recommender systems and information
retrieval to healthcare and finance, due to its stellar performance combined
with certain attractive properties, such as learning from less feedback. The
multi-armed bandit field is currently flourishing, as novel problem settings
and algorithms motivated by various practical applications are being
introduced, building on top of the classical bandit problem. This article aims
to provide a comprehensive review of top recent developments in multiple
real-life applications of the multi-armed bandit. Specifically, we introduce a
taxonomy of common MAB-based applications and summarize state-of-art for each
of those domains. Furthermore, we identify important current trends and provide
new perspectives pertaining to the future of this exciting and fast-growing
field.Comment: under review by IJCAI 2019 Surve
Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems
The multi-armed bandit problem forms the foundation for solving a wide range
of on-line stochastic optimization problems through a simple, yet effective
mechanism. One simply casts the problem as a gambler that repeatedly pulls one
out of N slot machine arms, eliciting random rewards. Learning of reward
probabilities is then combined with reward maximization, by carefully balancing
reward exploration against reward exploitation. In this paper, we address a
particularly intriguing variant of the multi-armed bandit problem, referred to
as the {\it Stochastic Point Location (SPL) Problem}. The gambler is here only
told whether the optimal arm (point) lies to the "left" or to the "right" of
the arm pulled, with the feedback being erroneous with probability .
This formulation thus captures optimization in continuous action spaces with
both {\it informative} and {\it deceptive} feedback. To tackle this class of
problems, we formulate a compact and scalable Bayesian representation of the
solution space that simultaneously captures both the location of the optimal
arm as well as the probability of receiving correct feedback. We further
introduce the accompanying Thompson Sampling guided Stochastic Point Location
(TS-SPL) scheme for balancing exploration against exploitation. By learning
, TS-SPL also supports {\it deceptive} environments that are lying about
the direction of the optimal arm. This, in turn, allows us to solve the
fundamental Stochastic Root Finding (SRF) Problem. Empirical results
demonstrate that our scheme deals with both deceptive and informative
environments, significantly outperforming competing algorithms both for SRF and
SPL.Comment: 17 pages, 2 figures. A preliminary version of some of the results of
this paper appears in the Proceedings of AIAI'1
Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration
We study the combinatorial pure exploration problem Best-Set in stochastic
multi-armed bandits. In a Best-Set instance, we are given arms with unknown
reward distributions, as well as a family of feasible subsets
over the arms. Our goal is to identify the feasible subset in
with the maximum total mean using as few samples as possible. The problem
generalizes the classical best arm identification problem and the top- arm
identification problem, both of which have attracted significant attention in
recent years. We provide a novel instance-wise lower bound for the sample
complexity of the problem, as well as a nontrivial sampling algorithm, matching
the lower bound up to a factor of . For an important class of
combinatorial families, we also provide polynomial time implementation of the
sampling algorithm, using the equivalence of separation and optimization for
convex program, and approximate Pareto curves in multi-objective optimization.
We also show that the factor is inevitable in general
through a nontrivial lower bound construction. Our results significantly
improve several previous results for several important combinatorial
constraints, and provide a tighter understanding of the general Best-Set
problem.
We further introduce an even more general problem, formulated in geometric
terms. We are given Gaussian arms with unknown means and unit variance.
Consider the -dimensional Euclidean space , and a collection
of disjoint subsets. Our goal is to determine the subset in
that contains the -dimensional vector of the means. The
problem generalizes most pure exploration bandit problems studied in the
literature. We provide the first nearly optimal sample complexity upper and
lower bounds for the problem.Comment: Accepted to COLT 201
Bandits meet Computer Architecture: Designing a Smartly-allocated Cache
In many embedded systems, such as imaging sys- tems, the system has a single
designated purpose, and same threads are executed repeatedly. Profiling thread
behavior, allows the system to allocate each thread its resources in a way that
improves overall system performance. We study an online resource al-
locationproblem,wherearesourcemanagersimulta- neously allocates resources
(exploration), learns the impact on the different consumers (learning) and im-
proves allocation towards optimal performance (ex- ploitation). We build on the
rich framework of multi- armed bandits and present online and offline algo-
rithms. Through extensive experiments with both synthetic data and real-world
cache allocation to threads we show the merits and properties of our al-
gorithm
Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
We study the linear contextual bandit problem with finite action sets. When
the problem dimension is , the time horizon is , and there are candidate actions per time period, we (1) show that the minimax
expected regret is for every algorithm,
and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose
regret matches the lower bound up to iterated logarithmic factors. Our
algorithmic result saves two factors from previous analysis,
and our information-theoretical lower bound also improves previous results by
one factor, revealing a regret scaling quite different from
classical multi-armed bandits in which no logarithmic term is present in
minimax regret. Our proof techniques include variable confidence levels and a
careful analysis of layer sizes of SupLinUCB on the upper bound side, and
delicately constructed adversarial sequences showing the tightness of
elliptical potential lemmas on the lower bound side
- …