1,728 research outputs found
Good Arm Identification via Bandit Feedback
We consider a novel stochastic multi-armed bandit problem called {\em good
arm identification} (GAI), where a good arm is defined as an arm with expected
reward greater than or equal to a given threshold. GAI is a pure-exploration
problem that a single agent repeats a process of outputting an arm as soon as
it is identified as a good one before confirming the other arms are actually
not good. The objective of GAI is to minimize the number of samples for each
process. We find that GAI faces a new kind of dilemma, the {\em
exploration-exploitation dilemma of confidence}, which is different difficulty
from the best arm identification. As a result, an efficient design of
algorithms for GAI is quite different from that for the best arm
identification. We derive a lower bound on the sample complexity of GAI that is
tight up to the logarithmic factor for
acceptance error rate . We also develop an algorithm whose sample
complexity almost matches the lower bound. We also confirm experimentally that
our proposed algorithm outperforms naive algorithms in synthetic settings based
on a conventional bandit problem and clinical trial researches for rheumatoid
arthritis
Causal Bandits: Learning Good Interventions via Causal Inference
We study the problem of using causal models to improve the rate at which good
interventions can be learned online in a stochastic environment. Our formalism
combines multi-arm bandits and causal inference to model a novel type of bandit
feedback that is not exploited by existing approaches. We propose a new
algorithm that exploits the causal feedback and prove a bound on its simple
regret that is strictly better (in all quantities) than algorithms that do not
use the additional causal information
Best arm identification in multi-armed bandits with delayed feedback
We propose a generalization of the best arm identification problem in
stochastic multi-armed bandits (MAB) to the setting where every pull of an arm
is associated with delayed feedback. The delay in feedback increases the
effective sample complexity of standard algorithms, but can be offset if we
have access to partial feedback received before a pull is completed. We propose
a general framework to model the relationship between partial and delayed
feedback, and as a special case we introduce efficient algorithms for settings
where the partial feedback are biased or unbiased estimators of the delayed
feedback. Additionally, we propose a novel extension of the algorithms to the
parallel MAB setting where an agent can control a batch of arms. Our
experiments in real-world settings, involving policy search and hyperparameter
optimization in computational sustainability domains for fast charging of
batteries and wildlife corridor construction, demonstrate that exploiting the
structure of partial feedback can lead to significant improvements over
baselines in both sequential and parallel MAB.Comment: AISTATS 201
Instrument-Armed Bandits
We extend the classic multi-armed bandit (MAB) model to the setting of
noncompliance, where the arm pull is a mere instrument and the treatment
applied may differ from it, which gives rise to the instrument-armed bandit
(IAB) problem. The IAB setting is relevant whenever the experimental units are
human since free will, ethics, and the law may prohibit unrestricted or forced
application of treatment. In particular, the setting is relevant in bandit
models of dynamic clinical trials and other controlled trials on human
interventions. Nonetheless, the setting has not been fully investigate in the
bandit literature. We show that there are various and divergent notions of
regret in this setting, all of which coincide only in the classic MAB setting.
We characterize the behavior of these regrets and analyze standard MAB
algorithms. We argue for a particular kind of regret that captures the causal
effect of treatments but show that standard MAB algorithms cannot achieve
sublinear control on this regret. Instead, we develop new algorithms for the
IAB problem, prove new regret bounds for them, and compare them to standard MAB
algorithms in numerical examples
Best-of-K Bandits
This paper studies the Best-of-K Bandit game: At each time the player chooses
a subset S among all N-choose-K possible options and observes reward max(X(i) :
i in S) where X is a random vector drawn from a joint distribution. The
objective is to identify the subset that achieves the highest expected reward
with high probability using as few queries as possible. We present
distribution-dependent lower bounds based on a particular construction which
force a learner to consider all N-choose-K subsets, and match naive extensions
of known upper bounds in the bandit setting obtained by treating each subset as
a separate arm. Nevertheless, we present evidence that exhaustive search may be
avoided for certain, favorable distributions because the influence of
high-order order correlations may be dominated by lower order statistics.
Finally, we present an algorithm and analysis for independent arms, which
mitigates the surprising non-trivial information occlusion that occurs due to
only observing the max in the subset. This may inform strategies for more
general dependent measures, and we complement these result with independent-arm
lower bounds
Asynchronous Parallel Empirical Variance Guided Algorithms for the Thresholding Bandit Problem
This paper considers the multi-armed thresholding bandit problem --
identifying all arms whose expected rewards are above a predefined threshold
via as few pulls (or rounds) as possible -- proposed by Locatelli et al. [2016]
recently. Although the proposed algorithm in Locatelli et al. [2016] achieves
the optimal round complexity in a certain sense, there still remain unsolved
issues. This paper proposes an asynchronous parallel thresholding algorithm and
its parameter-free version to improve the efficiency and the applicability. On
one hand, the proposed two algorithms use the empirical variance to guide the
pull decision at each round, and significantly improve the round complexity of
the "optimal" algorithm when all arms have bounded high order moments. The
proposed algorithms can be proven to be optimal. On the other hand, most bandit
algorithms assume that the reward can be observed immediately after the pull or
the next decision would not be made before all rewards are observed. Our
proposed asynchronous parallel algorithms allow making the choice of the next
pull with unobserved rewards from earlier pulls, which avoids such an
unrealistic assumption and significantly improves the identification process.
Our theoretical analysis justifies the effectiveness and the efficiency of
proposed asynchronous parallel algorithms.Comment: added lower boun
A Survey on Practical Applications of Multi-Armed and Contextual Bandits
In recent years, multi-armed bandit (MAB) framework has attracted a lot of
attention in various applications, from recommender systems and information
retrieval to healthcare and finance, due to its stellar performance combined
with certain attractive properties, such as learning from less feedback. The
multi-armed bandit field is currently flourishing, as novel problem settings
and algorithms motivated by various practical applications are being
introduced, building on top of the classical bandit problem. This article aims
to provide a comprehensive review of top recent developments in multiple
real-life applications of the multi-armed bandit. Specifically, we introduce a
taxonomy of common MAB-based applications and summarize state-of-art for each
of those domains. Furthermore, we identify important current trends and provide
new perspectives pertaining to the future of this exciting and fast-growing
field.Comment: under review by IJCAI 2019 Surve
Combinatorial Bandits with Relative Feedback
We consider combinatorial online learning with subset choices when only
relative feedback information from subsets is available, instead of bandit or
semi-bandit feedback which is absolute. Specifically, we study two regret
minimisation problems over subsets of a finite ground set , with
subset-wise relative preference information feedback according to the
Multinomial logit choice model. In the first setting, the learner can play
subsets of size bounded by a maximum size and receives top- rank-ordered
feedback, while in the second setting the learner can play subsets of a fixed
size with a full subset ranking observed as feedback. For both settings, we
devise instance-dependent and order-optimal regret algorithms with regret
and , respectively. We derive
fundamental limits on the regret performance of online learning with
subset-wise preferences, proving the tightness of our regret guarantees. Our
results also show the value of eliciting more general top- rank-ordered
feedback over single winner feedback (). Our theoretical results are
corroborated with empirical evaluations.Comment: 47 pages, 12 fgure
A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit
Adaptive and sequential experiment design is a well-studied area in numerous
domains. We survey and synthesize the work of the online statistical learning
paradigm referred to as multi-armed bandits integrating the existing research
as a resource for a certain class of online experiments. We first explore the
traditional stochastic model of a multi-armed bandit, then explore a taxonomic
scheme of complications to that model, for each complication relating it to a
specific requirement or consideration of the experiment design context.
Finally, at the end of the paper, we present a table of known upper-bounds of
regret for all studied algorithms providing both perspectives for future
theoretical work and a decision-making tool for practitioners looking for
theoretical guarantees.Comment: 49 pages, 1 figur
Learning to Hire Teams
Crowdsourcing and human computation has been employed in increasingly
sophisticated projects that require the solution of a heterogeneous set of
tasks. We explore the challenge of building or hiring an effective team, for
performing tasks required for such projects on an ongoing basis, from an
available pool of applicants or workers who have bid for the tasks. The
recruiter needs to learn workers' skills and expertise by performing online
tests and interviews, and would like to minimize the amount of budget or time
spent in this process before committing to hiring the team. How can one
optimally spend budget to learn the expertise of workers as part of recruiting
a team? How can one exploit the similarities among tasks as well as underlying
social ties or commonalities among the workers for faster learning? We tackle
these decision-theoretic challenges by casting them as an instance of online
learning for best action selection. We present algorithms with PAC bounds on
the required budget to hire a near-optimal team with high confidence.
Furthermore, we consider an embedding of the tasks and workers in an underlying
graph that may arise from task similarities or social ties, and that can
provide additional side-observations for faster learning. We then quantify the
improvement in the bounds that we can achieve depending on the characteristic
properties of this graph structure. We evaluate our methodology on simulated
problem instances as well as on real-world crowdsourcing data collected from
the oDesk platform. Our methodology and results present an interesting
direction of research to tackle the challenges faced by a recruiter for
contract-based crowdsourcing.Comment: Short version of this paper will appear in HCOMP'1
- …