Search CORE

1,728 research outputs found

Good Arm Identification via Bandit Feedback

Author: Honda Junya
Kano Hideaki
Matsuura Kentaro
Nakamura Atsuyoshi
Sakamaki Kentaro
Sugiyama Masashi
Publication venue
Publication date: 10/02/2018
Field of study

We consider a novel stochastic multi-armed bandit problem called {\em good arm identification} (GAI), where a good arm is defined as an arm with expected reward greater than or equal to a given threshold. GAI is a pure-exploration problem that a single agent repeats a process of outputting an arm as soon as it is identified as a good one before confirming the other arms are actually not good. The objective of GAI is to minimize the number of samples for each process. We find that GAI faces a new kind of dilemma, the {\em exploration-exploitation dilemma of confidence}, which is different difficulty from the best arm identification. As a result, an efficient design of algorithms for GAI is quite different from that for the best arm identification. We derive a lower bound on the sample complexity of GAI that is tight up to the logarithmic factor

\mathrm{O}(\log \frac{1}{\delta})

for acceptance error rate

\delta

. We also develop an algorithm whose sample complexity almost matches the lower bound. We also confirm experimentally that our proposed algorithm outperforms naive algorithms in synthetic settings based on a conventional bandit problem and clinical trial researches for rheumatoid arthritis

arXiv.org e-Print Archive

Causal Bandits: Learning Good Interventions via Causal Inference

Author: Lattimore Finnian
Lattimore Tor
Reid Mark D.
Publication venue
Publication date: 10/06/2016
Field of study

We study the problem of using causal models to improve the rate at which good interventions can be learned online in a stochastic environment. Our formalism combines multi-arm bandits and causal inference to model a novel type of bandit feedback that is not exploited by existing approaches. We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information

arXiv.org e-Print Archive

Best arm identification in multi-armed bandits with delayed feedback

Author: Attia Peter
Chen Michael
Cheong Bryan
Chueh William
Ermon Stefano
Grover Aditya
Harris Stephen
Jin Norman
Markov Todor
Perkins Nicholas
Yang Zi
Publication venue
Publication date: 29/03/2018
Field of study

We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample complexity of standard algorithms, but can be offset if we have access to partial feedback received before a pull is completed. We propose a general framework to model the relationship between partial and delayed feedback, and as a special case we introduce efficient algorithms for settings where the partial feedback are biased or unbiased estimators of the delayed feedback. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Our experiments in real-world settings, involving policy search and hyperparameter optimization in computational sustainability domains for fast charging of batteries and wildlife corridor construction, demonstrate that exploiting the structure of partial feedback can lead to significant improvements over baselines in both sequential and parallel MAB.Comment: AISTATS 201

arXiv.org e-Print Archive

Instrument-Armed Bandits

Author: Kallus Nathan
Publication venue
Publication date: 20/05/2017
Field of study

We extend the classic multi-armed bandit (MAB) model to the setting of noncompliance, where the arm pull is a mere instrument and the treatment applied may differ from it, which gives rise to the instrument-armed bandit (IAB) problem. The IAB setting is relevant whenever the experimental units are human since free will, ethics, and the law may prohibit unrestricted or forced application of treatment. In particular, the setting is relevant in bandit models of dynamic clinical trials and other controlled trials on human interventions. Nonetheless, the setting has not been fully investigate in the bandit literature. We show that there are various and divergent notions of regret in this setting, all of which coincide only in the classic MAB setting. We characterize the behavior of these regrets and analyze standard MAB algorithms. We argue for a particular kind of regret that captures the causal effect of treatments but show that standard MAB algorithms cannot achieve sublinear control on this regret. Instead, we develop new algorithms for the IAB problem, prove new regret bounds for them, and compare them to standard MAB algorithms in numerical examples

arXiv.org e-Print Archive

Best-of-K Bandits

Author: Jamieson Kevin
Recht Benjamin
Simchowitz Max
Publication venue
Publication date: 18/03/2016
Field of study

This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds

arXiv.org e-Print Archive

Asynchronous Parallel Empirical Variance Guided Algorithms for the Thresholding Bandit Problem

Author: Huang Yijun
Liu Ji
Zhong Jie
Publication venue
Publication date: 08/07/2017
Field of study

This paper considers the multi-armed thresholding bandit problem -- identifying all arms whose expected rewards are above a predefined threshold via as few pulls (or rounds) as possible -- proposed by Locatelli et al. [2016] recently. Although the proposed algorithm in Locatelli et al. [2016] achieves the optimal round complexity in a certain sense, there still remain unsolved issues. This paper proposes an asynchronous parallel thresholding algorithm and its parameter-free version to improve the efficiency and the applicability. On one hand, the proposed two algorithms use the empirical variance to guide the pull decision at each round, and significantly improve the round complexity of the "optimal" algorithm when all arms have bounded high order moments. The proposed algorithms can be proven to be optimal. On the other hand, most bandit algorithms assume that the reward can be observed immediately after the pull or the next decision would not be made before all rewards are observed. Our proposed asynchronous parallel algorithms allow making the choice of the next pull with unobserved rewards from earlier pulls, which avoids such an unrealistic assumption and significantly improves the identification process. Our theoretical analysis justifies the effectiveness and the efficiency of proposed asynchronous parallel algorithms.Comment: added lower boun

arXiv.org e-Print Archive

A Survey on Practical Applications of Multi-Armed and Contextual Bandits

Author: Bouneffouf Djallel
Rish Irina
Publication venue
Publication date: 02/04/2019
Field of study

In recent years, multi-armed bandit (MAB) framework has attracted a lot of attention in various applications, from recommender systems and information retrieval to healthcare and finance, due to its stellar performance combined with certain attractive properties, such as learning from less feedback. The multi-armed bandit field is currently flourishing, as novel problem settings and algorithms motivated by various practical applications are being introduced, building on top of the classical bandit problem. This article aims to provide a comprehensive review of top recent developments in multiple real-life applications of the multi-armed bandit. Specifically, we introduce a taxonomy of common MAB-based applications and summarize state-of-art for each of those domains. Furthermore, we identify important current trends and provide new perspectives pertaining to the future of this exciting and fast-growing field.Comment: under review by IJCAI 2019 Surve

arXiv.org e-Print Archive

Combinatorial Bandits with Relative Feedback

Author: Gopalan Aditya
Saha Aadirupa
Publication venue
Publication date: 26/02/2020
Field of study

We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set

[n]

, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-

m

rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size

k

with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret

O(\frac{n}{m} \ln T)

and

O(\frac{n}{k} \ln T)

, respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-

m

rank-ordered feedback over single winner feedback (

m=1

). Our theoretical results are corroborated with empirical evaluations.Comment: 47 pages, 12 fgure

arXiv.org e-Print Archive

A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit

Author: Burtini Giuseppe
Lawrence Ramon
Loeppky Jason
Publication venue
Publication date: 03/11/2015
Field of study

Adaptive and sequential experiment design is a well-studied area in numerous domains. We survey and synthesize the work of the online statistical learning paradigm referred to as multi-armed bandits integrating the existing research as a resource for a certain class of online experiments. We first explore the traditional stochastic model of a multi-armed bandit, then explore a taxonomic scheme of complications to that model, for each complication relating it to a specific requirement or consideration of the experiment design context. Finally, at the end of the paper, we present a table of known upper-bounds of regret for all studied algorithms providing both perspectives for future theoretical work and a decision-making tool for practitioners looking for theoretical guarantees.Comment: 49 pages, 1 figur

arXiv.org e-Print Archive

Learning to Hire Teams

Author: Horvitz Eric
Kohli Pushmeet
Krause Andreas
Singla Adish
Publication venue
Publication date: 12/08/2015
Field of study

Crowdsourcing and human computation has been employed in increasingly sophisticated projects that require the solution of a heterogeneous set of tasks. We explore the challenge of building or hiring an effective team, for performing tasks required for such projects on an ongoing basis, from an available pool of applicants or workers who have bid for the tasks. The recruiter needs to learn workers' skills and expertise by performing online tests and interviews, and would like to minimize the amount of budget or time spent in this process before committing to hiring the team. How can one optimally spend budget to learn the expertise of workers as part of recruiting a team? How can one exploit the similarities among tasks as well as underlying social ties or commonalities among the workers for faster learning? We tackle these decision-theoretic challenges by casting them as an instance of online learning for best action selection. We present algorithms with PAC bounds on the required budget to hire a near-optimal team with high confidence. Furthermore, we consider an embedding of the tasks and workers in an underlying graph that may arise from task similarities or social ties, and that can provide additional side-observations for faster learning. We then quantify the improvement in the bounds that we can achieve depending on the characteristic properties of this graph structure. We evaluate our methodology on simulated problem instances as well as on real-world crowdsourcing data collected from the oDesk platform. Our methodology and results present an interesting direction of research to tackle the challenges faced by a recruiter for contract-based crowdsourcing.Comment: Short version of this paper will appear in HCOMP'1

arXiv.org e-Print Archive