917 research outputs found
SPRT-based Efficient Best Arm Identification in Stochastic Bandits
This paper investigates the best arm identification (BAI) problem in
stochastic multi-armed bandits in the fixed confidence setting. The general
class of the exponential family of bandits is considered. The state-of-the-art
algorithms for the exponential family of bandits face computational challenges.
To mitigate these challenges, a novel framework is proposed, which views the
BAI problem as sequential hypothesis testing, and is amenable to tractable
analysis for the exponential family of bandits. Based on this framework, a BAI
algorithm is designed that leverages the canonical sequential probability ratio
tests. This algorithm has three features for both settings: (1) its sample
complexity is asymptotically optimal, (2) it is guaranteed to be PAC,
and (3) it addresses the computational challenge of the state-of-the-art
approaches. Specifically, these approaches, which are focused only on the
Gaussian setting, require Thompson sampling from the arm that is deemed the
best and a challenger arm. This paper analytically shows that identifying the
challenger is computationally expensive and that the proposed algorithm
circumvents it. Finally, numerical experiments are provided to support the
analysis
Decentralized Exploration in Multi-Armed Bandits
We consider the decentralized exploration problem: a set of players
collaborate to identify the best arm by asynchronously interacting with the
same stochastic environment. The objective is to insure privacy in the best arm
identification problem between asynchronous, collaborative, and thrifty
players. In the context of a digital service, we advocate that this
decentralized approach allows a good balance between the interests of users and
those of service providers: the providers optimize their services, while
protecting the privacy of the users and saving resources. We define the privacy
level as the amount of information an adversary could infer by intercepting the
messages concerning a single user. We provide a generic algorithm Decentralized
Elimination, which uses any best arm identification algorithm as a subroutine.
We prove that this algorithm insures privacy, with a low communication cost,
and that in comparison to the lower bound of the best arm identification
problem, its sample complexity suffers from a penalty depending on the inverse
of the probability of the most frequent players. Then, thanks to the genericity
of the approach, we extend the proposed algorithm to the non-stationary
bandits. Finally, experiments illustrate and complete the analysis
Best Arm Identification in Stochastic Bandits: Beyond optimality
This paper investigates a hitherto unaddressed aspect of best arm
identification (BAI) in stochastic multi-armed bandits in the fixed-confidence
setting. Two key metrics for assessing bandit algorithms are computational
efficiency and performance optimality (e.g., in sample complexity). In
stochastic BAI literature, there have been advances in designing algorithms to
achieve optimal performance, but they are generally computationally expensive
to implement (e.g., optimization-based methods). There also exist approaches
with high computational efficiency, but they have provable gaps to the optimal
performance (e.g., the -optimal approaches in top-two methods). This
paper introduces a framework and an algorithm for BAI that achieves optimal
performance with a computationally efficient set of decision rules. The central
process that facilitates this is a routine for sequentially estimating the
optimal allocations up to sufficient fidelity. Specifically, these estimates
are accurate enough for identifying the best arm (hence, achieving optimality)
but not overly accurate to an unnecessary extent that creates excessive
computational complexity (hence, maintaining efficiency). Furthermore, the
existing relevant literature focuses on the family of exponential
distributions. This paper considers a more general setting of any arbitrary
family of distributions parameterized by their mean values (under mild
regularity conditions). The optimality is established analytically, and
numerical evaluations are provided to assess the analytical guarantees and
compare the performance with those of the existing ones
Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies
Pandemic influenza has the epidemic potential to kill millions of people.
While various preventive measures exist (i.a., vaccination and school
closures), deciding on strategies that lead to their most effective and
efficient use remains challenging. To this end, individual-based
epidemiological models are essential to assist decision makers in determining
the best strategy to curb epidemic spread. However, individual-based models are
computationally intensive and it is therefore pivotal to identify the optimal
strategy using a minimal amount of model evaluations. Additionally, as
epidemiological modeling experiments need to be planned, a computational budget
needs to be specified a priori. Consequently, we present a new sampling
technique to optimize the evaluation of preventive strategies using fixed
budget best-arm identification algorithms. We use epidemiological modeling
theory to derive knowledge about the reward distribution which we exploit using
Bayesian best-arm identification algorithms (i.e., Top-two Thompson sampling
and BayesGap). We evaluate these algorithms in a realistic experimental setting
and demonstrate that it is possible to identify the optimal strategy using only
a limited number of model evaluations, i.e., 2-to-3 times faster compared to
the uniform sampling method, the predominant technique used for epidemiological
decision making in the literature. Finally, we contribute and evaluate a
statistic for Top-two Thompson sampling to inform the decision makers about the
confidence of an arm recommendation
Simple regret for infinitely many armed bandits
We consider a stochastic bandit problem with infinitely many arms. In this
setting, the learner has no chance of trying all the arms even once and has to
dedicate its limited number of samples only to a certain number of arms. All
previous algorithms for this setting were designed for minimizing the
cumulative regret of the learner. In this paper, we propose an algorithm aiming
at minimizing the simple regret. As in the cumulative regret setting of
infinitely many armed bandits, the rate of the simple regret will depend on a
parameter characterizing the distribution of the near-optimal arms. We
prove that depending on , our algorithm is minimax optimal either up to
a multiplicative constant or up to a factor. We also provide
extensions to several important cases: when is unknown, in a natural
setting where the near-optimal arms have a small variance, and in the case of
unknown time horizon.Comment: in 32th International Conference on Machine Learning (ICML 2015
- …