25 research outputs found
Simple regret for infinitely many armed bandits
We consider a stochastic bandit problem with infinitely many arms. In this
setting, the learner has no chance of trying all the arms even once and has to
dedicate its limited number of samples only to a certain number of arms. All
previous algorithms for this setting were designed for minimizing the
cumulative regret of the learner. In this paper, we propose an algorithm aiming
at minimizing the simple regret. As in the cumulative regret setting of
infinitely many armed bandits, the rate of the simple regret will depend on a
parameter characterizing the distribution of the near-optimal arms. We
prove that depending on , our algorithm is minimax optimal either up to
a multiplicative constant or up to a factor. We also provide
extensions to several important cases: when is unknown, in a natural
setting where the near-optimal arms have a small variance, and in the case of
unknown time horizon.Comment: in 32th International Conference on Machine Learning (ICML 2015
On Kernelized Multi-armed Bandits
We consider the stochastic bandit problem with a continuous set of arms, with
the expected reward function over the arms assumed to be fixed but unknown. We
provide two new Gaussian process-based algorithms for continuous bandit
optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and
derive corresponding regret bounds. Specifically, the bounds hold when the
expected reward function belongs to the reproducing kernel Hilbert space (RKHS)
that naturally corresponds to a Gaussian process kernel used as input by the
algorithms. Along the way, we derive a new self-normalized concentration
inequality for vector- valued martingales of arbitrary, possibly infinite,
dimension. Finally, experimental evaluation and comparisons to existing
algorithms on synthetic and real-world environments are carried out that
highlight the favorable gains of the proposed strategies in many cases
Ballooning Multi-Armed Bandits
In this paper, we introduce Ballooning Multi-Armed Bandits (BL-MAB), a novel
extension of the classical stochastic MAB model. In the BL-MAB model, the set
of available arms grows (or balloons) over time. In contrast to the classical
MAB setting where the regret is computed with respect to the best arm overall,
the regret in a BL-MAB setting is computed with respect to the best available
arm at each time. We first observe that the existing stochastic MAB algorithms
result in linear regret for the BL-MAB model. We prove that, if the best arm is
equally likely to arrive at any time instant, a sub-linear regret cannot be
achieved. Next, we show that if the best arm is more likely to arrive in the
early rounds, one can achieve sub-linear regret. Our proposed algorithm
determines (1) the fraction of the time horizon for which the newly arriving
arms should be explored and (2) the sequence of arm pulls in the exploitation
phase from among the explored arms. Making reasonable assumptions on the
arrival distribution of the best arm in terms of the thinness of the
distribution's tail, we prove that the proposed algorithm achieves sub-linear
instance-independent regret. We further quantify explicit dependence of regret
on the arrival distribution parameters. We reinforce our theoretical findings
with extensive simulation results. We conclude by showing that our algorithm
would achieve sub-linear regret even if (a) the distributional parameters are
not exactly known, but are obtained using a reasonable learning mechanism or
(b) the best arm is not more likely to arrive early, but a large fraction of
arms is likely to arrive relatively early.Comment: A full version of this paper is accepted in the Journal of Artificial
Intelligence (AIJ) of Elsevier. A preliminary version is published as an
extended abstract in AAMAS 2020. Proceedings of the 19th International
Conference on Autonomous Agents and MultiAgent Systems. 202
Training a Single Bandit Arm
The stochastic multi-armed bandit problem captures the fundamental
exploration vs. exploitation tradeoff inherent in online decision-making in
uncertain settings. However, in several applications, the traditional objective
of maximizing the expected sum of rewards obtained can be inappropriate.
Motivated by the problem of optimizing job assignments to groom novice workers
with unknown trainability in labor platforms, we consider a new objective in
the classical setup. Instead of maximizing the expected total reward from
pulls, we consider the vector of cumulative rewards earned from each of the
arms at the end of pulls, and aim to maximize the expected value of the
highest reward. This corresponds to the objective of grooming a
single, highly skilled worker using a limited supply of training jobs.
For this new objective, we show that any policy must incur a regret of
in the worst case. We design an explore-then-commit
policy featuring exploration based on finely tuned confidence bounds on the
mean reward and an adaptive stopping criterion, which adapts to the problem
difficulty and guarantees a regret of in the
worst case. Our numerical experiments demonstrate that this policy improves
upon several natural candidate policies for this setting.Comment: 23 pages, 1 figure, 1 tabl
A simple dynamic bandit algorithm for hyper-parameter tuning
International audienceHyper-parameter tuning is a major part of modern machine learning systems. The tuning itself can be seen as a sequential resource allocation problem. As such, methods for multi-armed bandits have been already applied. In this paper, we view hyper-parameter optimization as an instance of best-arm identification in infinitely many-armed bandits. We propose D-TTTS, a new adaptive algorithm inspired by Thompson sampling, which dynamically balances between refining the estimate of the quality of hyper-parameter configurations previously explored and adding new hyper-parameter configurations to the pool of candidates. The algorithm is easy to implement and shows competitive performance compared to state-of-the-art algorithms for hyper-parameter tuning