30 research outputs found
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with
unknown reward models. At each time, a player selects one arm to play, aiming
to maximize the total expected reward over a horizon of length T. An approach
based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is
developed for constructing sequential arm selection policies. It is shown that
for all light-tailed reward distributions, DSEE achieves the optimal
logarithmic order of the regret, where regret is defined as the total expected
reward loss against the ideal case with known reward models. For heavy-tailed
reward distributions, DSEE achieves O(T^1/p) regret when the moments of the
reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2))
for p>2. With the knowledge of an upperbound on a finite moment of the
heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret
order. The proposed DSEE approach complements existing work on MAB by providing
corresponding results for general reward distributions. Furthermore, with a
clearly defined tunable parameter-the cardinality of the exploration sequence,
the DSEE approach is easily extendable to variations of MAB, including MAB with
various objectives, decentralized MAB with multiple players and incomplete
reward observations under collisions, MAB with unknown Markov dynamics, and
combinatorial MAB with dependent arms that often arise in network optimization
problems such as the shortest path, the minimum spanning, and the dominating
set problems under unknown random weights.Comment: 22 pages, 2 figure
Kernelized Reinforcement Learning with Order Optimal Regret Bounds
Reinforcement learning (RL) has shown empirical success in various real world
settings with complex models and large state-action spaces. The existing
analytical results, however, typically focus on settings with a small number of
state-actions or simple models such as linearly modeled state-action value
functions. To derive RL policies that efficiently handle large state-action
spaces with more general value functions, some recent works have considered
nonlinear function approximation using kernel ridge regression. We propose
-KRVI, an optimistic modification of least-squares value iteration, when
the state-action value function is represented by an RKHS. We prove the first
order-optimal regret guarantees under a general setting. Our results show a
significant polynomial in the number of episodes improvement over the state of
the art. In particular, with highly non-smooth kernels (such as Neural Tangent
kernel or some Mat\'ern kernels) the existing results lead to trivial
(superlinear in the number of episodes) regret bounds. We show a sublinear
regret bound that is order optimal in the case of Mat\'ern kernels where a
lower bound on regret is known
Random Exploration in Bayesian Optimization: Order-Optimal Regret and Computational Efficiency
We consider Bayesian optimization using Gaussian Process models, also
referred to as kernel-based bandit optimization. We study the methodology of
exploring the domain using random samples drawn from a distribution. We show
that this random exploration approach achieves the optimal error rates. Our
analysis is based on novel concentration bounds in an infinite dimensional
Hilbert space established in this work, which may be of independent interest.
We further develop an algorithm based on random exploration with domain
shrinking and establish its order-optimal regret guarantees under both
noise-free and noisy settings. In the noise-free setting, our analysis closes
the existing gap in regret performance and thereby resolves a COLT open
problem. The proposed algorithm also enjoys a computational advantage over
prevailing methods due to the random exploration that obviates the expensive
optimization of a non-convex acquisition function for choosing the query points
at each iteration
Regret Bounds for Noise-Free Bayesian Optimization
Bayesian optimisation is a powerful method for non-convex black-box
optimization in low data regimes. However, the question of establishing tight
upper bounds for common algorithms in the noiseless setting remains a largely
open question. In this paper, we establish new and tightest bounds for two
algorithms, namely GP-UCB and Thompson sampling, under the assumption that the
objective function is smooth in terms of having a bounded norm in a Mat\'ern
RKHS. Importantly, unlike several related works, we do not consider perfect
knowledge of the kernel of the Gaussian process emulator used within the
Bayesian optimization loop. This allows us to provide results for practical
algorithms that sequentially estimate the Gaussian process kernel parameters
from the available data
Near-Optimal Collaborative Learning in Bandits
This paper introduces a general multi-agent bandit model in which each agent
is facing a finite set of arms and may communicate with other agents through a
central controller in order to identify, in pure exploration, or play, in
regret minimization, its optimal arm. The twist is that the optimal arm for
each agent is the arm with largest expected mixed reward, where the mixed
reward of an arm is a weighted sum of the rewards of this arm for all agents.
This makes communication between agents often necessary. This general setting
allows to recover and extend several recent models for collaborative bandit
learning, including the recently proposed federated learning with
personalization (Shi et al., 2021). In this paper, we provide new lower bounds
on the sample complexity of pure exploration and on the regret. We then propose
a near-optimal algorithm for pure exploration. This algorithm is based on
phased elimination with two novel ingredients: a data-dependent sampling scheme
within each phase, aimed at matching a relaxation of the lower bound