20 research outputs found
Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits
We consider the upper confidence bound strategy for Gaussian multi-armed
bandits with known control horizon sizes and build its limiting description
with a system of stochastic differential equations and ordinary differential
equations. Rewards for the arms are assumed to have unknown expected values and
known variances. A set of Monte-Carlo simulations was performed for the case of
close distributions of rewards, when mean rewards differ by the magnitude of
order , as it yields the highest normalized regret, to verify the
validity of the obtained description. The minimal size of the control horizon
when the normalized regret is not noticeably larger than maximum possible was
estimated.Comment: 9 pages, 2 figure
From Random Search to Bandit Learning in Metric Measure Spaces
Random Search is one of the most widely-used method for Hyperparameter
Optimization, and is critical to the success of deep learning models. Despite
its astonishing performance, little non-heuristic theory has been developed to
describe the underlying working mechanism. This paper gives a theoretical
accounting of Random Search. We introduce the concept of \emph{scattering
dimension} that describes the landscape of the underlying function, and
quantifies the performance of random search. We show that, when the environment
is noise-free, the output of random search converges to the optimal value in
probability at rate , where is the scattering
dimension of the underlying function. When the observed function values are
corrupted by bounded noise, the output of random search converges to the
optimal value in probability at rate . In addition, based on the
principles of random search, we introduce an algorithm, called BLiN-MOS, for
Lipschitz bandits in doubling metric spaces that are also endowed with a Borel
measure, and show that BLiN-MOS achieves a regret rate of order , where
is the zooming dimension of the problem instance. Our results show that under
certain conditions, the known information-theoretical lower bounds for
Lipschitz bandits can be
improved
Reward Imputation with Sketching for Contextual Batched Bandits
Contextual batched bandit (CBB) is a setting where a batch of rewards is
observed from the environment at the end of each episode, but the rewards of
the non-executed actions are unobserved, resulting in partial-information
feedback. Existing approaches for CBB often ignore the rewards of the
non-executed actions, leading to underutilization of feedback information. In
this paper, we propose an efficient approach called Sketched Policy Updating
with Imputed Rewards (SPUIR) that completes the unobserved rewards using
sketching, which approximates the full-information feedbacks. We formulate
reward imputation as an imputation regularized ridge regression problem that
captures the feedback mechanisms of both executed and non-executed actions. To
reduce time complexity, we solve the regression problem using randomized
sketching. We prove that our approach achieves an instantaneous regret with
controllable bias and smaller variance than approaches without reward
imputation. Furthermore, our approach enjoys a sublinear regret bound against
the optimal policy. We also present two extensions, a rate-scheduled version
and a version for nonlinear rewards, making our approach more practical.
Experimental results show that SPUIR outperforms state-of-the-art baselines on
synthetic, public benchmark, and real-world datasets.Comment: Accepted by NeurIPS 202
Online Learning of Energy Consumption for Navigation of Electric Vehicles
Energy efficient navigation constitutes an important challenge in electric
vehicles, due to their limited battery capacity. We employ a Bayesian approach
to model the energy consumption at road segments for efficient navigation. In
order to learn the model parameters, we develop an online learning framework
and investigate several exploration strategies such as Thompson Sampling and
Upper Confidence Bound. We then extend our online learning framework to the
multi-agent setting, where multiple vehicles adaptively navigate and learn the
parameters of the energy model. We analyze Thompson Sampling and establish
rigorous regret bounds on its performance in the single-agent and multi-agent
settings, through an analysis of the algorithm under batched feedback. Finally,
we demonstrate the performance of our methods via experiments on several
real-world city road networks.Comment: Extension of arXiv:2003.0141