Search CORE

20 research outputs found

Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits

Author: Garbar Sergey
Publication venue
Publication date: 11/05/2022
Field of study

We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes

N

and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order

N^{-1/2}

, as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.Comment: 9 pages, 2 figure

arXiv.org e-Print Archive

From Random Search to Bandit Learning in Metric Measure Spaces

Author: Feng Yasong
Han Chuying
Wang Tianyu
Publication venue
Publication date: 23/05/2023
Field of study

Random Search is one of the most widely-used method for Hyperparameter Optimization, and is critical to the success of deep learning models. Despite its astonishing performance, little non-heuristic theory has been developed to describe the underlying working mechanism. This paper gives a theoretical accounting of Random Search. We introduce the concept of \emph{scattering dimension} that describes the landscape of the underlying function, and quantifies the performance of random search. We show that, when the environment is noise-free, the output of random search converges to the optimal value in probability at rate

\widetilde{\mathcal{O}} \left( \left( \frac{1}{T} \right)^{ \frac{1}{d_s} } \right)

, where

d_s \ge 0

is the scattering dimension of the underlying function. When the observed function values are corrupted by bounded

iid

noise, the output of random search converges to the optimal value in probability at rate

\widetilde{\mathcal{O}} \left( \left( \frac{1}{T} \right)^{ \frac{1}{d_s + 1} } \right)

. In addition, based on the principles of random search, we introduce an algorithm, called BLiN-MOS, for Lipschitz bandits in doubling metric spaces that are also endowed with a Borel measure, and show that BLiN-MOS achieves a regret rate of order

\widetilde{\mathcal{O}} \left( T^{ \frac{d_z}{d_z + 1} } \right)

, where

d_z

is the zooming dimension of the problem instance. Our results show that under certain conditions, the known information-theoretical lower bounds for Lipschitz bandits

\Omega \left( T^{\frac{d_z+1}{d_z+2}} \right)

can be improved

arXiv.org e-Print Archive

Reward Imputation with Sketching for Contextual Batched Bandits

Author: Shao Ninglu
Si Zihua
Su Hanjing
Wang Wenhan
Wen Ji-Rong
Xu Jun
Zhang Xiao
Publication venue
Publication date: 07/10/2023
Field of study

Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode, but the rewards of the non-executed actions are unobserved, resulting in partial-information feedback. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. In this paper, we propose an efficient approach called Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching, which approximates the full-information feedbacks. We formulate reward imputation as an imputation regularized ridge regression problem that captures the feedback mechanisms of both executed and non-executed actions. To reduce time complexity, we solve the regression problem using randomized sketching. We prove that our approach achieves an instantaneous regret with controllable bias and smaller variance than approaches without reward imputation. Furthermore, our approach enjoys a sublinear regret bound against the optimal policy. We also present two extensions, a rate-scheduled version and a version for nonlinear rewards, making our approach more practical. Experimental results show that SPUIR outperforms state-of-the-art baselines on synthetic, public benchmark, and real-world datasets.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Online Learning of Energy Consumption for Navigation of Electric Vehicles

Author: Chehreghani Morteza Haghir
Chen Yuxin
Åkerblom Niklas
Publication venue: 'Elsevier BV'
Publication date: 01/01/2023
Field of study

Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks.Comment: Extension of arXiv:2003.0141

arXiv.org e-Print Archive

Chalmers Research

Knowledge UChicago