Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Abstract

We study pure exploration with infinitely many bandit arms generated i.i.d. from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability 1δ1-\delta, within ε\varepsilon of being among the top η\eta-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confidence and fixed budget settings, aiming respectively for minimal expected and fixed sample complexity. For fixed confidence, we give an algorithm with expected sample complexity O(log(1/η)log(1/δ)ηε2)O\left(\frac{\log (1/\eta)\log (1/\delta)}{\eta\varepsilon^2}\right). This is optimal except for the log(1/η)\log (1/\eta) factor, and the δ\delta-dependence closes a quadratic gap in the literature. For fixed budget, we show the asymptotically optimal sample complexity as δ0\delta\to 0 is c1log(1/δ)(loglog(1/δ))2c^{-1}\log(1/\delta)\big(\log\log(1/\delta)\big)^2 to leading order. Equivalently, the optimal failure probability given exactly NN samples decays as exp(cN/log2N)\exp\big(-cN/\log^2 N\big), up to a factor 1±oN(1)1\pm o_N(1) inside the exponent. The constant cc depends explicitly on the problem parameters (including the unknown arm distribution) through a certain Fisher information distance. Even the strictly super-linear dependence on log(1/δ)\log(1/\delta) was not known and resolves a question of Grossman and Moshkovitz (FOCS 2016, SIAM Journal on Computing 2020)

    Similar works

    Full text

    thumbnail-image

    Available Versions