We study pure exploration with infinitely many bandit arms generated i.i.d.
from an unknown distribution. Our goal is to efficiently select a single high
quality arm whose average reward is, with probability 1−δ, within
ε of being among the top η-fraction of arms; this is a natural
adaptation of the classical PAC guarantee for infinite action sets. We consider
both the fixed confidence and fixed budget settings, aiming respectively for
minimal expected and fixed sample complexity.
For fixed confidence, we give an algorithm with expected sample complexity
O(ηε2log(1/η)log(1/δ)). This is
optimal except for the log(1/η) factor, and the δ-dependence
closes a quadratic gap in the literature. For fixed budget, we show the
asymptotically optimal sample complexity as δ→0 is
c−1log(1/δ)(loglog(1/δ))2 to leading order.
Equivalently, the optimal failure probability given exactly N samples decays
as exp(−cN/log2N), up to a factor 1±oN(1) inside the
exponent. The constant c depends explicitly on the problem parameters
(including the unknown arm distribution) through a certain Fisher information
distance. Even the strictly super-linear dependence on log(1/δ) was not
known and resolves a question of Grossman and Moshkovitz (FOCS 2016, SIAM
Journal on Computing 2020)