We consider a stochastic bandit problem with a possibly infinite number of
arms. We write p∗ for the proportion of optimal arms and Δ for the
minimal mean-gap between optimal and sub-optimal arms. We characterize the
optimal learning rates both in the cumulative regret setting, and in the
best-arm identification setting in terms of the problem parameters T (the
budget), p∗ and Δ. For the objective of minimizing the cumulative
regret, we provide a lower bound of order Ω(log(T)/(p∗Δ)) and a
UCB-style algorithm with matching upper bound up to a factor of
log(1/Δ). Our algorithm needs p∗ to calibrate its parameters, and we
prove that this knowledge is necessary, since adapting to p∗ in this setting
is impossible. For best-arm identification we also provide a lower bound of
order Ω(exp(−cTΔ2p∗)) on the probability of outputting a
sub-optimal arm where c>0 is an absolute constant. We also provide an
elimination algorithm with an upper bound matching the lower bound up to a
factor of order log(T) in the exponential, and that does not need p∗ or
Δ as parameter. Our results apply directly to the three related problems
of competing against the j-th best arm, identifying an ϵ good arm,
and finding an arm with mean larger than a quantile of a known order.Comment: Substantial rewrite and added experiments. Accepted for NeurIPS 202