Bandits with many optimal arms

Abstract

We consider a stochastic bandit problem with a possibly infinite number of arms. We write pp^* for the proportion of optimal arms and Δ\Delta for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters TT (the budget), pp^* and Δ\Delta. For the objective of minimizing the cumulative regret, we provide a lower bound of order Ω(log(T)/(pΔ))\Omega(\log(T)/(p^*\Delta)) and a UCB-style algorithm with matching upper bound up to a factor of log(1/Δ)\log(1/\Delta). Our algorithm needs pp^* to calibrate its parameters, and we prove that this knowledge is necessary, since adapting to pp^* in this setting is impossible. For best-arm identification we also provide a lower bound of order Ω(exp(cTΔ2p))\Omega(\exp(-cT\Delta^2 p^*)) on the probability of outputting a sub-optimal arm where c>0c>0 is an absolute constant. We also provide an elimination algorithm with an upper bound matching the lower bound up to a factor of order log(T)\log(T) in the exponential, and that does not need pp^* or Δ\Delta as parameter. Our results apply directly to the three related problems of competing against the jj-th best arm, identifying an ϵ\epsilon good arm, and finding an arm with mean larger than a quantile of a known order.Comment: Substantial rewrite and added experiments. Accepted for NeurIPS 202

    Similar works

    Full text

    thumbnail-image

    Available Versions