ABSTRACT. In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. . For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in K-K log(T) armed bandits after T trials is bounded by const ·, where ∆ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · K log(T ∆2
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.