High-probability regret bounds for bandit online linear optimization

Abstract

We present a modification of the algorithm of Dani et al. [8] for the online linear optimization problem in the bandit setting, which with high probability has regret at most O βˆ— ( √ T) against an adaptive adversary. This improves on the previous algorithm [8] whose regret is bounded in expectation against an oblivious adversary. We obtain the same dependence on the dimension (n 3/2) as that exhibited by Dani et al. The results of this paper rest firmly on those of [8] and the remarkable technique of Auer et al. [2] for obtaining high probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings

Similar works

Full text

thumbnail-image

Queensland University of Technology ePrints Archive

redirect
Last time updated on 02/07/2013

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.