Bandit Online Optimization Over the Permutahedron

D. Suehiro; D.P. Helmbold; J. Yellott; L.G. Valiant; M. Jerrum; N. Cesa-Bianchi; P. Auer; S. Beggs; S. Yasutake

research

Bandit Online Optimization Over the Permutahedron

Authors: D. Suehiro
D.P. Helmbold
J. Yellott
L.G. Valiant
M. Jerrum
N. Cesa-Bianchi
P. Auer
S. Beggs
S. Yasutake
Publication date: 1 January 2014
Publisher
Doi

Abstract

The permutahedron is the convex polytope with vertex set consisting of the vectors

(\pi(1),\dots, \pi(n))

for all permutations (bijections)

\pi

over

\{1,\dots, n\}

. We study a bandit game in which, at each step

t

, an adversary chooses a hidden weight weight vector

s_t

, a player chooses a vertex

\pi_t

of the permutahedron and suffers an observed loss of

\sum_{i=1}^n \pi(i) s_t(i)

. A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of

O(n\sqrt{T \log n})

for a time horizon of

T

. Unfortunately, CombBand requires at each step an

n

-by-

n

matrix permanent approximation to within improved accuracy as

T

grows, resulting in a total running time that is super linear in

T

, making it impractical for large time horizons. We provide an algorithm of regret

O(n^{3/2}\sqrt{T})

with total time complexity

O(n^3T)

. The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters

Similar works

Full text

Available Versions

Crossref

info:doi/10.1007%2F978-3-319-1...

Last time updated on 05/06/2019