The permutahedron is the convex polytope with vertex set consisting of the
vectors (π(1),…,π(n)) for all permutations (bijections) π over
{1,…,n}. We study a bandit game in which, at each step t, an
adversary chooses a hidden weight weight vector st, a player chooses a
vertex πt of the permutahedron and suffers an observed loss of
∑i=1nπ(i)st(i).
A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a
regret of O(nTlogn) for a time horizon of T. Unfortunately,
CombBand requires at each step an n-by-n matrix permanent approximation to
within improved accuracy as T grows, resulting in a total running time that
is super linear in T, making it impractical for large time horizons.
We provide an algorithm of regret O(n3/2T) with total time
complexity O(n3T). The ideas are a combination of CombBand and a recent
algorithm by Ailon (2013) for online optimization over the permutahedron in the
full information setting. The technical core is a bound on the variance of the
Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by
establishing positive semi-definiteness of a family of 3-by-3 matrices
generated from rational functions of exponentials of 3 parameters