Delay and Cooperation in Nonstochastic Bandits

Cesa-Bianchi, Nicolo'; Gentile, Claudio; Mansour, Yishay; Minora, Alberto

research

Delay and Cooperation in Nonstochastic Bandits

Authors: Nicolo' Cesa-Bianchi
Claudio Gentile
Yishay Mansour
Alberto Minora
Publication date: 1 January 2016
Publisher

Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than

d

hops to arrive, where

d

is a delay parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc Exp3} algorithm and prove that with

K

actions and

N

agents the average per-agent regret after

T

rounds is at most of order

\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}

, where

\alpha_{\le d}

is the independence number of the

d

-th power of the connected communication graph

G

. We then show that for any connected graph, for

d=\sqrt{K}

the regret bound is

K^{1/4}\sqrt{T}

, strictly better than the minimax regret

\sqrt{KT}

for noncooperating agents. More informed choices of

d

lead to bounds which are arbitrarily close to the full information minimax regret

\sqrt{T\ln K}

when

G

is dense. When

G

has sparse components, we show that a variant of \textsc{Exp3-Coop}, allowing agents to choose their parameters according to their centrality in

G

, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.Comment: 30 page

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Archivio istituzionale della ricerca - Università dell'Insubria

oai:irinsubria.uninsubria.it:1...

Last time updated on 12/11/2016

AIR Universita degli studi di Milano

oai:air.unimi.it:2434/423453

Last time updated on 06/03/2019