We study networks of communicating learning agents that cooperate to solve a
common nonstochastic bandit problem. Agents use an underlying communication
network to get messages about actions selected by other agents, and drop
messages that took more than d hops to arrive, where d is a delay
parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc
Exp3} algorithm and prove that with K actions and N agents the average
per-agent regret after T rounds is at most of order (d+1+NKα≤d)(TlnK), where α≤d is the
independence number of the d-th power of the connected communication graph
G. We then show that for any connected graph, for d=K the regret
bound is K1/4T, strictly better than the minimax regret KT
for noncooperating agents. More informed choices of d lead to bounds which
are arbitrarily close to the full information minimax regret TlnK
when G is dense. When G has sparse components, we show that a variant of
\textsc{Exp3-Coop}, allowing agents to choose their parameters according to
their centrality in G, strictly improves the regret. Finally, as a by-product
of our analysis, we provide the first characterization of the minimax regret
for bandit learning with delay.Comment: 30 page