We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup
consisting of N agents, solving the same MAB instance to minimize individual
cumulative regret. In our model, agents collaborate by exchanging messages
through pairwise gossip style communications on an arbitrary connected graph.
We develop two novel algorithms, where each agent only plays from a subset of
all the arms. Agents use the communication medium to recommend only arm-IDs
(not samples), and thus update the set of arms from which they play. We
establish that, if agents communicate Ω(log(T)) times through any
connected pairwise gossip mechanism, then every agent's regret is a factor of
order N smaller compared to the case of no collaborations. Furthermore, we
show that the communication constraints only have a second order effect on the
regret of our algorithm. We then analyze this second order term of the regret
to derive bounds on the regret-communication tradeoffs. Finally, we empirically
evaluate our algorithm and conclude that the insights are fundamental and not
artifacts of our bounds. We also show a lower bound which gives that the regret
scaling obtained by our algorithm cannot be improved even in the absence of any
communication constraints. Our results thus demonstrate that even a minimal
level of collaboration among agents greatly reduces regret for all agents.Comment: To Appear in AISTATS 2020. The first two authors contributed equall