92 research outputs found
Delay and Cooperation in Nonstochastic Bandits
We study networks of communicating learning agents that cooperate to solve a
common nonstochastic bandit problem. Agents use an underlying communication
network to get messages about actions selected by other agents, and drop
messages that took more than hops to arrive, where is a delay
parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc
Exp3} algorithm and prove that with actions and agents the average
per-agent regret after rounds is at most of order , where is the
independence number of the -th power of the connected communication graph
. We then show that for any connected graph, for the regret
bound is , strictly better than the minimax regret
for noncooperating agents. More informed choices of lead to bounds which
are arbitrarily close to the full information minimax regret
when is dense. When has sparse components, we show that a variant of
\textsc{Exp3-Coop}, allowing agents to choose their parameters according to
their centrality in , strictly improves the regret. Finally, as a by-product
of our analysis, we provide the first characterization of the minimax regret
for bandit learning with delay.Comment: 30 page
On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits
We consider the nonstochastic multi-agent multi-armed bandit problem with
agents collaborating via a communication network with delays. We show a lower
bound for individual regret of all agents. We show that with suitable
regularizers and communication protocols, a collaborative multi-agent
\emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret
upper bound that matches the lower bound up to a constant factor when the
number of arms is large enough relative to degrees of agents in the
communication graph. We also show that an FTRL algorithm with a suitable
regularizer is regret optimal with respect to the scaling with the edge-delay
parameter. We present numerical experiments validating our theoretical results
and demonstrate cases when our algorithms outperform previously proposed
algorithms.Comment: Published in AAMAS 202
Cooperative Online Learning: Keeping your Neighbors Updated
We study an asynchronous online learning setting with a network of agents. At
each time step, some of the agents are activated, requested to make a
prediction, and pay the corresponding loss. The loss function is then revealed
to these agents and also to their neighbors in the network. Our results
characterize how much knowing the network structure affects the regret as a
function of the model of agent activations. When activations are stochastic,
the optimal regret (up to constant factors) is shown to be of order
, where is the horizon and is the independence
number of the network. We prove that the upper bound is achieved even when
agents have no information about the network structure. When activations are
adversarial the situation changes dramatically: if agents ignore the network
structure, a lower bound on the regret can be proven, showing that
learning is impossible. However, when agents can choose to ignore some of their
neighbors based on the knowledge of the network structure, we prove a
sublinear regret bound, where is the clique-covering number of the network
Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs
Recently, there has been extensive study of cooperative multi-agent
multi-armed bandits where a set of distributed agents cooperatively play the
same multi-armed bandit game. The goal is to develop bandit algorithms with the
optimal group and individual regrets and low communication between agents. The
prior work tackled this problem using two paradigms: leader-follower and fully
distributed algorithms. Prior algorithms in both paradigms achieve the optimal
group regret. The leader-follower algorithms achieve constant communication
costs but fail to achieve optimal individual regrets. The state-of-the-art
fully distributed algorithms achieve optimal individual regrets but fail to
achieve constant communication costs. This paper presents a simple yet
effective communication policy and integrates it into a learning algorithm for
cooperative bandits. Our algorithm achieves the best of both paradigms: optimal
individual regret and constant communication costs
On-Demand Communication for Asynchronous Multi-Agent Bandits
This paper studies a cooperative multi-agent multi-armed stochastic bandit
problem where agents operate asynchronously -- agent pull times and rates are
unknown, irregular, and heterogeneous -- and face the same instance of a
K-armed bandit problem. Agents can share reward information to speed up the
learning process at additional communication costs. We propose ODC, an
on-demand communication protocol that tailors the communication of each pair of
agents based on their empirical pull times. ODC is efficient when the pull
times of agents are highly heterogeneous, and its communication complexity
depends on the empirical pull times of agents. ODC is a generic protocol that
can be integrated into most cooperative bandit algorithms without degrading
their performance. We then incorporate ODC into the natural extensions of UCB
and AAE algorithms and propose two communication-efficient cooperative
algorithms. Our analysis shows that both algorithms are near-optimal in regret.Comment: Accepted by AISTATS 202
- …