24,279 research outputs found
DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm
Decentralized Stochastic Gradient Descent (SGD) is an emerging neural network
training approach that enables multiple agents to train a model collaboratively
and simultaneously. Rather than using a central parameter server to collect
gradients from all the agents, each agent keeps a copy of the model parameters
and communicates with a small number of other agents to exchange model updates.
Their communication, governed by the communication topology and gossip weight
matrices, facilitates the exchange of model updates. The state-of-the-art
approach uses the dynamic one-peer exponential-2 topology, achieving faster
training times and improved scalability than the ring, grid, torus, and
hypercube topologies. However, this approach requires a power-of-2 number of
agents, which is impractical at scale. In this paper, we remove this
restriction and propose \underline{D}ecentralized \underline{SGD} with
\underline{C}ommunication-optimal \underline{E}xact \underline{C}onsensus
\underline{A}lgorithm (DSGD-CECA), which works for any number of agents while
still achieving state-of-the-art properties. In particular, DSGD-CECA incurs a
unit per-iteration communication overhead and an transient
iteration complexity. Our proof is based on newly discovered properties of
gossip weight matrices and a novel approach to combine them with DSGD's
convergence analysis. Numerical experiments show the efficiency of DSGD-CECA
- …