2 research outputs found
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Most commonly used distributed machine learning systems are either
synchronous or centralized asynchronous. Synchronous algorithms like
AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous
algorithms using a parameter server suffer from 1) communication bottleneck at
parameter servers when workers are many, and 2) significantly worse convergence
when the traffic to parameter server is congested. Can we design an algorithm
that is robust in a heterogeneous environment, while being communication
efficient and maintaining the best-possible convergence rate? In this paper, we
propose an asynchronous decentralized stochastic gradient decent algorithm
(AD-PSGD) satisfying all above expectations. Our theoretical analysis shows
AD-PSGD converges at the optimal rate as SGD and has linear
speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of
decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and
standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a
heterogeneous environment. When training ResNet-50 on ImageNet with up to 128
GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each
epoch can be up to 4-8X faster than its synchronous counterparts in a
network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the
first asynchronous algorithm that achieves a similar epoch-wise convergence
rate as AllReduce-SGD, at an over 100-GPU scale
Communication-Efficient Distributed Strongly Convex Stochastic Optimization: Non-Asymptotic Rates
We examine fundamental tradeoffs in iterative distributed zeroth and first
order stochastic optimization in multi-agent networks in terms of
\emph{communication cost} (number of per-node transmissions) and
\emph{computational cost}, measured by the number of per-node noisy function
(respectively, gradient) evaluations with zeroth order (respectively, first
order) methods. Specifically, we develop novel distributed stochastic
optimization methods for zeroth and first order strongly convex optimization by
utilizing a probabilistic inter-agent communication protocol that increasingly
sparsifies communications among agents as time progresses. Under standard
assumptions on the cost functions and the noise statistics, we establish with
the proposed method the and
mean square error convergence rates, for
the first and zeroth order optimization, respectively, where
is the expected number of network communications and
is arbitrarily small. The methods are shown to achieve order-optimal
convergence rates in terms of computational cost~,
(first order optimization) and
(zeroth order optimization), while achieving
the order-optimal convergence rates in terms of iterations. Experiments on
real-life datasets illustrate the efficacy of the proposed algorithms.Comment: 32 pages. Submitted for journal publication. Initial Submission:
September 201