Search CORE

2 research outputs found

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Author: Lian Xiangru
Liu Ji
Zhang Ce
Zhang Wei
Publication venue
Publication date: 24/09/2018
Field of study

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal

O(1/\sqrt{K})

rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale

arXiv.org e-Print Archive

Communication-Efficient Distributed Strongly Convex Stochastic Optimization: Non-Asymptotic Rates

Author: Bajovic Dragana
Jakovetic Dusan
Kar Soummya
Sahu Anit Kumar
Publication venue
Publication date: 09/09/2018
Field of study

We examine fundamental tradeoffs in iterative distributed zeroth and first order stochastic optimization in multi-agent networks in terms of \emph{communication cost} (number of per-node transmissions) and \emph{computational cost}, measured by the number of per-node noisy function (respectively, gradient) evaluations with zeroth order (respectively, first order) methods. Specifically, we develop novel distributed stochastic optimization methods for zeroth and first order strongly convex optimization by utilizing a probabilistic inter-agent communication protocol that increasingly sparsifies communications among agents as time progresses. Under standard assumptions on the cost functions and the noise statistics, we establish with the proposed method the

O(1/(C_{\mathrm{comm}})^{4/3-\zeta})

and

O(1/(C_{\mathrm{comm}})^{8/9-\zeta})

mean square error convergence rates, for the first and zeroth order optimization, respectively, where

C_{\mathrm{comm}}

is the expected number of network communications and

\zeta>0

is arbitrarily small. The methods are shown to achieve order-optimal convergence rates in terms of computational cost~

C_{\mathrm{comp}}

O(1/C_{\mathrm{comp}})

(first order optimization) and

O(1/(C_{\mathrm{comp}})^{2/3})

(zeroth order optimization), while achieving the order-optimal convergence rates in terms of iterations. Experiments on real-life datasets illustrate the efficacy of the proposed algorithms.Comment: 32 pages. Submitted for journal publication. Initial Submission: September 201

arXiv.org e-Print Archive