Search CORE

2 research outputs found

Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

Author: Colin Igor
Santos Ludovic Dos
Scaman Kevin
Publication venue
Publication date: 11/10/2019
Field of study

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a

d^{1/4}

multiplicative factor of the optimal convergence rate, where

d

is the underlying dimension. While the convergence rate still obeys a slow

\varepsilon^{-2}

convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning

arXiv.org e-Print Archive

Optimal Complexity in Decentralized Training

Author: De Sa Christopher
Lu Yucheng
Publication venue
Publication date: 11/06/2021
Field of study

Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks

arXiv.org e-Print Archive