2 research outputs found
Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning
We investigate the theoretical limits of pipeline parallel learning of deep
learning architectures, a distributed setup in which the computation is
distributed per layer instead of per example. For smooth convex and non-convex
objective functions, we provide matching lower and upper complexity bounds and
show that a naive pipeline parallelization of Nesterov's accelerated gradient
descent is optimal. For non-smooth convex functions, we provide a novel
algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a
multiplicative factor of the optimal convergence rate, where is
the underlying dimension. While the convergence rate still obeys a slow
convergence rate, the depth-dependent part is accelerated,
resulting in a near-linear speed-up and convergence time that only slightly
depends on the depth of the deep learning architecture. Finally, we perform an
empirical analysis of the non-smooth non-convex case and show that, for
difficult and highly non-smooth problems, PPRS outperforms more traditional
optimization algorithms such as gradient descent and Nesterov's accelerated
gradient descent for problems where the sample size is limited, such as
few-shot or adversarial learning
Optimal Complexity in Decentralized Training
Decentralization is a promising method of scaling up parallel machine
learning systems. In this paper, we provide a tight lower bound on the
iteration complexity for such methods in a stochastic non-convex setting. Our
lower bound reveals a theoretical gap in known convergence rates of many
existing decentralized training algorithms, such as D-PSGD. We prove by
construction this lower bound is tight and achievable. Motivated by our
insights, we further propose DeTAG, a practical gossip-style decentralized
algorithm that achieves the lower bound with only a logarithm gap. Empirically,
we compare DeTAG with other decentralized algorithms on image classification
tasks, and we show DeTAG enjoys faster convergence compared to baselines,
especially on unshuffled data and in sparse networks