49 research outputs found
Local SGD Converges Fast and Communicates Little
Mini-batch stochastic gradient descent (SGD) is state of the art in large
scale distributed training. The scheme can reach a linear speedup with respect
to the number of workers, but this is rarely seen in practice as the scheme
often suffers from large network delays and bandwidth limits. To overcome this
communication bottleneck recent works propose to reduce the communication
frequency. An algorithm of this type is local SGD that runs SGD independently
in parallel on different workers and averages the sequences only once in a
while.
This scheme shows promising results in practice, but eluded thorough
theoretical analysis. We prove concise convergence rates for local SGD on
convex problems and show that it converges at the same rate as mini-batch SGD
in terms of number of evaluated gradients, that is, the scheme achieves linear
speedup in the number of workers and mini-batch size. The number of
communication rounds can be reduced up to a factor of T^{1/2}---where T denotes
the number of total steps---compared to mini-batch SGD. This also holds for
asynchronous implementations. Local SGD can also be used for large scale
training of deep learning models.
The results shown here aim serving as a guideline to further explore the
theoretical and practical aspects of local SGD in these applications.Comment: to appear at ICLR 2019, 19 page
Compressed Distributed Gradient Descent: Communication-Efficient Consensus over Networks
Network consensus optimization has received increasing attention in recent
years and has found important applications in many scientific and engineering
fields. To solve network consensus optimization problems, one of the most
well-known approaches is the distributed gradient descent method (DGD).
However, in networks with slow communication rates, DGD's performance is
unsatisfactory for solving high-dimensional network consensus problems due to
the communication bottleneck. This motivates us to design a
communication-efficient DGD-type algorithm based on compressed information
exchanges. Our contributions in this paper are three-fold: i) We develop a
communication-efficient algorithm called amplified-differential compression DGD
(ADC-DGD) and show that it converges under {\em any} unbiased compression
operator; ii) We rigorously prove the convergence performances of ADC-DGD and
show that they match with those of DGD without compression; iii) We reveal an
interesting phase transition phenomenon in the convergence speed of ADC-DGD.
Collectively, our findings advance the state-of-the-art of network consensus
optimization theory.Comment: 11 pages, 11 figures, IEEE INFOCOM 201
MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is very useful in optimization problems
with high-dimensional non-convex target functions, and hence constitutes an
important component of several Machine Learning and Data Analytics methods.
Recently there have been significant works on understanding the parallelism
inherent to SGD, and its convergence properties. Asynchronous, parallel SGD
(AsyncPSGD) has received particular attention, due to observed performance
benefits. On the other hand, asynchrony implies inherent challenges in
understanding the execution of the algorithm and its convergence, stemming from
the fact that the contribution of a thread might be based on an old (stale)
view of the state. In this work we aim to deepen the understanding of AsyncPSGD
in order to increase the statistical efficiency in the presence of stale
gradients. We propose new models for capturing the nature of the staleness
distribution in a practical setting. Using the proposed models, we derive a
staleness-adaptive SGD framework, MindTheStep-AsyncPSGD, for adapting the step
size in an online-fashion, which provably reduces the negative impact of
asynchrony. Moreover, we provide general convergence time bounds for a wide
class of staleness-adaptive step size strategies for convex target functions.
We also provide a detailed empirical study, showing how our approach implies
faster convergence for deep learning applications.Comment: 12 pages, 3 figures, accepted in IEEE BigData 201
Hogwild! over Distributed Local Data Sets with Linearly Increasing Mini-Batch Sizes
Hogwild! implements asynchronous Stochastic Gradient Descent (SGD) where
multiple threads in parallel access a common repository containing training
data, perform SGD iterations and update shared state that represents a jointly
learned (global) model. We consider big data analysis where training data is
distributed among local data sets in a heterogeneous way -- and we wish to move
SGD computations to local compute nodes where local data resides. The results
of these local SGD computations are aggregated by a central "aggregator" which
mimics Hogwild!. We show how local compute nodes can start choosing small
mini-batch sizes which increase to larger ones in order to reduce communication
cost (round interaction with the aggregator). We improve state-of-the-art
literature and show ) communication rounds for heterogeneous data
for strongly convex problems, where is the total number of gradient
computations across all local compute nodes. For our scheme, we prove a
\textit{tight} and novel non-trivial convergence analysis for strongly convex
problems for {\em heterogeneous} data which does not use the bounded gradient
assumption as seen in many existing publications. The tightness is a
consequence of our proofs for lower and upper bounds of the convergence rate,
which show a constant factor difference. We show experimental results for plain
convex and non-convex problems for biased (i.e., heterogeneous) and unbiased
local data sets.Comment: arXiv admin note: substantial text overlap with arXiv:2007.09208
AISTATS 202
Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent
The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset in order to converge to a solution of sufficient quality.In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state.In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose LeashedSGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing
samples across nodes usually yields the best performance, but poses scaling
challenges due to global information dissemination and load imbalance across
uneven sample lengths. State-of-the-art decentralized optimizers mitigate the
problem, but require more iterations to achieve the same accuracy as their
globally-communicating counterparts. We present Wait-Avoiding Group Model
Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global
communication via subgroup weight exchange. The key insight is a combination of
algorithmic changes to the averaging scheme and the use of a group allreduce
operation. We prove the convergence of WAGMA-SGD, and empirically show that it
retains convergence rates similar to Allreduce-SGD. For evaluation, we train
ResNet-50 on ImageNet; Transformer for machine translation; and deep
reinforcement learning for navigation at scale. Compared with state-of-the-art
decentralized SGD variants, WAGMA-SGD significantly improves training
throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves
the fastest time-to-solution (e.g., the highest score using the shortest
training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems
(IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202