14,667 research outputs found
OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training
The training of modern deep learning neural network calls for large amounts
of computation, which is often provided by GPUs or other specific accelerators.
To scale out to achieve faster training speed, two update algorithms are mainly
applied in the distributed training process, i.e. the Synchronous SGD algorithm
(SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence
point while the training speed is slowed down by the synchronous barrier. ASGD
has faster training speed but the convergence point is lower when compared to
SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a
novel technology named One-step Delay SGD (OD-SGD) to combine their strengths
in the training process. Therefore, we can achieve similar convergence point
and training speed as SSGD and ASGD separately. To the best of our knowledge,
we make the first attempt to combine the features of SSGD and ASGD to improve
distributed training performance. Each iteration of OD-SGD contains a global
update in the parameter server node and local updates in the worker nodes, the
local update is introduced to update and compensate the delayed local weights.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
Experimental results show that OD-SGD can obtain similar or even slightly
better accuracy than SSGD, while its training speed is much faster, which even
exceeds the training speed of ASGD
Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training
Distributed training of deep nets is an important technique to address some
of the present day computing challenges like memory consumption and
computational demands. Classical distributed approaches, synchronous or
asynchronous, are based on the parameter server architecture, i.e., worker
nodes compute gradients which are communicated to the parameter server while
updated parameters are returned. Recently, distributed training with AllReduce
operations gained popularity as well. While many of those operations seem
appealing, little is reported about wall-clock training time improvements. In
this paper, we carefully analyze the AllReduce based setup, propose timing
models which include network latency, bandwidth, cluster size and compute time,
and demonstrate that a pipelined training with a width of two combines the best
of both synchronous and asynchronous training. Specifically, for a setup
consisting of a four-node GPU cluster we show wall-clock time training
improvements of up to 5.4x compared to conventional approaches.Comment: Accepted at NeurIPS 201
A Hitchhiker's Guide On Distributed Training of Deep Neural Networks
Deep learning has led to tremendous advancements in the field of Artificial
Intelligence. One caveat however is the substantial amount of compute needed to
train these deep learning models. Training a benchmark dataset like ImageNet on
a single machine with a modern GPU can take upto a week, distributing training
on multiple machines has been observed to drastically bring this time down.
Recent work has brought down ImageNet training time to a time as low as 4
minutes by using a cluster of 2048 GPUs. This paper surveys the various
algorithms and techniques used to distribute training and presents the current
state of the art for a modern distributed training framework. More
specifically, we explore the synchronous and asynchronous variants of
distributed Stochastic Gradient Descent, various All Reduce gradient
aggregation strategies and best practices for obtaining higher throughout and
lower latency over a cluster such as mixed precision training, large batch
training and gradient compression.Comment: 14 page
Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms
Stochastic algorithms are efficient approaches to solving machine learning
and optimization problems. In this paper, we propose a general framework called
Splash for parallelizing stochastic algorithms on multi-node distributed
systems. Splash consists of a programming interface and an execution engine.
Using the programming interface, the user develops sequential stochastic
algorithms without concerning any detail about distributed computing. The
algorithm is then automatically parallelized by a communication-efficient
execution engine. We provide theoretical justifications on the optimal rate of
convergence for parallelizing stochastic gradient descent. Splash is built on
top of Apache Spark. The real-data experiments on logistic regression,
collaborative filtering and topic modeling verify that Splash yields
order-of-magnitude speedup over single-thread stochastic algorithms and over
state-of-the-art implementations on Spark.Comment: redo experiments to learn bigger models; compare Splash with
state-of-the-art implementations on Spar
Decoupled Parallel Backpropagation with Convergence Guarantee
Backpropagation algorithm is indispensable for the training of feedforward
neural networks. It requires propagating error gradients sequentially from the
output layer all the way back to the input layer. The backward locking in
backpropagation algorithm constrains us from updating network layers in
parallel and fully leveraging the computing resources. Recently, several
algorithms have been proposed for breaking the backward locking. However, their
performances degrade seriously when networks are deep. In this paper, we
propose decoupled parallel backpropagation algorithm for deep learning
optimization with convergence guarantee. Firstly, we decouple the
backpropagation algorithm using delayed gradients, and show that the backward
locking is removed when we split the networks into multiple modules. Then, we
utilize decoupled parallel backpropagation in two stochastic methods and prove
that our method guarantees convergence to critical points for the non-convex
problem. Finally, we perform experiments for training deep convolutional neural
networks on benchmark datasets. The experimental results not only confirm our
theoretical analysis, but also demonstrate that the proposed method can achieve
significant speedup without loss of accuracy.Comment: ICML 201
DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters
The increasing complexity of deep neural networks (DNNs) has made it
challenging to exploit existing large-scale data processing pipelines for
handling massive data and parameters involved in DNN training. Distributed
computing platforms and GPGPU-based acceleration provide a mainstream solution
to this computational challenge. In this paper, we propose DeepSpark, a
distributed and parallel deep learning framework that exploits Apache Spark on
commodity clusters. To support parallel operations, DeepSpark automatically
distributes workloads and parameters to Caffe/Tensorflow-running nodes using
Spark, and iteratively aggregates training results by a novel lock-free
asynchronous variant of the popular elastic averaging stochastic gradient
descent based update scheme, effectively complementing the synchronized
processing capabilities of Spark. DeepSpark is an on-going project, and the
current release is available at http://deepspark.snu.ac.kr
Stochastic Gradient Push for Distributed Deep Learning
Distributed data-parallel algorithms aim to accelerate the training of deep
neural networks by parallelizing the computation of large mini-batch gradient
updates across multiple nodes. Approaches that synchronize nodes using exact
distributed averaging (e.g., via AllReduce) are sensitive to stragglers and
communication delays. The PushSum gossip algorithm is robust to these issues,
but only performs approximate distributed averaging. This paper studies
Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient
updates. We prove that SGP converges to a stationary point of smooth,
non-convex objectives at the same sub-linear rate as SGD, and that all nodes
achieve consensus. We empirically validate the performance of SGP on image
classification (ResNet-50, ImageNet) and machine translation (Transformer,
WMT'16 En-De) workloads. Our code will be made publicly available.Comment: ICML 201
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
PipeDream is a Deep Neural Network(DNN) training system for GPUs that
parallelizes computation by pipelining execution across multiple machines. Its
pipeline parallel computing model avoids the slowdowns faced by data-parallel
training when large models and/or limited network bandwidth induce high
communication-to-computation ratios. PipeDream reduces communication by up to
95% for large DNNs relative to data-parallel training, and allows perfect
overlap of communication and computation. PipeDream keeps all available GPUs
productive by systematically partitioning DNN layers among them to balance work
and minimize communication, versions model parameters for backward pass
correctness, and schedules the forward and backward passes of different inputs
in round-robin fashion to optimize "time to target accuracy". Experiments with
five different DNNs on two different clusters show that PipeDream is up to 5x
faster in time-to-accuracy compared to data-parallel training
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Large-scale distributed training requires significant communication bandwidth
for gradient exchange that limits the scalability of multi-node training, and
requires expensive high-bandwidth network infrastructure. The situation gets
even worse with distributed training on mobile devices (federated learning),
which suffers from higher latency, lower throughput, and intermittent poor
connections. In this paper, we find 99.9% of the gradient exchange in
distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to
greatly reduce the communication bandwidth. To preserve accuracy during
compression, DGC employs four methods: momentum correction, local gradient
clipping, momentum factor masking, and warm-up training. We have applied Deep
Gradient Compression to image classification, speech recognition, and language
modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and
Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a
gradient compression ratio from 270x to 600x without losing accuracy, cutting
the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from
488MB to 0.74MB. Deep gradient compression enables large-scale distributed
training on inexpensive commodity 1Gbps Ethernet and facilitates distributed
training on mobile. Code is available at:
https://github.com/synxlin/deep-gradient-compression.Comment: we find 99.9% of the gradient exchange in distributed SGD is
redundant; we reduce the communication bandwidth by two orders of magnitude
without losing accuracy. Code is available at:
https://github.com/synxlin/deep-gradient-compressio
Strategies and Principles of Distributed Machine Learning on Big Data
The rise of Big Data has led to new demands for Machine Learning (ML) systems
to learn complex models with millions to billions of parameters, that promise
adequate capacity to digest massive datasets and offer powerful predictive
analytics thereupon. In order to run ML algorithms at such scales, on a
distributed cluster with 10s to 1000s of machines, it is often the case that
significant engineering efforts are required --- and one might fairly ask if
such engineering truly falls within the domain of ML research or not. Taking
the view that Big ML systems can benefit greatly from ML-rooted statistical and
algorithmic insights --- and that ML researchers should therefore not shy away
from such systems design --- we discuss a series of principles and strategies
distilled from our recent efforts on industrial-scale ML solutions. These
principles and strategies span a continuum from application, to engineering,
and to theoretical research and development of Big ML systems and
architectures, with the goal of understanding how to make them efficient,
generally-applicable, and supported with convergence and scaling guarantees.
They concern four key questions which traditionally receive little attention in
ML research: How to distribute an ML program over a cluster? How to bridge ML
computation with inter-machine communication? How to perform such
communication? What should be communicated between machines? By exposing
underlying statistical and algorithmic characteristics unique to ML programs
but not typically seen in traditional computer programs, and by dissecting
successful cases to reveal how we have harnessed these principles to design and
develop both high-performance distributed ML software as well as
general-purpose ML frameworks, we present opportunities for ML researchers and
practitioners to further shape and grow the area that lies between ML and
systems
- …