4,202 research outputs found
Distributed Deep Learning Using Synchronous Stochastic Gradient Descent
We design and implement a distributed multinode synchronous SGD algorithm,
without altering hyper parameters, or compressing data, or altering algorithmic
behavior. We perform a detailed analysis of scaling, and identify optimal
design points for different networks. We demonstrate scaling of CNNs on 100s of
nodes, and present what we believe to be record training throughputs. A 512
minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch
VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64
node cluster. We also demonstrate the generality of our approach via
best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt
to democratize deep-learning by training on an Ethernet based AWS cluster and
show ~14X scaling on 16 nodes
A Hitchhiker's Guide On Distributed Training of Deep Neural Networks
Deep learning has led to tremendous advancements in the field of Artificial
Intelligence. One caveat however is the substantial amount of compute needed to
train these deep learning models. Training a benchmark dataset like ImageNet on
a single machine with a modern GPU can take upto a week, distributing training
on multiple machines has been observed to drastically bring this time down.
Recent work has brought down ImageNet training time to a time as low as 4
minutes by using a cluster of 2048 GPUs. This paper surveys the various
algorithms and techniques used to distribute training and presents the current
state of the art for a modern distributed training framework. More
specifically, we explore the synchronous and asynchronous variants of
distributed Stochastic Gradient Descent, various All Reduce gradient
aggregation strategies and best practices for obtaining higher throughout and
lower latency over a cluster such as mixed precision training, large batch
training and gradient compression.Comment: 14 page
High Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel,
stochastic-gradient-descent learning algorithm. This algorithm uses amortized
inference in a compute-cluster-specific, deep, generative, dynamical model to
perform joint posterior predictive inference of the mini-batch gradient
computation times of all worker-nodes in a parallel computing cluster. We show
that a synchronous parameter server can, by utilizing such a model, choose an
optimal cutoff time beyond which mini-batch gradient messages from slow workers
are ignored that maximizes overall mini-batch gradient computations per second.
In keeping with earlier findings we observe that, under realistic conditions,
eagerly discarding the mini-batch gradient computations of stragglers not only
increases throughput but actually increases the overall rate of convergence as
a function of wall-clock time by virtue of eliminating idleness. The principal
novel contribution and finding of this work goes beyond this by demonstrating
that using the predicted run-times from a generative model of cluster worker
performance to dynamically adjust the cutoff improves substantially over the
static-cutoff prior art, leading to, among other things, significantly reduced
deep neural net training times on large computer clusters
Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer
We explore scaling of the standard distributed Tensorflow with GRPC
primitives on up to 512 Intel Xeon Phi (KNL) nodes of Cori supercomputer with
synchronous stochastic gradient descent (SGD), and identify causes of scaling
inefficiency at higher node counts. To our knowledge, this is the first
exploration of distributed GRPC Tensorflow scalability on a HPC supercomputer
at such large scale with synchronous SGD. We studied scaling of two convolution
neural networks - ResNet-50, a state-of-the-art deep network for classification
with roughly 25.5 million parameters, and HEP-CNN, a shallow topology with less
than 1 million parameters for common scientific usages. For ResNet-50, we
achieve >80% scaling efficiency on up to 128 workers, using 32 parameter
servers (PS tasks) with a steep decline down to 23% for 512 workers using 64 PS
tasks. Our analysis of the efficiency drop points to low network bandwidth
utilization due to combined effect of three factors. (a) Heterogeneous
distributed parallelization algorithm which uses PS tasks as centralized
servers for gradient averaging is suboptimal for utilizing interconnect
bandwidth. (b) Load imbalance among PS tasks hinders their efficient scaling.
(c) Underlying communication primitive GRPC is currently inefficient on Cori
high-speed interconnect. The HEP-CNN demands less interconnect bandwidth, and
shows >80% weak scaling efficiency for up to 256 nodes with only 1 PS task. Our
findings are applicable to other deep learning networks. Big networks with
millions of parameters stumble upon the issues discussed here. Shallower
networks like HEP-CNN with relatively lower number of parameters can
efficiently enjoy weak scaling even with a single parameter server.Comment: Published as a poster in NIPS 2017 Workshop: Deep Learning At
Supercomputer Scal
Revisiting Distributed Synchronous SGD
Distributed training of deep learning models on large-scale training data is
typically conducted with asynchronous stochastic optimization to maximize the
rate of updates, at the cost of additional noise introduced from asynchrony. In
contrast, the synchronous approach is often thought to be impractical due to
idle time wasted on waiting for straggling workers. We revisit these
conventional beliefs in this paper, and examine the weaknesses of both
approaches. We demonstrate that a third approach, synchronous optimization with
backup workers, can avoid asynchronous noise while mitigating for the worst
stragglers. Our approach is empirically validated and shown to converge faster
and to better test accuracies.Comment: 10 page
Sparse Communication for Training Deep Networks
Synchronous stochastic gradient descent (SGD) is the most common method used
for distributed training of deep learning models. In this algorithm, each
worker shares its local gradients with others and updates the parameters using
the average gradients of all workers. Although distributed training reduces the
computation time, the communication overhead associated with the gradient
exchange forms a scalability bottleneck for the algorithm. There are many
compression techniques proposed to reduce the number of gradients that needs to
be communicated. However, compressing the gradients introduces yet another
overhead to the problem. In this work, we study several compression schemes and
identify how three key parameters affect the performance. We also provide a set
of insights on how to increase performance and introduce a simple
sparsification scheme, random-block sparsification, that reduces communication
while keeping the performance close to standard SGD
Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training
Nonlinear conjugate gradient (NLCG) based optimizers have shown superior loss
convergence properties compared to gradient descent based optimizers for
traditional optimization problems. However, in Deep Neural Network (DNN)
training, the dominant optimization algorithm of choice is still Stochastic
Gradient Descent (SGD) and its variants. In this work, we propose and evaluate
the stochastic preconditioned nonlinear conjugate gradient algorithm for large
scale DNN training tasks. We show that a nonlinear conjugate gradient algorithm
improves the convergence speed of DNN training, especially in the large
mini-batch scenario, which is essential for scaling synchronous distributed DNN
training to large number of workers. We show how to efficiently use second
order information in the NLCG pre-conditioner for improving DNN training
convergence. For the ImageNet classification task, at extremely large
mini-batch sizes of greater than 65k, NLCG optimizer is able to improve top-1
accuracy by more than 10 percentage points for standard training of the
Resnet-50 model for 90 epochs. For the CIFAR-100 classification task, at
extremely large mini-batch sizes of greater than 16k, NLCG optimizer is able to
improve top-1 accuracy by more than 15 percentage points for standard training
of the Resnet-32 model for 200 epochs.Comment: 10 page
Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce
It is inevitable to train large deep learning models on a large-scale cluster
equipped with accelerators system. Deep gradient compression would highly
increase the bandwidth utilization and speed up the training process but hard
to implement on ring structure. In this paper, we find that redundant gradient
and gradient staleness has negative effect on training. We have observed that
in different epoch and different steps, the neural networks focus on updating
different layers and different parameters. In order to save more communication
bandwidth and preserve the accuracy on ring structure, which break the restrict
as the node increase, we propose a new algorithm to measure the importance of
gradients on large-scale cluster implementing ring all-reduce based on the size
of the ratio of parameter calculation gradient to parameter value. Our
importance weighted pruning approach achieved 64X and 58.8X of gradient
compression ratio on AlexNet and ResNet50 on ImageNet. Meanwhile, in order to
maintain the sparseness of the gradient propagation, we randomly broadcast the
index of important gradients on each node. While the remaining nodes are ready
for the index gradient and perform all-reduce update. This would speed up the
convergence of the model and preserve the training accuracy
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Most commonly used distributed machine learning systems are either
synchronous or centralized asynchronous. Synchronous algorithms like
AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous
algorithms using a parameter server suffer from 1) communication bottleneck at
parameter servers when workers are many, and 2) significantly worse convergence
when the traffic to parameter server is congested. Can we design an algorithm
that is robust in a heterogeneous environment, while being communication
efficient and maintaining the best-possible convergence rate? In this paper, we
propose an asynchronous decentralized stochastic gradient decent algorithm
(AD-PSGD) satisfying all above expectations. Our theoretical analysis shows
AD-PSGD converges at the optimal rate as SGD and has linear
speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of
decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and
standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a
heterogeneous environment. When training ResNet-50 on ImageNet with up to 128
GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each
epoch can be up to 4-8X faster than its synchronous counterparts in a
network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the
first asynchronous algorithm that achieves a similar epoch-wise convergence
rate as AllReduce-SGD, at an over 100-GPU scale
Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
Large-scale machine learning training, in particular distributed stochastic
gradient descent, needs to be robust to inherent system variability such as
node straggling and random communication delays. This work considers a
distributed training framework where each worker node is allowed to perform
local model updates and the resulting models are averaged periodically. We
analyze the true speed of error convergence with respect to wall-clock time
(instead of the number of iterations), and analyze how it is affected by the
frequency of averaging. The main contribution is the design of AdaComm, an
adaptive communication strategy that starts with infrequent averaging to save
communication delay and improve convergence speed, and then increases the
communication frequency in order to achieve a low error floor. Rigorous
experiments on training deep neural networks show that AdaComm can take less time than fully synchronous SGD, and still reach the same final
training loss.Comment: Accepted to SysML 201
- …