106 research outputs found
Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce
It is inevitable to train large deep learning models on a large-scale cluster
equipped with accelerators system. Deep gradient compression would highly
increase the bandwidth utilization and speed up the training process but hard
to implement on ring structure. In this paper, we find that redundant gradient
and gradient staleness has negative effect on training. We have observed that
in different epoch and different steps, the neural networks focus on updating
different layers and different parameters. In order to save more communication
bandwidth and preserve the accuracy on ring structure, which break the restrict
as the node increase, we propose a new algorithm to measure the importance of
gradients on large-scale cluster implementing ring all-reduce based on the size
of the ratio of parameter calculation gradient to parameter value. Our
importance weighted pruning approach achieved 64X and 58.8X of gradient
compression ratio on AlexNet and ResNet50 on ImageNet. Meanwhile, in order to
maintain the sparseness of the gradient propagation, we randomly broadcast the
index of important gradients on each node. While the remaining nodes are ready
for the index gradient and perform all-reduce update. This would speed up the
convergence of the model and preserve the training accuracy
Proportionate gradient updates with PercentDelta
Deep Neural Networks are generally trained using iterative gradient updates.
Magnitudes of gradients are affected by many factors, including choice of
activation functions and initialization. More importantly, gradient magnitudes
can greatly differ across layers, with some layers receiving much smaller
gradients than others. causing some layers to train slower than others and
therefore slowing down the overall convergence. We analytically explain this
disproportionality. Then we propose to explicitly train all layers at the same
speed, by scaling the gradient w.r.t. every trainable tensor to be proportional
to its current value. In particular, at every batch, we want to update all
trainable tensors, such that the relative change of the L1-norm of the tensors
is the same, across all layers of the network, throughout training time.
Experiments on MNIST show that our method appropriately scales gradients, such
that the relative change in trainable tensors is approximately equal across
layers. In addition, measuring the test accuracy with training time, shows that
our method trains faster than other methods, giving higher test accuracy given
same budget of training steps
Image Classification at Supercomputer Scale
Deep learning is extremely computationally intensive, and hardware vendors
have responded by building faster accelerators in large clusters. Training deep
learning models at petaFLOPS scale requires overcoming both algorithmic and
systems software challenges. In this paper, we discuss three systems-related
optimizations: (1) distributed batch normalization to control per-replica batch
sizes, (2) input pipeline optimizations to sustain model throughput, and (3)
2-D torus all-reduce to speed up gradient summation. We combine these
optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes
on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million
images/second and no accuracy drop.Comment: Presented as part of Systems for ML Workshop @ NIPS 201
Distributed Deep Learning Strategies For Automatic Speech Recognition
In this paper, we propose and investigate a variety of distributed deep
learning strategies for automatic speech recognition (ASR) and evaluate them
with a state-of-the-art Long short-term memory (LSTM) acoustic model on the
2000-hour Switchboard (SWB2000), which is one of the most widely used datasets
for ASR performance benchmark. We first investigate what are the proper
hyper-parameters (e.g., learning rate) to enable the training with sufficiently
large batch size without impairing the model accuracy. We then implement
various distributed strategies, including Synchronous (SYNC), Asynchronous
Decentralized Parallel SGD (ADPSGD) and the hybrid of the two HYBRID, to study
their runtime/accuracy trade-off. We show that we can train the LSTM model
using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7.6% WER on the
Hub5- 2000 Switchboard (SWB) test set and a 13.1% WER on the CallHome (CH) test
set. Furthermore, we can train the model using HYBRID in 11.5 hours with 32
NVIDIA V100 GPUs without loss in accuracy.Comment: Published in ICASSP'1
Scaling Distributed Training of Flood-Filling Networks on HPC Infrastructure for Brain Mapping
Mapping all the neurons in the brain requires automatic reconstruction of
entire cells from volume electron microscopy data. The flood-filling network
(FFN) architecture has demonstrated leading performance for segmenting
structures from this data. However, the training of the network is
computationally expensive. In order to reduce the training time, we implemented
synchronous and data-parallel distributed training using the Horovod library,
which is different from the asynchronous training scheme used in the published
FFN code. We demonstrated that our distributed training scaled well up to 2048
Intel Knights Landing (KNL) nodes on the Theta supercomputer. Our trained
models achieved similar level of inference performance, but took less training
time compared to previous methods. Our study on the effects of different batch
sizes on FFN training suggests ways to further improve training efficiency. Our
findings on optimal learning rate and batch sizes agree with previous works.Comment: 9 pages, 10 figure
High Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel,
stochastic-gradient-descent learning algorithm. This algorithm uses amortized
inference in a compute-cluster-specific, deep, generative, dynamical model to
perform joint posterior predictive inference of the mini-batch gradient
computation times of all worker-nodes in a parallel computing cluster. We show
that a synchronous parameter server can, by utilizing such a model, choose an
optimal cutoff time beyond which mini-batch gradient messages from slow workers
are ignored that maximizes overall mini-batch gradient computations per second.
In keeping with earlier findings we observe that, under realistic conditions,
eagerly discarding the mini-batch gradient computations of stragglers not only
increases throughput but actually increases the overall rate of convergence as
a function of wall-clock time by virtue of eliminating idleness. The principal
novel contribution and finding of this work goes beyond this by demonstrating
that using the predicted run-times from a generative model of cluster worker
performance to dynamically adjust the cutoff improves substantially over the
static-cutoff prior art, leading to, among other things, significantly reduced
deep neural net training times on large computer clusters
Parallel Complexity of Forward and Backward Propagation
We show that the forward and backward propagation can be formulated as a
solution of lower and upper triangular systems of equations. For standard
feedforward (FNNs) and recurrent neural networks (RNNs) the triangular systems
are always block bi-diagonal, while for a general computation graph (directed
acyclic graph) they can have a more complex triangular sparsity pattern. We
discuss direct and iterative parallel algorithms that can be used for their
solution and interpreted as different ways of performing model parallelism.
Also, we show that for FNNs and RNNs with layers and time steps the
backward propagation can be performed in parallel in O() and O() steps, respectively. Finally, we outline the generalization of this
technique using Jacobians that potentially allows us to handle arbitrary
layers.Comment: 18 page
Stochastic Normalized Gradient Descent with Momentum for Large Batch Training
Stochastic gradient descent (SGD) and its variants have been the dominating
optimization methods in machine learning. Compared with small batch training,
SGD with large batch training can better utilize the computational power of
current multi-core systems like GPUs and can reduce the number of communication
rounds in distributed training. Hence, SGD with large batch training has
attracted more and more attention. However, existing empirical results show
that large batch training typically leads to a drop of generalization accuracy.
As a result, large batch training has also become a challenging topic. In this
paper, we propose a novel method, called stochastic normalized gradient descent
with momentum (SNGM), for large batch training. We theoretically prove that
compared to momentum SGD (MSGD) which is one of the most widely used variants
of SGD, SNGM can adopt a larger batch size to converge to the
-stationary point with the same computation complexity (total number
of gradient computation). Empirical results on deep learning also show that
SNGM can achieve the state-of-the-art accuracy with a large batch size
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks
Training deep neural networks with Stochastic Gradient Descent, or its
variants, requires careful choice of both learning rate and batch size. While
smaller batch sizes generally converge in fewer training epochs, larger batch
sizes offer more parallelism and hence better computational efficiency. We have
developed a new training approach that, rather than statically choosing a
single batch size for all epochs, adaptively increases the batch size during
the training process. Our method delivers the convergence rate of small batch
sizes while achieving performance similar to large batch sizes. We analyse our
approach using the standard AlexNet, ResNet, and VGG networks operating on the
popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate
that learning with adaptive batch sizes can improve performance by factors of
up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1%
relative to training with fixed batch sizes.Comment: 14 page
Scale MLPerf-0.6 models on Google TPU-v3 Pods
The recent submission of Google TPU-v3 Pods to the industry wide MLPerf v0.6
training benchmark demonstrates the scalability of a suite of industry relevant
ML models. MLPerf defines a suite of models, datasets and rules to follow when
benchmarking to ensure results are comparable across hardware, frameworks and
companies. Using this suite of models, we discuss the optimizations and
techniques including choice of optimizer, spatial partitioning and weight
update sharding necessary to scale to 1024 TPU chips. Furthermore, we identify
properties of models that make scaling them challenging, such as limited data
parallelism and unscaled weights. These optimizations contribute to record
performance in transformer, Resnet-50 and SSD in the Google MLPerf-0.6
submission
- …