498 research outputs found
Revisiting Distributed Synchronous SGD
Distributed training of deep learning models on large-scale training data is
typically conducted with asynchronous stochastic optimization to maximize the
rate of updates, at the cost of additional noise introduced from asynchrony. In
contrast, the synchronous approach is often thought to be impractical due to
idle time wasted on waiting for straggling workers. We revisit these
conventional beliefs in this paper, and examine the weaknesses of both
approaches. We demonstrate that a third approach, synchronous optimization with
backup workers, can avoid asynchronous noise while mitigating for the worst
stragglers. Our approach is empirically validated and shown to converge faster
and to better test accuracies.Comment: 10 page
Toward Understanding the Impact of Staleness in Distributed Machine Learning
Many distributed machine learning (ML) systems adopt the non-synchronous
execution in order to alleviate the network communication bottleneck, resulting
in stale parameters that do not reflect the latest updates. Despite much
development in large-scale ML, the effects of staleness on learning are
inconclusive as it is challenging to directly monitor or control staleness in
complex distributed environments. In this work, we study the convergence
behaviors of a wide array of ML models and algorithms under delayed updates.
Our extensive experiments reveal the rich diversity of the effects of staleness
on the convergence of ML algorithms and offer insights into seemingly
contradictory reports in the literature. The empirical findings also inspire a
new convergence analysis of stochastic gradient descent in non-convex
optimization under staleness, matching the best-known convergence rate of
O(1/\sqrt{T}).Comment: 19 pages, 12 figure
High Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel,
stochastic-gradient-descent learning algorithm. This algorithm uses amortized
inference in a compute-cluster-specific, deep, generative, dynamical model to
perform joint posterior predictive inference of the mini-batch gradient
computation times of all worker-nodes in a parallel computing cluster. We show
that a synchronous parameter server can, by utilizing such a model, choose an
optimal cutoff time beyond which mini-batch gradient messages from slow workers
are ignored that maximizes overall mini-batch gradient computations per second.
In keeping with earlier findings we observe that, under realistic conditions,
eagerly discarding the mini-batch gradient computations of stragglers not only
increases throughput but actually increases the overall rate of convergence as
a function of wall-clock time by virtue of eliminating idleness. The principal
novel contribution and finding of this work goes beyond this by demonstrating
that using the predicted run-times from a generative model of cluster worker
performance to dynamically adjust the cutoff improves substantially over the
static-cutoff prior art, leading to, among other things, significantly reduced
deep neural net training times on large computer clusters
Scaling Distributed Training of Flood-Filling Networks on HPC Infrastructure for Brain Mapping
Mapping all the neurons in the brain requires automatic reconstruction of
entire cells from volume electron microscopy data. The flood-filling network
(FFN) architecture has demonstrated leading performance for segmenting
structures from this data. However, the training of the network is
computationally expensive. In order to reduce the training time, we implemented
synchronous and data-parallel distributed training using the Horovod library,
which is different from the asynchronous training scheme used in the published
FFN code. We demonstrated that our distributed training scaled well up to 2048
Intel Knights Landing (KNL) nodes on the Theta supercomputer. Our trained
models achieved similar level of inference performance, but took less training
time compared to previous methods. Our study on the effects of different batch
sizes on FFN training suggests ways to further improve training efficiency. Our
findings on optimal learning rate and batch sizes agree with previous works.Comment: 9 pages, 10 figure
Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce
It is inevitable to train large deep learning models on a large-scale cluster
equipped with accelerators system. Deep gradient compression would highly
increase the bandwidth utilization and speed up the training process but hard
to implement on ring structure. In this paper, we find that redundant gradient
and gradient staleness has negative effect on training. We have observed that
in different epoch and different steps, the neural networks focus on updating
different layers and different parameters. In order to save more communication
bandwidth and preserve the accuracy on ring structure, which break the restrict
as the node increase, we propose a new algorithm to measure the importance of
gradients on large-scale cluster implementing ring all-reduce based on the size
of the ratio of parameter calculation gradient to parameter value. Our
importance weighted pruning approach achieved 64X and 58.8X of gradient
compression ratio on AlexNet and ResNet50 on ImageNet. Meanwhile, in order to
maintain the sparseness of the gradient propagation, we randomly broadcast the
index of important gradients on each node. While the remaining nodes are ready
for the index gradient and perform all-reduce update. This would speed up the
convergence of the model and preserve the training accuracy
Distributed Deep Learning Strategies For Automatic Speech Recognition
In this paper, we propose and investigate a variety of distributed deep
learning strategies for automatic speech recognition (ASR) and evaluate them
with a state-of-the-art Long short-term memory (LSTM) acoustic model on the
2000-hour Switchboard (SWB2000), which is one of the most widely used datasets
for ASR performance benchmark. We first investigate what are the proper
hyper-parameters (e.g., learning rate) to enable the training with sufficiently
large batch size without impairing the model accuracy. We then implement
various distributed strategies, including Synchronous (SYNC), Asynchronous
Decentralized Parallel SGD (ADPSGD) and the hybrid of the two HYBRID, to study
their runtime/accuracy trade-off. We show that we can train the LSTM model
using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7.6% WER on the
Hub5- 2000 Switchboard (SWB) test set and a 13.1% WER on the CallHome (CH) test
set. Furthermore, we can train the model using HYBRID in 11.5 hours with 32
NVIDIA V100 GPUs without loss in accuracy.Comment: Published in ICASSP'1
Exponential Moving Average Model in Parallel Speech Recognition Training
As training data rapid growth, large-scale parallel training with multi-GPUs
cluster is widely applied in the neural network model learning currently.We
present a new approach that applies exponential moving average method in
large-scale parallel training of neural network model. It is a non-interference
strategy that the exponential moving average model is not broadcasted to
distributed workers to update their local models after model synchronization in
the training process, and it is implemented as the final model of the training
system. Fully-connected feed-forward neural networks (DNNs) and deep
unidirectional Long short-term memory (LSTM) recurrent neural networks (RNNs)
are successfully trained with proposed method for large vocabulary continuous
speech recognition on Shenma voice search data in Mandarin. The character error
rate (CER) of Mandarin speech recognition further degrades than
state-of-the-art approaches of parallel training.Comment: 5 page
MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning
Existing Deep Learning frameworks exclusively use either Parameter Server(PS)
approach or MPI parallelism. In this paper, we discuss the drawbacks of such
approaches and propose a generic framework supporting both PS and MPI
programming paradigms, co-existing at the same time. The key advantage of the
new model is to embed the scaling benefits of MPI parallelism into the loosely
coupled PS task model. Apart from providing a practical usage model of MPI in
cloud, such framework allows for novel communication avoiding algorithms that
do parameter averaging in Stochastic Gradient Descent(SGD) approaches. We show
how MPI and PS models can synergestically apply algorithms such as Elastic SGD
to improve the rate of convergence against existing approaches. These new
algorithms directly help scaling SGD clusterwide. Further, we also optimize the
critical component of the framework, namely global aggregation or allreduce
using a novel concept of tensor collectives. These treat a group of vectors on
a node as a single object allowing for the existing single vector algorithms to
be directly applicable. We back our claims with sufficient emperical evidence
using large scale ImageNet 1K data. Our framework is built upon MXNET but the
design is generic and can be adapted to other popular DL infrastructures
A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning
With huge amounts of training data, deep learning has made great
breakthroughs in many artificial intelligence (AI) applications. However, such
large-scale data sets present computational challenges, requiring training to
be distributed on a cluster equipped with accelerators like GPUs. With the fast
increase of GPU computing power, the data communications among GPUs have become
a potential bottleneck on the overall training performance. In this paper, we
first propose a general directed acyclic graph (DAG) model to describe the
distributed synchronous stochastic gradient descent (S-SGD) algorithm, which
has been widely used in distributed deep learning frameworks. To understand the
practical impact of data communications on training performance, we conduct
extensive empirical studies on four state-of-the-art distributed deep learning
frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and
multi-node environments with different data communication techniques, including
PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental
studies, we identify the potential bottlenecks and overheads that could be
further optimized. At last, we make the data set of our experimental traces
publicly available, which could be used to support simulation-based studies.Comment: 8 pages. Accepted by ICPADS'201
Image Classification at Supercomputer Scale
Deep learning is extremely computationally intensive, and hardware vendors
have responded by building faster accelerators in large clusters. Training deep
learning models at petaFLOPS scale requires overcoming both algorithmic and
systems software challenges. In this paper, we discuss three systems-related
optimizations: (1) distributed batch normalization to control per-replica batch
sizes, (2) input pipeline optimizations to sustain model throughput, and (3)
2-D torus all-reduce to speed up gradient summation. We combine these
optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes
on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million
images/second and no accuracy drop.Comment: Presented as part of Systems for ML Workshop @ NIPS 201
- …