3,483 research outputs found
Distributed Training of Structured SVM
Training structured prediction models is time-consuming. However, most
existing approaches only use a single machine, thus, the advantage of computing
power and the capacity for larger data sets of multiple machines have not been
exploited. In this work, we propose an efficient algorithm for distributedly
training structured support vector machines based on a distributed
block-coordinate descent method. Both theoretical and experimental results
indicate that our method is efficient.Comment: NIPS Workshop on Optimization for Machine Learning, 201
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
Theano-MPI: a Theano-based Distributed Training Framework
We develop a scalable and extendable training framework that can utilize GPUs
across nodes in a cluster and accelerate the training of deep learning models
based on data parallelism. Both synchronous and asynchronous training are
implemented in our framework, where parameter exchange among GPUs is based on
CUDA-aware MPI. In this report, we analyze the convergence and capability of
the framework to reduce training time when scaling the synchronous training of
AlexNet and GoogLeNet from 2 GPUs to 8 GPUs. In addition, we explore novel ways
to reduce the communication overhead caused by exchanging parameters. Finally,
we release the framework as open-source for further research on distributed
deep learnin
Nested Dithered Quantization for Communication Reduction in Distributed Training
In distributed training, the communication cost due to the transmission of
gradients or the parameters of the deep model is a major bottleneck in scaling
up the number of processing nodes. To address this issue, we propose
\emph{dithered quantization} for the transmission of the stochastic gradients
and show that training with \emph{Dithered Quantized Stochastic Gradients
(DQSG)} is similar to the training with unquantized SGs perturbed by an
independent bounded uniform noise, in contrast to the other quantization
methods where the perturbation depends on the gradients and hence, complicating
the convergence analysis. We study the convergence of training algorithms using
DQSG and the trade off between the number of quantization levels and the
training time.
Next, we observe that there is a correlation among the SGs computed by
workers that can be utilized to further reduce the communication overhead
without any performance loss. Hence, we develop a simple yet effective
quantization scheme, nested dithered quantized SG (NDQSG), that can reduce the
communication significantly \emph{without requiring the workers communicating
extra information to each other}. We prove that although NDQSG requires
significantly less bits, it can achieve the same quantization variance bound as
DQSG.
Our simulation results confirm the effectiveness of training using DQSG and
NDQSG in reducing the communication bits or the convergence time compared to
the existing methods without sacrificing the accuracy of the trained model
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Large-scale distributed training requires significant communication bandwidth
for gradient exchange that limits the scalability of multi-node training, and
requires expensive high-bandwidth network infrastructure. The situation gets
even worse with distributed training on mobile devices (federated learning),
which suffers from higher latency, lower throughput, and intermittent poor
connections. In this paper, we find 99.9% of the gradient exchange in
distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to
greatly reduce the communication bandwidth. To preserve accuracy during
compression, DGC employs four methods: momentum correction, local gradient
clipping, momentum factor masking, and warm-up training. We have applied Deep
Gradient Compression to image classification, speech recognition, and language
modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and
Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a
gradient compression ratio from 270x to 600x without losing accuracy, cutting
the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from
488MB to 0.74MB. Deep gradient compression enables large-scale distributed
training on inexpensive commodity 1Gbps Ethernet and facilitates distributed
training on mobile. Code is available at:
https://github.com/synxlin/deep-gradient-compression.Comment: we find 99.9% of the gradient exchange in distributed SGD is
redundant; we reduce the communication bandwidth by two orders of magnitude
without losing accuracy. Code is available at:
https://github.com/synxlin/deep-gradient-compressio
A Hitchhiker's Guide On Distributed Training of Deep Neural Networks
Deep learning has led to tremendous advancements in the field of Artificial
Intelligence. One caveat however is the substantial amount of compute needed to
train these deep learning models. Training a benchmark dataset like ImageNet on
a single machine with a modern GPU can take upto a week, distributing training
on multiple machines has been observed to drastically bring this time down.
Recent work has brought down ImageNet training time to a time as low as 4
minutes by using a cluster of 2048 GPUs. This paper surveys the various
algorithms and techniques used to distribute training and presents the current
state of the art for a modern distributed training framework. More
specifically, we explore the synchronous and asynchronous variants of
distributed Stochastic Gradient Descent, various All Reduce gradient
aggregation strategies and best practices for obtaining higher throughout and
lower latency over a cluster such as mixed precision training, large batch
training and gradient compression.Comment: 14 page
An MPI-Based Python Framework for Distributed Training with Keras
We present a lightweight Python framework for distributed training of neural
networks on multiple GPUs or CPUs. The framework is built on the popular Keras
machine learning library. The Message Passing Interface (MPI) protocol is used
to coordinate the training process, and the system is well suited for job
submission at supercomputing sites. We detail the software's features, describe
its use, and demonstrate its performance on systems of varying sizes on a
benchmark problem drawn from high-energy physics research.Comment: 4 pages, 4 figures, 1 table, DS@HEP, SC1
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
Highly distributed training of Deep Neural Networks (DNNs) on future compute
platforms (offering 100 of TeraOps/s of computational capacity) is expected to
be severely communication constrained. To overcome this limitation, new
gradient compression techniques are needed that are computationally friendly,
applicable to a wide variety of layers seen in Deep Neural Networks and
adaptable to variations in network architectures as well as their
hyper-parameters. In this paper we introduce a novel technique - the Adaptive
Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized
selection of gradient residues and automatically tunes the compression rate
depending on local activity. We show excellent results on a wide spectrum of
state of the art Deep Learning models in multiple domains (vision, speech,
language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers
(SGD with momentum, Adam) and network parameters (number of learners,
minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate
end-to-end compression rates of ~200X for fully-connected and recurrent layers,
and ~40X for convolutional layers, without any noticeable degradation in model
accuracies.Comment: IBM Research AI, 9 pages, 7 figures, AAAI18 accepte
High-Rate, Distributed Training-Embedded Complex Orthogonal Designs for Relay Networks
Distributed Space-Time Block Codes (DSTBCs) from Complex Orthogonal Designs
(CODs) (both square and non-square CODs other than the Alamouti design) are
known to lose their single-symbol ML decodable (SSD) property when used in
two-hop wireless relay networks using amplify and forward protocol. For such a
network, in this paper, a new class of high rate, training-embedded (TE) SSD
DSTBCs are constructed from TE-CODs. The proposed codes include the training
symbols in the structure of the code which is shown to be the key point to
obtain high rate as well as the SSD property. TE-CODs are shown to offer
full-diversity for arbitrary complex constellations. Non-square TE-CODs are
shown to provide higher rates (in symbols per channel use) compared to the
known SSD DSTBCs for relay networks with number of relays less than Comment: 6 pages, 2 figure
OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training
The training of modern deep learning neural network calls for large amounts
of computation, which is often provided by GPUs or other specific accelerators.
To scale out to achieve faster training speed, two update algorithms are mainly
applied in the distributed training process, i.e. the Synchronous SGD algorithm
(SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence
point while the training speed is slowed down by the synchronous barrier. ASGD
has faster training speed but the convergence point is lower when compared to
SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a
novel technology named One-step Delay SGD (OD-SGD) to combine their strengths
in the training process. Therefore, we can achieve similar convergence point
and training speed as SSGD and ASGD separately. To the best of our knowledge,
we make the first attempt to combine the features of SSGD and ASGD to improve
distributed training performance. Each iteration of OD-SGD contains a global
update in the parameter server node and local updates in the worker nodes, the
local update is introduced to update and compensate the delayed local weights.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
Experimental results show that OD-SGD can obtain similar or even slightly
better accuracy than SSGD, while its training speed is much faster, which even
exceeds the training speed of ASGD
- …