8,230 research outputs found
ImageNet Training in Minutes
Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU
takes 14 days. This training requires 10^18 single precision operations in
total. On the other hand, the world's current fastest supercomputer can finish
2 * 10^17 single precision operations per second (Dongarra et al 2017,
https://www.top500.org/lists/2017/06/). If we can make full use of the
supercomputer for DNN training, we should be able to finish the 90-epoch
ResNet-50 training in one minute. However, the current bottleneck for fast DNN
training is in the algorithm level. Specifically, the current batch size (e.g.
512) is too small to make efficient use of many processors. For large-scale DNN
training, we focus on using large-batch data-parallelism synchronous SGD
without losing accuracy in the fixed epochs. The LARS algorithm (You, Gitman,
Ginsburg, 2017, arXiv:1708.03888) enables us to scale the batch size to
extremely large case (e.g. 32K). We finish the 100-epoch ImageNet training with
AlexNet in 11 minutes on 1024 CPUs. About three times faster than Facebook's
result (Goyal et al 2017, arXiv:1706.02677), we finish the 90-epoch ImageNet
training with ResNet-50 in 20 minutes on 2048 KNLs without losing accuracy.
State-of-the-art ImageNet training speed with ResNet-50 is 74.9% top-1 test
accuracy in 15 minutes. We got 74.9% top-1 test accuracy in 64 epochs, which
only needs 14 minutes. Furthermore, when we increase the batch size to above
16K, our accuracy is much higher than Facebook's on corresponding batch sizes.
Our source code is available upon request
Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes
We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be
achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by
using a large minibatch size of 32k. To maintain accuracy with this large
minibatch size, we employed several techniques such as RMSprop warm-up, batch
normalization without moving averages, and a slow-start learning rate schedule.
This paper also describes the details of the hardware and software of the
system used to achieve the above performance.Comment: NIPS'17 Workshop: Deep Learning at Supercomputer Scal
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
Synchronized stochastic gradient descent (SGD) optimizers with data
parallelism are widely used in training large-scale deep neural networks.
Although using larger mini-batch sizes can improve the system scalability by
reducing the communication-to-computation ratio, it may hurt the generalization
ability of the models. To this end, we build a highly scalable deep learning
training system for dense GPU clusters with three main contributions: (1) We
propose a mixed-precision training method that significantly improves the
training throughput of a single GPU without losing accuracy. (2) We propose an
optimization approach for extremely large mini-batch size (up to 64k) that can
train CNN models on the ImageNet dataset without losing accuracy. (3) We
propose highly optimized all-reduce algorithms that achieve up to 3x and 11x
speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a
cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the
state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes
and achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048
Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training system
can achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40
GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1
test accuracy within 4 minutes, which also outperforms all other existing
systems.Comment: arXiv admin note: text overlap with arXiv:1803.03383 by other author
A Hitchhiker's Guide On Distributed Training of Deep Neural Networks
Deep learning has led to tremendous advancements in the field of Artificial
Intelligence. One caveat however is the substantial amount of compute needed to
train these deep learning models. Training a benchmark dataset like ImageNet on
a single machine with a modern GPU can take upto a week, distributing training
on multiple machines has been observed to drastically bring this time down.
Recent work has brought down ImageNet training time to a time as low as 4
minutes by using a cluster of 2048 GPUs. This paper surveys the various
algorithms and techniques used to distribute training and presents the current
state of the art for a modern distributed training framework. More
specifically, we explore the synchronous and asynchronous variants of
distributed Stochastic Gradient Descent, various All Reduce gradient
aggregation strategies and best practices for obtaining higher throughout and
lower latency over a cluster such as mixed precision training, large batch
training and gradient compression.Comment: 14 page
Image Classification at Supercomputer Scale
Deep learning is extremely computationally intensive, and hardware vendors
have responded by building faster accelerators in large clusters. Training deep
learning models at petaFLOPS scale requires overcoming both algorithmic and
systems software challenges. In this paper, we discuss three systems-related
optimizations: (1) distributed batch normalization to control per-replica batch
sizes, (2) input pipeline optimizations to sustain model throughput, and (3)
2-D torus all-reduce to speed up gradient summation. We combine these
optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes
on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million
images/second and no accuracy drop.Comment: Presented as part of Systems for ML Workshop @ NIPS 201
PowerAI DDL
As deep neural networks become more complex and input datasets grow larger,
it can take days or even weeks to train a deep neural network to the desired
accuracy. Therefore, distributed Deep Learning at a massive scale is a critical
capability, since it offers the potential to reduce the training time from
weeks to hours. In this paper, we present a software-hardware co-optimized
distributed Deep Learning system that can achieve near-linear scaling up to
hundreds of GPUs. The core algorithm is a multi-ring communication pattern that
provides a good tradeoff between latency and bandwidth and adapts to a variety
of system configurations. The communication algorithm is implemented as a
library for easy use. This library has been integrated into Tensorflow, Caffe,
and Torch. We train Resnet-101 on Imagenet 22K with 64 IBM Power8 S822LC
servers (256 GPUs) in about 7 hours to an accuracy of 33.8 % validation
accuracy. Microsoft's ADAM and Google's DistBelief results did not reach 30 %
validation accuracy for Imagenet 22K. Compared to Facebook AI Research's recent
paper on 256 GPU training, we use a different communication algorithm, and our
combined software and hardware system offers better communication overhead for
Resnet-50. A PowerAI DDL enabled version of Torch completed 90 epochs of
training on Resnet 50 for 1K classes in 50 minutes using 64 IBM Power8 S822LC
servers (256 GPUs)
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Training large deep neural networks on massive datasets is computationally
very challenging. There has been recent surge in interest in using large batch
stochastic optimization methods to tackle this issue. The most prominent
algorithm in this line of research is LARS, which by employing layerwise
adaptive learning rates trains ResNet on ImageNet in a few minutes. However,
LARS performs poorly for attention models like BERT, indicating that its
performance gains are not consistent across tasks. In this paper, we first
study a principled layerwise adaptation strategy to accelerate training of deep
neural networks using large mini-batches. Using this strategy, we develop a new
layerwise adaptive large batch optimization technique called LAMB; we then
provide convergence analysis of LAMB as well as LARS, showing convergence to a
stationary point in general nonconvex settings. Our empirical results
demonstrate the superior performance of LAMB across various tasks such as BERT
and ResNet-50 training with very little hyperparameter tuning. In particular,
for BERT training, our optimizer enables use of very large batch sizes of 32868
without any degradation of performance. By increasing the batch size to the
memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to
just 76 minutes (Table 1). The LAMB implementation is available at
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.pyComment: Published as a conference paper at ICLR 202
Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes
It is important to scale out deep neural network (DNN) training for reducing
model training time. The high communication overhead is one of the major
performance bottlenecks for distributed DNN training across multiple GPUs. Our
investigations have shown that popular open-source DNN systems could only
achieve 2.5 speedup ratio on 64 GPUs connected by 56 Gbps network. To address
this problem, we propose a communication backend named GradientFlow for
distributed DNN training, and employ a set of network optimization techniques.
First, we integrate ring-based allreduce, mixed-precision training, and
computation/communication overlap into GradientFlow. Second, we propose lazy
allreduce to improve network throughput by fusing multiple communication
operations into a single one, and design coarse-grained sparse communication to
reduce network traffic by only transmitting important gradient chunks. When
training ImageNet/AlexNet on 512 GPUs, our approach achieves 410.2 speedup
ratio and completes 95-epoch training in 1.5 minutes, which outperforms
existing approaches
Multi-Sample Dropout for Accelerated Training and Better Generalization
Dropout is a simple but efficient regularization technique for achieving
better generalization of deep neural networks (DNNs); hence it is widely used
in tasks based on DNNs. During training, dropout randomly discards a portion of
the neurons to avoid overfitting. This paper presents an enhanced dropout
technique, which we call multi-sample dropout, for both accelerating training
and improving generalization over the original dropout. The original dropout
creates a randomly selected subset (called a dropout sample) from the input in
each training iteration while the multi-sample dropout creates multiple dropout
samples. The loss is calculated for each sample, and then the sample losses are
averaged to obtain the final loss. This technique can be easily implemented
without implementing a new operator by duplicating a part of the network after
the dropout layer while sharing the weights among the duplicated fully
connected layers. Experimental results showed that multi-sample dropout
significantly accelerates training by reducing the number of iterations until
convergence for image classification tasks using the ImageNet, CIFAR-10,
CIFAR-100, and SVHN datasets. Multi-sample dropout does not significantly
increase computation cost per iteration because most of the computation time is
consumed in the convolution layers before the dropout layer, which are not
duplicated. Experiments also showed that networks trained using multi-sample
dropout achieved lower error rates and losses for both the training set and
validation set
Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark
Researchers have proposed hardware, software, and algorithmic optimizations
to improve the computational performance of deep learning. While some of these
optimizations perform the same operations faster (e.g., increasing GPU clock
speed), many others modify the semantics of the training procedure (e.g.,
reduced precision), and can impact the final model's accuracy on unseen data.
Due to a lack of standard evaluation criteria that considers these trade-offs,
it is difficult to directly compare these optimizations. To address this
problem, we recently introduced DAWNBench, a benchmark competition focused on
end-to-end training time to achieve near-state-of-the-art accuracy on an unseen
dataset---a combined metric called time-to-accuracy (TTA). In this work, we
analyze the entries from DAWNBench, which received optimized submissions from
multiple industrial groups, to investigate the behavior of TTA as a metric as
well as trends in the best-performing entries. We show that TTA has a low
coefficient of variation and that models optimized for TTA generalize nearly as
well as those trained using standard methods. Additionally, even though
DAWNBench entries were able to train ImageNet models in under 3 minutes, we
find they still underutilize hardware capabilities such as Tensor Cores.
Furthermore, we find that distributed entries can spend more than half of their
time on communication. We show similar findings with entries to the MLPERF v0.5
benchmark
- …