26 research outputs found
Image Classification at Supercomputer Scale
Deep learning is extremely computationally intensive, and hardware vendors
have responded by building faster accelerators in large clusters. Training deep
learning models at petaFLOPS scale requires overcoming both algorithmic and
systems software challenges. In this paper, we discuss three systems-related
optimizations: (1) distributed batch normalization to control per-replica batch
sizes, (2) input pipeline optimizations to sustain model throughput, and (3)
2-D torus all-reduce to speed up gradient summation. We combine these
optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes
on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million
images/second and no accuracy drop.Comment: Presented as part of Systems for ML Workshop @ NIPS 201
Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash
Scaling the distributed deep learning to a massive GPU cluster level is
challenging due to the instability of the large mini-batch training and the
overhead of the gradient synchronization. We address the instability of the
large mini-batch training with batch-size control and label smoothing. We
address the overhead of the gradient synchronization with 2D-Torus all-reduce.
Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and
performs a series of collective operation in different orientations. These two
techniques are implemented with Neural Network Libraries (NNL). We have
successfully trained ImageNet/ResNet-50 in 122 seconds without significant
accuracy loss on ABCI cluster
A Study of BFLOAT16 for Deep Learning Training
This paper presents the first comprehensive empirical study demonstrating the
efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep
Learning training across image classification, speech recognition, language
modeling, generative networks and industrial recommendation systems. BFLOAT16
is attractive for Deep Learning training for two reasons: the range of values
it can represent is the same as that of IEEE 754 floating-point format (FP32)
and conversion to/from FP32 is simple. Maintaining the same range as FP32 is
important to ensure that no hyper-parameter tuning is required for convergence;
e.g., IEEE 754 compliant half-precision floating point (FP16) requires
hyper-parameter tuning. In this paper, we discuss the flow of tensors and
various key operations in mixed precision training, and delve into details of
operations, such as the rounding modes for converting FP32 tensors to BFLOAT16.
We have implemented a method to emulate BFLOAT16 operations in Tensorflow,
Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep
learning training using BFLOAT16 tensors achieves the same state-of-the-art
(SOTA) results across domains as FP32 tensors in the same number of iterations
and with no changes to hyper-parameters
Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training
Deep learning has become widely used in complex AI applications. Yet,
training a deep neural network (DNNs) model requires a considerable amount of
calculations, long running time, and much energy. Nowadays, many-core AI
accelerators (e.g., GPUs and TPUs) are designed to improve the performance of
AI training. However, processors from different vendors perform dissimilarly in
terms of performance and energy consumption. To investigate the differences
among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU,
AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive
empirical study on the performance and energy efficiency of these processors by
benchmarking a representative set of deep learning workloads, including
computation-intensive operations, classical convolutional neural networks
(CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer.
Different from the existing end-to-end benchmarks which only present the
training time, We try to investigate the impact of hardware, vendor's software
library, and deep learning framework on the performance and energy consumption
of AI training. Our evaluation methods and results not only provide an
informative guide for end-users to select proper AI accelerators, but also
expose some opportunities for the hardware vendors to improve their software
library.Comment: Revised some minor issue
Highly Available Data Parallel ML training on Mesh Networks
Data parallel ML models can take several days or weeks to train on several
accelerators. The long duration of training relies on the cluster of resources
to be available for the job to keep running for the entire duration. On a mesh
network this is challenging because failures will create holes in the mesh.
Packets must be routed around the failed chips for full connectivity. In this
paper, we present techniques to route gradient summation allreduce traffic
around failed chips on 2-D meshes. We evaluate performance of our fault
tolerant allreduce techniques via the MLPerf-v0.7 ResNet-50 and BERT
benchmarks. Performance results show minimal impact to training throughput on
512 and 1024 TPU-v3 chips
Distributed Low Precision Training Without Mixed Precision
Low precision training is one of the most popular strategies for deploying
the deep model on limited hardware resources. Fixed point implementation of
DCNs has the potential to alleviate complexities and facilitate potential
deployment on embedded hardware. However, most low precision training solution
is based on a mixed precision strategy. In this paper, we have presented an
ablation study on different low precision training strategy and propose a
solution for IEEE FP-16 format throughout the training process. We tested the
ResNet50 on 128 GPU cluster on ImageNet-full dataset. We have viewed that it is
not essential to use FP32 format to train the deep models. We have viewed that
communication cost reduction, model compression, and large-scale distributed
training are three coupled problems
Large batch size training of neural networks with adversarial training and second-order information
The most straightforward method to accelerate Stochastic Gradient Descent
(SGD) computation is to distribute the randomly selected batch of inputs over
multiple processors. To keep the distributed processors fully utilized requires
commensurately growing the batch size. However, large batch training often
leads to poorer generalization. A recently proposed solution for this problem
is to use adaptive batch sizes in SGD. In this case, one starts with a small
number of processes and scales the processes as training progresses. Two major
challenges with this approach are (i) that dynamically resizing the cluster can
add non-trivial overhead, in part since it is currently not supported, and (ii)
that the overall speed up is limited by the initial phase with smaller batches.
In this work, we address both challenges by developing a new adaptive batch
size framework, with autoscaling based on the Ray framework. This allows very
efficient elastic scaling with negligible resizing overhead (0.32\% of time for
ResNet18 ImageNet training). Furthermore, we propose a new adaptive batch size
training scheme using second order methods and adversarial training. These
enable increasing batch sizes earlier during training, which leads to better
training time. We extensively evaluate our method on Cifar-10/100, SVHN,
TinyImageNet, and ImageNet datasets, using multiple neural networks, including
ResNets and smaller networks such as SqueezeNext. Our method exceeds the
performance of existing solutions in terms of both accuracy and the number of
SGD iterations (up to 1\% and , respectively). Importantly, this is
achieved without any additional hyper-parameter tuning to tailor our method in
any of these experiments
Scale MLPerf-0.6 models on Google TPU-v3 Pods
The recent submission of Google TPU-v3 Pods to the industry wide MLPerf v0.6
training benchmark demonstrates the scalability of a suite of industry relevant
ML models. MLPerf defines a suite of models, datasets and rules to follow when
benchmarking to ensure results are comparable across hardware, frameworks and
companies. Using this suite of models, we discuss the optimizations and
techniques including choice of optimizer, spatial partitioning and weight
update sharding necessary to scale to 1024 TPU chips. Furthermore, we identify
properties of models that make scaling them challenging, such as limited data
parallelism and unscaled weights. These optimizations contribute to record
performance in transformer, Resnet-50 and SSD in the Google MLPerf-0.6
submission
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Training large deep neural networks on massive datasets is computationally
very challenging. There has been recent surge in interest in using large batch
stochastic optimization methods to tackle this issue. The most prominent
algorithm in this line of research is LARS, which by employing layerwise
adaptive learning rates trains ResNet on ImageNet in a few minutes. However,
LARS performs poorly for attention models like BERT, indicating that its
performance gains are not consistent across tasks. In this paper, we first
study a principled layerwise adaptation strategy to accelerate training of deep
neural networks using large mini-batches. Using this strategy, we develop a new
layerwise adaptive large batch optimization technique called LAMB; we then
provide convergence analysis of LAMB as well as LARS, showing convergence to a
stationary point in general nonconvex settings. Our empirical results
demonstrate the superior performance of LAMB across various tasks such as BERT
and ResNet-50 training with very little hyperparameter tuning. In particular,
for BERT training, our optimizer enables use of very large batch sizes of 32868
without any degradation of performance. By increasing the batch size to the
memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to
just 76 minutes (Table 1). The LAMB implementation is available at
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.pyComment: Published as a conference paper at ICLR 202
Parameterized Structured Pruning for Deep Neural Networks
As a result of the growing size of Deep Neural Networks (DNNs), the gap to
hardware capabilities in terms of memory and compute increases. To effectively
compress DNNs, quantization and connection pruning are usually considered.
However, unconstrained pruning usually leads to unstructured parallelism, which
maps poorly to massively parallel processors, and substantially reduces the
efficiency of general-purpose processors. Similar applies to quantization,
which often requires dedicated hardware. We propose Parameterized Structured
Pruning (PSP), a novel method to dynamically learn the shape of DNNs through
structured sparsity. PSP parameterizes structures (e.g. channel- or layer-wise)
in a weight tensor and leverages weight decay to learn a clear distinction
between important and unimportant structures. As a result, PSP maintains
prediction performance, creates a substantial amount of sparsity that is
structured and, thus, easy and efficient to map to a variety of massively
parallel processors, which are mandatory for utmost compute power and energy
efficiency. PSP is experimentally validated on the popular CIFAR10/100 and
ILSVRC2012 datasets using ResNet and DenseNet architectures, respectively