10 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training
A new neural network architecture called Mixture-of-Experts (MoE) has been
proposed recently that increases the parameters of a neural network (the base
model) by adding sparsely activated expert blocks, without changing the total
number of floating point operations for training or inference. In theory, this
architecture allows us to train arbitrarily large models while keeping the
computational costs same as that of the base model. However, beyond 64 to 128
experts blocks, prior work has observed diminishing returns in the test
accuracies of these MoE models. Thus, training high quality MoE models requires
us to scale the size of the base models, along with the number of expert
blocks. In this work, we propose a novel, three-dimensional, hybrid parallel
algorithm that combines tensor, expert, and data parallelism to enable the
training of MoE models with 4-8x larger base models than the current
state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the
optimizer step, and communication optimizations that eliminate redundant
movement of data. Removing these redundancies provides a speedup of nearly 21%.
When training a 40 billion parameter MoE model (6.7 billion base model with 16
experts) on 128 V100 GPUs, our optimizations significantly improve the peak
half precision flop/s from 20% to 27%
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
Mixture-of-Experts (MoE) is a neural network architecture that
adds sparsely activated expert blocks to a base model, increasing
the number of parameters without impacting computational costs.
However, current distributed deep learning frameworks are limited
in their ability to train high-quality MoE models with large base
models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor,
and expert parallelism to enable the training of MoE models with
4–8× larger base models than the current state-of-the-art. We also
describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement.
We implement our approach in DeepSpeed and achieve speedups of
26% over a baseline (i.e. without our communication optimizations)
when training a 40 billion parameter MoE model (6.7 billion base
model with 16 experts) on 128 V100 GPUs.https://doi.org/10.1145/3577193.359370
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
As the training of giant dense models hits the boundary on the availability
and capability of the hardware resources today, Mixture-of-Experts (MoE) models
become one of the most promising model architectures due to their significant
training cost reduction compared to a quality-equivalent dense model. Its
training cost saving is demonstrated from encoder-decoder models (prior works)
to a 5x saving for auto-aggressive language models (this work along with
parallel explorations). However, due to the much larger model size and unique
architecture, how to provide fast MoE model inference remains challenging and
unsolved, limiting its practical usage. To tackle this, we present
DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the
DeepSpeed library, including novel MoE architecture designs and model
compression techniques that reduce MoE model size by up to 3.7x, and a highly
optimized inference system that provides 7.3x better latency and cost compared
to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented
scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x
cheaper inference compared to quality-equivalent dense models. We hope our
innovations and systems help open a promising path to new directions in the
large model landscape, a shift from dense to sparse MoE models, where training
and deploying higher-quality models with fewer resources becomes more widely
possible.Comment: This paper is published at ICML 2022:
https://proceedings.mlr.press/v162/rajbhandari22