A new neural network architecture called Mixture-of-Experts (MoE) has been
proposed recently that increases the parameters of a neural network (the base
model) by adding sparsely activated expert blocks, without changing the total
number of floating point operations for training or inference. In theory, this
architecture allows us to train arbitrarily large models while keeping the
computational costs same as that of the base model. However, beyond 64 to 128
experts blocks, prior work has observed diminishing returns in the test
accuracies of these MoE models. Thus, training high quality MoE models requires
us to scale the size of the base models, along with the number of expert
blocks. In this work, we propose a novel, three-dimensional, hybrid parallel
algorithm that combines tensor, expert, and data parallelism to enable the
training of MoE models with 4-8x larger base models than the current
state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the
optimizer step, and communication optimizations that eliminate redundant
movement of data. Removing these redundancies provides a speedup of nearly 21%.
When training a 40 billion parameter MoE model (6.7 billion base model with 16
experts) on 128 V100 GPUs, our optimizations significantly improve the peak
half precision flop/s from 20% to 27%