A Novel Tensor-Expert Hybrid Parallelism Approach to Scale
  Mixture-of-Experts Training

Awan, Ammar Ahmad; Bhatele, Abhinav; He, Yuxiong; Rajbhandari, Samyam; Ruwase, Olatunji; Singh, Siddharth

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training

Authors: Ammar Ahmad Awan
Abhinav Bhatele
Yuxiong He
Samyam Rajbhandari
Olatunji Ruwase
Siddharth Singh
Publication date: 11 March 2023
Publisher

Abstract

A new neural network architecture called Mixture-of-Experts (MoE) has been proposed recently that increases the parameters of a neural network (the base model) by adding sparsely activated expert blocks, without changing the total number of floating point operations for training or inference. In theory, this architecture allows us to train arbitrarily large models while keeping the computational costs same as that of the base model. However, beyond 64 to 128 experts blocks, prior work has observed diminishing returns in the test accuracies of these MoE models. Thus, training high quality MoE models requires us to scale the size of the base models, along with the number of expert blocks. In this work, we propose a novel, three-dimensional, hybrid parallel algorithm that combines tensor, expert, and data parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the optimizer step, and communication optimizations that eliminate redundant movement of data. Removing these redundancies provides a speedup of nearly 21%. When training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs, our optimizations significantly improve the peak half precision flop/s from 20% to 27%

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2303.06318

Last time updated on 24/03/2023