Recent years have seen an increase in the development of large deep learning
(DL) models, which makes training efficiency crucial. Common practice is
struggling with the trade-off between usability and performance. On one hand,
DL frameworks such as PyTorch use dynamic graphs to facilitate model developers
at a price of sub-optimal model training performance. On the other hand,
practitioners propose various approaches to improving the training efficiency
by sacrificing some of the flexibility, ranging from making the graph static
for more thorough optimization (e.g., XLA) to customizing optimization towards
large-scale distributed training (e.g., DeepSpeed and Megatron-LM).
In this paper, we aim to address the tension between usability and training
efficiency through separation of concerns. Inspired by DL compilers that
decouple the platform-specific optimizations of a tensor-level operator from
its arithmetic definition, this paper proposes a schedule language to decouple
model execution from definition. Specifically, the schedule works on a PyTorch
model and uses a set of schedule primitives to convert the model for common
model training optimizations such as high-performance kernels, effective 3D
parallelism, and efficient activation checkpointing. Compared to existing
optimization solutions, we optimize the model as-needed through high-level
primitives, and thus preserving programmability and debuggability for users to
a large extent. Our evaluation results show that by scheduling the existing
hand-crafted optimizations in a systematic way, we are able to improve training
throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by
up to 1.32x on multiple machines with up to 64 GPUs, when compared to the
out-of-the-box performance of DeepSpeed and Megatron-LM