8 research outputs found
A Study of Checkpointing in Large Scale Training of Deep Neural Networks
Deep learning (DL) applications are increasingly being deployed on HPC
systems, to leverage the massive parallelism and computing power of those
systems for DL model training. While significant effort has been put to
facilitate distributed training by DL frameworks, fault tolerance has been
largely ignored. In this work, we evaluate checkpoint-restart, a common fault
tolerance technique in HPC workloads. We perform experiments with three
state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow).
We evaluate the computational cost of checkpointing, file formats and file
sizes, the impact of scale, and deterministic checkpointing. Our evaluation
shows some critical differences in checkpoint mechanisms and exposes several
bottlenecks in existing checkpointing implementations. We provide discussion
points that can aid users in selecting a fault-tolerant framework to use in
HPC. We also provide takeaway points that framework developers can use to
facilitate better checkpointing of DL workloads in HPC