91,798 research outputs found
AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning
In the last few years, the memory requirements to train state-of-the-art
neural networks have far exceeded the DRAM capacities of modern hardware
accelerators. This has necessitated the development of efficient algorithms to
train these neural networks in parallel on large-scale GPU-based clusters.
Since computation is relatively inexpensive on modern GPUs, designing and
implementing extremely efficient communication in these parallel training
algorithms is critical for extracting the maximum performance. This paper
presents AxoNN, a parallel deep learning framework that exploits asynchrony and
message-driven execution to schedule neural network operations on each GPU,
thereby reducing GPU idle time and maximizing hardware efficiency. By using the
CPU memory as a scratch space for offloading data periodically during training,
AxoNN is able to reduce GPU memory consumption by four times. This allows us to
increase the number of parameters per GPU by four times, thus reducing the
amount of communication and increasing performance by over 13%. When tested
against large transformer models with 12-100 billion parameters on 48-384
NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of
theoretical peak and reduces the training time by 22-37 days (15-25% speedup)
as compared to the state-of-the-art.Comment: Proceedings of the IEEE International Parallel & Distributed
Processing Symposium (IPDPS). IEEE Computer Society, May 202
Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global performance of the system increases
yet eventually becomes limited by the interconnection network. This is the case for
distributed data-parallel training of convolutional neural networks (CNNs), which
usually proceeds on a cluster with a small to moderate number of nodes. In this paper,
we analyze the performance of the Allreduce collective communication primitive, a
key to the efficient data-parallel distributed training of CNNs. Our study targets the
distinct realizations of this primitive in three high performance instances of Message
Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a
cluster equipped with state-of-the-art processor and network technologies. In addition,
we apply the insights gained from the experimental analysis to the optimization of the
TensorFlow framework when running on top of Horovod. Our study reveals that a
careful selection of the most convenient MPI library and Allreduce (ARD) realization
accelerates the training throughput by a factor of 1.2× compared with the default
algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI
libraries in a number of relevant combinations of CNN model+dataset
- …