147 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Exascale Deep Learning for Climate Analytics
We extract pixel-level masks of extreme weather patterns using variants of
Tiramisu and DeepLabv3+ neural networks. We describe improvements to the
software frameworks, input pipeline, and the network training algorithms
necessary to efficiently scale deep learning on the Piz Daint and Summit
systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained
throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up
to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel
efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor
Cores, a half-precision version of the DeepLabv3+ network achieves a peak and
sustained throughput of 1.13 EF/s and 999.0 PF/s respectively.Comment: 12 pages, 5 tables, 4, figures, Super Computing Conference November
11-16, 2018, Dallas, TX, US
Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning
In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency.
Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications.
Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency.
Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments
SparCML: High-Performance Sparse Communication for Machine Learning
Applying machine learning techniques to the quickly growing data in science
and industry requires highly-scalable algorithms. Large datasets are most
commonly processed "data parallel" distributed across many nodes. Each node's
contribution to the overall gradient is summed using a global allreduce. This
allreduce is the single communication and thus scalability bottleneck for most
machine learning workloads. We observe that frequently, many gradient values
are (close to) zero, leading to sparse of sparsifyable communications. To
exploit this insight, we analyze, design, and implement a set of
communication-efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute arbitrary sparse input data vectors. Our generic
communication library, SparCML, extends MPI to support additional features,
such as non-blocking (asynchronous) operations and low-precision data
representations. As such, SparCML and its techniques will form the basis of
future highly-scalable machine learning frameworks
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism
We present scalable hybrid-parallel algorithms for training large-scale 3D
convolutional neural networks. Deep learning-based emerging scientific
workflows often require model training with large, high-dimensional samples,
which can make training much more costly and even infeasible due to excessive
memory usage. We solve these challenges by extensively applying hybrid
parallelism throughout the end-to-end training pipeline, including both
computations and I/O. Our hybrid-parallel algorithm extends the standard data
parallelism with spatial parallelism, which partitions a single sample in the
spatial domain, realizing strong scaling beyond the mini-batch dimension with a
larger aggregated memory capacity. We evaluate our proposed training algorithms
with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive
performance studies show that good weak and strong scaling can be achieved for
both networks using up 2K GPUs. More importantly, we enable training of
CosmoFlow with much larger samples than previously possible, realizing an
order-of-magnitude improvement in prediction accuracy.Comment: 12 pages, 10 figure
Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global performance of the system increases
yet eventually becomes limited by the interconnection network. This is the case for
distributed data-parallel training of convolutional neural networks (CNNs), which
usually proceeds on a cluster with a small to moderate number of nodes. In this paper,
we analyze the performance of the Allreduce collective communication primitive, a
key to the efficient data-parallel distributed training of CNNs. Our study targets the
distinct realizations of this primitive in three high performance instances of Message
Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a
cluster equipped with state-of-the-art processor and network technologies. In addition,
we apply the insights gained from the experimental analysis to the optimization of the
TensorFlow framework when running on top of Horovod. Our study reveals that a
careful selection of the most convenient MPI library and Allreduce (ARD) realization
accelerates the training throughput by a factor of 1.2× compared with the default
algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI
libraries in a number of relevant combinations of CNN model+dataset
- …