320 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes
The Data Science domain has expanded monumentally in both research and
industry communities during the past decade, predominantly owing to the Big
Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are
bringing more complexities to data engineering applications, which are now
integrated into data processing pipelines to process terabytes of data.
Typically, a significant amount of time is spent on data preprocessing in these
pipelines, and hence improving its e fficiency directly impacts the overall
pipeline performance. The community has recently embraced the concept of
Dataframes as the de-facto data structure for data representation and
manipulation. However, the most widely used serial Dataframes today (R, pandas)
experience performance limitations while working on even moderately large data
sets. We believe that there is plenty of room for improvement by taking a look
at this problem from a high-performance computing point of view. In a prior
publication, we presented a set of parallel processing patterns for distributed
dataframe operators and the reference runtime implementation, Cylon [1]. In
this paper, we are expanding on the initial concept by introducing a cost model
for evaluating the said patterns. Furthermore, we evaluate the performance of
Cylon on the ORNL Summit supercomputer
- ā¦