Search CORE

260 research outputs found

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Author: Awan Ammar Ahmad
Bedorf Jeroen
Chu Ching-Hsiang
Panda Dhabaleswar K.
Subramoni Hari
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/10/2018
Field of study

TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

arXiv.org e-Print Archive

Crossref

Design and Implementation of MPICH2 over InfiniBand with RDMA Support

Author: Ashton David
Buntinas Darius
Gropp William
Jiang Weihang
Liu Jiuxing
Panda Dhabaleswar K.
Toonen Brian
Wyckoff Pete
Publication venue
Publication date: 30/10/2003
Field of study

For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6

\mu

s latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.Comment: 12 pages, 17 figure

arXiv.org e-Print Archive

CiteSeerX

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

Author: Alnaasan Nawras
Chen Tian
K. Dhabaleswar
Panda
Shafi Aamir
Subramoni Hari
Yao Jinghan
Publication venue
Publication date: 02/11/2023
Field of study

Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.Comment: In Proceeding of 30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC

arXiv.org e-Print Archive

High Performance Pipelined Process Migration with RDMA

Author: Besseron Xavier
Ouyang Xiangyong
Panda Dhabaleswar K.
Rajachandrasekar Raghunath
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2011
Field of study

Crossref

Open Repository and Bibliography - Luxembourg

Structural and Functional Analysis of a β2-Adrenergic Receptor Complex with GRK5.

Author: Adams Christopher M.
Benovic Jeffrey L.
Betz Robin M.
Chung Ka Young
Dror Ron O.
Du Yang
Duc Nguyen Minh
Kobilka Brian K.
Komolov Konstantin E.
Leib Ryan D.
Patra Dhabaleswar
Rodrigues João P.G.L.M.
Skiniotis Georgios
Publication venue: Jefferson Digital Commons
Publication date: 20/04/2017
Field of study

The phosphorylation of agonist-occupied G-protein-coupled receptors (GPCRs) by GPCR kinases (GRKs) functions to turn off G-protein signaling and turn on arrestin-mediated signaling. While a structural understanding of GPCR/G-protein and GPCR/arrestin complexes has emerged in recent years, the molecular architecture of a GPCR/GRK complex remains poorly defined. We used a comprehensive integrated approach of cross-linking, hydrogen-deuterium exchange mass spectrometry (MS), electron microscopy, mutagenesis, molecular dynamics simulations, and computational docking to analyze GRK5 interaction with the β2-adrenergic receptor (β2AR). These studies revealed a dynamic mechanism of complex formation that involves large conformational changes in the GRK5 RH/catalytic domain interface upon receptor binding. These changes facilitate contacts between intracellular loops 2 and 3 and the C terminus of the β2AR with the GRK5 RH bundle subdomain, membrane-binding surface, and kinase catalytic cleft, respectively. These studies significantly contribute to our understanding of the mechanism by which GRKs regulate the function of activated GPCRs. PAPERCLIP

Jefferson Digital Commons

Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?

Author: Besseron Xavier
Meshram Vilobh
Ouyang Xiangyong
Panda Dhabaleswar K.
Rajachandrasekar Raghunath
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2011
Field of study

Crossref

Open Repository and Bibliography - Luxembourg