260 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Design and Implementation of MPICH2 over InfiniBand with RDMA Support
For several years, MPI has been the de facto standard for writing parallel
applications. One of the most popular MPI implementations is MPICH. Its
successor, MPICH2, features a completely new design that provides more
performance and flexibility. To ensure portability, it has a hierarchical
structure based on which porting can be done at different levels. In this
paper, we present our experiences designing and implementing MPICH2 over
InfiniBand. Because of its high performance and open standard, InfiniBand is
gaining popularity in the area of high-performance computing. Our study focuses
on optimizing the performance of MPI-1 functions in MPICH2. One of our
objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to
achieve high performance. We have based our design on the RDMA Channel
interface provided by MPICH2, which encapsulates architecture-dependent
communication functionalities into a very small set of functions. Starting with
a basic design, we apply different optimizations and also propose a
zero-copy-based design. We characterize the impact of our optimizations and
designs using microbenchmarks. We have also performed an application-level
evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2
implementation achieves 7.6 s latency and 857 MB/s bandwidth, which are
close to the raw performance of the underlying InfiniBand layer. Our study
shows that the RDMA Channel interface in MPICH2 provides a simple, yet
powerful, abstraction that enables implementations with high performance by
exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is
the first high-performance design and implementation of MPICH2 on InfiniBand
using RDMA support.Comment: 12 pages, 17 figure
Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
Autoregressive models, despite their commendable performance in a myriad of
generative tasks, face challenges stemming from their inherently sequential
structure. Inference on these models, by design, harnesses a temporal
dependency, where the current token's probability distribution is conditioned
on preceding tokens. This inherent characteristic severely impedes
computational efficiency during inference as a typical inference request can
require more than thousands of tokens, where generating each token requires a
load of entire model weights, making the inference more memory-bound. The large
overhead becomes profound in real deployment where requests arrive randomly,
necessitating various generation lengths. Existing solutions, such as dynamic
batching and concurrent instances, introduce significant response delays and
bandwidth contention, falling short of achieving optimal latency and
throughput. To address these shortcomings, we propose Flover -- a temporal
fusion framework for efficiently inferring multiple requests in parallel. We
deconstruct the general generation pipeline into pre-processing and token
generation, and equip the framework with a dedicated work scheduler for fusing
the generation process temporally across all requests. By orchestrating the
token-level parallelism, Flover exhibits optimal hardware efficiency and
significantly spares the system resources. By further employing a fast buffer
reordering algorithm that allows memory eviction of finished tasks, it brings
over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge
solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the
advanced tensor parallel technique, Flover proves efficacious across diverse
computational landscapes, from single-GPU setups to distributed scenarios,
thereby offering robust performance optimization that adapts to variable use
cases.Comment: In Proceeding of 30th IEEE International Conference on High
Performance Computing, Data, and Analytics (HiPC
Structural and Functional Analysis of a β2-Adrenergic Receptor Complex with GRK5.
The phosphorylation of agonist-occupied G-protein-coupled receptors (GPCRs) by GPCR kinases (GRKs) functions to turn off G-protein signaling and turn on arrestin-mediated signaling. While a structural understanding of GPCR/G-protein and GPCR/arrestin complexes has emerged in recent years, the molecular architecture of a GPCR/GRK complex remains poorly defined. We used a comprehensive integrated approach of cross-linking, hydrogen-deuterium exchange mass spectrometry (MS), electron microscopy, mutagenesis, molecular dynamics simulations, and computational docking to analyze GRK5 interaction with the β2-adrenergic receptor (β2AR). These studies revealed a dynamic mechanism of complex formation that involves large conformational changes in the GRK5 RH/catalytic domain interface upon receptor binding. These changes facilitate contacts between intracellular loops 2 and 3 and the C terminus of the β2AR with the GRK5 RH bundle subdomain, membrane-binding surface, and kinase catalytic cleft, respectively. These studies significantly contribute to our understanding of the mechanism by which GRKs regulate the function of activated GPCRs. PAPERCLIP
- …