Search CORE

2,151 research outputs found

Performance analysis of Message Passing Interface collective communication on intel xeon quad-core gigabit ethernet and infiniband clusters

Author: Abdul Hamid Nor Asilah Wati
Ismail Roswan
Latip Rohaya
Othman Mohamed
Publication venue: 'Science Publications'
Publication date: 01/04/2013
Field of study

The performance of MPI implementation operations still presents critical issues for high performance computing systems, particularly for more advanced processor technology. Consequently, this study concentrates on benchmarking MPI implementation on multi-core architecture by measuring the performance of Open MPI collective communication on Intel Xeon dual quad-core Gigabit Ethernet and InfiniBand clusters using SKaMPI. It focuses on well known collective communication routines such as MPI-Bcast, MPI-AlltoAll, MPI-Scatter and MPI-Gather. From the collection of results, MPI collective communication on InfiniBand clusters had distinctly better performance in terms of latency and throughput. The analysis indicates that the algorithm used for collective communication performed very well for all message sizes except for MPI-Bcast and MPI-Alltoall operation of inter-node communication. However, InfiniBand provides the lowest latency for all operations since it provides applications with an easy to use messaging service, compared to Gigabit Ethernet, which still requests the operating system for access to one of the server communication resources with the complex dance between an application and a network

Universiti Putra Malaysia Institutional Repository

Recommended from our members

Benchmarking the Intel®Xeon®Platinum 8160 Processor

Author: Chen Feng
Gómez-Iglesias Antonio
Huang Lei
Liu Hang
Liu Si
Rosales Carlos
Publication venue
Publication date: 10/08/2017
Field of study

This report presents a set of results for different microbenchmarks and applications on the Intel Xeon Platinum8160 Processor, formerly known as Skylake. For simplicity, we will use both Skylake and SKX to refer to this processor. We use the Skylake nodes that will be available in Stampede2. This systemwill provide Intel Knights Landing and Skylake chips interconnected by a 100 Gb/sec Intel Omni-Path (OPA) network with a fat tree topology. The peak performance of the system will be 18 PF.Texas Advanced Computing Center (TACC

Texas ScholarWorks

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Author: Awan Ammar Ahmad
Bedorf Jeroen
Chu Ching-Hsiang
Panda Dhabaleswar K.
Subramoni Hari
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/10/2018
Field of study

TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

arXiv.org e-Print Archive

Crossref