527 research outputs found
Design and Implementation of MPICH2 over InfiniBand with RDMA Support
For several years, MPI has been the de facto standard for writing parallel
applications. One of the most popular MPI implementations is MPICH. Its
successor, MPICH2, features a completely new design that provides more
performance and flexibility. To ensure portability, it has a hierarchical
structure based on which porting can be done at different levels. In this
paper, we present our experiences designing and implementing MPICH2 over
InfiniBand. Because of its high performance and open standard, InfiniBand is
gaining popularity in the area of high-performance computing. Our study focuses
on optimizing the performance of MPI-1 functions in MPICH2. One of our
objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to
achieve high performance. We have based our design on the RDMA Channel
interface provided by MPICH2, which encapsulates architecture-dependent
communication functionalities into a very small set of functions. Starting with
a basic design, we apply different optimizations and also propose a
zero-copy-based design. We characterize the impact of our optimizations and
designs using microbenchmarks. We have also performed an application-level
evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2
implementation achieves 7.6 s latency and 857 MB/s bandwidth, which are
close to the raw performance of the underlying InfiniBand layer. Our study
shows that the RDMA Channel interface in MPICH2 provides a simple, yet
powerful, abstraction that enables implementations with high performance by
exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is
the first high-performance design and implementation of MPICH2 on InfiniBand
using RDMA support.Comment: 12 pages, 17 figure
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
DD-AMG on QPACE 3
We describe our experience porting the Regensburg implementation of the
DD-AMG solver from QPACE 2 to QPACE 3. We first review how the code was
ported from the first generation Intel Xeon Phi processor (Knights Corner) to
its successor (Knights Landing). We then describe the modifications in the
communication library necessitated by the switch from InfiniBand to Omni-Path.
Finally, we present the performance of the code on a single processor as well
as the scaling on many nodes, where in both cases the speedup factor is close
to the theoretical expectations.Comment: 12 pages, 6 figures, Proceedings of Lattice 201
Using GPI-2 for Distributed Memory Paralleliziation of the Caffe Toolbox to Speed up Deep Neural Network Training
Deep Neural Network (DNN) are currently of great inter- est in research and
application. The training of these net- works is a compute intensive and time
consuming task. To reduce training times to a bearable amount at reasonable
cost we extend the popular Caffe toolbox for DNN with an efficient distributed
memory communication pattern. To achieve good scalability we emphasize the
overlap of computation and communication and prefer fine granu- lar
synchronization patterns over global barriers. To im- plement these
communication patterns we rely on the the Global address space Programming
Interface version 2 (GPI-2) communication library. This interface provides a
light-weight set of asynchronous one-sided communica- tion primitives
supplemented by non-blocking fine gran- ular data synchronization mechanisms.
Therefore, Caf- feGPI is the name of our parallel version of Caffe. First
benchmarks demonstrate better scaling behavior com- pared with other
extensions, e.g., the Intel TM Caffe. Even within a single symmetric
multiprocessing machine with four graphics processing units, the CaffeGPI
scales bet- ter than the standard Caffe toolbox. These first results
demonstrate that the use of standard High Performance Computing (HPC) hardware
is a valid cost saving ap- proach to train large DDNs. I/O is an other
bottleneck to work with DDNs in a standard parallel HPC setting, which we will
consider in more detail in a forthcoming paper
Recommended from our members
On the conditions for efficient interoperability with threads: An experience with PGAS languages using Cray communication domains
Today's high performance systems are typically built from shared memory nodes connected by a high speed network. That architecture, combined with the trend towards less memory per core, encourages programmers to use a mixture of message passing and multithreaded programming. Unfortunately, the advantages of using threads for in-node programming are hindered by their inability to efficiently communicate between nodes. In this work, we identify some of the performance problems that arise in such hybrid programming environments and characterize conditions needed to achieve high communication performance for multiple threads: addressability of targets, separability of communication paths, and full direct reachability to targets. Using the GASNet communication layer on the Cray XC30 as our experimental platform, we show how to satisfy these conditions. We also discuss how satisfying these conditions is influenced by the communication abstraction, implementation constraints, and the interconnect messaging capabilities. To evaluate these ideas, we compare the communication performance of a thread-based node runtime to a process-based runtime. Without our GASNet extensions, thread communication is significantly slower than processes - up to 21x slower. Once the implementation is modified to address each of our conditions, the two runtimes have comparable communication performance. This allows programmers to more easily mix models like OpenMP, CILK, or pthreads with a GASNet-based model like UPC, with the associated performance, convenience and interoperability advantages that come from using threads within a node. © 2014 ACM
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
A Partitioned Global Address Space (PGAS) approach treats a distributed
system as if the memory were shared on a global level. Given such a global view
on memory, the user may program applications very much like shared memory
systems. This greatly simplifies the tasks of developing parallel applications,
because no explicit communication has to be specified in the program for data
exchange between different computing nodes. In this paper we present DART, a
runtime environment, which implements the PGAS paradigm on large-scale
high-performance computing clusters. A specific feature of our implementation
is the use of one-sided communication of the Message Passing Interface (MPI)
version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated
the performance of the implementation with several low-level kernels in order
to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address
Space Programming Models (PGAS14
POSH: Paris OpenSHMEM: A High-Performance OpenSHMEM Implementation for Shared Memory Systems
In this paper we present the design and implementation of POSH, an
Open-Source implementation of the OpenSHMEM standard. We present a model for
its communications, and prove some properties on the memory model defined in
the OpenSHMEM specification. We present some performance measurements of the
communication library featured by POSH and compare them with an existing
one-sided communication library. POSH can be downloaded from
\url{http://www.lipn.fr/~coti/POSH}. % 9 - 67Comment: This is an extended version (featuring the full proofs) of a paper
accepted at ICCS'1
- …