4,929 research outputs found

    Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

    Full text link
    Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several factors may lead to an imbalanced load execution, such as problem characteristics, algorithmic, and systemic variations. Dynamic loop self-scheduling (DLS) techniques are devised to mitigate these factors, and consequently, improve application performance. On distributed-memory systems, DLS techniques can be implemented using a hierarchical master-worker execution model and are, therefore, called hierarchical DLS techniques. These techniques self-schedule loop iterations at two levels of hardware parallelism: across and within compute nodes. Hybrid programming approaches that combine the message passing interface (MPI) with open multi-processing (OpenMP) dominate the implementation of hierarchical DLS techniques. The MPI-3 standard includes the feature of sharing memory regions among MPI processes. This feature introduced the MPI+MPI approach that simplifies the implementation of parallel scientific applications. The present work designs and implements hierarchical DLS techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques are considered in the evaluation proposed herein. The results indicate certain performance advantages of the proposed approach compared to the hybrid MPI+OpenMP approach

    DART-MPI: An MPI-based Implementation of a PGAS Runtime System

    Full text link
    A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

    Design and Implementation of MPICH2 over InfiniBand with RDMA Support

    Full text link
    For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6 μ\mus latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.Comment: 12 pages, 17 figure

    POSH: Paris OpenSHMEM: A High-Performance OpenSHMEM Implementation for Shared Memory Systems

    Get PDF
    In this paper we present the design and implementation of POSH, an Open-Source implementation of the OpenSHMEM standard. We present a model for its communications, and prove some properties on the memory model defined in the OpenSHMEM specification. We present some performance measurements of the communication library featured by POSH and compare them with an existing one-sided communication library. POSH can be downloaded from \url{http://www.lipn.fr/~coti/POSH}. % 9 - 67Comment: This is an extended version (featuring the full proofs) of a paper accepted at ICCS'1

    Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI

    Full text link
    Matrix-matrix multiplication is a basic operation in linear algebra and an essential building block for a wide range of algorithms in various scientific fields. Theory and implementation for the dense, square matrix case are well-developed. If matrices are sparse, with application-specific sparsity patterns, the optimal implementation remains an open question. Here, we explore the performance of communication reducing 2.5D algorithms and one-sided MPI communication in the context of linear scaling electronic structure theory. In particular, we extend the DBCSR sparse matrix library, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. The library is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation. Here, we compare the performance of the original implementation based on Cannon's algorithm and MPI point-to-point communication, with an implementation based on MPI one-sided communications (RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and auxiliary operations for reduced communication, which can lead to a speedup if communication is dominant. The 2.5D algorithm is somewhat easier to implement with one-sided communications. A detailed description of the implementation is provided, also for non ideal processor topologies, since this is important for actual applications. Given the importance of the precise sparsity pattern, and even the actual matrix data, which decides the effective fill-in upon multiplication, the tests are performed within the CP2K package with application benchmarks. Results show a substantial boost in performance for the RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the number of involved processes in the parallelization.Comment: In Proceedings of PASC '17, Lugano, Switzerland, June 26-28, 2017, 10 pages, 4 figure
    corecore