55,493 research outputs found

    DART-MPI: An MPI-based Implementation of a PGAS Runtime System

    Full text link
    A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

    Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort

    Get PDF
    MPI uses the concept of communicators to connect groups of processes. It provides nonblocking collective operations on communicators to overlap communication and computation. Flexible algorithms demand flexible communicators. E.g., a process can work on different subproblems within different process groups simultaneously, new process groups can be created, or the members of a process group can change. Depending on the number of communicators, the time for communicator creation can drastically increase the running time of the algorithm. Furthermore, a new communicator synchronizes all processes as communicator creation routines are blocking collective operations. We present RBC, a communication library based on MPI, that creates range-based communicators in constant time without communication. These RBC communicators support (non)blocking point-to-point communication as well as (non)blocking collective operations. Our experiments show that the library reduces the time to create a new communicator by a factor of more than 400 whereas the running time of collective operations remains about the same. We propose Janus Quicksort, a distributed sorting algorithm that avoids any load imbalances. We improved the performance of this algorithm by a factor of 15 for moderate inputs by using RBC communicators. Finally, we discuss different approaches to bring nonblocking (local) communicator creation of lightweight (range-based) communicators into MPI

    Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

    Full text link
    Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017

    An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor

    Full text link
    This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

    Idle Period Propagation in Message-Passing Applications

    Full text link
    Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows application developers to design communication patterns avoiding idle period propagation and the consequent performance degradation in their applications. To understand idle period propagation, we introduce a methodology to trace idle periods when a process is waiting for data from a remote delayed process in MPI applications. We apply this technique in an MPI application that solves the heat equation to study idle period propagation on three different systems. We confirm that idle periods move between processes in the form of waves and that there are different stages in idle period propagation. Our methodology enables us to identify a self-synchronization phenomenon that occurs on two systems where some processes run slower than the other processes.Comment: 18th International Conference on High Performance Computing and Communications, IEEE, 201
    corecore