23,631 research outputs found

    Partitioned Global Address Space Languages

    Get PDF
    The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of non-uniform communication costs. Today, about a dozen languages exist that adhere to the PGAS model. This survey proposes a definition and a taxonomy along four axes: how parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions and finally how data is accessed across partitions. Our taxonomy reveals that today's PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost models remain open challenges

    A Theory of Partitioned Global Address Spaces

    Get PDF
    Partitioned global address space (PGAS) is a parallel programming model for the development of applications on clusters. It provides a global address space partitioned among the cluster nodes, and is supported in programming languages like C, C++, and Fortran by means of APIs. In this paper we provide a formal model for the semantics of single instruction, multiple data programs using PGAS APIs. Our model reflects the main features of popular real-world APIs such as SHMEM, ARMCI, GASNet, GPI, and GASPI. A key feature of PGAS is the support for one-sided communication: a node may directly read and write the memory located at a remote node, without explicit synchronization with the processes running on the remote side. One-sided communication increases performance by decoupling process synchronization from data transfer, but requires the programmer to reason about appropriate synchronizations between reads and writes. As a second contribution, we propose and investigate robustness, a criterion for correct synchronization of PGAS programs. Robustness corresponds to acyclicity of a suitable happens-before relation defined on PGAS computations. The requirement is finer than the classical data race freedom and rules out most false error reports. Our main result is an algorithm for checking robustness of PGAS programs. The algorithm makes use of two insights. Using combinatorial arguments we first show that, if a PGAS program is not robust, then there are computations in a certain normal form that violate happens-before acyclicity. Intuitively, normal-form computations delay remote accesses in an ordered way. We then devise an algorithm that checks for cyclic normal-form computations. Essentially, the algorithm is an emptiness check for a novel automaton model that accepts normal-form computations in streaming fashion. Altogether, we prove the robustness problem is PSpace-complete

    DART-MPI: An MPI-based Implementation of a PGAS Runtime System

    Full text link
    A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

    A framework for unit testing with coarray Fortran

    Get PDF
    Parallelism is a ubiquitous feature of modern computing architectures; indeed, we might even say that serial code is now automatically legacy code. Writing parallel code poses significant challenges to programs, and is often error-prone. Partitioned Global Address Space (PGAS) languages, such as Coarray Fortran (CAF), represent a promising development direction in the quest for a trade-off between simplicity and performance. CAF is a parallel programming model that allows a smooth migration from serial to parallel code. However, despite CAF simplicity, refactoring serial code and migrating it to parallel versions is still error-prone, especially in complex softwares. The combination of unit testing, which drastically reduces defect injection, and CAF is therefore a very appealing prospect; however, it requires appropriate tools to realize its potential. In this paper, we present the first CAF-compatible framework for unit tests, developed as an extension to the Parallel Fortran Unit Test framework (pFUnit)

    C Language Extensions for Hybrid CPU/GPU Programming with StarPU

    Get PDF
    Modern platforms used for high-performance computing (HPC) include machines with both general-purpose CPUs, and "accelerators", often in the form of graphical processing units (GPUs). StarPU is a C library to exploit such platforms. It provides users with ways to define "tasks" to be executed on CPUs or GPUs, along with the dependencies among them, and by automatically scheduling them over all the available processing units. In doing so, it also relieves programmers from the need to know the underlying architecture details: it adapts to the available CPUs and GPUs, and automatically transfers data between main memory and GPUs as needed. While StarPU's approach is successful at addressing run-time scheduling issues, being a C library makes for a poor and error-prone programming interface. This paper presents an effort started in 2011 to promote some of the concepts exported by the library as C language constructs, by means of an extension of the GCC compiler suite. Our main contribution is the design and implementation of language extensions that map to StarPU's task programming paradigm. We argue that the proposed extensions make it easier to get started with StarPU,eliminate errors that can occur when using the C library, and help diagnose possible mistakes. We conclude on future work

    Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

    Get PDF
    Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft
    • …
    corecore