20 research outputs found

    GMT: Enabling easy development and efficient execution of irregular applications on commodity clusters

    Get PDF
    In this poster we introduce GMT (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT only requires a cluster with x86 nodes supporting MPI. GMT integrates the Partititioned Global Address Space (PGAS) locality-aware global data model with a fork/join control model common in single node multithreaded environments. GMT supports lightweight software multithreading to tolerate latencies for accessing data on remote nodes, and is built around data aggregation to maximize network bandwidth utilization.Peer ReviewedPostprint (author's final draft

    COMPOSE-HPC: A Transformational Approach to Exascale

    Get PDF
    The goal of the COMPOSE-HPC project is to 'democratize' tools for automatic transformation of program source code so that it becomes tractable for the developers of scientific applications to create and use their own transformations reliably and safely. This paper describes our approach to this challenge, the creation of the KNOT tool chain, which includes tools for the creation of annotation languages to control the transformations (PAUL), to perform the transformations (ROTE), and optimization and code generation (BRAID), which can be used individually and in combination. We also provide examples of current and future uses of the KNOT tools, which include transforming code to use different programming models and environments, providing tests that can be used to detect errors in software or its execution, as well as composition of software written in different programming languages, or with different threading patterns

    Advanced data-parallel compilation

    No full text
    Over the past few decades, scientific research has grown to rely increasingly on simulation and other computational techniques. This strategy has been named in silico research. Computation is increasingly important for testing theories and obtaining results in fields where experimentation is not currently possible (e.g. astrophysics; cosmology; climate modeling) or the detail and resolution of the results cannot be provided by traditional experimental methodologies (e.g. computational biology, materials science). The quantities of data manipulated and produced by such scientific programs are often too large to be processed on a uniprocessor computer. Parallel computers have been used to obtain such results. Creating software for parallel machines has been a very difficult task. Most parallel applications have been written using low-level communication libraries based on message passing, which require application programmers to deal with all aspects of programming the machine. Compilers for data-parallel languages have been proposed as an alternative. High-Performance Fortran (HPF) was designed to simplify the, construction of data-parallel programs operating on dense distributed arrays. HPF compilers have not implemented the necessary transformations and mapping strategies to translate complex high-level data-parallel codes into low-level high-performance applications. This thesis demonstrates that it is possible to generate scalable high-performance code for a range of complex, regular applications written in high-level data-parallel languages, through the design and implementation of several analysis; compilation and runtime techniques in the dHPF compiler. The major contributions of this thesis are: (1) Analysis, code generation and data distribution strategies (multipartitioning) for tightly-coupled codes. (2) Compiler and runtime support based on generalized multipartitioning. (3) Communication scheduling to hide latency; through the overlap of embarass-ingly parallel loops. (4) Advanced static analysis for communication coalescing and simplification. (5) Support for efficient single-image executables running on a, parameterized number of processors. (6) Strategies for generation of large-scale scalable executables up to hundreds of processors. Experiments with the NAS SP, BT and LU application benchmarks show that these techniques enable the dHPF compiler to generate code that scales efficiently up to hundreds of processors with only a few percent overhead with respect to high-quality hand-coded implementations. The techniques are necessary but not sufficient to produces efficient code for the NAS MG multigrid benchmark, which still exhibits large overhead compared to its hand-coded counterpart

    Effective Communication Coalescing for Data-Parallel Applications

    No full text
    Communication coalescing is a static optimization that can reduce both communication frequency and redundant data transfer in compiler-generated code for regular, data parallel applications. We present an algorithm for coalescing communication that arises when generating code for regular, data-parallel applications written in High-Performance Fortran (HPF). To handle sophisticated computation partitionings, our algorithm normalizes communication before attempting coalescing. We experimentally evaluate our algorithm, which is implemented in the dHPF compiler, in the compilation of HPF versions of the NAS application benchmarks SP, BT and LU. Our normalized coalescing algorithm improves the performance and scalability of compiler-generated code for these benchmarks by reducing the communication volume up to 55\% compared to a simpler coalescing strategy and enables us to match the communication volume and frequency in hand-optimized MPI implementations of these codes

    Data-Parallel Compiler Support for Multipartitioning

    No full text
    . Multipartitioning is a skewed-cyclic block distribution that yields better parallel e#ciency and scalability for line-sweep computations than traditional block partitionings. This paper describes extensions to the Rice dHPF compiler for High Performance Fortran that enable it to support multipartitioned data distributions and optimizations that enable dHPF to generate e#cient multipartitioned code. We describe experiments applying these techniques to parallelize serial versions of the NAS SP and BT application benchmarks and show that the performance of the code generated by dHPF is approaching that of hand-coded parallelizations based on multipartitioning.

    On Efficient Parallelization of Line-Sweep Computations

    Get PDF
    Multipartitioning is a strategy for partitioning multidimensional arrays among a collection of processors so that line-sweep computations can be performed e#ciently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same number of tiles to compute at each step in the sweep. This property results in full, balanced parallelism. A secondary benefit of multipartitionings is that they induce only coarse-grain communication. Previously, computing a d-dimensional multipartitioning required that p 1 d-1 be integral, where p is the number of processors. Here, we describe an algorithm to compute a d-dimensional multipartitioning of an array of # dimensions for an arbitrary number of processors, for any d, 2 # d # #. When using a multipartitioning to parallelize a line sweep computation, the best partitioning is the one that exploits all of the processors and has the smallest communication volume. To compute the best multipartitioning of a #-dimensional array, we describe a cost model for selecting d, the dimensionality of the best partitioning, and the number of cuts along each partitioned dimension. In practice, our technique will choose a 3-dimensional multipartitioning for a 3-dimensional line-sweep computation, except when p is a prime; previously, a 3-dimensional multipartitioning could be applied only when # p is integral. We describe an implementation of multipartitioning in the Rice dHPF compiler and performance results obtained to parallelize a line sweep computation on a range of di#erent numbers of processors. # This work performed while a visiting scholar at Rice University.

    Scaling irregular applications through data aggregation and software multithreading

    No full text
    Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.Peer ReviewedPostprint (published version

    GMT: Enabling easy development and efficient execution of irregular applications on commodity clusters

    No full text
    In this poster we introduce GMT (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT only requires a cluster with x86 nodes supporting MPI. GMT integrates the Partititioned Global Address Space (PGAS) locality-aware global data model with a fork/join control model common in single node multithreaded environments. GMT supports lightweight software multithreading to tolerate latencies for accessing data on remote nodes, and is built around data aggregation to maximize network bandwidth utilization.Peer Reviewe

    Scaling irregular applications through data aggregation and software multithreading

    No full text
    Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.Peer Reviewe
    corecore