5,958 research outputs found

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

    Full text link
    Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1

    LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

    Get PDF
    LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft

    Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

    Full text link
    Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main memory are not lost when the system crashes because of the non-volatility nature of NVM. However, because of volatile caches, data must be logged and explicitly flushed from caches into NVM to ensure consistence and correctness before crashes, which can cause large runtime overhead. In this paper, we introduce an algorithm-based method to establish crash consistence in NVM for HPC applications. We slightly extend application data structures or sparsely flush cache blocks, which introduce ignorable runtime overhead. Such extension or cache flushing allows us to use algorithm knowledge to \textit{reason} data consistence or correct inconsistent data when the application crashes. We demonstrate the effectiveness of our method for three algorithms, including an iterative solver, dense matrix multiplication, and Monte-Carlo simulation. Based on comprehensive performance evaluation on a variety of test environments, we demonstrate that our approach has very small runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller than that of traditional checkpoint, while having the same or less recomputation cost.Comment: 12 page

    Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

    Full text link
    We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node. The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests.Comment: 35 pages, 14 figure

    A Test Suite for High-Performance Parallel Java

    Get PDF
    The Java programming language has a number of features that make it attractive for writing high-quality, portable parallel programs. A pure object formulation, strong typing and the exception model make programs easier to create, debug, and maintain. The elegant threading provides a simple route to parallelism on shared-memory machines. Anticipating great improvements in numerical performance, this paper presents a suite of simple programs that indicate how a pure Java Navier-Stokes solver might perform. The suite includes a parallel Euler solver. We present results from a 32-processor Hewlett-Packard machine and a 4-processor Sun server. While speedup is excellent on both machines, indicating a high-quality thread scheduler, the single-processor performance needs much improvement
    • …
    corecore