9,166 research outputs found

    Reducing communication in sparse solvers

    Get PDF
    Sparse matrix operations dominate the cost of many scientific applications. In parallel, the performance and scalability of these operations is limited by irregular point-to-point communication. Multiple methods are investigated throughout this dissertation for reducing the cost associated with communication throughout sparse matrix operations. Algorithmic changes reduce communication requirements, but also affect accuracy of the operation, leading to reduced convergence of scientific codes. We investigate a method of systematically removing relatively small non-zeros throughout an algebraic multigrid hierarchy, yielding significant reductions to the cost of sparse matrix-vector multiplication that outweigh affects of reduced accuracy of the multiplication. Therefore, the reduction in per-iteration communication costs outweigh the cost of extra solver iterations. As a result, sparsification yields improvement of both the performance and scalability of algebraic multigrid. Alterations to the parallel implementation of MPI communication also yield reduced costs with no effect on accuracy. We investigate methods of agglomerating messages on-node before injecting into the network, reducing the amount of costly inter-node communication. This node-aware communication yields improvements to both performance and scalability of matrix operations, particularly in strong scaling studies. Furthermore, we show an improvement in the cost of algebraic multigrid as a result of reduced communication costs in sparse matrix operations. Finally, performance models can be used to analyze the costs of matrix operations, indicating the source of dominant communication costs, such as initializing messages or transporting bytes of data. We investigate methods of improving traditional performance models of irregular point-to-point communication through the addition of node-awareness, queue search costs, and network contention penalties

    An efficient hardware-aware matrix-free implementation for finite-element discretized matrix-multivector products

    Full text link
    The finite-element (FE) discretization of a partial differential equation usually involves the construction of a FE discretized operator and computing its action on trial FE discretized fields for the solution of a linear system of equations or eigenvalue problems and is traditionally computed using global sparse-vector multiplication modules. However, recent hardware-aware algorithms for evaluating such matrix-vector multiplications suggest that on-the-fly matrix-vector products without building and storing the cell-level dense matrices reduce both arithmetic complexity and memory footprint and are referred to as matrix-free approaches. These approaches exploit the tensor-structured nature of the FE polynomial basis for evaluating the underlying integrals and, the current state-of-the-art matrix-free implementations deal with the action of FE discretized matrix on a single vector. These are neither optimal nor readily applicable for matrix-multivector products involving a large number of vectors. We discuss a computationally efficient and scalable matrix-free implementation procedure to compute the FE discretized matrix-multivector products on multi-node CPU and GPU architectures. The accuracy and performance of our implementation is assessed on the problem of computing FE overlap (mass) matrix-multivector multiplications as a representative benchmark example, and we observe superior performance of our implementation. For instance, computational gains up to 2.9x on GPU architectures and 6x on CPU-only architectures are observed for matrix-multivector products compared to the FE cell-level matrix-multiplication approach for the case of 100 vectors. We further benchmark our performance against the matrix-free module in deal.II and speedups up to 2x are observed for multivectors on CPU-only architectures and 1.5x on GPUs.Comment: 5 pages, 7 figure

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming

    Full text link
    We evaluate optimized parallel sparse matrix-vector operations for two representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to basic architectural properties of standard multicore chips. Going beyond the single node, parallel sparse matrix-vector operations often suffer from an unfavorable communication to computation ratio. Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be achieved by using a dedicated communication thread, which may run on a virtual core. We compare our approach to pure MPI and the widely used "vector-like" hybrid programming strategy.Comment: 12 pages, 6 figure

    Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems

    Full text link
    We evaluate optimized parallel sparse matrix-vector operations for several representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to basic architectural properties of standard multicore chips. Beyond the single node, the performance of parallel sparse matrix-vector operations is often limited by communication overhead. Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be achieved by using a dedicated communication thread, which may run on a virtual core. Moreover we identify performance benefits of hybrid MPI/OpenMP programming due to improved load balancing even without explicit communication overlap. We compare performance results for pure MPI, the widely used "vector-like" hybrid programming strategies, and explicit overlap on a modern multicore-based cluster and a Cray XE6 system.Comment: 16 pages, 10 figure
    corecore