8,873 research outputs found

    Runtime sparse matrix format selection

    No full text
    There exist many storage formats for the in-memory representation of sparse matrices. Choosing the format that yields the quickest processing of any given sparse matrix requires considering the exact non-zero structure of the matrix, as well as the current execution environment. Each of these factors can change at runtime. The matrix structure can vary as computation progresses, while the environment can change due to varying system load, the live migration of jobs across a heterogeneous cluster, etc. This paper describes an algorithm that learns at runtime how to map sparse matrices onto the format which provides the quickest sparse matrix-vector product calculation, and which can adapt to the hardware platform changing underfoot. We show multiplication times reduced by over 10% compared with the best non-adaptive format selection

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    Full text link
    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    A Library for Pattern-based Sparse Matrix Vector Multiply

    Get PDF
    Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns. The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable, and that the analysis and conversion overhead is amortized in realistic application scenarios

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

    Get PDF
    Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems

    Simultaneously Sparse Solutions to Linear Inverse Problems with Multiple System Matrices and a Single Observation Vector

    Full text link
    A linear inverse problem is proposed that requires the determination of multiple unknown signal vectors. Each unknown vector passes through a different system matrix and the results are added to yield a single observation vector. Given the matrices and lone observation, the objective is to find a simultaneously sparse set of unknown vectors that solves the system. We will refer to this as the multiple-system single-output (MSSO) simultaneous sparsity problem. This manuscript contrasts the MSSO problem with other simultaneous sparsity problems and conducts a thorough initial exploration of algorithms with which to solve it. Seven algorithms are formulated that approximately solve this NP-Hard problem. Three greedy techniques are developed (matching pursuit, orthogonal matching pursuit, and least squares matching pursuit) along with four methods based on a convex relaxation (iteratively reweighted least squares, two forms of iterative shrinkage, and formulation as a second-order cone program). The algorithms are evaluated across three experiments: the first and second involve sparsity profile recovery in noiseless and noisy scenarios, respectively, while the third deals with magnetic resonance imaging radio-frequency excitation pulse design.Comment: 36 pages; manuscript unchanged from July 21, 2008, except for updated references; content appears in September 2008 PhD thesi
    • …
    corecore