8,873 research outputs found
Runtime sparse matrix format selection
There exist many storage formats for the in-memory representation of sparse matrices. Choosing the format
that yields the quickest processing of any given sparse matrix requires considering the exact non-zero structure of
the matrix, as well as the current execution environment. Each of these factors can change at runtime. The matrix
structure can vary as computation progresses, while the environment can change due to varying system load, the live
migration of jobs across a heterogeneous cluster, etc. This paper describes an algorithm that learns at runtime how
to map sparse matrices onto the format which provides the quickest sparse matrix-vector product calculation, and
which can adapt to the hardware platform changing underfoot. We show multiplication times reduced by over 10%
compared with the best non-adaptive format selection
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
A Library for Pattern-based Sparse Matrix Vector Multiply
Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns.
The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable,
and that the analysis and conversion overhead is amortized in realistic application scenarios
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems
Simultaneously Sparse Solutions to Linear Inverse Problems with Multiple System Matrices and a Single Observation Vector
A linear inverse problem is proposed that requires the determination of
multiple unknown signal vectors. Each unknown vector passes through a different
system matrix and the results are added to yield a single observation vector.
Given the matrices and lone observation, the objective is to find a
simultaneously sparse set of unknown vectors that solves the system. We will
refer to this as the multiple-system single-output (MSSO) simultaneous sparsity
problem. This manuscript contrasts the MSSO problem with other simultaneous
sparsity problems and conducts a thorough initial exploration of algorithms
with which to solve it. Seven algorithms are formulated that approximately
solve this NP-Hard problem. Three greedy techniques are developed (matching
pursuit, orthogonal matching pursuit, and least squares matching pursuit) along
with four methods based on a convex relaxation (iteratively reweighted least
squares, two forms of iterative shrinkage, and formulation as a second-order
cone program). The algorithms are evaluated across three experiments: the first
and second involve sparsity profile recovery in noiseless and noisy scenarios,
respectively, while the third deals with magnetic resonance imaging
radio-frequency excitation pulse design.Comment: 36 pages; manuscript unchanged from July 21, 2008, except for updated
references; content appears in September 2008 PhD thesi
- …