4,438 research outputs found
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
A Library for Pattern-based Sparse Matrix Vector Multiply
Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns.
The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable,
and that the analysis and conversion overhead is amortized in realistic application scenarios
Parallel structurally-symmetric sparse matrix-vector products on multi-core processors
We consider the problem of developing an efficient multi-threaded
implementation of the matrix-vector multiplication algorithm for sparse
matrices with structural symmetry. Matrices are stored using the compressed
sparse row-column format (CSRC), designed for profiting from the symmetric
non-zero pattern observed in global finite element matrices. Unlike classical
compressed storage formats, performing the sparse matrix-vector product using
the CSRC requires thread-safe access to the destination vector. To avoid race
conditions, we have implemented two partitioning strategies. In the first one,
each thread allocates an array for storing its contributions, which are later
combined in an accumulation step. We analyze how to perform this accumulation
in four different ways. The second strategy employs a coloring algorithm for
grouping rows that can be concurrently processed by threads. Our results
indicate that, although incurring an increase in the working set size, the
former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo
Value-Compressed Sparse Column (VCSC): Sparse Matrix Storage for Redundant Data
Compressed Sparse Column (CSC) and Coordinate (COO) are popular compression
formats for sparse matrices. However, both CSC and COO are general purpose and
cannot take advantage of any of the properties of the data other than sparsity,
such as data redundancy. Highly redundant sparse data is common in many machine
learning applications, such as genomics, and is often too large for in-core
computation using conventional sparse storage formats. In this paper, we
present two extensions to CSC: (1) Value-Compressed Sparse Column (VCSC) and
(2) Index- and Value-Compressed Sparse Column (IVCSC). VCSC takes advantage of
high redundancy within a column to further compress data up to 3-fold over COO
and 2.25-fold over CSC, without significant negative impact to performance
characteristics. IVCSC extends VCSC by compressing index arrays through delta
encoding and byte-packing, achieving a 10-fold decrease in memory usage over
COO and 7.5-fold decrease over CSC. Our benchmarks on simulated and real data
show that VCSC and IVCSC can be read in compressed form with little added
computational cost. These two novel compression formats offer a broadly useful
solution to encoding and reading redundant sparse data
- …