21 research outputs found
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
Sparse matrix-vector multiplication (SpMV) is a fundamental building block
for numerous applications. In this paper, we propose CSR5 (Compressed Sparse
Row 5), a new storage format, which offers high-throughput SpMV on various
platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is
insensitive to the sparsity structure of the input matrix. Thus the single
format can support an SpMV algorithm that is efficient both for regular
matrices and for irregular matrices. Furthermore, we show that the overhead of
the format conversion from the CSR to the CSR5 can be as low as the cost of a
few SpMV operations. We compare the CSR5-based SpMV algorithm with 11
state-of-the-art formats and algorithms on four mainstream processors using 14
regular and 10 irregular matrices as a benchmark suite. For the 14 regular
matrices in the suite, we achieve comparable or better performance over the
previous work. For the 10 irregular matrices, the CSR5 obtains average
performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%,
153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel
CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For
real-world applications such as a solver with only tens of iterations, the CSR5
format can be more practical because of its low-overhead for format conversion.
The source code of this work is downloadable at
https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5Comment: 12 pages, 10 figures, In Proceedings of the 29th ACM International
Conference on Supercomputing (ICS '15
Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads
Sparse matrices are the key ingredients of several application domains, from
scientific computation to machine learning. The primary challenge with sparse
matrices has been efficiently storing and transferring data, for which many
sparse formats have been proposed to significantly eliminate zero entries. Such
formats, essentially designed to optimize memory footprint, may not be as
successful in performing faster processing. In other words, although they allow
faster data transfer and improve memory bandwidth utilization -- the classic
challenge of sparse problems -- their decompression mechanism can potentially
create a computation bottleneck. Not only is this challenge not resolved, but
also it becomes more serious with the advent of domain-specific architectures
(DSAs), as they intend to more aggressively improve performance. The
performance implications of using various formats along with DSAs, however, has
not been extensively studied by prior work. To fill this gap of knowledge, we
characterize the impact of using seven frequently used sparse formats on
performance, based on a DSA for sparse matrix-vector multiplication (SpMV),
implemented on an FPGA using high-level synthesis (HLS) tools, a growing and
popular method for developing DSAs. Seeking a fair comparison, we tailor and
optimize the HLS implementation of decompression for each format. We thoroughly
explore diverse metrics, including decompression overhead, latency, balance
ratio, throughput, memory bandwidth utilization, resource utilization, and
power consumption, on a variety of real-world and synthetic sparse workloads.Comment: 11 pages, 14 figures, 2 table