5 research outputs found
Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions
The sparse matrix-vector product (SpMV) is a fundamental operation in many
scientific applications from various fields. The High Performance Computing
(HPC) community has therefore continuously invested a lot of effort to provide
an efficient SpMV kernel on modern CPU architectures. Although it has been
shown that block-based kernels help to achieve high performance, they are
difficult to use in practice because of the zero padding they require. In the
current paper, we propose new kernels using the AVX-512 instruction set, which
makes it possible to use a blocking scheme without any zero padding in the
matrix memory storage. We describe mask-based sparse matrix formats and their
corresponding SpMV kernels highly optimized in assembly language. Considering
that the optimal blocking size depends on the matrix, we also provide a method
to predict the best kernel to be used utilizing a simple interpolation of
results from previous executions. We compare the performance of our approach to
that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of
standard benchmark matrices. We show that we can achieve significant
improvements in many cases, both for sequential and for parallel executions.
Finally, we provide the corresponding code in an open source library, called
SPC5.Comment: Published in Peer J C
Efficient sparse matrix multiple-vector multiplication using a bitmapped format
The problem of obtaining high computational throughput from sparse matrix multiple--vector multiplication routines is considered.
Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance.
We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without a fill overhead, thereby offering blocking without additional storage and bandwidth overheads.
An efficient algorithm decodes bitmaps using de Bruijn sequences and minimizes the number of conditionals evaluated.
Performance is compared with that of popular formats, including vendor implementations of sparse BLAS.
Our sparse matrix multiple-vector multiplication algorithm achieves high throughput on all platforms and is implemented using platform neutral optimizations