4 research outputs found
Optimization of SpGEMM with Risc-V vector instructions
The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) is
a fundamental routine extensively used in domains like machine learning or
graph analytics. Despite its relevance, the efficient execution of SpGEMM on
vector architectures is a relatively unexplored topic. The most recent
algorithm to run SpGEMM on these architectures is based on the SParse
Accumulator (SPA) approach, and it is relatively efficient for sparse matrices
featuring several tens of non-zero coefficients per column as it computes C
columns one by one. However, when dealing with matrices containing just a few
non-zero coefficients per column, the state-of-the-art algorithm is not able to
fully exploit long vector architectures when computing the SpGEMM kernel. To
overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm,
which computes in parallel several C columns among other optimizations, and the
HASH algorithm, which uses dynamically sized hash tables to store intermediate
output values. To combine the efficiency of SPA for relatively dense matrix
blocks with the high performance that SPARS and HASH deliver for very sparse
matrix blocks we propose H-SPA(t) and H-HASH(t), which dynamically switch
between different algorithms. H-SPA(t) and H-HASH(t) obtain 1.24 and
1.57 average speed-ups with respect to SPA respectively, over a set of
40 sparse matrices obtained from the SuiteSparse Matrix Collection. For the 22
most sparse matrices, H-SPA(t) and H-HASH(t) deliver 1.42 and
1.99 average speed-ups respectively
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
Important workloads, such as machine learning and graph analytics
applications, heavily involve sparse linear algebra operations. These
operations use sparse matrix compression as an effective means to avoid storing
zeros and performing unnecessary computation on zero elements. However,
compression techniques like Compressed Sparse Row (CSR) that are widely used
today introduce significant instruction overhead and expensive pointer-chasing
operations to discover the positions of the non-zero elements. In this paper,
we identify the discovery of the positions (i.e., indexing) of non-zero
elements as a key bottleneck in sparse matrix-based workloads, which greatly
reduces the benefits of compression. We propose SMASH, a hardware-software
cooperative mechanism that enables highly-efficient indexing and storage of
sparse matrices. The key idea of SMASH is to explicitly enable the hardware to
recognize and exploit sparsity in data. To this end, we devise a novel software
encoding based on a hierarchy of bitmaps. This encoding can be used to
efficiently compress any sparse matrix, regardless of the extent and structure
of sparsity. At the same time, the bitmap encoding can be directly interpreted
by the hardware. We design a lightweight hardware unit, the Bitmap Management
Unit (BMU), that buffers and scans the bitmap hierarchy to perform
highly-efficient indexing of sparse matrices. SMASH exposes an expressive and
rich ISA to communicate with the BMU, which enables its use in accelerating any
sparse matrix computation. We demonstrate the benefits of SMASH on four use
cases that include sparse matrix kernels and graph analytics applications
Algebraic, Block and Multiplicative Preconditioners based on Fast Tridiagonal Solves on GPUs
This thesis contributes to the field of sparse linear algebra, graph applications, and preconditioners for Krylov iterative solvers of sparse linear equation systems, by providing a (block) tridiagonal solver library, a generalized sparse matrix-vector implementation, a linear forest extraction, and a multiplicative preconditioner based on tridiagonal solves. The tridiagonal library, which supports (scaled) partial pivoting, outperforms cuSPARSE's tridiagonal solver by factor five while completely utilizing the available GPU memory bandwidth. For the performance optimized solving of multiple right-hand sides, the explicit factorization of the tridiagonal matrix can be computed. The extraction of a weighted linear forest (union of disjoint paths) from a general graph is used to build algebraic (block) tridiagonal preconditioners and deploys the generalized sparse-matrix vector implementation of this thesis for preconditioner construction. During linear forest extraction, a new parallel bidirectional scan pattern, which can operate on double-linked list structures, identifies the path ID and the position of a vertex. The algebraic preconditioner construction is also used to build more advanced preconditioners, which contain multiple tridiagonal factors, based on generalized ILU factorizations. Additionally, other preconditioners based on tridiagonal factors are presented and evaluated in comparison to ILU and ILU incomplete sparse approximate inverse preconditioners (ILU-ISAI) for the solution of large sparse linear equation systems from the Sparse Matrix Collection. For all presented problems of this thesis, an efficient parallel algorithm and its CUDA implementation for single GPU systems is provided