4 research outputs found
Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads
Sparse matrices are the key ingredients of several application domains, from
scientific computation to machine learning. The primary challenge with sparse
matrices has been efficiently storing and transferring data, for which many
sparse formats have been proposed to significantly eliminate zero entries. Such
formats, essentially designed to optimize memory footprint, may not be as
successful in performing faster processing. In other words, although they allow
faster data transfer and improve memory bandwidth utilization -- the classic
challenge of sparse problems -- their decompression mechanism can potentially
create a computation bottleneck. Not only is this challenge not resolved, but
also it becomes more serious with the advent of domain-specific architectures
(DSAs), as they intend to more aggressively improve performance. The
performance implications of using various formats along with DSAs, however, has
not been extensively studied by prior work. To fill this gap of knowledge, we
characterize the impact of using seven frequently used sparse formats on
performance, based on a DSA for sparse matrix-vector multiplication (SpMV),
implemented on an FPGA using high-level synthesis (HLS) tools, a growing and
popular method for developing DSAs. Seeking a fair comparison, we tailor and
optimize the HLS implementation of decompression for each format. We thoroughly
explore diverse metrics, including decompression overhead, latency, balance
ratio, throughput, memory bandwidth utilization, resource utilization, and
power consumption, on a variety of real-world and synthetic sparse workloads.Comment: 11 pages, 14 figures, 2 table
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
Important workloads, such as machine learning and graph analytics
applications, heavily involve sparse linear algebra operations. These
operations use sparse matrix compression as an effective means to avoid storing
zeros and performing unnecessary computation on zero elements. However,
compression techniques like Compressed Sparse Row (CSR) that are widely used
today introduce significant instruction overhead and expensive pointer-chasing
operations to discover the positions of the non-zero elements. In this paper,
we identify the discovery of the positions (i.e., indexing) of non-zero
elements as a key bottleneck in sparse matrix-based workloads, which greatly
reduces the benefits of compression. We propose SMASH, a hardware-software
cooperative mechanism that enables highly-efficient indexing and storage of
sparse matrices. The key idea of SMASH is to explicitly enable the hardware to
recognize and exploit sparsity in data. To this end, we devise a novel software
encoding based on a hierarchy of bitmaps. This encoding can be used to
efficiently compress any sparse matrix, regardless of the extent and structure
of sparsity. At the same time, the bitmap encoding can be directly interpreted
by the hardware. We design a lightweight hardware unit, the Bitmap Management
Unit (BMU), that buffers and scans the bitmap hierarchy to perform
highly-efficient indexing of sparse matrices. SMASH exposes an expressive and
rich ISA to communicate with the BMU, which enables its use in accelerating any
sparse matrix computation. We demonstrate the benefits of SMASH on four use
cases that include sparse matrix kernels and graph analytics applications