Search CORE

3 research outputs found

On Memory Footprints of Partitioned Sparse Matrices

Author
Publication venue: 'Polish Information Processing Society PTI'
Publication date
Field of study

Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core.Dissertation/ThesisMasters Thesis Computer Engineering 201

ASU Digital Repository

On Optimal Partitioning For Sparse Matrices In Variable Block Row Format

Author: Ahrens Peter
Boman Erik G.
Publication venue
Publication date: 25/05/2020
Field of study

The Variable Block Row (VBR) format is an influential blocked sparse matrix format designed to represent shared sparsity structure between adjacent rows and columns. VBR consists of groups of adjacent rows and columns, storing the resulting blocks that contain nonzeros in a dense format. This reduces the memory footprint and enables optimizations such as register blocking and instruction-level parallelism. Existing approaches use heuristics to determine which rows and columns should be grouped together. We adapt and optimize a dynamic programming algorithm for sequential hypergraph partitioning to produce a linear time algorithm which can determine the optimal partition of rows under an expressive cost model, assuming the column partition remains fixed. Furthermore, we show that the problem of determining an optimal partition for the rows and columns simultaneously is NP-Hard under a simple linear cost model. To evaluate our algorithm empirically against existing heuristics, we introduce the 1D-VBR format, a specialization of VBR format where columns are left ungrouped. We evaluate our algorithms on all 1626 real-valued matrices in the SuiteSparse Matrix Collection. When asked to minimize an empirically derived cost model for a sparse matrix-vector multiplication kernel, our algorithm produced partitions whose 1D-VBR realizations achieve a speedup of at least 1.18 over an unblocked kernel on 25% of the matrices, and a speedup of at least 1.59 on 12.5% of the matrices. The 1D-VBR representation produced by our algorithm had faster SpMVs than the 1D-VBR representations produced by any existing heuristics on 87.8% of the test matrices

arXiv.org e-Print Archive