3 research outputs found
Algorithm Architecture Co-design for Dense and Sparse Matrix Computations
abstract: With the end of Dennard scaling and Moore's law, architects have moved towards
heterogeneous designs consisting of specialized cores to achieve higher performance
and energy efficiency for a target application domain. Applications of linear algebra
are ubiquitous in the field of scientific computing, machine learning, statistics,
etc. with matrix computations being fundamental to these linear algebra based solutions.
Design of multiple dense (or sparse) matrix computation routines on the
same platform is quite challenging. Added to the complexity is the fact that dense
and sparse matrix computations have large differences in their storage and access
patterns and are difficult to optimize on the same architecture. This thesis addresses
this challenge and introduces a reconfigurable accelerator that supports both dense
and sparse matrix computations efficiently.
The reconfigurable architecture has been optimized to execute the following linear
algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM
(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),
LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),
SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where
each core consists of a 2D array of processing elements (PE).
The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix
updates efficiently. A sequence of such updates is used to solve a larger problem inside
a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)
is used to perform sparse kernel updates. Scalable partitioning and mapping schemes
are presented that map input matrices of any given size to the multicore architecture.
Design trade-offs related to the PE array dimension, size of local memory inside a core
and the bandwidth between on-chip memories and the cores have been presented. An
optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto
32 GOPS using a single core.Dissertation/ThesisMasters Thesis Computer Engineering 201
On Optimal Partitioning For Sparse Matrices In Variable Block Row Format
The Variable Block Row (VBR) format is an influential blocked sparse matrix
format designed to represent shared sparsity structure between adjacent rows
and columns. VBR consists of groups of adjacent rows and columns, storing the
resulting blocks that contain nonzeros in a dense format. This reduces the
memory footprint and enables optimizations such as register blocking and
instruction-level parallelism. Existing approaches use heuristics to determine
which rows and columns should be grouped together. We adapt and optimize a
dynamic programming algorithm for sequential hypergraph partitioning to produce
a linear time algorithm which can determine the optimal partition of rows under
an expressive cost model, assuming the column partition remains fixed.
Furthermore, we show that the problem of determining an optimal partition for
the rows and columns simultaneously is NP-Hard under a simple linear cost
model.
To evaluate our algorithm empirically against existing heuristics, we
introduce the 1D-VBR format, a specialization of VBR format where columns are
left ungrouped. We evaluate our algorithms on all 1626 real-valued matrices in
the SuiteSparse Matrix Collection. When asked to minimize an empirically
derived cost model for a sparse matrix-vector multiplication kernel, our
algorithm produced partitions whose 1D-VBR realizations achieve a speedup of at
least 1.18 over an unblocked kernel on 25% of the matrices, and a speedup of at
least 1.59 on 12.5% of the matrices. The 1D-VBR representation produced by our
algorithm had faster SpMVs than the 1D-VBR representations produced by any
existing heuristics on 87.8% of the test matrices