Abstract-Accelerators for sparse matrix multiplication are important components in emerging systems. In this paper, we study the main challenges of accelerating Sparse Matrix Multiplication (SpMM). For the situations that data is not stored in the desired order (row/column order), we propose a compact high performance sparse format, which allows for random access to a dataset with low memory access overhead. We show that using this format results in a 14-49 times speedup for SpMM. Next, we propose a high performance systolic architecture for SpMM, which uses a mesh of comparators to locate the useful (non-zero) computation. This design maximizes data reuse by sharing the input data among a row/column of the mesh. We also show that, with similar memory access assumptions, the proposed architecture results in a 9-30 times speedup in comparison with the state of the art.
I. INTRODUCTION
In emerging data-inference applications the data-set is often large and sparse. Hardware accelerators can benefit from sparse formats for these datasets as those store only the nonzero elements, reducing the required storage and bandwidth. A common kernel in this context is sparse matrix multiplication (SpMM). This is used in many applications such as graph analysis, including breadth first search [8] and graph clustering [18] , in addition to its scientific applications in algebraic multigrid solving [3] and quantum modeling [17] . While there are many accelerators suggested in the literature for sparse matrix to vector (SpMV) multiplication [13] , [10] , [7] , there are only a few proposed accelerators for SpMM. Accelerating SpMM cannot always be simplified to SpMV acceleration. As the second operand of SpMM is a two dimensional matrix rather than a vector. This poses additional challenges:
Accessing data in two different orders: In SpMM, the first dataset (the multiplicand) is accessed in row order and the other (the multiplier) is accessed in column order. However, sparse data formats store the non-zeros in one order, say roworder. Accessing data in the other orders has a high cost in the number of memory accesses (MA) and it might not be practical to store the large datasets in both orders. For instance, suppose matrix A is stored in row order and appears as the first operand in A × B. In another matrix multiplication, the same matrix A might appear as the second operand, where it needs to be stored in column order.
Complexity of the SpMM algorithm: Designing a high performance SpMM algorithm is challenging. SpMM algo- 1 The author was with Princeton University when the work was done.
rithms should skip operations on zero elements. However, in unstructured sparse datasets, location of zeros is arbitrary. This makes it non-trivial to locate the non-zeros without incurring an overhead that overwhelms the benefit of skipping zeros. This challenge will be elaborated in Section II.
The first challenge, which is a special case of accessing sparse data in non-trivial patterns, has not been studied in the literature thus far. There are various high performance sparse formats proposed in the literature [12] , [16] , [4] , [7] . However, most of these are proposed for SpMV acceleration, and therefore, they do not address the non-trivial access challenge. On the architecture front, to the best of our knowledge, there is only one work to design a systolic-like structure for SpMM, the FPIC design [11] . In a conventional matrix multiplier (MM), there is a mesh of multiplier and accumulator (MAC) nodes and input data is shared among a row or a column of those nodes. This maximizes the reusability of data.
The previous FPIC SpMM design [11] does not have data sharing along the rows and columns, and each MAC node reads all its arguments directly from the inputs. To reduce memory bandwidth, these inputs are buffered at the rows and columns. This buffering limits the size of the SpMM unit and the lack of scalability increases the overall latency when it targets large matrices.
In addressing the above gaps, this paper makes the following contributions: (i) We propose the Indexed Compressed Row Storage (InCRS) format, a new row-based sparse format with reduced memory access for arbitrary access patterns. We show how using InCRS can speedup SpMM when the dataset is not stored in the desired order. (ii) We propose a high performance and scalable systolic SpMM. In the proposed architecture, input is shared among a row or a column of the nodes and allows for maximum data reuse.
We evaluate our design on a range of large and sparse datasets. Using InCRS format when the second dataset is not stored in row-order, can speedup SpMM ≈ 14-49 times. Further, our systolic SpMM can speedup matrix multiplication 9-30 times in comparison with the state of the art FPIC design.
II. COMPLEXITY OF NON-TRIVIAL DATA ACCESS IN SPARSE FORMATS
In matrix multiplication the first matrix is accessed in row order and the second in column order. Since the second matrix might be accessed in row order in another application, e.g., being the first operand of another matrix multiplication, we assume all matrices are stored in one order, which we assume is row order (assuming column order is fine too, and will just switch the direction of our solution). Sparse formats store the non-zeros in one order, say row order, and accessing data in other orders is complex. For instance, to read one column of data stored in a row-based format, many of the non-zeros of each row are accessed to locate the elements of that column.
In the following, we briefly study popular unstructured sparse formats and their cost of random data access.
A. Popular Unstructured Sparse Formats

1) ELLPACK:
One matrix stores the non-zero values of row i in its i th row and another matrix stores the column indices of the non-zeroes [14] . Table I compares the complexity of reading one arbitrary element in sparse formats, where M and N are the number of rows and columns. D is the density, i.e., the ratio of the number of non-zeros to the dataset size. Note that in conventional dense format, a single memory access is enough to read an arbitrary element. As Table I shows, CRS has amongst the least memory access requirements while it is also amongst the most compact formats. Further, previous work [9] has shown that CRS incurs the lowest number of memory accesses in many linear algebra operations. However, CRS still incurs a high number of memory accesses in comparison with the conventional dense format. Next, we propose a lowoverhead improvement for accessing arbitrary data in CRS. th location in the row. We propose the Indexed CRS (InCRS) format, which augments the CRS format with such information.
B. Cost of Non-Trivial Data Access
III. ACCELERATING NON-TRIVIAL
A. Indexed CRS (InCRS) Format
InCRS divides each row into sections of S elements, which are sub-divided into blocks of b elements. It stores information on the number of non-zeros inside the sections and blocks. To access B[i] [j] , this information is used to find the number of non-zeros that exist before the respective block to locate that block. This information is represented using countervectors, which are the addition to CRS. Each counter-vector is designed to be single word and stores information of one section. Therefore, to access the counter-vector of a section, only one memory access is required. Figure 1 shows an example of a row with 24 columns. In this setup, the row is divided to sections of 8 elements (S = 8) and each section is divided to blocks of 2 elements (b = 2). The first part of the counter-vector stores the total number of non-zeros that exist in the previous sections of that row and the rest of it stores information about the number of nonzeros inside each block in that section. In this example, the counter-vector indicates that there are between 5 to 7 number of non-zeros(NZs) located in row i and before column j = 13. 
To locate B[i][j], we first locate its counter-vector. B[i][j] belongs to the section number
B. Implementation of InCRS
In our implementation, the section size is 256 elements (S = 256) and the blocks are 32 elements (b = 32). In this implementation each counter-vector has 64 bits, where 8 × 6 bits are used for storing the non-zeros inside the 8 blocks. The remaining 16 bits of the counter-vector store the number of non-zeros before this section. This is based on the assumption that each row has maximum of 65k nonzeros, which is reasonable for a sparse dataset. Note that these parameters can be adjusted for a given dataset.
C. Accelerating SpMM Data Access
As mentioned before, the CRS format requires on average ≈ . This is the estimated reduction factor of memory accesses for reading one column of the dataset. Table II shows this estimated factor for the 5 large and sparse datasets that we used in our experiments. The datasets are further discussed in Section V. The memory access ratio indicates that the datasets with a larger number of non-zeros in each row benefit more from using InCRS.
The main cost of this improvement is the additional storage required for the counter-vectors. In our design, we require a 64bit counter-vector for each section (256 elements) of a row. Thus, this extra storage is 1 S .N.M words. Since the storage required for CRS is ≈ 2.M.N.D words, the ratio of the storage required for CRS format to the storage for InCRS format is ≈ 2D.S 2D.S+1 , which we refer to as storage ratio in Table II. As  the Table shows , for these datasets, the storage required for CRS varies between 0.99 and 0.88 of the storage required to store the same datasets in InCRS format. It is clear that by reducing the size of the blocks the storage overhead (small b is needed to fit the counter-vector in a word) and the expected benefit both increase.
IV. SYSTOLIC SPMM ARCHITECTURE
In this section, we propose a systolic sparse matrix multiplier which consists of a mesh of comparators located before the MAC units. The main advantages of this architecture are maximizing the data reuse and fast consumption of the operands. It also has a simple data access, avoiding extra buffering at the input ports, which makes the design scalable. In the long version of this paper [1], this design is described in detail and is compared with conventional MM and state of the art FPIC design [11] .
V. EXPERIMENTS
We evaluated acceleration of SpMM memory access when using the InCRS format. We also evaluated the proposed systolic SpMM design in the long version of this paper [1] .
A. Methodology
Here, our goal is to evaluate memory access acceleration when using the InCRS format for SpMM operation. For this, we used the Gem5 [2] tool, which allows for simulating the memory hierarchy. Table III summarizes 
B. SpMM Memory Access Acceleration using InCRS
As mentioned in Section III-C, the gain from using the InCRS format depends on the number of non-zeros per row. Therefore, we have selected datasets with a different average number of non-zeros per row ranging from 150 to 1400 per row (Table II) [5] , [15] . We select datasets such as graph and network applications, as well as bag of words and user ratings that are used by inference applications.
We resized our target datasets because of the slow simulation speed in the experiments. We simplified the first operand (in SpMM) to a vector since the focus of this work is the Fig. 2 : CRS vs. InCRS memory access ratio column-order memory access to the second operand. Also, simplifying this part of the computation is fair because roworder access to a matrix is the same for the CRS and the InCRS formats. To resize the second operand, we randomly removed a number of rows from the datasets. At the end, the resized datasets were larger than at-least twice the L2 size to avoid eliminating the possible L2 cache miss effects. Table II reports the sizes of the second operand after they are resized. We did not change or remove any of the columns of the second operand, since the columns' lengths and distributions of nonzeros are important factors in this comparison. Figure 2 summarizes the benefits of replacing CRS format by InCRS for SpMM. This figure shows the number of cache accesses, total memory access time, and total run time when using CRS format normalized to the same parameters when using InCRS. As the figure shows, the number of cache accesses decrease for all the datasets with InCRS. The ratios of memory access reduction for different datasets are close to our estimations in Table II . For instance, the L1 access is decreased 49 and 31 times for Belcastro and Docword datasets, which is close to our estimation of 39 and 14 times reduction, respectively, in Table II . Here, the average cache miss latency was 11k-22k cycles for L1 and 75k-91k cycles for L2.
The third column in Figure 2 shows the ratio of the overall time spent on memory accesses. As a result of memory access speedup, the overall execution time of SpMM decreases noticeably for all the datasets. For instance, SpMM on Docword is around 31 times faster if we use InCRS instead of CRS.
C. Evaluation of the Systolic SpMM Architecture
We compared the performance of our proposed architecture with the conventional MM and the FPIC work [11] performing A × A T on a set of large and sparse datasets with a range of 0.057% to 14% density. Overall, our architecture's acceleration increases as the density decreases. For this range of densities, the proposed architecture performs SpMM 1.5-39 times faster in comparison with the conventional MM and 2-30 times faster in comparison with FPIC. Details of this set of experiments is available in the long version of this paper [1].
VI. CONCLUSION
Matrix multiplication is a key kernel in inference applications, which often have large and sparse data sets. Thus, there is strong interest in accelerating this important operation with sparse data representations. We studied the main challenges of accelerating Sparse Matrix Multiplication (SpMM) including non-trivial data access and designing a systolic architecture. We proposed a modification to sparse formats to accelerate access to a dataset when the dataset is not stored in the desired order (row/column order). Our experiments show that the proposed InCRS format speeds up SpMM 14-49 times due to fewer memory accesses. Next, we proposed a high performance systolic SpMM architecture, which accelerates the computation and maximizes the data reuse by performing synchronized movement of the data through the architecture. We showed that, with similar memory access assumptions, the proposed architecture performs SpMM 9-30 times faster than the state of the art FPIC architecture.
