Abstract-Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form C ¼ A B on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix A to evenly partition the work across threads with the objective of reducing the number of B-matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for B-matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for C-matrix rows. A similarity graph model is proposed for B-matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations.
INTRODUCTION

Applications Involving SpGEMM
G ENERAL sparse matrix-matrix multiplication (SpGEMM) of the form C ¼ A B is an important kernel in various applications such as molecular dynamics simulations [1] , [2] , [3] , [4] , [5] , [6] , [7] , finite element simulations based on domain decomposition [8] , [9] , linear programming (LP) [10] , [11] , [12] , multigrid interpolation and restriction [13] , breadth-first search from multiple source vertices [14] , finding all-pair shortest-paths [15] , similarity join [16] , data summarization [17] , and item-to-item collaborative filtering in recommendation systems [18] .
Some of the above-mentioned applications require repeated SpGEMM operations which involve matrices with same sparsity patterns but differing numerical values: Solution of LP problems through iterative interior point methods that use normal equations constitutes a typical example of such applications. These methods solve positive-definite linear systems of the form ðAD 2 A T Þ x ¼ b at each iteration, where A is the original constraint matrix and D is a positive diagonal matrix which varies with each iteration. Direct solvers [10] , [11] , [12] that utilize Cholesky factorization as well as iterative solvers [11] that utilize preconditioners require explicitly forming the coefficient matrix at each iteration through the SpGEMM computation C ¼ A B, where the sparsity patterns of both A and B ¼ D 2 A T remain the same throughout the iterations.
Similarity join and collaborative filtering applications may also involve repeated SpGEMM operations. In the self similarity join application that involves C ¼ AA and in the similarity join of two different sparse data sets that involve C ¼ AB, different weightings can be defined on the relative importance of features that will incur repeated SpGEMM operations of the forms C ¼ AWA and C ¼ AWB for different W matrices. Here, W is a diagonal matrix that contains values for relative ranking of the features in the data sets. In item-to-item collaborative filtering in recommendation systems, given a similar-items table, which is A, SpGEMM is performed in order to find items similar to each of the user's preferences, which are stored in B, and then the system aggregates those items and recommends the most popular or correlated items. This application may incur repeated SpGEMM operations of the form of C ¼ AWB, where W is a diagonal matrix that adjusts importance of items in filtering.
In this work, we propose matrix partitioning and row/ column reordering schemes for exploiting temporal and spatial localities in row-by-row parallelization of SpGEMM on many-core architectures. All of the above-mentioned applications will benefit from these matrix partitioning and reordering methods, whereas the preprocessing overhead due to these intelligent partitioning operations will amortize especially for the applications that involve repeated SpGEMM operations.
Parallel SpGEMM Algorithms
Parallel SpGEMM schemes can be broadly classified into four categories depending on the following 1D GEMM formulations [19] : Outer product, inner product, column-bycolumn, and row-by-row. In the outer-product parallelization, an atomic task is defined as the outer product of column i of A with row i of B, so that each thread is held responsible for computing one or more outer products. The outer-product formulation is reported to lead to scalable parallelization on distributed-memory architectures [20] , [21] , [22] . However, this SpGEMM scheme requires sparse accumulation operations for obtaining the output matrix from the partial results of concurrent thread computations on shared-memory architectures. These operations necessitate complex indexing schemes to efficiently identify the contributions to the same matrix entries by different threads. Although there exists an indexing scheme proposed for outer-product kernel on distributed-memory architectures [20] , this scheme does not seem to be viable for manycore architectures due to the overhead of many concurrent writes.
In the inner-product parallelization, an atomic task is defined as the inner product of row i of A with column j of B, so that each thread is held responsible for computing one or more inner products. This scheme requires merging two sparse vectors which is not efficient because of large number of element accesses per floating-point operation (Flop) [23] . Therefore, this scheme is also not viable for current many-core architectures with deep cache hierarchies.
In the column-by-column parallelization, an atomic task is defined as the post-multiply of the whole A matrix with a column of B, so that each thread is held responsible for computing one or more post-multiplies. In the row-by-row parallelization, an atomic task is defined as the pre-multiply of a row of A with the whole B matrix, so that each thread is held responsible for computing one or more pre-multiplies. Both schemes share the nice property of not requiring concurrent writes as opposed to the outer-product formulation. The former and latter schemes can complete the computation of a column and a row of the output matrix at a time, respectively. Both schemes also share the nice property of performing all multiplications related with a nonzero of one of the two input matrices as opposed to the inner-product formulation. So, both of these two schemes are found to be viable for shared-memory parallelism.
Related Work
There exist several recent works on SpGEMM parallelization on distributed-memory architectures [7] , [20] , [21] , [22] , [24] , [25] , [26] , [27] . Here, we discuss the related work on SpGEMM parallelization on shared-memory architectures according to the above-mentioned categorization.
Sulatcyke and Ghose [28] examine the impact of data reuse on the SpGEMM performance. They consider outerproduct and row-by-row formulations and report that the row-by-row formulation is more amenable for data reuse opportunities. They also report that the row-by-row parallelization does not need any synchronization among threads so it is more eligible for shared-memory parallelization.
There are successful GPU libraries, cuSPARSE [29] and Cusp [30] , which perform SpGEMM. cuSPARSE uses rowby-row formulation as its top-level parallelism. A hash table is used for accumulating partial results for each row of the output matrix. In Cusp [31] , elements of input matrices are accessed via row-by-row scheme, but the partial results for the output matrix are stored in an intermediate list consisting of row, column, and value tuples. This list is sorted and duplicate column indices are compressed to compute the final output matrix. The SpGEMM operation in Cusp is further enhanced by the works [32] and [33] .
Matam et al. [34] investigate row-by-row and outerproduct parallelization schemes, which utilize blocking to decrease the size of the temporary accumulation array used to accumulate the result of each pre-multiply and outer product, respectively. They report that the row-by-row parallelization performs better than the outer-product parallelization. A similar blocking approach is also utilized in [35] to decrease the size of the temporary accumulation array.
Liu and Vinter [36] use row-by-row parallelization. They utilize progressive allocation of C matrix for memory efficiency, computing final C-matrix rows via merging partial results, and partitioning rows of C with respect to their nonzero density for better load balancing.
Gremse et al. [37] uses row-by-row parallelization. Instead of using accumulation array, W (subwarp size) results are merged at a time via utilizing a warp reduction method, which uses only hardware registers.
Siegel et al. [38] use inner-product formulation for coarsegrained parallelism among tasks and row-by-row formulation for fine-grained parallelism within a task. Task sharing approach is used in [38] for dynamic load balancing among nodes and GPUs of a node.
The above-mentioned works on SpGEMM do not exploit sparsity patterns of input or output matrices for efficient parallelization via reducing cache misses. Although there exist models and methods that utilize sparsity pattern for exploiting locality in sparse matrix-vector multiplication [39] , [40] , [41] , the literature lacks locality exploiting methods for SpGEMM operations. In this work, we propose models and matrix partitioning/reordering methods, which utilize matrix sparsity pattern in order to exploit data locality in parallel SpGEMM operation.
Contributions
We propose a hypergraph model and a bipartite graph model for encapsulating computational and B-matrix transfer requirements of thread-level row-by-row parallelization of SpGEMM operation that is based on rowwise A-matrix partitioning. We show that the minimization objectives of partitioning the hypergraph and bipartite graph models relate to reducing data transfers from memory or another cache. We devise fast bottom-up clustering methods utilizing both proposed hypergraph and bipartite graph models in order to decrease the preprocessing overhead.
We also investigate how to exploit locality in accumulation of partial results for C-matrix rows. For this purpose, we propose a spatial hypergraph model and a temporal graph model for reordering B-matrix columns and rows, respectively. For the same purpose, we also show how to encode an existing A-matrix partition as a row/column reordering of the B matrix for the special cases of C ¼ AA and C ¼ AA T . The validity of the proposed matrix partitioning and row/column reordering methods are extensively experimented on real SpGEMM instances using many-core architectures Intel Xeon Phi and Xeon.
We should note that the column-by-column and row-byrow parallelization schemes, both of which are found to be viable for shared-memory parallelization, can be considered as dual of each other. So, the discussion for the column-bycolumn parallelization directly follows the discussion given for the row-by-row parallelization.
The rest of the paper is organized as follows: The row-byrow parallelization of SpGEMM and its data locality properties are given in Sections 2 and 3, respectively. The hypergraph and bipartite graph models for exploiting temporal locality in accessing B-matrix rows are introduced in Section 4. In Section 5, we present the models and methods for exploiting spatial and temporal localities in accessing entries of accumulation arrays. The experimental results are presented in Section 6. Finally, we conclude the paper in 
THE PARALLEL ALGORITHM
The row-by-row parallelization is based on one-dimensional conformable rowwise partitioning of the input matrix A and the output matrix C as follows:
In (1), submatrices A k and C k denote the kth row slices of the reordered A and C matrices, respectively, where K denotes the number of parts. Here, P denotes the permutation matrix obtained from partitioning. The use of the same permutation matrix for row reorderings of A and C shows that a rowwise partition on A induces a conformable rowwise partition on C. The reorderedÂ andĈ will be referred to as A and C. for each nonzero c i;j in row c i;Ã do 7:
for each nonzero a i;j in row a i;Ã do 9:
acc ¼ acc þ a i;j b j;Ã 10:
//Gather dense array to sparse storage 11:
for each nonzero c i;j in row c i;Ã do 12:
The input and output data partitioning given in (1) leads to the row-by-row algorithm as shown in Algorithm 1, where the output matrix C is computed as,
Each submatrix-matrix multiply C k ¼ A k B is assigned to a thread that will be executed by an individual core as also depicted by the "for ... in parallel do" construct in line 2 of Algorithm 1. This algorithm uses CSR (compressed storage by rows) scheme for storing both input matrices and the output matrix. In lines 6-12, row i of A k (a i;Ã ) is premultiplied by B and the result of this multiplication is written to row i of C (c i;Ã ). Each thread allocates a private accumulation array acc (i.e., a simple 1D dense array) of size N for efficient computation of pre-multiplies, where N is the number of columns of B. In line 6, acc is initialized to zero and then a pre-multiply is performed by the for-loop starting in line 8. In line 11, the final values for c i;Ã are gathered from acc and stored in compressed row storage of C. A symbolic SpGEMM operation is performed prior to the numeric multiplication for identifying the column indices of nonzeros in row c i;Ã . The pseudo-code of symbolic multiplication is given in Algorithm A.1 of Appendix A, available in the online supplemental material. Fig. 1 shows an SpGEMM instance to demonstrate the atomic task of computing row i of the C matrix according to row-by-row formulation. As seen in the figure, row i of A contains three nonzeros at columns x, y, and z. Each of B-matrix rows x, y, and z contains two nonzeros and these six nonzeros are spread over three columns j, k, and m. Thus, they incur the addition of the results of six scalar multiply-and-add operations to the jth, kth, and mth entries of the acc array. Each of three B-matrix rows has a nonzero at column k, which incurs the addition of the results of three scalar multiply-and-add operations to the same entry of the acc array.
DATA LOCALITY
We present spatial and temporal locality characteristics of the row-by-row parallelization given in Algorithm 1. Discussions presented here are valid for private caches and also for processors with private and shared caches existing at different levels of the memory hieararchy.
Spatial locality in accessing A-matrix nonzeros is simply achieved by storing nonzeros of each row consecutively using CSR and processing nonzeros of A consecutively. Temporal locality in accessing A-matrix nonzeros is not feasible since each A-matrix nonzero is accessed once.
Spatial locality in accessing B-matrix nonzeros is partially achieved by storing nonzeros of each row consecutively using CSR. However, rows of B are not processed consecutively and the processing order is determined by sparsity patterns and processing order of A-matrix rows. The performance improvement due to exploiting spatial locality is expected to be negligible because at most one extra cache miss can be avoided for each B-matrix row. So this locality is not considered in the paper.
Temporal locality in accessing B-matrix nonzeros (together with their indices) is feasible since B-matrix nonzeros are accessed multiple times. Each nonzero in row x of B is accessed nnzða Ã;x Þ times. Here, nnzðÁÞ denotes the number of nonzeros in a row, a column or the whole of the matrix. Temporal locality in accessing B-matrix nonzeros can be exploited through clustering A-matrix rows that are similar in terms of the number of nonzeros of required B-matrix rows, where each cluster constitutes a row-slice of the reordered A matrix. Reuse of B-matrix nonzeros can be achieved at a coarse level of reuse of B-matrix rows.
The access pattern to the accumulation array is determined by the sparsity pattern of B as well as the access order of B-matrix rows, which is induced by the processing order of A-matrix rows. Spatial locality in accessing the accumulation array entries can be exploited through reordering B-matrix columns with similar sparsity patterns nearby. Temporal locality in accessing the accumulation array entries can be exploited through reordering of B-matrix rows that are accessed within the same coarsegrain thread-level task with similar sparsity patterns nearby. This is because B-matrix rows with similar sparsity patterns are likely to access the same accumulation array entries during the execution of the coarse-grain task.
For exploiting temporal locality in accessing B-matrix rows, we propose and discuss two models for conformable rowwise partitioning of A and C matrices.
Hypergraph Model
Here, H T A ðA; fnnzðb x;Ã Þg x Þ refers to the fact that the sparsity pattern of matrix A determines the topology of the proposed hypergraph model, whereas the nonzero counts of B-matrix rows determine vertex and net weights. The superscript "T " denotes temporal locality.
In H T A , V contains a vertex v i for each row i of A so that it represents the atomic task
of pre-multiplying row i (a i;Ã ) of A with the B matrix to compute row i (c i;Ã ) of C. Note that this atomic task definition effectively means that vertex v i also represents row i of the C matrix according to owner computes rule. Hence, we associate vertex v i with a weight wðv i Þ proportional to the computational load of this pre-multiply in terms of scalar multiply-and-add operations. That is,
where colsða i;Ã Þ denotes the column indices of the nonzeros in row i of A. N contains a net (hyperedge) n x for each row x of B. Net n x connects vertices corresponding to rows that have nonzeros at column x of A. So, the vertices connected by n x correspond to the atomic tasks that need to access row x of B. Hence, we associate net n x with a weight wðn x Þ proportional to the cost of transferring B-matrix row x from memory or another cache, that is, Fig. 2 illustrates the input and output dependency of the proposed hypergraph model. In this figure, as well as in Fig. 4 , circles represent vertices, whereas dots represent nets. As seen in Fig. 2 , n x connects vertices v i and v j , whereas n y connects vertices v i , v j , and v k . So net n x encodes the need of accessing B-matrix row x for the atomic tasks of computing rows i and j of C, whereas net n y encodes the need of accessing B-matrix row y for the atomic tasks of computing rows i, j, and k of C. Vertex v i connected by both n x and n y shows that the atomic task of computing row i of C needs to access both rows x and y of B, which in turn shows the need for an accumulation operation depending on the sparsity patterns of rows x and y of B.
Consider a K-way partition PðVÞ ¼ fV 1 ; V 2 ; . . . ; V K g of vertices of H T A , where parts are mutually exclusive and exhaustive. In PðVÞ, the weight W ðV k Þ of a part k is defined as the sum of the weights of the vertices in part k. That is, W ðV k Þ ¼ P v2V k wðvÞ. A K-way partition PðVÞ is decoded in such a way that the set of atomic tasks (fine-grain tasks) corresponding to the vertices in a part of PðVÞ constitutes a coarse-grain task to be executed by a distinct thread. That is, vertex part V k denotes the submatrix-matrix multiply C k ¼ A k B to be executed by a distinct thread, where A k and C k are the submatrices respectively formed by A-matrix and C-matrix rows that are represented by the vertices in V k . Without loss of generality, we assume that each core executes only one distinct thread. K is selected such that the storage size of the B-matrix rows required by each submatrix-matrix multiply C k ¼ A k B together with the storage sizes of C k and A k matrices are below the size of the cache of a single core. Thus, the working set of each iteration of the for-loop starting at line 2 of Algorithm 1 fits into the cache of a single core due to thread-level coarse-grain task definition induced by the partitioning.
The weight of a part corresponds to the computational load of a thread-level coarse-grain task. So, the partitioning constraint of maintaining balance on the part weights corresponds to maintaining balance on the computational loads of thread-level coarse-grain tasks. This constraint corresponds to reducing overall execution time for arbitrary coarse-grain task scheduling.
In a K-way partition PðVÞ of H T A , a net n x is said to connect a part if n x connects at least one vertex in that part. The connectivity conðn x Þ of n x is the set of parts connected by n x . The cutsize of PðVÞ is defined as 
Here, we present the following discussion to show that the cutsize of a given partition PðVÞ of H T A is equal to the number of B-matrix words to be transferred from the memory and between different caches under the following assumptions: single-word line, fully associative cache, and single thread per core.
Consider a net n x with connectivity conðn x Þ in P. Let T x denote the set of jT x j ¼ jconðn x Þj threads that will execute the coarse-grain tasks corresponding to the vertex parts in conðn x Þ on different cores of the system. Without loss of generality, let t be the first thread to be executed in T x . Thread t will transfer the B-matrix row x from the memory, whereas the other threads in T x will transfer B-matrix row x either from the memory or from the cache of another core. Hence the amount of data transfer due to accessing B-matrix row x, which is represented by net n x , will be equal to wðn x Þjconðn x Þj words. Here, the costs of transferring from memory and another cache are assumed to be equal based on the results reported in [43] . Thus cutsizeðPÞ according to (6) will show the total amount of data transferred due to accessing B-matrix rows under the above-mentioned assumptions. It is clear that the amount of data transfer will reduce for the general cases of multiword line and multiple threads per core. On the other hand, the amount of data transfer will increase for the general case of set associative caches because of the possibility of the conflict misses, whereas fully associative caches will incur only capacity misses. So, for the general case, the partitioning objective of minimizing the cutsize relates to minimizing the amount of data transfer due to accessing B-matrix rows.
In Appendix B, available in the online supplemental material, we also show that the objective of minimizing the cutsize also relates to maximizing B-matrix row reuse. Fig. 3 shows a sample SpGEMM computation C ¼ A B, where A and B are 11 by 9 and 9 by 8 matrices with 27 and 26 nonzeros, respectively, and C is an 11 by 8 matrix with 57 nonzeros. Fig. 4 shows the hypergraph model H T A that represents the sample SpGEMM computation instance given in Fig. 3 . As seen in Fig. 4 , H T A contains 11 vertices, 9 nets, and 27 pins. Note that a connection between a net and a vertex is said to be pin of that net. As also seen in the figure, wðv 5 Þ ¼ 14 since row 5 of A contains nonzeros at columns 2, 3, 5, and 6, and nnzðb 2;Ã Þ þ nnzðb 3;Ã Þþ nnzðb 5;Ã Þþ In the 3-way partition PðVÞ given in Fig. 4 , there are four cut nets n 1 , n 3 , n 5 , and n 9 , whereas the remaining five nets are uncut. Note that a net is said to be cut if it connects more than one part, uncut otherwise. As seen in the figure, net n 3 connects all vertex parts, and hence jconðn 3 Þj ¼ 3. Consequently, cut net n 3 incurs wðn 3 Þjconðn 3 Þj ¼ 3 Á 3 ¼ 9 to the cutsize, which is equal to the amount of data transfer due to accessing B-matrix row 3, under the mentioned assumptions.
Hypergraph Partitioning (HP)
The state-of-the-art hypergraph partitioning tools such as PaToH [44] and hMeTiS [45] support the following cutsize metrics: hyperedge/net-cut [44], [45] , "connectivity-1" metric [44], [46] , [47] and sum of external degrees (SOED) [45] . None of these metrics directly encode the cutsize definition given in (6) for the proposed hypergraph model. On the other hand, it can easily be shown that there exist a constant difference of nnzðBÞ between the cutsize definition in (6) and the cutsize definition according to the "connectivity-1" metric. So, "connectivity-1" cutsize metric of PaToH can be safely used for minimizing the connectivity cutsize definition given in (6).
Fast Bottom-Up Hypergraph Clustering (HC)
In order to reduce the preprocessing overhead due to the top-down partitioning using PaToH, we propose a method that performs bottom-up clustering on the proposed hypergraph model. This bottom-up clustering method exploits the multi-level coarsening phase of PaToH in such a way that coarsening is continued until the number of supervertices in a level becomes smaller than or equal to K 0 ¼ 1:30K. Note that each supervertex in the last level is decoded as a vertex part/cluster in such a way that the set of atomic tasks (fine-grain tasks) constituting that supervertex is taken as a coarse-grain task to be executed by a distinct thread.
The proposed method uses a multi-level matching-based clustering scheme. As the matching selection criteria, "absorption metric" and a variant of "scaled heavy connectivity metric" (SHCM) are implemented. The SHCM-variant used in this experimentation corresponds to the generalized Jaccard similarity metric [48] . This bottom-up method establishes a tradeoff between the solution quality and the preprocessing time and it will be referred to as hypergraph clustering HC.
HC aims at clustering vertices with similar net connectivities. This corresponds to clustering A-matrix rows that are similar in terms of the number of nonzeros of the required B-matrix rows to the same cluster, thus exploiting temporal locality in accessing B-matrix nonzeros.
HC does not enforce any explicit constraint for maintaining balance on clusters. So this scheme may incur load imbalance for instances with high atomic-task weight variation. In order to partially alleviate this bottleneck, the coarse-grain tasks obtained from partitioning are sorted in decreasing order with respect to their coarse-grain task weights and they are given to the OpenMP scheduler in this order. with zero weight since they do not incur any computation. That is,
Bipartite Graph Model
A is decoded in a way similar to decoding PðVÞ of H T A . That is, the set of atomic tasks (fine-grain tasks) corresponding to the vertices in a part of PðV A Þ constitutes a coarse-grain task to be executed by a distinct thread. So, the partitioning constraint of maintaining balance on the part weights corresponds to maintaining balance on the computational loads of thread-level coarse-grain tasks. PðV B Þ is not decoded since vertices in V B do not signify any computation.
Edgecut Formulation (BGPe)
We associate each edge e i;x with a weight equal to
So the partitioning objective of minimizing
relates to minimizing the amount of B-matrix rows to be transferred from the memory and between different caches under the assumption that accessing each B-matrix row always incurs compulsory miss(es), i.e., there is no reuse of B-matrix rows. In other words, minimizing the edgecut corresponds to minimizing a loose upper bound on the number of cache misses due to accessing B-matrix rows.
Totalv Formulation (BGPv)
The above-mentioned edgecut formulation is based on the fact that this objective is widely supported by almost all graph partitioning tools. Here, we propose a formulation that encodes the partitioning objective of minimizing the "total communication volume" (totalv) supported by MeTiS [49] ,
Here, V b denotes the set of boundary vertices, where a vertex is said to be boundary if it is incident to at least one cut edge. For a vertex v 2 V k , NadjðvÞ denotes the number of parts other than V k that the vertices adjacent to v (i.e., AdjðvÞ) belong to. In (10), sðvÞ denotes the size of vertex v so that MeTiS assumes a weight wðvÞ and a size sðvÞ for each vertex v, where edges are unweighted.
In the proposed totalv formulation, the vertex weight assignment is as same as in the edgecut formulation (see (7)), whereas vertex size assignment is as follows:
This vertex size assignment is based on establishing a oneto-one correspondence between vertices of V B ðB It 
Fast Bottom-Up Graph Clustering (BGCe)
In order to reduce the preprocessing overhead due to the top-down partitioning, we propose a method that performs bottom-up clustering on the proposed bipartite graph models. Similar to the HC method proposed in Section 4.1.2, this method exploits only the multi-level coarsening phase of the multi-threaded graph partitioning tool mt-MeTiS [50] . The initial bipartitioning and refinement phases of mtMeTiS are omitted. This method will be referred to as bipartite graph clustering BGCe since mt-MeTiS performs vertex matching according to weights of the connecting edges.
B-MATRIX ROW/COLUMN REORDERING FOR LOCALITY IN ACCESSING ACC ARRAY
In Sections 5.1 and 5.2, we respectively propose a spatial graph model and a temporal hypergraph model for preserving locality in accessing acc array. A K 00 -way vertex partition of both models is used to induce a partial ordering on either the rows or columns of the B matrix as follows: The rows/columns corresponding to the vertices in V kþ1 are ordered after the rows/columns corresponding the vertices in V k for k ¼ 1; 2; . . . ; K 00 À 1. In both reordering schemes, K 00 is selected such that each vertex part is sufficiently small.
Spatial Hypergraph Model
In order to exploit spatial locality in accessing acc-array entries, the B-matrix columns with similar sparsity patterns should be reordered nearby. Furthermore, the sparsity pattern similarity among B-matrix columns should be weighted according to the access counts of the B-matrix rows that contain nonzeros in those columns. So, we propose to represent the pattern of accessing acc-array entries as a spatial hypergraph model H S B ðB; fnnzða Ã;x Þg x Þ¼ðV; N Þ for column reordering of B matrix. Here, the superscript "S" stands for spatial locality. In H S B , V contains a vertex v j for each column j of matrix B. We associate vertices with unit weights. N contains a net n x for each row x of matrix B. We associate each net n x with a weight wðn x Þ equal to the number of times B-matrix row x will be accessed during C ¼ A B. That is, Assume that L number of B-matrix columns represented by vertices in a part are ordered consecutively so that corresponding acc-array entries are stored in consecutive locations of the memory (i.e., in the same cache line) and accessed consecutively by the SpGEMM algorithm. Consider a net n x 2 H S B with connectivity set conðn x Þ and with weight wðn x Þ. This net n x means that the B-matrix row x will be transferred from memory or another cache wðn x Þ times. Each time this row is transferred and accessed during multiplication, jconðn x Þj number of lines containing acc-array entries will be accessed also. So, in total, the number of accesses to acc-array entries will be equal to P nx2N wðn x Þjconðn x Þj. Therefore, the cutsize according to the connectivity metric (6) corresponds to the number of accesses to the cache lines containing acc-array entries during C ¼ A B. Hence, the proposed model exploits spatial locality in accessing acc-array entries.
Temporal Graph Model
In order to exploit temporal locality in accessing acc-array entries, B-matrix rows with similar sparsity patterns should be reordered nearby. Furthermore, the sparsity pattern similarity among B-matrix rows should be weighted according to the number of accesses to the same acc-arrray entries within the same coarse-grain thread-level tasks. So, exploiting this locality depends on coarse-grain task set induced by the rowwise A-matrix partition, PðAÞ ¼ fA 1 ; A 2 ; . . . ; A K g, obtained via using the models proposed in Section 4. Thus, B-matrix-row access pattern is represented as a B-matrix-row similarity graph G T B ðB; PðAÞÞ ¼ ðV; EÞ for conformable column and row reordering of matrices A and B, respectively.
In G T B , V contains a vertex v i for each row i of matrix B. We associate vertices with unit weights. E contains an edge e i;j if B-matrix rows i and j have nonzeros in at least one common column and these rows are accessed together by at least one coarse-grain task executed by a thread. We associate each edge e i;j with a weight wðe i;j Þ equal to the number of common columns of B-matrix rows i and j that are accessed together in the same coarse-grain task, i.e., wðe i;j Þ¼jfx : a Ã;i ; a Ã;j 2 A k ; x 2 colsðb i;Ã Þ\colsðb j;Ã Þgj: This edge e i;j means that the B-matrix rows i and j involve accesses to wðe i;j Þ same entries of acc array. If B-matrix rows i and j are always processed close in time, which in turn corresponds to the case where edge e i;j is uncut and parts of PðG T B Þ are small enough, the wðe i;j Þ different acc-array entries are likely to be reused. If e i;j is cut, these wðe i;j Þ acc-array entries probably will not be reused. So, in total, the amount of cache misses due to loss of acc-array entry reuse is proportional to P e i;j 2E wðe i;j Þ. Therefore, partitioning objective given in (9) relates to maximizing the amount of reuse.
Using A-matrix Partitioning for C¼AA T and C¼AA
Here, we will show how a rowwise partition of matrix A obtained by partitioning the two models proposed in Section 4 can be used to exploit spatial and temporal localities in accessing acc-array entries during the SpGEMM operations of the forms C ¼ AA T [10] , [11] , [12] and C ¼ AA [3] , [7] , [15] , [16] , [17] . We should here mention that in both cases, second input matrices A T and A are stored separately from the first input matrix A.
A rowwise partition on the A matrix induces a partial ordering R A row on the rows of A, where rows in row slice A kþ1 are ordered after the rows in A k for k ¼ 1; 2; . . . ; K À 1. In the bipartite graph model B Here, we will briefly discuss how closely R T is conformably applied on the columns of A.
C ¼ AA C ¼ AA Case. We exploit the simple fact that the first and second input matrices are the same. For exploiting spatial locality, R A col is used for reordering columns of second A matrix. For symmetric matrices, R A row can be a better alternative for hypergraph models to avoid the disadvantage of using net ordering. For exploiting temporal locality, there are two options: In the first option, R A col is used for reordering the rows of second A matrix, whereas in the second option, R A row is used. Note that in both options, the ordering obtained on the rows of the second A is conformably applied on the columns of the first A matrix.
EXPERIMENTS
Data Set
The validity of the proposed methods are tested on three different categories: C ¼ AA T , C ¼ AA, and C ¼ A B. The C ¼ AA T category contains 12 LP constraint matrices, all of which are selected from the UFL sparse matrix collection [51] .
The C ¼ AA category contains 17 sparse matrices. Two of these matrices, cp2k-h2o-e6 and cp2k-h2o-.5e7, are obtained from the simulation of H 2 O molecules via using CP2K's implementation of KohnSham density functional theory calculations [3] , [21] . The remaining 15 matrices are selected from the UFL collection. The two matrices, 144 and cage12, represent graphs, which can be used in finding all-pairs shortest-paths [15] , self similarity joins of sparse datasets [16] , and summarization of sparse data [17] . Some of the remaining matrices are included in this dataset because of their use in recent works [36] , [37] , [42] as synthetic applications.
The C ¼ A B category for the general SpGEMM case contains 4 SpGEMM instances. This general case arises in two applications: First application is item-to-item collaborative filtering [18] in recommendation systems. The first two A matrices in the C ¼ A B category, i.e., amazon0302 and amazon0312, represent similarities between items, and they are obtained from the UFL collection. The corresponding B matrices are randomly generated according to Zipf distribution with exponent equal to 3.0. The second application arises in similarity joins of two different sparse datasets [16] . The remaining two SpGEMM instances in the C ¼ A B category represent sparse networks obtained from the DIMACS Implementation Challenges [52] .
In the appendix, available in the online supplemental material, Table A.1 displays the properties of the 33 SpGEMM instances. For each category, the SpGEMM instances are listed in alphabetical order by name of the first input matrix. The properties of SpGEMM instances are displayed in terms of total number of rows, columns, and nonzeros of the input matrices, as well as their average and maximum number of nonzeros per row and column. The properties of SpGEMM instances are also displayed in terms of the number of thread-level coarse-grain tasks (K), and statistics of atomic tasks in terms of number of multiply-and-adds and kilo bytes (KB). K values are introduced to show the amount of parallelism in each SpGEMM instance. The "cov" (coefficient of variation) value of an SpGEMM instance is given as an indication of the level of irregularity in the atomic task weights. As seen in Table A .1, available in the online supplemental material, C ¼ AA T instances have much higher "cov" values than C ¼ AA instances in general, thus showing the higher irregularity of LP instances.
For each of the 33 SpGEMM instances, the properties of the hypergraph and bipartite graph models can also be extracted from the information available in [49] are used for K-way partitioning of these hypergraphs and bipartite graphs, respectively. PaToH is used with the PATOH_SUGPARAM_SPEED parameter in order to trade off between the partitioning quality and the preprocessing overhead due to partitioning. MeTiS is used with default values except it is made to use multilevel recursive bisectioning. For the partitioning constraint, the maximum allowed imbalance threshold is set to be equal to 0.30 for both PaToH and MeTiS. For the partitioning objective, "connectivity-1" cutsize metric is used with PaToH, whereas both edgecut and totalv metrics are used with MeTiS.
As the selection criteria in the HC method, "absorption metric" and a variant of "scaled heavy connectivity metric" (SHCM) are respectively used for the C ¼ AA (and C ¼ A B) and C ¼ AA T categories. The cache size threshold utilized for calculating K values for each SpGEMM instance is selected as half of the effective cache size per thread. This cache size thresholding scheme is used to alleviate the possibility of capacity misses due to the small set-associativity (8 ways) during the execution of a coarse-grain task.
PaToH, MeTiS, HC, and BGCe involve randomized algorithms. So for each SpGEMM instance, these tools/methods are run five times and the average results are reported in the following tables.
B B-Matrix Row/Column Reordering. For each test SpGEMM instance, the spatial hypergraph model proposed in Section 5.1 is partitioned by PaToH using the same parameters mentioned above. For hypergraph models, K 00 is calculated such that each part has about 10 vertices. For each rowwise A-matrix partition of each test SpGEMM instance, the temporal graph model proposed in Section 5.2 is partitioned using MeTiS three times, and the average result is reported. MeTiS is used with default values except it is made to use multilevel recursive bisectioning. For graph models, K 00 is calculated such that each part has about 10 vertices.
Baseline Algorithms
We use two baseline algorithms MKL and BinP, neither of which exploits locality. MKL. We use mkl_dcsrmultcsr function [53] of the latest MKL library (version 11.3). In the parallelization strategy adopted by MKL [54] , the first input matrix is divided into chunks with more or less equal number of rows, and every row chunk is assigned to a thread. Since the number of chunks is equal to the number of threads, the chunk size cannot be set by the user outside MKL. So MKL considers balancing the computational loads of threads in a very rough manner.
BinP. The BinP algorithm is implemented to achieve a much better balancing of the computational loads of the threads as follows: BinP is a binpacking-based algorithm and it adapts the best-fit-decreasing heuristic used in solving the K-feasible binpacking problem [55] . In adapting this heuristic for our purpose, A-matrix rows are considered for assignment into one of the K bins in decreasing number of multiply-and-add operations incurred by the pre-multiply of this row with the B matrix. The best-fit criterion is the assignment of the A-matrix row to the minimally loaded bin (part). The bin capacity constraint is not used in BinP. At the termination of the algorithm, each bin represents a coarse-grain task to be executed by a distinct thread. The number of resulting parts becomes much larger than the number of threads thus enabling the utilization of dynamic part-to-thread scheduling for further load balancing.
The cache size thresholding scheme described in Section 6.2.1 for the proposed algorithms is also used to determine the K value in the BinP algorithm.
Parallel Systems and Implementation
We conducted experiments on a single Xeon Phi 5110P coprocessor. We used the offload mode instead of the native mode to enable future vertical integration that involves hybrid parallelization on a Xeon Phi cluster. The Xeon Phi coprocessor has 8 GB on-device RAM and provides 59 cores in the offload mode and each core can handle up to four hardware threads. Each core has 32 KB 8-way set associative L1 data cache with 64-byte lines, and 512 KB 8-way set associative L2 with 64-byte lines.
We also conducted experiments on a two-socket Xeon server. Each X5650 Xeon processor clocked at 2.67 GHz has 6 cores and 12 MB 16-way set associative L3 cache. Each core has 32 KB 8-way associative L1 cache with 8-byte lines and 256 KB 8-way set associative L2 with 8-byte lines. Only Tables 4, A.2, and A.4, available in the online supplemental material, contain results on the two-socket Xeon server, whereas all other tables contain results for Xeon Phi.
For evaluating the performance of the proposed models as well as BinP, SpGEMM routines are implemented and integrated into shoc-mic benchmark [56] , which is compiled with -O3 flag. OpenMP's dynamic scheduler is used with the default chunk size. The best results for 59, 118, 177, and 236 threads are reported for Xeon Phi and results for 12 threads are reported for the Xeon server.
Parallel SpGEMM Performance
This section compares the SpGEMM performance of the methods without considering the preprocesing overhead, whereas Section 6.5 gives an overall discussion including the preprocessing overheads.
Sorting Coarse-Grain Tasks for Computational
Load Balancing in HC Table 1 displays the performance of the sorting-based coarse-grain task scheduling scheme described for HC at the end of Section 4.1.2 for the C ¼ AA T category. In the table, the properties of the coarse-grain tasks are given in terms of the number K of tasks, the average and maximum task weights, and the covariance of task weights. Each T value shows the number of threads that achieve the best reported result for the respective SpGEMM instance. The last column displays the percent performance improvement achieved by using the sorting-based scheduling scheme. As seen in Table 1 , the proposed sorting scheme considerably improves the performance in all SpGEMM instances. As seen in the table, sorting scheme achieves relatively better performance for small K=T and large covariance values, as expected. For example, sorting does not improve the performance much for the rlfprim, sc205-2r and scsd8-2r instances, which have large K=T ratios of 3821/236%16, 4260/118%36, and 4393/236%19, respectively.
On the average, the proposed sorting scheme improves HC by 12.9, 0.9, and 3.8 percent for C ¼ AA T , C ¼ AA, and C ¼ A B categories, respectively. The significant amount of improvement in the C ¼ AA T category can be attributed to the relatively smaller average K=T ratio of ðK=T Þ avg =10 and higher average covariance value of cov avg =0.71 of this category, whereas ðK=T Þ avg =24 and cov avg =0.29 for C ¼ AA, and ðK=T Þ avg =14 and cov avg =0.64 for C ¼ A B.
We should note here that, for the BGCe method, the sorting scheme achieves only slight improvement of 1.5 percent in C ¼ AA T whereas it does not achieve any improvement for the other two categories.
As expected, the sorting scheme does not considerably improve the performance of partitioning-based methods since they explicitly enforce balance among coarse-grain task weights.
Exploiting Locality in Accessing acc Array
Here, we evaluate the validity of the B-matrix row/column reordering algorithms H S B , G T B , R A col , and R A row proposed in Section 5. For this purpose, we report the performance improvement induced by these reordering algorithms on each of the five methods HP, HC, BGPe, BGPv, and BGCe. Table 2 displays the average performance improvement over the original orderings of A-matrix columns and B-matrix rows and columns. The "Spatial" and "Temporal" columns respectively refer to the separate use of B-matrix-column and B-matrix-row reorderings described in Section 5.
As seen in Table 2 , the spatial H 
Exploiting Temporal Locality in Accessing B
Matrix Table 3 compares the performance (in terms of GFlops) of the proposed locality-aware HP, HC, BGPe, BGPv, and BGCe methods against the performance of the baseline MKL and BinP methods on Xeon Phi. For all SpGEMM instances, the proposed methods use the best-performing B-matrix row and column reordering methods as discussed in Section 5 for exploiting locality in accessing acc array. In the table, for each method, performance averages and the number of best results attained over each of the three different categories are listed at the bottom of each part, whereas overall averages and number of best results are displayed at the bottom of the table. A bold value in a row of the table indicates the highest GFlops performance attained for that instance.
In Fig. 8 , we present a performance profile, which is a generic tool introduced by Dolan and Mor e [57] , in order to give a more comprehensive view of the runtime results of the proposed, as well as the baseline methods. The test set in this figure consists of all instances listed in Table A .1, available in the online supplemental material.
We will first briefly discuss the relative performance of the two baseline methods MKL and BinP. As seen in Table 3 , the proposed baseline algorithm BinP performs significantly better than MKL both in C ¼ AA T and C ¼ A B categories, whereas BinP and MKL display comparable performance in the C ¼ AA category. On average, BinP performs 2.32x and 1.39x better than MKL for C ¼ AA T and C ¼ A B categories, respectively, whereas BinP performs only 1.04x better than MKL for the C ¼ AA category. This is because atomic tasks of the SpGEMM instances in the C ¼ AA T and C ¼ A B categories are much more irregular compared to those in C ¼ AA category as mentioned in Section 6.1. So, by assigning equal number of A-matrix rows to threads, MKL may attain reasonable load balancing among threads in the C ¼ AA category. This significant performance improvement of BinP over MKL for the C ¼ AA T and C ¼ A B categories shows the merits of using load balancing alone.
As seen in Table 3 , all of the proposed locality exploiting methods perform significantly better than both baseline methods for all of the three categories. Among the proposed methods, top-down partitioning-based methods HP, BGPe, and BGPv perform better than the bottom-up clustering methods HC and BGCe as expected. As seen in Table 3 , topdown partitioning-based methods HP, BGPe, and BGPv display comparable performance for all categories. As seen in the performance profile curves given in Fig. 8 , in almost 75 percent of the instances, BGPe, BGPv, and HP perform nearly same.
As seen in Table 3 , top-down partitioning-based methods HP, BGPe and BGPv performs 3.14x, 3.18x, and 3.19x better than MKL on the overall average. As also seen in the table, HP, BGPe, and BGPv perform 2.18x, 2.21x, and 2.21x better than BinP on the overall average. These relative performance improvements of HP, BGPe, and BGPv over BinP show the benefit of exploiting locality. Despite the inferior performance of HC and BGCe over the top-down partitioning-based methods, they still perform significantly better than BinP, where HC is the clear winner. HC and BGCe perform 1.88x and 1.52x better than BinP on the overall average. The better performance of HC over BGCe can be attributed to the use of nets that encode multi-way similarity instead of edges that encode only twoway similarity during the bottom-up clustering process. Table 4 displays results of experiments conducted on Xeon server. In Table 4 , GFlops performance results are displayed as averages over the three different SpGEMM categories. We refer the reader to Table A.2 of the appendix for instance-based detailed performance results, available in the online supplemental material. Comparison of Tables 3  and 4 show that the performance gap between the proposed locality aware methods and the baseline method BinP slightly reduces on Xeon compared to Xeon Phi. For example, the performance improvement of HP over BinP decreases from 2.18x on Xeon Phi to 2.05x on Xeon on the overall average. The average performance gap between HP and BinP remains almost the same (1.62x versus 1.61x) for the C ¼ A B category. The average performance improvement of HP over BinP decreases from 1.82x on Xeon Phi to 1.41x on Xeon for the C ¼ AA T category, whereas the average improvement of HP over BinP increases from 2.65x on Xeon Phi to 2.83x on Xeon for the C ¼ AA category. This experimental finding may be attributed to the reduced cache miss overhead in Xeon due to out-of-order execution capability as opposed to the in-order execution of Xeon Phi.
Reducing Data Transfer
In the appendix, Table A.3, available in the online supplemental material, is introduced in order to compare the improvements provided by the proposed partitioning-based methods against BinP in terms of cutsize. For each SpGEMM instance, the cutsize (computed according to (6) ) of the respective method is divided by the cutsize of the baseline method BinP. As seen in the table, the normalized cutsize values in general conform with the relative GFlops performance values of the respective methods given in Table 3 . For example, as seen in Table A .3, available in the online supplemental material, for the C ¼ AA T and C ¼ AA categories, HP, which aims at exploiting locality, respectively achieves 3.22x and 8.33x less cutsize than the baseline partitioning method BinP which only considers load-balancing. As seen in Table 3 , HP achieves respectively 1.82x and 2.65x speedups over BinP for the C ¼ AA T and C ¼ AA categories, on average.
We also conducted experiments with likwid [58] , which enables counting data transfers in a multi-threaded setting, to measure data transfer between L2 caches and last level caches of the Xeon server. Table A.4, available in the online supplemental material, displays the data transfer amounts incurred by the proposed partitoning-based methods normalized with respect to those of BinP. As seen in Table A .4, available in the online supplemental material, the proposed locality-exploiting methods achieve up to 2.63x less data transfers than BinP, on the overall average. Comparison of Tables A. 4 , available in the online supplemental material, and 4 show that data transfer amounts incurred by the partitioning-based methods correlate with the attained GFlops performances. For example, on the overall average, BinP incurs 2.44x more data transfers than HP, whereas HP performs 2.18x better than BinP. In conclusion, the cutsize minimization objective in the proposed methods successfully achieve reducing pressure on memory and caches. Table 5 is introduced to compare the partitioning overheads of the proposed methods, as well as BinP. For each SpGEMM instance, the partitioning time of the respective method in the host machine (Xeon) is divided by the parallel SpGEMM time obtained by MKL on Xeon Phi and averages of these normalized values over the matrix categories are reported in the table. For BGCe, which uses multithreaded implementation, running times for the number of threads that achieve the lowest time are used. As seen in Tables 3 and 5 , BinP performs considerably better than MKL in terms of parallel SpGEMM performance, whereas the preprocessing overhead of BinP amortizes for only one SpGEMM operation. Hence BinP is a very simple yet effective heuristic that can easily be integrated into existing libraries.
Partitioning Overhead versus SpGEMM Performance
The comparison of HP and BGPv is as follows: As seen in Table 5 , BGPv incurs significantly higher partitioning overhead than both HP and BGPe. Although BGPv achieves a similar parallel SpGEMM performance as HP, it is not recommended because of its significantly higher preprocessing overhead. The comparison of HP and BGPe is as follows: Although bipartite graph and hypergraph models are of similar size (same number of edges and pins), HP incurs higher preprocessing overhead than BGPe as seen in Table 5 . This is due to the fact that graph partitioning is considerably faster than hypergraph partitioning in general. As BGPe shows a close SpGEMM performance to HP on the average, BGPe can be considered as a good alternative to HP because of its considerably lower preprocessing overhead.
The comparison of HP and HC is as follows: In terms of parallel SpGEMM performance, for C ¼ AA and C ¼ A B categories, the average performance of HC is close to that of HP as seen in Table 3 . In terms of partitioning time, on the average, HC runs approximately 10x faster than HP as seen in Table 5 . Hence the fast bottom-up clustering approach is recommended for C ¼ AA and C ¼ A B categories instead of HP.
The comparison of BGPe and BGCe is as follows: In terms of parallel SpGEMM performance, BGCe performs significantly worse (31 percent worse) than BGPe, on the overall average. On the other hand, BGCe runs approximately 23x faster than BGPe, on the overall average. For example, running time of BGCe is no more than the running time of a single MKL SpGEMM time for the C ¼ AB category.
CONCLUSION
We investigated row-by-row formulation of the sparse matrix-matrix multiplication (SpGEMM) operation of the form C ¼ A B for locality-aware parallelization on many-core architectures. We proposed a hypergraph model and a bipartite graph model for 1D rowwise A-matrix partitioning to exploit locality in accessing B-matrix rows. In the partitioning methods utilizing these models, the partitioning constraint corresponds to maintaining balance on the computational loads of threads, whereas the partitioning objective relates to reducing data transfer amount from memory and between caches. For both hypergraph and bipartite graph models, we proposed bottom-up clustering methods, which were experimentally shown to be producing reasonably good partitions while being significantly faster than the respective partitioning methods.
We also investigated how to exploit locality in accessing entries of temporary arrays utilized by threads during accumulation of results for C-matrix rows. We proposed a hypergraph and a graph model for B-matrix column and row reordering to exploit spatial and temporal localities in these operations, respectively.
We tested the validity of the proposed methods on a wide range of realistic SpGEMM instances from three different categories of C ¼ AA T , C ¼ AA, and C ¼ A B. Experimental results showed the validity of the proposed methods. These results also showed that Xeon Phi and Xeon processors can benefit from locality enhancements for sparse and irregular applications through intelligent partitioning and reordering.
