Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures by Akbudak K. & Aykanat C.
Exploiting Locality in Sparse Matrix-Matrix
Multiplication on Many-Core Architectures
Kadir Akbudak and Cevdet Aykanat
Abstract—Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparsematrix-matrix
multiplication (SpGEMM) operation of the formC¼AB onmany-core architectures. Hypergraph and bipartite graphmodels are
proposed for 1D rowwise partitioning of matrixA to evenly partition the work across threadswith the objective of reducing the number of
B-matrix words to be transferred from thememory and between different caches. A hypergraphmodel is proposed forB-matrix column
reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for
C-matrix rows. A similarity graphmodel is proposed forB-matrix row reordering to increase temporal reuse of these accumulation array
entries. The proposedmodels andmethods are tested on a wide range of sparsematrices from real applications and the experiments
were carried on a 60-core Intel Xeon Phi processor, aswell as a two-socket Xeon processor. Results show the validity of themodels and
methods proposed for enhancing the locality in parallel SpGEMMoperations.
Index Terms—Data locality, sparsematrix, sparsematrix-matrix multiplication, SpGEMM, computational hypergraphmodel, hypergraph
partitioning, hypergraph clustering, graphmodel, bipartite graphmodel, graph partitioning, graph clustering,many-core architecture, Intel XeonPhi
Ç
1 INTRODUCTION
1.1 Applications Involving SpGEMM
GENERAL sparse matrix-matrix multiplication (SpGEMM)of the form C¼AB is an important kernel in various
applications such as molecular dynamics simulations [1], [2],
[3], [4], [5], [6], [7], finite element simulations based ondomain
decomposition [8], [9], linear programming (LP) [10], [11],
[12], multigrid interpolation and restriction [13], breadth-first
search from multiple source vertices [14], finding all-pair
shortest-paths [15], similarity join [16], data summariza-
tion [17], and item-to-item collaborative filtering in recom-
mendation systems [18].
Some of the above-mentioned applications require
repeated SpGEMM operations which involve matrices with
same sparsity patterns but differing numerical values: Solu-
tion of LP problems through iterative interior point meth-
ods that use normal equations constitutes a typical example
of such applications. These methods solve positive-definite
linear systems of the form ðAD2AT Þ x ¼ b at each iteration,
where A is the original constraint matrix and D is a positive
diagonal matrix which varies with each iteration. Direct
solvers [10], [11], [12] that utilize Cholesky factorization as
well as iterative solvers [11] that utilize preconditioners
require explicitly forming the coefficient matrix at each iter-
ation through the SpGEMM computation C¼AB, where
the sparsity patterns of both A and B ¼ D2AT remain the
same throughout the iterations.
Similarity join and collaborative filtering applications
may also involve repeated SpGEMM operations. In the self
similarity join application that involves C ¼ AA and in the
similarity join of two different sparse data sets that involve
C ¼ AB, different weightings can be defined on the relative
importance of features that will incur repeated SpGEMM
operations of the forms C ¼ AWA and C ¼ AWB for differ-
ent W matrices. Here, W is a diagonal matrix that contains
values for relative ranking of the features in the data sets. In
item-to-item collaborative filtering in recommendation sys-
tems, given a similar-items table, which is A, SpGEMM is
performed in order to find items similar to each of the user’s
preferences, which are stored in B, and then the system
aggregates those items and recommends the most popular
or correlated items. This application may incur repeated
SpGEMM operations of the form of C ¼ AWB, whereW is a
diagonal matrix that adjusts importance of items in filtering.
In this work, we propose matrix partitioning and row/
column reordering schemes for exploiting temporal and spa-
tial localities in row-by-row parallelization of SpGEMM on
many-core architectures. All of the above-mentioned appli-
cations will benefit from these matrix partitioning and reor-
dering methods, whereas the preprocessing overhead due to
these intelligent partitioning operations will amortize espe-
cially for the applications that involve repeated SpGEMM
operations.
1.2 Parallel SpGEMM Algorithms
Parallel SpGEMM schemes can be broadly classified into
four categories depending on the following 1D GEMM for-
mulations [19]: Outer product, inner product, column-by-
column, and row-by-row. In the outer-product paralleliza-
tion, an atomic task is defined as the outer product of
column i of A with row i of B, so that each thread is held
responsible for computing one or more outer products. The
outer-product formulation is reported to lead to scalable
 The authors are with Computer EngineeringDepartment, Bilkent University,
Ankara 06800, Turkey. E-mail: {kadir, aykanat}@cs.bilkent.edu.tr.
Manuscript received 22 Dec. 2015; revised 15 Jan. 2017; accepted 16 Jan.
2017. Date of publication 23 Jan. 2017; date of current version 14 July 2017.
Recommended for acceptance by T. Hoefler.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2017.2656893
2258 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
1045-9219 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
parallelization on distributed-memory architectures [20],
[21], [22]. However, this SpGEMM scheme requires sparse
accumulation operations for obtaining the output matrix
from the partial results of concurrent thread computations
on shared-memory architectures. These operations necessi-
tate complex indexing schemes to efficiently identify the con-
tributions to the same matrix entries by different threads.
Although there exists an indexing scheme proposed for
outer-product kernel on distributed-memory architec-
tures [20], this scheme does not seem to be viable for many-
core architectures due to the overhead of many concurrent
writes.
In the inner-product parallelization, an atomic task is
defined as the inner product of row i of A with column j of
B, so that each thread is held responsible for computing one
or more inner products. This scheme requires merging two
sparse vectors which is not efficient because of large
number of element accesses per floating-point operation
(Flop) [23]. Therefore, this scheme is also not viable for cur-
rent many-core architectures with deep cache hierarchies.
In the column-by-column parallelization, an atomic task
is defined as the post-multiply of the whole A matrix with a
column of B, so that each thread is held responsible for
computing one or more post-multiplies. In the row-by-row
parallelization, an atomic task is defined as the pre-multiply
of a row of Awith the whole Bmatrix, so that each thread is
held responsible for computing one or more pre-multiplies.
Both schemes share the nice property of not requiring con-
current writes as opposed to the outer-product formulation.
The former and latter schemes can complete the computa-
tion of a column and a row of the output matrix at a time,
respectively. Both schemes also share the nice property of
performing all multiplications related with a nonzero of one
of the two input matrices as opposed to the inner-product
formulation. So, both of these two schemes are found to be
viable for shared-memory parallelism.
1.3 Related Work
There exist several recent works on SpGEMM paralleliza-
tion on distributed-memory architectures [7], [20], [21], [22],
[24], [25], [26], [27]. Here, we discuss the related work on
SpGEMM parallelization on shared-memory architectures
according to the above-mentioned categorization.
Sulatcyke and Ghose [28] examine the impact of data
reuse on the SpGEMM performance. They consider outer-
product and row-by-row formulations and report that the
row-by-row formulation is more amenable for data reuse
opportunities. They also report that the row-by-row paralle-
lization does not need any synchronization among threads
so it is more eligible for shared-memory parallelization.
There are successful GPU libraries, cuSPARSE [29] and
Cusp [30], which perform SpGEMM. cuSPARSE uses row-
by-row formulation as its top-level parallelism. A hash table
is used for accumulating partial results for each row of the
output matrix. In Cusp [31], elements of input matrices are
accessed via row-by-row scheme, but the partial results for
the output matrix are stored in an intermediate list consist-
ing of row, column, and value tuples. This list is sorted and
duplicate column indices are compressed to compute the
final output matrix. The SpGEMM operation in Cusp is fur-
ther enhanced by the works [32] and [33].
Matam et al. [34] investigate row-by-row and outer-
product parallelization schemes, which utilize blocking to
decrease the size of the temporary accumulation array used
to accumulate the result of each pre-multiply and outer prod-
uct, respectively. They report that the row-by-row paralleli-
zation performs better than the outer-product parallelization.
A similar blocking approach is also utilized in [35] to decrease
the size of the temporary accumulation array.
Liu and Vinter [36] use row-by-row parallelization. They
utilize progressive allocation of C matrix for memory effi-
ciency, computing final C-matrix rows via merging partial
results, and partitioning rows of C with respect to their non-
zero density for better load balancing.
Gremse et al. [37] uses row-by-rowparallelization. Instead
of using accumulation array, W (subwarp size) results are
merged at a time via utilizing a warp reduction method,
which uses only hardware registers.
Siegel et al. [38] use inner-product formulation for coarse-
grained parallelism among tasks and row-by-row formula-
tion for fine-grained parallelism within a task. Task sharing
approach is used in [38] for dynamic load balancing among
nodes andGPUs of a node.
The above-mentioned works on SpGEMM do not exploit
sparsity patterns of input or output matrices for efficient
parallelization via reducing cache misses. Although there
exist models and methods that utilize sparsity pattern for
exploiting locality in sparse matrix-vector multiplica-
tion [39], [40], [41], the literature lacks locality exploiting
methods for SpGEMM operations. In this work, we propose
models and matrix partitioning/reordering methods, which
utilize matrix sparsity pattern in order to exploit data local-
ity in parallel SpGEMM operation.
1.4 Contributions
We propose a hypergraph model and a bipartite graph
model for encapsulating computational and B-matrix trans-
fer requirements of thread-level row-by-row parallelization
of SpGEMM operation that is based on rowwise A-matrix
partitioning. We show that the minimization objectives of
partitioning the hypergraph and bipartite graph models
relate to reducing data transfers from memory or another
cache. We devise fast bottom-up clustering methods utiliz-
ing both proposed hypergraph and bipartite graph models
in order to decrease the preprocessing overhead.
We also investigate how to exploit locality in accumula-
tion of partial results forC-matrix rows. For this purpose, we
propose a spatial hypergraph model and a temporal graph
model for reordering B-matrix columns and rows, respec-
tively. For the same purpose, we also show how to encode an
existing A-matrix partition as a row/column reordering of
theBmatrix for the special cases ofC ¼ AA andC ¼ AAT .
The validity of the proposed matrix partitioning and
row/column reordering methods are extensively experi-
mented on real SpGEMM instances using many-core archi-
tectures Intel Xeon Phi and Xeon.
We should note that the column-by-column and row-by-
row parallelization schemes, both of which are found to be
viable for shared-memory parallelization, can be considered
as dual of each other. So, the discussion for the column-by-
column parallelization directly follows the discussion given
for the row-by-row parallelization.
AKBUDAK AND AYKANAT: EXPLOITING LOCALITY IN SPARSE MATRIX-MATRIX MULTIPLICATION ON MANY-CORE ARCHITECTURES 2259
The rest of the paper is organized as follows: The row-by-
row parallelization of SpGEMM and its data locality proper-
ties are given in Sections 2 and 3, respectively. The hyper-
graph and bipartite graph models for exploiting temporal
locality in accessing B-matrix rows are introduced in
Section 4. In Section 5, we present the models and methods
for exploiting spatial and temporal localities in accessing
entries of accumulation arrays. The experimental results are
presented in Section 6. Finally, we conclude the paper in
Section 7. The supplemental material contains Appendix
Sections A–D, Algorithm A.1, Tables A.1–A.4, and Fig. A.1,
which can be found on the Computer Society Digital
Library at http://doi.ieeecomputersociety.org/10.1109/
TPDS.2017.2656893.
2 THE PARALLEL ALGORITHM
The row-by-row parallelization is based on one-dimensional
conformable rowwise partitioning of the input matrix A and
the outputmatrixC as follows:



















In (1), submatrices Ak and Ck denote the kth row slices of the
reordered A and C matrices, respectively, where K denotes
the number of parts. Here, P denotes the permutationmatrix
obtained from partitioning. The use of the same permutation
matrix for row reorderings of A and C shows that a rowwise
partition on A induces a conformable rowwise partition on
C. The reordered Â and Ĉ will be referred to asA andC.
Algorithm 1. The Thread-Level Row-by-Row Paralleliza-
tion of SpGEMMOperation for Numeric Multiply
Require: Ak, B, and Ck matrices in CSR format
1: //Each iteration performs Ck¼AkB
2: for k ¼ 1 toK in parallel do
3: //Each iteration performs ci;¼ai;B
4: for each row ai; of Ak do
5: //Init thread-private accumulation array
6: for each nonzero ci;j in row ci; do
7: acc½j ¼ 0:0
8: for each nonzero ai;j in row ai; do
9: acc ¼ accþ ai;j bj;
10: //Gather dense array to sparse storage
11: for each nonzero ci;j in row ci; do
12: ci;j ¼ acc½j
The input and output data partitioning given in (1) leads
to the row-by-row algorithm as shown in Algorithm 1,
where the output matrix C is computed as,
Ck ¼ AkB; for k ¼ 1; 2; . . . ; K: (2)
Each submatrix-matrix multiply Ck¼AkB is assigned to a
thread that will be executed by an individual core as also
depicted by the “for ... in parallel do” construct in line 2 of
Algorithm 1. This algorithm uses CSR (compressed storage
by rows) scheme for storing both input matrices and the
output matrix. In lines 6–12, row i of Ak (ai;) is pre-
multiplied by B and the result of this multiplication is writ-
ten to row i of C (ci;). Each thread allocates a private accu-
mulation array acc (i.e., a simple 1D dense array) of size N
for efficient computation of pre-multiplies, where N is the
number of columns of B. In line 6, acc is initialized to zero
and then a pre-multiply is performed by the for-loop starting
in line 8. In line 11, the final values for ci; are gathered from
acc and stored in compressed row storage of C. A symbolic
SpGEMM operation is performed prior to the numeric mul-
tiplication for identifying the column indices of nonzeros in
row ci;. The pseudo-code of symbolic multiplication is
given in Algorithm A.1 of Appendix A, available in the
online supplemental material.
Fig. 1 shows an SpGEMM instance to demonstrate the
atomic task of computing row i of the C matrix according to
row-by-row formulation. As seen in the figure, row i ofA con-
tains three nonzeros at columns x, y, and z. Each of B-matrix
rows x, y, and z contains two nonzeros and these six nonzeros
are spread over three columns j, k, and m. Thus, they incur
the addition of the results of six scalarmultiply-and-add oper-
ations to the jth, kth, andmth entries of the acc array. Each of
three B-matrix rows has a nonzero at column k, which incurs
the addition of the results of three scalar multiply-and-add
operations to the same entry of the acc array.
3 DATA LOCALITY
We present spatial and temporal locality characteristics of
the row-by-row parallelization given in Algorithm 1. Dis-
cussions presented here are valid for private caches and
also for processors with private and shared caches existing
at different levels of the memory hieararchy.
Spatial locality in accessing A-matrix nonzeros is simply
achieved by storing nonzeros of each row consecutively
using CSR and processing nonzeros of A consecutively.
Temporal locality in accessing A-matrix nonzeros is not fea-
sible since each A-matrix nonzero is accessed once.
Spatial locality in accessing B-matrix nonzeros is par-
tially achieved by storing nonzeros of each row consecu-
tively using CSR. However, rows of B are not processed
consecutively and the processing order is determined by
sparsity patterns and processing order of A-matrix rows.
The performance improvement due to exploiting spatial
locality is expected to be negligible because at most one
extra cache miss can be avoided for each B-matrix row. So
this locality is not considered in the paper.
Temporal locality in accessing B-matrix nonzeros
(together with their indices) is feasible since B-matrix non-
zeros are accessed multiple times. Each nonzero in row x of
B is accessed nnzða;xÞ times. Here, nnzðÞ denotes the
Fig. 1. Row-by-row formulation based SpGEMM.
2260 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
number of nonzeros in a row, a column or the whole of the
matrix. Temporal locality in accessing B-matrix nonzeros
can be exploited through clustering A-matrix rows that are
similar in terms of the number of nonzeros of required
B-matrix rows, where each cluster constitutes a row-slice of
the reordered A matrix. Reuse of B-matrix nonzeros can be
achieved at a coarse level of reuse ofB-matrix rows.
The access pattern to the accumulation array is deter-
mined by the sparsity pattern of B as well as the access
order of B-matrix rows, which is induced by the processing
order of A-matrix rows. Spatial locality in accessing the
accumulation array entries can be exploited through reor-
dering B-matrix columns with similar sparsity patterns
nearby. Temporal locality in accessing the accumulation
array entries can be exploited through reordering of
B-matrix rows that are accessed within the same coarse-
grain thread-level task with similar sparsity patterns
nearby. This is because B-matrix rows with similar sparsity
patterns are likely to access the same accumulation array
entries during the execution of the coarse-grain task.
4 A-MATRIX PARTITIONING FOR TEMPORAL
LOCALITY IN ACCESSING BMATRIX
For exploiting temporal locality in accessing B-matrix rows,
we propose and discuss two models for conformable row-
wise partitioning of A and C matrices.
4.1 Hypergraph ModelHTA
A given SpGEMM computation C¼AB is represented as a
temporal hypergraph HTA ðA; fnnzðbx;ÞgxÞ ¼ ðV;NÞ. Here,
HTA ðA; fnnzðbx;ÞgxÞ refers to the fact that the sparsity pat-
tern of matrix A determines the topology of the proposed
hypergraph model, whereas the nonzero counts of B-matrix
rows determine vertex and net weights. The superscript
“T” denotes temporal locality.
In HTA , V contains a vertex vi for each row i of A so that it
represents the atomic task
ci; ¼ ai;B; (3)
of pre-multiplying row i (ai;) of A with the B matrix to
compute row i (ci;) of C. Note that this atomic task defini-
tion effectively means that vertex vi also represents row i of
the C matrix according to owner computes rule. Hence, we
associate vertex vi with a weight wðviÞ proportional to the
computational load of this pre-multiply in terms of scalar





where colsðai;Þ denotes the column indices of the nonzeros
in row i of A.
N contains a net (hyperedge) nx for each row x of B. Net
nx connects vertices corresponding to rows that have non-
zeros at column x of A. So, the vertices connected by nx cor-
respond to the atomic tasks that need to access row x of B.
Hence, we associate net nx with a weight wðnxÞ proportional
to the cost of transferring B-matrix row x from memory or
another cache, that is,
wðnxÞ ¼ nnzðbx;Þ: (5)
Fig. 2 illustrates the input and output dependency of the
proposed hypergraph model. In this figure, as well as in
Fig. 4, circles represent vertices, whereas dots represent
nets. As seen in Fig. 2, nx connects vertices vi and vj,
whereas ny connects vertices vi, vj, and vk. So net nx encodes
the need of accessing B-matrix row x for the atomic tasks of
computing rows i and j of C, whereas net ny encodes the
need of accessing B-matrix row y for the atomic tasks of
computing rows i, j, and k of C. Vertex vi connected by
both nx and ny shows that the atomic task of computing
row i of C needs to access both rows x and y of B, which in
turn shows the need for an accumulation operation depend-
ing on the sparsity patterns of rows x and y of B.
Consider a K-way partition PðVÞ ¼ fV1;V2; . . . ;VKg of
vertices of HTA , where parts are mutually exclusive and
exhaustive. In PðVÞ, the weight WðVkÞ of a part k is defined




AK-way partitionPðVÞ is decoded in such a way that the
set of atomic tasks (fine-grain tasks) corresponding to the
vertices in a part ofPðVÞ constitutes a coarse-grain task to be
executed by a distinct thread. That is, vertex part Vk denotes
the submatrix-matrix multiply Ck¼AkB to be executed by
a distinct thread, where Ak and Ck are the submatrices
respectively formed by A-matrix and C-matrix rows that are
represented by the vertices in Vk. Without loss of generality,
we assume that each core executes only one distinct thread.
K is selected such that the storage size of the B-matrix
rows required by each submatrix-matrix multiply Ck¼AkB
together with the storage sizes of Ck and Ak matrices are
below the size of the cache of a single core. Thus, theworking
set of each iteration of the for-loop starting at line 2 of Algo-
rithm 1 fits into the cache of a single core due to thread-level
coarse-grain task definition induced by the partitioning.
The weight of a part corresponds to the computational
load of a thread-level coarse-grain task. So, the partitioning
constraint of maintaining balance on the part weights corre-
sponds to maintaining balance on the computational loads
of thread-level coarse-grain tasks. This constraint corre-
sponds to reducing overall execution time for arbitrary
coarse-grain task scheduling.
In a K-way partition PðVÞ of HTA , a net nx is said to con-
nect a part if nx connects at least one vertex in that part. The
connectivity conðnxÞ of nx is the set of parts connected by
nx. The cutsize of PðVÞ is defined as
Fig. 2. Hypergraph model for exploiting temporal locality in accessing
B-matrix rows during thread-level row-by-row parallelization of SpGEMM
operation.





Here, we present the following discussion to show that the
cutsize of a given partition PðVÞ of HTA is equal to the num-
ber of B-matrix words to be transferred from the memory
and between different caches under the following assump-
tions: single-word line, fully associative cache, and single
thread per core.
Consider a net nx with connectivity conðnxÞ in P. Let T x
denote the set of jT xj ¼ jconðnxÞj threads that will execute
the coarse-grain tasks corresponding to the vertex parts in
conðnxÞ on different cores of the system. Without loss of
generality, let t be the first thread to be executed in T x.
Thread t will transfer the B-matrix row x from the mem-
ory, whereas the other threads in T x will transfer B-matrix
row x either from the memory or from the cache of another
core. Hence the amount of data transfer due to accessing
B-matrix row x, which is represented by net nx, will be
equal to wðnxÞjconðnxÞj words. Here, the costs of transfer-
ring from memory and another cache are assumed to be
equal based on the results reported in [43]. Thus
cutsizeðPÞ according to (6) will show the total amount of
data transferred due to accessing B-matrix rows under the
above-mentioned assumptions. It is clear that the amount
of data transfer will reduce for the general cases of multi-
word line and multiple threads per core. On the other
hand, the amount of data transfer will increase for the gen-
eral case of set associative caches because of the possibility
of the conflict misses, whereas fully associative caches will
incur only capacity misses. So, for the general case, the par-
titioning objective of minimizing the cutsize relates to min-
imizing the amount of data transfer due to accessing
B-matrix rows.
In Appendix B, available in the online supplemental
material, we also show that the objective of minimizing the
cutsize also relates to maximizing B-matrix row reuse.
Fig. 3 shows a sample SpGEMM computation C¼AB,
where A and B are 11 by 9 and 9 by 8 matrices with 27 and
26 nonzeros, respectively, and C is an 11 by 8 matrix with
57 nonzeros. Fig. 4 shows the hypergraph model HTA that
represents the sample SpGEMM computation instance
given in Fig. 3. As seen in Fig. 4, HTA contains 11 vertices,
9 nets, and 27 pins. Note that a connection between a net
and a vertex is said to be pin of that net. As also seen in the
figure, wðv5Þ ¼ 14 since row 5 of A contains nonzeros at col-
umns 2, 3, 5, and 6, and nnzðb2;Þ þ nnzðb3;Þþ nnzðb5;Þþ
nnzðb6;Þ ¼ 3þ 3þ 4þ 4 ¼ 14.
Fig. 4 also shows a 3-way partition PðVÞ ofHTA , and Fig. 5
shows the 3-way partition of the sample input and output
matrices induced by this PðVÞ. In PðVÞ, WðV1Þ ¼ wðv1Þþ
wðv5Þ þ wðv7Þ ¼ 6þ 14þ 7 ¼ 27. Similarly, WðV2Þ ¼ 22 and
WðV3Þ ¼ 31. So PðVÞ incurs 16 percent load imbalance.
In the 3-way partition PðVÞ given in Fig. 4, there are four
cut nets n1, n3, n5, and n9, whereas the remaining five nets
are uncut. Note that a net is said to be cut if it connects more
than one part, uncut otherwise. As seen in the figure, net n3
connects all vertex parts, and hence jconðn3Þj ¼ 3. Conse-
quently, cut net n3 incurs wðn3Þjconðn3Þj ¼ 3  3 ¼ 9 to the
cutsize, which is equal to the amount of data transfer due to
accessingB-matrix row 3, under thementioned assumptions.
4.1.1 Hypergraph Partitioning (HP)
The state-of-the-art hypergraph partitioning tools such as
PaToH [44] and hMeTiS [45] support the following cutsize
metrics: hyperedge/net-cut [44], [45], “connectivity-1” met-
ric [44], [46], [47] and sum of external degrees (SOED) [45].
None of these metrics directly encode the cutsize definition
given in (6) for the proposed hypergraph model. On the
other hand, it can easily be shown that there exist a constant
difference of nnzðBÞ between the cutsize definition in (6)
and the cutsize definition according to the “connectivity-1”
metric. So, “connectivity-1” cutsize metric of PaToH can be
safely used for minimizing the connectivity cutsize defini-
tion given in (6).
4.1.2 Fast Bottom-Up Hypergraph Clustering (HC)
In order to reduce the preprocessing overhead due to the
top-down partitioning using PaToH, we propose a method
that performs bottom-up clustering on the proposed hyper-
graph model. This bottom-up clustering method exploits
the multi-level coarsening phase of PaToH in such a way
Fig. 5. Matrices A and C partitioned according to the partition PðVÞ of
HTA given in Fig. 4 and matrix B.Fig. 3. A sample SpGEMM computation.
Fig. 4. Hypergraph model for the sample SpGEMM instance given in
Fig. 4 and its 3-way partition.
2262 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
that coarsening is continued until the number of superverti-
ces in a level becomes smaller than or equal to K0 ¼ 1:30K.
Note that each supervertex in the last level is decoded as a
vertex part/cluster in such a way that the set of atomic tasks
(fine-grain tasks) constituting that supervertex is taken as a
coarse-grain task to be executed by a distinct thread.
The proposed method uses a multi-level matching-based
clustering scheme. As the matching selection criteria,
“absorption metric” and a variant of “scaled heavy connec-
tivity metric” (SHCM) are implemented. The SHCM-variant
used in this experimentation corresponds to the generalized
Jaccard similarity metric [48]. This bottom-up method estab-
lishes a tradeoff between the solution quality and the pre-
processing time and it will be referred to as hypergraph
clustering HC.
HC aims at clustering vertices with similar net connectiv-
ities. This corresponds to clustering A-matrix rows that are
similar in terms of the number of nonzeros of the required
B-matrix rows to the same cluster, thus exploiting temporal
locality in accessing B-matrix nonzeros.
HC does not enforce any explicit constraint for maintain-
ing balance on clusters. So this schememay incur load imbal-
ance for instances with high atomic-task weight variation. In
order to partially alleviate this bottleneck, the coarse-grain
tasks obtained from partitioning are sorted in decreasing
order with respect to their coarse-grain task weights and
they are given to the OpenMP scheduler in this order.
4.2 Bipartite Graph Model BTA
A given SpGEMM computation C¼AB is represented as a
temporal bipartite graph BTA ðA; fnnzðbx;ÞgxÞ ¼ ðV ¼ VA [
V B; EÞ. VA contains a vertex vi for each row i of A and VB
contains a vertex vx for each row x of B. E contains an edge
ei;x between vi 2 VA and vx 2 VB if ai;x is nonzero. So the
adjacency lists of the row and column vertices represent
the sparsity patterns of the respective rows and columns.
Note that BTA and HTA are topologically equivalent, where
there exist one-to-one correspondences between vertices of
VAðBTA Þ and vertices of VðHTA Þ, between vertices of VBðBTA Þ
and nets of NðHTA Þ, and between edges of EðBTA Þ and pins of
PinsðHTA Þ. Fig. 6 illustrates the input and output depen-
dency of the proposed bipartite graph model.
Atomic task and weight definitions associated with
the vertices of VðHTA Þ in (3) and (4) directly apply to the
vertices of VAðBTA Þ, and vertices of VBðBTA Þ are associated
with zero weight since they do not incur any computa-
tion. That is,
vi 2 VA: wðviÞ ¼
X
x2colsðai;Þ
nnzðbx;Þ; vx 2 VB: wðvxÞ ¼ 0: (7)
A K-way vertex partition of BTA induces a K-way parti-
tion on both VA and VB. PðVAÞ of BTA is decoded in a way
similar to decoding PðVÞ of HTA . That is, the set of atomic
tasks (fine-grain tasks) corresponding to the vertices in a
part of PðVAÞ constitutes a coarse-grain task to be executed
by a distinct thread. So, the partitioning constraint of main-
taining balance on the part weights corresponds to main-
taining balance on the computational loads of thread-level
coarse-grain tasks. PðVBÞ is not decoded since vertices in
VB do not signify any computation.
4.2.1 Edgecut Formulation (BGPe)
We associate each edge ei;x with a weight equal to
wðei;xÞ ¼ nnzðbx;Þ: (8)





relates to minimizing the amount of B-matrix rows to be
transferred from the memory and between different caches
under the assumption that accessing each B-matrix row
always incurs compulsory miss(es), i.e., there is no reuse of
B-matrix rows. In other words, minimizing the edgecut cor-
responds to minimizing a loose upper bound on the number
of cache misses due to accessing B-matrix rows.
4.2.2 Totalv Formulation (BGPv)
The above-mentioned edgecut formulation is based on the
fact that this objective is widely supported by almost all
graph partitioning tools. Here, we propose a formulation that
encodes the partitioning objective of minimizing the “total





Here, Vb denotes the set of boundary vertices, where a ver-
tex is said to be boundary if it is incident to at least one cut
edge. For a vertex v 2 Vk, NadjðvÞ denotes the number of
parts other than Vk that the vertices adjacent to v (i.e.,
AdjðvÞ) belong to. In (10), sðvÞ denotes the size of vertex v so
that MeTiS assumes a weight wðvÞ and a size sðvÞ for each
vertex v, where edges are unweighted.
In the proposed totalv formulation, the vertex weight
assignment is as same as in the edgecut formulation
(see (7)), whereas vertex size assignment is as follows:
vi 2 VA : sðviÞ ¼ 0; vx 2 VB : sðvxÞ ¼ nnzðbx;Þ: (11)
This vertex size assignment is based on establishing a one-
to-one correspondence between vertices of VBðBTA Þ and nets
of NðHTA Þ. Note that NadjðvxÞ is equal to jconðvxÞj  1 for
vx 2 VB, where conðvxÞ to be the set of parts that the vertices
in fvx [AdjðvxÞg reside.
Fig. 6. Bipartite graph model for exploiting temporal locality in accessing
B-matrix rows during thread-level row-by-row parallelization of SpGEMM
operation.
AKBUDAK AND AYKANAT: EXPLOITING LOCALITY IN SPARSE MATRIX-MATRIX MULTIPLICATION ON MANY-CORE ARCHITECTURES 2263
It can easily be shown that, for the same partition P on
VðHTA Þ  VAðBTA Þ, we have a constant difference of nnzðBÞ
between cutsizeðPðV ðHTA ÞÞÞ and totalvðPðVðBTA ÞÞÞ. So, the
totalv-based partitioning objective of the bipartite graph
model becomes equivalent to the partitioning objective of
the hypergraph model given in Section 4.1.
4.2.3 Fast Bottom-Up Graph Clustering (BGCe)
In order to reduce the preprocessing overhead due to the
top-down partitioning, we propose a method that performs
bottom-up clustering on the proposed bipartite graph mod-
els. Similar to the HC method proposed in Section 4.1.2, this
method exploits only the multi-level coarsening phase of
the multi-threaded graph partitioning tool mt-MeTiS [50].
The initial bipartitioning and refinement phases of mt-
MeTiS are omitted. This method will be referred to as bipar-
tite graph clustering BGCe since mt-MeTiS performs vertex
matching according to weights of the connecting edges.
5 B-MATRIX ROW/COLUMN REORDERING FOR
LOCALITY IN ACCESSING ACC ARRAY
In Sections 5.1 and 5.2, we respectively propose a spatial
graph model and a temporal hypergraph model for preserv-
ing locality in accessing acc array. A K00-way vertex parti-
tion of both models is used to induce a partial ordering on
either the rows or columns of the B matrix as follows: The
rows/columns corresponding to the vertices in Vkþ1 are
ordered after the rows/columns corresponding the vertices
in Vk for k ¼ 1; 2; . . . ; K00  1. In both reordering schemes,
K00 is selected such that each vertex part is sufficiently small.
5.1 Spatial Hypergraph ModelHSB for C¼AB
In order to exploit spatial locality in accessing acc-array
entries, the B-matrix columns with similar sparsity patterns
should be reordered nearby. Furthermore, the sparsity pat-
tern similarity among B-matrix columns should be
weighted according to the access counts of the B-matrix
rows that contain nonzeros in those columns. So, we pro-
pose to represent the pattern of accessing acc-array entries
as a spatial hypergraph model HSBðB; fnnzða;xÞgxÞ¼ðV;NÞ
for column reordering of B matrix. Here, the superscript
“S” stands for spatial locality. In HSB , V contains a vertex vj
for each column j of matrix B. We associate vertices with
unit weights. N contains a net nx for each row x of matrix
B. We associate each net nx with a weight wðnxÞ equal to the
number of times B-matrix row x will be accessed during
C¼AB. That is,
wðnxÞ ¼ nnzða;xÞ: (12)
Fig. 7a illustrates the proposed model.
Let a single cache line contain L words. Consider a parti-
tion P of vertices ofHSB , where each part contains L vertices.
Assume that L number of B-matrix columns represented by
vertices in a part are ordered consecutively so that corre-
sponding acc-array entries are stored in consecutive loca-
tions of the memory (i.e., in the same cache line) and
accessed consecutively by the SpGEMMalgorithm. Consider
a net nx 2 HSB with connectivity set conðnxÞ and with weight
wðnxÞ. This net nx means that the B-matrix row x will be
transferred frommemory or another cachewðnxÞ times. Each
time this row is transferred and accessed during multiplica-
tion, jconðnxÞj number of lines containing acc-array entries
will be accessed also. So, in total, the number of accesses to
acc-array entries will be equal to
P
nx2N wðnxÞjconðnxÞj.
Therefore, the cutsize according to the connectivitymetric (6)
corresponds to the number of accesses to the cache lines con-
taining acc-array entries during C¼AB. Hence, the pro-
posed model exploits spatial locality in accessing acc-array
entries.
5.2 Temporal Graph Model GTB for C¼AB
In order to exploit temporal locality in accessing acc-array
entries, B-matrix rows with similar sparsity patterns should
be reordered nearby. Furthermore, the sparsity pattern sim-
ilarity among B-matrix rows should be weighted according
to the number of accesses to the same acc-arrray entries
within the same coarse-grain thread-level tasks. So, exploit-
ing this locality depends on coarse-grain task set induced
by the rowwise A-matrix partition, PðAÞ ¼ fA1; A2; . . . ;
AKg, obtained via using the models proposed in Section 4.
Thus, B-matrix-row access pattern is represented as a
B-matrix-row similarity graph GTB ðB;PðAÞÞ ¼ ðV; EÞ for con-
formable column and row reordering of matrices A and B,
respectively.
In GTB , V contains a vertex vi for each row i of matrix B.
We associate vertices with unit weights. E contains an edge
ei;j if B-matrix rows i and j have nonzeros in at least one
common column and these rows are accessed together by at
least one coarse-grain task executed by a thread. We associ-
ate each edge ei;j with a weight wðei;jÞ equal to the number
of common columns of B-matrix rows i and j that are
accessed together in the same coarse-grain task, i.e.,
wðei;jÞ¼jfx :a;i; a;j2Ak;x2colsðbi;Þ\colsðbj;Þgj: (13)
Fig. 7b illustrates the proposed temporal graph model.
Consider a partition PðGTB Þ of vertices of the similarity
graph GTB and an edge ei;j 2 GTB with weight wðei;jÞ. This edge
ei;j means that the B-matrix rows i and j involve accesses to
wðei;jÞ same entries of acc array. If B-matrix rows i and j are
always processed close in time, which in turn corresponds to
the case where edge ei;j is uncut and parts ofPðGTB Þ are small
enough, the wðei;jÞ different acc-array entries are likely to be
reused. If ei;j is cut, these wðei;jÞ acc-array entries probably
will not be reused. So, in total, the amount of cache misses
Fig. 7. Models for exploiting spatial and temporal localities in accessing
acc array.
2264 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
due to loss of acc-array entry reuse is proportional toP
ei;j2E wðei;jÞ. Therefore, partitioning objective given in (9)
relates tomaximizing the amount of reuse.
5.3 Using A-matrix Partitioning for C¼AAT and
C¼AA
Here, we will show how a rowwise partition of matrix A
obtained by partitioning the two models proposed in
Section 4 can be used to exploit spatial and temporal locali-
ties in accessing acc-array entries during the SpGEMM
operations of the forms C¼AAT [10], [11], [12] and C¼AA
[3], [7], [15], [16], [17]. We should here mention that in both
cases, second input matrices AT and A are stored separately
from the first input matrix A.
A rowwise partition on the A matrix induces a partial
orderingRArow on the rows ofA, where rows in row sliceAkþ1
are ordered after the rows inAk for k ¼ 1; 2; . . . ; K  1. In the
bipartite graph model BTA , the partition PðVBÞ on the
A-matrix columns also induces a partial column ordering
RAcol on the A-matrix columns in a similar manner. In the
hypergraph model HTA , the partition PðVÞ on the A-matrix
rows induces a partial orderingRAcol on theA-matrix columns
as follows: The A-matrix columns corresponding to the
uncut nets of Vkþ1 are ordered after the columns correspond-
ing to the uncut nets of Vk for k ¼ 1; 2; . . . ; K  1, whereas
the columns corresponding to the cut nets are ordered last.
For C¼AAT and C¼AA cases, RArow and RAcol can be used
for reordering rows and columns of B, where B¼AT in the
former case and B¼A in the latter case.
Here, we will briefly discuss how closely RArow and R
A
col
serve the purpose of clustering B-matrix rows and columns
for exploiting localities inC¼AAT andC¼AA. In the hyper-
graph model, the objective of minimizing cutsize directly
and closely relates to clustering vertices with similar net con-
nectivity to the same parts. So, RArow induced by the vertex
partition PðVAÞ closely relates to clustering A-matrix rows
with similar sparsity patterns. However, the partitioning
objective of the hypergraph model indirectly and hence
loosely relates to clustering nets with similar vertex connec-
tivity to the same parts as uncut nets. So, RAcol induced by the
net reordering described above loosely relates to clustering
A-matrix columnswith similar sparsity patterns.
The bipartite graph model has the nice property of induc-
ing RArow and R
A
col based on the vertex reordering obtained
from the respective vertex partitions PðVAÞ and PðVBÞ.
However it may suffer from the relatively loose relation
between the objective of graph partitioning and clustering
vertices with similar adjacency patterns.
C¼AAT Case. We exploit the simple fact of rows and col-
umns of B ¼ AT being columns and rows of A, respectively,
as follows. For exploiting spatial locality, RArow is used for
reordering AT -matrix columns. For exploiting temporal
locality, RAcol is used for reordering A
T -matrix rows. Note
that the ordering obtained on the rows of AT is conformably
applied on the columns of A.
C¼AA Case. We exploit the simple fact that the first and
second input matrices are the same. For exploiting spatial
locality, RAcol is used for reordering columns of second A
matrix. For symmetric matrices, RArow can be a better alterna-
tive for hypergraph models to avoid the disadvantage of
using net ordering. For exploiting temporal locality, there
are two options: In the first option, RAcol is used for reorder-
ing the rows of second A matrix, whereas in the second
option, RArow is used. Note that in both options, the ordering
obtained on the rows of the second A is conformably
applied on the columns of the first Amatrix.
6 EXPERIMENTS
6.1 Data Set
The validity of the proposed methods are tested on three
different categories: C¼AAT , C¼AA, and C¼AB.
The C¼AAT category contains 12 LP constraint matrices,
all of which are selected from the UFL sparse matrix
collection [51].
The C¼AA category contains 17 sparse matrices. Two of
these matrices, cp2k-h2o-e6 and cp2k-h2o-.5e7, are
obtained from the simulation of H2O molecules via using
CP2K’s implementation of KohnSham density functional
theory calculations [3], [21]. The remaining 15 matrices are
selected from the UFL collection. The two matrices, 144
and cage12, represent graphs, which can be used in find-
ing all-pairs shortest-paths [15], self similarity joins of
sparse datasets [16], and summarization of sparse data [17].
Some of the remaining matrices are included in this dataset
because of their use in recent works [36], [37], [42] as syn-
thetic applications.
The C¼AB category for the general SpGEMM case con-
tains 4 SpGEMM instances. This general case arises in two
applications: First application is item-to-item collaborative
filtering [18] in recommendation systems. The first two A
matrices in the C¼AB category, i.e., amazon0302 and
amazon0312, represent similarities between items, and
they are obtained from the UFL collection. The correspond-
ing B matrices are randomly generated according to Zipf
distribution with exponent equal to 3.0. The second applica-
tion arises in similarity joins of two different sparse data-
sets [16]. The remaining two SpGEMM instances in the
C¼AB category represent sparse networks obtained from
the DIMACS Implementation Challenges [52].
In the appendix, available in the online supplemental
material, Table A.1 displays the properties of the 33
SpGEMM instances. For each category, the SpGEMM instan-
ces are listed in alphabetical order by name of the first input
matrix. The properties of SpGEMM instances are displayed
in terms of total number of rows, columns, and nonzeros of
the input matrices, as well as their average and maximum
number of nonzeros per row and column. The properties of
SpGEMM instances are also displayed in terms of the num-
ber of thread-level coarse-grain tasks (K), and statistics of
atomic tasks in terms of number of multiply-and-adds and
kilo bytes (KB).K values are introduced to show the amount
of parallelism in each SpGEMM instance. The “cov” (coeffi-
cient of variation) value of an SpGEMM instance is given as
an indication of the level of irregularity in the atomic task
weights. As seen in Table A.1, available in the online supple-
mental material, C¼AAT instances have much higher “cov”
values than C¼AA instances in general, thus showing the
higher irregularity of LP instances.
For each of the 33 SpGEMM instances, the properties
of the hypergraph and bipartite graph models can also
be extracted from the information available in Table A.1,
AKBUDAK AND AYKANAT: EXPLOITING LOCALITY IN SPARSE MATRIX-MATRIX MULTIPLICATION ON MANY-CORE ARCHITECTURES 2265
available in the online supplementalmaterial. Statistics given
for rows of A matrices correspond to statistics for vertices in
VðHTA Þ and VAðBTA Þ. Similarly, statistics given for columns of
Amatrices correspond to statistics for nets inNðHTA Þ and ver-
tices VBðBTA Þ. Statistics given for rows of B matrices corre-
spond to statistics for nets NðHSBÞ, whereas statistics given
for columns correspond to statistics for vertices in VðHSBÞ.
6.2 Implementation of Partitioning Methods
6.2.1 Proposed Algorithms
A-matrix Partitioning. The hypergraph and bipartite graph
models of the test SpGEMM instances are generated as
described in Sections 4.1 and 4.2. The state-of-the-art tools
PaToH [44] and MeTiS [49] are used for K-way partitioning
of these hypergraphs and bipartite graphs, respectively.
PaToH is used with the PATOH_SUGPARAM_SPEED parame-
ter in order to trade off between the partitioning quality and
the preprocessing overhead due to partitioning. MeTiS
is used with default values except it is made to use multi-
level recursive bisectioning. For the partitioning constraint,
the maximum allowed imbalance threshold is set to be
equal to 0.30 for both PaToH and MeTiS. For the partition-
ing objective, “connectivity-1” cutsize metric is used with
PaToH, whereas both edgecut and totalv metrics are used
with MeTiS.
As the selection criteria in the HC method, “absorption
metric” and a variant of “scaled heavy connectivity metric”
(SHCM) are respectively used for the C¼AA (and C¼AB)
and C¼AAT categories.
The cache size threshold utilized for calculatingK values
for each SpGEMM instance is selected as half of the effective
cache size per thread. This cache size thresholding scheme
is used to alleviate the possibility of capacity misses due to
the small set-associativity (8 ways) during the execution of a
coarse-grain task.
PaToH, MeTiS, HC, and BGCe involve randomized algo-
rithms. So for each SpGEMM instance, these tools/methods
are run five times and the average results are reported in
the following tables.
B-Matrix Row/Column Reordering. For each test
SpGEMM instance, the spatial hypergraph model proposed
in Section 5.1 is partitioned by PaToH using the same param-
eters mentioned above. For hypergraph models,K00 is calcu-
lated such that each part has about 10 vertices. For each
rowwise A-matrix partition of each test SpGEMM instance,
the temporal graph model proposed in Section 5.2 is parti-
tioned using MeTiS three times, and the average result is
reported. MeTiS is usedwith default values except it is made
to use multilevel recursive bisectioning. For graph models,
K00 is calculated such that each part has about 10 vertices.
6.2.2 Baseline Algorithms
We use two baseline algorithms MKL and BinP, neither of
which exploits locality.
MKL. We use mkl_dcsrmultcsr function [53] of the
latest MKL library (version 11.3). In the parallelization strat-
egy adopted by MKL [54], the first input matrix is divided
into chunks with more or less equal number of rows, and
every row chunk is assigned to a thread. Since the number
of chunks is equal to the number of threads, the chunk size
cannot be set by the user outside MKL. So MKL considers
balancing the computational loads of threads in a very
rough manner.
BinP. The BinP algorithm is implemented to achieve a
much better balancing of the computational loads of the
threads as follows: BinP is a binpacking-based algorithm
and it adapts the best-fit-decreasing heuristic used in solv-
ing the K-feasible binpacking problem [55]. In adapting this
heuristic for our purpose, A-matrix rows are considered for
assignment into one of the K bins in decreasing number of
multiply-and-add operations incurred by the pre-multiply
of this row with the B matrix. The best-fit criterion is the
assignment of the A-matrix row to the minimally loaded bin
(part). The bin capacity constraint is not used in BinP. At
the termination of the algorithm, each bin represents a
coarse-grain task to be executed by a distinct thread. The
number of resulting parts becomes much larger than the
number of threads thus enabling the utilization of dynamic
part-to-thread scheduling for further load balancing.
The cache size thresholding scheme described in
Section 6.2.1 for the proposed algorithms is also used to
determine theK value in the BinP algorithm.
6.3 Parallel Systems and Implementation
We conducted experiments on a single Xeon Phi 5110P
coprocessor. We used the offload mode instead of the native
mode to enable future vertical integration that involves
hybrid parallelization on a Xeon Phi cluster. The Xeon Phi
coprocessor has 8 GB on-device RAM and provides 59 cores
in the offload mode and each core can handle up to four
hardware threads. Each core has 32 KB 8-way set associative
L1 data cache with 64-byte lines, and 512 KB 8-way set asso-
ciative L2 with 64-byte lines.
We also conducted experiments on a two-socket Xeon
server. Each X5650 Xeon processor clocked at 2.67 GHz has
6 cores and 12 MB 16-way set associative L3 cache. Each
core has 32 KB 8-way associative L1 cache with 8-byte lines
and 256 KB 8-way set associative L2 with 8-byte lines. Only
Tables 4, A.2, and A.4, available in the online supplemental
material, contain results on the two-socket Xeon server,
whereas all other tables contain results for Xeon Phi.
For evaluating the performance of the proposed models
as well as BinP, SpGEMM routines are implemented and
integrated into shoc-mic benchmark [56], which is com-
piled with -O3 flag. OpenMP’s dynamic scheduler is used
with the default chunk size. The best results for 59, 118, 177,
and 236 threads are reported for Xeon Phi and results for 12
threads are reported for the Xeon server.
6.4 Parallel SpGEMM Performance
This section compares the SpGEMM performance of the
methods without considering the preprocesing overhead,
whereas Section 6.5 gives an overall discussion including
the preprocessing overheads.
6.4.1 Sorting Coarse-Grain Tasks for Computational
Load Balancing in HC
Table 1 displays the performance of the sorting-based
coarse-grain task scheduling scheme described for HC at
the end of Section 4.1.2 for the C¼AAT category. In the
2266 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
table, the properties of the coarse-grain tasks are given in
terms of the number K of tasks, the average and maximum
task weights, and the covariance of task weights. Each T
value shows the number of threads that achieve the best
reported result for the respective SpGEMM instance. The
last column displays the percent performance improvement
achieved by using the sorting-based scheduling scheme.
As seen in Table 1, the proposed sorting scheme consid-
erably improves the performance in all SpGEMM instances.
As seen in the table, sorting scheme achieves relatively bet-
ter performance for small K=T and large covariance values,
as expected. For example, sorting does not improve the per-
formance much for the rlfprim, sc205-2r and scsd8-
2r instances, which have large K=T ratios of 3821/23616,
4260/11836, and 4393/23619, respectively.
On the average, the proposed sorting scheme improves
HC by 12.9, 0.9, and 3.8 percent for C¼AAT , C¼AA, and
C¼AB categories, respectively. The significant amount of
improvement in the C¼AAT category can be attributed to
the relatively smaller averageK=T ratio of ðK=T Þavg=10 and
higher average covariance value of covavg=0.71 of this cate-
gory, whereas ðK=T Þavg=24 and covavg=0.29 for C¼AA, and
ðK=T Þavg=14 and covavg=0.64 for C¼AB.
We should note here that, for the BGCe method, the sort-
ing scheme achieves only slight improvement of 1.5 percent
in C¼AAT whereas it does not achieve any improvement
for the other two categories.
As expected, the sorting scheme does not considerably
improve the performance of partitioning-based methods
since they explicitly enforce balance among coarse-grain
task weights.
6.4.2 Exploiting Locality in Accessing acc Array
Here, we evaluate the validity of the B-matrix row/column
reordering algorithms HSB , GTB , RAcol, and RArow proposed in
Section 5. For this purpose, we report the performance
improvement induced by these reordering algorithms on
each of the five methods HP, HC, BGPe, BGPv, and BGCe.
Table 2 displays the average performance improvement
over the original orderings ofA-matrix columns andB-matrix
rows and columns. The “Spatial” and “Temporal” columns
respectively refer to the separate use ofB-matrix-column and
B-matrix-row reorderings described in Section 5.
As seen in Table 2, the spatial HSB model achieves signifi-
cant performance improvement in all categories. The tem-
poral GTB model achieves significant improvement in C¼AA
and C¼AB, whereas it achieves relatively small improve-
ment in C¼AAT . These experimental findings show the
validity ofHSB and GTB models.
Here, for the C¼AAT and C¼AA cases, we compare the
performance ofRArow/R
A
col against the performance ofHSB and
GTB . As seen in Table 2, although HSB performs better than
RAcol/R
A
row in general, the performance gap is rather small for
HP, BGPe, and BGPv, whereas it is significant in HC and
BGCe. For example, for HC and BGCe on the C ¼ AAT cate-
gory, while HSB achieves 38.5 and 38.9 percent performance
improvement, RArow achieves only -7.7 and 8.3 percent,
respectively. This may be attributed to the high sensitivity of
exploiting spatial locality to sparsity patterns of B-matrix
columns and less sophistication of HC and BGCe compared
to HP and BGPe, respectively. A similar discussion follows
for the comparison of RArow/R
A
col against GTB for exploiting
temporal locality. So forB-matrix row/column reordering in
theC¼AAT and C¼AA cases, we recommend the use ofHSB
and GTB only for HC and BGCe, whereas we recommend the
use of RArow/R
A
col, which are induced by existing A-matrix
partitionings, for all othermethods.
6.4.3 Exploiting Temporal Locality in Accessing B
Matrix
Table 3 compares the performance (in terms of GFlops) of the
proposed locality-aware HP, HC, BGPe, BGPv, and BGCe
methods against the performance of the baseline MKL and
BinP methods on Xeon Phi. For all SpGEMM instances, the
TABLE 1
Coarse-Grain Task Sorting in HC for C¼AAT
Coarse grain task #of
count weight threads Percent
Matrix K avg max cov T impr.
e18 411 8606 15060 0.22 236 12.7%
fxm3_16 766 5301 44310 1.44 177 27.8%
fxm4_6 556 5591 17268 0.93 236 30.2%
rlfdual 259 8000 20787 0.53 177 15.9%
rlfprim 3821 8701 23632 0.53 236 3.1%
sc205-2r 4260 4900 25709 0.63 118 8.7%
scfxm1-2b 278 6176 25415 1.12 118 16.4%
scfxm1-2r 1319 6560 44432 1.28 236 19.8%
scrs8-2r 2327 5922 25160 0.71 177 11.2%
scsd8-2r 4393 7363 15660 0.37 236 5.9%
sctap1-2b 1754 6098 22924 0.90 236 20.1%
testbig 1006 5198 19412 0.81 177 22.0%
HC: Fast bottom-up clustering method (see Section 4.1.2)
TABLE 2
Performance Improvement of B-matrix Row/Column Reordering for Locality in Accessing acc on Xeon Phi
C ¼ AAT (LP) C ¼ AA C ¼ AB
Spatial Temporal Spatial Temporal Spatial Temporal
RArow HSB RAcol GTB RAcol RArow HSB RAcol RArow GTB HSB GTB
HP 34.6% 38.0% 0.8% 0.1% 0.4% 18.6% 19.3% 1.6% 8.5% 8.8% 20.4% 4.8%
HC -7.7% 38.5% 0.8% 1.7% 0.8% 10.3% 18.9% 1.8% 5.7% 8.4% 18.4% 1.6%
BGPe 30.8% 36.2% 1.0% 0.6% 19.7% 19.4% 19.2% 9.4% 8.8% 9.1% 20.4% 4.5%
BGPv 29.2% 37.8% 1.8% 2.7% 19.3% 18.0% 18.7% 8.5% 8.7% 9.1% 20.4% 4.6%
BGCe 8.3% 38.9% 0.8% 2.8% 6.0% 5.5% 20.4% 5.0% 4.6% 10.4% 20.7% 2.5%
Reordering for spatial locality: reordering B-matrix columns. Reordering for temporal locality: reordering B-matrix rows.
AKBUDAK AND AYKANAT: EXPLOITING LOCALITY IN SPARSE MATRIX-MATRIX MULTIPLICATION ON MANY-CORE ARCHITECTURES 2267
proposed methods use the best-performing B-matrix row
and column reorderingmethods as discussed in Section 5 for
exploiting locality in accessing acc array. In the table, for
each method, performance averages and the number of best
results attained over each of the three different categories are
listed at the bottom of each part, whereas overall averages
and number of best results are displayed at the bottom of the
table. A bold value in a row of the table indicates the highest
GFlops performance attained for that instance.
In Fig. 8, we present a performance profile, which is a
generic tool introduced by Dolan and More [57], in order to
give a more comprehensive view of the runtime results of
the proposed, as well as the baseline methods. The test set
in this figure consists of all instances listed in Table A.1,
available in the online supplemental material.
We will first briefly discuss the relative performance of
the two baseline methods MKL and BinP. As seen in Table 3,
the proposed baseline algorithm BinP performs significantly
better than MKL both in C¼AAT and C¼AB categories,
whereas BinP and MKL display comparable performance in
the C¼AA category. On average, BinP performs 2.32x and
1.39x better than MKL for C¼AAT and C¼AB categories,
respectively, whereas BinP performs only 1.04x better than
MKL for the C¼AA category. This is because atomic tasks
of the SpGEMM instances in the C¼AAT and C¼AB cate-
gories are much more irregular compared to those in
C¼AA category as mentioned in Section 6.1. So, by assign-
ing equal number of A-matrix rows to threads, MKL may
attain reasonable load balancing among threads in the
C¼AA category. This significant performance improve-
ment of BinP over MKL for the C¼AAT and C¼AB catego-
ries shows the merits of using load balancing alone.
As seen in Table 3, all of the proposed locality exploiting
methods perform significantly better than both baseline
methods for all of the three categories. Among the proposed
methods, top-down partitioning-based methods HP, BGPe,
and BGPv perform better than the bottom-up clustering
methods HC and BGCe as expected. As seen in Table 3, top-
down partitioning-based methods HP, BGPe, and BGPv
display comparable performance for all categories. As seen
in the performance profile curves given in Fig. 8, in almost
75 percent of the instances, BGPe, BGPv, and HP perform
nearly same.
As seen in Table 3, top-down partitioning-based methods
HP, BGPe and BGPv performs 3.14x, 3.18x, and 3.19x better
than MKL on the overall average. As also seen in the table,
HP, BGPe, and BGPv perform 2.18x, 2.21x, and 2.21x better
than BinP on the overall average. These relative perfor-
mance improvements of HP, BGPe, and BGPv over BinP
show the benefit of exploiting locality.
TABLE 3
Performance Results (in GFlops) on Xeon Phi
Baseline Proposed methods (locality aware)
methods Hypergraph Bipartite graph
(no locality) part. cluster. partitioning cluster.
MKL BinP HP HC BGPe BGPv BGCe
C ¼ AAT (Linear programming)
e18 0.65 1.60 3.41 2.54 3.10 3.34 2.26
fxm3_16 2.48 3.53 5.79 4.24 6.19 6.25 6.41
fxm4_6 1.75 4.29 5.84 5.16 5.66 6.07 6.85
rlfdual 0.51 2.76 4.00 2.50 3.94 3.86 1.68
rlfprim 1.84 2.90 5.52 4.13 5.50 5.42 3.88
sc205-2r 1.32 1.41 4.86 3.94 5.75 5.86 1.45
scfxm1-2b 1.04 3.06 4.73 2.61 5.70 4.92 3.71
scfxm1-2r 1.20 3.47 7.30 5.67 7.62 8.13 4.68
scrs8-2r 1.15 2.42 5.59 4.66 5.46 5.26 2.70
scsd8-2r 1.54 9.00 11.78 11.29 11.71 11.51 8.88
sctap1-2b 2.59 3.73 5.69 4.81 5.60 5.56 2.90
testbig 1.33 2.59 5.03 4.18 5.35 5.39 2.83
Average 1.31 3.04 5.52 4.26 5.68 5.68 3.48
# of bests – – 6 – 1 3 2
C ¼ AA
144 1.66 1.75 6.28 6.55 6.81 6.88 2.67
2cubes_sphere 3.17 2.84 6.93 6.97 6.91 6.98 5.23
cage12 2.04 1.91 4.67 4.26 4.51 4.54 2.66
conf5_4-8x8-05 5.10 7.96 10.61 10.65 10.24 10.47 9.10
cop20k_A 2.46 3.93 9.65 9.27 9.64 9.54 5.51
cp2k-h2o-.5e7 2.99 3.43 8.11 7.95 8.05 8.14 7.21
cp2k-h2o-e6 2.72 2.53 6.69 6.47 6.61 6.66 6.03
filter3D 3.75 3.48 9.73 9.66 9.52 9.63 6.11
mac_econ_fwd500 1.67 1.24 3.31 3.02 3.34 3.39 2.92
majorbasis 2.89 3.09 6.49 6.22 6.42 6.34 4.95
mario002 1.27 1.22 4.41 3.94 4.41 4.41 2.48
mc2depi 1.04 0.83 3.42 3.27 3.33 3.35 2.63
offshore 2.83 2.17 6.99 7.02 6.99 7.05 4.24
poisson3Da 2.03 3.20 8.39 7.92 8.21 8.42 4.11
scircuit 1.29 1.72 4.95 3.36 4.84 4.98 3.75
tmt_sym 2.23 1.90 5.90 5.65 5.90 5.93 4.62
torso2 2.84 3.23 6.53 6.36 6.44 6.52 6.40
Average 2.28 2.38 6.31 5.97 6.27 6.32 4.41
# of bests – – 8 1 – 8 –
C ¼ AB
amazon0302 1.36 1.89 2.54 2.25 2.58 2.56 2.12
amazon0312 1.60 1.90 2.86 2.57 2.87 2.89 2.39
rgg_n_2_16_s0 0.74 0.65 1.69 1.74 1.76 1.71 1.66
smallworld 0.28 0.72 0.95 0.58 1.12 1.11 1.00
Average 0.82 1.14 1.85 1.55 1.95 1.94 1.70
# of bests – – – – 3 1 –
Overall average 1.65 2.38 5.18 4.48 5.25 5.27 3.61
Overall # of bests – – 14 1 4 12 2
Fig. 8. GFlops performance profiles on Xeon Phi.
2268 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
Despite the inferior performance of HC and BGCe over
the top-down partitioning-based methods, they still per-
form significantly better than BinP, where HC is the clear
winner. HC and BGCe perform 1.88x and 1.52x better than
BinP on the overall average. The better performance of HC
over BGCe can be attributed to the use of nets that encode
multi-way similarity instead of edges that encode only two-
way similarity during the bottom-up clustering process.
Table 4 displays results of experiments conducted on
Xeon server. In Table 4, GFlops performance results are dis-
played as averages over the three different SpGEMM cate-
gories. We refer the reader to Table A.2 of the appendix for
instance-based detailed performance results, available in
the online supplemental material. Comparison of Tables 3
and 4 show that the performance gap between the proposed
locality aware methods and the baseline method BinP
slightly reduces on Xeon compared to Xeon Phi. For exam-
ple, the performance improvement of HP over BinP
decreases from 2.18x on Xeon Phi to 2.05x on Xeon on the
overall average. The average performance gap between HP
and BinP remains almost the same (1.62x versus 1.61x) for
the C¼AB category. The average performance improve-
ment of HP over BinP decreases from 1.82x on Xeon Phi to
1.41x on Xeon for the C¼AAT category, whereas the aver-
age improvement of HP over BinP increases from 2.65x on
Xeon Phi to 2.83x on Xeon for the C¼AA category. This
experimental finding may be attributed to the reduced
cache miss overhead in Xeon due to out-of-order execution
capability as opposed to the in-order execution of Xeon Phi.
6.4.4 Reducing Data Transfer
In the appendix, Table A.3, available in the online supple-
mental material, is introduced in order to compare the
improvements provided by the proposed partitioning-based
methods against BinP in terms of cutsize. For each SpGEMM
instance, the cutsize (computed according to (6)) of the
respective method is divided by the cutsize of the baseline
method BinP. As seen in the table, the normalized cutsize
values in general conform with the relative GFlops
performance values of the respective methods given in
Table 3. For example, as seen in Table A.3, available in the
online supplemental material, for the C¼AAT and C¼AA
categories, HP, which aims at exploiting locality, respec-
tively achieves 3.22x and 8.33x less cutsize than the baseline
partitioning method BinP which only considers load-balanc-
ing. As seen in Table 3, HP achieves respectively 1.82x and
2.65x speedups over BinP for the C¼AAT and C¼AA cate-
gories, on average.
We also conducted experiments with likwid [58], which
enables counting data transfers in a multi-threaded setting,
to measure data transfer between L2 caches and last level
caches of the Xeon server. Table A.4, available in the online
supplemental material, displays the data transfer amounts
incurred by the proposed partitoning-based methods nor-
malized with respect to those of BinP. As seen in Table A.4,
available in the online supplemental material, the proposed
locality-exploiting methods achieve up to 2.63x less data
transfers than BinP, on the overall average. Comparison of
Tables A.4, available in the online supplemental material,
and 4 show that data transfer amounts incurred by the
partitioning-based methods correlate with the attained
GFlops performances. For example, on the overall average,
BinP incurs 2.44x more data transfers than HP, whereas HP
performs 2.18x better than BinP. In conclusion, the cutsize
minimization objective in the proposed methods success-
fully achieve reducing pressure onmemory and caches.
6.5 Partitioning Overhead versus SpGEMM
Performance
Table 5 is introduced to compare the partitioning overheads
of the proposed methods, as well as BinP. For each
SpGEMM instance, the partitioning time of the respective
method in the host machine (Xeon) is divided by the paral-
lel SpGEMM time obtained by MKL on Xeon Phi and aver-
ages of these normalized values over the matrix categories
are reported in the table. For BGCe, which uses multi-
threaded implementation, running times for the number of
threads that achieve the lowest time are used.
As seen in Tables 3 and 5, BinP performs considerably
better than MKL in terms of parallel SpGEMM perfor-
mance, whereas the preprocessing overhead of BinP amor-
tizes for only one SpGEMM operation. Hence BinP is a very
simple yet effective heuristic that can easily be integrated
into existing libraries.
The comparison of HP and BGPv is as follows: As seen in
Table 5, BGPv incurs significantly higher partitioning
TABLE 5
Partitioning Overhead in Terms of Parallel MKL
SpGEMM Times
Proposed methods (locality aware)
Hypergraph Bipartite graph
partition. cluster. partitioning cluster.
BinP HP HC BGPe BGPv BGCe
C ¼ AAT 0.14 18.58 2.33 17.07 95.99 1.14
C ¼ AA 0.47 80.18 7.33 58.81 219.23 2.02
C ¼ AB 0.34 36.81 6.37 31.21 165.97 0.99
Overall 0.29 42.87 4.75 34.74 156.98 1.50
TABLE 4
Average Performance Results (in GFlops)
on Two-Socket Xeon Server
Baseline Proposed methods (locality aware)
methods Hypergraph Bipartite graph
(no locality) part. cluster. partitioning cluster.
MKL BinP HP HC BGPe BGPv BGCe
C ¼ AAT (Linear programming)
Average 3.32 4.31 6.06 4.85 6.01 6.04 4.94
# of bests – – 5 1 2 4 –
C ¼ AA
Average 2.22 1.63 4.61 4.39 4.60 4.62 3.24
# of bests – – 5 – 2 10 –
C ¼ AB
Average 0.86 1.18 1.90 1.85 1.92 1.94 1.57
# of bests – – – 1 – 3 –
Overall average 2.29 2.23 4.57 4.10 4.56 4.58 3.46
Overall # of bests – – 10 2 4 17 –
AKBUDAK AND AYKANAT: EXPLOITING LOCALITY IN SPARSE MATRIX-MATRIX MULTIPLICATION ON MANY-CORE ARCHITECTURES 2269
overhead than both HP and BGPe. Although BGPv achieves
a similar parallel SpGEMM performance as HP, it is not rec-
ommended because of its significantly higher preprocessing
overhead.
The comparison of HP and BGPe is as follows: Although
bipartite graph and hypergraph models are of similar size
(same number of edges and pins), HP incurs higher prepro-
cessing overhead than BGPe as seen in Table 5. This is due
to the fact that graph partitioning is considerably faster than
hypergraph partitioning in general. As BGPe shows a close
SpGEMM performance to HP on the average, BGPe can be
considered as a good alternative to HP because of its consid-
erably lower preprocessing overhead.
The comparison of HP and HC is as follows: In terms of
parallel SpGEMM performance, for C¼AA and C¼AB cat-
egories, the average performance of HC is close to that
of HP as seen in Table 3. In terms of partitioning time, on
the average, HC runs approximately 10x faster than HP as
seen in Table 5. Hence the fast bottom-up clustering
approach is recommended for C¼AA and C¼AB catego-
ries instead of HP.
The comparison of BGPe and BGCe is as follows: In
terms of parallel SpGEMM performance, BGCe performs
significantly worse (31 percent worse) than BGPe, on the
overall average. On the other hand, BGCe runs approxi-
mately 23x faster than BGPe, on the overall average. For
example, running time of BGCe is no more than the running
time of a single MKL SpGEMM time for the C ¼ AB
category.
7 CONCLUSION
We investigated row-by-row formulation of the sparse
matrix-matrix multiplication (SpGEMM) operation of the
form C¼AB for locality-aware parallelization on many-core
architectures. We proposed a hypergraph model and a bipar-
tite graph model for 1D rowwise A-matrix partitioning to
exploit locality in accessingB-matrix rows. In the partitioning
methods utilizing these models, the partitioning constraint
corresponds to maintaining balance on the computational
loads of threads, whereas the partitioning objective relates to
reducing data transfer amount from memory and between
caches. For both hypergraph and bipartite graph models, we
proposed bottom-up clustering methods, which were experi-
mentally shown to be producing reasonably good partitions
while being significantly faster than the respective partition-
ingmethods.
We also investigated how to exploit locality in accessing
entries of temporary arrays utilized by threads during accu-
mulation of results for C-matrix rows. We proposed a
hypergraph and a graph model for B-matrix column and
row reordering to exploit spatial and temporal localities in
these operations, respectively.
We tested the validity of the proposed methods on a
wide range of realistic SpGEMM instances from three differ-
ent categories of C¼AAT , C¼AA, and C¼AB. Experimen-
tal results showed the validity of the proposed methods.
These results also showed that Xeon Phi and Xeon process-
ors can benefit from locality enhancements for sparse and
irregular applications through intelligent partitioning and
reordering.
ACKNOWLEDGMENT
This work was partially supported by the Scientific and
Technological Research Council of Turkey (TUBITAK)
under Grant EEEAG-115E212. The authors would also like
to acknowledge the contribution of the COST Action IC1406
High-Performance Modeling and Simulation for Big Data
Applications (cHiPSet).
REFERENCES
[1] M. Challacombe, “A general parallel sparse-blocked matrix multi-
ply for linear scaling SCF theory,” Comput. Phys. Commun.,
vol. 128, no. 12, pp. 93–107, 2000.
[2] S. Itoh, P. Ordejon, and R. M. Martin, “Order-N tight-binding
molecular dynamics on parallel computers,” Comput. phys. Com-
mun., vol. 88, no. 2, pp. 173–185, 1995.
[3] J. VandeVondele, U. Borstnik, and J. Hutter, “Linear scaling self-
consistent field calculations with millions of atoms in the con-
densed phase,” J. Chem. Theor. Comput, vol. 8, no. 10, pp. 3565–
3573, 2012.
[4] N. Hine, P. Haynes, A. Mostofi, and M. Payne, “Linear-scaling
density-functional simulations of charged point defects in Al2O3
using hierarchical sparse matrix algebra,” J. Chemical Phys.,
vol. 133, no. 11, 2010, p. 114111.
[5] C. Saravanan, Y. Shao, R. Baer, P. N. Ross, and M. Head-
Gordon, “Sparse matrix multiplications for linear scaling elec-
tronic structure calculations in an atom-centered basis set using
multiatom blocks,” J. Comput. Chemistry, vol. 24, no. 5, pp. 618–
622, 2003.
[6] D. Bowler, T. Miyazaki, and M. Gillan, “Parallel sparse matrix
multiplication for linear scaling electronic structure calcu-
lations,” Comput. Phys. Commun., vol. 137, no. 2, pp. 255–273,
2001.
[7] U. Borstnik, J. VandeVondele, V. Weber, and J. Hutter, “Sparse
matrix multiplication: The distributed block-compressed sparse
row library,” Parallel Comput., vol. 40, no. 5, pp. 47–58, 2014.
[8] Total-FETI massively parallel implementation research group.
(2015). [Online]. Available: http://spomech.vsb.cz/feti/
[9] V. Hapla, D. Hork, and M. Merta, “Use of direct solvers in TFETI
massively parallel implementation,” in Applied Parallel and Scien-
tific Computing, P. Manninen and P. ster, Eds. Berlin, Germany:
Springer, 2013, pp. 192–205.
[10] R. H. Bisseling, T. M. Doup, and L. Loyens, “A parallel interior
point algorithm for linear programming on a network of trans-
puters,” Ann. Operations Res., vol. 43, pp. 51–86, 1993.
[11] E. Boman, O. Parekh, and C. Phillips, “LDRD final report on
massively-parallel linear programming: The parPCx system,” San-
dia National Laboratories Albuquerque, NM, USA, Tech. Rep.
SAND2004–6440, 2005.
[12] G. Karypis, A. Gupta, and V. Kumar, “A parallel formulation of
interior point algorithms,” in Proc. ACM/IEEE Conf. Supercomput-
ing, 1994, pp. 204–213.
[13] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid
Tutorial. Philadelphia, PA, USA: SIAM, 2000.
[14] A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design,
implementation, and applications,” Int. J. High Performance Com-
put. Appl., vol. 25, no. 4, pp. 496–509, 2011.
[15] P. D’Alberto and A. Nicolau, “R-Kleene: A high-performance
divide-and-conquer algorithm for the all-pair shortest path for
densely connected networks,” Algorithmica, vol. 47, no. 2, pp. 203–
213, 2007.
[16] C. Ordonez, “Optimization of linear recursive queries in SQL,”
IEEE Trans. Knowl. Data Eng., vol. 22, no. 2, pp. 264–277, Feb. 2010.
[17] C. Ordonez, Y. Zhang, and W. Cabrera, “The Gamma matrix to
summarize dense and sparse data sets for big data analytics,”
IEEE Trans. Knowl. Data Eng., vol. 28, no. 7, pp. 1905–1918,
Jul. 2016.
[18] G. Linden, B. Smith, and J. York, “Amazon.com recommenda-
tions: item-to-item collaborative filtering,” IEEE Internet Comput.,
vol. 7, no. 1, pp. 76–80, Jan. 2003.
[19] J. Kepner, Ed.,Graph Algorithms in the Language of Linear Algebra.
Philadelphia, PA, USA: SIAM, 2011.
[20] A. Buluc and J. R. Gilbert, “On the representation and multiplica-
tion of hypersparse matrices,” in Proc. IEEE Int. Symp. Parallel Dis-
trib. Process., 2008, pp. 1–11.
2270 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 8, AUGUST 2017
[21] K. Akbudak and C. Aykanat, “Simultaneous input and output
matrix partitioning for outer-product–parallel sparse matrix-
matrix multiplication,” SIAM J. Scientific Comput., vol. 36, no. 5,
pp. C568–C590, 2014.
[22] A. Buluc and J. R. Gilbert, “Parallel sparse matrix-matrix multipli-
cation and indexing: Implementation and experiments,” SIAM
J. Scientific Comput., vol. 34, no. 4, pp. C170–C191, 2012.
[23] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multi-
plication and permuted transposition,” ACM Trans. Math. Softw.,
vol. 4, no. 3, pp. 250–269, 1978.
[24] K. Akbudak, O. Selvitopi, and C. Aykanat, “Partitioning models
for scaling parallel sparse matrix-matrix multiplication,” ACM
TOPC (under review).
[25] G. Ballard, et al., “Communication optimal parallel multiplication
of sparse random matrices,” in Proc. 25th Annu. ACM Symp. Paral-
lelism Algorithms Archit., 2013, pp. 222–231.
[26] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz, “Brief
announcement: Hypergraph partitioning for parallel sparse
matrix-matrix multiplication,” in Proc. ACM Symp. Parallelism
Algorithms Archit., 2015, pp. 86–88.
[27] A. Azad, et al., “Exploiting multiple levels of parallelism in
sparse matrix-matrix multiplication,” CoRR, vol. abs/1510.00844,
2015. [Online]. Available: http://arxiv.org/abs/1510.00844
[28] P. D. Sulatycke and K. Ghose, “Caching-efficient multithreaded fast
multiplication of sparse matrices,” in Proc. 1st Merged Int. Parallel
Proc. Symp. Symp. Parallel Distrib. Process., 1998, pp. 117–123.
[29] NVIDIA Corporation, Cusparse Library, NVIDIA Corporation,
Santa Clara, California, 2014, https://developer.nvidia.com/
cusparse
[30] N. Bell and M. Garland, “CUSP: Generic parallel algorithms for
sparse matrix and graph computations,” 2014, http://cuspli-
brary.github.io/
[31] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grained paral-
lelism in algebraic multigrid methods,” SIAM J. Scientific Comput.,
vol. 34, no. 4, pp. C123–C152, 2012.
[32] S. Dalton, L. Olson, and N. Bell, “Optimizing sparse matrix-matrix
multiplication for the GPU,” ACM Trans. Math. Softw., New York,
NY, USA, vol. 41, no. 4, pp. 25:1–25:20, Oct. 2015, http://doi.acm.
org/10.1145/2699470
[33] L. Polok, S. V. Ila, and P. Smrvz, “Fast sparse matrix multiplica-
tion on GPU,” in Proc. Symp. High Performance Comput., 2015,
pp. 33–40.
[34] K. Matam, S. Indarapu, and K. Kothapalli, “Sparse matrix-matrix
multiplication on modern architectures,” in Proc. 19th Int. Conf.
High Performance Comput., Dec. 2012, pp. 1–10.
[35] M. M. A. Patwary, et al., “Parallel efficient sparse matrix-matrix
multiplication on multicore platforms,” in Proc. Symp. High Perfor-
mance Comput., 2015, pp. 48–57.
[36] W. Liu and B. Vinter, “An efficient GPU general sparse matrix-
matrix multiplication for irregular data,” in Proc. IEEE 28th
Int. Parallel Distrib. Processing Symp., 2014, pp. 370–381.
[37] F. Gremse, A. H€ofter, L. O. Schwen, F. Kiessling, and U. Nau-
mann, “GPU-accelerated sparse matrix-matrix multiplication by
iterative row merging,” SIAM J. Scientific Comput., vol. 37, no. 1,
pp. C54–C71, 2015.
[38] J. Siegel, O. Villa, S. Krishnamoorthy, A. Tumeo, and X. Li,
“Efficient sparse matrix-matrix multiplication on heterogeneous
high performance systems,” in Proc. IEEE Int. Conf. Cluster Com-
put. Workshops Posters, 2010, pp. 1–8.
[39] K. Akbudak, E. Kayaaslan, and C. Aykanat, “Hypergraph parti-
tioning based models and methods for exploiting cache locality in
sparse matrix-vector multiplication,” SIAM J. Scientific Comput.,
vol. 35, no. 3, pp. C237–C262, 2013.
[40] E. Saule, K. Kaya, and €U. V. Çataly€urek, “Performance evaluation
of sparse matrix multiplication kernels on Intel Xeon Phi,” in
Parallel Processing and Applied Mathematics. Berlin, Germany:
Springer, 2014, pp. 559–570.
[41] M. O. Karsavuan, K. Akbudak, , and C. Aykanat, “Locality-aware
parallel sparse matrix-vector and matrix-transpose-vector multi-
plication on many-core processors,” IEEE Trans. Parallel Distrib.
Syst., vol. 27, no. 6, pp. 1713–1726, Jun. 2016.
[42] M. McCourt, B. Smith, and H. Zhang, “Sparse matrix-matrix prod-
ucts executed through coloring,” SIAM J. Matrix Anal. Appl.,
vol. 36, no. 1, pp. 90–109, 2015.
[43] S. Ramos and T. Hoefler, “Modeling communication in cache-coher-
ent SMP systems: A case-study with Xeon Phi,” in Proc. 22nd Int.
Symp. High-Performance Parallel Distrib. Comput., 2013, pp. 97–108.
[44] €U. V. Çataly€urek and C. Aykanat, PaToH: A Multilevel Hypergraph
Partitioning Tool, Version 3.0, Comput. Eng. Dept., Bilkent Univ.,
Ankara, Turkey., 1999.
[45] G. Karypis and V. Kumar, “hMETIS: A hypergraph partitioning
package, version 1.5.3,” Dept. Comput. Sci., Univ. Minnesota,
Minneapolis, MN, USA, 1998.
[46] U. V. Çataly€urek and C. Aykanat, “Hypergraph-partitioning
based decomposition for parallel sparse-matrix vector multi-
plication,” IEEE Trans. Parallel Distrib. Syst., vol. 10, no. 7, pp. 673–
693, Jul. 1999.
[47] B. U car and C. Aykanat, “Revisiting hypergraph models for
sparse matrix partitioning,” SIAM Rev., vol. 49, no. 4, pp. 595–603,
2007.
[48] H. Sp€ath, “The minisum location problem for the Jaccard metric,”
Operations-Research-Spektrum, vol. 3, no. 2, pp. 91–94, 1981.
[49] G. Karypis and V. Kumar, “METIS manual, version 5.1,” Dept.
Comput. Sci., Univ. Minnesota, Minneapolis, MN, USA, 1998.
[50] D. LaSalle and G. Karypis, “Multi-threaded graph partitioning,”
in Proc. IEEE 27th Int. Symp. Parallel Distrib. Processing, 2013,
pp. 225–236.
[51] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
collection,” ACM Trans. Math. Softw., vol. 38, no. 1, 2011, Art. no. 1.
[52] The DIMACS Implementation Challenges. (2012). [Online]. Avail-
able: http://dimacs.rutgers.edu/Challenges/
[53] MKL Developer Reference. (2016). [Online]. Available: software.
intel.com/en-us/node/520855
[54] MKL Forum. (2014). [Online]. Available: software.intel.com/en-
us/forums/topic/508188
[55] E. Horowitz and S. Sahni, Fundamentals of Computer Algorithms.
Rockville, MD, USA: Comput. Sci. Press, 1978.
[56] The Scalable Heterogeneous Computing Benchmark Suite (SHOC)
for Intel Xeon Phi. (2013). [Online]. Available: https://github.
com/vetter/shoc-mic
[57] E. D. Dolan and J. J. More, “Benchmarking optimization software
with performance profiles,”Math. Prog., vol. 91, pp. 201–213, 2002.
[58] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A lightweight per-
formance-oriented tool suite for x86 multicore environments,” in
Proc. 39th Int. Conf. Parallel Process. Workshops, Sep. 2010, pp. 207–
216.
Kadir Akbudak received the BS degree from
Hacettepe University, Turkey, and the MS and
PhD degrees from Bilkent University, Turkey, all in
computer engineering. His research interests
mainly include parallel scientific computing.
Cevdet Aykanat received the the BS and MS
degrees from METU, Turkey, and the PhD degree
from Ohio State University, Columbus, in electri-
cal and computer engineering. Since 1989, he
has been in Computer Engineering Department,
Bilkent University. His research interests mainly
include parallel scientific computing.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
AKBUDAK AND AYKANAT: EXPLOITING LOCALITY IN SPARSE MATRIX-MATRIX MULTIPLICATION ON MANY-CORE ARCHITECTURES 2271
