A Recursive Algebraic Coloring Technique for Hardware-Efficient
  Symmetric Sparse Matrix-Vector Multiplication by Alappat, Christie L. et al.
1A Recursive Algebraic Coloring Technique for Hardware-Eicient
Symmetric Sparse Matrix-Vector Multiplication
CHRISTIE ALAPPAT, Department of Computer Science, Friedrich-Alexander-Universita¨t Erlangen-
Nu¨rnberg
GEORG HAGER, Erlangen Regional Computing Center, Friedrich-Alexander-Universita¨t Erlangen-
Nu¨rnberg
OLAF SCHENK, Institute of Computational Science, Universita` della Svizzera italiana
JONAS THIES, Simulation and Soware Technology, German Aerospace Center
ACHIM BASERMANN, Simulation and Soware Technology, German Aerospace Center
ALAN R. BISHOP, eory, Simulation and Computation, Los Alamos National Laboratory
HOLGER FEHSKE, Institute of Physics, University of Greifswald
GERHARD WELLEIN, Department of Computer Science, Friedrich-Alexander-Universita¨t Erlangen-
Nu¨rnberg
e symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many
numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on
today’s multicore platforms with up to 100 cores is dicult due to the need to manage conicting updates
on the result vector. Coloring approaches can be used to solve this problem without data duplication, but
existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering
scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE),
a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of
previous coloring methods in terms of hardware eciency and parallelization overhead. We describe the
level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV,
and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel
MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves
in accordance with the roofline model. Outliers are discussed and analyzed in detail. While we focus on
SymmSpMV in this paper, our algorithm and soware is applicable to any sparse matrix operation with data
dependencies that can be resolved by distance-k coloring.
CCS Concepts: •Mathematics of computing→ Graph algorithms;
Additional Key Words and Phrases: sparse matrix, sparse symmetric matrix-vector multiplication, graph
algorithms, graph coloring, scheduling, memory hierarchies
ACM Reference format:
CHRISTIE ALAPPAT, GEORG HAGER, OLAF SCHENK, JONAS THIES, ACHIM BASERMANN, ALAN R.
BISHOP, HOLGER FEHSKE, and GERHARD WELLEIN. xxxx. A Recursive Algebraic Coloring Technique for
Hardware-Ecient Symmetric Sparse Matrix-Vector Multiplication. 1, 1, Article 1 (January xxxx), 39 pages.
DOI:
1 INTRODUCTION AND RELATEDWORK
e ecient solution of linear systems or eigenvalue problems involving large sparse matrices has
been an active research eld in parallel and high performance computing for many decades. Well-
known, traditional application areas include quantum physics, quantum chemistry or engineering.
xxxx. XXXX-XXXX/xxxx/1-ART1 $15.00
DOI:
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
ar
X
iv
:1
90
7.
06
48
7v
1 
 [c
s.D
C]
  1
5 J
ul 
20
19
1:2 Christie Alappat et al.
In recent years, new elds such as social graph analysis [41] or spectral clustering in the context
of learning algorithms [35, 45] have further increased the need for hardware-ecient, parallel
sparse solvers and/or ecient matrix-free solvers. Assuming suciently large problems, the solvers
are typically based on iterative subspace methods and may include advanced preconditioning
techniques. In many methods, two components, sparse matrix-vector multiplication (SpMV) and
coloring techniques, are crucial for hardware eciency and parallel scalability. Typically, these two
components are considered to be orthogonal, i.e., hardware eciency for SpMV is mainly related
to data formats and local structures while coloring is used to address dependencies in the enclosing
iteration scheme. Interestingly, the hardware-ecient parallelization of symmetric SpMV has not
aracted a lot of aention over the years, though symmetry is widespread in the application elds.
e SpMV operation is an essential building block in a number of applications such as algebraic
multigrid methods, sparse iterative solvers, shortest path algorithms, breadth rst search algorithms,
and Markov cluster algorithms, and therefore it is an integral part of numerous scientic algorithms.
In the past decades, much research has been focusing on designing new data structures, ecient
algorithms, and parallelization techniques for the SpMV operation. Its performance is typically
limited by main memory bandwidth. On cache-based architectures, the main factors that inuence
performance are spatial access locality to the matrix data and temporal locality when reusing the
elements of the vectors involved. To address this problem, over the last two decades a plethora
of partitioning techniques and data structures to improve SpMV on cache-based architectures
have been suggested, including cache-oblivious methods using hypergraph partitioning. One of
the rst studies on temporal locality optimizations was done by Toledo [43], who investigated
Cuthill–McKee (CM) ordering techniques on three-dimensional nite-element test matrices when
used in combination with blocking into small dense blocks. Various authors [19, 47] used advanced
data storage formats and techniques such as register and cache blocking for SpMV by spliing the
matrix into several smaller p × q sparse submatrices and presented an analytic cache-aware model
to determine the optimal block size. ese algorithms are, e.g., included in OSKI [46], which is
a collection of low-level primitives of tuned sparse kernels for modern cache-based superscalar
machines. Kreutzer et al. [26] and Xing et al. [32] used techniques to improve SIMD eciency and
performance on many-core and GPU architectures. Recent work can be found, e.g., in [29–31].
Previous work on SpMV has also focused on reducing communication volume for distributed-
memory parallelization, oen by using variants of graph or hypergraph partitioning techniques [6].
Yzelman and Bisseling [49, 50] extended hypergraph partitioning techniques in a cache-oblivious
method, permuting rows and columns of the input matrix using a recursive hypergraph-based
sparse matrix partitioning scheme so that the resulting matrix exhibits cache-friendly behavior
during the SpMV.
Despite SpMV being a bandwidth-limited operation, not much work has been done to exploit the
symmetry property of symmetric matrices to reduce storage requirements and data transfers by
using only the upper/lower triangular part of the matrix. e major challenge here is to resolve the
potential write conicts of explicit symmetric sparse matrix-vector multiplication (SymmSpMV)
kernels in parallel processing. ere are general solutions for such problems like lock based
methods and thread private target arrays [10, 17, 27, 36]. However they have in common that their
overhead may increase with the degree of parallelism. Another recent research direction is the use
of specialized storage formats like CSB [4], RSB [34], CSX [10] combined with the use of bitmasked
register blocking techniques as in [5]. As pointed out by [31] these approaches have drawbacks like
missing backward compatibility and matrix conversion costs. Due to these problems there are only
a very few standard libraries, like Intel MKL [20], that support primitives for ecient SymmSpMV
operation. Another potential way of tackling this inherent data dependency problem is using a
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:3
distance-2 coloring of the underlying undirected graph, which has not been investigated so far to
the best of our knowledge.
Multicoloring (MC) reordering to tackle data dependencies is a very well established strategy in
parallelization of iterative solvers. As it is applied to the underlying graph it is not bound to a specic
data format and may use existing highly optimized (serial) kernels, i.e., it is orthogonal to general
code optimization strategies. Prominent examples for MC in iterative solvers are Gauss-Seidel,
incomplete Cholesky factorization or Kaczmarz method [11, 13, 22], where typically a distance-1
or distance-2 coloring is applied subject to the underlying dependencies of the iterative scheme.
However, coloring changes the evaluation order of the original solver and may lead to worse
convergence rates. is is dierent when using MC methods for parallelization of SymmSpMV
where we only need to ensure that entries of the target vector is not wrien in parallel. Here we
do not require strict serial ordering to get to the same result as in serial processing. In terms of
hardware utilization long-standing MC methods oen generate colorings which lack eciency on
modern cache-based processors. Studies have been made to increase their performance and improve
inherent heuristics; an overview of the methods can be found in [14–16, 33]. However, for irregular
and/or large sparse matrices MC may lead to load imbalance, frequent global synchronization,
and loss of data locality, severely reducing (single-node) performance. ese problems typically
become more stringent for higher order distance colorings and larger matrices. e algebraic block
multicoloring (ABMC) [21] proposed by Iwashita et al. in 2012 addresses some of these issues as it
tries to increase data locality by applying graph partitioning (blocking) before coloring. Beyond
the quality of the actual coloring, the time to generate it is also critical, especially for very large
problems. Here, widely used and publicly available coloring packages such as COLPACK[16],
Kokkos[24] and ZOLTAN[2, 3] speed-up the coloring process itself by parallelization and other
heuristics.
Design and implementation of hardware ecient computational kernels can be supported by a
structured performance engineering process based on white-box models. On the processor/node
level, the most prominent model is the roofline model [48]. Its basic applicability as a reasonable
light-speed estimate for SpMV was demonstrated already in [18], including an extension to sparse
matrix multiple vector multiplication. e SpMV performance model was rened in [26] with a focus
on modeling the performance impact of irregular accesses to the right-hand side (RHS) vector. It has
been successfully used to model performance on CPUs and GPGPUs for SpMV kernels [26] and for
augmented sparse matrix multiple vector kernels for Chebyshev lter diagonalization [25]. However,
there is no extension towards explicit SymmSpMV, which shows increased computational intensity
and irregular accesses to both involved vectors. Typically the expectation is that SymmSpMV
should be approximately twice as fast as SpMV as only half of the matrix information needs to be
stored and accessed.
Finally, there is a clear hardware trend towards processors with advanced vector-style processing,
higher core counts and more complex cache hierarchies. Also, aainable bandwidth may increase
even for “standard” CPU based systems through the use of high bandwidth memory solutions at
the cost of very restricted memory sizes. A rst step into this direction was the Intel Xeon Knights
Landing processor. e specication of the ARM-based Fujitsu A64FX processor (to be used in
the Post-K computer) may provide another blueprint for future processor congurations [12]: A
48-core processor supporting 512-bit SIMD execution units on top of 32 GiB HBM2 main memory,
which provides a bandwidth of 1 TB/s. It is obvious that such hardware trends call for revisiting
existing, time-critical components in simulation codes both in terms of scalability and hardware
eciency. Moreover, the potential of SymmSpMV to substantially reduce the memory footprint of
sparse solvers needs to be exploited to meet the constraint of very limited memory space.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:4 Christie Alappat et al.
Contribution and Outline
is paper addresses the general problem of generating hardware ecient distance-k coloring of
undirected graphs for modern multicore processors. As an application we choose parallelization of
the SymmSpMV operation. We cover thread-level parallelization and focus on a single multicore
processor. e main contributions can be summarized as follows:
• A new recursive algebraic coloring scheme (RACE) is proposed, which generates hardware
ecient distance-k colorings of undirected graphs. Special emphasis in the design of RACE
is put on achieving data locality, generating levels of parallelism matching the core count of
the underlying multicore processor and load balancing for shared memory parallelization.
• We propose shared memory parallelization of SymmSpMV using a distance-2 coloring of
the underlying undirected graph to avoid write conicts and apply RACE for generating
the colorings.
• A comprehensive performance study of shared memory parallel SymmSpMV using RACE
demonstrates the benet of our approach. Performance modeling is deployed to substantiate
our performance measurements, and a comparison to existing coloring methods as well
a vendor optimized library (Intel MKL) are presented. e broad applicability and the
sustainability is validated by using a wide set of 31 test matrices and two very dierent
generations of Intel Xeon processors.
• We extend the existing proven SpMV performance modeling approach to the SymmSpMV
kernel. In the course of the performance analysis we further demonstrate why in some
cases the ideal speedup may not be achievable.
We have implemented our graph coloring algorithms in the open source library recursive algebraic
coloring engine (RACE).1 Information required to reproduce the performance numbers provided in
this paper is also available.2
is paper is organized as follows: Our soware and hardware environment as well as the
benchmark matrices are introduced in Section 2. In Section 3 we describe the properties of the
SpMV and SymmSpMV kernels, including roofline performance limits, and motivate the need
for an advanced coloring scheme. In Section 4 we detail the steps of the RACE algorithm via
an articial stencil matrix and show how recursive level group construction and coloring can be
leveraged to exploit a desired level of parallelism for distance-k dependencies. e interaction
between the parameters of the method and their impact on the parallel eciency is studied in
Section 5. Section 6 presents performance data for SymmSpMV for a wide range of matrices on
two dierent multicore systems, comparing RACE with ABMC and MC as well as Intel MKL, and
also shows the eciency of RACE as dened by the roofline model yardstick. Section 7 concludes
the paper and gives an outlook to future work.
2 HARDWARE AND SOFTWARE ENVIRONMENT
2.1 Hardware test bed
We conducted all benchmarks on a single CPU socket from Intel’s Ivy Bridge EP and Skylake SP
families, respectively, since these represent the oldest and the latest Intel architectures in active
use within the scientic community at the time of writing:
• e Intel Ivy Bridge EP architecture belongs to the class of “classic” designs with three
inclusive cache levels. While the L1 and L2 caches are private to each core, the L3 cache is
1hp://tiny.cc/RACElib
2hp://tiny.cc/RACElib-AD
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:5
Table 1. Technical details (per socket) of the Intel CPUs used for the benchmarks.
Model name Xeon® E5-2660 Xeon® Gold 6148
Microarchitecture Ivy Bridge EP Skylake SP
Base clock frequency 2.2 GHz 2.4 GHz
Uncore clock frequency 2.2 GHz 2.4 GHz
Physical cores per socket 10 20
L1D cache 10 × 32 KiB 20 × 32 KiB
L2 cache 10 × 256 KiB 20 × 1 MiB
L3 cache 25 MiB 27.5 MiB
L3 type inclusive noninclusive, victim
Main memory 32 GiB 48 GiB
Bandwidth per socket, load-only 47 GB/s 115 GB/s
Bandwidth per socket, copy 40 GB/s 104 GB/s
20 32 64 128 256 512 1024 2048
40
60
90
130
200
Size (MB)
B
an
dw
id
th
(G
B
/s
) load-onlycopy
(a) Ivy Bridge EP
20 32 64 128 256 512 1024 2048
100
200
400
800
Size (MB)
B
an
dw
id
th
(G
B
/s
) load-onlycopy
(b) Skylake SP
Fig. 1. Aained bandwidth versus total data size for a range from 20 MB to 2 GB. The doed lines show
the asymptotic bandwidth given in Table 1 for the load-only and copy benchmark. The benchmarks were
performed on the full socket using the likwid-bench tool. The gray vertical lines correspond to the positions
of matrices that might show caching eects; see Section 2.3. Note the logarithmic scales.
shared but scalable in terms of bandwidth. e processor supports the AVX instruction set
extension, which is capable of 256-bit wide SIMD execution.
• Contrary to its predecessors, the Intel Skylake SP architecture has a shared but noninclusive
victim L3 cache and much larger private L2 caches. e model we use in this work supports
AVX-512, which features 512-bit wide SIMD execution.
Architectural details along with the aainable memory bandwidths are given in Table 1. All the
measurements were made with CPU clock speeds xed at the indicated base frequencies. Note
that for the Skylake SP architecture the clock frequency is scaled down internally to 2.2 GHz when
using multicore support and the AVX-512 instruction set; however, this is of minor importance for
the algorithms discussed here.
As the aainable main memory bandwidth is the input parameter to the roofline model used
later, we have carefully measured this value depending on the data set size for two access paerns
(copy and load-only). e data presented in Figure 1 basically show the characteristic performance
drop if the data set size is too large to t into the last level cache (LLC), which is an L3 cache on
both architectures (cf. Table 1 for the actual sizes). Interestingly there is no sharp drop at the exact
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:6 Christie Alappat et al.
size of the LLC but a rather steady performance decrease with enhanced data access rates also for
data set sizes up to twice the LLC size on Ivy Bridge EP. For Skylake SP this eect is even more
pronounced as the noninclusive victim L3 cache architecture only stores data which are not in
the L2 cache; thus the available cache size for an application may be the aggregate sizes of the
L2 and L3 caches on this architecture. e nal bandwidth for the roofline model is chosen as
the asymptotic value depicted in Figure 1. Of course caching eects are extremely sensitive to the
data access paern and thus the values presented here only provide simple upper bounds for the
SymmSpMV kernel with its potentially strong irregular data access.
2.2 External Tools and Soware
e LIKWID [44] tool suite in version 4.3.2 was used, specically likwid-bench for bandwidth
benchmarks (see Table 1), and likwid-perfctr for counting hardware events and measuring
derived metrics. LIKWID validates the quality of it’s performance metrics and validation data is
publicly available.3 Overall the LIKWID data trac measurements can be considered as highly
accurate. Only the L3 data trac measurement on Skylake SP fails the quantitative validation but
it still provides good qualitative results.4
For coloring we used the COLPACK [16] library and METIS [23] version 5.1.0 for graph parti-
tioning with the ABMC method. e Intel SpMP [42] library was employed for RCM bandwidth
reduction, and the Intel MKL version 19.0.2 for some reference computations and comparisons.
All code was compiled with the Intel compiler in version 19.0.2 and the following compiler ags:
-fno-alias -xHost -O3 for Ivy Bridge EP and -fno-alias -xCORE-AVX512 -O3 for Skylake SP.
2.3 Benchmark Matrices
Most test matrices were taken from the SuiteSparse Matrix Collection (formerly University of
Florida Sparse Matrix Collection) [8] combining sets from two related papers [34, 38], which allows
the reader to make a straightforward comparison of results. We also added some matrices from
the Scalable Matrix Collection (ScaMaC) library[1], which allows for scalable generation of large
matrices related to quantum physics applications. A brief description of the background of these
matrices can be found in ScaMaC documentation.5 All the matrices considered are real, although
our underlying soware would also support complex matrices. As mentioned before, we restrict
ourselves to matrices representing fully connected undirected graphs. Table 2 gives an overview of
the most important matrix properties like number of rows (Nr), total number of nonzeros (Nnz),
average number of nonzeros per row (Nnzr), along with the bandwidth of the matrix without (bw)
and with (bwRCM ) RCM preprocessing.
Due to the extended cache size as seen in Figure 1 it might happen that some of the matrices
aain higher eective bandwidths due to partial/full caching, especially on Skylake SP. e ten
potential candidates for the Skylake SP chip in terms of symmetric and full storage (< 128 MB) are
marked with an asterisk in Table 2, while only two among these (offshore and parabolic fem)
satisfy the criteria for Ivy Bridge EP (< 40 MB). e corresponding data set size for storing the
upper triangular part of these matrices have been labeled in Figure 1.
3 KERNELS
We evaluate our methods by parallelization of the SymmSpMV kernel using distance-2 coloring,
which avoids concurrent updates of the same vector entries by dierent threads.
3hps://github.com/RRZE-HPC/likwid/wiki/TestAccuracy
4hps://github.com/RRZE-HPC/likwid/wiki/L2-L3-MEM-trac-on-Intel-Skylake-SP-CascadeLake-SP
5hps://alvbit.bitbucket.io/scamac docs/ matrices page.html
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:7
Table 2. Details of the benchmark matrices. Nr is the number of matrix rows and Nnz is the number of
nonzeros. Nnzr = Nnz/Nr is the average number of nonzeros per row. bw and bwRCM refer to the matrix
bandwidth without and with RCM preprocessing. The leer “C” in the parentheses of the matrix name
indicates a corner case matrix that will be discussed in detail, while the leer “Q” marks a matrix from
quantum physics that is not part of the SuiteSparse Matrix Collection. With an asterisk (*) we have labeled
all the matrices which are less than 128 MB, which could potentially lead to some caching eects especially
on the Skylake SP architecture.
Index Matrix name Nr Nnz Nnzr bw bwRCM
1 crankseg 1* (C) 52,804 10,614,210 201.01 50,388 5126
2 ship 003* 121,728 8,086,034 66.43 3659 3833
3 pwtk* 217,918 11,634,424 53.39 189,331 2029
4 oshore* 259,789 4,242,673 16.33 237,738 19,534
5 F1 343,791 26,837,113 78.06 343,754 10,052
6 inline 1 (C) 503,712 36,816,342 73.09 502,403 6002
7 parabolic fem* (C) 525,825 3,674,625 6.99 525,820 514
8 gsm 106857* 589,446 21,758,924 36.91 588,744 17,865
9 Fault 639 638,802 28,614,564 44.79 19,988 19,487
10 Hubbard-12* (Q) 853,776 11,098,164 13.00 232,848 38,780
11 Emilia 923 923,136 41,005,206 44.42 17,279 14,672
12 audikw 1 943,695 77,651,847 82.29 925,946 35,084
13 bone010 986,703 71,666,325 72.63 13,016 14,540
14 dielFilterV3real 1,102,824 89,306,020 80.98 1,036,475 25,637
15 thermal2* 1,228,045 8,580,313 6.99 1,226,000 797
16 Serena 1,391,349 64,531,701 46.38 81,578 84,947
17 Geo 1438 1,437,960 63,156,690 43.92 26,018 30,623
18 Hook 1498 1,498,023 60,917,445 40.67 29,036 28,994
19 Flan 1565 1,564,794 117,406,044 75.03 20,702 20,849
20 G3 circuit* 1,585,478 7,660,826 4.83 947,128 5068
21 Anderson-16.5* (Q) 2,097,152 14,680,064 7.00 1,198,372 24,620
22 FreeBosonChain-18 (Q) 3,124,550 38,936,700 12.46 2,042,975 131,749
23 nlpkkt120 3,542,400 96,845,792 27.34 1,814,521 86,876
24 channel-500x100x100-b050 4,802,000 90,164,744 18.78 600,299 23,766
25 HPCG-192 7,077,888 189,119,224 26.72 37,057 110,017
26 FreeFermionChain-26 (Q) 10,400,600 140,616,112 13.52 5,490,811 434,345
27 Spin-26 (Q) 10,400,600 145,608,400 14.00 709,995 211,828
28 Hubbard-14 (Q) 11,778,624 176,675,928 15.00 3,171,168 425,415
29 nlpkkt200 16,240,000 448,225,632 27.60 8,240,201 240,796
30 delaunay n24 16,777,216 100,663,202 6.00 16,769,102 32,837
31 Graphene-4096 (C,Q) 16,777,216 218,013,704 13.00 4098 6145
Since the kernel is closely related to the sparse matrix-vector multiplication (SpMV) kernel by
structure and computational intensity, we start with a discussion of SpMV and extend it towards
SymmSpMV later. In all cases the aim is to derive realistic upper performance bounds, which can
be estimated once the computational intensity and main memory bandwidth (bs; see Table 1) are
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:8 Christie Alappat et al.
Algorithm 1 SpMV using the CRS format: b = Ax
1: double :: A[nnz],b[nrows],x[nrows]
2: inteдer :: col[nnz], rowPtr [nrows + 1], tmp
3: for row = 1 : nrows do
4: tmp = 0
5: for idx = rowPtr [row] : (rowPtr [row + 1] − 1) do
6: tmp+ = A[idx] ∗ x[col[idx]]
7: end for
8: b[row] = tmp
9: end for
known [48], i.e.,
Pkernel = Ikernel × bS . (1)
Since bS depends on the ratio of load to store streams we present the model for both upper (load-
only) and lower bound (copy) bandwidth cases. In the following we choose the compressed row
storage (CRS) format for the implementation of SpMV as well as SymmSpMV and assume symmetric
matrices.
3.1 SpMV
A baseline SpMV kernel is presented in Algorithm 1. It has no loop-carried dependencies, so
parallelization of the outer loop using, e.g., OpenMP, is straightforward. Following the discussion
in [26], its computational intensity is
ISpMV(α) = 28 + 4 + 8α + 20/Nnzr
ops
bytes . (2)
Here we assume that the matrix data (A[], col[]), the le-hand side (LHS) vector (b[]), and the row
pointer information (rowPtr []) are loaded only once from main memory, since these data structures
are consecutively accessed. e intensity is calculated from the average cost of performing all
computations required for one nonzero element of the matrix. us, contributions which are
independent of the inner (short) loop are rescaled by Nnzr, which is the average number of nonzeros
per row (i.e., the average length of the inner loop).
e 8α term quanties the data trac caused by accessing the RHS vector (x[]). e value of
α depends on the matrix structure as well as on the RHS vector data set size and the available
cache size. e minimum value of α = Nnzr−1 is aained if the RHS vector is only loaded once
from main memory to the cache and all subsequent accesses in the same SpMV are cache hits. is
limit is typically observed for matrices with low bandwidth (high access locality) or if the cache is
large enough to hold the full RHS data during one SpMV. e actual value of α can be determined
experimentally by measuring the data trac when executing the SpMV; see [26] for more details.6
e optimal value of α = Nnzr−1 together with the corresponding computational intensities for all
matrices is shown in Table 3. e measured αSpMV is used as a sensible lower bound for αSymmSpMV
values (see Section 3.2) in cases where advanced cache replacement strategies do not apply; therefore
the table also presents the corresponding measured αSpMV (= assumed αSymmSpMV ) values for
dierent matrices. Choosing the matrices 10, 22, and 31, which have approximately the same
optimal αSpMV , one can study the delicate inuence of matrix structure (i.e., matrix bandwidth and
6In [26] the trac for the row pointer was not accounted for, i.e., the denominator in (2) is larger by 4Nnzr bytes. is error
is only signicant when Nnzr is small.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:9
Table 3. The optimal value of αSpMV is shown in column three. Following Equation (1) the maximum SpMV
performance can be calculated for each architecture using the best intensity values (ISpMV (αSpMV ) in flopsbytes )
shown in the fourth column. The assumed αSymmSpMV on Skylake SP and Ivy Bridge EP architectures are
presented in columns five and six, respectively. The assumed αSymmSpMV is equal to the measured αSpMV
for all matrices except the ones marked with asterisk, where αSymmSpMV is set to optimal αSymmSpMV (=
1/N symmnzr ).
Index Matrix name αSpMV ISpMV (αSpMV ) Assumed αSymmSpMVOptimal Optimal SKX IVB
1 crankseg 1 0.0050 0.1648 0.0099* 0.0179
2 ship 003 0.0151 0.1610 0.0297* 0.0390
3 pwtk 0.0187 0.1597 0.0368* 0.0383
4 oshore 0.0612 0.1458 0.1154* 0.1058
5 F1 0.0128 0.1618 0.0253* 0.0436
6 inline 1 0.0137 0.1615 0.0137 0.0340
7 parabolic fem 0.1431 0.1249 0.2504* 0.2250
8 gsm 106857 0.0271 0.1568 0.0528* 0.0946
9 Fault 639 0.0223 0.1584 0.0453 0.0861
10 Hubbard-12 0.0769 0.1413 0.1429* 0.2318
11 Emilia 923 0.0225 0.1583 0.0827 0.0855
12 audikw 1 0.0122 0.1621 0.0624 0.0638
13 bone010 0.0138 0.1615 0.0492 0.0523
14 dielFilterV3real 0.0123 0.1620 0.0728 0.0675
15 thermal2 0.1431 0.1249 0.2504* 0.2277
16 Serena 0.0216 0.1587 0.1006 0.1156
17 Geo 1438 0.0228 0.1583 0.0896 0.0917
18 Hook 1498 0.0246 0.1576 0.1031 0.0948
19 Flan 1565 0.0133 0.1616 0.0541 0.0525
20 G3 circuit 0.2070 0.1124 0.3429* 0.3360
21 Anderson-16.5 0.1429 0.1250 0.3634 0.3187
22 FreeBosonChain-18 0.0802 0.1404 0.2708 0.2628
23 nlpkkt120 0.0366 0.1536 0.1600 0.1656
24 channel-500x100x100-b050 0.0533 0.1482 0.1735 0.1339
25 HPCG-192 0.0374 0.1533 0.1358 0.1391
26 FreeFermionChain-26 0.0740 0.1421 0.3879 0.3973
27 Spin-26 0.0714 0.1429 0.3670 0.3518
28 Hubbard-14 0.0667 0.1442 0.3575 0.3598
29 nlpkkt200 0.0362 0.1537 0.1669 0.1720
30 delaunay n24 0.1667 0.1200 0.4065 0.3192
31 Graphene-4096 0.0770 0.1413 0.1604 0.1278
number of rows; see Table 2) and the cache size on the actual data trac, i.e., the measured values
of αSpMV .
For most of the ten candidate matrices on the Skylake SP architecture that could potentially show
a caching eect (see Table 2) we observe the measured αSpMV to be lower than optimal. In this case
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:10 Christie Alappat et al.
Algorithm 2 SymmSpMV b = Ax , where A is an upper triangular matrix
1: for row = 1 : nrows do
2: diaд idx = rowPtr [row]
3: b[row]+ = A[diaд idx] ∗ x[row]
4: tmp = 0
5: for idx = (rowPtr [row] + 1) : (rowPtr [row + 1] − 1) do
6: tmp+ = A[idx] ∗ x[col[idx]]
7: b[col[idx]]+ = A[idx] ∗ x[row]
8: end for
9: b[row]+ = tmp
10: end for
we set their αSymmSpMV values to the optimal alpha value of SymmSpMV (αSymmSpMV ) which will
be dened in the following Section 3.2. ese cases are marked with an asterisk in Table 3.
3.2 SymmSpMV
SymmSpMV exploits the symmetry of the matrix (Ai j = Aji ) to reduce storage size for matrix
data and reduce the overall memory trac by operating on the upper (or lower) half of the matrix
only. us for every nonzero matrix entry we need to update two entries in the LHS vector (b[]) as
shown in Algorithm 2.
In line with the discussion above, the computational intensity of SymmSpMV is
ISymmSpMV(α) = 48 + 4 + 24α + 4/N symmnzr
ops
bytes , (3)
where N symmnzr = (Nnzr − 1)/2 + 1 . (4)
For a given nonzero matrix element 4 ops are performed, which is twice the amount of work
than in SpMV. In addition we have indirect access to the LHS vector (read and write) which triples
the trac contribution quantied by α . e only term scaled with N symmnzr (number of nonzeros
per row in the upper triangular part of the matrix) is the row pointer. e most optimistic value
of α (αSymmSpMV ) in this case is 1/N symmnzr , which corresponds to a one time transfer of the LHS
and RHS vectors. Note that the α for SpMV and SymmSpMV may be dierent even for the same
matrix and the same compute device, as in the laer case the two vectors are accessed irregularly
and compete for cache. us we can assume that the α value measured for SpMV (αSpMV) is a
lower bound for SymmSpMV. Table 3 show the assumed αSymmSpMV values taken for performance
modeling. Since an upper bound for the performance is the product of computational intensity and
main memory bandwidth (see Equation (1)), this approach provides an upper performance bound
for SymmSpMV. However, note that the performance models derived for matrices having caching
eects (see Table 3) need not be strictly upper bound, as they heavily depends on the caching
strategy of the underlying architecture.
Comparing Equation (3) and Equation (2) it is obvious that the perfect speedup of 2× when
using SymmSpMV instead of SpMV is only aainable in the limit of small α . Considering the large
prefactor of the α contribution, any implementation of SymmSpMV must aim at ensuring high data
locality. e indirect update of the LHS also has a large impact on parallelization strategy as two
rows which have a nonzero in the same column cannot be computed in parallel. In a graph-based
approach to this problem, this is equivalent to the constraint that only vertices which have at least
distance two can be computed in parallel.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:11
0 2 4 6 8 10
0
1
2
3
4
5
6
Threads
P
er
fo
rm
an
ce
(G
F
/s
)
SpMV
SymmSpMV MC
SymmSpMV ABMC
(a) SymmSpMV
SpM
V
Sym
mS
pM
V-M
C
Sym
mS
pM
V-A
BM
C
0
10
20
30
40
50
60
70
B
/N
N
Z
MEM
L3
L2
Ideal SymmSpMV
(b) Data traic
0 2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
Threads
P
er
fo
rm
an
ce
(G
F
/s
)
SpMV
SymmSpMV MC
SymmSpMV ABMC
(c) SymmSpMV
SpM
V
Sym
mS
pM
V-M
C
Sym
mS
pM
V-A
BM
C
0
10
20
30
40
50
60
70
B
/N
N
Z
MEM
L3
L2
Ideal SymmSpMV
(d) Data traic
Fig. 2. Scaling performance of SymmSpMV with MC and ABMC compared to SpMV on one socket of Ivy
Bridge EP and Skylake SP is shown in Figures 2(a) and 2(c) respectively. Figures 2(b) and 2(d) show average
data traic per nonzero entry (Nnz) of the full matrix as measured with LIKWID for all cache levels and main
memory on full socket of Ivy Bridge EP and Skylake SP respectively. The Spin-26 matrix was prepermuted
with RCM.
3.3 Analysis of the SymmSpMV kernel using parallel coloring schemes for the Spin-26
matrix
As pointed out previously, the parallelization of the SymmSpMV kernel can be done via distance-2
coloring of the corresponding graph. e computational intensity, and hence the performance,
depends on the data access paerns to the LHS and RHS vectors. As coloring schemes change those
paerns, they may change the computational intensity, and we have to investigate this eect in
more detail. We apply the basic MC scheme generated by COLPACK [16] to parallelize SymmSpMV
and compare it with SpMV, which serves as our performance yardstick. Note that any required
preprocessing is excluded from the timings. In Figure 2 we present performance and data transfer
volumes for the Spin-26 matrix on a single socket of the Ivy Bridge EP and Skylake SP systems.
For SpMV we recover the well-known memory bandwidth saturation paern as we ll the chip
(Figures 2(a) and 2(c)). Measuring the actual data volume from main memory using LIKWID we
nd 16.24 and 16.36 bytes per nonzero matrix entry (Figures 2(b) and 2(d)) on Ivy Bridge EP and
Skylake SP architectures. is corresponds to the denominator of ISpMV in Equation (2), so we
can determine αSpMV = 0.351 for Ivy Bridge EP and 0.367 for Skylake SP, thus we can calculate
an optimistic bound for the intensity of SymmSpMV according to Equation (3). Using the copy
and the load-only bandwidth of Ivy Bridge EP (see Table 1) in Equation (1) we nd a maximum
aainable SymmSpMV performance range for this matrix of PSymmSpMV = 7.63, . . . , 8.96 GF/s, while
for Skylake SP we expect PSymmSpMV = 19.49, . . . , 21.55 GF/s . is indicates a possible speedup of
approximately 1.4× – 1.6× compared to the SpMV baseline (5.5 GF/s and 13.41 GF/s on Ivy Bridge
EP and Skylake SP), the SymmSpMV implementation using MC falls short of this expectation and
is more than three times slower than SpMV.
e reason for this decrease is the nature of the MC permutation. For distance-2 coloring, sets
of structurally orthogonal rows have to be determined [15], i.e., rows that do not overlap in any
column entry. ese sets are referred to as colors. Figure 3 shows the corresponding permutation
and the obtained sets of colors applied to a toy problem with high data locality. Dierent rows of
the same color can be executed in parallel, but colors are operated one aer the other. Aer MC a
color may contain rows from very distant parts of the matrix, potentially destroying data locality.
Assuming that the LLC can hold a maximum of six elements, we nd that the RHS vector must
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:12 Christie Alappat et al.
1 1
2 2 2
3 3
1
1
2
2
2
3
3
1 1 1
2 2 2
1
1
1
2
2
2 1 1 1
2 2 2
1
1
1
2
2
2
Fig. 3. Illustration of the increase of α by MC. Numbers represent thread ids. Note that this figure shows only
rows of the matrix permuted according to MC, but in practice one would permute both rows and columns.
be loaded every time for each new color. is is the reason why we observe 3× more bytes per
nonzero for SymmSpMV with MC compared to SpMV, as seen in Figures 2(b) and 2(d). However,
our performance model indicates that SymmSpMV should exhibit only 0.7× the data trac of SpMV
(see red doed line in Figures 2(b) and 2(d)). Of course this eect strongly depends on the matrix
structure, the matrix size, and the cache size.
ABMC[21] tries to preserve data locality by rst partitioning the entire matrix into blocks of
specied size and then applying MC to these blocks. reads then work in parallel between blocks
of the same color. Along the lines of [39] we use METIS [23] to partition the matrix into blocks, and
COLPACK for MC. e size of blocks for ABMC is determined by a parameter scan (range 4 . . . 128;
see [21]). As stated above, the timing for the performance measurements excludes preprocessing
and the parameter search. is method reduces the data trac (see Figures 2(b) and 2(d)) as there
is beer data locality within a block. Consequently, the performance improves over plain MC (see
Figures 2(a) and 2(c)). However, we are far o the performance model prediction. In addition to data
locality, other factors like global synchronizations and false sharing also contribute to this failure.
ese eects strongly depend on the number of colors and in general increase with chromatic
number. In the case of the Spin-26 matrix the overhead of synchronization is roughly 10% for the
MC method. For most of the matrices considered in this work one can also observe a strong positive
correlation between false sharing and the number of threads for SymmSpMV kernels due to the
indirect writes in SymmSpMV.
4 RECURSIVE ALGEBRAIC COLORING ENGINE (RACE)
Our advanced coloring algorithm is based on three steps:
(1) level construction,
(2) distance-k coloring,
(3) load balancing.
In the rst step we apply a bandwidth reduction algorithm including level construction and matrix
reordering. We then use the information from the level construction step to form subsets of levels
which allow for hardware ecient distance-k coloring of the graph. Finally we present a concept
to ensure load balancing between threads. ese steps are applied recursively if required.
To illustrate the method we choose a simple matrix which is associated with an articially
constructed two-dimensional stencil as shown in Figure 4(a). e corresponding sparsity paern
and the graph of the matrix are shown in Figures 4(b) and 4(c) respectively.
Definitions
We need the following denitions from graph theory:
• Graph: G = (V ,E) represents a graph, with V (G) denoting its set of vertices and E(G)
denoting its edges. Note that we restrict ourselves to irreducible undirected graphs.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:13
(a) Stencil example
0 10 20 30 40 50 60
0
10
20
30
40
50
60
(b) Sparsity paern
0
8
16
24
32
40
48
56
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
(c) Graph
Fig. 4. (a) Structure of an artificially designed stencil, (b) corresponding sparsity paern of its matrix represen-
tation on an 8 × 8 laice with Dirichlet boundary conditions, and (c) the graph representation of the matrix.
The stencil structure was chosen for illustration purposes and does not represent any specific application
scenario.
• Neighborhood: N (u) is the neighborhood of a vertex u and is dened as
N (u) = {v ∈ V (G) : (u,v) ∈ E(G) } .
• kth Neighborhood: N k (u) of a vertex u is dened as
N 2(u) = N (N (u))
N 3(u) = N 2(N (u))
...
N k (u) = N k−1(N (u)) .
• Subgraph: In this paper a subgraph H of G specically refers to the subgraph induced by
vertices V ′ ⊆ V (G) and is dened as
H = (V ′, { (u,v) : (u,v) ∈ E(G) and u,v ∈ V ′ }) .
4.1 Level Construction
e rst step of RACE is to determine dierent levels in the graph and permute the graph data
structure. is we achieve using well-known bandwidth reduction algorithms such as reverse
Cuthill McKee (RCM) [7] or breadth-rst search (BFS) [28]. Although the RCM method is also
implemented in RACE, we use the BFS reordering in the following for simpler illustration.
First we choose a root vertex and assign it to the rst level, L(0). For i > 0, level L(i) is dened to
contain vertices that are in the neighborhood of vertices in L(i − 1) but not in the neighborhood of
vertices in L(i − 2) [9], i.e.,
L(i) =

root if i = 0,
u : u ∈ N (L(i − 1)) if i = 1,
u : u ∈ N (L(i − 1)) ∩ N (L(i − 2)) otherwise.
(5)
From Equation (5) one nds that the ith level consists of all vertices that have a minimum distance
i from the root node. Algorithm 3 shows how to determine this distance and thus set up the levels
L(i). We refer to the total number of levels obtained for a particular graph as N` . Figure 5(a) shows
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:14 Christie Alappat et al.
0
8
16
24
32
40
48
56
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
00
81
162
243
324
405
486
567
11
92
173
254
335
416
497
578
22
103
184
265
346
427
508
589
33
114
195
276
357
438
519
5910
44
125
206
287
368
449
5210
6011
55
136
217
298
379
4510
5311
6112
66
147
228
309
3810
4611
5412
6213
77
158
239
3110
3911
4712
5 13
6314
(a) Level construction
00
21
52
93
144
205
276
357
11
42
83
134
195
266
347
428
32
73
124
185
256
337
418
489
63
114
175
246
327
408
479
5310
104
165
236
317
398
469
5210
5711
155
226
307
388
459
5110
5611
6012
216
297
378
449
5010
5511
5912
6213
287
368
439
4910
5411
5812
6113
6314
(b) Permuted graph (G ′)

0
1
3
6
10
15
21
28
36
..

(c)
Fig. 5. (a) Levels of the original graph and (b) the permuted graph for the stencil example. Insets show the
corresponding sparsity paerns. (c) Shows the entries of the level ptr array associated with G ′.
the N`=14 levels of our articial stencil operator, where the index of each vertex (v) is the vertex
number and the superscript represents the level number, i.e.,
vi =⇒ v ∈ L(i) . (6)
Note that the L(i) are substantially dierent from the levels used in the “level-scheduling” [40]
approach, which applies “depth rst search.”
Aer the levels have been determined, the matrix is permuted in the order of its levels, such that
the vertices in L(i) are stored consecutively and appear before those of L(i + 1). Figure 5 shows the
graph (G ′ = P(G)) of the stencil example aer applying this permutation (P ) and demonstrates the
enhanced spatial locality of the vertices within and between levels (see Figure 5(b)) as compared to
the original (lexicographic) numbering (see Figure 5(a)). Until now the procedure is the same as
BFS (or RCM).
As RACE uses information about the levels for resolving dependencies in the coloring step, we
store the index of the entry point to each level in the permuted data structure (of G ′) in an array
level ptr[0 : N`], so that levels on G ′ can be identied as
L(i) = {u : u ∈ [level ptr[i] : (level ptr[i + 1] − 1)] and u ∈ V (G ′) } .
e entries of level ptr for the stencil example are shown in Figure 5(c).
4.2 Distance-k coloring
e data structure generated above serves as the basis for our distance-k coloring procedure as
it contains information about the neighborhood relation between the vertices of any two levels.
Following the denition in [15], two vertices are called distance-k neighbors if the shortest path
connecting them consists of at most k edges. is implies that u is a distance-k neighbor of v
(referred to as u k−→ v) if
u
k−→ v ⇐⇒ v ∈ {u ∪ N (u) ∪ N 2(u) ∪ · · ·N k (u) } . (7)
For the undirected graphs as used here, u k−→ v also implies v k−→ u. Based on this denition we
consider two vertices to be distance-k independent if they are not distance-k neighbors, and two
levels are said to be distance-k independent if their vertices are mutually distance-k independent.
us, levels L(i) and L(i ± (k + j)) of the permuted graphG ′ are distance-k independent for all j ≥ 1,
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:15
00
21
52
93
144
205
276
357
11
42
83
134
195
266
347
428
32
73
124
185
256
337
418
489
63
114
175
246
327
408
479
5310
104
165
236
317
398
469
5210
5711
155
226
307
388
459
5110
5611
6012
216
297
378
449
5010
5511
5912
6213
287
368
439
4910
5411
5812
6113
6314
(a) Distance-1 independent level groups
00
21
52
93
144
205
276
357
11
42
83
134
195
266
347
428
32
73
124
185
256
337
418
489
63
114
175
246
327
408
479
5310
104
165
236
317
398
469
5210
5711
155
226
307
388
459
5110
5611
6012
216
297
378
449
5010
5511
5912
6213
287
368
439
4910
5411
5812
6113
6314
T (0)
T (1)
T (2)
T (3)
T (4) T (5) T (6) T (7)
(b) Distance-2 independent level groups
Fig. 6. Forming distance-1 and distance-2 independent level groups for the stencil example.
denoted as
L(i) 6 k−→ L(i ± (k + j))∀j ≥ 1 . (8)
Equation (8) implies that if there is a gap of at least one level between any two levels (e.g.,
L(i) and L(i + 2)) all pairs of vertices between these two levels are distance-1 independent. Similarly
if the gap consists of at least two levels between any two levels (e.g., L(i) and L(i + 3)) we have
distance-2 independent levels, and so on.
e denition used in Equation (8) oers many choices for forming distance-k independent sets
of vertices, which can then be executed in parallel. In Figure 6 we present one example each for
distance-1 (Figure 6(a)) and distance-2 (Figure 6(b)) colorings of our stencil example. e distance-1
coloring uses a straightforward approach by assigning two colors to alternating levels, i.e., levels
of a color can be calculated concurrently. In case of distance-2 independence we do not use three
colors but rather aggregate two adjacent levels to form a level group (denoted by T (i)) and perform
a distance-1 coloring on top of those groups. is guarantees that vertices of two level groups of
the same color are distance-2 independent and can be executed in parallel. Here, the vertices in
T (0), T (2), T (4), and T (6) can be operated on by four threads in parallel, i.e., one thread per level
group. Aer synchronization the remaining four blue level groups can also be executed in parallel.
is idea can be generalized such that for distance-k coloring, each level group contains k adjacent
levels. us formed level groups are then distance-1 colored. en, all level groups within a color
can be executed in parallel. is simple approach allows one to generate workload for a maximum
of N`/2k threads if distance-k coloring is requested.7 Note that in all cases, vertices within a single
level group are computed in their original order, which allows for good spatial access locality.
Choosing the same number of levels for each level group may, however, cause severe load
imbalance depending on the matrix structure. In particular, the use of bandwidth reduction
schemes such as BFS or RCM will further worsen this problem due to the lenslike shape of the
reordered matrix (see inset of Figure 5(b)), leading to low workload for level groups containing the
top and boom rows of the matrix. Compare, e.g., T (0) and T (7) with T (3) and T (4) in Figure 6(b).
However, Equation (8) does not require exactly k levels to be in a level group but only at least k . In
the following we exploit this to alleviate the imbalance problem.
7is implies that as the number of levels increases, so does the parallelism. E.g., if the matrix contains at least one dense
row, there is parallelism as N` = 2 in this case.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:16 Christie Alappat et al.
4.3 Load balancing
e RACE load balancing scheme tries to balance the workload across level groups within each
color for a given number of threads while maintaining data locality and the distance-k constraint
between the two colors. To achieve this we use an idea similar to incremental graph partitioning
[37]. e level groups containing low workload “grab” adjacent levels from neighboring level
groups; overloaded level groups shi levels to adjacent level groups. One can either balance the
number of rows (i.e., vertices) Nr(T (i)) or the number of nonzeros (i.e., edges) Nnz(T (i)). Both
variants are supported by our implementation, and we choose balancing by number of rows in the
following to demonstrate the method (see Algorithm 4).
For a given set of level groups we calculate the mean and variance of Nr(T (i)) within each color
(red and blue). e overall variance, which is the target of our minimization procedure, is then
found by summing up the variances across colors. In order to reduce this value we rst select the
two level groups with largest negative/positive deviation from the mean (which is T (5) and T (4)
in step 1 of Figure 7) and try to add/remove levels to/from them (see top row of Figure 7). When
removing levels from a level group, the distance-k coloring is strictly maintained by keeping at
least k levels in it. e shi of levels is done with the help of an arrayT ptr [], which holds pointers
to the beginning of each level group (see Figure 7), avoiding any copy operation. If shiing levels
between the two level groups with the largest deviation does not lead to a lower overall variance,
no levels are exchanged and we choose the next pair of level groups according to a ranking which
is based on the absolute deviation from the mean (see Algorithm 4 for implementation details) and
continue. Following this process in an iterative way we nally end up in a state of lowest overall
variance at which no further moves are possible, either because they would violate the distance-k
dependency or lead to an increase in overall variance. Figure 7 shows the load balancing procedure
under a distance-2 constraint for some initial mapping of 17 levels to six level groups. Applying
the procedure to our stencil example of size 16 × 16, requesting distance-2 coloring and ten level
groups leads to the mapping shown in Figure 8(a). Note that level groups at the extreme ends have
more levels due to fewer vertices (Nr) in each level, while level groups in the middle, having more
vertices, maintain two levels to preserve the distance-2 constraint.
4.4 Recursion
As discussed in Section 4.2, the maximum degree of parallelism is limited by the total number
of levels (N`) and may be further reduced by level aggregation. In case of our 16 × 16 stencil
example the maximum possible parallelism is eight threads, which may cause load imbalance as
seen in Figure 8(b). Hence, further parallelism must be found within the level groups. Compared to
methods like MC we do not require all vertices in a level group to be distance-1 (or distance-k in
general) independent. is is a consequence of our level-based approach, which requires distance-k
independence between vertices of dierent levels but not within a level (see Equation (8)). ere
may be more parallelism hidden within the level groups, which can be interpreted as subgraphs.
us we apply the three steps of our method recursively on selected subgraphs to exploit the
parallelism within them.
In the following section we rst demonstrate the basic idea in the context of distance-1 dependen-
cies, which can be resolved within the given level group by design. However, for k > 1, vertices in a
level group may have distance-k dependencies via vertices in adjacent level groups. We generalize
our procedure to distance-k dependencies as a second step in Section 4.4.2. Finally, in Section 4.4.3
we apply the recursive scheme to our stencil example and introduce proper subgraph selection as
well as global load balancing strategies.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:17
0 3 6 9 12 15 17
1 4 2 3 12 5 6 5 4 3 9 2 8 4 11 2 4
7 20 15 14 23 6 mean r = 15 mean b = 13.3
-8.06.7 0 0.7 8.0-7.3 var = 37.78
T ptr
Nr(L(i))
T size =
Nr(T (j))
signed dev.
Step 1
0 3 6 9 12 14 17
1 4 2 3 12 5 6 5 4 3 9 2 8 4 11 2 4
7 20 15 14 12 17 mean r = 11.3 mean b = 17
-4.33.0 3.7-3.00.7 0 var = 8.45
T ptr
Nr(L(i))
T size
signed dev.
Step 2
0 4 7 9 12 14 17
1 4 2 3 12 5 6 5 4 3 9 2 8 4 11 2 4
10 23 9 14 12 17 mean r = 10.3 mean b = 18
-0.35.0-1.3-3.01.7-1.0 var = 6.6
T ptr
Nr(L(i))
T size
signed dev.
Step 3
0 4 6 8 12 14 17
1 4 2 3 12 5 6 5 4 3 9 2 8 4 11 2 4
10 17 11 18 12 17 mean r = 11 mean b = 17.3
-1.0-0.3 0 0.7 1.0-0.3 var = 0.44
T ptr
Nr(L(i))
T size
signed dev.
Step 4
Fig. 7. All steps of the load balancing scheme, applied to an arbitrarily chosen initial distribution of 17 levels
into six level groups for distance-2 coloring. Rebalancing steps are performed clockwise starting from top le.
mean r andmean b denote the current average number of rows per level group and color. var is the overall
variance.
00
21
52
93
144
205
276
357
448
549
6510
7711
9012
10413
11914
13515
11
42
83
134
195
266
347
438
539
6410
7611
8912
10313
11814
13415
15016
32
73
124
185
256
337
428
529
6310
7511
8812
10213
11714
13315
14916
16417
63
114
175
246
327
418
519
6210
7411
8712
10113
11614
13215
14816
16317
17718
104
165
236
317
408
509
6110
7311
8612
10013
11514
13115
14716
16217
17618
18919
155
226
307
398
499
6010
7211
8512
9913
11414
13015
14616
16117
17518
18819
20020
216
297
388
489
5910
7111
8412
9813
11314
12915
14516
16017
17418
18719
19920
21021
287
378
479
5810
7011
8312
9713
11214
12815
14416
15917
17318
18619
19820
20921
21922
368
469
5710
6911
8212
9613
11114
12715
14316
15817
17218
18519
19720
20821
21822
22723
459
5610
6811
8112
9513
11014
12615
14216
15717
17118
18419
19620
20721
21722
22623
23424
5510
6711
8012
9413
10914
12515
14116
15617
17018
18319
19520
20621
21622
22523
23324
24025
6611
7912
9313
10814
12415
14016
15517
16918
18219
19420
20521
21522
22423
23224
23925
24526
7812
9213
10714
12315
13916
15417
16818
18119
19320
20421
21422
22323
23124
23825
24426
24927
9113
10614
12215
13816
15317
16718
18019
19220
20321
21322
22223
23024
23725
24326
24827
25228
10514
12115
13716
15217
16618
17919
19120
20221
21222
22123
22924
23625
24226
24727
25128
25429
12015
13616
15117
16518
17819
19020
20121
21122
22023
22824
23525
24126
24627
25028
25329
25530
(a) Five threads
00
21
52
93
144
205
276
357
448
549
6510
7711
9012
10413
11914
13515
11
42
83
134
195
266
347
438
539
6410
7611
8912
10313
11814
13415
15016
32
73
124
185
256
337
428
529
6310
7511
8812
10213
11714
13315
14916
16417
63
114
175
246
327
418
519
6210
7411
8712
10113
11614
13215
14816
16317
17718
104
165
236
317
408
509
6110
7311
8612
10013
11514
13115
14716
16217
17618
18919
155
226
307
398
499
6010
7211
8512
9913
11414
13015
14616
16117
17518
18819
20020
216
297
388
489
5910
7111
8412
9813
11314
12915
14516
16017
17418
18719
19920
21021
287
378
479
5810
7011
8312
9713
11214
12815
14416
15917
17318
18619
19820
20921
21922
368
469
5710
6911
8212
9613
11114
12715
14316
15817
17218
18519
19720
20821
21822
22723
459
5610
6811
8112
9513
11014
12615
14216
15717
17118
18419
19620
20721
21722
22623
23424
5510
6711
8012
9413
10914
12515
14116
15617
17018
18319
19520
20621
21622
22523
23324
24025
6611
7912
9313
10814
12415
14016
15517
16918
18219
19420
20521
21522
22423
23224
23925
24526
7812
9213
10714
12315
13916
15417
16818
18119
19320
20421
21422
22323
23124
23825
24426
24927
9113
10614
12215
13816
15317
16718
18019
19220
20321
21322
22223
23024
23725
24326
24827
25228
10514
12115
13716
15217
16618
17919
19120
20221
21222
22123
22924
23625
24226
24727
25128
25429
12015
13616
15117
16518
17819
19020
20121
21122
22023
22824
23525
24126
24627
25028
25329
25530
(b) Eight threads
Fig. 8. Domain size 16 × 16 for the stencil example and distance-2 dependency, (a) aer load balancing for
five threads, (b) aer load balancing for eight threads.
In order to visualize the basic concepts easily and discuss important corner cases of the recursive
approach we start with the simple graph shown in Figure 9(a), which is not related to our stencil
example. To distinguish between level groups at dierent stages s of the recursive procedure we
add a subscript to the levels (Ls (i)) and level groups (Ts (i)) indicating the stage of recursion at
which they are generated, with s = 0 being the original distribution before recursion is applied to
any subgraph.
4.4.1 Distance-1 dependency. For the distance-1 coloring of the graph in Figure 9 we nd that
three of the four level groups of the initial stage still contain distance-1-independent vertices; e.g.,
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:18 Christie Alappat et al.
0
1 2
3 4 5 6
7 8 9
(a) Example graph
00
11 21
32 42 52 62
73 83 93
L0(0)
L0(1)
L0(2)
L0(3)
(b) Stage 0, levels in graph
00
11 21
32 42 52 62
73 83 93
T0(0)
T0(1)
T0(2)
T0(3)
(c) distance-1 coloring
Fig. 9. Exposing potential for more parallelism in a graph with distance-1 coloring. T0(1),T0(2), and T0(3)
have internal unexposed parallelism. Note that the graph shown here is not related to the previous stencil
example.
00
11 21
32 42 52 62
73 83 93
(a)
62
52
42
32
(b)
62,4
52,3
42,2
32,0
(c)
62,4
52,3
42,2
32,0
(d)
62,4
52,3
42,2
32,0
(e)
Fig. 10. Applying recursion to the subgraph induced byT0(2). Figure 10(b) shows the isolated subgraph, while
Figure 10(c) presents the level construction step on the subgraph. Two potential distance-1 colorings of this
subgraph are shown in Figures 10(d) and 10(e).
in T0(2) we have vertices 3 6 1−→ 4 (3 distance-1 independent to 4), 3 6 1−→ 5, 3 6 1−→ 6, and 4 6 1−→ 6, implying
each of these pairs can be computed in parallel without any distance-1 conicts. is parallelism is
not exposed at the rst stage (s = 0) as vertices in L0(i) are chosen such that they are distance-1
neighbors of L0(i − 1), ignoring any vertex relations within L0(i).
Recursion starts with the selection of a subgraph of the matrix, which is discussed in more detail
later (see Section 4.4.3). Here we choose the subgraph induced by T0(2). It can be isolated from the
rest of the graph since the distance-1 coloring step in stage 0 has already produced independent level
groups. Now we just need to repeat the three steps explained previously (Section 4.1–Section 4.3)
on this subgraph.
Figure 10 shows an illustration of applying the rst recursive step (s = 1) on T0(2), where we
extend the denition of the vertex numbering in Equation (6) to the following:
vi, j,k ... =⇒ v ∈ { L0(i) ∩ L1(j) ∩ L2(k) ∩ · · · } . (9)
At the end of the recursion (cf. Figures 10(d) and 10(e)) onT0(2), we obtain parallelism for two more
threads in this case. Note that the subgraphs might have “islands” (groups of vertices that are not
connected to the rest of the graph); e.g., vertex 3 and vertices 4,5,6 form two islands in Figure 10(b).
Since an island is disconnected from the rest of the (sub)graph it can be executed independently
and in parallel to it. To take advantage of this, the starting node in the next island is assigned
a level number with an increment of two, as seen in Figure 10(c). is allows for two dierent
colorings of the island, increasing the number of valid distance-1 congurations (cf. Figures 10(d)
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:19
00
11 21
32 42 52 62
73 83 93
T0(0)
T0(1)
(a)
32 42 52 62
73 83 93
(b)
32,0 42,3 52,3 62,4
73,1 83,2 93,4
(c)
11
32,0 42,3 52,3 62,4
73,1 83,2 93,4
7
T1(0)
T1(1)
T1(2)
(d)
Fig. 11. Two level groups generated by a distance-2 coloring (Figure 11(a)). Figure 11(b) shows the subgraph
induced by level group T0(1). Level construction on the selected subgraph is shown in Figure 11(c). Forming
distance-2 independent level groups on these levels does not guarantee a distance-2 independence between
the newly generated level groups of the same sweep (color) as seen in Figure 11(d).
and 10(e)). e selection of the optimal one will be done in the nal load balancing step as described
in Section 4.3.
As this recursive process nds independent level groups (Ts+1) within a level group of the
previous stage (Ts ), the thread assigned to Ts has to spawn threads to parallelize within Ts+1.
4.4.2 Distance-k dependencies with k > 1. In general, it is insucient to consider only the
subgraphs induced by level groups in the recursion step, as can be seen in Figure 11(a) for distance-
2 coloring. Applying the three steps (see Figures 11(b) to 11(d)) to the subgraph induced by T0(1)
does not guarantee distance-2 independence between the new level groups T1(0) and T1(2). It
is obvious that for general distance-k colorings two vertices a,b within a level group might be
connected by a shared vertex c outside the level group. us, our three step procedure must be
applied to a subgraph which contains the actual level group (Ts (j)) as well as its all distance-p
neighbors, where p = 1, 2, . . . , (k − 1).
is ensures that there is no vertex outside the subgraph which can mediate a distance-k
dependency between vertices in the embedded level group (Ts (j)). We can now construct the
new levels on this subgraph considering the neighborhood, but we only store the vertices in the
new levels Ls+1(:) that are in the embedded level group (Ts (j)). Next we apply distance-k coloring
by aggregation of the new levels, leading to a set of level groups Ts+1(:) within Ts (j). Figure 12
demonstrates this approach to resolve the conict shown in Figure 11(d). Figure 12(b) presents the
subgraph containing the selected level group T0(1) and its distance-1 neighborhood.
Level construction is performed on the subgraph (Figure 12(c)), but the new levels only contain
vertices of T0(1), i.e., L1(1) = {73,1}. Finally, distance-2 coloring by aggregation of two adjacent
levels is performed, leading to three level groups of the second stage s = 1 (Figure 12(d)), i.e.,
T1(0) = {L1(0) ∪ L1(1)}. Now vertices 3 and 6 are mapped to level groups of dierent colors. Note
that the permutation step on the newly generated levels is not shown but is performed as well to
maintain data locality.
4.4.3 Level group construction and global load balancing. e recursive renement of level groups
allows us to tackle load imbalance problems and limited degree of parallelism as we are no longer
restricted by the one thread per level group constraint. Instead, we have the opportunity to form
level groups and assign appropriate thread counts to them such that the load per thread approaches
the optimal value, i.e., the total workload divided by the number of threads available. Pairs of
adjacent level groups having dierent colors within a stage, i.e.,Ts (i) andTs (i+1)with i = 0, 2, 4, ...,
are typically handled by the same threads, so we assign an equal number of threads to them. We
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:20 Christie Alappat et al.
00
11 21
32 42 52 62
73 83 93
T0(0)
T0(1)
(a)
11 21
32 42 52 62
73 83 93
(b)
11,1 21,4
32,0 42,2 52,3 62,2
73,1 83,2 93,4
(c)
11,1 21,4
32,0 42,2 52,3 62,2
73,1 83,2 93,4T1(0)
T1(1)
T1(2)
(d)
Fig. 12. Correct procedure for distance-2 coloring of level group T0(1). The subgraph as shown in Figure 12(b)
contains level group T0(1) and its distance-1 neighborhood. A level construction step is applied to this
subgraph in Figure 12(c). Distance-2 coloring by level aggregation leading to level groups of stage 1 is shown
in Figure 12(d); we get three level groups at the end of the recursion on T0(1).
0
2
5
9
14
20
27
35
44
54
65
77
90
104
119
135
1
4
8
13
19
26
34
43
53
64
76
89
103
118
134
150
3
7
12
18
25
33
42
52
63
75
88
102
117
133
149
164
6
11
17
24
32
41
51
62
74
87
101
116
132
148
163
177
10
16
23
31
40
50
61
73
86
100
115
131
147
162
176
189
15
22
30
39
49
60
72
85
99
114
130
146
161
175
188
200
21
29
38
48
59
71
84
98
113
129
145
160
174
187
199
210
28
37
47
58
70
83
97
112
128
144
159
173
186
198
209
219
36
46
57
69
82
96
111
127
143
158
172
185
197
208
218
227
45
56
68
81
95
110
126
142
157
171
184
196
207
217
226
234
55
67
80
94
109
125
141
156
170
183
195
206
216
225
233
240
66
79
93
108
124
140
155
169
182
194
205
215
224
232
239
245
78
92
107
123
139
154
168
181
193
204
214
223
231
238
244
249
91
106
122
138
153
167
180
192
203
213
222
230
237
243
248
252
105
121
137
152
166
179
191
202
212
221
229
236
242
247
251
254
120
136
151
165
178
190
201
211
220
228
235
241
246
250
253
255
70
6
6
71
7273
4
6
75
77
789
80
82
1
3
48
8
88
7
5
7
93
94
96
98
9
100
103
1
02
04
05106
107
9
1
08
0
2
13
14
17
5
6
3
5
22
24
26
278
9
31
33
3
32
34
35
36
9
4
37
38
0
2
43
44
47
49
5
6
8
4
6
53
55
5
8
9
62
0
1
63
64
65
8
66
67
69
07
7
74
76
3
5
T0(4)
Initial Stage Recursion
(s = 0) (s = 1)
red orange
pink
blue brown
cyan
execution
tim
e
Fig. 13. Graph coloring of the stencil exam-
ple for eight threads. Recursion is applied
on level groupsT0(4−7)with two threads as-
signed to each. The parallel execution order
is shown on the right. Horizontal red dot-
ted lines indicate synchronization and its
extent. Vertical lines distinguish between
level groups of dierent stages (here T0 and
T1) which can run in parallel.
then apply recursion to the level groups with more than one thread assigned. Starting with the
original graph as the base level group (T−1(0)) to which all available threads Nt(T−1(0)) = Nt and all
vertices Nr(T−1(0)) = Nrtotal are assigned, we perform the following steps to form level groups Ts (:)
at stage s ≥ 0 to which we assign Nt(Ts (:)) threads. To illustrate the procedure we use the 16 × 16
stencil example and construct a coloring scheme for eight threads (see Figure 13).
(1) Assign weights to all levels at stage (s) of the recursion. Assuming that Ls (i) ⊂ Ts−1(j), its
weight is dened by
w(Ls (i)) = Nr(Ls (i))Nr(Ts−1(j))
Nt(Ts−1(j))
=
Nr(Ls (i))
Nr(Ts−1(j))Nt(Ts−1(j)).
For a given level group (Ts−1(j)) that has to be split up (Nt(Ts−1(j)) > 1), the weight
describes the fraction of the optimal load per thread, Nr(Ts−1(j))Nt(Ts−1(j)) , in the specic level (Ls (i)).
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:21
RequestingNt(T−1(0)) = 8 threads for theNr(T−1(0)) = 256 vertices of the stencil example
in Figure 13 produces the following weights for the initial (s = 0) levels:
{w(L0(0)),w(L0(1)),w(L0(2)), ...} =
{ 1
256 × 8,
2
256 × 8,
3
256 × 8, ...
}
.
(2) e above denition implies that if the weight is close to a natural number b, the corre-
sponding workload is near optimal for operation with b threads. us, starting with Ls (0)
we aggregate successive levels until their combined weight forms a number a close to
a natural number b. Distance-k coloring is ensured by enforcing it to aggregate at least
2 × k levels, i.e., for distance-2 coloring at least four levels (two for red and two for blue).
Closeness to the natural number is quantied by a parameter ϵ dened as
ϵ = 1 − abs(a − b), where b = max(1, [a])
and [a] is the nearest integer to a,
and controlled by the criterion
ϵ > ϵs ,where the ϵs ∈ [0.5, 1) are user dened parameters.
e choice of this parameter may be dierent for every stage of recursion. Once we nd
a collection of successive levels satisfying this criterion, the natural number b is xed.
We try to further increase the number of levels to test if there exists a number a′ > a
which is closer to b leading to an ϵ value closer to one. We nally choose the set of levels
with the best ϵ value and dene them to form Ts (0) and Ts (1) which are to be executed by
Nt(Ts (0)) = Nt(Ts (1)) = b threads. In Figure 13 we choose ϵs = 0.6, which selects the rst
seven levels to form T0(0) and T0(1). As their combined weight is 2832 = 0.875, one thread
will execute these two level groups.
(3) We continue with subsequent pairs of level groups (Ts (i),Ts (i + 1); i = 2, 4...) by applying
this procedure starting with the very next level. Finally, once all the levels have been
touched, a total of Nt(Ts−1(j)) threads have been assigned to the levels Ls (i) ⊂ Ts−1(j). For
example, forT0(4) andT0(5) in Figure 13 two threads satisfy the criterion as the total weight
of the four levels included is 5432 = 1.69.
(4) e distribution between adjacent red and blue level groups which are assigned to the same
thread(s) as well as the nal global load balancing is performed using a slight modication of
the scheme presented in Section 4.3 (shown at the beginning of Algorithm 4 in appendix A):
Now the calculation of mean and variance must consider the number of threads (Nt(Ts (j)))
assigned to each level group. e worker array now has to be replaced by the number of
threads assigned to each level group (Nt(Ts (j))). e algorithm then tries to minimize the
variance of the number of vertices per thread in level groups. Ideally, aer this step the
load per thread in each level group should approach the optimal value given above.
Once the level group of stage s has been formed, the recursion and the above procedure are
separately applied to all new level groups with more than one thread assigned. is continues
until every level group is assigned to one thread. e depth of the recursion is determined by the
parameter ϵs and depends on the matrix structure as well as degree of parallelism requested.
For our stencil example in Figure 13 the inner four level groups of stage s = 0 required one
stage of recursion. is led to 16 level groups at stage s = 1, as we require four new level groups
per recursion to schedule two threads. In terms of parallel computation, rst the red vertices
will be computed in parallel with the orange ones using four threads for both colors. Once the
orange vertices are done, each pair of threads assigned to T0(4) and T0(6) synchronize locally
(i.e., within T0(4) and T0(6) separately). en the pink vertices are computed followed by a global
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:22 Christie Alappat et al.
[0,255]
id=0,...,7
〈44〉
[0,14]
id=0
〈15〉
[15,27]
id=0
〈13〉
[28,44]
id=1
〈17〉
[45,65]
id=1
〈21〉
[66,90]
id=2,3
〈13〉
[66,72]
id=2
〈7〉
[73,78]
id=2
〈6〉
[79,84]
id=3
〈6〉
[85,90]
id=3
〈6〉
[91,119]
id=2,3
〈15〉
[91,98]
id=2
〈8〉
[99,105]
id=2
〈7〉
[106,112]
id=3
〈7〉
[113,119]
id=3
〈7〉
[120,150]
id=4,5
〈16〉
[120,127]
id=4
〈8〉
[128,134]
id=4
〈7〉
[135,142]
id=5
〈8〉
[143,150]
id=5
〈8〉
[151,177]
id=4,5
〈14〉
[151,157]
id=4
〈7〉
[158,163]
id=4
〈6〉
[164,170]
id=5
〈7〉
[171,177]
id=5
〈7〉
[178,200]
id=6
〈23〉
[201,219]
id=6
〈19〉
[220,234]
id=7
〈15〉
[235,255]
id=7
〈21〉
s=0
s=1
Fig. 14. The internal tree structure of RACE representing the stencil example for domain size 16 × 16 and
eight threads. The range [. . .] specified in each leaf represents the vertices belonging to each level group and
the id refers to the thread id assigned to each level group assuming compact pinning . The last entry 〈N er 〉
gives the eective row count introduced in Section 5.
synchronization of all threads. e scheme continues with the blue vertices and the brown/cyan
ones, which represent the two blue level groups to which recursion has been applied (see table in
Figure 13).
e recursive nature of our scheme can be best described by a tree data structure, where every
node represents one level group and the maximum depth is equivalent to the maximum level (stage)
of recursion. e data structure for the colored graph in Figure 13 and its thread assignments are
shown in Figure 14. e root node represents our baseline level group T−1(0) comprising all 256
vertices and all eight threads (having unique id = 0, . . . , 7). e rst level of child nodes gives
the initial (s = 0) distribution, with each node storing the information of a level group including
its color. reads are mapped consecutively to the level groups. e red T0(4) level group, which
consists of vertices 66, . . . , 90 (omiing the superscript for level numbers), is executed by threads
with id = 2, 3. Applying recursion to T0(4), this node spawns four new child nodes at stage s = 1,
i.e., level groups T1(0, . . . , 3) ⊂ T0(4), to be executed by the two threads. Synchronization only
happens between threads having the same parent node aer executing the same color. Note that
actual computations are only performed on the leaf nodes of the nal tree.
5 PARAMETER STUDY
e RACE method has a set of input parameters {ϵs ; s = 0, 1, . . .} which control the assignment of
threads to adjacent level groups. To determine useful seings, we analyze the interaction between
these input parameters, the number of threads used, and the parallel eciency of the generated
workload distribution.
As the internal tree structure contains all information about the nal workload distribution, we
can use it to identify the critical path in terms of workload and thus the parallel eciency. To
this end we introduce the eective row count for every node (or level group) N er (Ts (i)), which is
a measure for the absolute runtime to calculate the corresponding level group. For level groups
that are not further rened (leaf nodes) this value is their actual workload, i.e., the number of rows
assigned to them (N er (T0(0)) = 15 in Figure 14). For an inner node, the eective row count is the
sum of the maximum workload (i.e., the maximum eective row count value) across each of the two
colors of its child nodes:
N er (Ts (i)) = max
(
N er (Ts+1(j) ⊆ Ts (i))
)
+ max
(
N er (Ts+1(j + 1) ⊆ Ts (i))
)
for j = 0, 2, . . .
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:23
Such a denition is based on the idea that nodes at a given stage s have to synchronize with
each other and have to wait for their siblings with the largest workload in each sweep (color).
Propagating this information upwards on the tree until we reach the root node constructs the
critical path in terms of longest runtime taking into account single thread workloads, dependencies,
and synchronizations. us, the nal value in the root node N er (T−1(0)) can be considered as the
eective maximum workload of a single thread. Dividing the globally optimal workload per thread,
Nr
total/Nt, by this number gives the parallel eciency (η) of our workload distribution:
η =
Nr
total
N er (T−1(0)) × Nt
.
For the tree presented in Figure 14, the parallel eciency is limited to η = 25644×8 = 0.73 on eight
threads, i.e., the maximum parallel speedup is 5.8.
5.1 Parameter analysis and selection
e parallel eciency η as dened above can be calculated for any given matrix, number of threads
Nt, and choice of {ϵs ; s = 0, 1, . . .}; it reects the quality of parallelism generated by RACE for
the problem at hand. is way we can understand the interaction between these parameters and
identify useful choices for the ϵs . Of course, running a specic kernel such as SymmSpMV on
actual hardware will add further hardware and soware constraints such as aainable memory
bandwidth or cost of synchronization.
As a rst step we can limit the parameter space by simple corner case analysis. Seing all
parameters close to one requests high-quality load balancing but may prevent our balancing
scheme from terminating. In the extreme case of {ϵs = 1; s = 0, 1, . . .} the scheme may generate
only two level groups (one of each color) in each recursion, assign all threads to them, and may
further aempt to rene them in the same way. e lowest possible value of ϵs is the maximum
deviation of a real number from its nearest integer, which is 0.5. A range of [0.5,0.9] for the ϵs
is therefore used in the following. For a basic analysis we have selected the inline 1 matrix (see
Table 2) as it has a rather small amount of parallelism and allows us to cover basic scenarios. In
Figure 15 we demonstrate the impact of dierent choices for ϵ0 and ϵ1 on the parallel eciency for
thread counts up to 100, which is a useful limit for modern CPU-based compute nodes. For s > 1
we always set the minimum value of ϵs = 0.5. e limited parallelism can be clearly observed in
Figure 15(a), with eciency steadily decreasing with increasing thread count. At ϵ1 = 0.5 there is
only a minor impact of the parameter ϵ0. In Figures 15(b) to 15(d) the interplay between these two
parameters is analyzed at dierent thread counts in more detail. We nd that up to intermediate
parallelism (Nt = 50) the exact choice has only a minor impact on the parallel eciency (see y-axis
scaling). For larger parallelism the interplay becomes more intricate, where too large values of ϵ0,1
may lead to stronger imbalance. Based on this evaluation, we choose ϵ0,1 = 0.8 and ϵs = 0.5 for
s > 1 for all subsequent performance measurements. e quality of this choice in terms of parallel
eciency for all matrices is presented in Figure 16. Here we plot the η value for all the matrices
over a large thread count. We nd that our parameter seing achieves parallel eciencies of 80%
or higher for a substantial fraction of the matrices up to intermediate thread counts. Representing
the upper (lower) values in Figure 16 is the best (worst) case matrix Graphene-4096 (crankseg 1),
exhibiting almost perfect (very low) parallel eciency at intermediate to high thread counts.
Finally, we evaluate the scalability of RACE using these two corner cases and the inline 1 matrix
as well as the parabolic fem matrix, which is small enough to t into the cache. In Figure 17 we
mimic scaling tests on one Skylake processor with up to 20 cores (i.e., threads) and plot the parallel
eciency η as well as the maximum number of threads which can be “perfectly” used N et (i.e.,
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:24 Christie Alappat et al.
10 20 30 40 50 60 70 80 90 100
0
0.2
0.4
0.6
0.8
1
Nt
η
0 = 0.50
0 = 0.70
0 = 0.80
(a) η versus Nt for inline 1 matrix, ϵ1 = 0.5
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0
0.2
0.4
0.6
0.8
1
0
η
1 = 0.50
1 = 0.70
1 = 0.80
1 = 0.90
(b) Nt=25
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0
0.2
0.4
0.6
0.8
1
0
η
1 = 0.50
1 = 0.70
1 = 0.80
1 = 0.90
(c) Nt=45
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0
0.2
0.4
0.6
0.8
1
0
η
1 = 0.50
1 = 0.70
1 = 0.80
1 = 0.90
(d) Nt=100
Fig. 15. Parameter study on the inline 1 matrix. In Figures 15(b) to 15(d) each of the lines in the plot are
iso-ϵ1 and impact of η with respect to ϵ0 is shown. ϵs for s > 1 is fixed to 0.5.
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
0
0.2
0.4
0.6
0.8
1
Nt
η
Fig. 16. Parallel eiciency η versus Nt for all test matrices with ϵ0,1 = 0.8 and ϵs>1 = 0.5.
N et = η × Nt). e unfavorable structure of the crankseg 1 matrix puts strict limits on parallelism
even for low thread counts. e combination of small matrix size with a rather dense population (see
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:25
2 4 6 8 10 12 14 16 18 20
1
2
3
4
5
6
7
Nt
N
ef
f
t
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
η
N efft
η
(a) crankseg 1
2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
18
20
Nt
N
ef
f
t
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
η
N efft
η
(b) inline 1
2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
18
20
Nt
N
ef
f
t
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
η
N efft
η
(c) parabolic fem
2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
18
20
Nt
N
ef
f
t
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
η
N efft
η
(d) Graphene-4096
Fig. 17. N et and η versus Nt for the four corner case matrices, with the same seings used in experiment
runs. N et is defined as η × Nt.
Table 2) leads to large inner levels when constructing the graph, triggering strong load imbalance
if using more than six threads. A search for beer ϵs slightly changes the characteristic scaling
but not the maximum parallelism that can be extracted. For the inline 1 matrix we nd a weak
but steady decrease of the parallel eciency, which is in good agreement with the discussion of
Figure 15. e other two matrices scale very well in the range of thread counts considered.
e corresponding performance measurements for the SymmSpMV kernel (see Section 3.2) on
a single Skylake SP processor chip with 20 cores are shown in Figure 18.8 For the crankseg 1
matrix (see Figure 18(a)) we recover the limited scaling due to load imbalance as theoretically
predicted. A performance maximum is at nine cores, where the maximum SpMV performance can
be slightly exceeded. However, based on the roofline performance model given by Equations (1)
and (3) together with the matrix parameters from Table 2, a theoretical speedup of approximately
two as compared to SpMV can be expected for the full processor chip under best conditions. Indeed,
in case of the inline 1 and Graphene-4096 matrices, performance scales almost linearly until the
main memory bandwidth boleneck is hit. e saturated performance is in good agreement with
the roofline limits. Note that even though the inline 1 matrix does not exhibit perfect theoretical
eciency (η ≈ 0.85 at Nt = 20), it still generates sucient parallelism to achieve main memory
saturation: e memory boleneck can mitigate a limited load imbalance.
e peculiar performance behavior of parabolic fem (see Figures 17(c) and 18(c)) is due to its
smallness (≈ 23 MB), which lets it t into the caches of the Skylake processor (LLC size = 28 MB).
us, performance is not limited by the main memory bandwidth constraint and the roofline model
limits do not apply.
We have demonstrated that a simple choice for the only set of RACE input parameters {ϵs ; s =
0, 1, . . .} can extract sucient parallelism for most matrices considered in this study. Moreover, the
parallel eciency as calculated by RACE in combination with the roofline performance model is a
good indication for scalability and maximum performance of the actual computations.
6 PERFORMANCE EVALUATION OF SYMMSPMV USING RACE
We evaluate the performance of the SymmSpMV based on parallelization and reordering performed
by RACE and compare it with the two MC approaches introduced above and the Intel MKL. As a
yardstick for baseline performance we choose the general SpMV kernel and use the performance
model introduced in Section 3 to quantify the quality of our absolute performance numbers. As the
8For the benchmarking setup see Section 6.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:26 Christie Alappat et al.
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
22
24
Nt
G
F
/s
SpMV
SymmSpMV
7.22 B
(a) crankseg 1
2 4 6 8 10 12 14 16 18 20
0
5
10
15
20
25
30
35
Nt
G
F
/s
SpMV
SymmSpMV
RLM-copy
RLM-load
12.18 B
(b) inline 1
2 4 6 8 10 12 14 16 18 20
0
5
10
15
20
25
30
Nt
G
F
/s
SpMV
SymmSpMV
RLM-copy
RLM-load
6.68 B
(c) parabolic fem
2 4 6 8 10 12 14 16 18 20
0
5
10
15
20
25
30
Nt
G
F
/s
SpMV
SymmSpMV
RLM-copy
RLM-load
16.72 B
(d) Graphene-4096
Fig. 18. Parallel performance measurements of SymmSpMV with RACE on one Skylake SP socket for the
four corner case matrices. The performance of the basic SpMV kernel is presented for reference. For the
matrices Figures 18(b) to 18(d) the maximum roofline performance limits Equation (1) are given using the
computational intensity Equation (3) for the two extreme cases of load-only memory bandwidth (RLM-load)
and copy memory bandwidth (RLM-copy). The measured full socket main memory data traic per nonzero
entry of the symmetric matrix (in bytes) for the SymmSpMV operation is also shown, where values below 12
bytes indicate caching of the matrix entries.
deviations between dierent measurement runs are less than 5%, we do not show the error bar in
our performance measurements.
6.1 Experimental Setup
All matrix data are encoded in the CRS format. For the SymmSpMV only the nonzeros of the upper
triangular matrix are stored. In the case of RACE and the coloring approaches every thread executes
the SymmSpMV kernel Algorithm 2 with appropriate outer loop boundary seings depending on
the color (MC, ABMC) or level groups (RACE) to be computed. In order to ensure vectorization
of the inner loop in Algorithm 2 we use the SIMD pragma #pragma simd reduction(+:tmp)
vectorlength(VECWIDTH). Here VECWIDTH is the maximum vector width supported by the archi-
tecture, i.e., VECWIDTH = 4 (8) for Ivy Bridge EP (Skylake SP).
e Intel MKL oers two choices for the two sparse matrix kernels under consideration: First,
CRS based data structures are provided and are used in the subroutines (mkl cspblas dcsrgemv for
SpMV and mkl cspblas dcsrsymv for SymmSpMV) without any modication (MKL). is mode
of operation is deprecated from Intel MKL.v.18. Instead, the inspector-executor mode (MKL-IE)
is recommended to be used. Here, the user initially provides the matrix along with hints (e.g.,
symmetry) and operations to be carried out to the inspector routine (mkl sparse set mv hint).
en an optimization routine (mkl sparse optimize) is called where the matrix is preprocessed
based on the inspector information to achieve best performance and highest parallelism for the
problem at hand. e subroutine mkl sparse d mv is then used to do the SpMV or SymmSpMV
operations on this optimized matrix structure. is approach does not provide any insight into
which kernel or data structure is actually used “under the hood.”
In the performance measurements the kernels are executed multiple times in order to ensure
reasonable measurement times and average out potential performance uctuations. Doing succes-
sive invocations of the kernel on the same two vectors, however, may lead to unrealistic caching of
these vectors if the number of rows is small enough. us, we use two ring buers (at least of 50
MB each) holding separate vectors of size Nr. Aer each kernel invocation we switch to the next
vector in the two buers. is way we mimic the typical structure of iterative sparse solvers where
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:27
0 2 4 6 8 10
0
1
2
3
4
5
6
7
8
9
Threads
P
er
fo
rm
an
ce
(G
F
/s
)
SpMV
SymmSpMV MC
SymmSpMV ABMC
SymmSpMV RACE
RLM-copy
RLM-load
(a) SymmSpMV
SpM
V
Sym
mS
pM
V-M
C
Sym
mS
pM
V-A
BM
C
Sym
mS
pM
V-R
AC
E
0
10
20
30
40
50
60
70
B
/N
N
Z
MEM
L3
L2
Ideal SymmSpMV
(b) Data traic
0 2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
18
20
22
Threads
P
er
fo
rm
an
ce
(G
F
/s
)
SpMV
SymmSpMV MC
SymmSpMV ABMC
SymmSpMV RACE
RLM-copy
RLM-load
(c) SymmSpMV
SpM
V
Sym
mS
pM
V-M
C
Sym
mS
pM
V-A
BM
C
Sym
mS
pM
V-R
AC
E
0
10
20
30
40
50
60
70
B
/N
N
Z
MEM
L3
L2
Ideal SymmSpMV
(d) Data traic
Fig. 19. Performance (Figure 19(a)) and data traic (Figure 19(b)) analysis for SymmSpMV kernel with Spin-26
matrix using MC, ABMC, and RACE on a single socket of Ivy Bridge EP. The corresponding measurements for
a single socket of Skylake SP is shown in Figures 19(c) and 19(d). The roofline performance model (using copy
and load-only bandwidth) and the performance of the SpMV kernel is ploed for reference in scaling plots
Figures 19(a) and 19(c). The average data traic per nonzero entry (Nnzr) of the full matrix as measured with
LIKWID for all cache levels and main memory is shown together with the minimal value for main memory
access (horizontal dashed line) in Figures 19(b) and 19(d).
between successive matrix-vector operations other data intensive kernels are executed, e.g., several
Level 1 BLAS routines or preconditioning steps. We run over these two buers 100 iterations (times)
and report the mean performance.
For all methods and libraries the input matrices have been preprocessed with RCM bandwidth
reduction using the Intel SpMP library [42]. is provides the same or beer performance on all
matrices as compared to the original ordering. If not otherwise noted we use the full processor
chip and assign one thread to each core. As we focus on a single chip and sub-NUMA clustering
(SNC) is not enabled on Skylake SP, no NUMA data placement eects impact our results.
6.2 Results
Before we evaluate the performance across the full set of matrices presented in Table 2 we return
to the analysis of the SymmSpMV performance and data trac for the Spin-26 matrix which we
have presented in Section 3.3 for the established coloring approaches.
6.2.1 Analysis of SymmSpMV kernel using RACE for the Spin-26 matrix. e shortcomings in
terms of performance and excessive data transfer for parallelization of SymmSpMV using MC and
ABMC have been demonstrated in Figure 2. We extend this evaluation by comparison with the
RACE results in Figure 19. e gures clearly demonstrate the ability of RACE to ensure high
data locality in the parallel SymmSpMV kernel. e actual main memory trac achieved is inline
with the minimum trac for that matrix (see discussion in Section 3.3) and a factor of up to 4×
lower than the coloring approaches. Correspondingly, RACE SymmSpMV performance is at least
3.3× higher than its best competitor and 25% beer than the SpMV kernel on both architectures.
It achieves more than 84% of the rooine performance limit based on the copy main memory
performance. Note that the indirect update of the LHS vector will generate a store instruction for
every inner loop iteration (see Algorithm 2), while the SpMV kernel only does a nal store at the
end of the inner loop iteration. In combination with the low number of nonzeros per row (Nnzr) of
the Spin-26 matrix, the “copy” induced limit poses a realistic upper performance bound.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:28 Christie Alappat et al.
cr
an
ks
eg
-1
sh
ip
-0
03
pw
tk
off
sh
or
e
F
1
in
lin
e-
1
pa
ra
bo
lic
-f
em
gs
m
-1
06
85
7
Fa
ul
t-
63
9
H
ub
ba
rd
-1
2
E
m
ili
a-
92
3
au
di
kw
-1
bo
ne
01
0
di
el
F
ilt
er
V
3r
ea
l
th
er
m
al
2
Se
re
na
G
eo
-1
43
8
H
oo
k-
14
98
F
la
n-
15
65
G
3-
ci
rc
ui
t
A
nd
er
so
n-
16
.5
Fr
ee
B
os
on
C
ha
in
-1
8
nl
pk
kt
12
0
ch
an
ne
l-5
00
x1
00
x1
00
-b
05
0
H
P
C
G
-1
92
Fr
ee
Fe
rm
io
nC
ha
in
-2
6
Sp
in
-2
6
H
ub
ba
rd
-1
4
nl
pk
kt
20
0
de
la
un
ay
-n
24
G
ra
ph
en
e-
40
96
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
P
er
f
(G
F
/s
)
RACE-SymmSpMV MKL-IE-SymmSpMV MKL-SymmSpMV MKL-SpMV RLM
(a) Ivy Bridge EP
cr
an
ks
eg
-1
sh
ip
-0
03
pw
tk
off
sh
or
e
F
1
in
lin
e-
1
pa
ra
bo
lic
-f
em
gs
m
-1
06
85
7
Fa
ul
t-
63
9
H
ub
ba
rd
-1
2
E
m
ili
a-
92
3
au
di
kw
-1
bo
ne
01
0
di
el
F
ilt
er
V
3r
ea
l
th
er
m
al
2
Se
re
na
G
eo
-1
43
8
H
oo
k-
14
98
F
la
n-
15
65
G
3-
ci
rc
ui
t
A
nd
er
so
n-
16
.5
Fr
ee
B
os
on
C
ha
in
-1
8
nl
pk
kt
12
0
ch
an
ne
l-5
00
x1
00
x1
00
-b
05
0
H
P
C
G
-1
92
Fr
ee
Fe
rm
io
nC
ha
in
-2
6
Sp
in
-2
6
H
ub
ba
rd
-1
4
nl
pk
kt
20
0
de
la
un
ay
-n
24
G
ra
ph
en
e-
40
96
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
P
er
f
(G
F
/s
)
RACE-SymmSpMV MKL-IE-SymmSpMV MKL-SymmSpMV MKL-SpMV RLM
(b) Skylake SP
Fig. 20. Performance of SymmSpMV executed with RACE compared to the performance model and Intel
MKL implementations. SpMV performance obtained using Intel MKL library is also shown for reference. The
model prediction is derived for bandwidths in the range of load and copy bandwidth, and using the measured
αSpMV shown in Table 3.
6.2.2 Analyzing absolute performance of RACE. We now extend our RACE performance investi-
gation to the full set of test matrices presented in Table 2. In Figures 20(a) and 20(b) the performance
results for the full Ivy Bridge EP processor chip (10 cores) and the full Skylake SP processor chip (20
cores) are presented along with the upper rooine limits and the performance of the baseline SpMV
kernel using Intel MKL. e matrices are arranged along the abscissa according to the ascending
number of rows (Nr), i.e., increasing size of the two vectors involved in SymmSpMV. Overall
RACE performance comes close to or matches our performance model for many test cases on both
architectures. A comparison of the architectures shows that the corner case matrices crankseg 1
and parabolic fem have a strikingly dierent behavior for RACE. For crankseg 1 this is caused
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:29
by the limited amount of parallelism in its structures. Here we refer to the discussion of Figure 17(a)
where best performance and highest parallelism (N et ) were achieved at approximately 10 cores.
Using only 9 cores on Skylake SP lis the SymmSpMV performance of crankseg 1 slightly above
the SpMV level of the MKL. e parabolic fem has been chosen to t into the LLC of the Skylake
SP architecture to provide a corner case where scalability is not intrinsically limited by main
memory bandwidth (see Figure 17(c)) and thus our rooine performance limit does not apply for
this matrix on Skylake SP. However, on Ivy Bridge EP the matrix data set just exceeds the LLC and
the performance is inline with our model.
On both architectures a characteristic drop in performance levels is encountered around the
Flan 1565 and G3 circuit matrices, where the aggregate size of the two vectors (25 MB) ap-
proaches the available LLC sizes. For smaller matrices we have a higher chance that the vectors
stay in the cache during the SymmSpMV, i.e., the vectors must only be transferred once between
main memory and the processor for every kernel invocation. For larger matrices (i.e., larger Nr )
the reuse of vector data during a single SymmSpMV kernel decreases and vector entries may be
accessed several times from the main memory. is is reected by the increase in measured αSpMV
(assumed αSymmSpMV ) values for matrices with index 20 and higher in Table 3.
In short, RACE has an average speedup of 1.4× and 1.5× compared to SpMV on the Skylake SP and
Ivy Bridge EP architectures, respectively. On Skylake SP RACE SymmSpMV aains on an average
87% and 80% of the rooine performance limits predicted using the copy and load bandwidth,
respectively, while on Ivy Bridge EP we are 91% and 83% close to the respective performance
models.
e MKL implementations of SymmSpMV deserves a special consideration in this context.
erefore, in Figure 20 we also compare our approach with the two Intel MKL options described
above. For the MKL-IE variant we specify exploiting the symmetry of the matrix when calling the
inspector routine. On the Ivy Bridge EP architecture, RACE always provides superior performance
levels and the best performing Intel variant depends on the underlying matrix. On the Skylake SP,
however, MKL-IE always outperforms the deprecated MKL routine and is superior to RACE for two
matrices (crankseg-1,offshore). ese are the same matrices where RACE is slower than the
MKL SpMV kernel (see Figure 20(b)). It can be clearly seen that the MKL-IE data for SymmSpMV
are identical with the MKL SpMV numbers presented in Figure 20, i.e., the inspector calls the
baseline SpMV kernel and uses the full matrix, though it knows about the symmetry of the matrix.
One reason for that strategy might be that the parallelization approach used in the deprecated
MKL implementation for SymmSpMV is not scalable which would explain the fact that MKL is
worse than MKL-IE for all cases on Skylake SP. As neither the algorithm used to parallelize the
SymmSpMV nor its low level code implementation is known, we refrain from a deep analysis of the
Intel performance behavior. In summary we nd that RACE is on average 1.4× faster than the best
Intel variant and can achieve speedups of up to 2×. Note that on Skylake SP the best MKL variant
is always MKL-IE, which has almost twice the memory footprint compared to the SymmSpMV
with RACE.
6.2.3 Single core performance. Although single core performance is oen considered not to be
crucial for the full chip SymmSpMV performance, we demonstrate that it is vital to explain some
of the performance behaviors. For example the drop in SymmSpMV performance for matrices like
Hubbard-12 and delaunay n24 strongly correlates with the lower performance of the baseline
SpMV (see Figure 20). ese matrices are characterized by a rather low Nnzr and a larger αSpMV
value. Note that αSpMV measured for the SpMV kernel mainly accounts for the RHS vector trac
and the actual αSymmSpMV may even be higher as SymmSpMV requires two vectors to stay in
cache concurrently. Moreover, for these matrices the inner loop lengths are typically very short
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:30 Christie Alappat et al.
cr
an
ks
eg
-1
sh
ip
-0
03
pw
tk
off
sh
or
e
F
1
in
lin
e-
1
pa
ra
bo
lic
-f
em
gs
m
-1
06
85
7
Fa
ul
t-
63
9
H
ub
ba
rd
-1
2
E
m
ili
a-
92
3
au
di
kw
-1
bo
ne
01
0
di
el
F
ilt
er
V
3r
ea
l
th
er
m
al
2
Se
re
na
G
eo
-1
43
8
H
oo
k-
14
98
F
la
n-
15
65
G
3-
ci
rc
ui
t
A
nd
er
so
n-
16
.5
Fr
ee
B
os
on
C
ha
in
-1
8
nl
pk
kt
12
0
ch
an
ne
l-5
00
x1
00
x1
00
-b
05
0
H
P
C
G
-1
92
Fr
ee
Fe
rm
io
nC
ha
in
-2
6
Sp
in
-2
6
H
ub
ba
rd
-1
4
nl
pk
kt
20
0
de
la
un
ay
-n
24
G
ra
ph
en
e-
40
96
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
P
er
f
(G
F
/s
)
RACE-SymmSpMV MKL-SpMV
(a) Skylake SP
Fig. 21. Single core performance of SymmSpMV executed with RACE compared to SpMV performance using
Intel MKL.
(approximately Nnzr/2 on average) and consequently the SIMD vectorization performed by the
compiler may become inecient. is leads to lower single core performance as shown in Figure 21
for the Skylake SP architecture, where bad performance of SymmSpMV and SpMV can oen be
correlated with a small Nnzr value. For several matrices these combined eects overcompensate
the reduced matrix data trac of the SymmSpMV leading to worse single core performance
than running SpMV with the full matrix. Using the delaunay n24 matrix as a representative
for this class of matrices we demonstrate the basic challenge for SymmSpMV to exploit its basic
performance advantage over SpMV in Figure 22. Starting with an approximately 25% lower single
core performance (0.75 GF/s versus 0.98 GF/s) but having a 50% higher rooine performance limit
(approximately 18 GF/s; see Figure 20(b)) than the SpMV, the SymmSpMV is not able to saturate
the main memory bandwidth of the Skylake SP on its 20 cores. As speculated above, the single
core performance is limited by inecient SIMD vectorization of the extremely short inner loop
and switching back to scalar code does improve performance by 15% (see Figure 22). As we are still
substantially o the bandwidth limit we see this benet over the full chip. Using chips with larger
core counts would allow for further improving the SymmSpMV performance of this matrix. e
same arguments hold for the offshore matrix but here the eect compared to SpMV performance
is even more pronounced on Skylake SP. Here the full matrix can at least partially be held in the
large aggregate cache between successive kernel invocations and its performance is not limited
by the main memory bandwidth. In terms of caching eects we have also further identied at
least partial caching of the matrix for ship-003 and pwtk test cases by analyzing the overall data
trac in the kernel invocations. is is inline with their higher performance levels presented in
Figure 20(b).
6.2.4 Comparing RACE with MC and ABMC. Having well understood the performance charac-
teristics of SymmSpMV with RACE we nally compare this with the performance achieved by the
two coloring methods in Figure 23. Here the underlying algorithm as well as implementation are
known and are closely related to our approach. Overall the MC is not competitive and provides
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:31
0 2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
Nt
G
F
/s
SpMV
SymmSpMV
SymmSpMV-Scalar
Fig. 22. Parallel performance of SymmSpMV (with RACE)
and SpMV (with Intel MKL) for the delaunay n24 ma-
trix on one socket of Skylake SP. To disable vectorization
(SymmSpMV-Scalar) we set VECWIDTH = 1 when compiling
the SymmSpMV kernel.
low performance levels for almost all the matrices on both architectures. e ABMC shows similar
performance characteristics as RACE until the two vectors involved in SymmSpMV approach the
size of the caches (cf. discussion of Figure 20). For matrices with suciently small Nr (le in the
diagram) the method can achieve between 70% and 90% of RACE performance on most cases. For
matrices in the right part of the diagram with their higher Nr and αSpMV values, the ABMC falls
substantially behind RACE. Here, the strict orientation of the RACE design towards data locality
in the vector accesses delivers its full power. See also the data transfer discussion in Section 6.2.1
for the Spin-26 matrix. In total there are only three cases where ABMC performance is on a par
with or slightly above the RACE measurement and the average speedup of RACE is 1.5× and 1.65×
for Ivy Bridge EP and Skylake SP, respectively. Note that all three methods use the same baseline
kernels and thus performance dierences between the methods do not arise from dierent low
level code but from the ability to generate appropriate degrees of parallelism and to maintain data
locality.
7 CONCLUSION AND OUTLOOK
In this paper we have developed RACE, a coloring algorithm and open-source library imple-
mentation for exploiting parallelism in algorithms with inherent dependencies. RACE generates
hardware-ecient distance-k colorings of undirected graphs and puts emphasis on data access
locality, load balancing, and parallelism that is adapted to the number of cores of the underlying
architecture. We demonstrated these benets by applying RACE to symmetric sparse matrix-vector
multiplication (SymmSpMV) on modern multicore architectures and compared its performance
against standard multicoloring, algebraic block multicoloring, and Intel MKL implementations.
Average and maximum speedups of 1.4 and 2, respectively, could be observed across a representa-
tive set of 31 matrices on two modern Intel processors. Our entire experimental and performance
analysis process was backed by the Roofline performance model, corroborating the optimality of
the RACE approach in terms of resource utilization and shedding some new light on the challenges
of the SymmSpMV kernel on modern hardware. We demonstrated that RACE runs very close to
the Roofline limit for most of the 31 test cases. Outliers were analyzed and discussed in detail.
Similar to other coloring algorithms, the RACE method is not limited to the SymmSpMV kernel
and can be used to eciently parallelize solvers and kernels having general distance-k dependencies.
Moreover, due to the level-based formulation of RACE, the framework has an added advantage
that allows us to address other classes of problems. Future work with RACE will involve variants
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:32 Christie Alappat et al.
cr
an
ks
eg
-1
sh
ip
-0
03
pw
tk
off
sh
or
e
F
1
in
lin
e-
1
pa
ra
bo
lic
-f
em
gs
m
-1
06
85
7
Fa
ul
t-
63
9
H
ub
ba
rd
-1
2
E
m
ili
a-
92
3
au
di
kw
-1
bo
ne
01
0
di
el
F
ilt
er
V
3r
ea
l
th
er
m
al
2
Se
re
na
G
eo
-1
43
8
H
oo
k-
14
98
F
la
n-
15
65
G
3-
ci
rc
ui
t
A
nd
er
so
n-
16
.5
Fr
ee
B
os
on
C
ha
in
-1
8
nl
pk
kt
12
0
ch
an
ne
l-5
00
x1
00
x1
00
-b
05
0
H
P
C
G
-1
92
Fr
ee
Fe
rm
io
nC
ha
in
-2
6
Sp
in
-2
6
H
ub
ba
rd
-1
4
nl
pk
kt
20
0
de
la
un
ay
-n
24
G
ra
ph
en
e-
40
96
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P
er
f
(G
F
/s
)
RACE ABMC MC
(a) Ivy Bridge EP
cr
an
ks
eg
-1
sh
ip
-0
03
pw
tk
off
sh
or
e
F
1
in
lin
e-
1
pa
ra
bo
lic
-f
em
gs
m
-1
06
85
7
Fa
ul
t-
63
9
H
ub
ba
rd
-1
2
E
m
ili
a-
92
3
au
di
kw
-1
bo
ne
01
0
di
el
F
ilt
er
V
3r
ea
l
th
er
m
al
2
Se
re
na
G
eo
-1
43
8
H
oo
k-
14
98
F
la
n-
15
65
G
3-
ci
rc
ui
t
A
nd
er
so
n-
16
.5
Fr
ee
B
os
on
C
ha
in
-1
8
nl
pk
kt
12
0
ch
an
ne
l-5
00
x1
00
x1
00
-b
05
0
H
P
C
G
-1
92
Fr
ee
Fe
rm
io
nC
ha
in
-2
6
Sp
in
-2
6
H
ub
ba
rd
-1
4
nl
pk
kt
20
0
de
la
un
ay
-n
24
G
ra
ph
en
e-
40
96
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
P
er
f
(G
F
/s
)
RACE ABMC MC
(b) Skylake SP
Fig. 23. Comparison of SymmSpMV performance between RACE and coloring variants MC and ABMC.
Matrices are arranged in increasing number of rows (Nr).
of linear solvers and kernel operations like in-place matrix powers and polynomials, which are of
high interest in the scientic community.
ACKNOWLEDGMENTS
e project is funded by the German DFG priority programme 1648 “Soware for Exascale Comput-
ing (SPPEXA)” and the Swiss National Science Foundation (SNF) under the projects “Dual-Phase
Steels – From Micro to Macro Properties (EXASTEEL-2)” (DFG, SNF) and “Equipping Sparse Solvers
for Exascale (ESSEX-II)” (DFG). e authors wish to thank Andreas Alvermann for providing access
to his ScaMaC library, omas Gruber for supporting our LIKWID measurements and Moritz
Kreutzer for helpful discussions.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:33
ACRONYMS
L(i) ith level.
Ls (i) ith level at stage s .
N` total levels in the graph.
Nr number of rows.
Nt number of threads.
Nnzr average number of nonzeros per row.
N
symm
nzr average number of nonzeros per row in symmetric matrix.
Nnz total number of nonzeros.
N er eective number of rows.
N et eective number of threads.
T (i) ith level group.
Ts (i) ith level group at stage s .
η theoretical parallel eciency.
bs single socket bandwidth.
s stage number of recursion.
ABMC algebraic block multicoloring.
BFS breadth-rst search.
CRS compressed row storage.
Intel MKL Intel math kernel library.
LLC last level cache.
MC multicoloring.
RACE recursive algebraic coloring engine.
RCM reverse Cuthill McKee.
SNC sub-NUMA clustering.
SpMV sparse matrix-vector multiplication.
SymmSpMV symmetric sparse matrix-vector multiplication.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:34 Christie Alappat et al.
REFERENCES
[1] Andreas Alvermann. 2019. ScaMaC: e Scalable Matrix Collection. hps://bitbucket.org/essex/matrixcollection/.
[2] D. Bozdag˘, U¨. C¸atalyu¨rek, A. Gebremedhin, F. Manne, E. Boman, and F. O¨zgu¨ner. 2010. Distributed-Memory Parallel
Algorithms for Distance-2 Coloring and Related Problems in Derivative Computation. SIAM Journal on Scientic
Computing 32, 4 (2010), 2418–2446. hps://doi.org/10.1137/080732158 arXiv:hps://doi.org/10.1137/080732158
[3] Doruk Bozdag˘, Assefaw H. Gebremedhin, Fredrik Manne, Erik G. Boman, and Umit V. Catalyurek. 2008. A framework
for scalable greedy coloring on distributed-memory parallel computers. J. Parallel and Distrib. Comput. 68, 4 (2008),
515 – 535. hps://doi.org/10.1016/j.jpdc.2007.08.002
[4] Aydin Buluc¸, Jeremy T. Fineman, Maeo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel Sparse
Matrix-vector and Matrix-transpose-vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the
Twenty-rst Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09). ACM, New York, NY, USA,
233–244. hps://doi.org/10.1145/1583991.1584053
[5] Aydin Buluc¸, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-Bandwidth Multithreaded Algorithms
for Sparse Matrix-Vector Multiplication. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing
Symposium (IPDPS ’11). IEEE Computer Society, Washington, DC, USA, 721–733. hps://doi.org/10.1109/IPDPS.2011.73
[6] U.V. Catalyurek and C. Aykanat. 1999). Hypergraph-partitioning-based decomposition for parallel sparse-matrix
vector multiplication. IEEE Trans. Parallel Distrib. Systems 10, 7 (1999)), 673–693. hps://doi.org/10.1109/71.780863
[7] Elizabeth Cuthill. 1972. Several Strategies for Reducing the Bandwidth of Matrices. Springer US, Boston, MA, 157–166.
hps://doi.org/10.1007/978-1-4615-8675-3 14
[8] Timothy A. Davis and Yifan Hu. 2011. e University of Florida Sparse Matrix Collection. ACM Trans. Math. Sow. 38,
1, Article 1 (Dec. 2011), 25 pages. hps://doi.org/10.1145/2049662.2049663
[9] Josep Dı´az, Jordi Petit, and Maria Serna. 2002. A Survey of Graph Layout Problems. ACM Comput. Surv. 34, 3 (Sept.
2002), 313–356. hps://doi.org/10.1145/568522.568523
[10] Athena Elafrou, Vasileios Karakasis, eodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios
Koziris. 2018. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms.
ACM Trans. Math. Sow. 44, 3, Article 26 (Jan. 2018), 32 pages. hps://doi.org/10.1145/3134442
[11] D. J. Evans. 1984. Parallel S.O.R. Iterative Methods. Parallel Comput. 1, 1 (Aug. 1984), 3–18. hps://doi.org/10.1016/
S0167-8191(84)90380-6
[12] Fujitsu. 2019. Retrieved 2019/05/14 from hps://www.fujitsu.com/global/Images/post-k supercomputer with fujitsu%
27s original cpu a64fx powered by arm isa.pdf
[13] Martin Galgon, Lukas Kra¨mer, Jonas ies, Achim Basermann, and Bruno Lang. 2015. On the Parallel Iterative Solution
of Linear Systems Arising in the FEAST Algorithm for Computing Inner Eigenvalues. Parallel Comput. 49, C (Nov.
2015), 153–163. hps://doi.org/10.1016/j.parco.2015.06.005
[14] Assefaw Hadish Gebremedhin and Fredrik Manne. 2000. Scalable parallel graph coloring algorithms. Concurrency:
Practice and Experience 12, 12 (2000), 1131–1146.
[15] Assefaw Hadish Gebremedhin, Fredrik Manne, and Alex Pothen. 2002. Parallel Distance-k Coloring Algorithms for
Numerical Optimization. In Proceedings of the 8th International Euro-Par Conference on Parallel Processing (Euro-Par
’02). Springer-Verlag, London, UK, UK, 912–921. hp://dl.acm.org/citation.cfm?id=646667.699892
[16] Assefaw H. Gebremedhin, Duc Nguyen, Md. Mostofa Ali Patwary, and Alex Pothen. 2013. ColPack: Soware for
Graph Coloring and Related Problems in Scientic Computing. ACM Trans. Math. Sow. 40, 1, Article 1 (Oct. 2013),
31 pages. hps://doi.org/10.1145/2513109.2513110 Soware last accessed on 2019/02/25.
[17] T. Gkountouvas, V. Karakasis, K. Kourtis, G. Goumas, and N. Koziris. 2013. Improving the Performance of the Symmetric
Sparse Matrix-Vector Multiplication in Multicore. In 2013 IEEE 27th International Symposium on Parallel and Distributed
Processing. 273–283. hps://doi.org/10.1109/IPDPS.2013.43
[18] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. 2000. Towards Realistic Performance Bounds for Implicit
CFD Codes. In Parallel Computational Fluid Dynamics 1999, D. Keyes, J. Periaux, A. Ecer, N. Satofuka, and P. Fox (Eds.).
Elsevier, 241–248. hps://doi.org/10.1016/B978-044482851-4.50030-X
[19] Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels.
e International Journal of High Performance Computing Applications 18, 1 (2004), 135–158. hps://doi.org/10.1177/
1094342004041296 arXiv:hps://doi.org/10.1177/1094342004041296
[20] Intel. 2019. Intel Math Kernel Library. hps://soware.intel.com/en-us/mkl
[21] Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic Block Multi-Color Ordering Method
for Parallel Multi-readed Sparse Triangular Solver in ICCG Method. In Proceedings of the 2012 IEEE 26th International
Parallel and Distributed Processing Symposium (IPDPS ’12). IEEE Computer Society, Washington, DC, USA, 474–483.
hps://doi.org/10.1109/IPDPS.2012.51
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:35
[22] Mark T. Jones and Paul E. Plassmann. 1994. Scalable Iterative Solution of Sparse Linear Systems. Parallel Comput. 20,
5 (May 1994), 753–773. hps://doi.org/10.1016/0167-8191(94)90004-3
[23] George Karypis and Vipin Kumar. 1998. A Fast and High ality Multilevel Scheme for Partitioning Irregular Graphs.
SIAM Journal on Scientic Computing 20, 1 (1998), 359–392. hps://doi.org/10.1137/S1064827595287997
[24] Kokkos. 2018. Kokkos C++ Performance Portability Programming EcoSystem: e Programming Model - Parallel
Execution and Memory Abstraction. hp://trilinos.sandia.gov/packages/kokkos
[25] Moritz Kreutzer, Dominik Ernst, Alan R. Bishop, Holger Fehske, Georg Hager, Kengo Nakajima, and Gerhard Wellein.
2018. Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs. In High Performance Computing,
Rio Yokota, Miche`le Weiland, David Keyes, and Carsten Trinitis (Eds.). Springer International Publishing, Cham,
329–349.
[26] Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unied Sparse Matrix
Data Format for Ecient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units.
SIAM Journal on Scientic Computing 36, 5 (2014), C401–C423. hps://doi.org/10.1137/130930352
[27] M. Krotkiewski and M. Dabrowski. 2010. Parallel Symmetric Sparse Matrix-vector Product on Scalar Multi-core CPUs.
Parallel Comput. 36, 4 (April 2010), 181–198. hps://doi.org/10.1016/j.parco.2010.02.003
[28] C. Y. Lee. 1961. An Algorithm for Path Connections and Its Applications. IRE Transactions on Electronic Computers
EC-10, 3 (Sept 1961), 346–365. hps://doi.org/10.1109/TEC.1961.5219222
[29] Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon
Song. 2017. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientic Kernels. In
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17).
ACM, New York, NY, USA, Article 26, 14 pages. hps://doi.org/10.1145/3126908.3126931
[30] Weifeng Liu and Brian Vinter. 2015. CSR5: An Ecient Storage Format for Cross-Platform Sparse Matrix-Vector
Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS ’15). ACM, New
York, NY, USA, 339–350. hps://doi.org/10.1145/2751205.2751209
[31] Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-vector Multiplication on Heteroge-
neous Processors. Parallel Comput. 49, C (Nov. 2015), 179–193. hps://doi.org/10.1016/j.parco.2015.04.004
[32] Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Ecient Sparse Matrix-vector Multiplication
on x86-based Many-core Processors. In Proceedings of the 27th International ACM Conference on International Conference
on Supercomputing (ICS ’13). ACM, New York, NY, USA, 273–282. hps://doi.org/10.1145/2464996.2465013
[33] H. Lu, M. Halappanavar, D. Chavarra-Miranda, A. H. Gebremedhin, A. Panyala, and A. Kalyanaraman. 2017. Algorithms
for Balanced Graph Colorings with Applications in Parallel Computing. IEEE Transactions on Parallel and Distributed
Systems 28, 5 (May 2017), 1240–1256. hps://doi.org/10.1109/TPDS.2016.2620142
[34] Michele Martone. 2014. Ecient Multithreaded Untransposed, Transposed or Symmetric Sparse Matrix-vector
Multiplication with the Recursive Sparse Blocks Format. Parallel Comput. 40, 7 (July 2014), 251–270. hps://doi.org/
10.1016/j.parco.2014.03.008
[35] James Mceen, Marina Meila˘, Jacob VanderPlas, and Zhongyue Zhang. 2016. Megaman: Scalable Manifold Learning
in Python. Journal of Machine Learning Research 17, 148 (2016), 1–5. hp://jmlr.org/papers/v17/16-109.html
[36] P. Mironowicz, A. Dziekonski, and M. Mrozowski. 2015. A Task-Scheduling Approach for Ecient Sparse Symmetric
Matrix-Vector Multiplication on a GPU. SIAM Journal on Scientic Computing 37, 6 (2015), C643–C666. hps:
//doi.org/10.1137/14097135X
[37] Chao-Wei Ou and Sanjay Ranka. 1997. Parallel Incremental Graph Partitioning. IEEE Trans. Parallel Distrib. Syst. 8, 8
(Aug. 1997), 884–896. hps://doi.org/10.1109/71.605773
[38] Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying Synchronization
for High-Performance Shared-Memory Sparse Triangular Solver. In Proceedings of the 29th International Conference
on Supercomputing - Volume 8488 (ISC 2014). Springer-Verlag New York, Inc., New York, NY, USA, 124–140. hps:
//doi.org/10.1007/978-3-319-07518-1 8
[39] J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar, X. Liu, M. M. A. Patwary, Y. Lu, and P.
Dubey. 2014. Ecient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its
Application to Unstructured Matrices. In SC14: International Conference for High Performance Computing, Networking,
Storage and Analysis. 945–955. hps://doi.org/10.1109/SC.2014.82
[40] Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (second ed.). Society for Industrial and Applied Mathematics.
hps://doi.org/10.1137/1.9780898718003 arXiv:hps://epubs.siam.org/doi/pdf/10.1137/1.9780898718003
[41] Toby Simpson, Dimosthenis Pasadakis, Drosos Kourounis, Kohei Fujita, Takuma Yamaguchi, Tsuyoshi Ichimura,
and Olaf Schenk. 2018. Balanced Graph Partition Renement Using the Graph p-Laplacian. In Proceedings of the
Platform for Advanced Scientic Computing Conference (PASC ’18). ACM, New York, NY, USA, Article 8, 11 pages.
hps://doi.org/10.1145/3218176.3218232
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:36 Christie Alappat et al.
[42] SpMP Development Team. [n.d.]. Sparse matrix pre-processing library. Retrieved 2019/02/25 from hps://github.
com/IntelLabs/SpMP
[43] S. Toledo. 1997. Improving the Memory-system Performance of Sparse-matrix Vector Multiplication. IBM J. Res. Dev.
41, 6 (Nov. 1997), 711–726. hps://doi.org/10.1147/rd.416.0711
[44] J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore
environments. (2010).
[45] Ulrike von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (01 Dec 2007), 395–416.
hps://doi.org/10.1007/s11222-007-9033-z
[46] Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI: A library of automatically tuned sparse matrix
kernels. Journal of Physics: Conference Series 16, 1 (2005), 521. hp://stacks.iop.org/1742-6596/16/i=1/a=071
[47] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization
of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. Parallel Comput. 35, 3 (March 2009), 178–194.
hps://doi.org/10.1016/j.parco.2008.12.006
[48] Samuel Williams, Andrew Waterman, and David Paerson. 2009. Rooine: An Insightful Visual Performance Model
for Multicore Architectures. Commun. ACM 52, 4 (April 2009), 65–76. hps://doi.org/10.1145/1498765.1498785
[49] A. Yzelman and R. Bisseling. 2009. Cache-Oblivious Sparse Matrix-Vector Multiplication by Using Sparse Matrix
Partitioning Methods. SIAM Journal on Scientic Computing 31, 4 (2009), 3128–3154. hps://doi.org/10.1137/080733243
arXiv:hps://doi.org/10.1137/080733243
[50] Albert-Jan Nicholas Yzelman. 2011. Fast sparse matrix-vector multiplication by partitioning and reordering. Ph.D.
Dissertation. Utrecht University, Utrecht.
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:37
A ALGORITHMS
Algorithm 3 Construction of levels
1: inteдer :: root = n % Choose starting node
2: bool :: marked all = f alse % Stopping criteria
3: inteдer :: N = nrows(дraph)
4: inteдer :: distFromRoot[N ] = {−1}
5: inteдer :: curr children[] = {}
6: curr children.push back(root);
7: inteдer :: currLvl = 0
8: while !marked all do
9: marked all = true
10: inteдer :: nxt children[] = {}
11: for i = 1 : size(curr children) do
12: if distFromRoot[curr children[i]] == −1 then
13: distFromRoot[curr children[i]] = currLvl
14: for j in дraph[curr children[i]].children do
15: if distFromRoot[j] == −1 then
16: nxt children.push back(j)
17: end if
18: end for
19: end if
20: end for
21: curr children = nxt children
22: currLvl = currLvl + 1
23: end while
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
1:38 Christie Alappat et al.
Algorithm 4 Load Balancing for two sweep, distance-2, two colors
1: if section 4.3 then
2: inteдer :: nthreads = Nt
3: inteдer :: len = 2 ∗ nthreads % number of level groups
4: inteдer :: worker [len] = 1
5: inteдer :: T ptr [len + 1] = linspace(0,N`, len) % level group pointer
6: else
7: inteдer :: nthreads = nt (Ts−1(i)) % i is the index of level group in stage s − 1
8: % where recursion is applied.
9: inteдer :: len = 2 ∗ nthreads % number of level groups
10: inteдer :: worker [len] = [nt (Ts (0)), ...,nt (Ts (len − 1))] = b
11: inteдer :: T ptr [len + 1] = linspace(0,N`, len)
12: end if
13: bool :: exit = f alse
14: inteдer :: T size[len],absRankIdx[len], rankIdx[len], currRank
15: double :: mean r ,mean b,di f f [len],var ,newVar
16: while !(exit) do
17: T size[:] = update(T ptr [:]) % T size contains nrows in each level group
18: inteдer :: T size worker [:] = T size[:]/worker [:]
19: mean r = sum(T size worker [0 : 2 : len − 1]) / nthreads %mean of red color
20: mean b = sum(T size worker [1 : 2 : len − 1]) / nthreads %mean of blue color
21: di f f [0 : 2 : len − 1] = T size worker [0 : 2 : len − 1]. −mean r
22: di f f [1 : 2 : len − 1] = T size worker [1 : 2 : len − 1]. −mean b
23: var = dot product(di f f ,di f f )/len % overall variance
24: absRankIdx = argsort(-abs(di f f )) % ranking according to absolute deviation
25: rankIdx = argsort(di f f ) % ranking according to signed deviation
26: currRank = 0,newVar = var
27: inteдer :: old T ptr [len + 1] = T ptr [:],acquireIdx ,дiveIdx
28: while newVar ≥ var do
29: T ptr = old T ptr
30: bool :: f ail=true
31: if di f f [absRankIdx[currRank]] < 0 then
32: for el in rankIdx[(len − 1) : −1 : 0] do
33: if (T Ptr [el + 1] −T ptr [el]) > 2 then
34: acquireIdx = el
35: f ail=false
36: break
37: end if
38: end for
39: shi(T ptr ,acquireIdx , currRank) % shis T ptr by 1 from acquireIdx
40: % to currRank if currIdx < acquireIdx else shi by -1
41: else if (T ptr [currRank + 1] −T ptr [currRank]) > 2 then
42: дiveIdx = rankIdx[0]
43: f ail=false
44: shi(T ptr , currRank,дiveIdx )
45: end if
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
SymmSpMV with RACE 1:39
46: if !f ail then
47: newVar = calculate variance(T ptr ) % as seen in Line 17 to Line 23
48: end if
49: if (currRank == (len − 1)) && (newVar ≥ var ) then
50: T Ptr = old T ptr
51: exit = true
52: break
53: end if
54: currRank+ = 1
55: end while
56: end while
, Vol. 1, No. 1, Article 1. Publication date: January xxxx.
