Nürnberg e symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is di cult due to the need to manage con icting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware e ciency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and so ware is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.
INTRODUCTION AND RELATED WORK
e e cient solution of linear systems or eigenvalue problems involving large sparse matrices has been an active research eld in parallel and high performance computing for many decades. Wellknown, traditional application areas include quantum physics, quantum chemistry or engineering.
In recent years, new elds such as social graph analysis [41] or spectral clustering in the context of learning algorithms [35, 45] have further increased the need for hardware-e cient, parallel sparse solvers and/or e cient matrix-free solvers. Assuming su ciently large problems, the solvers are typically based on iterative subspace methods and may include advanced preconditioning techniques. In many methods, two components, sparse matrix-vector multiplication (SpMV) and coloring techniques, are crucial for hardware e ciency and parallel scalability. Typically, these two components are considered to be orthogonal, i.e., hardware e ciency for SpMV is mainly related to data formats and local structures while coloring is used to address dependencies in the enclosing iteration scheme. Interestingly, the hardware-e cient parallelization of symmetric SpMV has not a racted a lot of a ention over the years, though symmetry is widespread in the application elds.
e SpMV operation is an essential building block in a number of applications such as algebraic multigrid methods, sparse iterative solvers, shortest path algorithms, breadth rst search algorithms, and Markov cluster algorithms, and therefore it is an integral part of numerous scienti c algorithms. In the past decades, much research has been focusing on designing new data structures, e cient algorithms, and parallelization techniques for the SpMV operation. Its performance is typically limited by main memory bandwidth. On cache-based architectures, the main factors that in uence performance are spatial access locality to the matrix data and temporal locality when reusing the elements of the vectors involved. To address this problem, over the last two decades a plethora of partitioning techniques and data structures to improve SpMV on cache-based architectures have been suggested, including cache-oblivious methods using hypergraph partitioning. One of the rst studies on temporal locality optimizations was done by Toledo [43] , who investigated Cuthill-McKee (CM) ordering techniques on three-dimensional nite-element test matrices when used in combination with blocking into small dense blocks. Various authors [19, 47] used advanced data storage formats and techniques such as register and cache blocking for SpMV by spli ing the matrix into several smaller p × q sparse submatrices and presented an analytic cache-aware model to determine the optimal block size. ese algorithms are, e.g., included in OSKI [46] , which is a collection of low-level primitives of tuned sparse kernels for modern cache-based superscalar machines. Kreutzer et al. [26] and Xing et al. [32] used techniques to improve SIMD e ciency and performance on many-core and GPU architectures. Recent work can be found, e.g., in [29] [30] [31] . Previous work on SpMV has also focused on reducing communication volume for distributedmemory parallelization, o en by using variants of graph or hypergraph partitioning techniques [6] . Yzelman and Bisseling [49, 50] extended hypergraph partitioning techniques in a cache-oblivious method, permuting rows and columns of the input matrix using a recursive hypergraph-based sparse matrix partitioning scheme so that the resulting matrix exhibits cache-friendly behavior during the SpMV.
Despite SpMV being a bandwidth-limited operation, not much work has been done to exploit the symmetry property of symmetric matrices to reduce storage requirements and data transfers by using only the upper/lower triangular part of the matrix. e major challenge here is to resolve the potential write con icts of explicit symmetric sparse matrix-vector multiplication (SymmSpMV) kernels in parallel processing.
ere are general solutions for such problems like lock based methods and thread private target arrays [10, 17, 27, 36] . However they have in common that their overhead may increase with the degree of parallelism. Another recent research direction is the use of specialized storage formats like CSB [4] , RSB [34] , CSX [10] combined with the use of bitmasked register blocking techniques as in [5] . As pointed out by [31] these approaches have drawbacks like missing backward compatibility and matrix conversion costs. Due to these problems there are only a very few standard libraries, like Intel MKL [20] , that support primitives for e cient SymmSpMV operation. Another potential way of tackling this inherent data dependency problem is using a distance-2 coloring of the underlying undirected graph, which has not been investigated so far to the best of our knowledge.
Multicoloring (MC) reordering to tackle data dependencies is a very well established strategy in parallelization of iterative solvers. As it is applied to the underlying graph it is not bound to a speci c data format and may use existing highly optimized (serial) kernels, i.e., it is orthogonal to general code optimization strategies. Prominent examples for MC in iterative solvers are Gauss-Seidel, incomplete Cholesky factorization or Kaczmarz method [11, 13, 22] , where typically a distance-1 or distance-2 coloring is applied subject to the underlying dependencies of the iterative scheme. However, coloring changes the evaluation order of the original solver and may lead to worse convergence rates. is is di erent when using MC methods for parallelization of SymmSpMV where we only need to ensure that entries of the target vector is not wri en in parallel. Here we do not require strict serial ordering to get to the same result as in serial processing. In terms of hardware utilization long-standing MC methods o en generate colorings which lack e ciency on modern cache-based processors. Studies have been made to increase their performance and improve inherent heuristics; an overview of the methods can be found in [14] [15] [16] 33] . However, for irregular and/or large sparse matrices MC may lead to load imbalance, frequent global synchronization, and loss of data locality, severely reducing (single-node) performance. ese problems typically become more stringent for higher order distance colorings and larger matrices. e algebraic block multicoloring (ABMC) [21] proposed by Iwashita et al. in 2012 addresses some of these issues as it tries to increase data locality by applying graph partitioning (blocking) before coloring. Beyond the quality of the actual coloring, the time to generate it is also critical, especially for very large problems. Here, widely used and publicly available coloring packages such as COLPACK [16] , Kokkos [24] and ZOLTAN [2, 3] speed-up the coloring process itself by parallelization and other heuristics.
Design and implementation of hardware e cient computational kernels can be supported by a structured performance engineering process based on white-box models. On the processor/node level, the most prominent model is the roofline model [48] . Its basic applicability as a reasonable light-speed estimate for SpMV was demonstrated already in [18] , including an extension to sparse matrix multiple vector multiplication. e SpMV performance model was re ned in [26] with a focus on modeling the performance impact of irregular accesses to the right-hand side (RHS) vector. It has been successfully used to model performance on CPUs and GPGPUs for SpMV kernels [26] and for augmented sparse matrix multiple vector kernels for Chebyshev lter diagonalization [25] . However, there is no extension towards explicit SymmSpMV, which shows increased computational intensity and irregular accesses to both involved vectors. Typically the expectation is that SymmSpMV should be approximately twice as fast as SpMV as only half of the matrix information needs to be stored and accessed.
Finally, there is a clear hardware trend towards processors with advanced vector-style processing, higher core counts and more complex cache hierarchies. Also, a ainable bandwidth may increase even for "standard" CPU based systems through the use of high bandwidth memory solutions at the cost of very restricted memory sizes. A rst step into this direction was the Intel Xeon Knights Landing processor. e speci cation of the ARM-based Fujitsu A64FX processor (to be used in the Post-K computer) may provide another blueprint for future processor con gurations [12] : A 48-core processor supporting 512-bit SIMD execution units on top of 32 GiB HBM2 main memory, which provides a bandwidth of 1 TB/s. It is obvious that such hardware trends call for revisiting existing, time-critical components in simulation codes both in terms of scalability and hardware e ciency. Moreover, the potential of SymmSpMV to substantially reduce the memory footprint of sparse solvers needs to be exploited to meet the constraint of very limited memory space.
1:4
Christie Alappat et al.
Contribution and Outline
is paper addresses the general problem of generating hardware e cient distance-k coloring of undirected graphs for modern multicore processors. As an application we choose parallelization of the SymmSpMV operation. We cover thread-level parallelization and focus on a single multicore processor. e main contributions can be summarized as follows:
• A new recursive algebraic coloring scheme (RACE) is proposed, which generates hardware e cient distance-k colorings of undirected graphs. Special emphasis in the design of RACE is put on achieving data locality, generating levels of parallelism matching the core count of the underlying multicore processor and load balancing for shared memory parallelization.
• We propose shared memory parallelization of SymmSpMV using a distance-2 coloring of the underlying undirected graph to avoid write con icts and apply RACE for generating the colorings.
• A comprehensive performance study of shared memory parallel SymmSpMV using RACE demonstrates the bene t of our approach. Performance modeling is deployed to substantiate our performance measurements, and a comparison to existing coloring methods as well a vendor optimized library (Intel MKL) are presented. e broad applicability and the sustainability is validated by using a wide set of 31 test matrices and two very di erent generations of Intel Xeon processors.
• We extend the existing proven SpMV performance modeling approach to the SymmSpMV kernel. In the course of the performance analysis we further demonstrate why in some cases the ideal speedup may not be achievable.
We have implemented our graph coloring algorithms in the open source library recursive algebraic coloring engine (RACE). 1 Information required to reproduce the performance numbers provided in this paper is also available. 2 is paper is organized as follows: Our so ware and hardware environment as well as the benchmark matrices are introduced in Section 2. In Section 3 we describe the properties of the SpMV and SymmSpMV kernels, including roofline performance limits, and motivate the need for an advanced coloring scheme. In Section 4 we detail the steps of the RACE algorithm via an arti cial stencil matrix and show how recursive level group construction and coloring can be leveraged to exploit a desired level of parallelism for distance-k dependencies. e interaction between the parameters of the method and their impact on the parallel e ciency is studied in Section 5. Section 6 presents performance data for SymmSpMV for a wide range of matrices on two di erent multicore systems, comparing RACE with ABMC and MC as well as Intel MKL, and also shows the e ciency of RACE as de ned by the roofline model yardstick. Section 7 concludes the paper and gives an outlook to future work.
HARDWARE AND SOFTWARE ENVIRONMENT

Hardware test bed
We conducted all benchmarks on a single CPU socket from Intel's Ivy Bridge EP and Skylake SP families, respectively, since these represent the oldest and the latest Intel architectures in active use within the scienti c community at the time of writing:
• e Intel Ivy Bridge EP architecture belongs to the class of "classic" designs with three inclusive cache levels. While the L1 and L2 caches are private to each core, the L3 cache is Table 1 for the load-only and copy benchmark. The benchmarks were performed on the full socket using the likwid-bench tool. The gray vertical lines correspond to the positions of matrices that might show caching e ects; see Section 2.3. Note the logarithmic scales.
shared but scalable in terms of bandwidth. e processor supports the AVX instruction set extension, which is capable of 256-bit wide SIMD execution.
• Contrary to its predecessors, the Intel Skylake SP architecture has a shared but noninclusive victim L3 cache and much larger private L2 caches. e model we use in this work supports AVX-512, which features 512-bit wide SIMD execution.
Architectural details along with the a ainable memory bandwidths are given in Table 1 . All the measurements were made with CPU clock speeds xed at the indicated base frequencies. Note that for the Skylake SP architecture the clock frequency is scaled down internally to 2.2 GHz when using multicore support and the AVX-512 instruction set; however, this is of minor importance for the algorithms discussed here.
As the a ainable main memory bandwidth is the input parameter to the roofline model used later, we have carefully measured this value depending on the data set size for two access pa erns (copy and load-only). e data presented in Figure 1 basically show the characteristic performance drop if the data set size is too large to t into the last level cache (LLC), which is an L3 cache on both architectures (cf. Table 1 for the actual sizes). Interestingly there is no sharp drop at the exact size of the LLC but a rather steady performance decrease with enhanced data access rates also for data set sizes up to twice the LLC size on Ivy Bridge EP. For Skylake SP this e ect is even more pronounced as the noninclusive victim L3 cache architecture only stores data which are not in the L2 cache; thus the available cache size for an application may be the aggregate sizes of the L2 and L3 caches on this architecture. e nal bandwidth for the roofline model is chosen as the asymptotic value depicted in Figure 1 . Of course caching e ects are extremely sensitive to the data access pa ern and thus the values presented here only provide simple upper bounds for the SymmSpMV kernel with its potentially strong irregular data access.
External Tools and So ware
e LIKWID [44] tool suite in version 4.3.2 was used, speci cally likwid-bench for bandwidth benchmarks (see Table 1 ), and likwid-perfctr for counting hardware events and measuring derived metrics. LIKWID validates the quality of it's performance metrics and validation data is publicly available. 3 Overall the LIKWID data tra c measurements can be considered as highly accurate. Only the L3 data tra c measurement on Skylake SP fails the quantitative validation but it still provides good qualitative results. 4 For coloring we used the COLPACK [16] All code was compiled with the Intel compiler in version 19.0.2 and the following compiler ags: -fno-alias -xHost -O3 for Ivy Bridge EP and -fno-alias -xCORE-AVX512 -O3 for Skylake SP.
Benchmark Matrices
Most test matrices were taken from the SuiteSparse Matrix Collection (formerly University of Florida Sparse Matrix Collection) [8] combining sets from two related papers [34, 38] , which allows the reader to make a straightforward comparison of results. We also added some matrices from the Scalable Matrix Collection (ScaMaC) library [1] , which allows for scalable generation of large matrices related to quantum physics applications. A brief description of the background of these matrices can be found in ScaMaC documentation. 5 All the matrices considered are real, although our underlying so ware would also support complex matrices. As mentioned before, we restrict ourselves to matrices representing fully connected undirected graphs. Table 2 gives an overview of the most important matrix properties like number of rows (N r ), total number of nonzeros (N nz ), average number of nonzeros per row (N nzr ), along with the bandwidth of the matrix without (bw) and with (bw RC M ) RCM preprocessing.
Due to the extended cache size as seen in Figure 1 it might happen that some of the matrices a ain higher e ective bandwidths due to partial/full caching, especially on Skylake SP. e ten potential candidates for the Skylake SP chip in terms of symmetric and full storage (< 128 MB) are marked with an asterisk in Table 2 , while only two among these (offshore and parabolic fem) satisfy the criteria for Ivy Bridge EP (< 40 MB). e corresponding data set size for storing the upper triangular part of these matrices have been labeled in Figure 1 .
KERNELS
We evaluate our methods by parallelization of the SymmSpMV kernel using distance-2 coloring, which avoids concurrent updates of the same vector entries by di erent threads. Table 2 . Details of the benchmark matrices. N r is the number of matrix rows and N nz is the number of nonzeros. N nzr = N nz /N r is the average number of nonzeros per row. bw and bw RC M refer to the matrix bandwidth without and with RCM preprocessing. The le er "C" in the parentheses of the matrix name indicates a corner case matrix that will be discussed in detail, while the le er "Q" marks a matrix from quantum physics that is not part of the SuiteSparse Matrix Collection. With an asterisk (*) we have labeled all the matrices which are less than 128 MB, which could potentially lead to some caching e ects especially on the Skylake SP architecture. Since the kernel is closely related to the sparse matrix-vector multiplication (SpMV) kernel by structure and computational intensity, we start with a discussion of SpMV and extend it towards SymmSpMV later. In all cases the aim is to derive realistic upper performance bounds, which can be estimated once the computational intensity and main memory bandwidth (b s ; see 
Index
end for 8: b[row] = tmp 9: end for known [48] , i.e.,
Since b S depends on the ratio of load to store streams we present the model for both upper (loadonly) and lower bound (copy) bandwidth cases. In the following we choose the compressed row storage (CRS) format for the implementation of SpMV as well as SymmSpMV and assume symmetric matrices.
SpMV
A baseline SpMV kernel is presented in Algorithm 1. It has no loop-carried dependencies, so parallelization of the outer loop using, e.g., OpenMP, is straightforward. Following the discussion in [26] , its computational intensity is
Here we assume that the matrix data (A[], col[]), the le -hand side (LHS) vector (b[]), and the row pointer information (rowPtr []) are loaded only once from main memory, since these data structures are consecutively accessed. e intensity is calculated from the average cost of performing all computations required for one nonzero element of the matrix. us, contributions which are independent of the inner (short) loop are rescaled by N nzr , which is the average number of nonzeros per row (i.e., the average length of the inner loop). e 8α term quanti es the data tra c caused by accessing the RHS vector (x[]). e value of α depends on the matrix structure as well as on the RHS vector data set size and the available cache size. e minimum value of α = N nzr −1 is a ained if the RHS vector is only loaded once from main memory to the cache and all subsequent accesses in the same SpMV are cache hits. is limit is typically observed for matrices with low bandwidth (high access locality) or if the cache is large enough to hold the full RHS data during one SpMV. e actual value of α can be determined experimentally by measuring the data tra c when executing the SpMV; see [26] for more details. 6 e optimal value of α = N nzr −1 together with the corresponding computational intensities for all matrices is shown in Table 3 . e measured α SpMV is used as a sensible lower bound for α S mmSpMV values (see Section 3.2) in cases where advanced cache replacement strategies do not apply; therefore the table also presents the corresponding measured α SpMV (= assumed α S mmSpMV ) values for di erent matrices. Choosing the matrices 10, 22, and 31, which have approximately the same optimal α SpMV , one can study the delicate in uence of matrix structure (i.e., matrix bandwidth and 6 In [26] the tra c for the row pointer was not accounted for, i.e., the denominator in (2) is larger by 4 Nnzr bytes. is error is only signi cant when N nzr is small. Table 2 ) and the cache size on the actual data tra c, i.e., the measured values of α SpMV . For most of the ten candidate matrices on the Skylake SP architecture that could potentially show a caching e ect (see Table 2 ) we observe the measured α SpMV to be lower than optimal. In this case 
end for 9: b[row]+ = tmp 10: end for we set their α S mmSpMV values to the optimal alpha value of SymmSpMV (α S mmSpMV ) which will be de ned in the following Section 3.2. ese cases are marked with an asterisk in Table 3 .
SymmSpMV
SymmSpMV exploits the symmetry of the matrix (A i j = A ji ) to reduce storage size for matrix data and reduce the overall memory tra c by operating on the upper (or lower) half of the matrix only. us for every nonzero matrix entry we need to update two entries in the LHS vector (b[]) as shown in Algorithm 2.
In line with the discussion above, the computational intensity of SymmSpMV is
where N symm nzr
For a given nonzero matrix element 4 ops are performed, which is twice the amount of work than in SpMV. In addition we have indirect access to the LHS vector (read and write) which triples the tra c contribution quanti ed by α. e only term scaled with N symm nzr (number of nonzeros per row in the upper triangular part of the matrix) is the row pointer. e most optimistic value of α (α S mmSpMV ) in this case is 1/N symm nzr , which corresponds to a one time transfer of the LHS and RHS vectors. Note that the α for SpMV and SymmSpMV may be di erent even for the same matrix and the same compute device, as in the la er case the two vectors are accessed irregularly and compete for cache.
us we can assume that the α value measured for SpMV (α SpMV ) is a lower bound for SymmSpMV. Table 3 show the assumed α S mmSpMV values taken for performance modeling. Since an upper bound for the performance is the product of computational intensity and main memory bandwidth (see Equation (1)), this approach provides an upper performance bound for SymmSpMV. However, note that the performance models derived for matrices having caching e ects (see Table 3 ) need not be strictly upper bound, as they heavily depends on the caching strategy of the underlying architecture.
Comparing Equation (3) and Equation (2) it is obvious that the perfect speedup of 2× when using SymmSpMV instead of SpMV is only a ainable in the limit of small α. Considering the large prefactor of the α contribution, any implementation of SymmSpMV must aim at ensuring high data locality. e indirect update of the LHS also has a large impact on parallelization strategy as two rows which have a nonzero in the same column cannot be computed in parallel. In a graph-based approach to this problem, this is equivalent to the constraint that only vertices which have at least distance two can be computed in parallel. 3.3 Analysis of the SymmSpMV kernel using parallel coloring schemes for the Spin-26 matrix As pointed out previously, the parallelization of the SymmSpMV kernel can be done via distance-2 coloring of the corresponding graph. e computational intensity, and hence the performance, depends on the data access pa erns to the LHS and RHS vectors. As coloring schemes change those pa erns, they may change the computational intensity, and we have to investigate this e ect in more detail. We apply the basic MC scheme generated by COLPACK [16] to parallelize SymmSpMV and compare it with SpMV, which serves as our performance yardstick. Note that any required preprocessing is excluded from the timings. In Figure 2 we present performance and data transfer volumes for the Spin-26 matrix on a single socket of the Ivy Bridge EP and Skylake SP systems. For SpMV we recover the well-known memory bandwidth saturation pa ern as we ll the chip ( is corresponds to the denominator of I SpMV in Equation (2), so we can determine α SpMV = 0.351 for Ivy Bridge EP and 0.367 for Skylake SP, thus we can calculate an optimistic bound for the intensity of SymmSpMV according to Equation (3) . Using the copy and the load-only bandwidth of Ivy Bridge EP (see Table 1 ) in Equation (1) we nd a maximum a ainable SymmSpMV performance range for this matrix of P SymmSpMV = 7.63, . . . , 8.96 GF/s, while for Skylake SP we expect P SymmSpMV = 19.49, . . . , 21.55 GF/s . is indicates a possible speedup of approximately 1.4× -1.6× compared to the SpMV baseline (5.5 GF/s and 13.41 GF/s on Ivy Bridge EP and Skylake SP), the SymmSpMV implementation using MC falls short of this expectation and is more than three times slower than SpMV.
e reason for this decrease is the nature of the MC permutation. For distance-2 coloring, sets of structurally orthogonal rows have to be determined [15] , i.e., rows that do not overlap in any column entry. ese sets are referred to as colors. Figure 3 shows the corresponding permutation and the obtained sets of colors applied to a toy problem with high data locality. Di erent rows of the same color can be executed in parallel, but colors are operated one a er the other. A er MC a color may contain rows from very distant parts of the matrix, potentially destroying data locality. Assuming that the LLC can hold a maximum of six elements, we nd that the RHS vector must . Of course this e ect strongly depends on the matrix structure, the matrix size, and the cache size. ABMC [21] tries to preserve data locality by rst partitioning the entire matrix into blocks of speci ed size and then applying MC to these blocks. reads then work in parallel between blocks of the same color. Along the lines of [39] we use METIS [23] to partition the matrix into blocks, and COLPACK for MC. e size of blocks for ABMC is determined by a parameter scan (range 4 . . . 128; see [21] ). As stated above, the timing for the performance measurements excludes preprocessing and the parameter search. is method reduces the data tra c (see Figures 2(b) and 2(d)) as there is be er data locality within a block. Consequently, the performance improves over plain MC (see Figures 2(a) and 2(c)). However, we are far o the performance model prediction. In addition to data locality, other factors like global synchronizations and false sharing also contribute to this failure.
ese e ects strongly depend on the number of colors and in general increase with chromatic number. In the case of the Spin-26 matrix the overhead of synchronization is roughly 10% for the MC method. For most of the matrices considered in this work one can also observe a strong positive correlation between false sharing and the number of threads for SymmSpMV kernels due to the indirect writes in SymmSpMV.
RECURSIVE ALGEBRAIC COLORING ENGINE (RACE)
Our advanced coloring algorithm is based on three steps:
In the rst step we apply a bandwidth reduction algorithm including level construction and matrix reordering. We then use the information from the level construction step to form subsets of levels which allow for hardware e cient distance-k coloring of the graph. Finally we present a concept to ensure load balancing between threads. ese steps are applied recursively if required.
To illustrate the method we choose a simple matrix which is associated with an arti cially constructed two-dimensional stencil as shown in Figure 4 
Definitions
We need the following de nitions from graph theory:
• Graph: G = (V , E) represents a graph, with V (G) denoting its set of vertices and E(G) denoting its edges. Note that we restrict ourselves to irreducible undirected graphs. The stencil structure was chosen for illustration purposes and does not represent any specific application scenario.
• Neighborhood: N (u) is the neighborhood of a vertex u and is de ned as
• kth Neighborhood: N k (u) of a vertex u is de ned as
. . .
• Subgraph: In this paper a subgraph H of G speci cally refers to the subgraph induced by vertices V ⊆ V (G) and is de ned as
4.1 Level Construction e rst step of RACE is to determine di erent levels in the graph and permute the graph data structure.
is we achieve using well-known bandwidth reduction algorithms such as reverse Cuthill McKee (RCM) [7] or breadth-rst search (BFS) [28] . Although the RCM method is also implemented in RACE, we use the BFS reordering in the following for simpler illustration.
First we choose a root vertex and assign it to the rst level, L(0). For i > 0, level L(i) is de ned to contain vertices that are in the neighborhood of vertices in L(i − 1) but not in the neighborhood of vertices in L(i − 2) [9] , i.e.,
From Equation (5) one nds that the ith level consists of all vertices that have a minimum distance i from the root node. Algorithm 3 shows how to determine this distance and thus set up the levels L(i). We refer to the total number of levels obtained for a particular graph as N . the N =14 levels of our arti cial stencil operator, where the index of each vertex ( ) is the vertex number and the superscript represents the level number, i.e.,
Note that the L(i) are substantially di erent from the levels used in the "level-scheduling" [40] approach, which applies "depth rst search. " A er the levels have been determined, the matrix is permuted in the order of its levels, such that the vertices in L(i) are stored consecutively and appear before those of L(i + 1). Figure 5 shows the graph (G = P(G)) of the stencil example a er applying this permutation (P) and demonstrates the enhanced spatial locality of the vertices within and between levels (see Figure 5 (b)) as compared to the original (lexicographic) numbering (see Figure 5 (a)). Until now the procedure is the same as BFS (or RCM).
As RACE uses information about the levels for resolving dependencies in the coloring step, we store the index of the entry point to each level in the permuted data structure (of G ) in an array level ptr[0 : N ], so that levels on G can be identi ed as
e entries of level ptr for the stencil example are shown in Figure 5 (c).
Distance-k coloring
e data structure generated above serves as the basis for our distance-k coloring procedure as it contains information about the neighborhood relation between the vertices of any two levels. Following the de nition in [15] , two vertices are called distance-k neighbors if the shortest path connecting them consists of at most k edges.
is implies that u is a distance-k neighbor of (referred to as u
For the undirected graphs as used here, u
Based on this de nition we consider two vertices to be distance-k independent if they are not distance-k neighbors, and two levels are said to be distance-k independent if their vertices are mutually distance-k independent. us, levels L(i) and L(i ± (k + j)) of the permuted graph G are distance-k independent for all j ≥ 1, 
(b) Distance-2 independent level groups denoted as
Equation (8) implies that if there is a gap of at least one level between any two levels (e.g., L(i) and L(i + 2)) all pairs of vertices between these two levels are distance-1 independent. Similarly if the gap consists of at least two levels between any two levels (e.g., L(i) and L(i + 3)) we have distance-2 independent levels, and so on. e de nition used in Equation (8) o ers many choices for forming distance-k independent sets of vertices, which can then be executed in parallel. In Figure 6 we present one example each for distance-1 ( Figure 6(a) ) and distance-2 ( Figure 6(b) ) colorings of our stencil example. e distance-1 coloring uses a straightforward approach by assigning two colors to alternating levels, i.e., levels of a color can be calculated concurrently. In case of distance-2 independence we do not use three colors but rather aggregate two adjacent levels to form a level group (denoted by T (i)) and perform a distance-1 coloring on top of those groups. is guarantees that vertices of two level groups of the same color are distance-2 independent and can be executed in parallel. Here, the vertices in T (0), T (2), T (4), and T (6) can be operated on by four threads in parallel, i.e., one thread per level group. A er synchronization the remaining four blue level groups can also be executed in parallel.
is idea can be generalized such that for distance-k coloring, each level group contains k adjacent levels. us formed level groups are then distance-1 colored. en, all level groups within a color can be executed in parallel. is simple approach allows one to generate workload for a maximum of N /2k threads if distance-k coloring is requested. 7 Note that in all cases, vertices within a single level group are computed in their original order, which allows for good spatial access locality.
Choosing the same number of levels for each level group may, however, cause severe load imbalance depending on the matrix structure. In particular, the use of bandwidth reduction schemes such as BFS or RCM will further worsen this problem due to the lenslike shape of the reordered matrix (see inset of Figure 5(b) ), leading to low workload for level groups containing the top and bo om rows of the matrix. Compare, e.g., T (0) and T (7) with T (3) and T (4) in Figure 6 (b). However, Equation (8) does not require exactly k levels to be in a level group but only at least k. In the following we exploit this to alleviate the imbalance problem.
1:16
Load balancing
e RACE load balancing scheme tries to balance the workload across level groups within each color for a given number of threads while maintaining data locality and the distance-k constraint between the two colors. To achieve this we use an idea similar to incremental graph partitioning [37] .
e level groups containing low workload "grab" adjacent levels from neighboring level groups; overloaded level groups shi levels to adjacent level groups. One can either balance the number of rows (i.e., vertices) N r (T (i)) or the number of nonzeros (i.e., edges) N nz (T (i)). Both variants are supported by our implementation, and we choose balancing by number of rows in the following to demonstrate the method (see Algorithm 4) . For a given set of level groups we calculate the mean and variance of N r (T (i)) within each color (red and blue). e overall variance, which is the target of our minimization procedure, is then found by summing up the variances across colors. In order to reduce this value we rst select the two level groups with largest negative/positive deviation from the mean (which is T (5) and T (4) in step 1 of Figure 7 ) and try to add/remove levels to/from them (see top row of Figure 7 ). When removing levels from a level group, the distance-k coloring is strictly maintained by keeping at least k levels in it. e shi of levels is done with the help of an array T ptr [], which holds pointers to the beginning of each level group (see Figure 7) , avoiding any copy operation. If shi ing levels between the two level groups with the largest deviation does not lead to a lower overall variance, no levels are exchanged and we choose the next pair of level groups according to a ranking which is based on the absolute deviation from the mean (see Algorithm 4 for implementation details) and continue. Following this process in an iterative way we nally end up in a state of lowest overall variance at which no further moves are possible, either because they would violate the distance-k dependency or lead to an increase in overall variance. Figure 7 shows the load balancing procedure under a distance-2 constraint for some initial mapping of 17 levels to six level groups. Applying the procedure to our stencil example of size 16 × 16, requesting distance-2 coloring and ten level groups leads to the mapping shown in Figure 8 (a). Note that level groups at the extreme ends have more levels due to fewer vertices (N r ) in each level, while level groups in the middle, having more vertices, maintain two levels to preserve the distance-2 constraint.
Recursion
As discussed in Section 4.2, the maximum degree of parallelism is limited by the total number of levels (N ) and may be further reduced by level aggregation. In case of our 16 × 16 stencil example the maximum possible parallelism is eight threads, which may cause load imbalance as seen in Figure 8(b) . Hence, further parallelism must be found within the level groups. Compared to methods like MC we do not require all vertices in a level group to be distance-1 (or distance-k in general) independent. is is a consequence of our level-based approach, which requires distance-k independence between vertices of di erent levels but not within a level (see Equation (8)). ere may be more parallelism hidden within the level groups, which can be interpreted as subgraphs.
us we apply the three steps of our method recursively on selected subgraphs to exploit the parallelism within them.
In the following section we rst demonstrate the basic idea in the context of distance-1 dependencies, which can be resolved within the given level group by design. However, for k > 1, vertices in a level group may have distance-k dependencies via vertices in adjacent level groups. We generalize our procedure to distance-k dependencies as a second step in Section 4.4.2. Finally, in Section 4.4.3 we apply the recursive scheme to our stencil example and introduce proper subgraph selection as well as global load balancing strategies. T size = Nr(T (j)) signed dev.
Step 1 T ptr
Nr(L(i))
T size signed dev.
Step 2 T size signed dev.
Step 3 T size signed dev.
Step 4 Fig. 7 . All steps of the load balancing scheme, applied to an arbitrarily chosen initial distribution of 17 levels into six level groups for distance-2 coloring. Rebalancing steps are performed clockwise starting from top le . mean r and mean b denote the current average number of rows per level group and color. ar is the overall variance. In order to visualize the basic concepts easily and discuss important corner cases of the recursive approach we start with the simple graph shown in Figure 9 (a), which is not related to our stencil example. To distinguish between level groups at di erent stages s of the recursive procedure we add a subscript to the levels (L s (i)) and level groups (T s (i)) indicating the stage of recursion at which they are generated, with s = 0 being the original distribution before recursion is applied to any subgraph.
Distance-1 dependency.
For the distance-1 coloring of the graph in Figure 9 we nd that three of the four level groups of the initial stage still contain distance-1-independent vertices; e.g., , implying each of these pairs can be computed in parallel without any distance-1 con icts. is parallelism is not exposed at the rst stage (s = 0) as vertices in L 0 (i) are chosen such that they are distance-1 neighbors of L 0 (i − 1), ignoring any vertex relations within L 0 (i).
Recursion starts with the selection of a subgraph of the matrix, which is discussed in more detail later (see Section 4.4.3). Here we choose the subgraph induced by T 0 (2). It can be isolated from the rest of the graph since the distance-1 coloring step in stage 0 has already produced independent level groups. Now we just need to repeat the three steps explained previously (Section 4.1-Section 4.3) on this subgraph. Figure 10 shows an illustration of applying the rst recursive step (s = 1) on T 0 (2), where we extend the de nition of the vertex numbering in Equation (6) to the following:
At the end of the recursion (cf. Figures 10(d) and 10(e)) on T 0 (2), we obtain parallelism for two more threads in this case. Note that the subgraphs might have "islands" (groups of vertices that are not connected to the rest of the graph); e.g., vertex 3 and vertices 4,5,6 form two islands in Figure 10 (b).
Since an island is disconnected from the rest of the (sub)graph it can be executed independently and in parallel to it. To take advantage of this, the starting node in the next island is assigned a level number with an increment of two, as seen in Figure 10 T1 (0) T1 (1) T1 (2) (d) Fig. 11 . Two level groups generated by a distance-2 coloring (Figure 11(a) ). Figure 11 (b) shows the subgraph induced by level group T 0 (1). Level construction on the selected subgraph is shown in Figure 11 (c). Forming distance-2 independent level groups on these levels does not guarantee a distance-2 independence between the newly generated level groups of the same sweep (color) as seen in Figure 11 (d).
and 10(e)). e selection of the optimal one will be done in the nal load balancing step as described in Section 4.3.
As this recursive process nds independent level groups (T s+1 ) within a level group of the previous stage (T s ), the thread assigned to T s has to spawn threads to parallelize within T s+1 .
4.4.2
Distance-k dependencies with k > 1. In general, it is insu cient to consider only the subgraphs induced by level groups in the recursion step, as can be seen in Figure 11 (a) for distance-2 coloring. Applying the three steps (see Figures 11(b) to 11(d) ) to the subgraph induced by T 0 (1) does not guarantee distance-2 independence between the new level groups T 1 (0) and T 1 (2). It is obvious that for general distance-k colorings two vertices a, b within a level group might be connected by a shared vertex c outside the level group. us, our three step procedure must be applied to a subgraph which contains the actual level group (T s (j)) as well as its all distance-p neighbors, where p = 1, 2, . . . , (k − 1).
is ensures that there is no vertex outside the subgraph which can mediate a distance-k dependency between vertices in the embedded level group (T s (j)). We can now construct the new levels on this subgraph considering the neighborhood, but we only store the vertices in the new levels L s+1 (:) that are in the embedded level group (T s (j)). Next we apply distance-k coloring by aggregation of the new levels, leading to a set of level groups T s+1 (:) within T s (j). Figure 12 demonstrates this approach to resolve the con ict shown in Figure 11 (d). Figure 12 (b) presents the subgraph containing the selected level group T 0 (1) and its distance-1 neighborhood.
Level construction is performed on the subgraph (Figure 12(c) ), but the new levels only contain vertices of T 0 (1), i.e., L 1 (1) = {7 3,1 }. Finally, distance-2 coloring by aggregation of two adjacent levels is performed, leading to three level groups of the second stage s = 1 (Figure 12(d) ), i.e., T 1 (0) = {L 1 (0) ∪ L 1 (1)}. Now vertices 3 and 6 are mapped to level groups of di erent colors. Note that the permutation step on the newly generated levels is not shown but is performed as well to maintain data locality.
4.4.3
Level group construction and global load balancing. e recursive re nement of level groups allows us to tackle load imbalance problems and limited degree of parallelism as we are no longer restricted by the one thread per level group constraint. Instead, we have the opportunity to form level groups and assign appropriate thread counts to them such that the load per thread approaches the optimal value, i.e., the total workload divided by the number of threads available. Pairs of adjacent level groups having di erent colors within a stage, i.e., T s (i) and T s (i + 1) with i = 0, 2, 4, ..., are typically handled by the same threads, so we assign an equal number of threads to them. We
1:20
Christie Alappat et al. T1 (0) T1 (1) T1 (2) (d) Fig. 12 . Correct procedure for distance-2 coloring of level group T 0 (1). The subgraph as shown in Figure 12 (b) contains level group T 0 (1) and its distance-1 neighborhood. A level construction step is applied to this subgraph in Figure 12 (c). Distance-2 coloring by level aggregation leading to level groups of stage 1 is shown in Figure 12(d) ; we get three level groups at the end of the recursion on T 0 (1). then apply recursion to the level groups with more than one thread assigned. Starting with the original graph as the base level group (T −1 (0)) to which all available threads N t (T −1 (0)) = N t and all vertices N r (T −1 (0)) = N r total are assigned, we perform the following steps to form level groups T s (:) at stage s ≥ 0 to which we assign N t (T s (:)) threads. To illustrate the procedure we use the 16 × 16 stencil example and construct a coloring scheme for eight threads (see Figure 13 ).
(1) Assign weights to all levels at stage (s) of the recursion. Assuming that L s (i) ⊂ T s−1 (j), its weight is de ned by
For a given level group (T s−1 (j)) that has to be split up (N t (T s−1 (j)) > 1), the weight describes the fraction of the optimal load per thread,
Requesting N t (T −1 (0)) = 8 threads for the N r (T −1 (0)) = 256 vertices of the stencil example in Figure 13 produces the following weights for the initial (s = 0) levels:
(2) e above de nition implies that if the weight is close to a natural number b, the corresponding workload is near optimal for operation with b threads. us, starting with L s (0) we aggregate successive levels until their combined weight forms a number a close to a natural number b. Distance-k coloring is ensured by enforcing it to aggregate at least 2 × k levels, i.e., for distance-2 coloring at least four levels (two for red and two for blue). Closeness to the natural number is quanti ed by a parameter ϵ de ned as
and [a] is the nearest integer to a, and controlled by the criterion ϵ > ϵ s , where the ϵ s ∈ [0.5, 1) are user de ned parameters.
e choice of this parameter may be di erent for every stage of recursion. Once we nd a collection of successive levels satisfying this criterion, the natural number b is xed. We try to further increase the number of levels to test if there exists a number a > a which is closer to b leading to an ϵ value closer to one. We nally choose the set of levels with the best ϵ value and de ne them to form T s (0) and T s (1) which are to be executed by N t (T s (0)) = N t (T s (1)) = b threads. In Figure 13 we choose ϵ s = 0.6, which selects the rst seven levels to form T 0 (0) and T 0 (1). As their combined weight is 28 32 = 0.875, one thread will execute these two level groups. (3) We continue with subsequent pairs of level groups (T s (i),T s (i + 1); i = 2, 4...) by applying this procedure starting with the very next level. Finally, once all the levels have been touched, a total of N t (T s−1 (j)) threads have been assigned to the levels L s (i) ⊂ T s−1 (j). For example, for T 0 (4) and T 0 (5) in Figure 13 two threads satisfy the criterion as the total weight of the four levels included is 54 32 = 1.69. (4) e distribution between adjacent red and blue level groups which are assigned to the same thread(s) as well as the nal global load balancing is performed using a slight modi cation of the scheme presented in Section 4.3 (shown at the beginning of Algorithm 4 in appendix A): Now the calculation of mean and variance must consider the number of threads (N t (T s (j))) assigned to each level group. e worker array now has to be replaced by the number of threads assigned to each level group (N t (T s (j))). e algorithm then tries to minimize the variance of the number of vertices per thread in level groups. Ideally, a er this step the load per thread in each level group should approach the optimal value given above. Once the level group of stage s has been formed, the recursion and the above procedure are separately applied to all new level groups with more than one thread assigned. is continues until every level group is assigned to one thread. e depth of the recursion is determined by the parameter ϵ s and depends on the matrix structure as well as degree of parallelism requested.
For our stencil example in Figure 13 the inner four level groups of stage s = 0 required one stage of recursion. is led to 16 level groups at stage s = 1, as we require four new level groups per recursion to schedule two threads. In terms of parallel computation, rst the red vertices will be computed in parallel with the orange ones using four threads for both colors. Once the orange vertices are done, each pair of threads assigned to T 0 (4) and T 0 (6) synchronize locally (i.e., within T 0 (4) and T 0 (6) separately). en the pink vertices are computed followed by a global synchronization of all threads. e scheme continues with the blue vertices and the brown/cyan ones, which represent the two blue level groups to which recursion has been applied (see table in Figure 13 ). e recursive nature of our scheme can be best described by a tree data structure, where every node represents one level group and the maximum depth is equivalent to the maximum level (stage) of recursion. e data structure for the colored graph in Figure 13 and its thread assignments are shown in Figure 14 . e root node represents our baseline level group T −1 (0) comprising all 256 vertices and all eight threads (having unique id = 0, . . . , 7). e rst level of child nodes gives the initial (s = 0) distribution, with each node storing the information of a level group including its color. reads are mapped consecutively to the level groups. e red T 0 (4) level group, which consists of vertices 66, . . . , 90 (omi ing the superscript for level numbers), is executed by threads with id = 2, 3. Applying recursion to T 0 (4), this node spawns four new child nodes at stage s = 1, i.e., level groups T 1 (0, . . . , 3) ⊂ T 0 (4), to be executed by the two threads. Synchronization only happens between threads having the same parent node a er executing the same color. Note that actual computations are only performed on the leaf nodes of the nal tree.
PARAMETER STUDY
e RACE method has a set of input parameters {ϵ s ; s = 0, 1, . . .} which control the assignment of threads to adjacent level groups. To determine useful se ings, we analyze the interaction between these input parameters, the number of threads used, and the parallel e ciency of the generated workload distribution.
As the internal tree structure contains all information about the nal workload distribution, we can use it to identify the critical path in terms of workload and thus the parallel e ciency. To this end we introduce the e ective row count for every node (or level group) N e r (T s (i)), which is a measure for the absolute runtime to calculate the corresponding level group. For level groups that are not further re ned (leaf nodes) this value is their actual workload, i.e., the number of rows assigned to them (N e r (T 0 (0)) = 15 in Figure 14) . For an inner node, the e ective row count is the sum of the maximum workload (i.e., the maximum e ective row count value) across each of the two colors of its child nodes:
Such a de nition is based on the idea that nodes at a given stage s have to synchronize with each other and have to wait for their siblings with the largest workload in each sweep (color). Propagating this information upwards on the tree until we reach the root node constructs the critical path in terms of longest runtime taking into account single thread workloads, dependencies, and synchronizations. us, the nal value in the root node N e r (T −1 (0)) can be considered as the e ective maximum workload of a single thread. Dividing the globally optimal workload per thread, N r total /N t , by this number gives the parallel e ciency (η) of our workload distribution:
For the tree presented in Figure 14 , the parallel e ciency is limited to η = 256 44×8 = 0.73 on eight threads, i.e., the maximum parallel speedup is 5.8.
Parameter analysis and selection
e parallel e ciency η as de ned above can be calculated for any given matrix, number of threads N t , and choice of {ϵ s ; s = 0, 1, . . .}; it re ects the quality of parallelism generated by RACE for the problem at hand. is way we can understand the interaction between these parameters and identify useful choices for the ϵ s . Of course, running a speci c kernel such as SymmSpMV on actual hardware will add further hardware and so ware constraints such as a ainable memory bandwidth or cost of synchronization.
As a rst step we can limit the parameter space by simple corner case analysis. Se ing all parameters close to one requests high-quality load balancing but may prevent our balancing scheme from terminating. In the extreme case of {ϵ s = 1; s = 0, 1, . . .} the scheme may generate only two level groups (one of each color) in each recursion, assign all threads to them, and may further a empt to re ne them in the same way. e lowest possible value of ϵ s is the maximum deviation of a real number from its nearest integer, which is 0.5. A range of [0.5,0.9] for the ϵ s is therefore used in the following. For a basic analysis we have selected the inline 1 matrix (see Table 2 ) as it has a rather small amount of parallelism and allows us to cover basic scenarios. In Figure 15 we demonstrate the impact of di erent choices for ϵ 0 and ϵ 1 on the parallel e ciency for thread counts up to 100, which is a useful limit for modern CPU-based compute nodes. For s > 1 we always set the minimum value of ϵ s = 0.5. e limited parallelism can be clearly observed in Figure 15 (a), with e ciency steadily decreasing with increasing thread count. At ϵ 1 = 0.5 there is only a minor impact of the parameter ϵ 0 . In Figures 15(b) to 15(d) the interplay between these two parameters is analyzed at di erent thread counts in more detail. We nd that up to intermediate parallelism (N t = 50) the exact choice has only a minor impact on the parallel e ciency (see -axis scaling). For larger parallelism the interplay becomes more intricate, where too large values of ϵ 0,1 may lead to stronger imbalance. Based on this evaluation, we choose ϵ 0,1 = 0.8 and ϵ s = 0.5 for s > 1 for all subsequent performance measurements. e quality of this choice in terms of parallel e ciency for all matrices is presented in Figure 16 . Here we plot the η value for all the matrices over a large thread count. We nd that our parameter se ing achieves parallel e ciencies of 80% or higher for a substantial fraction of the matrices up to intermediate thread counts. Representing the upper (lower) values in Figure 16 is the best (worst) case matrix Graphene-4096 (crankseg 1), exhibiting almost perfect (very low) parallel e ciency at intermediate to high thread counts.
Finally, we evaluate the scalability of RACE using these two corner cases and the inline 1 matrix as well as the parabolic fem matrix, which is small enough to t into the cache. In Figure 17 we mimic scaling tests on one Skylake processor with up to 20 cores (i.e., threads) and plot the parallel e ciency η as well as the maximum number of threads which can be "perfectly" used N e t (i.e., N e t = η × N t ). e unfavorable structure of the crankseg 1 matrix puts strict limits on parallelism even for low thread counts. e combination of small matrix size with a rather dense population (see . N e t and η versus N t for the four corner case matrices, with the same se ings used in experiment runs. N e t is defined as η × N t . Table 2 ) leads to large inner levels when constructing the graph, triggering strong load imbalance if using more than six threads. A search for be er ϵ s slightly changes the characteristic scaling but not the maximum parallelism that can be extracted. For the inline 1 matrix we nd a weak but steady decrease of the parallel e ciency, which is in good agreement with the discussion of Figure 15 . e other two matrices scale very well in the range of thread counts considered. e corresponding performance measurements for the SymmSpMV kernel (see Section 3.2) on a single Skylake SP processor chip with 20 cores are shown in Figure 18 . 8 For the crankseg 1 matrix (see Figure 18 (a)) we recover the limited scaling due to load imbalance as theoretically predicted. A performance maximum is at nine cores, where the maximum SpMV performance can be slightly exceeded. However, based on the roofline performance model given by Equations (1) and (3) together with the matrix parameters from Table 2 , a theoretical speedup of approximately two as compared to SpMV can be expected for the full processor chip under best conditions. Indeed, in case of the inline 1 and Graphene-4096 matrices, performance scales almost linearly until the main memory bandwidth bo leneck is hit. e saturated performance is in good agreement with the roofline limits. Note that even though the inline 1 matrix does not exhibit perfect theoretical e ciency (η ≈ 0.85 at N t = 20), it still generates su cient parallelism to achieve main memory saturation: e memory bo leneck can mitigate a limited load imbalance.
e peculiar performance behavior of parabolic fem (see Figures 17(c) and 18(c)) is due to its smallness (≈ 23 MB), which lets it t into the caches of the Skylake processor (LLC size = 28 MB).
us, performance is not limited by the main memory bandwidth constraint and the roofline model limits do not apply.
We have demonstrated that a simple choice for the only set of RACE input parameters {ϵ s ; s = 0, 1, . . .} can extract su cient parallelism for most matrices considered in this study. Moreover, the parallel e ciency as calculated by RACE in combination with the roofline performance model is a good indication for scalability and maximum performance of the actual computations.
PERFORMANCE EVALUATION OF SYMMSPMV USING RACE
We evaluate the performance of the SymmSpMV based on parallelization and reordering performed by RACE and compare it with the two MC approaches introduced above and the Intel MKL. As a yardstick for baseline performance we choose the general SpMV kernel and use the performance model introduced in Section 3 to quantify the quality of our absolute performance numbers. As the
1:26
Christie Alappat et al. (1) are given using the computational intensity Equation (3) for the two extreme cases of load-only memory bandwidth (RLM-load) and copy memory bandwidth (RLM-copy). The measured full socket main memory data tra ic per nonzero entry of the symmetric matrix (in bytes) for the SymmSpMV operation is also shown, where values below 12 bytes indicate caching of the matrix entries.
deviations between di erent measurement runs are less than 5%, we do not show the error bar in our performance measurements.
Experimental Setup
All matrix data are encoded in the CRS format. For the SymmSpMV only the nonzeros of the upper triangular matrix are stored. In the case of RACE and the coloring approaches every thread executes the SymmSpMV kernel Algorithm 2 with appropriate outer loop boundary se ings depending on the color (MC, ABMC) or level groups (RACE) to be computed. In order to ensure vectorization of the inner loop in Algorithm 2 we use the SIMD pragma #pragma simd reduction(+:tmp) vectorlength(VECWIDTH). Here VECWIDTH is the maximum vector width supported by the architecture, i.e., VECWIDTH = 4 (8) for Ivy Bridge EP (Skylake SP).
e Intel MKL o ers two choices for the two sparse matrix kernels under consideration: First, CRS based data structures are provided and are used in the subroutines (mkl cspblas dcsrgemv for SpMV and mkl cspblas dcsrsymv for SymmSpMV) without any modi cation (MKL). is mode of operation is deprecated from Intel MKL.v.18. Instead, the inspector-executor mode (MKL-IE) is recommended to be used. Here, the user initially provides the matrix along with hints (e.g., symmetry) and operations to be carried out to the inspector routine (mkl sparse set mv hint).
en an optimization routine (mkl sparse optimize) is called where the matrix is preprocessed based on the inspector information to achieve best performance and highest parallelism for the problem at hand. e subroutine mkl sparse d mv is then used to do the SpMV or SymmSpMV operations on this optimized matrix structure. is approach does not provide any insight into which kernel or data structure is actually used "under the hood. "
In the performance measurements the kernels are executed multiple times in order to ensure reasonable measurement times and average out potential performance uctuations. Doing successive invocations of the kernel on the same two vectors, however, may lead to unrealistic caching of these vectors if the number of rows is small enough. us, we use two ring bu ers (at least of 50 MB each) holding separate vectors of size N r . A er each kernel invocation we switch to the next vector in the two bu ers. is way we mimic the typical structure of iterative sparse solvers where between successive matrix-vector operations other data intensive kernels are executed, e.g., several Level 1 BLAS routines or preconditioning steps. We run over these two bu ers 100 iterations (times) and report the mean performance. For all methods and libraries the input matrices have been preprocessed with RCM bandwidth reduction using the Intel SpMP library [42] . is provides the same or be er performance on all matrices as compared to the original ordering. If not otherwise noted we use the full processor chip and assign one thread to each core. As we focus on a single chip and sub-NUMA clustering (SNC) is not enabled on Skylake SP, no NUMA data placement e ects impact our results.
Results
Before we evaluate the performance across the full set of matrices presented in Table 2 we return to the analysis of the SymmSpMV performance and data tra c for the Spin-26 matrix which we have presented in Section 3.3 for the established coloring approaches.
6.2.1 Analysis of SymmSpMV kernel using RACE for the Spin-26 matrix. e shortcomings in terms of performance and excessive data transfer for parallelization of SymmSpMV using MC and ABMC have been demonstrated in Figure 2 . We extend this evaluation by comparison with the RACE results in Figure 19 . e gures clearly demonstrate the ability of RACE to ensure high data locality in the parallel SymmSpMV kernel. e actual main memory tra c achieved is inline with the minimum tra c for that matrix (see discussion in Section 3.3) and a factor of up to 4× lower than the coloring approaches. Correspondingly, RACE SymmSpMV performance is at least 3.3× higher than its best competitor and 25% be er than the SpMV kernel on both architectures. It achieves more than 84% of the roo ine performance limit based on the copy main memory performance. Note that the indirect update of the LHS vector will generate a store instruction for every inner loop iteration (see Algorithm 2), while the SpMV kernel only does a nal store at the end of the inner loop iteration. In combination with the low number of nonzeros per row (N nzr ) of the Spin-26 matrix, the "copy" induced limit poses a realistic upper performance bound. 6.2.2 Analyzing absolute performance of RACE. We now extend our RACE performance investigation to the full set of test matrices presented in Table 2 . In Figures 20(a) and 20(b) the performance results for the full Ivy Bridge EP processor chip (10 cores) and the full Skylake SP processor chip (20 cores) are presented along with the upper roo ine limits and the performance of the baseline SpMV kernel using Intel MKL. e matrices are arranged along the abscissa according to the ascending number of rows (N r ), i.e., increasing size of the two vectors involved in SymmSpMV. Overall RACE performance comes close to or matches our performance model for many test cases on both architectures. A comparison of the architectures shows that the corner case matrices crankseg 1 and parabolic fem have a strikingly di erent behavior for RACE. For crankseg 1 this is caused by the limited amount of parallelism in its structures. Here we refer to the discussion of Figure 17 (a) where best performance and highest parallelism (N e t ) were achieved at approximately 10 cores. Using only 9 cores on Skylake SP li s the SymmSpMV performance of crankseg 1 slightly above the SpMV level of the MKL. e parabolic fem has been chosen to t into the LLC of the Skylake SP architecture to provide a corner case where scalability is not intrinsically limited by main memory bandwidth (see Figure 17(c) ) and thus our roo ine performance limit does not apply for this matrix on Skylake SP. However, on Ivy Bridge EP the matrix data set just exceeds the LLC and the performance is inline with our model.
On both architectures a characteristic drop in performance levels is encountered around the Flan 1565 and G3 circuit matrices, where the aggregate size of the two vectors (25 MB) approaches the available LLC sizes. For smaller matrices we have a higher chance that the vectors stay in the cache during the SymmSpMV, i.e., the vectors must only be transferred once between main memory and the processor for every kernel invocation. For larger matrices (i.e., larger N r ) the reuse of vector data during a single SymmSpMV kernel decreases and vector entries may be accessed several times from the main memory. is is re ected by the increase in measured α SpMV (assumed α S mmSpMV ) values for matrices with index 20 and higher in Table 3 .
In short, RACE has an average speedup of 1.4× and 1.5× compared to SpMV on the Skylake SP and Ivy Bridge EP architectures, respectively. On Skylake SP RACE SymmSpMV a ains on an average 87% and 80% of the roo ine performance limits predicted using the copy and load bandwidth, respectively, while on Ivy Bridge EP we are 91% and 83% close to the respective performance models.
e MKL implementations of SymmSpMV deserves a special consideration in this context. erefore, in Figure 20 we also compare our approach with the two Intel MKL options described above. For the MKL-IE variant we specify exploiting the symmetry of the matrix when calling the inspector routine. On the Ivy Bridge EP architecture, RACE always provides superior performance levels and the best performing Intel variant depends on the underlying matrix. On the Skylake SP, however, MKL-IE always outperforms the deprecated MKL routine and is superior to RACE for two matrices (crankseg-1,offshore). ese are the same matrices where RACE is slower than the MKL SpMV kernel (see Figure 20(b) ). It can be clearly seen that the MKL-IE data for SymmSpMV are identical with the MKL SpMV numbers presented in Figure 20 , i.e., the inspector calls the baseline SpMV kernel and uses the full matrix, though it knows about the symmetry of the matrix. One reason for that strategy might be that the parallelization approach used in the deprecated MKL implementation for SymmSpMV is not scalable which would explain the fact that MKL is worse than MKL-IE for all cases on Skylake SP. As neither the algorithm used to parallelize the SymmSpMV nor its low level code implementation is known, we refrain from a deep analysis of the Intel performance behavior. In summary we nd that RACE is on average 1.4× faster than the best Intel variant and can achieve speedups of up to 2×. Note that on Skylake SP the best MKL variant is always MKL-IE, which has almost twice the memory footprint compared to the SymmSpMV with RACE.
6.2.3 Single core performance. Although single core performance is o en considered not to be crucial for the full chip SymmSpMV performance, we demonstrate that it is vital to explain some of the performance behaviors. For example the drop in SymmSpMV performance for matrices like Hubbard-12 and delaunay n24 strongly correlates with the lower performance of the baseline SpMV (see Figure 20) . ese matrices are characterized by a rather low N nzr and a larger α SpMV value. Note that α SpMV measured for the SpMV kernel mainly accounts for the RHS vector tra c and the actual α S mmSpMV may even be higher as SymmSpMV requires two vectors to stay in cache concurrently. Moreover, for these matrices the inner loop lengths are typically very short (approximately N nzr /2 on average) and consequently the SIMD vectorization performed by the compiler may become ine cient. is leads to lower single core performance as shown in Figure 21 for the Skylake SP architecture, where bad performance of SymmSpMV and SpMV can o en be correlated with a small N nzr value. For several matrices these combined e ects overcompensate the reduced matrix data tra c of the SymmSpMV leading to worse single core performance than running SpMV with the full matrix. Using the delaunay n24 matrix as a representative for this class of matrices we demonstrate the basic challenge for SymmSpMV to exploit its basic performance advantage over SpMV in Figure 22 . Starting with an approximately 25% lower single core performance (0.75 GF/s versus 0.98 GF/s) but having a 50% higher roo ine performance limit (approximately 18 GF/s; see Figure 20 (b)) than the SpMV, the SymmSpMV is not able to saturate the main memory bandwidth of the Skylake SP on its 20 cores. As speculated above, the single core performance is limited by ine cient SIMD vectorization of the extremely short inner loop and switching back to scalar code does improve performance by 15% (see Figure 22 ). As we are still substantially o the bandwidth limit we see this bene t over the full chip. Using chips with larger core counts would allow for further improving the SymmSpMV performance of this matrix. e same arguments hold for the offshore matrix but here the e ect compared to SpMV performance is even more pronounced on Skylake SP. Here the full matrix can at least partially be held in the large aggregate cache between successive kernel invocations and its performance is not limited by the main memory bandwidth. In terms of caching e ects we have also further identi ed at least partial caching of the matrix for ship-003 and pwtk test cases by analyzing the overall data tra c in the kernel invocations. is is inline with their higher performance levels presented in Figure 20 (b).
6.2.4
Comparing RACE with MC and ABMC. Having well understood the performance characteristics of SymmSpMV with RACE we nally compare this with the performance achieved by the two coloring methods in Figure 23 . Here the underlying algorithm as well as implementation are known and are closely related to our approach. Overall the MC is not competitive and provides low performance levels for almost all the matrices on both architectures. e ABMC shows similar performance characteristics as RACE until the two vectors involved in SymmSpMV approach the size of the caches (cf. discussion of Figure 20 ). For matrices with su ciently small N r (le in the diagram) the method can achieve between 70% and 90% of RACE performance on most cases. For matrices in the right part of the diagram with their higher N r and α SpMV values, the ABMC falls substantially behind RACE. Here, the strict orientation of the RACE design towards data locality in the vector accesses delivers its full power. See also the data transfer discussion in Section 6.2.1 for the Spin-26 matrix. In total there are only three cases where ABMC performance is on a par with or slightly above the RACE measurement and the average speedup of RACE is 1.5× and 1.65× for Ivy Bridge EP and Skylake SP, respectively. Note that all three methods use the same baseline kernels and thus performance di erences between the methods do not arise from di erent low level code but from the ability to generate appropriate degrees of parallelism and to maintain data locality.
CONCLUSION AND OUTLOOK
In this paper we have developed RACE, a coloring algorithm and open-source library implementation for exploiting parallelism in algorithms with inherent dependencies. RACE generates hardware-e cient distance-k colorings of undirected graphs and puts emphasis on data access locality, load balancing, and parallelism that is adapted to the number of cores of the underlying architecture. We demonstrated these bene ts by applying RACE to symmetric sparse matrix-vector multiplication (SymmSpMV) on modern multicore architectures and compared its performance against standard multicoloring, algebraic block multicoloring, and Intel MKL implementations. Average and maximum speedups of 1.4 and 2, respectively, could be observed across a representative set of 31 matrices on two modern Intel processors. Our entire experimental and performance analysis process was backed by the Roofline performance model, corroborating the optimality of the RACE approach in terms of resource utilization and shedding some new light on the challenges of the SymmSpMV kernel on modern hardware. We demonstrated that RACE runs very close to the Roofline limit for most of the 31 test cases. Outliers were analyzed and discussed in detail.
Similar to other coloring algorithms, the RACE method is not limited to the SymmSpMV kernel and can be used to e ciently parallelize solvers and kernels having general distance-k dependencies. Moreover, due to the level-based formulation of RACE, the framework has an added advantage that allows us to address other classes of problems. Future work with RACE will involve variants 
