We consider the problem of writing performance portable sparse matrix-sparse matrix multiplication (SPGEMM) kernel for many-core architectures. We approach the SPGEMM kernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design a hierarchical, memory-efficient SPGEMM algorithm. We then design and implement thread scalable data structures that enable us to develop a portable SPGEMM implementation. We show that the method achieves performance portability on massively threaded architectures, namely Intel's Knights Landing processors (KNLs) and NVIDIA's Graphic Processing Units (GPUs), by comparing its performance to specialized implementations. Second, we study an important aspect of SPGEMM's usage in practice by reusing the structure of input matrices, and show speedups up to 3× compared to the best specialized implementation on KNLs. We demonstrate that the portable method outperforms 4 native methods on 2 different GPU architectures (up to 17× speedup), and it is highly thread scalable on KNLs, in which it obtains 101× speedup on 256 threads.
We consider the problem of writing performance portable sparse matrix-sparse matrix multiplication (SPGEMM) kernel for many-core architectures. We approach the SPGEMM kernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design a hierarchical, memory-efficient SPGEMM algorithm. We then design and implement thread scalable data structures that enable us to develop a portable SPGEMM implementation. We show that the method achieves performance portability on massively threaded architectures, namely Intel's Knights Landing processors (KNLs) and NVIDIA's Graphic Processing Units (GPUs), by comparing its performance to specialized implementations. Second, we study an important aspect of SPGEMM's usage in practice by reusing the structure of input matrices, and show speedups up to 3× compared to the best specialized implementation on KNLs. We demonstrate that the portable method outperforms 4 native methods on 2 different GPU architectures (up to 17× speedup), and it is highly thread scalable on KNLs, in which it obtains 101× speedup on 256 threads.
I. INTRODUCTION
Modern supercomputer architectures are following two different paths using either the Intel's Knights Landing processors or NVIDIA's Graphic Processing Units. As a result, it is important to design algorithms that can perform well on both platforms. The focus on this problem has been around programming models to implement an algorithm on multiple architectures [1] , [2] . We consider this problem from an algorithmic perspective to design a "performance-portable algorithm", an algorithm that can perform well on multiple architectures with similar accuracy and robustness. Another approach would be to consider different, architecture specific ("native") algorithms for every key kernel. This leads to the question "How much performance will be sacrificed for portability?". We address this question by comparing the implementation of our performance-portable algorithm to several native implementations.
We choose the sparse matrix-matrix multiply (SPGEMM) kernel as our benchmark for this study due to its importance in several applications. SPGEMM is a fundamental kernel that is used in various applications such as graph analytics and scientific computing, especially in the setup phase of multigrid solvers. The kernel has been studied extensively in the contexts of sequential [3] , shared memory parallel [4] , [5] and GPU [6] , [7] , [8] , [9] implementations. There are native kernels available in different architectures [5] , [7] , [9] , [10] , [11] providing us with good comparison points.
This paper focuses on SPGEMM from the perspectives of algorithm design and implementation for performance portability and its practical usage. Specifically, we try to address the following questions:
• What are the performance critical design choices and data structures for a SPGEMM algorithm to map well to different architectures (thousands vs hundreds of threads, streaming multiprocessors vs lightweight cores, shared memory vs MCDRAM) ? • How will the kernel serve the needs of real applications, when there is a reuse of the symbolic structure? In addressing these questions we make the following contributions in this paper.
• We design two thread scalable data structures (a multilevel hashmap and memory pool) to achieve performance portability, and a graph compression technique to speedup symbolic factorization phase of SPGEMM.
• We design a hierarchical, thread-scalable SPGEMM algorithm and implement it using the Kokkos programming model [1] . Our implementation is available at https://github.com/trilinos/Trilinos. • We evaluate the performance portability of our method on various platforms, including traditional CPUs, KNLs, and two different GPU architectures. We show that our method outperforms 4 native methods on 2 different GPU architectures, (up to 17× speedup). We also show that it has better thread-scalablilty than native OpenMP methods on KNLs, where it obtains 101× speedup on 256 threads w.r.t. its sequential run.
• We also present results for the practical case of matrix structure reuse, where the method is up to 3× faster than best native OpenMP method. The rest of the paper is organized as follows: Section II covers the background for SPGEMM. Our SPGEMM algorithm and related data structures are described in Section III. Finally, the performance comparisons that demonstrate the efficacy of our approach is given in Section IV.
II. BACKGROUND Given matrices A of size m × n and B of size n × k SPGEMM finds the m×k matrix C s. t. C = A×B. Multigrid solvers use triple products in their setup phase, which is in the form of A coarse = R × A f ine × P (R = P T if A f ine is symmetric), to coarsen the matrices. SPGEMM is also widely used in the literature for various graph analytic problems.
In the literature, most parallel SPGEMM methods follow Gustavson's algorithm [3] (Algorithm 1). This algorithm iterates over rows of A in a 1D fashion (line 1) to compute all entries in the corresponding row of C. Each iteration of the second loop (line 2) accumulates the intermediate values of multiple columns within the row using an accumulator. The number of the necessary multiplications to perform this matrix multiplication is referred as f m (there are f m additions too) for the rest of the paper.
Algorithm 1 1D row-wise SPGEMM for C = A × B. We use matlab notation, i.e., C(i, :) and C(:, i) refer to i th row and column of C, respectively.
Require: Matrices A, B 1:
//accumulate partial row results 4:
Design Choices: There are three design choices that can be observed in Algorithm 1: (a) the partitioning needed for the iteration, (b) how to determine the size of C as it is not known ahead of time, and (c) the different choices for the accumulators. The key differences in past work are related to these three choices in addition to the target parallel programming model (distributed memory vs shared memory). This section gives a brief background of the different choices and what is explored in the literature with a summary of different methods listed in Table I .
Different partitioning schemes have been used in the past for SPGEMM. A 1D method [12] partitions C in single dimension, and each row (or column) is computed by a single execution unit. On the other hand, 2D [4] , [13] and 3D [9] , [14] methods assign each nonzero of C or each multiplication to a single execution unit, respectively. Hypergraph model [15] , [16] based partitioning schemes have also been used in the past. 1D row-wise is the most popular choice for the scientific computing applications. Using other partitioning schemes just within SPGEMM will require reordering and maintaining a copy of one or both of the input matrices when used within a scientific computing application. There are also hierarchical algorithms where rows are assigned to first level of parallelism (blocks or warps), and the calculations within the rows are done using the second level parallelism [6] , [10] , [7] . In this work, we use a hierarchical partitioning of the computation where the first level will do 1D partitioning and the second level will exploit further vector parallelism.
The next design choice is to determine the size of C. Finding the structure of C is usually as expensive as finding C. There exists some work in the literature to estimate its structure [17] . However, it does not provide a robust upper bound. It also requires a generation of a random dense matrix, and a sparse matrix-dense matrix multiplication which is not significantly cheaper than calculating the exact size in practice. As a result, one-phase or two-phase methods are commonly used. One-phase methods rely either on finding a loose upper bound for the size of C or doing dynamic reallocations when needed. The former could result in overallocation and the later is not feasible in GPUs. Two-phase methods use the structure of A and B to determine/estimate the structure of C (symbolic phase), before computing C in the second phase (numeric phase). They either find the size of the rows of C or the exact non-zero pattern of C [11] , [6] , [18] , and allow reusing of the structure for different multiplies with the same structure. This is an important use case in scientific computing, where matrix structure stays the same with changing values. For example, Algorithm 2 presents an example of a nonlinear system solve. Given a mesh, a nonlinear solve consists of solving series of linear systems with the same non-zero pattern but different values. The structure of A k remains the same, but the values are assembled in each iteration of k (line 6) based on the solution of the previous nonlinear system. Multigrid solvers usually exploit this in their reuse case [19] . Given the matrix A k , the solve method in Algorithm 2 constructs a multigrid hierarchy of some number of levels (maxl). At each level l (l = 1 . . . maxl) the corresponding matrix is coarsened using a triple matrix product
The structure of the prolongation (P * l ) and restriction (R * l ) operators as well as the linear system A * l stays same over all the nonlinear systems. Therefore, the symbolic phase of a two-phase SPGEMM method can be executed only once per level of the first linear system solve, and can be reused for the rest of the linear systems. In this work, we use a two-phase approach, and we aim to speedup the symbolic phase using matrix compression. //nonlinear solve 5:
//calculate residual 8:
//solve problem -using multigrid 10:
//update the solution 12:
The third design choice is the data structure to use for the accumulators. Some algorithms use a dense data structure of size k. However, such thread private arrays are not scalable on massively threaded architectures. Therefore, sparse accumulators such as heaps or hashmaps are usually preferred in parallel implementations. In this work, we use multi-level hashmaps as sparse accumulators.
Related Work: There are a number of distributedmemory algorithms for SPGEMM [13] , [15] , [16] , [14] , [12] . Most of the multithreaded SPGEMM studies [4] , [14] , [10] , [8] , [5] are based on Gustavson's algorithm. They usually differ in the data structure used for row accumulation. Some use dense accumulators [4] , others a heap with an assumption of sorted columns in B rows [14] , or a sorted list with row merges [10] , [8] or other sparse accumulators [5] . Most of the SPGEMM algorithms for GPUs are hierarchical. CUSP [9] uses a 3D algorithm where each multiplication is computed by a single thread and later accumulated with a sort operation (ESC). Such global expansion results in high memory requirements. AmgX [6] follows a hierarchical Gustavson algorithm. Each row is calculated by a single warp, and multiplications within a row are done by different threads of the warp. It uses 2-level hash map accumulators, and does not make any assumption on the order of the column indices. On the other hand, row merge algorithm [8] and its implementation in ViennaCL [10] use merge sorts for accumulations of the sorted rows. bhSPARSE [7] exploits this assumption on GPUs. It has methods to predict the size of the result matrix. It performs binning based on the size of the result rows and chooses different accumulators based on the size of the row. The kokkos-parallel hierarchy consists of teams, threads and vector lanes. Table II shows how these terms map to execution units on GPUs, KNLs and CPUs. The mapping is the same for KNLs and CPUs. A team in Kokkos handles a workset assigned to a group of threads sharing resources. On GPUs, a team is mapped to a thread block, which has access to a software managed L1 cache. A team on the KNLs can be the threads sharing either the DDR memory, or the L2/L1 cache. We use the hyperthreads that share an L1 cache as the team. There is no one-to-one mapping from kokkos-teams to number of execution units used. That is the number of teams even on the CPUs is much higher than the number of execution units. An execution unit is assigned various number of teams. A kokkos-thread within a team maps to a warp (half, quarter warp, etc.) on the GPUs and work on a single thread on the KNLs. A kokkos-thread uses multiple vector lanes. The vector lanes map to threads within a warp in the GPUs and the vectorization units on the KNLs. We will use the terms, kokkos-teams/teams, kokkos-threads/threads and vector lanes in the rest of the paper.
The portability provided by Kokkos comes with some overhead. For example, template meta-programming used heavily in the Kokkos model results in some of the current compilers failing to perform some optimizations. Typically, portable data structures in Kokkos have some small overheads as well. Portable methods also avoid exploiting any architecture-specific assumption or instructions. However, all these overheads can be overcome by careful algorithm design unless the problem sizes are very small. The overall structure of our SPGEMM method, KKMEM, is given in Algorithm 3. It follows a two-phase approach, in which the first (symbolic) phase computes the number of nonzeros in each row (lines 4 − 5) of C, and second (numeric) phase (line 9) computes C. Both phases use the CORE SPGEMM kernel with small changes to the input. The amount of work is typically doubled in such two-phase approaches. The main difference is that, the symbolic phase does not use matrix values, it does not perform floating point operations. We aim to improve this phase and reduce memory usage by performing compression of B (Line 4).
III. ALGORITHM

A. Core SPGEMM Kernel
The core SPGEMM kernel used by both symbolic and numeric phases follow a hierarchical, row-wise, algorithm. This core SPGEMM kernel is given in Algorithm 4. A simplified example of the kernel is shown in Figure 1 . The allocate the first level accumulator L 1 7:
if L2 not allocated then 12:
allocate the second level accumulator L 2 
13:
L2 allocated ← T rue 14:
if PHASE IS SYMBOLIC then 16:
Crow
release L 2 performance of KKMEM derives from the hierarchical parallelism and two thread-scalable data structures, a memory pool and an accumulator. First, we focus on partitioning the work for hierarchical parallelism. The kernel achieves the first level of parallelism by assigning a set of rows of C to kokkos-teams (loop in Line 4). Each kokkos-thread within the team is assigned a subset of these rows. For example, the first two rows of C is assigned to team-1 (shown in orange in Figure 1 ). Each team has two threads in this example (highlighted in blue and green). Thread-1 of team-1 is assigned to the first row (highlighted in blue). These two levels of team and thread parallelism use a 1D (row-wise) distribution. At the third level, each thread has four vector lanes in this example. Vector lanes are assigned to the non-zeros in a row of B. Thread-1 will use four vector lanes for each row of B it accesses (highlighted in different shades of blue) in succession. B is naturally laid out for this level of vector parallelism as the non-zeros in a row are consecutive. Vector parallelism is implemented using vectorized reads and multiplies on the corresponding rows of B. For example, thread-1 first does a read of A(0, 0), a VECTORREAD(B(0, :)) and the corresponding VECTORMULT followed by the read of A(0, 1), VECTORREAD(B(1, :)) and the corresponding VECTORMULT (loop in Line 7). Given the partitioning above, we can describe the accumulator to accumulate the entries of C. We use a two-level, sparse, hashmap-based accumulator. Accumulators are used to compute the row size of C in the symbolic phase, and the column indices and their values of C in the numeric phase. Once kokkos-threads determine the row they work on, they allocate some scratch memory (Line 6) for their private level-1 (L 1 ) accumulator (not to be confused with the L1 cache). The scratch memory maps to the GPU shared memory in GPUs and the default memory (i.e., DDR4 or high bandwidth memory) on KNLs. If the L 1 accumulator runs out of space, global memory is allocated (Line 12) in a scalable way using memory pools (explained below) for a thread private L 2 accumulator. This L 2 accumulator is used for failed insertions from L 1 , and its size is chosen to guarantee to hold all insertions. Upon the completion of a row computation, any allocated L 2 accumulator is explicitly released. Scratch spaces used by L 1 accumulators are automatically released by Kokkos when the threads retire.
1) Implementation: This subsection focuses on two implementation aspects -length of the vector lanes and size of the accumulators. The length of vector lanes in each kokkos-thread is a runtime parameter, and it is fixed for all threads in a parallel kernel. We set it by rounding the average number of nonzeroes in a row of B (δ B ) to the closest power of 2 on GPUs (upper bound by the warp size). Kokkos sets the length depending on the compiler and underlying architecture specifications on the KNLs.
The size of L 1 accumulators depends on the available shared memory on GPUs. In the numeric phase, the size of the L 2 accumulator (in the global memory) is chosen to equal the "maximum row size in C" (MAXRS) to guarantee enough space for the work in any row of C. MAXRS is not known before symbolic so this phase uses an upper bound. The upper bound is the maximum number of multiplies (MAXRF) required by any row. This is found by summing up the size of rows of B that contributes to a row. In contrast to GPUs, both L 1 and L 2 accumulators are in the same memory space on KNLs/CPUs (DDR or MCDRAM). Since there are more resources per thread on the KNLs/CPUs, we choose the size of L 1 accumulator big enough to hold MAXRF or MAXRS depending on the phase. Even this is usually small enough to fit into L1 or L2 caches of KNLs/CPUs.
B. Compression
This section addresses the problem of compressing the graph of B in the symbolic phase. This method, based on packing columns of B as bits, can reduce the size of B up to 32×. The graph structure of B encodes binary relations -existence of a nonzero in (i, j) or not. This can be represented using single bits. We compress the rows of B such that 32 (64) columns of B are represented using a single integer (long integer) similar to the color compression idea [20] . In this scheme, the traditional column index array in a compressed-row matrix is represented with 2 arrays of smaller size: "column set" (CS) and "column set index" (CSI). Set bits in the CS denote existing columns. That is, if Compression reduces the problem size, allows faster rowunion operations using BITWISEOR, and makes the symbolic phase more efficient. The reduction in row lengths of B also reduces the calculated MAXRF, the upper bound prediction for the required memory of accumulators in the symbolic phase. There exist more complicated graph compression methods in the literature. Some uses delta encoding to compress the columns of a row [21] , while others seek for similarity of the rows (block structures) [22] , [23] , [24] . These expensive methods can be amortized on repetitive or computationally intensive problems. We avoid using such expensive methods, and show the effectiveness of our lightweight compression method in Section IV-C.
C. Memory Pool
Algorithm 4 requires a portable, thread-scalable memory pool to allocate memory for L 2 accumulators when a row of C cannot be fit in a L 1 accumulator. The memory pool is allocated and initialized before the kernel call and services requests to allocate and release memory from thousands of kokkos-threads. As a result, allocate and release has to be thread scalable. Allocate function returns a memory chunk to a thread and locks it. This lock is released when the thread releases the chunk back to the pool. The memory pool reserves NUMCHUNKS many memory chunks, where each has fixed size (CHUNKSIZE). CHUNKSIZE is chosen based on the MAXRF, and MAXRS in the symbolic and numeric phases, respectively. The memory pool has two operational modes: unique (non-unique) mapping of chunks to threads (ONE2ONE and MANY2MANY).
The parameters of the memory pool are architecture specific. NUMCHUNKS is chosen based on the available concurrency in an architecture. It is an exact match to the number of threads on the KNLs/CPUs. On GPUs, we over estimate the concurrency to efficiently acquire memory. We use an upper bound for the maximum allocated memory for the pool (such as O(nnz C )), and reduce the NUMCHUNKS if the memory allocation becomes too expensive on GPUs. CPUs/KNLs use ONE2ONE and GPUs use MANY2MANY. The allocate function of the memory pool uses thread indices. These indices assist the look up for a free chunk. Pool returns the chunk with the given thread index on CPUs/KNLs (ONE2ONE mode). This allows CPU/KNL threads to reuse local NUMA memory regions. It starts a scan from the given thread-index until an available chunk is found on GPUs. It also helps the scan of GPU threads as they start their scan from different indices.
D. HashMap Accumulator
This section describes the hashmap based accumulator that supports parallel insertions/accumulations from multiple vector lanes. It is thread-private within kokkos-threads, so it needs to be highly scalable in terms of memory to fit within a small scratch space. The hashmap accumulator here extends the hashmap we used in [24] for parallel insertions. It consists of 4 parallel arrays as shown in Figure 2b , which shows an example of a hashmap that has a capacity of 8 hash entries and 5 (key, value) pairs. Hashmap is stored in a linked list structure. Ids and V alues stores the (key, value) pairs, which corresponds to column indices and their numerical values in numeric phase, to CSI Table III . KKMEM method is implemented using the Kokkos library (Trilinos 12.8 release), and publicly available in Trilinos. Each run reported in this paper is the average of 5 executions with double precision arithmetic and 32 bit integers. We evaluate matrix multiplications in the forms of P T × A × P and A × A depending on the application domain of the corresponding matrices. The experiments use 29 (20 on Haswell and KNL) matrices from various multigrid problems and UF sparse matrices [25] used in [7] , [14] (for A × A). The problems are listed in Table IV . The experiments are organized in four parts. Section IV-A and Section IV-B evaluate the performance of KKMEM on CPUs/KNLs and GPUs respectively. Section IV-C evaluates the effect of the compression method. Overall results are summarized in Section IV-D.
A. Experiments on CPUs and KNLs
In this subsection, we compare KKMEM against ViennaCL (OpenMP) and the two SPGEMM methods provided in Intel Math Kernel Library (MKL) on multicore CPUs and KNLs. In the experiments, mkl_sparse_spmm routine in MKL's inspector-executor is referred as MKL1, and mkl_dcsrmultcsr routine is referred as MKL2. mkl_dcsrmultcsr is used with the option to sort the output rows, as it does not return sorted outputs. However, it requires sorted input rows. In the multigrid hierarchy the output matrices in one level is the input for the next level. This constrains us to set the outputs to be sorted. Both MKL routines are 2 − 3× slower for the first call than the following calls in any run. Even though an application using MKL will observe this difference, we exclude the first run, for these routines. Thus, the comparisons against MKL are conservative. The test dataset includes 12 multigrid multiplications from Table IV, and 8 A × A multiplications for the largest matrices from our dataset. refers to the case where the symbolic phase is reused (recomputed) for every call to numeric phase. Two-phase algorithms (KKMEM and MKL2) benefit from the reuse. Although ViennaCL runs in two-phases, the public interface does not support the 2-phase usage. Hence we do not reuse the symbolic structure for ViennaCL.
In the NoReuse case MKL2 is up to 22% faster than KKMEM on smaller number of threads. However, KKMEM scales better and is the faster method after 16 threads (not shown). It is 2× faster than MKL2 on 64 threads. Native implementation in ViennaCL is typically slower than the other three variants. MKL1 is the best method on traditional CPUs. KKMEM is designed to be thread-scalable for massively threaded architectures. The introduced overhead costs are usually not amortized on smaller number of threads. However, it scales better than MKL1. It is 1.90× slower than MKL1 on sequential run, however the performance difference drops to 16% as the number of threads increase to 64 threads. The traditional CPU results are shown for demonstrating portability. It is not our target platform. Two-phase methods take advantage of reusing the symbolic structure in Reuse experiments. MKL2 and KKMEM run up to 45% and 80% faster w.r.t. their NoReuse runs. KKMEM achieves a 51× speed up on 64 threads (30× w.r.t. seq. Reuse). It is slower than MKL1 only when the resources are underutilized and become the fastest of the four methods beyond four threads, where it is 1.53× faster than MKL1. Figure 4 shows the speedups of the algorithms w.r.t. sequential KKMEM (on DDR4) on KNL. It differentiates runs using MCDRAM and DDR4 and the NoReuse/Reuse cases. In these experiments, we use 64 of 68 cores with 2 and 4 hyperthreads on 128 and 256 threads. MKL1 does not complete in 1000 seconds or fails for 7/20 instances with 256 threads, for which we proportionally scale the speedup of the dataset that it runs so we can find a geometric mean (assuming that Speedup (20, 256) Speedup (20, 128) = Speedup (13, 256) Speedup (13, 128) ). We compare the performance of the algorithms on both MCDRAM and DDR4. MCDRAM provides not only more bandwidth, but also more parallelism for the memory controllers compared to DDR4. When neither the bandwidth nor the memory controllers are saturated, the performance is expected to be similar on DDR (Figure 4a ) demonstrate the strength of a thread-scalable KKMEM algorithm. MKL1 is initially 1.74× faster than KKMEM on single-thread runs, however, the difference reduces to 8% on 64 threads. KKMEM becomes 1.28× and 1.78× faster than MKL1 on 128 and 256 threads, respectively. MKL1 performance drops on 128 and 256 threads suggesting that it is memory latency bound on 128 threads. On the other hand, KKMEM scales up to 128 threads, and its performance stays almost constant on 256 threads, which suggests that it is memory bandwidth bound on 256 threads. As in the multicore experiment, the performance of ViennaCL and MKL2 is slightly lower than MKL1 and KKMEM for NoReuse.
When the symbolic structure is reused, the performance of MKL2 and KKMEM improves up to 1.48× and 1.93× w.r.t. their "NoReuse" performance. KKMEM has similar performance to MKL1 on smaller number of threads, however it scales better beyond four threads. It is 2.47× (3.45×) faster than MKL1 on 128 (256) threads. MKL2 is usually slower than MKL1 even in Reuse case.
In general, storing matrices in MCDRAM improves the performances of all the algorithms (Figure 4b) . KKMEM does not saturate the bandwidth or the memory controllers on DDR4 up to 32 threads (not shown), therefore the performance on MCDRAM is very close to its performance on DDR4 (at most 6% difference). After this point the speedups on MCDRAM are higher (up to 1.53×), and KKMEM scales well up to 256 threads. It results in a 101× and 186× speedup with 256 threads for NoReuse and Reuse case, respectively. The respective performance of ViennaCL and MKL2 improve up to 1.14× and 1.32×, w.r.t. their performances on DDR4. However, they are still slower than KKMEM and MKL2 for both "NoReuse" and "Reuse" cases. More memory controllers provided in MCDRAM significantly helps the latency-bound problem of MKL1 on 128 and 256 threads. MKL1 scales up to 128 threads on MCDRAM, after which it hits the bandwidth bound. As a result, MKL1 improves its performance up to 2.77× in MCDRAM. Although more bandwidth provided by the hardware helps to improve the scalability of MKL1, it still hits the memory bound earlier than KKMEM. In general, storing matrices in MCDRAM pays good dividends for larger thread counts as the larger bandwidth can be effectively utilized. This is reflected in the 101.70× speedup for MCDRAM as opposed to the 66.46× speedup for the DDR4 when using 256 threads.
B. Experiments on GPUs
We evaluate KKMEM against CUSP (0.5.1), bhSPARSE, cuSPARSE, ViennaCL (1.7.1) on two GPUs. Table IV shows GFLOPS and execution time of KKMEM, and its speedup w.r.t. other methods on 28 different matrix multiplications on K80 GPUs. KKMEM and cuSPARSE are the most robust methods that can multiply all the matrices (except cage15, since the output matrix does not fit into memory). However, cuSPARSE is the slowest among the methods compared. ViennaCL, bhSPARSE and CUSP run out of memory for 21, 11 and 5 multiplications, respectively (excluding cage15). The bottom row shows the geometric mean of the speedup of KKMEM w.r.t. each algorithm for the set of matrices the method works. In general, the fastest algorithm is KKMEM, which is on average 3.80×, 1.59×, 1.47× and 3.19× faster than cuSPARSE, ViennaCL, BhSPARSE, and CUSP, respectively. KKMEM obtains the best performance on 22/30 multiplications. On Laplace R A and webbase multiplications, KKMEM is slower than BhSPARSE and CUSP. It is possible to change the vector length and improve the performance of KKMEM by 1.47× and 2.80×, so that KKMEM has the best performance on these two matrices. However, we only report the performance with the default auto-parameter selection mechanism, and leave exploration of the better parameterization as future work. Overall, KKMEM achieves the best performance, without sacrificing robustness, mainly due to the hierarchical algorithm that maps well to the GPUs and the thread scalable data structures that can effectively use the GPU hardware features.
1) K80 Results:
2) AmgX comparisons: In Table V , we compare the performance of KKMEM against the SPGEMM method in AmgX library [6] . This method is neither open-source, nor there exists a public interface for SPGEMM. Therefore, there is no way for us to compare with this method. However, we have been provided the performance numbers listed in the table for K80 GPUs. The performance of AmgX is better than other native methods, and more close to the performance of KKMEM. Execution time of both methods are usually in the orders of the milliseconds. We use Kokkos which is a wrapper around the CUDA library. It adds some overheads on the execution time, which are more likely to be visible in such tiny problems. KKMEM outperforms AmgX in the largest two matrices. Moreover, KKMEM is better than AmgX on webbase, since it is reported to be at least 3× slower than CUSP [6] . In general, our portable method achieves similar performances to native AmgX method without using any architecture specific intrinsics functions as in AmgX.
3) P100 results: Table VI shows GFLOPS and execution time of KKMEM, and its speedup w.r.t. other methods on 29 different matrix multiplications on P100 GPUs. In general, performances of all methods improved w.r.t. K80 GPUs, and they run on larger set of matrices. KKMEM, CUSP, bhSPARSE, ViennaCL, and cuSPARSE improved their performances by 3.29×, 1.53×, 3.74×, 4.10×, and 5.26×, respectively (on the set of multiplications they run on both GPUs). With a faster GPU, the difference in execution time narrowed. The difference in execution time is negligible for the smaller matrices. Even in P100 KKMEM is the most robust and the fastest method. It obtains the best performance on 17 multiplications. In addition, the KKMEM times on ldoor, dielFilterV3real and delaunay n24 and cage15 matrices are comparable to the times achieved in the distributed-memory implementation in [14] using 512 to 4096 cores. For example, the performance of KKMEM on delaunay n24 on a single P100 GPU is comparable to using 2048 to 4096 Intel Ivy Bridge cores [14] .
The effect of the memory pool: Kokkos-threads use memory pool to allocate memory for their L 2 accumulators when their L 1 accumulator is full. In the numeric phase, the size of each row of the result matrix is known beforehand. It is possible to allocate 4 arrays in the size of result matrix and allow threads to directly use them as accumulators to avoid using memory pools. We implemented this idea in the numeric phase to evaluate the performance of memory pool. The new method increases the memory requirement, where it runs out of memory on 3 multiplications on K80 GPUs (audi, dielFilterV3real, Brick R A), and on 1 multiplication on P100 (cage15). However, on average, the new method improves the results of numeric phase by 6% and 8% on K80, and 2% and 6% on P100 over using the memory pool (over all matrices). This demonstrates that memory pool increases the robustness of the algorithm by reducing the memory requirement, with a small increase on the runtime.
C. The effect of the compression
The compression technique is applied in the symbolic phase of KKMEM. This is a critical process to reduce the time and the memory requirements of the symbolic phase. At the beginning of this phase, the sizes of the rows of C are unknown. KKMEM estimates the maximum row size as MAXRF as used in the literature [10] , [7] . The size of the accumulators are fixed to these sizes, which might cause memory problems on massively threaded architectures. By reducing the size of B using the compression technique, we reduce not only the number of insertions to hashmap accumulators (originally as many as f m ), but also the estimated maximum row size which in turn reduces the size of the accumulators used. We list the calculated maximum row sizes and the required size for accumulators with and without compression in Table VII together with the reductions on the size of B and the number of hashmap insertions. On average, the compression reduces the size of B, the number of insertions and the required memory by 57%, 59% and 53%. There are cases where compression increases the required size of an accumulator due to the additional array needed. However, in most of these cases, the size of accumulators are already very small. Therefore these cases are not likely to create memory issues. The memory reductions are more significant where it is more critical. Starting from Brick R AP multiplication, the required accumulator size per thread becomes more than 16KB, and gets as high as 2.18MB on Empire R A matrices. In these sets of matrices, compression reduces the memory requirement on average by 74% and up to 96% on certain matrices. Compression usually reduces the runtime of the symbolic phase. When the reduction on insertions is low, it might not amortize compression cost. It is more successful when columns of the right hand side matrices are packed. Compression achieves lower reductions when the columns are spread out, which is the case for multigrid matrices (A and AP).
D. Overall Results
From previous experiments it is clear that the best method is different for different problems in different architectures. However, we allow a meta-algorithm that can choose the best method (excluding KKMEM) for a given problem in a given architecture. We can then compare KKMEM to this meta-algorithm that always chooses the best method for very conservative speedup numbers. Table VIII gives the geometric mean of the execution times of the meta-algorithm ("best method") for each instance (excluding KKMEM) on 19 matrices (excluding small matrices and cage15) and KKMEM on all architectures. These results also allows us to compare different architectures. First, the fastest overall multiplication times are obtained in P100 GPUs. Overall, performance achieved by the portable kernel is better than the best method by 17%, 4% and 54% on KNL-DDR4, Table IV : The matrices and multiplications used throughout this paper. The (#rows, #cols, #nnz) of the input matrices and #multiplications performed are given in the first four columns. The right side lists the execution time in seconds and GFLOPS of KKMEM on K80, and its speedup w.r.t other SPGEMM methods. Blank spaces indicate the method failed. The matrices are sorted based the success of the algorithms, then by the #multiplications. The matrices 2cubes sphere, cage12, webbase, offshore, filter3D, cant, hood, pwtk, ldoor are used as they are repetitively used in the literature [6] , [7] . We run these matrices only on GPUs 
V. CONCLUSION
We described performance-portable, thread-scalable SPGEMM kernel for highly threaded architectures. We conclude by answering the primary question we started with -"How much performance will be sacrificed for portability?". We conclude that we do not sacrifice much in terms of performance on highly-threaded architectures. This is demonstrated by the experiments comparing our portable method against 5 native methods on GPUs, and 2 native methods on KNLs. Our SPGEMM kernel is also the most robust among publicly available codes whereas we see several failures with other kernels with the exception of cuSPARSE. It is possible to adapt our algorithm in a native method and improve it with architecture specific techniques. However, in our experience, there is only small room left to improve the portable kernel in these two architectures.
We addressed the questions raised in Section I throughout the paper. We summarize the conclusions here.
• The key to performance portability is to design an algorithm that relies on thread scalable data structures, uses memory efficiently and maps correctly to the hierarchical parallelism. The data structures and compression technique play a major role in performance and robustness.
• Designing for application use cases such as the reuse results in significantly better performance than past methods.
