Sparse matrix-matrix multiplication (SpGEMM) is an important primitive for many data analytics algorithms, such as Markov clustering. Unlike the dense case, where performance of matrix-matrix multiplication is considerably higher than matrix-vector multiplication, the opposite is true for the sparse case on GPUs. A signi cant challenge is that the sparsity structure of the output sparse matrix is not known a priori, and many additive contributions must be combined to generate its non-zero elements. We use synthetic matrices to characterize the e ectiveness of alternate approaches and devise a hybrid approach that is demonstrated to be consistently superior to other available GPU SpGEMM implementations.
INTRODUCTION
Sparse matrix-matrix multiplication (SpGEMM) is a fundamental operation required in several domains. It is a primitive operation in many data-analytic and graph-analytic algorithms [6, 13, 19, 23] as well as in algebraic multi-grid methods [3] . ere have been a number of recent e orts to develop GPU implementations for SpGEMM [2, 8, 9, 14, 16, 18, 20] . But achieving high performance for SpGEMM is very challenging. Unlike the dense case, where achieved performance with matrix-matrix multiplication (in GFLOPs) is considerably higher than sparse matrix-vector multiplication, performance of SpGEMM is generally much lower than SpMV for all known GPU implementations. Table 1 shows the achieved SpGEMM performance with Nvidia's cuSPARSE library on a set of sparse matrices from the Florida sparse matrix collection [10, 22] . is set of sparse matrices has been used in recent work to evaluate sparse matrix multiplication Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. ICS '17, Chicago, IL, USA © 2017 ACM. 978-1-4503-5020-4/17/06. . . $15.00 DOI: h p://dx.doi.org/10.1145/3079079.3079106 [2, 14, 16, 18] , and we use the same set of sparse matrices in this paper. As has been done in prior work, in this paper too we report performance of SpGEMM implementations using the same matrix for both input operands, i.e., perform C = A * A.
is enables be er insights into the impact of matrix structure on SpGEMM performance, which would be more di cult if two sparse matrices with di erent sparsity structure were used.
It can be seen that the performance achieved in computing the product of each matrix with itself (A*A) does not exceed even 5 GFLOPs (single-precision), and is less than 0.5 GFLOPs for more than half the matrices in the test set. In contrast, on the same GPU, the Nvidia cuBLAS routine for dense MM achieves over 2500 GFLOPs for a 4k × 4k matrix.
ere are a number of reasons why achieving high GPU performance with SpGEMM is very challenging:
• Low Arithmetic Intensity: If we de ne the arithmetic intensity of a computation as the ratio of the number of arithmetic operations to the number of distinct data elements accessed, dense MM for computing C = A*A has an arithmetic intensity of 0.5 × N FMAs (Fused Multiply-Adds), since N 3 FMAs are executed and N 2 + N 2 distinct data elements are accessed for the input and output matrices. e compression ratio (CR) for an SpGEMM computation is de ned as the ratio of the number of FMA operations to the number of non-zeros in the result matrix. us, CR serves as an upper bound on the arithmetic intensity for SpGEMM. It can be observed from Table 1 that CR is no more than 30 for any of the test matrices, and is less than 5 for more than half of the matrices. e arithmetic intensity can be used to upper bound the operational intensity [24] , de ned as the ratio of the number of executed computational (here oating-point) operations over the number of bytes of data moved between main memory and the GPU cores. Unless the operational intensity is su ciently high, a computation will necessarily be memory-bandwidth limited.
e very high arithmetic intensity of dense MM makes it not memory-bandwidth limited on GPUs, enabling it to achieve a signi cant fraction of peak performance. In contrast, the low CR for SpGEMM means that it will be impossible to achieve anywhere close to peak performance on GPUs.
• Index Matching e achieved performance for SpGEMM in Table 1 is in all cases much lower than the roo ine-based [24] bound using the CR and achievable bandwidth between global-memory and GPU cores. is is because of the index-matching problem for SpGEMM, explained in greater detail in the next section. e problem is that the structure of sparsity of the result matrix (i.e., which elements in each row are non-zero) is unknown, and further it is non-trivial to locate where a given contribution from the product of a ik with b k j to result non-zero c i j should be accumulated. As explained in the next section, several approaches have been proposed to address this index-matching problem, and each of these approaches entails additional data access and computational operations.
• Load Balancing Unlike dense MM, which features the same number of elements in every row (column) and the same number of arithmetic operations for each result matrix element, sparse matrices that arise in practice exhibit signi cant non-uniformity. Our test set of sparse matrices includes structural matrices, web connection graphs, protein models, macroeconomic models and circuit simulation graphs. e chosen matrices are diverse and inherently irregular in their sparsity structure. Table 1 shows the mean (µ) and standard deviation (σ ) of the number of non-zeros per row. It can be seen that most matrices have signi cant nonuniformity across di erent rows. Table 1 shows a distribution of work (number of operations) per row, grouped into four bins: i) 1-256 operations, ii) 257-1024 ops., iii) 1025-4K ops., and iv) more than 4K ops. It can be seen that di erent matrices can have very di erent distribution of work.
In this paper, we develop a hybrid SpGEMM algorithm for GPUs, based on the following rationale:
• Many sparse matrices exhibit signi cant nonuniformity in the number of nonzero elements in di erent rows/columns, with some rows having very few nonzeros, while others have thousands or more nonzeros. Hence, it is very challenging to achieve both inter-thread-block load balance as well as intra-warp load balance and avoidance of thread divergence with any approach that uses a uniform row-wise division of work among thread/warps/thread-blocks.
• Achieving load balance is much more feasible with an element-wise division of work among threads, but all known SpGEMM approaches (elaborated later) that take this approach su er from high data access costs.
• We therefore develop a hybrid SpGEMM algorithm that combines two approaches with complementary characteristics: i) For rows which have few operations and thus low concurrency potential, we use an element-wise parallel strategy, and ii) for rows with a large number of operations, we resort to a row-wise work partitioning strategy that prioritizes minimization of data movement. In this paper, we make the following contributions:
• We characterize di erent approaches to SpGEMM using synthetic matrices that remove intra-row non-uniformity.
• We develop an e cient GPU implementation of SpGEMM based on use of a sca er-vector.
• Based on the performance characterization using synthetic matrices, we devise a hybrid SpGEMM implementation that uses di erent approaches for light versus heavy rows, based on the number of operations needed for them.
• We present an experimental evaluation using 23 sparse matrices from di erent domains that demonstrates higher performance with the new GPU-SpGEMM implementation than other SpGEMM implementations for GPUs. 
Challenges to Parallel Formulation of SpGEMM
As pointed out by Buluc and Gilbert [4] , dividing the work among di erent threads and memory allocation for the result matrix are two major challenges for parallel formulations of SpGEMM. e number of non-zeros in the result matrix C is not known. One of the major concerns in SpGEMM is that the number of non-zero elements in the result matrix is not known in advance, complicating memory management for the result matrix.
Load-balanced work division among threads is non-trivial for SpGEMM. is is mainly because of the fact that the number of non-zeros in di erent rows of a sparse matrix can be very di erent, and the work per row is non uniform. One approach to divide the work among the threads in a uniform fashion is to derive work-lists and divide the work from work-lists among the threads in a cyclic fashion. But this approach inherently has a drawback that it will result in a large volume of data movement to/from GPU global memory, thereby limiting performance.
In the next rest of this section, we describe how the above two concerns have been addressed by previous SpGEMM implementations. In the later sections of the paper, we discuss how we address these issues and develop a more e cient implementation of SpGEMM.
A number of considerations are important in developing e cient GPU applications. While dividing work among di erent threads in a GPU, the approach to work distribution among threads a ects a number of important metrics: warp occupancy, data movement volume and e ciency, and thread divergence. Consider the case of performing SpGEMM using work-lists. is approach has very high concurrency. But the drawback of this approach is that it involves high data movement. If we try to reduce the high data movement using shared memory, the potential problem faced is that of low occupancy. us, a high-performance SpGEMM implementation must address all these issues in order to achieve high performance.
As stated earlier, the size of the result matrix is not known a priori. In the next section, we brie y describe how this issue is addressed by di erent approaches.
Alternative Approaches to SpGEMM
2-Phase Approach In this approach, the number of elements in the result matrix is computed in the rst phase and accordingly the necessary memory locations are allocated. During the second phase, the actual values of the result matrix are computed. ere are a number of ways to determine the structure of the result matrix C and allocate space for it -precise, probability, upper bound and progressive [16] . e precise method pre computes the exact number of non-zero elements in the result matrix. A row-merging approach [14] , as well as SpGEMM methods in cuSPARSE and MKL use the precise method in their implementations. Probabilistic methods use random sampling and probability analysis to estimate the number of non-zero elements in the result matrix [1, 7] . e upper bound method calculates an upper bound on the number of non-zero elements. e progressive method starts sparse matrix computation with an arbitrary initial amount of memory and reallocates memory if more space is required. Gilbert et al. use the progressive method for their parallel implementation [12] .
Expand-Sort-Compress(ESC)
is algorithm was developed as part of Nvidia's CUSP library function [8] [9] for computing the product of two sparse matrices. It creates an intermediate set of key-value pairs C which contains all the elemental contributions generated due to arithmetic operations. A er the computation step, all the elements in matrix C are sorted according to their row and column numbers. In the nal step, the values belonging to the same row and column are accumulated to produce the result matrix C.
Hashing Hashing is a technique that has been used for SpGEMM [2, 11, 20] . In this method, during the computation step a hash table is created for elements of the result matrix, with row/column indices being the key and the computed result being the value of a computed sparse matrix element. A potential race condition is that multiple threads could a empt modi cation of the same element simultaneously. In order to avoid this, the update operation on the hash table must be atomic. e cuSPARSE SpGEMM implementation [20] uses a global memory based hash implementation. Balanced Hashing [2] is another e cient implementation of this approach, where shared-memory hashing is used as much as possible.
Scatter-Vector
Algorithm 2: Sequential Sca er Vector Implementation 
e sca er-vector approach for SpGEMM was devised for CPUs [15] . A full vector of size n is allocated, where n is the number of columns in B. As the rows of 'A' are processed in serial fashion, this 'sca er vector' stores pointers to compacted elements of the corresponding resultant row in 'C'. When any new intermediate product is formed in the result row of 'C', the pointer to the value is stored in the sca er-vector at the corresponding column id. After each additive contribution is generated, the sca er vector is accessed at the pertinent column index rst to determine if the corresponding element already has a valid location in the result row of C. If so, the new intermediate product is added at the same position. Otherwise, the intermediate product is wri en at a new location and the corresponding column id in the sca er-vector is updated with a pointer to this new location.
In this paper, we characterize performance of some alternative SpGEMM approaches for GPUs and develop a hybrid implementation that combines multiple strategies.
SCATTER VECTOR IMPLEMENTATION ON GPU
One of the earliest developed approaches for e cient SpGEMM [15] used a sca er-vector, but its use for GPU implementation of SpGEMM has not been reported. In this section, we describe a sca er-vector based SpGEMM on GPUs. reads in a thread block (or virtual warp, as discussed later) collectively process a row in matrix A/C, and all the rows in A are distributed among the thread blocks in a cyclic fashion. An array of size n * (blocks num) * (warps num per block) is created in the global memory, where n is the number of columns in B. us each warp gets its own private sca er vector. reads in a warp sequentially process the elements of a row in A. e elements of the corresponding row in B are mapped in a cyclic manner among threads, to compute updates in parallel to di erent elements in the result row C.
Algorithm 3: Sca er Vector Implementation on GPU
In the implementation, memory allocation is based on the upper bound method. e number of multiplication operations for each row in the matrix is pre-computed by adding the number of nonzeros in the relevant rows of B. is provides an upper-bound on the maximum possible number of non-zeros in each row of C. e time taken for this pre-computation is negligible.
Initially none of the C elements is assigned a memory location. e rst thread that tries to update its contribution to a given C element is responsible for assigning a unique memory location and updating the sca er-vector to point to the new location. Multiple threads may simultaneously try to obtain a unique memory location for di erent C elements. In order to prevent the competing threads from assigning the same memory location for di erent elements in C, a locking mechanism is required. Since each row in C has a memory pool, only the threads working on the same row of C will compete for the same lock. Locking is only required when creating e above locking mechanism is implemented using an atomic add operation on a shared variable count.(Algorithm 3). e number of times this atomic operation is performed is equal to the number of non zeros created in the given row of C.
e update of intermediate products also requires atomic operations if concurrent updates could occur for an element. We avoid this by sequentially processing the elements of a given row in A.
is guarantees that no two threads can simultaneously a empt to update the same no zero element in a row of C . e update temp C function updates the products formed for a given row of C in the temporary space allocated for it. Later an additional step is required to copy all the products belonging to all the rows from their temporary space to the result array C.col and C. al. Also, a simple scan operation on C.row array is performed to obtain row pointers in C, which along with C.col and C. al form the nal result C matrix in unordered CSR format. e total time for these additional steps is very low relative to the time taken to form the result elements.
SPGEMM CHARACTERIZATION USING SYNTHETIC MATRICES
e performance of di erent SpGEMM algorithms is dependent on input matrix characteristics. e number of non-zeros per row in these matrices (Table 1) can vary considerably, making it challenging to e ectively distribute the work among the threads. Signi cant variance in the number of operations per row (Table 1) impacts warp occupancy, data-movement and thread divergence. In order to systematically characterize di erent alternatives, we measure the performance of each algorithm on synthetic sparse matrices having distinct features. Details on the generation of these synthetic matrices is discussed later in section 7.1. Results from experimentation with synthetic matrices is presented in Table 2 .
Synthetic sparse matrices were created with random column positions but a xed total number of non-zeros in every row of the matrix. e purpose of this experiment was to remove one source of variability in real matrices -the variance in the number of non-zeros per row -and observe the impact of the number of non-zeros per row on performance. Di erent matrices were generated which di ered in the number of non-zeros per row. When computing C=A*A, each row of C requires exactly the same number of operations since each row of the source matrices has a xed number of non zeros per row. Table 2 shows the performance achieved by ESC in global memory, ESC in shared memory and the sca er vector approach. As the work per row increases, initially the performance of ESC-shared increases. However, as work increases, the shared memory requirement also increases. is limits occupancy and negatively a ects performance. ESC using global memory does not have this constraint. However, ESC-global involves sorting and compression in global memory, which leads to higher global memory tra c and this in turn limits performance. e sca er-vector approach has poor performs when the work per row is small. In this case the memory access to the sca er-vector in global memory is random and the probability of L2 cache hits is low. However, the sca er vector approach neither faces the problem of low occupancy (like ESC-shared) nor is it IO intensive (like ESC-global). As the work per row increases, the sca er vector approach performs signi cantly be er.
e experiments indicate that use of ESC with shared memory is a be er choice for rows having low work (total number of ops), and use of the sca er-vector approach is be er suited for rows associated with a high operation count. However, the number of non zeros in a row impacts the performance of the ESC algorithm. In the next section we develop an alternative variant for ESC in shared memory, with a choice to be made between the two variants based on the number of non-zeros in a given row.
ESC VARIANTS
For the experiments in the previous section, although the work per row was varied, the number of elements per row in 'A' was not varied. e number of elements present in a row of 'A' can have a signi cant impact on performance. Consider the following two cases: i) Rows in 'A' have a large number of non-zero elements but they point to very sparsely populated rows in 'B' and ii) Rows in 'A' have very few non-zero elements, but each points to rows in 'B' that are heavily populated. In both the cases the work per result row could be similar, but the amount of concurrency is quite di erent. In this section we show how the classical approach is be er suited for the rst scenario and a slightly modi ed approach called ESC-row-wise suits the second scenario. e classical parallel ESC algorithm [9] slices the rows in matrix A and for each row slice, generates the key-value pairs in the expand phase, sorts the generated pairs according to the keys and compresses all the values with the same key. is approach is quite e cient in terms of dividing the work among di erent threads, since no synchronization or locks on update locations is necessary during the expand phase. At the beginning of the expand phase each thread has to identify the element a i j and b jk to be multiplied. e cost of identifying the multiplicands is well amortized if the number of non-zeros in the rows in 'A' are quite large. e classical ESC approach is called ESC-continuous.
If the number of elements in a given row of 'A' is small, and each of these elements point to su ciently populated rows in 'B', we can avoid the overhead of identifying the multiplicands. is can be done by sequentially iterating over the elements in a row in 'A' and exploiting parallelism by cyclic mapping of threads across the elements in the row of matrix 'B' to compute the intermediate products. us, by modifying the expand phase of classical ESC approach, we can achieve be er performance. Since the rows in 'A' are binned according to total operation count, a low number of non-zeros in a row of 'A' implies higher number of operations per element (since the total op-count remains the same). We call this approach ESC-row-wise.
We compared the two ESC variants on synthetic matrices. Figure  1 compares ESC-continuous with ESC-row-wise. We observe that for the rows that have very few non-zeros, ESC-row-wise fares be er than ESC-continuous. But when the number of non-zeros in a row of 'A' is high, the ESC-continuous performs well.
From the characterization experiments in section 4, it is clear that ESC is preferable for rows with low work, while the sca ervector approach is preferable for rows with a lot of work. In the next section, we seek to further improve the performance of the sca er-vector approach for such heavy rows. As discussed in Section 2.3, the sca er-vector approach requires locking to acquire new locations. e drawback of this approach is that thread contention is high when they all compete for a single lock. One way to reduce thread contention is to introduce more locks. However, it should be guaranteed that di erent threads that update the same 'C' element should go through the same lock. Otherwise, di erent threads may assign di erent memory locations to the same 'C' element. A two-lock scheme can be used where threads trying to acquire a new location choose one of two locks based on whether the column id is even or odd. is ensures that di erent threads trying write to the same column index will necessarily go through the same lock, guaranteeing that only one of them will create a new location (the one which fails to create the new location, will add its contribution to the location generated by the other). e staging area/memory pool is also divided into two.
SCATTER VECTOR WITH MULTIPLE LOCKS
e rst lock allocates new locations from the le of the staging area and the second one allocates from the right, ensuring maximum space utilization. Once the entire row is processed, the two contiguous regions are wri en back to global memory.
Figure 4: Memory Management in Multi-Lock Approach
We further extend our two lock approach to multiple locks. e main challenge with this approach is that although the total number of operations is known, the number of new locations required per thread is not known a priori. One way to overcome this problem is to use a staging area of N * max ops, which will ensure the space requirements. However, if the staging area is in shared memory, this will reduce occupancy and thus limit performance. In the multilock approach, we divide the entire staging area into N chunks and each lock owns/guards a chunk. An extra indirection array called lock selector is used to indicate which lock should be used to acquire a new location. Each lock selector is initialized with values from 0 to N -1. For example, let us consider the use of eight locks. In this case we divide the entire scratch space into 8 chunks. Lock selector[0] will be initialized with the value 0, lock selector [1] will be initialized with value 1 and so on. is is shown in the Figure  3 . When a thread a empts to acquire a new location, it looks at the lock selector array based on the modulo N of its column id to get the lock id. en its performs a lock operation on the corresponding lock. In our example (where N = 8), if a thread a empts to acquire a new location for column 42 it would refer to lock selector [2] (42 % 8 == 2), which initially will contain the value 2. Hence the thread will try to lock on lock 2 and if successful will get a new location in the chunk owned by lock 2. If a thread acquiring lock observes that it obtained the last location in the given chunk, it updates the corresponding lock selector value with the id of the next sequential chunk which has a free location. is is shown in the 4. In our example, assume that a thread locking for column 42 observes that it received the last location owned by lock 2. It will then update lock selector [2] with value 3. Now assume that another thread requires a new location for column 50, it will check the lock selector [2] which will contain 3, hence the thread will try to get a new location by locking on lock 3. Any other thread trying to obtain a new location for column 50 will try to get a location on lock 3 and will fail (the failed threads will update the existing value by adding its contribution). Once the entire row is processed, each occupied entry in the staging area is wri en back to the global memory. By using this strategy, we reduce thread contention with out any extra space overhead.
Algorithm 6: 2-lock SV with staging in shared. 
Scatter vector with managed memory
e sca er-vector approach discussed in Section 3 uses an upper bound method based on the number of operations corresponding to that row for allocating memory to the result row in matrix C. e rows in matrix 'A' are binned based on the operation count (more information about binning is provided in section 7.2). Each bin contains a set of rows whose operation counts fall in a particular range. For all the rows in a given bin, memory space corresponding to the maximum op-count for that bin is allocated. But this strategy of memory allocation becomes very ine cient at high operation counts, especially for matrices with high compression ratio. Hence we use a single bin for all rows having op-count greater than 1024 and adopt a dynamic memory allocation strategy for this bin. A chunk of prede ned size is initially allocated to each block. As blocks process the rows, they sequentially write the elements to the chunk space owned by them. If a block runs out of chunk space a new chunk of memory from the memory bank is allocated.
IMPLEMENTATION 7.1 Creating Synthetic Matrices
For characterization experiments, two sets of synthetic matrices were generated. e rst set of experiments were designed to study the performance of di erent SpGEMM approaches with increasing work per row. For this purpose, matrices were generated with a xed number of non zeros in every row of a matrix, but di ering across the set of synthetic matrices. Consider the case where the work per row is 64; for this we created a matrix 'A' such that each row has 8 elements with each element having a random column id. With such a matrix, when computing C = A * A exactly 64 FMAs are required for every row of C. Since the column ids are chosen randomly, we avoid any structure in the access pa ern. e matrices were created with su ciently large number of rows (of the order of 200k) so that there is enough work for load-balanced execution across the SMs.
e second set of experiments were designed to study the impact on performance of di erent algorithms when the structure of rows in A changes but the work per row remains constant. Consider a case where the work per row remains constant at 220 FMAs per row. We created a matrix 'A' with 10 non-zeros per row with each element having a random column id and correspondingly we created matrix B with each row having 22 non-zero elements per row with each element having a random column id. us when we perform A * B, we are guaranteed that the work per row is 220. Similarly we created other synthetic matrix-pairs A and B having same work per row but di erent number of non-zeros in 'A'.
Binning
In order to have good concurrency among all the threads, the rows of A are initially grouped into di erent bins based on the total number of ops each row is involved in. We are creating bins with range of ops being powers of 2; i.e. the range of ops for di erent bins is 2-4, 4-8, 4-16, 16-32 and so on. For all the rows in a given bin the memory in matrix C is allocated in upper bound fashion; i.e. for all the rows in the bin having 32-64 ops are preallocated with 64 memory locations. Note that the maximum number of non-zeros that are possible in the corresponding result row of C is 64. As the number of ops per row increases beyond a certain point the upper bound method for memory allocation becomes were ine cient. Hence we put all the rows having ops greater than 1024 in the last bin and allocate memory in a progressive fashion. A er binning the rows based on ops, we are sub-binning them based on the number of non-zeros in each row of A. We are employing di erent SpGEMM algorithm for each sub-bin. Other kinds of binning have been tried by di erent papers. [16] 7.3 read block con guration and kernel selection e main goal of binning is to choose the thread block size according to the bins, such that su cient work and load balancing can be ensured. Also, the shared memory used can be tuned based on the bin to achieve higher occupancy. In addition to the la er, for di erent bins we choose di erent SpGEMM strategies/kernels.For the bins having small number of ops(i.e. upto 8 ops per row) we use 2-phase mechanism described in Section 2.2. In the rst phase we determine the number of non-zeros in the result-row and we will reserve space for these rows in C accordingly. In the second phase we perform the actual multiplication.
For the bins having ops from 8 to 256, we use ESC approach using shared memory with virtual warps. Virtual warps are used to ensure load balancing among all the threads in a virtual warp. Hence we choose the size of a virtual warp for a given bin in accordance with the maximum number of ops for the bin. Di erent kernels are launched for di erent sub-bins based on the number of non zero elements in row A.
As we observed in the Section 4, for rows with higher number of ops, sca er vector performs be er than ESC. Hence for all the higher bins we use sca er vector approach. For the rows in bins having ops from 256-1024 we allocate memory corresponding to maximum ops for that bin and for rows having ops beyond 1024, we allocate memory dynamically.
EXPERIMENTAL EVALUATION
e new HybridSparse SpGEMM implementation was evaluated by comparing performance against three publicly available SpGEMM implementations: from Nvidia's CUSP library, from Nvidia's cuS-PARSE library , and bhSPARSE [16, 18] . Tests were performed on two GPUs -details are provided in Table 4 . Both single and double precision performance were measured. e same set of sparse matrices from the Florida dataset collection [22] were used that were also used by Liu et al. [16, 18] .
A recent SpGEMM implementation called balanced hashing has reported be er performance than bhSPARSE [2] . Since source code was unavailable, we present an indirect comparison against it. Anh et al. [2] present a comparison of their balanced hashing approach against bhSPARSE for single precision data on an Nvidia Tesla K20Xm GPU with 2688 cores in 14 SMs. In Fig. 6a and Fig. 6b , for each test matrix, GFLOPs performance data is presented for four scenarios: i) measured performance with our hybrid SpGEMM implementation on a K20c GPU, ii) measured performance of bhSPARSE on the K20c GPU, iii) performance reported by Anh et al. for their balanced hashing approach on a K20Xm GPU, and iv) performance reported by Anh et al. for bhSPARSE on a K20Xm GPU . From Fig. 6 and 5, we can see that except for the case of the scircuit dataset, the new hybrid SpGEMM outperforms the balanced hashing approach, as well as the bhSPARSE approach. Figure 6 and 5 shows that that our hybridSparse is faster than the other SpGEMM algorithms both in the case of single precision and double precision on all of the matrices. Especially for the case of web-networks(webbase-1M) and biological-networks(protein), we achieved speedups upto 3.9x and 4.1X over bhSPARSE respectively. e hybrid approach achieves high GFlops for the matrices such as hood, cant, harbor, pwtk, ship, spheres and protein. ese matrices have rows which are high in ops and are processed through sca er vector approach. Hence high GFlops is a consequence of our implementation of sca er vector approach on GPU for these matrices. We achieved on average 2X speedup over bhSPARSE on all the matrices in both single and double precision multiplications and on both the machines.
Even for the matrices that don't have rows with high number of ops, the new hybrid approach outperforms bhSPARSE with very good speedup. 
DISCUSSION AND CONCLUSION
In this paper, we have a empted a systematic analysis of alternate approaches to performing SpGEMM on a GPU. By use of synthetic matrices whose characteristics could be controlled, the bene ts of alternate approaches were characterized, and a hybrid scheme was implemented.
Although the new hybrid SpGEMM has improved performance over prior GPU SpGEMM implementations, there seems to be much further room for improvement. An interesting open question is whether good upper bounds on SpGEMM can be established. For sparse matrix-vector multiplication, a practically useful bound can be established using the roo ine model. Given a sparse n × n matrix with nnz nonzero elements, SpMV requires a minimum data movement from/to global memory of 12 * nnz + 8 * n bytes of data for double-precision computation using the CSR representation.
is allows an upper bound estimation of SpMV time as (12 * nnz + 8 * n)/Peak-memory-BW. A similar upper limit could be a empted for SpMM, based on the total data volume for the input and output matrices. Fig. 7a shows the results of such an exercise for the tested matrices, with Fig. 7b showing the actual GFLOPs achieved by SpMV and SpGEMM. For SpMV, we used the state-of-the-art CSR5 implementation from Liu and Vinter [17] , and for SpGEMM we used the hybrid-sparse implementation presented here. It can be seen that whereas the achieved performance for SpMV is o en within a factor of 2 of the roo ine upper bound based on global-memory bandwidth of the GPU, the achieved performance with SpGEMM is very far from such a bound. It may also be seen that the absolute performance achieved with SpGEMM is o en far lower than SpMV performance, in stark contrast to dense matrix multiplication where MM achieves orders of magnitude higher performance than MV. 
