We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for general-purpose sparse linear algebra computations. Our CsrMV performs an equitable multi-partitioning of the input dataset, ensuring that no single thread can be overwhelmed by assignment to (a) arbitrarily-long rows or (b) an arbitrarily-large number of zerolength rows. This parallel decomposition requires neither offline preprocessing nor specialized/ancillary data formats. We evaluate our method on both CPU and GPU microarchitecture across an enormous corpus of diverse real world matrix datasets. We show that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of our method is substantially impervious to row-length heterogeneity.
Introduction
High performance algorithms for sparse linear algebra are commonplace within many application domains, including computational science, graph analytics, and machine learning. The sparse matrix-vector product (SpMV) operation is of particular importance for solving sparse linear systems, eigenvalue systems, and similar problems. In its simplest form, a SpMV operation computes y = Ax where the matrix A is sparse and vectors x and y are dense.
SpMV usage is typically repetitive, often to the point of dominating overall application runtime. The primary design concerns for parallel SpMV performance are (1) uniform processor utilization and (2) efficient use of memory bandwidth [4] . Achieving these goals requires complementary strategies for parallel decomposition and matrix storage formatting, and has remained an active field of research.
This research was developed, in part, with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings contained in this article/presentation are those of the author(s)/presenter(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. Distribution Statement A (Approved for Public Release, Distribution Unlimited) Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, contact the Owner/Author. Request permissions from permissions@acm.org or Publications Dept., ACM, Inc., fax +1 (212) However, improving SpMV performance with innovative matrix formatting comes at significant practical cost. Applications predominantly rely upon the general-purpose Compressed Sparse Row (CSR) format for in-memory representation (as illustrated in Figure 1) . As a result, SpMV implementations that require alternative formatting ultimately burden the application with both (1) processing time for format conversion and (2) duplicated matrix storage space (because the original CSR matrix is likely required by other routines and cannot be discarded). In contrast, our "merge-based" method of CsrMV parallelization delivers high performance without format conversion or augmentation.
Although the CSR format is amenable to straightforward SpMV parallelization via row-wise assignment among individual threads for privatized row-reductions, contemporary parallel CsrMV implementations are subject to data-dependent performance degradation arising from irregular row lengths and/or wide aspect ratios. Alternatively, a "nonzero-splitting" strategy can be employed in which parallel processors are assigned regularized sections of the non-zero data (i.e., the values and column_indices arrays). As each processor consumes its section of non-zeros, it must track its progress through the row_offsets array. Imbalance is still possible because some processors may consume arbitrarily more row offsets than others. Many heuristics have been developed to ameliorate these issues. Williams et al. survey a wide range of optimization techniques for multicore CPUs [5] , and Filippone et al. survey more than sixty GPU-specific variations [2] .
However, neither general strategy has proven to be a panacea. Poor performance can be exacerbated by (1) wider parallel processors that are increasingly susceptible to underutilization, (2) scalefree datasets in which a small minority of rows are many orders of magnitude longer than others, and (3) datasets having an abundant number of zero-length rows corresponding to "leaf" vertices in directed graphs. For both CPU and GPU microarchitecture, we show that these CsrMV strategies commonly suffer performance deficiencies of one or more orders of magnitude among similarly-sized matrices.
To eliminate workload imbalance altogether, our method adapts the fine-grained MergePath decomposition [3] for merging two sorted lists, allowing our implementations to distribute equal shares of the total CsrMV problem among parallel processing elements. The result is an even utilization of parallel processing elements to provide predictably-good performance response regardless of rowirregularity and skew. As illustrated in Figure 2 , we can visualize the CsrMV work list as a 2-dimensional decision path in which the row_offsets array is merged with the sequence of natural numbers, which indexes the non-zeros in the column_indices and values arrays. The decision path begins in the top-left corner and ends in the bottom-right. When traced sequentially, the path moves downward when consuming indices (accumulating matrix-vector dot-products within a given row) and rightward when consuming row offset sentinels (outputting row-aggregates). The path itself is linearly indexed by the grid diagonals, where diagonals are enumerated from top-left to bottom-right. The goal of the MergePath parallel decomposition is to use the diagonals to partition this decision path into equal-length sections (one section per processing element).
In this example, the path is being split into three sections of four work items each. The fundamental insight is that each grid coordinate (i, j) along the path can be found by independent binary searches along the diagonal k , where k = i+j. Given the diagonal constraint, we simply find the first (i, j) where row_offsets[i] is greater than j − 1. In this example, the ninth of the CsrMV work item begins the third partition, and is located at (row3, nonzero5).
Once a given thread's starting coordinate (i, j) is found, the remainder of its section can be processed using a sequential CsrMV algorithm starting at rowi and nonzeroj. Finally, the partial sums from rows that span multiple threads can be aggregated in a subsequent "reduce-value-by-key" pass.
Assuming the number of processors p is a finite constant of the underlying machine (unrelated to the number of rows or non-zeros), the searching overhead does not asymptotically affect the total work complexity, which remains linear O(N ). Furthermore, the decision path can be partitioned hierarchically, trivially enabling parallelization across multi-scale systems (e.g., GPUs).
We evaluate our method across the approximately 4,400 nontrivial matrices of the entire Florida Sparse Matrix Collection [1] . Figure 3 and Figure 4 plot runtime vs. dataset size (nonzeros) as performed by the Intel MKL, the NVIDIA cuSPARSE, and our merge-based CsrMV implementations. These performance land- scapes serve to highlight performance inconsistencies. The presence of significant outliers is readily apparent for the row-based parallelizations (MKL, cuSPARSE). In comparison, our mergebased CsrMV achieves a substantially more consistent performance response. Statistically, our runtimes are much less anti-correlated to row variation (-0.07 versus -0.16 on Intel e5-2695x2, and -0.017 versus -0.24 on K40). We demonstrate speedups of up to 16× and 216× on the CPUs and GPU, respectively. Furthermore, we demonstrate aggregate evaluation times that are 1.1× and 2.9× faster for the entire corpus on the CPUs and GPU, respectively.
