Abstract Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns. The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable, and that the analysis and conversion overhead is amortized in realistic application scenarios.
Introduction
Sparse Matrix Vector Multiply (SMVM) is a numerical kernel that dominates the runtime of iterative solvers for systems of linear equations, which form the core of many scientific codes. Compressed representations of sparse matrices lead to a low ratio of floating point operations to memory accesses. For instance, if a matrix is represented in the commonly used compressed sparse row (CSR) format, each floating point multiply operation is accompanied by at least two memory accesses that trigger compulsory cache misses: one to retrieve the matrix element and a second one to retrieve its column index. Since accesses to main memory proceed at only a fraction of CPU speed [39] , sparse solvers achieve only a small fraction of the processor's peak performance.
Blocking [17, 18, 25, 26, 33, 35] reduces this memory overhead if a matrix contains dense substructures. Instead of recording one index per matrix element, blocked representations record one index per block. However, blocking may require zero-filling, which introduces unnecessary memory and floating point operations, and it cannot be applied if a matrix does not contain dense substructures or if those structures cannot be identified.
This paper introduces a novel representation called PBR, or Pattern-based Representation, which reduces the index overhead for many matrices without zero-filling and without requiring the existence or identification of dense substructures. Instead, PBR exploits a simple analysis that identifies recurring block structures that share the same pattern of nonzeros within a matrix. For any pattern that covers more than a threshold number of nonzeros, PBR represents the submatrix formed by this pattern in block coordinate (BCOO) format, along with a "block code" bitmask that describes the repeated pattern. A code generator generates optimized custom kernels for each block code. Thus, PBR expresses matrix structure in terms of specialized inner loops, thereby creating locality for repeating structure via the processor's instruction cache, and reducing the amount of index data that must be fetched from memory. We have implemented PBR and applied it to a number of matrices from different application areas. For a majority of matrices, we found that a large proportion of nonzeros is covered by PBR, with coverage equal or close to 100% in some cases. When applicable, PBR can shorten time to solution by up to 3.4×, with 1.4× on average, when compared to CSR in a sequential implementation, and it can also improve time to solution when compared to the widely used OSKI [33] library. On a multicore machine with 8 cores, PBR improves the SMVM performance by up to 5.0× over parallel CSR. SMVM using PBR is amenable to three optimizations, which we have implemented: explicit prefetching, vectorization, and parallelization. First, explicit prefetching ensures that data is brought into the L1 cache and tagged with the correct temporal locality. We exhaustively search for the optimal prefetch distance for each architecture. Second, we have adapted our code generator to generate vectorized code using SSE-intrinsics, which is possible since the substructure for each block code is known at code generation time. Third, PBR is amenable to parallelization for use on multi-core architectures. PBR can be applied in situations where the repeated use of a matrix justifies the tuning effort, or where multiple matrices with identical structure are used, such as in so-called ensemble computations, e.g., model reduction [4, 15] , weather modeling [12] , and drug design applications [23] . We presented the idea of PBR and an initial performance evaluation in [5] . This evaluation assumed off-line matrix analysis, structure conversion, and code generation, which precluded a quantitative overhead vs. benefit analysis. In this paper, we add several significant contributions: First, we implemented a library that performs all required steps at run time. This library can be used as a drop-in replacement SMVM kernel, minimizing the number of changes users have to make to benefit from our method. Second, we validated the actual implementation costs of the analysis and structure conversion by comparing them to the expected costs based on the asymptotic complexity of our algorithms. We verified that PBR can be implemented with a linear time cost of O(N + NNZ) for a matrix of dimension N with NNZ nonzeros. Third, we modeled the performance of PBR for large matrices to derive a predictor for choosing a block size. This predictor yields optimal or near-optimal performance for almost all matrices in the training set we considered. Fourth, we present a comprehensive evaluation of PBR's runtime costs relative to benefits it provides on two recent architectures. Our analysis shows that the cost of PBR are amortized after 500 to 700 steps on average if precompiled codes are available. For matrices that require code generation and compilation, this break even point may reach several thousand steps. Finally, we implement a row-partitioning strategy to significantly improve the performance and scalability of parallel PBR in today's multicore architectures.
The remainder of this paper is structured as follows: Sect. 2 describes our method in detail. Section 3 describes our implementation, including matrix structure analysis, blocksize detection and code generation. Section 4 provides an extensive performance analysis and discusses limitations. Section 5 compares PBR to related approaches and Sect. 6 concludes.
Pattern Based Representation

Background
SMVM computes a dense vector y that is the product of a sparse matrix A and a dense vector x : y = Ax. Figure 1 shows the basic form of an SMVM kernel's inner loop if a matrix is represented in CSR format. The matrix nonzeros are stored in a continuous array aa. Two indices record the structure of the matrix: an index ja[j] records the column index of the j th nonzero element, and a row pointer ia[i] records at which matrix element row i begins. Each pair of multiply-add floating point operations is thus accompanied by two memory accesses: one to load the matrix element aa[j], and another one to retrieve the column index ja[j] for each matrix element. Neither of these values is reused within the same invocation of the kernel. Therefore, even if there is maximum reuse of x, the performance is limited by the speed with which aa, ja, and ia can be fetched from memory. Except when SMVM is repeatedly called for small matrices that fit in the cache, each of these accesses will encounter compulsory or capacity misses.
Exploiting Recurring Patterns
Because we cannot reduce the number of nonzero values that must be read from main memory, our approach focuses on reducing the size of the index data structure. We identify blocks of recurring patterns and generate custom code for those patterns. As a result, fewer indices are needed since a pair of coordinates expresses the coordinates of each block, rather than needing indices for each nonzero within a block. The microstructure of each block is expressed in the machine code of the inner loop that iterates over all blocks of identical structure within a matrix.
We use a simple analysis to identify repeating patterns. Given a block size R×C, we divide a m ×n matrix into a grid of m R × n C rectangular blocks and count how often each of the possible 2 R×C patterns occurs. We represent the aggregate submatrix for each block pattern by recording block coordinates in COO format, along with a "block code," which is a bit vector of size R × C that encodes the nonzero micro-pattern.
We exclude two types of blocks. First, we exclude blocks with patterns that include less than three nonzeros, because such patterns would yield little or no reduction in terms of index overhead. Second, we exclude blocks whose patterns do not occur frequently enough to cover a significant number of nonzero elements, because the overhead of dispatching to the kernel specific to their block code may not be amortized. We empirically set this threshold as one thousand. Our analysis aggregates nonzeros that belong to excluded blocks in a remainder matrix, which is stored using conventional CSR representation. Figure 2 shows an example 12 × 12 matrix in which three recurring patterns occur when using a block size of 4 × 4. 7 of the matrix's 9 blocks exhibit recurring patterns with more than 2 nonzeros. 3 remainder elements (shown in red) are located within Figure 4 shows how the original matrix is split into a sum of submatrices and a remainder.
Block Size Selection and Nonzero Coverage
To benefit from PBR, we must choose a blocksize that yields sufficient nonzero coverage relative to the number of nonzeros assigned to the remainder matrix. For a given matrix, the nonzero coverage depends on the choice of the number of rows R and columns C within each block and the described cutoff criteria. Depending on the structure of a matrix, different values of R and C will yield different degrees of coverage. For our experiments, we chose square block sizes R = C = 2, 3, 4, . . . , 8 and performed the analysis described in Sect. 2.2 for each size. We found that the block size that yields the largest coverage does not always yield the best performance, because different block sizes lead to different decompositions of the matrix, which results in different memory access patterns for the x and y vectors. We describe a model and a predictor that consider these factors in Sect. 3.2.
We selected a set of 53 matrices to evaluate the applicablity and potential of PBR. We included all obtainable matrices from two sets that were previously used for sequen-tial and parallel evaluation of SMVM optimizations by Im et al. [18] and by Williams et al. [37] . We also added a number of larger matrices from the University of Florida repository [8] to include matrices of widely varying patterns, stemming from a variety of application areas. The properties of the matrices we considered can be found in Table 1 . The "Opt. Blocksize" column list the block size that yielded the best performance for the Intel Harpertown (HPT) and AMD Opteron (OPT) architectures we benchmarked. The columns "PBR Coverage," "Index Savings," and "# of Patterns" list the coverage, index overhead reduction, and the number of patterns for the optimal block size for the respective architecture. This table excludes the jpwh_991 matrix, which is a small matrix that includes no shared patterns that satisfy PBR criteria for any of the blocksizes. Figure 5 provides a histogram that summarizes the achieved nonzero coverage, based on the block size that yielded the best performance on both architectures. For 56% (HPT) and 58% (OPT) of matrices, PBR encodes over 80% of nonzeros; coverage is 40% or less for only 19% (HPT) and 26% (OPT) of matrices. Table 2 displays how often each block size yielded the best performance. We do not consider block sizes larger than 8 because good coverage can be obtained from block sizes less than or equal to 8, and because larger block sizes tend to increase the number of patterns, making it less likely that individual patterns meet the cut-off criteria.
Our analysis appears to capture the underlying regularity in the nonzero structure of many types of matrices. Only matrices that are nearly random, such as those that model the relationship between hyperlinked web pages, tend to yield lower coverage. Table 3 shows the distribution of the number of distinct patterns that satisfied our inclusion criteria. This table shows that the high degree of nonzero coverage we observe requires only a small subset of all theoretically possible 2 R×C patterns, thus making it feasible to generate code for all qualifying patterns.
PBR Library
The PBR library provides a conversion routine that implements the matrix analysis, structure conversion, and any necessary code generation. Users provide the input matrix in CSR format, thus making PBR drop-in compatible for any codes exploiting this widely used format. The conversion routine returns an opaque handle that is used in subsequent SMVM operations. The handle refers to an internal data structure that encapsulates matrix-specific information (such as dimension and sparsity), PBR-specific information (such as blocksize, number of recurring patters, nonzero coverage and identifiers for generated code and compiled object files), the data structures to hold the converted matrix itself (including nonzero values, block indices, list of patterns and their occurrence) and the remainder matrix in CSR format. An analyzed matrix structure can be saved to and restored from disk, allowing re-use of the analysis results for the same or matrix or other structurally identical matrices. The library is callable from C code, but it is implemented in portable C++ using several high-performance container and utility classes provided by the STL [2] and Boost [1] libraries. The custom SMVM kernels that implement the multiplication step are generated as C code and compiled with an optimizing C compiler. The coverage, index overhead savings, and number of patterns are based on the block size that yielded the best performance on the Intel Harpertown (HPT) and AMD Opteron (OPT) architectures Table 1 . Values are tabulated for the block size that yielded the highest performance on each of the architectures shown Table 2 Frequency with which each block size yielded maximum performance for matrix set shown in Table 1 2 PBR conversion involves the following steps: analyzing the matrix structure to find recurring patterns, selection of a block size, structural conversion, code generation, compilation (if necessary), and loading. The following sections describe these steps.
Matrix Structure Analysis
The structure analysis determines which block patterns occur in the matrix and how often. To avoid reading the matrix index structure multiple times, we analyze all block sizes from 2 × 2 to 8 × 8 in a single pass, as illustrated in Fig. 6 . We divide the matrix into stripes, each stripe comprising lcm(2, 3, . . . , 8) = 840 rows. For each block size, we use an instance of Boost's unordered hash map to store the indices of those blocks that contain at least one nonzero element in the stripe. We cumulatively update these blocks as we process the input matrix index array. To provide for efficient iteration over the blocks we additionally store their column indices in a linear STL vector. We optimized this step further by using a custom memory allocator based on GNU's obstack implementation, which allows for fast allocation and deallocation of these temporary data structures. After all nonzeros within a stripe are accounted for, all encountered patterns and their frequency are tabulated in a separate hash map that is keyed by bitsets encoding each pattern. Since all lookups are performed with hash maps that provide O(1) average lookup time, the asymptotic time complexity of these analysis steps is linear in the number of nonzeros contained in the matrix. Some operations, such as the initialization and teardown of the hash maps containing the indices of the nonzeros within each stripe, are performed once for each stripe. These operations incur an asymptotic complexity that is linear in the number of matrix rows N .
Our actual implementation collects pattern statistics only for the 8×8, 7×7, 6×6, and 5 × 5 block sizes. The recurring patterns and their frequencies for blocks of size 4 × 4 and 2 × 2 are derived from the recurring 8 × 8 patterns by examining subsets of the bit patterns of each 8 × 8 pattern, as shown in Fig. 7 . Similarly, 3 × 3 patterns and their frequency are derived from their encapsulating 6 × 6 blocks. This derivation reduces the cost of our analysis substantially, because it is done only once for each distinct 8 × 8 and 6 × 6 pattern.
The result of the analysis step provides, for each of the considered block sizes, the number and kind of distinct recurring patterns. No information regarding block locations in the matrix are collected at this point. We exclude patterns for which the product of pattern frequency and number of bits contained in the pattern does not meet our threshold, and we exclude patterns with less than three nonzeros.
Block Size Selection
To predict which block size would yield the best performance, we developed a simple multiple linear regression model that estimates the performance for a decomposition resulting from each potential choice of block size. For sufficiently large matrices, we expect that PBR's execution time is dominated by memory accesses, therefore our model includes three variables that capture the memory accesses performed during the multiplication. The values of these input variables can be derived from statistics collected during the analysis step.
-The number of bytes fetched from memory while accessing the covered nonzero values, computed as the product of coverage × N N Z× sizeof(double), plus the number of bytes comprising the block index array, which is i∈P 2 × f req(P i ) × sizeof(int) for the set of qualifying patterns P. -An approximation of the number of accesses to the x and y vectors while multiplying the submatrix for each included block pattern. Since each submatrix may cover all rows and columns, we approximate this number as the product of the matrix dimension and the number of patterns: N × |P|. -The third model variable is the number of writes to y, which can be derived for each block pattern from its structure. This variable is needed because unlike CSR, which writes each element of y exactly once, PBR may read and write each y element multiple times.
Our model must also predict the performance of the remainder matrix, which is kept in CSR format. We modeled CSR's performance with a separate multiple regression model that is based on matrix dimension and number of nonzeros. Section 4 discusses the computation of parameter estimates and the fit of these models for our training set and the architectures we considered.
Structure Conversion
The structure conversion step creates the block indices and arranges the nonzero values in PBR format for the selected blocksize. We keep the (row, col) block indices and the matrix nonzero values in two one-dimensional contiguous arrays. As depicted in Fig. 4 , the nonzeros and indices of all blocks that belong to the same pattern are stored in contiguous slices of these arrays, which guarantees spatial locality in the inner loop of each kernel. Because the number of blocks for each recurring pattern has been determined during the analysis step, the start and end locations of these slices can be precomputed in a single pass over all patterns. Thus, nonzeros and block indices can be copied directly to their target destinations by maintaining pointers to the current nonzero and block index offsets within each pattern's slices.
Code Generation
The code generation step generates custom C functions for all qualifying patterns, stores them to disk, and invokes the C compiler to create shared .so object files. Each .so file is dynamically linked into the process's address space. A pointer to the C function it contains is added to an array of function pointers. The outer loop of the SMVM routine iterates over this array and invokes the generated block multiplication functions.
The shared object modules are created on demand and stored in a repository on disk. Since object modules are specific to only the block pattern and size, they can be reused both across multiple uses of the PBR library on the same matrix as well as across multiple matrices that contain the same pattern. We expect caching to be particularly beneficial for application domains that repeatedly generate similarly structured matrices, e.g., k-stencil FEM methods. Optionally, the code cache can be primed by generating and compiling all possible patterns for blocksizes 2 × 2, 3 × 3 and 4 × 4, which would pay for these code generation costs upfront and avoid any cost for individual matrices when these block sizes are chosen.
Our code cache is designed to hold modules for different target architectures and with different optimization options (e.g., SSE) simultaneously, thus allowing it to be shared by multiple machines on a network. If needed, the size of the code cache repository could be managed by a periodically scheduled cronjob that deletes files that have not been accessed for a certain time period, which is similar to the strategy used by many Unix distributions to purge stale files from the /tmp directory.
Explicit Prefetching
We optimized our baseline implementation by applying explicit prefetching for the matrix elements. Although the streaming pattern of these accesses would ordinarily prevent gains from prefetching, prior work on SMVM optimizations has shown [37] that explicit prefetching is beneficial on SSE-enabled x86-based architectures, which are used by 91% of the machines in the June 2010 Top-500 list [24] . Prefetching can place data directly into the L1 cache and labels cache entries with the correct temporal locality, thus allowing eviction in preference to other data such as elements of the x and y vectors, which may be reused. Explicit prefetching is implemented via the GCC __builtin_prefetch() compiler intrinsic, which results in the emission of prefetch* SSE instructions. Our code allows varying the prefetch distance at runtime, which enables dynamic tuning of the prefetch distance on a per matrix basis.
Vectorization
We adapted our code generator to emit vectorized code that exploits SSE2/3 SIMD intrinsics. This instruction set allows simultaneous operations on a vector of two double values via the __ mm128d data type, which is mapped to a 128 bit SSE register. For simplicity, we vectorize only blocks with even sizes. Each R × C block is divided into R/2 × C/2 number of 2 × 2 subblocks. We allocate C/2 and R/2 vector variables for each (x i , x i+1 ) and y j , y j+1 pair. We allocate 1 vector variable for aa elements if a subblock has 1 or 2 nonzeros, and 2 variables if it has 3 or 4 nonzero elements, independent of where those nonzeros are located within the subblock. To minimize the number of shuffle operations that are required if aa elements are not located in the order in which they are needed, we store the aa elements in zigzag order in memory. Figure 8 shows the code our generator produces for the pattern
We introduce padding to maximize the use of 128bit aligned load instructions provided by the SSE instruction set. The first nonzero element in each block is always aligned at a 16 byte boundary. In addition, we implemented two different alignment strategies within a block. Our first strategy uses aligned loads only when it is known that the elements to be loaded have the correct alignment, based on the alignment of the initial nonzero within the block. If the number of nonzero elements is odd (1 or 3) , we use 64-bit double load instructions and set the unused element within the vector to zero. Our second strategy stores an additional zero in memory whenever the number of nonzeros within a 2 × 2 subblock is odd, thus allowing the use of aligned load instructions throughout. Our performance evaluation reports the results for the strategy that yielded the best performance.
Parallelization
SMVM using a PBR representation contains inherent data parallelism since all floating point multiply operations are independent from each other. Moreover, since the distribution of repeating blocks and the number of nonzeros they contain are known post analysis, the workload can be distributed across threads easily. We row-partition the matrix, which is a common strategy also used in parallel CSR. We included an additional step in the matrix analyzer to assign blocks to threads for equally distributing the workload, i.e., the number of nonzeros. Currently, the PBR analyzer prepares assumes a 128-thread granularity and keeps this information inside the PBR represen- This approach is different than the initial PBR version described in [5] , which did not implement row-partitioning. Instead, parallelization was achieved by distributing blocks evenly to threads, regardless of locations of blocks in the matrix. This approach required use of private copies of y vectors on each threads, as well as an additional reduce-add step to construct the final solution vector. For many matrices, this extra step greatly limited the parallel performance and scalability of PBR, especially for small matrices and large number of cores.
Evaluation
Our evaluation contains multiple parts: first, we demonstrate how PBR-based SMVM shortens time to solution, or overall runtime, in both sequential and parallel environments, relative to readily available alternatives. Second, we demonstrate the relative impact of the prefetching and vectorization optimizations we applied to PBR. Finally, we evaluate the runtime costs associated with the PBR implementation in comparison to benefits by PBR. For all experiments, we used the matrix set in Table 1 , which was introduced in Sect. 2.3.
Methodology
We used two x86-based architectures for our evaluation. The first architecture is a 2.8Ghz 2-socket quadcore (8-core total) Intel Xeon Harpertown 5400 with 12 MB L2 cache per socket (each core pair sharing a 6MB cache,) and 8GB of RAM. The second architecture is a 2.5 GHz 2-socket quadcode (8-core total) AMD Opteron 2380 with 512 KB L2 and 128 KB L1 caches per core, a 6MB shared L3 cache per socket, and 16GB of RAM. We use the GCC compiler version 4.1.2 on the Harpertown and version 4.3.2 on the Opteron with optimization flags: -O2, -funroll-loops, -mfpmath =sse, -msse, -mtune and -march using proper values for each architecture.
Sequential Performance
Figures 9 and 10 compare the performance of unoptimized (plain) sequential PBR to the performance of PBR with prefetching and PBR with vectorization on Harpertown and Opteron, respectively. Speedup results are shown relative to the naïve CSR performance.
For matrices with 80% or more nonzero coverage, which account for at least 30 of the 53 matrices on both architectures, PBR provides a maximum speedup of 3.41 with an average of 1.53 on the Harpertown and a maximum speedup of 2.32 with an average of 1.64 on the Opteron.
The relative contribution of prefetching and vectorization is shown as zero if prefetching or vectorization did not improve performance. For a few matrices, these optimizations yielded slight slowdown. On Harpertown, small matrices benefit particularly from vectorization, whereas the benefit of prefetching is more pronounced on the Opteron.
These charts also contain the results we obtained using OSKI 1.0.1h [33] , a widely used optimization library. We used OSKI's aggressive tuning option without providing structural hints. OSKI is consistently slower than PBR, and even provides little or no speedup compared to naïve CSR on the architectures considered, which is consistent with earlier results obtained by others [37] . We built a basic thread pool implementation on top of Linux's PThreads API. To avoid the potentially random placement of threads onto cores by the OS scheduler, we used the Portable Linux Processor Affinity (PLPA) interface by the OpenMPI project [14] to explicitly set CPU affinity for each thread, which forces a fixed mapping of threads to cores. When using 2 threads, we placed threads on neighboring cores that share the same L2 cache on Harpertown and the same L3 cache on Opteron. When using 4 threads, we placed them onto the same socket, thus sharing a single front-side memory bus.
We compared our parallel implementation of PBR to a parallel implementation of naïve CSR. We parallelized CSR in a manner that partitions the matrices' rows such that each thread operates on roughly the same number of nonzero elements. We used the same thread pool implementation as for PBR. Figure 11 shows the overall runtime of parallelized PBR relative to parallelized CSR for 2, 4, and 8 threads on Intel Harpertown (top) and AMD Opteron (bottom). For this discussion, we consider only the subset of matrices that provided more than 80% coverage and that yielded at least 1.1 sequential speedup. A value greater than 1.0 indicates that the use of PBR is beneficial for a given number of threads. These results indicate that PBR retains its performance advantage over CSR in parallel as well, providing a maximum speedup of 3.8 on Harpertown and 5.0 on Opteron using 8 cores. Figure 12 depicts parallel speedup by CSR and PBR relative to their respective sequential implementation for the same matrix set used in Fig. 11 . The averages of the speedups in this figure are summarized in Table 4 . Both methods exhibit superlinear speedup with 8 threads for medium sized matrices due to doubled memory bandwidth and cache capacity by using two sockets. These results are in line with CSR superlinear speedups that have been observed on similar architectures by others [37] .
Overhead Analysis
As with any optimization method, PBR's overhead must be evaluated relative to the benefits it provides. We relate PBR's overhead to its performance benefits over the CSR method in the context of iterative solvers. The break-even column (BEP) in Table 5 Table 4 Average of parallel speedups given in Fig. 12 2 core 4 core 8 core
Harpertown PBR shows how many iterations would be needed to compensate for PBR's overhead. Since the BEP depends on the achieved speedup for a given matrix, we also report as a second point of comparison how many CSR iterations could be performed in the time it takes to analyze the matrix (Analysis), convert to PBR (Structure) and link to kernel objects (dynLink) in the same table. These numbers provide a cutoff value below which PBR cannot be amortized. The best block size (BBS) column depicts the block sizes that yielded the best performance. This table excludes the code generation/compilation cost, which is shown separately in Table 7 . Matrices in this table are sorted by their size, with smaller matrices in the top half. Since small matrices benefit less from PBR, their break-even point is generally higher. The break-even point reduces significantly with increasing matrix size.
To ensure that the overhead measured by our evaluation accurately reflects the inherent complexity of our method, we validated that our implementation in fact exhibits the asymptotic complexity we predicted. During this validation process, we eliminated several lurking quadratic complexities in our implementation by applying the techniques described in Sect. 3.1. We fit a multiple linear regression model using the variables N and NNZ to the benchmarked analysis times. This model yielded an adjusted R 2 value of 0.90. Parameter estimates and standard errors for the model are shown in Table 6 . We performed the same validation for the structure conversion step. We fit separate models for each block size; the adjusted R 2 values for these models ranged from 0.77 to 0.88. These results confirm that PBR can be implemented with a cost that is linear in matrix size and number of nonzeros. Table 5 assumed that no code generation is necessary, which is true if the code cache contains already compiled modules for all encountered patterns. Table 7 shows the additional cost incurred by the code generation/compilation step for the subset of matrices that yielded the best performance with a blocksize greater than 4 × 4, in columns BEP and CSR_itr. We report numbers only for these matrices because we assume that the code cache contains precompiled modules for all smaller block sizes.
Detection of the optimal prefetch distance on systems that benefit from prefetching adds to these overheads. This tuning step can be performed during the first few SMVM operations in which the converted matrix is used. We vary the distance in 64-byte increments within a range between 256 and 1024 bytes for the first 12 iterations, then retain the distance that yielded the best performance in all subsequent iterations. This table depicts cost of block size detection (Analysis), conversion of the matrix (Structure) and dynamically linking to kernel objects (dynLink), expressed in number of CSR iterations for the given matrix and architecture. BEP denotes the break even point, which is the number of iterations needed to compensate PBR overhead. BBS is the blocksize that yielded the best performance Table 6 Multi-regression parameters for the matrix analysis costs on the Harpertown architecture, which models the 'Analysis' column of Table 5 Adjusted R 2 = 0.90 BEP column shows number of extra iterations incurred by code generation/compilation, to be added on values given in BEP column in Table 5 . CSR_itr column shows the code generation/compilation cost expressed in number of CSR iterations In Sect. 3.2, we described a method for estimating the PBR execution time in order to choose a block size that yields optimal or near-optimal performance. For each candidate block size, we first estimate the PBR and remainder CSR execution time using separate models; the overall predicted execution time is the sum of these two components. The model for the PBR part uses three estimation factors as described in Sect. 3.2 (the memory reads due to the nonzero and index arrays, the approximation to the number of accesses to the x and y vectors, and the number of writes to y). The model for the remainder CSR uses matrix size and number of remainder nonzeros. Then, we pick the block size that leads to the smallest predicted execution time. Since our model is based on the assumption that PBR performance is dominated by memory accesses, we chose the 29 largest matrices of our set to train the model. The resulting adjusted R 2 values for these models on the Intel Harpertown architecture are given in Table 8 . We verified that the overall memory consumption of these matrices exceeds the available L2 cache capacity, making PBR memory bound. For these experiments, we do not consider the impact of vectorization and we assume a constant prefetching distance.
Our model correctly selected the best-performing block size for 12 matrices. For the remaining 17 matrices, the performance loss due to mispredicted block size was less than 3%, with a maximum slowdown of 12% for the finan512 matrix.
Because our prediction includes two components, the PBR and the remainder CSR parts, prediction errors with opposite signs may cancel each other out. To validate that the high accuracy of our model was not due to chance, we computed the weighted sum of the absolute values of the relative error η for both the PBR and the CSR remainder parts:
where cov denotes the nonzero coverage by PBR. We saw that for 25 matrices, the total error was less than 25% for both the predicted block size as well as the block size that performed best. Therefore, the cancelation of model errors is not significant for most matrices, and the high model accuracy is due to the good fit of the model.
Overall, these results show that the block sizes selected by our predictor model lead to optimal or near-optimal speedup. We note that block size prediction is an optimization; if the optimal block size is desired and a matrix structure is sufficiently often reused, it is possible to perform a trial structure conversion using each block size and choose the optimal block size via benchmarking.
Limitations
PBR faces a number of limitations that affect its applicability and performance. First, there is the inherent requirement that a block size exists for which there is significant nonzero coverage. Second, PBR may decrease the locality of the right-hand side x vector. Third, whereas SMVM using CSR writes each element of the left-hand side vector y only once, PBR may require multiple reads and writes. If the x or y vectors do not fit into the cache, such as for very sparse problems, the performance impact of these limitations may reduce PBR's effectiveness.
Related Work
The optimization and tuning of numerical kernel for sparse matrices has received a substantial amount of attention in the literature. An overview of techniques for performance tuning of sparse kernels is provided in [32] .
Register blocking [17, 18, 31, 35, 37] identifies dense substructures and fills in zeros to obtain dense subblocks, which can then be stored using more efficient indices. By contrast, PBR achieves the same index reduction without storing (or computing with) zeros. As such, it will include blocks that register blocking heuristics would reject because they would require too much zero-filling. Others have proposed preprocessing [3] and matrix permutation to find or create dense substructures [19] .
Cache blocking [17, 25, 26] improves performance by improving the locality of accesses to the x vector, which is also the goal of classical bandwidth reduction techniques [29] . Cache blocking could be applied to the individual submatrices on which PBR operates as well, although likely with smaller benefits.
Most related to our work are a number of techniques that attempt to reduce index overhead. For instance, if a row contains many contiguous columns of nonzeros, run-length encoding can be used to reduce the index overhead [27] . The benefit of run-length encoding is amplified if the matrix is first reordered to create contiguous nonzero columns [28] . The use of delta coding and row pattern compression was proposed in [36] . These two methods compress the index overhead relative to CSR, but may require hand-tuned assembly language decompression routines to be effective. An alternative to delta coding that can take advantage of near dense, but not necessarily contiguous nonzero concentrations was suggested in [20] . By contrast, PBR does not perform decompression at runtime, but may not be applicable for some matrices that benefit from compression.
Capturing sparse matrix structure in generated code was previously proposed for specialized sparse problems in [10] and [16] . In comparison, PBR does not make any assumptions about the structure of a matrix when identifying repeating patterns. Compilation techniques that optimize for specific sparse structures have been considered in [6] .
Vectorized versions of CSR were proposed in [7] and [9] . Similarly, length-sorted CSR, which processes pairs of rows with equal lengths, can use unroll-and-jam to increase instruction-level parallelism [22] . Neither technique reduces index overhead, however.
Other tuning techniques include diagonal cache blocking [30] , the detection of diagonal substructures [11] , the exploitation of symmetries [21] , and optimizations for specific higher-level kernels, such as sparse triangular solve [34] . The impact of prefetching on SMVM performance was previously explored in [31] and [37] .
The results of a comparative evaluation study of techniques for improving SMVM performance [13] concur with our emphasis on reducing memory bandwidth usage. Another recent study [37, 38] considered the impact of several known optimization techniques for SMVM on emerging multicore architectures. These techniques included thread blocking, cache and register blocking, prefetching, SSE-based SIMDization of dense substructures, and others. We included the matrices used in this study in ours and found that many significantly benefit from PBR. We did not compare PBR to the optimizations implemented in that study because the implementation is not readily available.
Conclusion
Sparse matrix vector multiply (SMVM) has long been recognized as an extremely challenging numerical kernel. Its speed is limited by available memory bandwidth, particularly on modern architectures. We proposed a novel method to reduce its memory bandwidth requirements by exploiting a memory-efficient index structure that identifies and exploits repeating patterns, which are captured in generated code. Unlike previous techniques, our method is agnostic with respect to structure. Perhaps counterintuitively, the lion's share of the structure of many matrices that arise in scientific problems can be captured using a relatively small number of frequently recurring patterns. Our technique can benefit from features that are available on dominant modern architectures, such as prefetching and vectorization, and it is easily parallelizable.
We presented a library that performs matrix conversions on the fly, allowing our method to be used as a drop-in replacement for existing methods. We evaluated PBR by demonstrating its performance advantage for sequential and parallel implementations and by quantifying its overheads relative to its benefits. We described a performance predictor for choosing a blocksize that achieves an optimal or near-optimal performance. For many practical scenarios in which matrix structures are reused sufficiently often, pattern-based representation can augment the arsenal of tuning methods for sparse matrix-vector multiplication.
