Abstract-Sparse matrix-vector multiplication (SpM Â V) has been characterized as one of the most significant computational scientific kernels. The key algorithmic characteristic of the SpM Â V kernel, that inhibits it from achieving high performance, is its very low flop:byte ratio. In this paper, we present a compressed storage format, called Compressed Sparse eXtended (CSX), that is able to detect and encode simultaneously multiple commonly encountered substructures inside a sparse matrix. Relying on aggressive compression techniques of the sparse matrix's indexing structure, CSX is able to considerably reduce the memory footprint of a sparse matrix, alleviating the pressure to the memory subsystem. In a diverse set of sparse matrices, CSX was able to provide a more than 40 percent average performance improvement over the standard CSR format in SMP architectures and surpassed 20 percent improvement in NUMA systems, significantly outperforming other CSR alternatives. Additionally, it was able to adapt successfully to the nonzero element structure of the considered matrices, exhibiting very stable performance. Finally, in the context of a "real-life" multiphysics simulation software, CSX accelerated the SpM Â V component nearly 40 percent and the total solver time approximately 15 percent.
INTRODUCTION
S PARSE matrices arise in a variety of scientific disciplines, ranging from physics simulations to data mining and financial analysis. Most of these applications boil down to the execution of basic sparse matrix kernels. Among these kernels, the Sparse Matrix-Vector Multiplication kernel (SpM Â V), which computes the product of a sparse matrix and a dense vector, is probably the most prominent and challenging, being recently classified as one of the kernels particularly important for science and engineering in the next decade [1] . The algorithmic characteristic that renders the optimization of SpM Â V so challenging in modern multiprocessor systems is its very low flop:byte ratio [2] , [3] , [4] . In effect, this means that the algorithm must retrieve a significant amount of data from the memory hierarchy to perform a useful operation. This adds significant pressure to the memory subsystem and tends to saturate the memory bandwidth of modern symmetric shared memory multicore architectures. In the case of NUMA architectures, although the higher memory bandwidth offered alleviates the pressure to the memory subsystem, the high memory intensity of the SpM Â V kernel renders its performance very sensitive to the placement of the data on the different memory nodes.
To store a sparse matrix efficiently, one needs to store only the nonzero values of the matrix along with some location information for iterating over the nonzero elements. The most common format for storing sparse matrices for use with the SpM Â V kernel is the Compressed Sparse Row (CSR) format [5] . The idea behind CSR is that it suffices to store only the column indices and "pointers" to the start of each row for locating each nonzero element of the matrix. However, despite being relatively compact, CSR has a lot of redundant information. The nonzero elements of sparse matrices are usually arranged in dense substructures, for example, horizontal and/or diagonal sequences, 2D blocks, and so on. One could keep just a single column index for each encountered substructure, because it is known beforehand how to iterate over the elements inside the substructure; the matrix size, therefore, could be significantly reduced, especially for matrices with a lot of dense regions. Several storage formats have been proposed in the past that target the reduction of the matrix size, either explicitly or implicitly [5] , [6] , [7] , [8] , [9] . The main drawback of the majority of these formats, however, is that they take an "all-or-nothing" approach: they will either offer very high-performance improvement over CSR when they find the expected substructures or plummet otherwise. The most typical example is the Blocked Compressed Sparse Row (BCSR) format [8] , which can offer 2Â performance improvement in certain matrices, but half the CSR performance in others.
Our approach to the optimization of the SpM Â V kernel is the Compressed Sparse eXtended (CSX) format. CSX targets the minimization of the memory footprint of the matrix, because memory bandwidth contention is the key performance problem of the SpM Â V kernel in modern multicore architectures [2] , [3] , [4] . We believe that data compression techniques will play a more significant role in future multicore and manycore chips as a means not only for minimizing the communication cost in different levels (processor-to-memory, processor-toprocessor, etc.), but also for increasing energy efficiency. CSX employs an aggressive compression scheme that builds on top of the CSR-DU format [10] . By applying run-length encoding in conjunction with matrix coordinate transformations, CSX is able to detect and incorporate a variety of different substructures, including horizontal, diagonal, and 2D blocks. The very condensed representation of a substructure allows CSX to significantly compress the input matrix, leading to considerable performance improvements. On a 24-core SMP system, CSX managed a 42 percent average performance improvement over CSR, significantly surpassing other CSR alternatives. In NUMA architectures, where the memory bottleneck is not so intense, we relax the compression scheme and fine-tune our substructure selection heuristic, to avoid some of the time-consuming decompression computations. In these architectures, CSX manages an 18 percent average performance improvement over CSR, which climbs to 21 percent when simultaneous multithreading is enabled. A key advantage of CSX over other CSR alternatives is its performance stability; even for matrices that CSX does not achieve the highest performance, its performance lies within 5 percent of the best. Finally, we add to the versatility of CSX by employing runtime code generation using Clang and LLVM [11] for generating optimized substructure-specific SpM Â V routines.
In this paper, we take a step further from the simple optimization of the SpM Â V kernel by focusing also on the minimization of the preprocessing cost of CSX, to make it practical for "real-life" SpM Â V applications. With the use of advanced sampling of the input matrix and a careful implementation, we were able to reduce the CSX preprocessing cost down to approximately 100 serial CSR SpM Â V iterations. This low preprocessing cost can be amortized easily in the context of a large sparse linear solver that may need a few thousands of SpM Â V iterations to converge. Indeed, despite its preprocessing cost, integrating CSX into the Elmer multiphysics simulation software [12] offered a nearly 40 percent performance improvement of the SpM Â V component and a 15 percent average improvement of the overall solver time.
The rest of the paper is organized as follows: Section 2 provides background information about the SpM Â V kernel, while Sections 3, 4, 5, 6, and 7 describe the details of the CSX format (detection and encoding of substructures, code generation, preprocessing, and porting to NUMA architectures). Section 8 presents an experimental evaluation of the CSX format, Section 9 describes the related work in the field and, finally, Section 10 concludes the paper.
BACKGROUND AND MOTIVATION
The SpM Â V kernel lies at the core of iterative algorithms for the solution of sparse linear systems [13] . It is one of the most time-consuming kernels in the execution of these methods, taking up even 90 percent of the solver's total execution time under certain circumstances [13] . The major performance problem of the SpM Â V kernel in modern commodity microarchitectures stems primarily from its algorithmic nature. The matrix-vector kernels (dense and sparse) sweep once through the whole matrix and perform a constant number of floating point operations per element. This automatically leads to a Âð1Þ flop:byte ratio compared to the ÂðNÞ (N being the rank of the matrix) of the matrix-matrix multiplication kernels. In practice, this means that to avoid bottlenecks, the memory hierarchy must be able to provide data to the processor at a comparable speed, which is hardly ever the case for any modern commodity microarchitecture. The situation gets worse with sparse matrices, where the kernel must first retrieve the nonzero element's location information, before accessing it. This algorithmic behavior tends to quickly saturate the available memory bandwidth, especially for modern symmetric shared memory multicore architectures, rendering the minimization of the memory footprint of the sparse matrix of vital importance for the optimization of the SpM Â V kernel.
The most straightforward sparse matrix storage format is the Coordinate (COO) format [5] , which stores the row index and column index of every nonzero element. Assuming 4-byte integers for indices, the matrix size in the COO format is 16NNZ bytes (NNZ being the nonzero elements of the matrix), twice as much the size of the theoretical lower bound (see Supplementary Material, Section II-B, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPDS.2012.290). The Compressed Sparse Row format [5] addresses the problem of the increased matrix size by compressing the row indices and leaving the column indices intact. More specifically, instead of storing all the row indices, the CSR format stores only N "pointers" to the start of each row, achieving a 25 percent gain in the total matrix size compared to the COO format. The typical implementation of the CSR format uses three arrays to store a sparse matrix: the rowptr array stores the row pointers, the colind the column indices, and the values the nonzero values. However, the colind structure of CSR has a lot of redundant information. Indeed, the matrix size of CSR is still 50 percent over the theoretical lower bound. The most common approach for compressing the column indices is grouping together neighboring elements in blocks and keeping one column index per block. We can distinguish two large categories of blocking storage formats: 1) those with fixed size blocks and 2) those with variable size blocks. The most representative storage format in the first category is the Blocked Compressed Sparse Row format [5] , [8] , which is actually a blocked version of the CSR format. BCSR arranges all the nonzero elements of the matrix into fixed size, r Â c, strictly aligned blocks, employing zero-padding, if necessary, to construct full blocks, and keeps a single column index per block. Despite being initially utilized for register optimizations [8] and still being very computation friendly, the BCSR format can significantly reduce the size of matrices with large amounts of dense blocks. In the category of variable size block formats, the most representative is the Variable Block Length (VBL) format [7] , which forms one-dimensional variable sized horizontal blocks. Again, the VBL format keeps a single column index per block, but uses also an additional data structure for the different block sizes. A single byte for storing the block size is usually more than enough for the majority of sparse matrices, splitting larger blocks in 255-element chunks. From a theoretical standpoint, VBL has a higher compression potential than BCSR (see Supplementary Material, Section II-B, available in the online supplemental material) and, in practice, achieves almost always higher compression ratios.
THE COMPRESSED SPARSE EXTENDED FORMAT

The Need for an Integrated Storage Format
Sparse matrices arising from the discretization of partial differential equations have their nonzero elements arranged in substructures either extending to some onedimensional direction (e.g., horizontal, vertical, diagonal) or expanding to two-dimensional blocks. The exact nature of these substructures depends chiefly on the underlying application domain and any preprocessing performed in the original matrix that alters the distribution of the nonzero elements. As already discussed in Section 2, exploiting these substructures can result in a reduction of the matrix size. However, restraining a storage format to detect a single substructure may have diminishing returns in cases where the detected substructure does not exist in adequate quantities inside the sparse matrix ( Fig. 1) , leading even to an increase in matrix size compared to the baseline CSR. Alternatives that split the matrix into multiple submatrices of the same rank, each one storing a different substructure [6] , [14] , should experience additional scaling problems, since a final reduction of partial output vectors is required; the SpM Â V kernel performance can be hindered as well, due to very small rows of the resulting submatrices [2] , and load balancing issues might also creep in. The need for an integrated storage format that could incorporate a multitude of substructures is seemingly the most viable approach to a performant CSR alternative.
The proposed Compressed Sparse eXtended format integrates five different substructure categories (horizontal, vertical, diagonal, antidiagonal, 2D blocks) into a single storage format and can be expanded easily to support more. Since we focus on the minimization of the matrix size, we build CSX on top of the CSR-DU [10] format. CSR-DU is a CSR alternative that applies delta indexing (delta encoding) in the CSR's colind structure; instead of storing the full, 4-byte column index for every nonzero element, CSR-DU stores the delta distance from the previous index based on the premise that 1 or 2 bytes will be enough to store the delta distances in most of the cases. Basing CSX on CSR-DU has the benefit of using a more compressed storage format than the standard CSR as a baseline, giving rise to performance improvements even in case where no substructures are detected.
The Data Structures
CSX replaces both the rowptr and colind arrays of the standard CSR with a single control byte-array containing all the required information, called ctl (Fig. 2) . Similar to CSR-DU, CSX divides the matrix into units. In CSX's terminology, a unit represents a substructure inside the sparse matrix, will it be a substructure (substructure unit) or a sequence of column index delta distances represented by the same number of bytes (delta unit). A CSX unit is comprised of two parts: the head and the body. The head contains a 2-byte descriptor of the unit and its initial column index (ucol field) encoded as a variable-size integer. The 2-byte descriptor stores a 6-bit ID of the substructure unit (id field) and the size of the unit in nonzero elements. The nr bit denotes the start of a new row. Finally, the body part is present only in delta units and stores the column index delta distances as fixed-size integers.
The rjmp bit and the ujmp field serve a special purpose. Since CSX encodes nonhorizontal substructures, it is possible for substructures to group together all the elements of subsequent rows. For example, the antidiagonal substructure in Fig. 1c starting at position ð1; 9Þ contains also the sole element of the second row, leaving it empty. These rows must be skipped when computing the SpM Â V. Therefore, the rjmp bit denotes the existence of empty rows, while the ujmp field stores the number of empty rows to skip in a variable-size integer. Similar to CSR, CSX stores the nonzero elements of the matrix in a values array, but in a substructure row-wise order. For example, the substructures in Fig. 1c will be stored in the following order: horiz(1), anti-diag(1), bcol(4,2), vert(1), diag(2), bcol(4,2) and bcol(3,2). 
DETECTION AND ENCODING OF SUBSTRUCTURES
Mining the Matrix for Substructures
CSX can detect a variety of nonhorizontal substructures with the use of coordinate transformations on the nonzero elements of the matrix. To facilitate the detection process, CSX uses a special internal representation for the sparse matrix, which is a hybrid of CSR and a generic version of the COO format. Instead of simple nonzero elements, the CSX's internal representation stores generic elements; a generic element is either a single nonzero element or a substructure. Similarly to the COO format, each generic element is stored as an ði; j; vÞ tuple, lexicographically sorted on the ði; jÞ. In the case of a substructure, ði; jÞ represents the coordinate of the first element in the substructure and v is an array of the substructure's elements. We also keep CSR's row pointers for fast row accessing.
CSX detects substructures by applying run-length encoding on the column indices of the matrix. The run-length encoding first computes the delta distances of the column indices and then assembles groups, called runs, from the same distance values (Fig. 3) . Each run is identified by the common delta value and its length. Runs with length greater or equal to two form a substructure; nonetheless, we impose a restriction on the minimum length of a run (currently set to four) to avoid a proliferation of very small runs that could be more of an overhead than a benefit. The process of scanning the whole matrix for a specific type of substructures is described in Algorithm 1; we iterate over the rows of the matrix and for each row we gather all the column indices that are not yet part of a substructure and apply run-length encoding. For all the detected substructure instantiations (different delta values), we keep statistics (number of units and number of nonzero elements covered) to guide us on the final selection of the substructures to encode in CSX. Algorithm 1. Substructure detection in CSX.
1: procedure DETECTSUBSTR(matrix::in, stats::inout) matrix: CSX's internal repr., lexicographically sorted stats: substructure statistics 2:
colind ;
. Column indices to encode 3:
for all rows in matrix do 4:
for allgeneric elements eði; j; vÞ in row do 5:
if e is not a substructure then 6: colind colind [ e:j 7: continue 8: enc RLENCODEðcolindÞ 9:
UPDATESTATSðstats; encÞ . Update statistics for this encoding 10:
enc RLENCODEðcolindÞ 12:
UPDATESTATSðstats; encÞ . Update statistics for this encoding 13:
Detecting Nonhorizontal Substructures
The detection process described above "sees" just column indices and the only requirement is that these indices must be sorted. Detecting horizontal substructures is, therefore, straightforward, because the matrix elements are arranged in row-wise (horizontal) order and are lexicographically sorted by default. To detect linear nonhorizontal substructures, it suffices to transform the matrix elements into the desired iteration order, sort them lexicographically and use the DETECTSUBSTR() procedure described in Algorithm 1.
For the 2D substructures, we segment the matrix into fixedwidth bands, either row-or column-aligned, and apply a space-filling transformation inside every band, passing the resulting sequence to the same procedure. In total, the use of coordinate transformations has allowed CSX to support seamlessly 18 different substructure types and enables the straightforward expansion to other substructure families (for a detailed description of the coordinate transformations, see Supplementary Material, Section III-B, available in the online supplemental material). Fig. 4 shows the substructures detected by CSX in our matrix suite and is characteristic of the capability of CSX in detecting a variety of substructures inside a sparse matrix. It is very interesting that CSX was able to detect and encode even unusual substructures, such as the diagonal ones with delta distances of 857 and 1,714 in the pre2 matrix. In matrices dominated by dense blocks, CSX was able to detect large blocks in significant amounts, as is the case of the inline_1 and af_k_101 matrices. The detection of large dense blocks not only has a positive impact on the overall compression ratio, but also provides a performance advantage in the SpM Â V computations.
Substructure Types and Their Instantiations
In CSX, we make a distinction between a substructure type and its instantiation. A substructure instantiation denotes how exactly a substructure type is encountered inside the sparse matrix. For example, the horizontal substructures with d ¼ 1 and d ¼ 20 of Fig. 3 belong to the same substructure type (horizontal), but are different instantiations. A substructure type, therefore, may have indefinitely many instantiations in a sparse matrix, which adds great flexibility to the CSX format, allowing it to detect almost every dense pattern inside a sparse matrix. We should note here that the 6-bit id field in the ctl structure of CSX ( Fig. 2) refers to the exact substructure instantiation being encoded by the related unit and not to an abstract substructure type. This has the downside of limiting the total number of substructure instantiations in a sparse matrix to 64, but, in practice, only 4-5 instantiations are selected for encoding (see Fig. 4 ).
Selecting Substructures for Final Encoding
The selection of the substructures for the final encoding is described in Algorithm 2, which is a typical local search optimization algorithm. More specifically, for each available substructure type, we transform the matrix to the corresponding iteration order and scan it, collecting statistics for the examined substructure type. After having collected statistics for all the available substructure types, we filter out all the instantiations that fail to surpass a certain threshold in encoding the nonzero elements of the matrix (empirically set to 5 percent of the total nonzero elements) and proceed with the selection of the most suitable type for encoding the matrix (SELECTTYPE()). The algorithm then proceeds by encoding the matrix with the selected substructure type (ENCODESUBSTR()) and repeating the same procedure until no type can be selected.
The criterion we use for selecting the substructures to encode is a rough estimate of the achieved reduction in the size of the original CSR's colind structure. Assuming that we keep a single, full column index per detected substructure (ignoring delta units), the gain in the CSR's colind size is
where N units is the total number of encoded substructure units, and NNZ enc is the number of the nonzero elements encoded by the substructure type. 
The indefinitely large and a priori unknown number of substructure instantiations inside a sparse matrix dictates the use of runtime code generation for the substructure-specific SpM Â V routines. We have built a new Just-In-Time (JIT) compilation framework for CSX based on Clang and LLVM [11] . In our initial implementation of CSX [15] , we used to generate the LLVM intermediate representation (IR) directly through the related library interface. However, this is a tedious and error-prone task (also in terms of performance), because not only a good understanding of the intermediate representation is required, but also the actual SpM Â V code is "obfuscated" by the complex calls needed to construct the IR. For this reason, we now generate simple C code for the substructure-specific SpM Â V routines and use Clang to generate optimized LLVM IR. The overhead of parsing the resulting C source (one translation unit, less than 150 l.o.c.) and generating the optimized LLVM IR is negligible compared to the matrix preprocessing time to justify the extra pain of maintaining a pure LLVM IR codebase or even an on-disk cache of precompiled SpM Â V routines. Fig. 5 shows how the just-in-time compilation is organized. CSX maintains a directory of C source templates (not to be confused with the C++ templates), which define a set of text hooks to be filled in the runtime. For each substructure type, we maintain a source template for performing the SpM Â V computation inside the substructure, and we also have a "top-level" template that is responsible for the execution of the full SpM Â V kernel in the CSX format. After selecting the substructures for encoding and generating the CSX's data structures, we pick the suitable templates and generate the corresponding SpM Â V C code passing it to the Clang front end. We program the front end to emit an optimized LLVM IR module, from which we get a function pointer to the generated SpM Â V kernel and, eventually, the LLVM back end takes over to generate the final native code for the host machine.
The SpM Â V kernel that we execute for CSX is shown in Algorithm 3. First, we should note that in CSX we must explicitly zero out the output vector (lines [4] [5] , an artifact that is prescribed by the use of multiple nonhorizontal substructures. However, the effect of this operation is very small, especially in multithreaded configurations, where the initialization is performed in parallel. The algorithm iterates over the ctl structure decoding a field at a time. If a new row is detected, we store the computed dot product (yr) in the output vector and advance the current position (y curr ) with the __NEW_ROW_HOOK(). If the matrix does not contain row jumps, the generated code simply advances y curr by one. In the opposite case, it checks the rjmp bit and if there is a jump, it advances y curr with the value of the variable size integer containing the jump size. The algorithm proceeds by decoding the variable size integer containing the delta distance from the previous column index (DECODECOLUMN()) and computes the starting column index (x curr ) of the current CSX unit. It then retrieves the ID of the unit and executes the unitspecific SpM Â V code within the __BODY_HOOK(). The __BODY_HOOK() is actually replaced by a C switch statement, switching on the id variable. The unit-specific SpM Â V routines are responsible for correctly advancing the ctl structure and the current column index, as well as updating the local accumulator yr, if necessary. 
__NEW_ROW_HOOK() . Advances y curr 15:
x curr x 16:
x curr x curr þ DECODECOLUMNðctlÞ 17:
id GETID(flags) . Retrieve the ID of the next unit 18:
__BODY_HOOK() . Unit-specific SpM Â V code 19: until ctl ends Since we encode specific substructure instantiations in CSX, we know during the compilation time the exact delta distances of the one-dimensional substructures and the exact block dimensions of the 2D ones. This allows us to generate efficient code with hard-coded constant delta distances and constant block dimensions, matching the computational advantage of the fixed-size blocks of BCSR. Additionally, if only one substructure instantiation is encoded, we optimize out the switch statement completely.
Parallelization
CSX follows the parallelization pattern of CSR: during the construction of the CSX's internal representation, we split the input matrix row-wise maintaining roughly the same number of nonzero elements per partition. After this point, the detection and encoding phases proceed independently, producing different final CSX submatrices per partition. The SpM Â V kernel in Algorithm 3 changes only in line 2, where y curr is initialized to the beginning of the partition and in line 4, where we iterate within the bounds of the partition.
TACKLING THE PREPROCESSING COST
The preprocessing time of CSX is bounded by the multiple calls to the LEXSORT() routine, which performs a lexicographic sort on the transformed matrix elements, during the scanning for substructures (Algorithm 2, lines 4-8) . We address this issue with a combination of partial sorting and sampling. More specifically, instead of scanning the matrix as a whole, we split it into constant size preprocessing windows based on the nonzero elements and scan every window separately. This modification reduces automatically the asymptotic complexity of the scanning phase from ÂðNNZ lg NNZÞ to ÂðNNZÞ at the expense of missing the substructures that cross the boundaries of the windows. To further reduce the preprocessing cost, we only examine a certain number of windows uniformly distributed over the whole matrix covering only a small amount of the total nonzero elements.
Having minimized the substructure scanning phase, the preprocessing time is now bound from the final encoding of the matrix (Algorithm 2, lines 12-14), which is still in the order of ÂðNNZ lg NNZÞ, because the whole matrix must be encoded, after all. Nonetheless, the impact of this phase on the overall preprocessing cost is not so crucial for two main reasons: in most of the cases, a very small number of substructure types will be finally encoded, and the full cost of the sorting will be paid only during the encoding of the first substructure type, because every time we sort only the unencoded elements.
In addition to the use of sampling and preprocessing windows, we further reduce the preprocessing cost by parallelizing the whole process. The parallelization is straightforward: after loading the initial CSR matrix, we setup a new thread for every partition that proceeds independently with the scanning and the construction of the final CSX matrix. The preprocessing cost of CSX for detecting all the supported substructure types ranges now in the order of 100 serial CSR SpM Â V operations (see the approximately 500 operations of [15] ), rendering it viable even for online processing of the input matrix.
PORTING TO NUMA ARCHITECTURES
The key factor for the high performance of SpM Â V in NUMA architectures is the correct placement of the involved data on the system's memory nodes [2] , [16] . The data each thread accesses must lie on its local memory node, to avoid the saturation of its peers' memory or interconnection links and also the increased latency incurred by the multiple hops required for a remote access. To avoid the programming burden of converting the SpM Â V algorithm to be aware of the data placement, to exploit the NUMA capabilities of a machine, we implemented a low-level interleaved allocation scheme that takes care of the correct data placement transparently to the application code. The idea is simple: we keep the allocation continuous in the virtual address space, in order not to influence the application code, and instruct the OS to physically allocate the pages according to the desired placement. We have used this mechanism for allocating the SpM Â V vectors and for implementing the NUMAaware versions of the formats considered in this paper with the least effort. Details of the implementation are provided in Supplementary Material, Section III-C, available in the online supplemental material.
Optimizing the Computations
Modern NUMA architectures have an important side effect on the execution of the SpM Â V kernel: the increased memory bandwidth when accessing a local memory node exposes more the computational part of the kernel, which is no longer negligible. CSX performs some heavy decompression operations in decoding the variable size integers used to store the column delta distances and the row jumps. For this reason, we replace the column delta distance with the full starting column index of the substructure, stored in a standard 32-bit integer. This optimization degrades the overall compression ratio 2-3 percent, but the gain in performance can exceed 15 percent in certain matrices. Finally, the proliferation of delta units when encoding multiple substructures can also overwhelm the benefit of compression in NUMA architectures due to the increased performance overhead of the switch statement (branch mispredictions). For this reason, we revise our score function for these architectures to include an estimate of the total switch statements (N switch ) executed as follows:
This heuristic balances better the tradeoff between the memory and the computational part of the kernel by penalizing the encodings that lead to a large computational cost. We have encountered more than 20 percent performance improvement for certain matrices in NUMA-aware thread configurations with the revised heuristic.
EXPERIMENTAL EVALUATION
Setup and Methodology
For the evaluation of the CSX format, we have selected 30 matrices from the University of Florida Sparse Matrix Collection [17] (see Supplementary Material, Table II , available in the online supplemental material). We made an effort to include large matrices from various scientific disciplines with different nonzero element structures. To evaluate the adaptivity and the performance stability of CSX in unfavorable cases, we have selected a large amount of matrices with a rather irregular structure. These matrices are challenging for memory bandwidth minimization formats, because contention to the memory bandwidth is not the key performance problem in these cases [2] .
The alternative storage formats we consider in this paper are BCSR and VBL. We have selected these formats as the most established paradigms of CSR alternatives and as the best representatives of the fixed and variable size blocking storage formats, respectively. We have implemented our own optimized versions of all these formats, including NUMA-aware implementations. Specifically for BCSR, we have implemented block specific, optimized SpM Â V routines for all the block sizes (1D and 2D) up to size eight plus the 3 Â 3 block. The results reported for BCSR correspond to the best performing block, which was obtained after an exhaustive search of the 20 available blocks.
Our computational testbed comprises a symmetric shared memory system and a NUMA platform (see Supplementary Material, Table III , available in the online supplemental material). Specifically, the Dunnington system is a four-way six-core Intel Xeon X7460 configuration (24 cores, symmetric shared memory), and the Gainestown system is a two-way quad-core Intel Xeon W5580 configuration (eight cores/16 threads, NUMA). All systems were running a 64-bit version of the Linux OS (kernel version 2.6) and we used LLVM 2.9 for the compilation of the SpM Â V routines for all the considered formats. Finally, for the assignment of threads to logical processors, we followed a "share-all" policy, by first "filling" a socket before moving to the next one, with the exception of Gainestown; in this case, we did not use the HyperThreading feature but for the last 16-thread configuration. For a more detailed description of the experimental setup, the reader is referred to Supplementary Material, Section IV-A, available in the online supplemental material. Fig. 6a shows the speedups achieved by all the considered methods in the Dunnington platform. The effect of compression is dominant in SMP architectures, especially in the multithreaded configurations. CSX offers a 41.8 percent performance improvement over CSR on average, VBL follows further behind with a 28.8 percent improvement, while BCSR is able to achieve only a bare 6.3 percent improvement. BCSR suffers from its inherently low compression capability and the use of padding to construct full blocks. It can be classified as an "all-or-nothing" storage format, because in almost half of the matrices of our suite, it degraded SpM Â V performance more than 30 percent on average, while for the other half matrices it provided a more than 25 percent performance improvement. The compression potential of VBL and CSX is definitely higher than BCSRs and this is clearly depicted by the speedup diagrams in Dunnington. The ability of CSX to detect and encode multiple types of substructures, and especially blocks, is a significant advantage of CSX over VBL, not only in terms of pure compression ratio, but also in terms of the involved SpM Â V computations, because CSX keeps the computational advantage of the BCSR's fixed size blocks, thanks to the runtime code generation, without paying the padding overhead at all. Indeed, CSX starts off with a significant 13.1 percent performance advantage over CSR right from the single-threaded configuration.
The Performance of the CSX Format
In NUMA architectures, where the available memory bandwidth is considerably higher, the performance landscape changes, but CSX remains still the most performant storage format across the full range of multithreaded configurations. Fig. 6b shows the speedups of the considered storage formats in Gainestown using a NUMAaware data placement. The very first observation is that CSR is now rather competitive, because the memory bottleneck is not so intense as before until the 16-threaded configuration, where HyperThreading is also enabled. CSX achieves a 17.5 percent performance improvement over CSR when using eight threads and increases the gap to 20.7 percent in the 16-threaded configuration, where the memory bandwidth contention becomes visible. The respective numbers for VBL are 8.4 and 13.9 percent for the two aforementioned configurations. The case of BCSR is interesting, because it manages to achieve a performance comparable to VBL, thanks chiefly to the ample memory bandwidth that better exposes its computational advantage. While VBL's performance falls behind CSRs up to the two-threaded configuration, reaching a 23.5 percent performance degradation for the single-threaded one, CSX experiences a slight 2.8 percent performance degradation only for the single-threaded configuration. From the twothreaded configuration, however, CSX is already ahead of every other format having achieved an 11.4 percent performance improvement over CSR and reaching 20.7 percent when the full system is utilized.
This behavior can be further explained by examining the per-matrix performance results for eight threads in Gainestown presented in Fig. 7 . We have selected this configuration instead of the 16-threaded configuration as being less favorable for CSX and most representative of systems, where the memory bottleneck is not so intense. Observing the performance results, we can first separate our matrix suite into two matrix categories: 1) lowperforming matrices (%3 Gflop/s in CSR) and 2) highperforming matrices (%5 Gflop/s in CSR). The first category is formed by matrices with an irregular nonzero element structure and very short rows; SpM Â V is rather latency-bound than bandwidth-bound in this case, because it suffers chiefly from cache misses in the input vector [2] . The second category, which is the most typical, consists of matrices with a more regular structure arising mostly from the discretization of PDEs; the key problem here is the contention for memory bandwidth. CSX manages to achieve considerable performance improvements in matrices of the second category, approaching the upper bound of dense-matrix vector multiplication performance in this configuration (8.5 Gflop/s). In the low-performing matrices, CSX exhibits a rather stable behavior, matching or even surpassing the performance of CSR. Although VBL and BCSR exhibit similarly high performance in more regular matrices, with BCSR matching the performance of CSX in block-dominated matrices (xenon2, consph, m_t1, bmwcra_1, inline_1), the situation changes for irregular matrices, where both BCSR and VBL tend to exhibit a significant performance degradation, due to their increased overhead in matrix size and computations. The advantage of CSX is also clear in matrices dominated by diagonal patterns (e.g., torso3, cage13, atmosmodj), which cannot be efficiently exploited by the alternative formats considered.
In Table 1 , we exhibit quantitatively the performance stability of CSX in the eight-threaded, NUMA-aware configuration in Gainestown. 1 The competitiveness of CSR and BCSR is clear from this table, gaining seven and nine matrices, respectively. Nonetheless, even in this not so favorable configuration, CSX manages to achieve the absolute best performance in the majority of matrices (12), while VBL contents itself to just two matrices. The most important information of this table, however, are the performance differences of each format from the best overall performance. Specifically, for each considered format, we present its average performance difference from Fig. 7 . The SpM Â V performance in Gainestown for every sparse matrix in our suite using eight threads in a NUMA-aware configuration. the absolute best for all the matrices that it did not gained. We also present a 95 percent confidence interval (C.I.). 2 These metrics confirm the predominance of CSX both in terms of overall SpM Â V performance and in terms of adaptivity to the different characteristics of each matrix. Despite obtaining the absolutely best performance in less than half of the matrices, CSX manages to be as close as 5.8 percent to the overall best performance. VBL follows further back with an average difference of 12 percent from the best performance, while BCSR and CSR come last with 20.1 and 21.2 percent differences, respectively. The stability of CSX is also superior to the other considered formats as it is depicted by the computed confidence intervals, while BCSR, as expected, is the least stable format with a performance variation of almost 5 percent. CSX can be, therefore, considered as a high-performance storage format for sparse matrices, because not only achieves the highest performance in the majority of matrices in both SMP and NUMA architectures, but also manages to stay very close to the overall best performance in the rest of the matrices, including corner cases, such as very irregular matrices, exhibiting significant performance stability compared to other CSR alternatives.
Before closing the presentation of the performance of CSX, it is important to emphasize the effect of the data placement in NUMA architectures. The correct placement allows SpM Â V to scale beyond the first socket, while a non-NUMA implementation encounters a performance slowdown, due to saturation of the interprocessor link by the remote accesses (we measured a sustained bandwidth of %9 GB/s.). Using our interleaved allocator (Section 7) for the allocation of the matrix's data structures, the data placement remained completely transparent to the application level and SpM Â V was able to exploit the NUMA advantage at a minimal developer's cost.
The Preprocessing Cost
CSX can be programmed to detect only delta units or a specific subset of the supported substructures, for example, horizontal substructures only. This is a simple way to reduce the preprocessing cost. Nonetheless, this kind of reduction comes at a cost of the overall CSX performance as it is depicted in Fig. 8 . The cost when detecting only delta units (CSX-delta) or horizontal substructures (CSX-horiz) is very small ranging from 16 to 35 CSR SpM Â V operations, but only a 14.8 to 23.9 percent performance improvement is achieved. To exploit its full potential, CSX must be programmed to detect as many substructure types as possible. In this case, however, the preprocessing cost approaches the 500 serial CSR SpM Â V operations. Though not irrational even for online preprocessing, this cost is large and can eradicate the performance benefit of CSX in the midterm. The use of statistical sampling drops the preprocessing cost nearly an order of magnitude, at 88 SpM Â V iterations, with a minimal impact in CSX's overall performance, not exceeding 1.5 percent. A key aspect for sampling a sparse matrix with CSX is to use a lot of sampling windows scattered all over the matrix, to obtain a "good look" of the whole matrix and avoid any over-or underestimation of the presence of certain substructures. In our case, we have used 48 sampling windows for sampling only 1 percent of the total nonzero elements of the matrix. Finally, as a matter of reference, we should mention here that the cost of sampling in the initial CSX implementation [15] surpassed 300 SpM Â V operations, revealing the considerable progress performed in optimizing the preprocessing cost of CSX.
Integrating CSX into Multiphysics Simulation Software
As a further evaluation of the potential of CSX in optimizing SpM Â V in the context of a multiphysics application, we have integrated it into the Elmer [12] multiphysics simulation software. Elmer employs iterative Krylov subspace methods for treating large problems using the preconditioned BiCGStab method for the solution of the resulting linear systems. Elmer supports parallelism across multiple nodes using MPI, but uses only a single thread inside every node. For this reason, we have also implemented a multithreaded CSR version for Elmer, to achieve a fair comparison with CSX. Fig. 9 shows the average speedup achieved by Elmer in a 24-node, two-way, quad core Intel Xeon E5405/E5335 (Harpetown/Clovertown) mixed cluster (192 cores) for five large (>576 MiB) benchmark problems of the Elmer test suite. We have also used a diagonal preconditioner, which consumed almost half of the execution time in three of our benchmarks. We show results after 1,000 linear system iterations, because in practical applications the solver may need thousands of iterations to converge. CSX, including its preprocessing cost, was able to improve the performance of the Elmer's SpM Â V component by Fig. 8 . The preprocessing cost of CSX in Dunnington using 24 threads. Fig. 9 . Speedup of the Elmer multiphysics simulation software by employing CSX (preprocessing time is included) after 1,000 linear system iterations. The knee in the diagrams is due to the lower memory bandwidth of the Clovertown nodes, which we start to use at the 16-node configuration.
37 percent over the multithreaded CSR, while the performance of the overall solver was improved by a noticeable 14.8 percent, despite the large preconditioning cost in three of our benchmarks. We believe this improvement could have been much more significant, have other parts of the solver been also multithreaded.
RELATED WORK
Research in sparse matrices has been active since the times of the first computer systems. Early descriptions of the CSR format go back in late 1960s, where the CSR format was described as a possible way of storing a sparse matrix [18] . One of the first and very revealing surveys on the indexing structures of sparse matrices, dating back in 1973, is that of Pooch and Nieder [19] . In this paper, the authors describe a series of different indexing structures for sparse matrices, referred to as rowcolumn schemes, including the Coordinate format, the CSR and BCSR, and also designate the use of delta indexing of the column indices. The term Compressed Sparse Row format or CSR was "standardized" by Saad [5] , [20] , who also coined the terms Coordinate format and Blocked Sparse Row (BSR), later standardized as Blocked Compressed Sparse Row by Im and Yelick [8] , [21] . Several research has been conducted on BCSR and its variations [22] , including register and cache optimizations [8] and autotuning [23] , which culminated in the OSKI sparse kernel library [24] . Pinar and Heath [7] employ variable sized one-dimensional blocks and present the Variable Block Length format, which we discussed in more detail in previous sections. Agarwal et al. [6] decompose the input matrix into multiple submatrices, each one storing a different substructure. Belgin et al. [14] also employ the same decomposition technique in their Pattern Block Row (PBR) format. Specifically, they store in each submatrix an arbitrary block pattern and generate specific C SpM Â V routines for each pattern. Another interesting approach is the Compressed Sparse Block (CSB) format proposed recently by Buluç et al. [25] , [26] , targeted at supporting efficiently both the Ax and A T x operations. CSB divides the matrix into large sparse square blocks, which are stored with the coordinate storage scheme, but using small integers for the row and column indices. The authors employ also task parallelism across the different blocks. Willcock and Lumsdaine [27] take a different approach and apply delta compression explicitly in the column indices of CSR, proposing the Delta-Coded Sparse Row (DCSR) format. Kourtis et al. [28] take a similar approach by delta-encoding the column indices and applying run-length encoding for detecting arbitrary horizontal substructures in the sparse matrix. The authors also propose a storage format for compacting the nonzero values by keeping only unique values and an associated indexing structure [10] .
CONCLUSIONS
In this paper, we have presented and evaluated the performance of the Compressed Sparse eXtended sparse matrix storage format in modern multicore architectures. CSX integrates in a single storage format the most commonly encountered substructures inside sparse matrices, including horizontal, vertical, diagonal, and two-dimensional substructures. CSX is able to minimize the memory footprint of the sparse matrix, alleviating the memory bandwidth demands of the streaming SpM Â V kernel. Compared to other CSR alternatives, CSX is able to provide considerable performance improvements not only in symmetric shared memory architectures, but also in NUMA platforms, where the computational part of the kernel is more prominent, exhibiting consistent and high performance across a variety of matrices. Most notably, this performance advantage does not come at an excessive preprocessing cost and CSX was able to considerably accelerate the performance of a multiphysics simulation, despite its initial bootstrap cost.
As a future research direction, we plan to investigate alternative parallelization schemes for CSX, for example, task-parallelism techniques, and experiment with more advanced heuristics for the selection of the substructures for encoding in CSX. Finally, we plan to exploit the insight of the matrix structure that CSX offers during its preprocessing phase, to guide the SpM Â V execution for achieving high performance and increased energy efficiency. He also serves as the vice chairman for the Greek Research and Education Network (GRNET-Greek NREN, www.grnet.gr), vice chairman for the Free/Libre/Open Source Software nonprofit company (ELLAK www.ellak.gr), founded by the Greek Universities and Research Centers. He is leading GRNET efforts in developing a public cloud infrastructure, codenamed "okeanos," offering IaaS to all the academic community (okeanos.grnet.gr). He is a member of the IEEE Computer Society, member of the IEEE-CS TCPP and TCCA, senior member of the ACM, and chairs the Greek IEEE Computer Society Chapter.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
