Abstract-We examine the implementation of block compressed row storage (BCSR) sparse matrix-vector multiplication (SpMV) for sparse matrices with dense block substructure, optimized for blocks with sizes from 2x2 to 32x32, on CPU, Intel many-integrated-core, and GPU architectures. Previous research on SpMV for matrices with dense block substructure has largely focused on the design of novel data structures to optimize performance for specific architectures or to store variable-sized, variably-aligned blocks, but depending on alternate storage formats breaks compatibility with existing preconditioners and solvers or imposes significant runtime costs when converting between matrix formats. This paper instead focuses on the optimization of SpMV using the standard block compressed row storage (BCSR) format. We give a set of algorithms that performs SpMV up to 4x faster than the NVIDIA cuSPARSE cusparseDbsrmv routine, up to 147x faster than the Intel Math Kernel Library (MKL) mkl dbsrmv routine (a single-threaded BCSR SpMV kernel), and up to 3x faster than the MKL mkl dcsrmv routine (a multi-threaded CSR SpMV kernel).
I. INTRODUCTION Sparse matrix-vector multiplication (SpMV) computes the operation y ← αy + βAx, where A is a sparse matrix and x and y are dense vectors. It dominates the running time in many scientific and engineering applications and is notorious for sustaining low percentages of machine peak performance due to its typically poor cache usage and memorybound nature. Improving performance of SpMV generally requires selecting appropriate data structure transformations and memory access patterns for the matrices being used and the underlying hardware architecture.
Sparse matrices arising from higher-order discretizations and problems with multiple degrees of freedom per mesh point often exhibit a dense block substructure in which dense blocks are constant-sized and aligned. The Block Compressed Sparse Row (BCSR) format is a popular choice for representing matrices of this type; a variant of the standard Compressed Sparse Row format, it stores pointers to block rows and indices of block columns, as opposed to storing pointers and indices for individual elements.
This paper does not seek to compare the BCSR format against other formats; rather, it intends to explore the optimization of BCSR SpMV algorithms on various architectures. The applications of interest to us produce matrices with aligned, constant-size blocks and use BCSR as a native storage format. BCSR may not be the optimal format for SpMV, but designing SpMV algorithms compatible with BCSR maintains compatibility with existing application and linear solver software while avoiding expensive data structure conversions.
This paper contributes algorithms for the efficient execution of SpMV using BCSR on shared-memory parallel architectures, including traditional CPU architectures (e.g., Sandy Bridge), the Intel Knights Corner (KNC) many-integratedcore architecture, and NVIDIA GPU architectures. The paper is structured as follows: Section II presents an overview of related work, Section III describes our algorithms in detail, Section IV compares our algorithms with other BCSR and CSR implementations, and Section V concludes our paper.
II. RELATED WORK
A great amount of literature exists regarding the optimization of SpMV for various parallel architectures using standard non-block formats. Williams et al. investigate the tuning of CSR SpMV on AMD and Intel CPU architectures (among others) [13] , Bell and Garland compare the performance of SpMV using a variety of sparse matrix formats on NVIDIA GPUs [3] , and Saule et al. review the performance of CSR on Intel Xeon Phi (KNC) [9] . Greathouse and Daga propose an algorithm called CSR-Adaptive, which maintains the standard CSR format but improves performance on GPUs [6] . Some researchers, including Liu et al. [8] and Baskaran and Bordawekar [2] , have developed new sparse matrix representations to optimize performance on specific architectures. Other researchers have developed alternate matrix formats to optimize SpMV when matrices exhibit block substructure. Vuduc and Moon offer the Unaligned Block Compressed Sparse Row (UBCSR) format [12] and Shahnaz and Usman present the Sparse Block Compressed Row Storage (SBCRS) format for matrices with variablesized and variably-aligned blocks [10] .
This paper seeks to focus on optimization of sparse matrix-vector multiplication using the standard Block CSR (BCSR) format. As mentioned in Section I, BCSR may not be the optimal format for SpMV, but designing BCSR SpMV algorithms maintains compatibility with existing BCSR linear solver software while avoiding expensive data structure conversions. BCSR was first introduced for CPU architectures by Im and Yelick [7] . Choi et al. found GPU-specific optimizations for Block CSR algorithms to be insufficient in outperforming the cuSPARSE HYB kernel [4] , but recent work by Abdelfattah et al. provides Block CSR algorithms that outperform cuSPARSE HYB for both structured and unstructured matrices [1] . Saule et al. investigated the performance of various SpMV and SpMM algorithms on Intel Xeon Phi, but found that using BCSR for the purposes of accelerating ordinary non-block SpMV did not improve performance [9] . In this paper, we explore Block CSR SpMV algorithms for CPU, GPU, and Xeon Phi architectures and compare them against Abdelfattah et al.'s algorithms (as implemented in KSPARSE 1 ) and the proprietary algorithms in the NVIDIA cuSPARSE and Intel MKL libraries.
III. BCSR SPMV ON SHARED-MEMORY PARALLEL ARCHITECTURES

A. Storage Format -Block CSR
The algorithms presented in this paper use sparse matrices stored in the Block Compressed Sparse Row (BCSR/BSR) format with column-major blocks. BCSR stores nonzero blocks contiguously, row by row, in an array val of length nnzb · bs 2 (where nnzb is the number of nonzero blocks and bs is the block size). For each nonzero block, an entry is added to an array col idx (also of length nnzb) with its column index. An array row ptr stores m + 1 values; each value stores the index of a row's first block in val and col idx. For a more in-depth description of the Block CSR format, please refer to Richard Vuduc's Ph.D. thesis [11] .
We have elected to use column-major blocks in order to improve temporal cache locality for x (the dense vector in the SpMV operation). By streaming through columns of val, we access elements in x consecutively; a row-major layout would access segments of val multiple times, causing L1 evictions for larger blocks. Additionally, using column-major blocks simplifies the reductions for the GPU algorithms presented in Sections III-D1 and III-D2.
B. High-Level Algorithm Design
In our algorithm, we assign a parallel worker (a thread or core for CPU and CPU-like architectures, or a warp of threads for GPU architectures) to one or more block rows. In each of its block rows, a parallel worker iterates down each column, across the block row, in order to achieve streaming access to the val array (which has column-major blocks stored row-wise). The manner in which a parallel worker does this is tuned for different hardware architectures (e.g., 1 http://ecrc.kaust.edu.sa/Pages/ksparse.aspx.
for GPU architectures, each warp has threads enabling it to iterate through multiple columns at a time) and is described in Sections III-C and III-D.
C. CPU and Many-Integrated-Core Architectures
For these architectures, we delegate the block rows of the sparse matrix across n threads, such that each thread is responsible for approximately mb/n block rows (where mb is the number of block rows in the sparse matrix).
Each thread iterates over its block rows serially, using a temporary output array in registers for each block row. After processing the block row, the thread updates the output array y. Threads do not interact. A pseudocode implementation of this algorithm is shown in Algorithm 1. Memory access patterns for this algorithm are improved over most CSR algorithms. Each thread achieves streaming access to val; the access pattern is also predictable by a hardware prefetcher, so access latencies are reduced. If the architecture offers a large cache and bs is small, the algorithm also achieves streaming access to x and col idx. row ptr is accessed only twice per thread, so its memory access cost is generally insignificant.
This algorithm is also friendly to SIMD vectorization. If bs is known at compile time, the innermost loop over the elements in a column has compile-time bounds and can be vectorized by the compiler. In our tests, vectorization provided up to 4.7x the performance of non-vectorized code on KNC.
Reuse of x is limited; though better than naive non-block CSR, x is reused a maximum of bs times. Performance could be improved by tiling blocks to improve temporal locality; this possibility is discussed in Section V, Conclusions and Future Work.
D. GPU Architectures
In this section, we seek to adapt the previous algorithm for highly multi-threaded GPU architectures. The CUDA programming model operates with a large number of threads operating in SIMT (single instruction, multiple thread) style, where a group of 32 threads (known as a warp) concurrently execute the same instruction. The threads in a warp must work cooperatively; if threads diverge and issue different sets of instructions, the warp scheduler will mask off parts of the warp, executing different branch paths separately and reducing the rate at which instructions can be issued.
For optimal performance, the threads in a warp must also access contiguous segments of memory. NVIDIA devices have a global memory bus width of 128 bytes, which can hold 16 8-byte double-precision values. If 16 threads access a sequential range of doubles that lie in a 128-byte row of DRAM in the same SIMD instruction issue, the memory accesses is "coalesced" into a single memory access instruction.
In the following three sections, we propose algorithms that each assign a warp to a single block row, but assign the threads within a warp to the elements in a block row in different ways based on the block size of the matrix.
1) Block row per warp, operating by block:
In this algorithm, a warp is assigned to a single block row, and threads in the warp are assigned to elements in the block row column-wise (see Figure 2 for an illustration of thread assignment). To cover the entire block row, a warp iterates over the row's blocks, handling 32/bs 2 blocks at a time, where bs is the block size. This is the number of blocks that a warp can cover completely if each thread is assigned to a single element. For example, with a block size of 3, the warp can cover 3 complete blocks (27 elements) at a time. The warp will only cover complete blocks; in the case of 3x3 blocks, five threads will be inactive in each iteration
Because a warp only covers complete blocks, a thread's assigned position within a block will never change between iterations. On initialization, a thread calculates the index of the block row it is targeting, finds pointers to the start and end of its row from row ptr, and finds a pointer to the block it should begin its work on using the row start pointer and its lane number. It also calculates its assigned position within a block (r and c, where r is the row index within a block and c is the column index within a block). It then iterates through the block row, beginning at its assigned block and advancing by 32/bs 2 blocks at a time (the number of complete blocks that the warp covers), multiplying its target element in A by the corresponding value from x and adding the product to a register that it uses to maintain a running total. Once the threads in a warp have iterated through a block row, threads that covered the same vertical (r) position within a block reduce the values stored in their local registers and write the reduced output to global memory ( Figure 1 ).
This algorithm is described in pseudocode in Algorithm 2, and its memory access pattern is illustrated in Figure 2 . The reduction step could be implemented using a warp shuffle but is shown using shared memory in Algorithm 2. Figure 2 . Each thread in a warp reduces with other threads that also covered the same row within a block; the output is then written to global memory.
This algorithm exhibits high-performing memory access patterns for val and x. Accesses to val will be fully coalesced. Accesses to x are fully coalesced when the block size is 2 and partially coalesced for larger block sizes. Accesses to row ptr and writes to global out are T0   T2  T1  T3   T4  T6  T5  T7   T0  T2  T1  T3   T8  T10  T9  T11   T12 T14  T13 T15   T8  T10  T9  T11   T12 T14  T13 T15   T8  T10  T9  T11   T16 T18  T17 T19   T20 T22  T21 T23   T16 T18  T17 T19   T20 T22  T21 T23 1st iteration 2nd iteration 3rd iteration Figure 2 . Thread assignment to elements in the sparse matrix for the by-block algorithm for a matrix with 2x2 blocks. For the purposes of visualization, this is illustrated as if there were 8 threads in a warp (though in practice, the warp size will always be 32). Warp 0 (threads 0-7) handles block row 0, warp 1 (threads 8-15) handles block row 1, and warp 2 (threads 16-23) handles block row 2. T32 T35 T38  T33 T36 T39  T34 T37 T40   ⎤   ⎦   ⎡   ⎣  T41 T44 T47  T42 T45 T48  T43 T46 T49   ⎤   ⎦   ⎡   ⎣  T50 T53 T56  T51 T54 T57  T52 T55 T58   ⎤   ⎦   ⎡   ⎣  T59 T32 T35  T60 T33 T36  T61 T34 T37   ⎤   ⎦   ⎡   ⎣  T64 T67 T70  T65 T68 T71  T66 T69 T72   ⎤   ⎦   ⎡   ⎣  T73 T76 T79  T74 T77 T80  T75 T78 T81   ⎤   ⎦   ⎡   ⎣  T82 T85 T88  T83 T86 T89  T84 T87 T90   ⎤   ⎦   ⎡   ⎣  T91 T64 T67  T92 T65 T68  T93 T66 generally not coalesced, but these transactions represent an insignificant portion of all memory operations. Unfortunately, accesses to col idx are generally not coalesced, and since the values in col idx are required to access x, this adds latency to those memory transactions. However, as the algorithm uses few registers, many warps may fit into the streaming multiprocessors, so the GPU can generally hide the latency and keep the memory bus saturated by switching between warps.
Each thread accumulates a result in a register and there is no interaction between threads until the final reduction, so latency is low. The reduction always occurs within a warp, so no synchronization primitives are required.
While this algorithm theoretically achieves 100% theoretical thread utilization for block sizes that are powers of two (because warps can cover blocks evenly, with no threads leftover), it potentially produces a large number of inactive threads for other block sizes. Consider the case of 3x3 blocks; a warp can cover only 3 full blocks (27 elements), leaving 5 threads in each warp (15.6%) inactive. The algorithm uses few registers and achieves high occupancy, so it is generally able to issue enough instructions from a large number of warps to keep the memory bus saturated despite this decrease in active threads. However, this is an important consideration, and performance does decrease relative to the algorithm described in the next section (which is able to handle these cases more efficiently) for 3x3 and 5x5 blocks (see Figure 7) .
A more significant problem is the block size limitation that the whole-block constraint imposes. Because a warp of only 32 threads must cover an entire block, this algorithm cannot multiply matrices that have block sizes larger than 5. This motivates the design of a similar algorithm that better handles large block sizes.
2) Block row per warp, operating by column: This algorithm is similar to the previous one, but relaxes the requirement that warps cover whole blocks. Instead, warps cover whole columns, enabling larger block sizes (up to 32x32, where a warp would cover a single column) and a more efficient handling of matrices with block sizes that are not powers of two. Unlike the previous algorithm, a warp iterates through its block row's columns, covering 32/bs columns at a time. With the whole-block requirement replaced with a whole-column requirement, the assigned vertical position of a thread within a block (r) will never change, but a thread must compute its target block and target column in every iteration. The algorithm is shown in pseudocode in Algorithm 3 and its memory access pattern for 3x3 blocks -improved over that of the previous algorithm -is shown in Figure 3 . Like the previous block-by-block algorithm, this algorithm has minimal interaction between threads and achieves coalesced or partially coalesced accesses to val and x, but it is now able to handle large block sizes more efficiently. For the case of a matrix with 3x3 blocks, this algorithm has only 2 inactive threads (an improvement from the 5 inactive threads of the previous algorithm for this scenario). However, this comes at the cost of increased latency from integer operations. The algorithm must compute a block index and a horizontal position within that block, and the memory requests stall on these operations. By maximizing occupancy, we are usually able to minimize this additional latency, but the previous algorithm tends to outperform this one for block sizes up to 5x5.
3) Row-per-thread: When block sizes are sufficiently large, a simpler, more efficient algorithm may be introduced. For these cases, a CUDA thread block is assigned to a block row, and each thread within a thread block is responsible for a single row within the block row. A thread iterates through its assigned (non-block) row, multiplying each element by the appropriate value from x and accumulating the results in a register. A simple version of this algorithm is shown in Algorithm 4.
This algorithm achieves fully-coalesced accesses to val when the block size is a multiple of 16 and partiallycoalesced accesses when this is not the case. As threads use few registers, depend on little arithmetic for memory requests, and do not interact with other threads, occupancy is high and latency is low. Additionally, the inner loop can be unrolled for decreased instruction traffic and increased instruction-level parallelism.
In the basic implementation shown in Algorithm 4, access to x is not coalesced. This problem can be remedied by loading a portion of x into shared memory in a fullycoalesced fashion, where it can be reused by all threads in a thread block. An implementation of this is shown in Algorithm 5. Using shared memory in this way requires two barrier synchronizations for every block iteration, but we found this cost not to be significant. We find that this algorithm performs best for block sizes ≥ 16, as access to val and x have the opportunity to be fully-coalesced. See Figure 7 in Section IV-B.
The performance of this algorithm would seem to depend heavily on how well the matrix block size is matched to the thread block size; since the thread block size must be a multiple of 32, it would seem that the algorithm would perform poorly for a block size of 33, where 31 of 64 threads would be inactive. While performance certainly drops in this case, we did not notice as great of a performance impact as expected. The algorithm appears to have sufficient occupancy and instruction-level parallelism to hide latency despite the drop in active threads. Experiments with Intel hardware were run on the Sandia National Laboratories Compton test bed with two 8-core Sandy Bridge Xeon E5-2670 processors running at 2.6GHz (see Table 1 ) and a KNC Xeon Phi 3120A card (see Table  2 ). Intel ICC 15.0.2 and MKL 11.2.2.164 were used with the −O3 optimization flag. We compared our algorithm against MKL's mkl dcsrmv and mkl dbsrmv functions (CSR and BCSR, respectively). The mkl dbsrmv routine is not multithreaded, 2 so we have also compared our algorithm against mkl dcsrmv as a multithreaded reference.
Experiments with NVIDIA hardware were run on the Shannon test bed with an NVIDIA K80S dual-GPU card (see Table 3 ). Only one GPU on the card was used to run the tests. ECC was on and Boost Clock was off, which reduced the theoretical maximum bandwidth. GCC 4.9.0 and CUDA 7.0.18 were used. Implementations of our GPU algorithms used a texture cache to optimize accesses to x. We compared our algorithms against cuSPARSE's cusparseDbsrmv function and against KSPARSE's ksparse dbsrmv function.
Test matrices were obtained from the University of Florida Sparse Matrix Collection [5] and are shown in Table 4 . Matrices were selected from a variety of real-world applications. 2 The recent MKL 11.3 introduced a multithreaded BCSR algorithm for iterative solvers in the inspector-executor API, involving a separate setup/inspection phase in which MKL analyzes the sparsity pattern and applies matrix transformations. We did not compare against this algorithm. To measure the performance of algorithms, we measured the time it took to execute them 3000 times and then divided that time by 3000 to obtain an average execution time. We divided the execution time by the amount of unique data read (i.e. the total size of x, A.val, A.col idx, and A.row ptr) to determine the achieved throughput. This throughput represents the performance of the algorithm only and does not include the population of the input matrices or the zero-filling of the output matrices.
On the KNC architecture, we also tested a variant of our algorithm that pads columns within blocks in the A.val matrix to multiples of 8, so that each column (as accessed by the innermost loop in Algorithm 1) falls on a 64-byte boundaries. For example, for bs = 7, we access elements as val[block * 8 * bs + c * 8 + r] instead of val[block * bs * bs+c * bs+r]. This improves the performance of the inner loop in some cases, but requires additional memory and may not be directly compatible with other existing software. The performance for different block sizes is explored in Section IV-B. We found that padding data did not significantly improve performance on the Sandy Bridge architecture.
All computations were performed using double-precision values.
A. Algorithm Performance
We partitioned our test matrices into their respective block sizes (see Table 4 ) and compared our algorithms to vendor implementations. On Sandy Bridge, all 16 cores were used and hyperthreading was enabled; on KNC, all 57 cores were used with all 4 hardware threads active on each core. Results are shown in Figures 4, 5 , and 6.
Performance on Sandy Bridge is the most consistent of all three architectures. Our CPU algorithm exhibits L3 cache effects for the smallest two matrices (GT01R and raefsky4) but achieves a consistent throughput of approximately 80GB/s for the rest of the matrices (78% of the maximum bandwidth Table  4 ). Table 4 ). By-column algorithm Figure 6 . Performance comparison on the Kepler GPU architecture for various test matrices using each matrix's dominant block size (see Table 4 ). The blocks of bmw7st 1, pwtk, and RM07R are larger than 5x5, so they cannot be handled by the by-block algorithm. No matrices in our test set had blocks large enough to justify the use of the row-per-thread algorithm, so we have omitted it from this comparison.
of two Xeon E5-2670 processors). Our algorithm did not perform well on the Xeon Phi KNC architecture, sustaining approximately 40GB/s throughput (16% of the theoretical maximum bandwidth). We suspect this is due to the poor performance of gather/scatter instructions on the KNC architecture -this is discussed further in Section IV-B. MKL's BSRMV algorithm also performs poorly on KNC due to its single-threaded design.
On the Kepler architecture, we found that our algorithms performed best with larger block sizes as they were able to achieve better reuse of x. Also, larger block sizes enable better use of cache. For a block size of 2x2, a warp will cover 8 blocks; in each loop iteration (see line 13 of Algorithm 2 and line 12 of Algorithm 3), a warp must fetch 8 block vectors (i.e. 16 elements or 128 bytes, which is one cache line). For a block size of 4x4, a warp will cover only 2 blocks; in each loop iteration, it will then only need to fetch 2 block vectors (i.e. 8 elements or 64 bytes, which is only half a cache line). The next block vector will be brought into cache as part of the memory request, and the next iteration of the loop will then not need to request it.
B. Impact of Block Sizes
To understand how the algorithms perform on manycore architectures with matrices of different block sizes, we partitioned the GT01R test matrix (a discretization from a 2D inviscid fluid simulation -see Table 4 ) into blocks with sizes from 2 to 32, placing nonzero elements in the appropriate aligned blocks and filling in zeros. The results are shown in Figure 7 .
On the Sandy Bridge architecture, the performance of our algorithm generally increases with block size. This is expected, as larger block sizes offer improved temporal cache reuse for x. For all block sizes, ICC completely unrolls the inner two loops (lines 7 and 9 of Algorithm 1), but does not vectorize.
Performance of our algorithm on KNC is generally poor, often sustaining under 50% of peak bandwidth. This is likely due to the performance of gather/scatter instructions on the KNC architecture. We found that for block sizes < 16, not including 8, the Intel compiler emits gather instructions for fetching elements from A.val, despite these accesses being consecutive and unit-strided. For a block size of 8, the innermost loop of the algorithm is perfectly vectorized and no gather instructions are emitted, yielding higher performance as expected. For block sizes ≥ 16, the compiler fuses the middle loops and emits no gather instructions, again yielding higher performance. Further investigation is required to determine why the compiler produces gather/scatter instructions for smaller block sizes and whether the code can be restructured to avoid unnecessary gather/scatters. Performance will also likely improve with the introduction of gather/scatter-related AVX-512 instructions in the upcoming Knight's Landing (KNL) architecture. Padding and aligning blocks such that columns (within blocks) always fall on 64-byte boundaries does improve performance for small block sizes, as shown in Figure 7 , but this breaks direct compatibility with other software that does not use padded data structures, and it still suffers from poor gather/scatter performance.
On Kepler, our by-block and by-column algorithms perform comparably for block sizes ≤ 5 (the maximum block size the by-block algorithm can handle). The by-block algorithm appears to perform marginally faster, likely due to reduced integer opterations (both r and c are constant, as discussed in III-D1). For block sizes from 6 to 16, the bycolumn algorithm performs optimally, and for block sizes greater than 16, the row-per-thread algorithm performs best. Note that for block sizes greater than 16, the by-column and row-per-thread algorithms operate similarly in that a warp covers a single column at a time, with each thread assigned to a row within a block. However, the row-perthread algorithm exhibits better cache performance for x, as it loads an entire block vector (16+ elements at a time) into shared memory in a coalesced manner, while the by-column algorithm accesses only a single element from x at a time.
The cuSPARSE algorithm appears to have been optimized for block sizes that are powers of two. While our algorithms also exhibit performance drops for the block size after each power of two, they are less significant, and by using a combination of our three algorithms we achieve fairly consistent performance across block sizes.
Our algorithms achieve performance comparable to KSPARSE [1] for larger block sizes. Our algorithms consistently outperform KSPARSE for smaller block sizes, but as the KSPARSE algorithm for small blocks is similar to ours, more investigation is necessary to determine the reason for the performance difference.
C. Scalability
We also examined how our CPU-oriented algorithm scales over a varying number of cores within a single node. For Sandy Bridge, we tested the algorithm with up to 8 cores on a single socket, and used two sockets to test with up to 16 cores. We also tested how hyperthreading and the usage of multiple hardware threads would affect performance. Results are shown in Figure 8 .
Performance appears to scale nearly linearly with additional cores within a node, even across sockets. While we performed no tests with multiple nodes, we expect this algorithm will continue to scale well due to a lack of interthread communication.
Using multiple threads enhanced performance on both Sandy Bridge and KNC architectures. Hyperthreading on Sandy Bridge yielded slight performance gains, while using multiple hardware threads on KNC yielded more significant improvements. Figure 8 . Scaling of our algorithm on a Sandy Bridge Xeon E5-2670 processor and a KNC Xeon Phi using the GT01R test matrix. We tested the impact of hyperthreading (HT) on the CPU and the impact of hardware (HW) thread usage on the Xeon Phi.
V. CONCLUSIONS AND FUTURE WORK
By optimizing memory access patterns and minimizing visible latencies, the algorithms presented in this paper are able to achieve high bandwidth utilization and outperform the vendor-optimized block sparse matrix-vector implementations on the Intel Sandy Bridge and NVIDIA Kepler architectures. However, the implementation of our algorithm on the Intel Xeon Phi KNC many-integratedcore architecture is lacking. Additional optimizations may be made to further improve the throughput of our algorithms.
Our algorithm typically utilizes only 15-20% of the Xeon Phi's theoretical bandwidth. As discussed in Section IV-B, this appears to be caused by the compiler emitting gather/scatter instructions when they should not be used, and may be addressable by using pragmas or refactoring the algorithm to avoid this. Additionally, a cooperative threading strategy may be developed so that KNC hardware threads may work on consecutive block rows and achieve increased temporal cache locality with x. In the current algorithm, hardware threads on a core operate independently on separate block rows. As KNC is an in-order architecture and this severely limits instruction throughput, a better thread cooperation strategy will benefit performance.
The performance of the GPU algorithms may be optimized as well, as they used only 25-50% of the K80's theoretical 240GB/s bandwidth in our tests. However, full utilization will not be possible without enabling NVIDIA GPU Boost on the GPU and disabling ECC, which we were not able to do in our test setting.
To improve performance for iterative solvers when nonzero blocks are unevenly distributed between rows, a preliminary analysis stage may be introduced in which the algorithm groups rows by nnzb/row. The multiplication can then be executed in a multi-pass style, where each pass includes rows of a certain length, in order to better distribute load between cores.
Data structure transformations may be required to further improve performance by a significant margin. Specifically, blocks may be grouped into and processed as tiles in order to potentially (a) further reduce the sizes of the row ptr and col idx arrays and (b) improve cache performance for x. Such an optimization may be possible without changing the basic BCSR format by adding metadata pointing to tiles in the matrix.
