Abstract: This paper presents the design and implementation of a high performance sparse matrix-vector multiplication (SpMV) on fieldprogrammable gate array (FPGA). By proposing a new storage format to compress the indexes of non-zero elements by exploiting the substructure of the sparse matrix, our SpMV implementation on a reconfigurable computing platform with a multi-channel memory subsystem is capable of obtaining similar performance by using a single FPGA to that of a highly optimized BFS implementation on a commercial heterogeneous system containing four FPGAs.
Introduction
Sparse matrix-vector multiplication (SpMV) is the kernel algorithm of sparse linear system solvers and the fundamental core of many scientific and engineering applications. The SpMV algorithm is the multiplication of a sparse matrix A and a dense vector x, with the result stored in a dense vector y. The SpMV algorithm has data-driven computations dictated by the property of the sparse matrix. The sparse matrix is unstructured and irregular, requiring fine-grained random memory accesses with poor spatial and temporal locality. In addition, SpMV algorithm tends to perform a relatively small amount of computations, leading to execution times dominated by the bandwidth of external memory subsystems. Therefore, conventional implementations of SpMV have historically performed poorly, running at 10% or less of system peak performance on conventional cache-based microprocessors.
The field-programmable gate array (FPGA) is a reconfigurable integrated circuit. Combining the flexibility of software and the high performance of customized hardware design, FPGA can offer superior performance and power efficiency for many specific applications. FPGA has great potential in coping with the irregularity of SpMV, due to its easily pipelined ability, inherent parallelism and configurable architecture. Many studies have attempt to implement the SpMV algorithm on FPGA by improving the utilization of memory bandwidth.
In this paper, we propose a high performance SpMV algorithm based on a new storage format to compress the indexes of non-zero elements to reduce the volume of the indexing information of the sparse matrix. By exploiting the substructure of the matrix, we adapt the large-scale sparse matrix to the FPGA. The performance of our implementation is evaluated on a selfdesigned reconfigurable algorithm accelerator with a multi-channel memory subsystem. The results show that our algorithm is capable of obtaining similar performance by using a single FPGA to that of a highly optimized BFS implementation on a commercial heterogeneous system with four FPGAs.
Background and related works

CSR storage format
The compressed sparse row (CSR) storage format is the most general format for storing the sparse matrix, as shown in Fig. 1 . The CSR format stores only the non-zero elements of the matrix and employs additional indexing information representing the position of these elements. This format is comprised of three arrays: column indexes array col, row pointers array ptr and values array val. The val array and col array contain the value and corresponding column index for each nonzero element in the sparse matrix, arranged in a raster order starting with the upper-left and continuing from left to right then top to bottom. The row pointers array ptr stores the locations within val and col where each row begins and terminated with the number of nonzero elements. For a n * n sparse matrix with m nonzero elements, the data volume is O(n + 3 * m) (n for integer array ptr, m for integer array col and 2 * m for double-precision floating-point array val). It is obvious that the data volume of the indexing information accounts for a large proportion of the data volume of the sparse matrix. 
Previous work
There has been much prior work to leverage the highly parallel and specialized computational fabric of modern FPGAs to develop FPGA-based SpMV implementations. Zhuo employs double-precision floating-point multipliers and adders, and performs multiple floating-point and I/O operations in parallel [1] . However, the performance of this design is greatly affected by the padding overhead and the design parameters are depend on the sparsity structure of the matrix. Delorimier partitions the set of n dot products across multiple Processing Elements (PEs) [2] and Morris proposes a deeply pipelined dual-FPGA design [3] . However, since only on-chip Block RAM (BRAM) is used to store the whole matrix for both of the implementations, the matrix size is constrained by the BRAM resources. The design of Sun depends less on the matrix structure compared with the above approaches, but the reduction circuit is exclusively shared by multiple PEs, which becomes the potential performance bottleneck resulting from the access conflict of PEs [4] . Kevin proposes a new design methodology for SpMV that seeks to utilize system memory bandwidth more efficiently [5] . Nagar implements the SpMV algorithm on Convey HC-1 (a commercial heterogeneous system containing four FPGAs) and demonstrates up to 40% of the peak performance for largescale matrices in CSR format [6] .
3 SpMV in BVCSR format on FPGA
BVCSR storage format
In order to reduce the data volume involved in the computation and increase the utilization of the bandwidth, we propose a new storage format to adapt the sparse matrix to FPGA, the Blocked Variable Compressed Sparse Row (BVCSR) format. The BVCSR format is similar to the Blocked Compressed Sparse Row (BCSR) format regarding its relation with CSR, which stores and indexes two-dimensional submatrices with at least one nonzero element instead of storing and indexing each single nonzero elements. The BVCSR format is designed for the FPGA, where the blocks are not assumed to be dense and of arbitrary size as the BCSR format, but rather are fixed-size sparse blocks. For BVCSR, the sparse matrix is divided into submatrices of fixed size s * s and each submatrix is aligned at s row and column boundaries. This restriction simplifies the design of the SpMV controller design and facilitates the construction of the BVCSR. A larger s reduces the number of load instructions for the elements of the vector x and y, but the size of s is limited by the on-chip memory of FPGA.
As shown in Fig. 2 , the BVCSR format is a two-level storage format. The high-level storage format is used to store the position of the nonzero submatrices (submatrices with at least one nonzero element), and the low-level storage format is used to store the nonzero elements within each submatrix.
For the high-level storage format, we use a CSR-like format where the elements are pointers to the position of the non-empty submatrices. Three arrays is used to store the submatrices, sV al, sCol and sP tr. Array sV al For the low-level storage format, we propose the variable compressed sparse row (VCSR) format to reduce the index data of nonzero elements across the same submatrix row by applying compression for the indexing information. The VCSR format tries to exploit contiguous nonzero elements in the same row of a submatrix and maintains a single index pair (head, len) for each contiguous area, where head and len are respectively the offset of the first nonzero element and the length of the area. Therefore, only two integers are maintained to record the indexes for all the elements of a contiguous area, instead of an integer for each element in the CSR format. To increase the granularity of the contiguous area, we further merge the adjacent areas (the distance between the two areas is smaller than the merging threshold r) in the same row with explicit zero padding. In addition, instead of using the ptr array, we encode the end-of-row information within the val and col arrays using zero termination. Thus, to mark the termination of a row, we use 0 for both the val and col values. These methods substantially reduce the working set of the SpMV algorithm, thus reducing the demands on memory bandwidth and alleviating the pressure to the memory subsystem.
Implementation on FPGA
We design a multi-channel memory based architecture that fully utilizes the parallelism on FPGA for high-performance SpMV implementation. Our design is implement on a self-designed reconfigurable algorithm accelerator which resides on a host server. The architecture of our top-level design is shown in Fig. 3 . The basic organization of the accelerator is a FPGA chip and a multi-channel memory subsystem, which consists of three external DRAM modules. The accelerator receives the sparse matrix in BVCSR format and vector x, executes the SpMV algorithm and sends the result vector y back to the host.
On the front end of our design, k processing elements (PEs) accepts the data from the external memory. Each PE contains one multiplier and BRAM in which the multiplicand fragment of vector x is preloaded which is addressed Fig. 3 . Top level design of SpMV on FPGA by the column indexes. Obtained products are then accumulated by the adders in the adder tree and one partial sum is produced every clock cycle. On the back end, we use the same approach as [6] to implement the doubleprecision floating-point accumulator.
Performance evaluation
We evaluated the proposed architecture on a self-designed reconfigurable algorithm accelerator, as shown in Fig. 4 , which is mainly composed of one Xilinx Virtex-7 XC7VX485T FPGA and three DDR3-1600 8 GB DRAM modules. Our BFS hardware design is expressed in RTL using Verilog HDL, and was compiled using Xilinx ISE v14.3. The number of front-end PEs is 24, each having a 64 KB BRAM memory. The implementation reach a clock frequency of 208 MHz and the working frequency is set to 200 MHz. The sparse matrices used in our experiment are from the University of Florida Sparse Matrix Collection [7] with different sparsity patterns, which are used in Nagar's work [6] . The submatrix size is set to 1024*1024. As shown in Fig. 5 , our implementation obtains similar performance by using a Xilinx Virtex-7 FPGA to that of a highly optimized BFS implementation on a commercial heterogeneous system containing four Xilinx Virtex-5 LX330 FPGAs. It should be noted that the Convey HC-1 system possess vastly superior memory systems relative to our system (80 GB/s on the Convey HC-1 and 38.4 GB/s on our system). Thus our bandwidth utilization is about twice that of Nagar's work. This mainly result from the proposed BVCSR format, which reduce the data volume involving in the computation. 
Conclusions
This paper presents the design and implementation of a high performance sparse matrix-vector multiplication (SpMV) on field-programmable gate array (FPGA). By proposing a new storage format to compress the indexing information of non-zero elements by exploiting the substructure of the sparse matrix, our SpMV implementation on a reconfigurable computing platform with a multi-channel memory subsystem is capable of improving the utilization of the memory bandwidth and obtains similar performance by using a single FPGA to that of a highly optimized BFS implementation on a commercial heterogeneous system containing four FPGAs.
