Iterative methods on irregular grids have been used widely in all areas of comptational science and engineering for solving partial differential equations with complex geometry. They provide the flexibility to express complex shapes with relatively low computational cost. However, the direction of the evolution of high-performance processors in the last two decades have caused serious degradation of the computational efficiency of iterative methods on irregular grids, because of relatively low memory bandwidth. Data compression can in principle reduce the necessary memory memory bandwidth of iterative methods and thus improve the efficiency. We have implemented several data compression algorithms on the PEZY-SC processor, using the matrix generated for the HPCG benchmark as an example. For the SpMV (Sparse Matrix-Vector multiplication) part of the HPCG benchmark, the best implementation without data compression achieved 11.6Gflops/chip, close to the theoretical limit due to the memory bandwidth. Our implementation with data compression has achieved 32.4Gflops. This is of course rather extreme case, since the grid used in HPCG is geometrically regular and thus its compression efficiency is very high. However, in real applications, it is in many cases possible to make a large part of the grid to have regular geometry, in particular when the resolution is high. Note that we do not need to change the structure of the program, except for the addition of the data compression/decompression subroutines. Thus, we believe the data compression will be very useful way to improve the performance of many applications which rely on the use of irregular grids.
Abstract-Iterative methods on irregular grids have been used widely in all areas of comptational science and engineering for solving partial differential equations with complex geometry. They provide the flexibility to express complex shapes with relatively low computational cost. However, the direction of the evolution of high-performance processors in the last two decades have caused serious degradation of the computational efficiency of iterative methods on irregular grids, because of relatively low memory bandwidth. Data compression can in principle reduce the necessary memory memory bandwidth of iterative methods and thus improve the efficiency. We have implemented several data compression algorithms on the PEZY-SC processor, using the matrix generated for the HPCG benchmark as an example. For the SpMV (Sparse Matrix-Vector multiplication) part of the HPCG benchmark, the best implementation without data compression achieved 11.6Gflops/chip, close to the theoretical limit due to the memory bandwidth. Our implementation with data compression has achieved 32.4Gflops. This is of course rather extreme case, since the grid used in HPCG is geometrically regular and thus its compression efficiency is very high. However, in real applications, it is in many cases possible to make a large part of the grid to have regular geometry, in particular when the resolution is high. Note that we do not need to change the structure of the program, except for the addition of the data compression/decompression subroutines. Thus, we believe the data compression will be very useful way to improve the performance of many applications which rely on the use of irregular grids.
Index Terms-Finite Element Analysis, Sparse Matrices, Data Compression
I. INTRODUCTION
In this paper, we describe the implementation and performance of the multiplication of sparse matrix and vector (hereafter the SpMV multiplication) on the PEZY-SC processor. In particular, we focus on the effect of various data compression schemes on the performance. The multiplication of sparse matrix and vector is the most time consuming part of many real applications which use irregular grids.
It has become very difficult to achieve even reasonable efficiency on the SpMV on modern HPC systems. The main reason is the memory bandwidth. Consider the multiplication of matrix A and vector x,
For real applications like FEM, the matrix A is too large to fit to the cache memory. On the other hand, vectors x and y are much smaller, and there is always the possibility of extensive data reuse for them. Thus, the dominant part of memory access for an SpMV operation is the reading of the (sparse) matrix A.
The exact data size of the matrix A depends on the used data format, but it cannot be smaller than the number of non-zero elements of A. The number of floating-point operation per one non-zero element of A is two. Thus, If the data format is used is the double-precision format, memory read of eight bytes takes place for every two floating point operations. In other words, the "required" B/F (byte per flops) number is 8/2 = 4. If we use the ELL format, required bandwidth can increase by 50-100%.
Thus, the required memory bandwidth, in terms of the B/F number can be between six and eight. However, the B/F numbers of microprocessors used for modern HPC systems are much smaller. The low memory bandwidth of modern HPC systems is clearly the primary reason for the very low efficiency of SpMV multiplication on them.
One of ways which can potentially be useful in reducing the required amount of memory access is to compress the matrix using some data compression algorithm. If the decompression algorithm requires more than a few instructions, it will cause quite significant increase of the total cost. Moreover, generally the decompression algorithm requires some table-lookup op-erations, in other words, the indirect memory access for which modern microprocessors are not particularly efficient.
On the other hand, if the hardware B/F number of a system is extremely low, we might be able to achieve significant performance improvement on SpMV multiplications by using data compression/decompression.
In this paper we first present the performance of HPCG benchmark [1] , [2] on ZettaScaler-1.5 [3] with PEZY-SC processor, with usual optimizations applied in previous works. Then we proceed to present the performance of "optimized" implementations of SpMV operation with on-the-fly data compression and decompression.
II. THE PEZY-SC PROCESSOR CHIP AND THE ZETTASCALER SYSTEM
The ZettaScaler (previously called ExaScaler) system is based on the first-generation PEZY-SC 1024-core processor chip. It appeared in the TOP500 list of November 2014 and ranked #2 in the Green500 list. In the June 2015 Green500 list, three ExaScaler systems occupied top three ranks. The system listed #1 achieved the performance per watt exceeding 7 Gflops/W, significantly higher than the number of the #1 system for November 2014 Green500 list. As of June 2016, it still keeps the #1 position in the Green500 list.
The reason why we used the ZettaScaler system as the testbed for the data compression algorithm is that its hardware B/F number is rather low, around 0.05. Thus, it is ideal as the testbed of algorithms which will be useful for processors in the near future. In addition, its processor cores do not have SIMD units. Thus, we might be able to achieve pretty good speedup for SpMV multiplication using data compression.
The ZettaScaler system is rather similar to modern GPGPUbased systems, in which the GPGPUs are connected to Intel Xeon processors through PCIe interface, and Xeon processors are connected using Infiniband network.
At least for the HPL benchmark, or more specifically the DGEMM operation (double precision dense matrix multiplication), the PEZY-SC processor has achieved quite impressive performance per watt, even though the efficiency compared to the theoretical peak performance is still rather low (slightly better than 50% for HPL). On the other hand, the porting of applications could be relatively easy, since the PEZY-SC processor is an MIMD manycore processor with hierarchical (but non-coherent) cache and physically shared memory. Also, a fairly well designed PEZY-SC processor supports a language called PZCL, a dialect of OpenCL. It supports most of the features of OpenCL, but there are some limitations in particular when the performance is important (which is of course almost always the case). The number of software threads created should be same as the maximum number of hardware threads (8192 per chip) to achieve best efficiency. Another difference comes from the fact that the cache is not coherent. Functions to flush appropriate levels of cache should be inserted manually to guarantee the correct result. For small computing kernels, this is not too difficult, but of course can be a source of hard-to fix bugs.
As one PEZY-SC processor has eight channels of DDR4 DRAMs, the theoretical peak memory bandwidth is 85GB/s when the DDR4 clock is 1333 MHz. The actual read bandwidth is around 75 GB/s, and STREAM copy performance is 40 GB/s. The copy performance is low because the write bandwidth is 1/2 of the read bandwidth.
The read bandwidth of L1, L2 and L3 caches (chip total) are 2000, 2000, and 700 GB/s, respectively.
III. THE OVERVIEW OF HPCG BENCHMARK AND
IMPLEMENTATION ON PEZY-SC In this section, we briefly describe the HPCG benchmark itself and our reference implementation on the PEZY-SC processor. In subsection III-A, we describe the HPCG benchmark and in subsection III-B, our implementation of HPCG on PEZY-SC.
A. The HPCG benchmark
The HPCG benchmark [1] , [2] is, according to its designers, "designed to measure performance that is representative of many important scientific calculations, with low computationto-data-access ratios." As such, it mimics the major operations of FEM using the CG with Multigrid solver, on irregular grid. Unfortunately, the currently available official specification of HPCG [2] is rather old, and the algorithm described there and what is used in the current benchmark code are quite different. In the following, we first follow [2] and then summarize the changes made.
From the mathematical point of view, the problem solved in HPCG is a 3D diffusion equation discretized using 27-point stencil on a regular grid of size (n x n px , n y n py , n z n pz ), where (n x , n y , n z ) is the size of the grid on each MPI process and (n px , n py , n pz ) is the MPI process grid. Thus, the total number of the MPI processes is n px n py n pz .
In the original specification, HPCG solves this problem using the symmetric Gauss-Seidel preconditioned CG iteration, and the users are not allowed to change this basic CG algorithm. In particular, the multigrid method, which is essential if one wants to solve large 3D problems, is not included. Thus, not surprisingly, this is changed in the current specification. Four-stage V-cycle geometric multigrid preconditioner is used.
What is measured in the HPCG benchmark is the weighted average of the computing speed of major operations, in particular SymGS, SpMV, Restriction, Prolongation, DotProduct, and Waxpby. Usually, two functions, ComputeSPMV and ComputeSYMGS, dominate the total computing time and thus determine the performance.
B. Implementation of HPCG on PEZY-SC
Our reference implementation of HPCG on PEZY-SC is pretty straightforward. The major operations described in last section are ported to PEZY-SC (rewritten using the PZCL language). Both the matrix data and vector data are kept on the memory of PEZY-SC. Therefore, only a small amount of data to be transferred for convergence check and other operations and the boundary data to be exchanged between nodes are exchanged between the host Xeon processor and the PEZY-SC processors.
Since the changes which directly take advantage of the regular structure of the grid are not allowed in the optimization phase, algebraic block multicolor ordering [4] is used for the SymGS part.
IV. HPCG BENCHMARK RESULT

A. Measured Performance
We have measured the performance of HPCG on the "Ajisai" ZettaScaler system installed in RIKEN AICS. It has the total of 64 Xeon nodes each with four PEZY-SC processors. We made the measurement of the performance on up to 32 PEZY-SC processors. We assign one PEZY-SC processor to one MPI process. Therefore, four MPI processes run on each Xeon processor. The core clock of the PEZY-SC processor is 733 MHz. Memory clock is 1333 MHz. The host CPU is Xeon E5-2618L v3 with 8 cores at 2.3 GHz clock. Each PEZY-SC processor has 32 GB or DDR4 memory, and the host Xeon processor 128 GB.
On 32 PEZY-SC processor, the achieved performance for HPCG 3.0 rating is 168.06 Gflops (For HPCG 2.4 rating, 189.15 Gflops). The problem size used is 176 3 local grid with 4 × 4 × 2 processor grid, for the global problem size of (704, 704, 352).
For this particular problem size, HPCG reported convergence after exactly 50 iterations. Figure 1 shows the breakdown of the execution time. As usual, SpMV and SymGS dominate the execution time. The speed of these two sections are 238.4 Gflops and 217.6 Gflops, respectively. If we calculate the performance per MPI process (or per PEZY-SC processor), they are 7.45 Gflops and 6.80 Gflops.
B. Performance Analysis of the reference implementation
We limit our analysis to SpMV, to simplify the discussion. Since the achieved performance numbers of SpMV and SymGS are not much different, we believe the analysis of SpMV is sufficient to discuss the behavior of hardware and software.
The single-chip performance of the SpMV operation on a PEZY-SC processor is 11.6 Gflops. Since the matrix is quite large and each non-zero element of the matrix is used only once per one SpMV operation, the performance of SpMV is bandwidth limited. One element is expressed by one four-byte integer number and one eight-byte floating-point number, and for this element two floating-point operations are performed (one multiplication and one addition). The theoretical peak memory bandwidth of a PEZY-SC processor is 85 GB/s, and the actual measured read performance is around 75 GB/s. Therefore, the peak performance achievable by SpMV is around 12.5 Gflops. We can see that the performance of 11.6 Gflops is quite close to what can be achieved.
We can thus conclude that we have successfully ported HPCG on the PEZY-SC processor, and for the SpMV operation we achieve the performance very close to the theoretical limit determined by the throughput of the external memory. Therefore, we now have a good reference implementation, against which we can measure the effect of data compression.
V. IMPLEMENTATION OF THE SPMV MULTIPLICATION WITH DATA COMPRESSION/DECOMPRESSION AND ITS PERFORMANCE
The use of the data compression in HPC is currently an active area of research, and many methods, both lossless and lossy, have been proposed [5] .
Many of the previous proposals of the use of data compression is for storage and checkpointing, but there are also many studies of the use of data compression on cache and main memories [6] - [8] .
We have tested several implementation of fast compression/decompression algorithms for simplified implementation of SpMV operation on the PEZY-SC processor. The original matrix is the same as what appears in the HPCG benchmark.
So far, the best result is achieved by a simple tablebased compression. In this algorithm, first the entire matrix is scanned and all unique values in the matrix elements are listed and sorted in the ascending order. We call this list the value
Then, for each row of the matrix, the non-zero elements are also sorted in the ascending order. Now we have the list of column indices, sorted by the actual value of the element. We call this list sorted column list S i . Now, for each v i of V , we calculate the last position of that value in the sorted list of non-zero elements, and record that value to create the list of "terminal" indices, T i .
We can further compress the list of S i in the following way. First, we convert the actual values of column indices to the value relative to the diagonal element. Then, we register the pattern of this relative displacement of column indices to a So far, We have actually implemented the first two compression schemes. Table I show the resulting performance. The first approach of compressing the data array only should theoretically give around a factor of three speedup, and actually realized the speedup by 50%. The second one, in which both the index array and the data array are compressed, should theoretically give around a factor of 25 speedup, and actually achieved the speedup by a factor of 2.8.
The reason why the actual speedup is much smaller than the theoretical limit is simply that to estimate the theoretical limit we ignore the access cost of the input vector.
The fact indices and data are compressed means that their actual values are obtained by table lookup operations. Thus, on modern microprocessors with wide SIMD instruction sets, it would be difficult to achieve reasonable performance with the compression algorithm, since the throughput of indirect access operations are generally low. The fully-MIMD, non-SIMD nature of the PEZY-SC processor is critical to achieve the actual speedup.
VI. SUMMARY
In this paper, we report the effect of data compression/decompression algorithms for the SpMV multiplication on the ZettaScaler system, in which the 1024-core, MIMD ultramany-core PEZY-SC processors are used as accelerators. We have used the matrix generated by HPCG benchmark code as the example. Usually, the performance of the well-optimized implementation of HPCG is limited by the bandwidth of the sequential read access of the external memory of the processor (or accelerator). In the case of a PEZY-SC processor, the theoretical limit of the read bandwidth is 85 GB/s, and actual measured bandwidth is 75 GB/s. Thus, the performance of SpMV and SymGS operations are limited to around 10 Gflops. The actual performance achieved is close to this number.
The theoretical speedup by data and index compression is as large as a factor of 25. We actually achieved the speedup of a factor of 2.8 for the SpMV operation. Even though the actual achieved improvement is much smaller than the theoretical maximum, we have demonstrated that the used of data compression/decompression can actually improve the performance of SpMV multiplication on the PEZY-SC processor. We therefore conclude that the use of data compression/decompression will be quite useful technique to improve the performance of SpMV operations in FEM applications on the current and future high-performance processors.
