Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work.
Introduction
The sparse triangular solve (SpTRSV) operation computes a solution vector x for a linear system Lx = b 1 , where L is a sparse lower triangular matrix and b is a right-hand side vector. The SpTRSV operation is in general indispensable to the solve phase of sparse direct solvers [11] and preconditioners of sparse iterative solvers [44] in a wide range of applications from numerical simulation [42] to machine learning [35] .
Unlike other sparse basic linear algebra subprograms [28] such as sparse transposition [60] , sparse matrix-vector multiplication [31, 33] and sparse matrix-matrix multiplication [32] , the SpTRSV is an inherently sequential operation. In worst cases (e.g., when L is a triangular double-band matrix), any solution component x i , if i 0, needs to wait for its former components to solve out. This makes parallelizing SpTRSV seem not possible.
But by grouping components without dependencies between each other into one set, SpTRSV can run in parallel. Although multiple sets have to run level-by-level (i.e., in serial, because of the dependencies between their components), at least the components inside a set can be calculated simultaneously. This is so-called level-set method first presented by Anderson and Saad [2] and Saltz [45] , and recently improved by Park et al. [40] and Liu et al. [29] . The total overheads for the level-set methods are two-fold: the calculation cost inside each level and the synchronization cost between one level to another. The proportions of the two parts highly depend on the sparsity structure of input matrix. makes cores in the same column share x through register communication but not global memory.
All 2057 square matrices from the Florida Sparse Matrix Collection [8] are tested to evaluate the performance of our proposed method and the latest approaches running on Intel Xeon Phi (KNC) and Nvidia K80 GPU. Compared with the parallel level-set method on SW26010, our algorithm achieves an average speedup of 6.9 and a maximal speedup of 38.5. Our method also outperforms the latest methods on KNC in 1856 benchmarks and the latest methods on K80 in 1672 benchmarks, respectively.
The paper makes the following contributions: • The Sparse Level Tile layout is proposed to make data reuse of vectors x and b under control and to make their memory access coarse-grained, predictable and prefetchable.
• A Producer-Consumer pairing method is designed to make any inter-level synchronization only happen through very fast register communication but not slow global memory.
• Totally 2057 sparse matrices are used for performance evaluation, in which our method largely outperforms parallel level-set methods and the latest methods on KNC and GPU.
Background and Motivation

Serial Algorithm for SpTRSV
Alg.1 lists a serial algorithm for Lx = b 2 . As can be seen, there is no naïve concurrency in the outer for loop (line 1), due to the dependency between each element of x and its previous elements (line 4) . Taking the problem in the left half of Fig.1 as an example 3 , x 0 , x 2 and x 3 cannot be solved in parallel, since x 3 depends on x 0 and x 2 to solve out first.
Algorithm 1 A serial SpTRSV algorithm for Lx = b.
Input: L of size n × n, b of size n; Output: x of size n;
1: for j = 0 to n − 1 do 2:
for each l i j on column j of L do 4:
end for 6: end for
Level-Set Methods for Parallel SpTRSV
Luckily, some potential concurrency can be exploited in this problem with careful analysis. For example, x 3 -x 8 can be calculated concurrently after x 0 -x 2 are completed. Based on this discovery, parallelizable components x 0 -x 2 , x 3 -x 8 , x 9 -x 13 , x 14 -x 15 are grouped into four sets, and a dependency graph composed of four levels (i.e., the four sets) can be 2 Here L can be stored in any format, though the compressed sparse column (CSC) layout may be better since Alg.1 is a column-wise algorithm. 3 Note that here the components in the levels are ordered in restrict ascending order. This is not typical in real-world problems. But doing so is helpful for describing our problems, and it is easy to reorder/permute matrix rows to this form when the level information is known. extracted for parallel execution. The right half of Fig.1 shows the dependency graph of the level-set method.
However, to make sure all the dependent elements have been calculated out, a synchronization (see the three dotted lines in the dependency graph in Fig.1) is introduced after each level's computation. Unfortunately, on modern processors, especially many-core architectures, the cost for synchronization is expensive and sometimes dominates the execution time of parallel SpTRSV [25, 29, 40] .
To reduce the costs for synchronization across threads, two approaches have been proposed. One method by Park et al. [40] replaces the full-synchronization with multiple corecore (P2P) synchronization. In the above example, the x 13 calculated by Thread 2 on level 2 only depends on the x 4 of Thread 0 and x 7 and x 8 of Thread 2 on level 1. Thus, a single P2P synchronization between Threads 0 and 2 is enough to guarantee correctness. Another optimization by Liu et al. [29] replaces synchronization with atomic operations. In Fig.1 , Threads 0 and 2 on level 0 can update x 6 by atomically adding the results of x 0 and x 2 . Before the atomic-add is finished, Thread 1 busy-waits the lock on x 6 getting updated.
Performance Problems in Existing Methods
Traditional data layouts, such as CSR 4 and CSC, are unaware of data reuse. As shown in line 4 of Alg.1, methods based on the CSC layout traverse the nonzeros in the column (vertical) direction, which leads to perfect data reuse of x but makes the reuse of b handled by hardware. Similar, methods based on the CSR layout traverse the nonzeros in the row (horizontal) direction, which leads to perfect reuse of b but makes the reuse of x out-of-control. Both methods only consider data reuse in one dimension. Thus, methods based on either of the two layouts may not achieve the best performance. Later on, we will design a new layout to carefully traverse the nonzeros in both row and column directions, i.e., in 2D space, and to make the reuse of x and b both under control.
It is well known that the inter-level synchronization cost is expensive for parallel SpTRSV. Even though [40] and [29] have replaced full synchronization with P2P synchronization or atomic operations, the overhead for synchronization is still high. The main reason is that inter-core synchronization has to go through global memories or low-level shared caches, both of which offer lower bandwidth and higher latency compared with high-level private caches and registers. Furthermore, as the number of cores goes up, the potential competition of accessing the global memory or shared cache between different cores often further hurts the performance.
Sunway Architecture
The major building block of the supercomputer Sunway Taihulight is the SW26010 many-core processor, as shown in Fig.2 . Each processor is composed of four Core Groups (CGs), and each CG contains a Management Processing Element (MPE) for latency sensitive tasks and one Computing Processing Element (CPE) cluster of 64 CPEs organized as an 8×8 mesh for throughput sensitive tasks. Each MPE has 32 KB L1 data cache and 256 KB L2 instruction/data cache, while each CPE has its own 16 KB L1 instruction cache and a 64 KB Scratch Pad Memory (SPM), whose speed is equal with that of L1 cache. A CG has 34.1 GB/s theoretical memory bandwidth and 765 GFlops double-precision peak performance [9, 13] . Two kinds of memory access, i.e., Direct Memory Access (DMA) and global load/store (Gload/Gstore), are supported, between which an obvious performance gap exists: DMA prefers transferring massive data from main memory to SPM, and Gload/Gstore prefers transferring small and random data between main memory and registers. A Stream Triad test [63] shows that the bandwidth of DMA and Gload/Gstore are 22.6 GB/s and 1.48 GB/s, respectively.
The CPE cluster offers low-latency register data communication among the 8 × 8 CPEs, and this is one of the key features of the SW26010 processor. Each CPE has Register Send Buffer and Register Receive Buffer. The hardware will send data from one Send Buffer to another Receive Buffer uninterruptedly and automatically until the Send Buffer is empty or the Receive Buffer is full. According to the benchmarking results [63] , the latency of the register communication is at most 11 cycles and the integrated core-core communication bandwidth is 637 GB/s. However, due to the limitation of the hardware, the data in register can only be communicated between the CPEs in the same row or column.
Challenges of SpTRSV on Sunway Architecture
Even though the SPM can help us manually control the timing of data swapping for better reuse, and the register communication technique improves the efficiency of inter-core data-sharing and synchronization, it is still challenging to use these advantages for parallel SpTRSV.
First, the fine-grained and unpredictable memory access of SpTRSV is unsuitable with the features of Sunway architecture. The fine-grained memory access can only use the Gload/Gstore method with low bandwidth and high latency, or use the DMA method in low efficiency. Meanwhile, the memory access of SpTRSV is unpredictable, which means that we need to estimate whether the data is cached or not before each load/store instruction. This is natural for hardware controlled caches but will cost much more for SPM, because the estimating process is done by software but not hardware. Furthermore, the unpredictable memory access also makes it impossible to prefetch any data.
On the other hand, when parallelizing the SpTRSV, a CPE may need to synchronize to any other CPEs for sharing data or finishing dependencies. However, This cannot be directly supported by the register communication technique on Sunway processors, because such high-speed communication only works for CPEs in the same row or the same column. A straight forward idea is to select some CPEs as "transfer stations" to achieve register communication between those CPEs not in the same row/column. However, such method may introduce a cycle of messaging route, thus potential deadlock may occur [26] .
Sparse Level Tile Layout
In this section, we design a new data layout, called Sparse Level Tile (SLT for short) layout, to make the reuse of x and b both under control and to migrate the memory access of SpTRSV from fine-grained, random and unprefetchable to coarse-grained, predictable and prefetchable.
Data Layout Design
The key point of the SLT layout is to divide a triangular matrix into multiple Tiles, each of which only applies to part of x and b. Here, we use the general term "cache" as the fastest memory on different processors, meaning that for Sunway architecture, the "cache" is the SPM, and for other processors, the "cache" may be high-level cache. Fig.3 presents the establishing process of the SLT layout, where Fig.3A is the example matrix plotted in Fig.1 with four level-sets, and Fig.3B-Fig.3F are the output after each of the five steps described in the rest of this section.
Step 1. Import the Concept of Region. We introduce a high-level concept named Region to divide both x and b, and then make X-Region and B-Region, respectively. Each Step 1
Step 4
Step 5
Step 3 Step 2 Figure 3 . The establishing process for Sparse Level Tile layout. Based on Region, the x and b will be divided into X-Regions and B-Regions (B). The original levels will be separated into new levels if crossing Regions (C). Each nonzero belongs to a Diagonal-Tile (D) or an Offdiagonal-Tile (E) according to the ID of the used X-Region and modified B-Region. The Tiles modifying the same B-Region will be gathered together to improve the data reuse of b elements (F). The numbers in (F) denote tile order for both storing and computing.
Tile uses x elements from some X-Regions to modify b elements from some B-Regions. The Region size is corresponding to the cache size. In the following steps, we will group the nonzeros into different Tiles to limit the number of XRegions and B-Regions that each Tile uses and modifies. In this example, we assume the Region size is 4 elements, which leads to 16/4 = 4 Regions for the 16 unknowns. As an example in Fig.3 , the first off-diagonal nonzeros on the top-left (l 3,0 ) of Fig.3B uses x 0 from X-Region 0 (i.e., ⌊0/4⌋ = 0) to modify b 3 from B-Region 0 (i.e., ⌊3/4⌋ = 0).
Step 2. Separate Original Levels. A level is separated into multiple levels when going across more than one X-Regions. For example, the original L1 will be separated into three levels, as it crosses three X-Regions. Totally, the original four levels will be translated into seven levels. Note that Inherit from previous Tile Load from main memory Obtained from Used in
Inherit
Load from main memory Store to main memory Figure 4 . The dataflow using the SLT layout. Each b i will only be loaded once and will never be swapped out until it is used to compute x i (Line 2 in Alg.1). The nonzeros in two dimension in each Tile make x i be reused as much as possible. When targeting a Tile, all the corresponding data must be cached to guarantee predictability. The streaming access of x and b from main memory makes it possible to prefetch coarse-grainedly.
this operation usually introduces more levels to a problem. But since our method processes synchronizations in a very low-cost way, it is not sensitive to the number of level-sets.
Step 3. Build Diagonal-Tiles. To improve the data reuse of x and b and to make memory access predictable, we combine the diagonal nonzeros of each level with their "nearest" off-diagonal nonzeros together to consist the first kind of Tiles, Diagonal-Tiles or DiaTiles for short. The width of a DiaTile is the number of diagonal nonzeros in each level, and the height is the width plus the Region size. Hence, the term "nearest" refers to a range that the DiaTile's nonzeros should be in. The number of DiaTiles is equal to the number of divided levels. In our case, this number is seven.
In the seven DiaTiles, all x elements inside one single Tile can be solved out without dependencies inside the current Tile, but only with dependencies from its previous Tiles. From the figure, it can be seen that a DiaTile is composed of a diagonal on top and a square matrix on the bottom, meaning that x elements corresponding to this diagonal can be together solved by x j = b j /l j j (if previous dependencies outside this Tile are released). This spatial feature actually decouples operations x j = b j /l j j (only for the diagonal on top) and b i = b i − l i j x j (only for the square in the bottom) inside a single DiaTile. This makes tasks in a DiaTile work in a perfectly parallel-friendly way. It is worth to note that this is why we use this non-uniformed tiling instead of regular 2D tiling selected by previous work [5, 19, 37, 41, 62] .
Taking the left most DiaTile in Fig.3D as an example, we can see that its width is 3 as there are three diagonal nonzeros, and the Region size is 4. So its height is 4 + 3 = 7. In this DiaTile, we can solve out x 0 , x 1 and x 2 , which must be cached. As the Region size is 4, only b 3 , b 4 , b 5 and b 6 are stored in cache. To make memory access predictable, which means that we can guarantee the corresponding x and b elements must be cached without any extra estimating, each nonzero in this Tile must have a row index smaller than heiдht = 7. Then, this Tile only uses X-Region 0 to modify B-Region 0 and B-Region 1. In this DiaTile, x 0 and x 2 can be reused three times and twice respectively, as three and two nonzeros exist below. Similar, b 3 and b 6 , stored in the cache, can be reused twice as two nonzeros exist on the left, respectively. Note that the cached b elements will be updated by some b i = b i − l i j x j operations (Line 4 of Alg.1), and can be reused when conducting the calculation within its following Tiles.
Step 4. Build Offdiagonal-Tiles. Note that not all the off-diagonal nonzeros have been assigned to DiaTiles. We now need to group the rest of the nonzeros together to build another kind of Tile: Offdiagonal-Tile, or OffdiaTile for short. We let the nonzeros of an OffdiaTile only use a single X-Region and modify a single B-Region. As a result, both the maximal width and height of an OffdiaTile are equal with Region size. However, as some off-diagonal nonzeros have been assigned to DiaTiles, the actual range of an OffdiaTile might be smaller. Take the top-left OffdiaTile of the matrix (storing one nonzero) in Fig.3E as an example. This Tile uses X-Region 0 and modifies B-Region 1. Similar, the OffdiaTile located below the former OffdiaTile (storing six nonzeros) uses X-Region 0 and modifies B-Region 2.
The goal of building OffdiaTile is to make memory access predictable and to improve data reuse. Before targeting an OffdiaTile, we can load a bunch of x and b elements, the size of which is equal to the Region size. As each OffdiaTile only uses a single X-Region and modifies a single B-Region, the corresponding x and b must be cached and can be reused.
Step 5. Sort Tiles. We want to further improve the datareuse of b. The way is to closely store the Tiles modifying the same B-Region. Then we can only load part of a B-Region but not the whole B-Region before processing each Tile. Then a Tile, regardless it is Diagonal or Offdiagonal, belongs to a Region ID, which is the maximal ID of the B-Region it can modify. We will store Tiles according to their Region IDs.
As can be seen in Fig.3F , because the potential maximal row index of Tile 1 is 7 and it can modify b 7 from B-Region 1 (even though there are no nonzeros in this tile having a row index of 7), Tile 1 belongs to Region ⌊7/4⌋ = 1. In summary, Tiles 0, 1 and 2 belong to Region 1, Tiles 3 and 4 belong to Region 2, Tiles 5, 6, 7 and 8 belong to Region 3, and Tiles 9 and 10 belong to Region 4 (The B-Region 4 cannot be found in Fig.3 since it locates beyond the matrix size and has no effect on solving process). For the Tiles belong to different Region IDs, Tiles belong to smaller IDs are stored in front of those belong to larger IDs. For Tiles belong to the same Region ID, DiaTiles are stored in front of OffdiaTiles and OffdiaTiles with the same Region ID can be stored in arbitrary order. The storing order of Tiles, as shown in Fig.3F , is same with the computing order. So we can load the Tiles in a stream. Fig.4 uses the example matrix in Fig.1 and Fig.3 for explaining dataflow and data reuse based on the SLT layout. Here we take Tiles 0, 1 and 2 as an example. Tile 0 only focuses on x 0 -x 2 and b 3 -b 6 . So we can guarantee that these data must be cached without any extra estimating when using Tile 0. x 0 is reused three times to modify b 3 , b 4 and b 6 , and x 2 is reused twice to modify b 3 and b 6 . Similar, b 3 and b 6 are reused twice to be modified by x 0 and x 3 . After Tile 0 is finished, b 3 -b 6 are in the cache. So, when targeting Tile 1, x 3 is obtained by
Dataflow, Data Reuse and Data Storage
, and b 3 -b 6 are inherited directly. Both of these two operations are in the cache. Thanks to the order well organized, only b 7 are loaded from the main memory for Tile 1. Next, for Tile 2, x 0 -x 3 are reloaded from main memory and b 4 -b 7 can be inherited from Tile 1 directly and perfectly.
The inheritance of b elements in this order can make us reuse b until they are transformed to x elements after x j = b j /l j j (Line 2 in Alg.1). It can also avoid any redundant memory access to b. Meanwhile, the nonzeros in two dimensions in each Tile can reuse the x elements as much as possible. When targeting a Tile, any memory access is predictable since the corresponding data must be cached. As the memory access to x and b for DiaTiles are in streaming, it can be guaranteed to prefetch b and integrally store x by introducing other on-chip buffers in implementation, and the memory access must be coarse-grained.
Alg.2 presents the pseudocode that establishes the SLT layout. We first calculate the level-set information and separate the original level-sets into new levels, as introduced in Steps 1 and 2. Next, we traverse all the nonzeros to classify them into different Tiles, including Diagonal-and OffdiagonalTiles, as introduced in Steps 3 and 4. At last, we sort these tiles to improve the data-reuse, as motioned in Step 5.
The layout of storing each Tile is flexible. Any sparse matrix representations, e.g., CSR, CSR or COO 5 , are usable. To minimize redundant information, we select the COO as the low-level layout for each Tile and make the nonzeros inside each Tile ordered (this can be done by calling the parallel segmented sort kernel [16] ). Also note that converting a matrix from another layout to the SLT layout mainly includes obtaining Tile information and moving nonzeros. Thus the 5 The coordinate format stores row/col indices and value for each nonzero.
converting process is fast. In our experiments, the conversion of large matrices in our benchmark requires up to several seconds. Further considering an SLT matrix can be used many times once created, the conversion cost is trivial.
SpTRSV Algorithm based on the SLT Layout
To store a matrix using the SLT layout, six variables are needed: l, li, lj, tiles, sizes and idx, as exhibited in Alg.3. The l, li, lj stores the values, the row and column indices of the lower triangular matrix L respectively; The tiles stores the number of Tiles, including both Diagonal-and OffdiagonalTiles; The idx stores the index of the first nonzero of each Tile; The sizes stores the number of diagonal nonzeros for the DiaTiles and zero for OffdiaTiles.
Line 10 computes x j = b j /l j j and Line 15 computes b i = b i − l i j x j . The memory access to b and x in both Line 10 and 15 is predictable, because we can guarantee that the corresponding x and b elements have already been cached (cx and cb). The memory access to matrix L is also coarsegrained and prefectchable, because the Tiles are stored in the same order with that of computation, and we can load a whole Tile from the main memory before each kernel starts. //Separate original levels into new levels based on Regions 4:
for i = 0 → #new _l evels − 1 do 5:
for < r ow, col, val
Diaдonal _T il e[i].add (< r ow, col, val >) 8:
end if 12: end for 13:
end for 14:
//Sort Diaдonal _T il es and O f f diaдonal _T il es if not empty 15: end function
Tuning the Region Size in the SLT Layout
The Region size in the SLT layout essentially affects the performance. Compared with large Region size, small Region size leads to more levels and more Offdiagonal-Tiles, the latter of which will reduce the x elements' reuse and cause more extra reloading of x elements (Line 13 in Alg.3). While large Region size makes each targeting OffdiagonalTile reload a large amount of x elements. If the targeting Tile is very sparse, there will be lots of redundancy in these reloading x elements. According to our experiment, a small Region size suits problems with lots of levels thus few parallelism, while a large Region size suits problems with high parallelism. Detailed results can be found in Sec.5.3.
Algorithm 3
The SLT-based SpTRSV algorithm. Replenish cb from b for REGION_SIZE 5:
for t = 0 → t il es − 1 do 6:
if
In this subsection, we propose a Producer-Consumer pairing method to utilize the regular register communication pattern for our irregular problem. Each Producer has a cache_x and a buffer_l, the former one is used to cache and reuse x elements and the latter one is to coarse-grainedly prefetch the nonzeros in each Tile. Similarly, each Consumer has a cache_b and a buffer_b. The former one is used to cache and reuse b elements and the latter one is to coarse-grainedly prefetch b. In fact, both the cache and the buffer are part of the SPM. But in our design, only the data in cache can be reused, while the data in the buffer cannot, till moved from buffer to cache. Note that multiplying cache_b size on each Consumer and the number of Consumers is the Region size of the SLT layout.
Each Producer will be paired with one Consumer. The b element in b i = b i − ∆ i j on one Consumer is exactly needed by x j = b j /l j j on the paired Producer. So, before each level's computation, Consumers will send corresponding b elements to their paired Producers using the Row register communication to replace full synchronization. This sending operation, together with corresponding x j = b j /l j j , is represented by the "x j = b j /l j j " Line in Fig.4 . Except the b elements sent out, other b elements are still in the cache_b, which is represented by the "Inherit" Line in Fig .4 .
Here, we introduce a concept owner for each x element, b element and nonzero. Each b element's owner is a Consumer using the b element for b i = b i − ∆ i j . Each x element's owner is a Producer computing the x element by x j = b j /l j j , and Each nonzero's owner is a Producer using the nonzero for
We first averagely distribute the b elements to all the Consumers. Due to the one-to-one pairing, we can further determine the owner of each x element in the same row simply. Therefore, we can successfully assign the owner for each x element, b element and diagonal nonzero.
Next, we will assign each off-diagonal nonzero's owner, to make sure the communication only happens in the same row or column. Observe that each b i = b i −l i j x j , imported by the off-diagonal nonzero, uses x j to modify b i , and the owners of both x j and b i have been assigned. So we can distribute the nonzero to the Producer, which has the same column with the x j 's owner, and the same row with the b i 's owner. By such distributing method, each ∆ will be sent only by the Row register communication and each x element will be shared across the CPEs in the same column only by the Column register communication.
SpTRSV with Producer-Consumer Pairing
In this subsection, we introduce the solving process for both Diagonal-Tiles and Offdiagonal-Tiles, as exhibited in Fig.5B .
Step 1. The Producers load the current Tile from the main memory to their buffer_l. Then all the CPEs check whether the current Tile is Diagonal. If the answer is 'Yes', the Consumers will send the b elements to their paired Producers by the Row register communication for reuse (2a). While, if the answer is 'no', the Producers will reload the x elements from the main memory directly to the cache_x.
Step 2. If we are targeting a Diagonal-Tile, the Producers will finish the x j = b j /l j j and then store the x back to the main memory, and the Consumers will move the buffer_b's b elements to their cache_b. The Consumers will replenish buffer_b once it is empty.
Step 3. When the current Tile is 'thin' (which means the cache_x of each Producer is large enough to hold the x elements from all the Producers in the same column), the Producers will do a Whole-Rolling by the Column register communication to share their x elements to all the Producers in the same column (5a). Then Producers finish ∆ i j = l i j x j and send ∆ to corresponding Consumers by the Row register Communication (6a). The Consumers finally complete "b i = b i − ∆ i j " after getting the ∆. Figure 5 . The Producer-Consumer pairing method and solving process on Sunway architecture. Each Producer will be paired with one Consumer, and the synchronization before each level's computation will only happen in each pair. The Producers in the same column will share x elements using Column register communication. The numbers 2a, 5a, 5b, 6a and 6b in subfigure A are the related operations in subfigure B.
Step 4. If the current Tile is not 'thin', each Producer does a Sub-Rolling to send cached x elements to its lower neighbour and receive new x elements from its upper neighbour by Column register communication to replace its cached x elements (6b), after consuming all the nonzeros based on its cached x elements (5b). We need to make each x element cross all the Producers in the same column, so this step needs to be repeated multiple times (named as Cycle for short), and the times equal with the number of CPE rows we adopt. The only reason of having this Step is that the cache_x has a limited size and may not hold all the data at the same time.
In summary, thanks to the Producer-Consumer pairing method, we complete inter-level synchronization and data sharing by very fast Row/Column register communication.
Experimental Results
Experimental Setup
To benchmark our proposed algorithm and existing methods, we evaluate 2057 sparse matrices on three platforms. Here we test all of the 2057 square matrices from the University of Florida Sparse Matrix Collection [8] (containing in total 2757 matrices: 700 rectangular and 2057 square matrices) to maximize the coverage of matrix features (i.e., matrix size, sparsity structure, the number of level-sets and application domains). Without loss of generality, we only execute lower triangular solve for each matrix, and add a major diagonal to make it nonsingular. To better visualize the large amount of results recorded, we divide the 2057 matrices into four groups according to their parallelism. Here, as work [29, 40] already did, we define the term parallelism as the average number of nonzeros per level (i.e., nnz/#levels). Tab.1 lists a statistics of the four groups of matrices.
1.E-4
1.E-3
1.E-2
1.E-1 1.E+0 Number of nonzeros Figure 6 . The inspector cost for all the matrices. The cheapest cost is 0.0003s, and the most expensive cost is 28.1621s. The harmonic average is 0.0033s
The three platforms are equipped with an SW26010 processor, an Nvidia Tesla K80 GPU, and an Intel Xeon Phi (KNC) processor, respectively. We test three SpTRSV methods (i.e., serial code, parallel level-set method, and the proposed swSp-TRSV) on the SW26010 processor, three GPU methods (i.e., cuSparse 8.0 SpTRSV v1 and v2, and SyncFree [29] ) and two x86 methods from Intel (i.e., serial/parallel MKL and P2P [40] ). Note that to achieve the best performance, we test the above algorithms on the processors they designed for. Also, if a device has multiple "work nodes", we only use one of them. Specifically, a work node of an SW26010 is one of its four CGs, and of an Nvidia K80 GPU is one of its two GK210 chips. Tab.2 lists more details. Also note that the experiments are completed in double precision, and each performance number is the average of 200 runs. Figure 7 . The impact of tuning the number of CPEs (i.e., the number of Producers/Consumers, see Sec. 4.1) and the size of cache_b (see Sec. 3.4 ). Each matrix is tested with 24 (4 × 6) parameter settings. Each value filled in the heat maps is a harmonic mean, this mean is calculated from ratios of the performance of current parameter setting to the highest performance of the 24 settings of all matrices in a group. matrix 'Tina_DisCog', while the most expensive cost is 28.16 s, occurs in the matrix 'europe_osm'. The harmonic average of all costs is 0.0033 s. It can be seen that the cost is basically linear with the number of nonzeros, and most costs (68.8%) are between 0.001 s and 0.1 s. So in practice, establishing SLT layout is quite fast, especially when considering an SLT matrix can be used many times once created.
The Cost of establishing SLT layout
Effect of Parameter Tuning
Note that multiplying #Producers or #Consumers and the size of cache_b is the Region size in the SLT layout. Thus, as described in Sec. 3.4 and Sec. 4.1, tuning the two parameters, i.e., #CPEs (2× #Producers or #Consumers) and the size of cache_b on each Consumer, may bring varied performance.
To tune them for best performance, we exhaustively execute different combinations of the two parameters. We set #CPEs to {8, 16, 32, 64} and cache_b to {128, 256, 512, 1024, 2048, 4096}, and run all 2057 matrices with those 24 parameter settings. For each of the four group, we calculate a normalized performance for one parameter setting. The results are illustrated in Fig.7 . As can be seen, with the increase of parallelism, best throughput tends to be obtained by more CPEs and larger cache_b. For Group A, the best performance occurs at the point when #CPEs is 16 and cache_b is 256; For Group B, a combination of 32 CPEs and cache_b = 2048 achieves the best performance; And for Groups C and D, using all CPEs with 4096 entries in each cache_b is the best. It can also be seen in Fig.7 that the influence of tuning the number of CPEs is more significant than tuning cache_b size, in particular for the groups with less parallelism. So, we fix the cache_b size to the best, and adjust the number of CPEs to find more insights of where the performance differences come. By profiling the code with all 2057 matrices, we find that the ratio of the actual and the theoretical amount of memory access (both in Bytes) is the key issue here. We plot the metric in Fig.8 . As can be seen, in Group A, the ratio is pretty high (up to about 12) when using 32 or 64 Parallelism (nnz/levels) Performance (MFlops) Figure 9 . The performance of different methods on Sunway architecture. Compared with the serial method and the parallel level-set method, our parallel method obtains an average speedup of 7.8 (maximal speedup of 117.3) and an average speed of 6.9 (maximal speedup of 38.5), respectively. CPEs. According to some other performance counters (not shown here for brevity), the extra memory accesses are from loading SLT information to more CPEs, and also from more frequently reloading x elements for Offdiagonal-Tiles. Thus, for matrices in Group A, 16 CPEs are enough to behave the best. With the increase of parallelism, the ratio becomes smaller. For Group B, the ratios of using 32, 16 or 8 CPEs are comparable (gradually become 1). More CPEs bring more parallel resource thus achieve better performance. Therefore, using 32 CPEs is the best. As for matrices in Groups C and D, the ratios of using different number of CPEs are almost identical (i.e., 1). Hence, using all the CPEs achieves the best performance due to higher parallelism.
We will use the performance with the best group-wise (but not matrix-wise) parameter settings in the rest of the paper. In other words, for each matrix, we select the best parameter setting of the group it belongs to, regardless whether the setting brings the best performance for this single matrix.
Methods on Sunway Processor
Here we compared the peformance of three methods on SW architecture: serial SpTRSV on one MPE, parallel level-set method and the proposed method on the 64 CPEs. Note that for the level-set method, memory access are set to go through the best paths, i.e., DMA for accessing L and b, and Gload/Gstore for accessing x, due to its randomness.
The results are presented in Fig.9 . For the benchmarks from Group A, swSpTRSV has comparable performance with the serial method, due to the poor parallelism and more memory accesses for the SLT layout information. But when the parallelism increases to 32 and beyond (i.e., Groups B, C and D), our swSpTRSV can achieve noticeably higher performance. The poor performance of parallel level-set method is because of frequent, fine-grained and random memory access for x, which can only use Gload/Gstore. Thus level-set method's performance may be even worse than the serial method in some cases. With the increase of parallelism, the performance gap between swSpTRSV and the other two methods is larger.
Compared with the serial method, our method achieves an average speedup of 7.79 and the best speedup 117.3, occurs in matrix 'kron_g500-logn21' (n = 2M, nnz = 93M, #levels = 4340, parallelism=21460). Compared with the level-set method, our method obtains an average speedup of 6.9 and the best speedup 38.5, occurs in matrix 'torso1' (n = 116K, nnz = 4.5M, #levels = 1689, parallelism=2672).
Tab.3 presents the power information of 20 typical benchmarks. The parallelism range from 21 to 669K. The power consist of two parts: the ddr power and the core power, the former of which reflects the energy cost of memory access and the latter reflects the cost of computation. With the increasing of the performance, both the ddr power, the core power, together with the performance/power increase. For the benchmark 'net150', where we obtain the highest performance, the total power reach 38.18 Watt and the performance/power is 89.22 MFlops/W. However, for the last three benchmarks, the power is high but the performance is low. This is because that there are lots of Offdiagonal-Tiles, which need much more memory access and obstruct high Table 3 . The power of 20 typical benchmarks. The ddr power reflects the energy cost of memory access and the core power reflects the cost of computation.
performance. What reflects in the table is that the ddr power is high and does not matches the low core power. MKL  24  0  0  0  24  P2P  0  3  49  25  77  cuSparse v1  0  0  66  107  173  cuSparse v2  0  10  143  6  159  SyncFree  0  0  0  0  0  swSpTRSV  225  1002  376  21  1624  Total  249  1015  634  159  2057  Table 4 . A statistics of the behavior of each method, in terms of the #matri-ces with the best performance achieved by it.
Different Methods on Different Processors
We compare swSpTRSV with five recent SpTRSV methods developed for KNC Xeon Phi and K80 GPU. The performance is presented in Fig.10 , and a statistics about the number of best performance achieved by each method is listed in Tab. 4 .
In Group A, the MKL (serial) method has comparable performance with ours, and both of them are faster than the other implementations. The reason is that less parallelism brings more synchronizations, which makes parallel methods slow. But the serial MKL function does not suffer from synchronizations cost, and our method is not so sensitive to the number of synchronizations. When the parallelism increases, in Group B (containing around half of the tested matrices), swSpTRSV largely outperforms any other methods, thanks to the SLT layout and the low-cost synchronization techniques. For the matrices in Groups C and D, as the parallelism increases, the synchronization cost is relatively lower, and the calculation cost gradually dominates the overall cost. As a result, the compute pattern of SpTRSV is more like SpMV. Hence, GPU with more concurrent threads (which are very helpful for latency hiding) in general behaves the best.
In summary, as listed in Tab.4, for all of the 2057 benchmarks, our method outperforms MKL and P2P methods on KNC in 1856 cases, and cuSparse and SyncFree methods on K80 in 1672 benchmarks. Totally, our method can achieve the best performance in 1624 matrices. It is worth to note that although the SyncFree method outperforms cuSparse v1 and v2 methods in many cases, it never demonstartes the best performance in the four Groups A-D. The reason is that its performance is lower than our swSpTRSV method in the low-parallelism benchmarks and is also lower than cuSparse v1 and v2 methods in the high-parallelism benchmarks.
Though we list the performance of the recent work on various platforms, it is worth to note that this brief comparison may only guide algorithm design (which is the primary target of this work) within a certain range, due to dramatic architectural difference between the three platforms and their distinct design objectives for power-performance tradeoff.
Related Work
Concurrent data structures are fundamental building blocks for computer science. Data layout for better spatial locality and concurrency is becoming ever more crucial in the multiand many-core era [6, 10, 15, 22, 34, 36, 43, 53, 61, 68, 69] . Im and Yelick presented a register blocking method to improve the performance of SpMV [17] . Vuduc et al. demonstrated that well-designed data layouts for cache/register data reuse in SpMV operation provides superior performance [58] , and similar optimization techniques are effective to serial Sp-TRSV operation as well [56, 59] . Strout et al. proposed sparse tiling [48, 49] that brings data locality and extra parallelism from both intra-and inter-iteration, and such techniques can also be extended to compile-time [47, 50] for automatically generating efficient code running in the inspector-executor mode [7] . Liu and Vinter [31] proposed the CSR5 storage format for avoiding degraded SpMV performance due to the irregularity of the distribution of nonzero entries. Moreover, because of the diversity of sparsity structures, recent research [24, 52, 70] suggested to analyze nonzero layout and select the best sparse kernels by using machine learning and deep learning methods.
Because of the serial nature, algorithm optimization for parallel SpTRSV have been mainly developed on top of the level-set methods by Anderson and Saad [2] and Saltz [45] and color-set methods by Schreiber and Tang [46] for various parallel architectures [25, 29, 39, 40, 51] . Despite their effectiveness, the barrier synchronization (where itself is a well-known bottleneck for parallel program [4, 14, 20, 21, 38, 65] ) often limits the performance of parallel SpTRSV. To address this problem, Park et al. [40] sparsified synchronization through pruning unneeded dependencies, and Liu et al. replaced synchronization with atomic operations [29] and developed a implementation for further parallelizing multiple right-hand sides [30] . Recently, Venkat et al. [54, 55] developed several techniques for loop and data transformations and dependency-aware optimization for sparse matrix computations, and faster level-set scheduling is one of the key targets of their research. However, synchronizations, even of the reduced number, still need to go through global shared memories and slow down the overall performance.
Mayer [37] and Wolf et al. [62] first pointed out that data layout optimization for parallel SpTRSV actually plays a more important role through some experiments on multicore systems. Kabir et al. [19] built A = L + L T and utilized both level-set and color-set as well as graph partition techniques for better data locality. But this method may not be suitable for broader architectures beyond multi-core and NUMA hardware. As for GPUs, Picciau et al. [41] developed an approach that partitions a matrix into multiple sub-graphs and uses graph theory for global scheduling. However, this method emphasizes on better utilization of scratchpad memory and task scheduling but not data reuse. Bradley [5] designed a method that mainly works well for matrices with dense submatrices generated from sparse direct solvers. All of the above tiling schemes, however, share a common disadvantage that the information from level-sets is not taken into consideration. This makes the computations in the tiles on the critical path (i.e., tiles containing diagonal elements) are still inherently serial (i.e., operations x j = b j /l j j and b i = b i − l i j x j are intertwined thus tightly coupled). Also, their data reuse is still out-of-control and inter-level synchronization still needs to go through global shared memories.
Compared to existing research, our Sparse Level Tile layout described in this paper divides a matrix into 2D tiles of non-uniformed shapes and connects them through their spatial relationships. Because we consider sparse tiling with the information from level-set, a DiaTile of SLT is always composed of a diagonal and a square submatrix, and an OffdiaTiles is always a rectangular or square submatrix. This means that although each Tile may not have uniformed shape, the decoupled operations for it (i.e., x j = b j /l j j , b i = b i − ∆ i j and ∆ i j = l i j x j ) must be regular and parallelfriendly. This design thus brings controllable data reuse of x and b, and exploits the advantages of register communication on Sunway architecture for very fast inter-level synchronizations. As shown in the experiments, a large amount of sparse problems have obtained benefits from our method.
Compared with recent research on regular problems on Sunway architecture, such as stencil [3] , DNN [12] , GEMM [18, 27] and fully-implicit solver for nonhydrostatic atmospheric dynamics [66] , our work presented in this paper is more complicated, as the irregularities from various matrix sparsity structures are dynamic and leveraging such irregularity is known to be more challenging [57, 67] .
Conclusion
Sparse triangular solve plays an important role in many applications. Despite its importance, out-of-control data reuse and slow synchronization via global memory have been largely neglected in existing research. In this work, we proposed swSpTRSV for SW26010 processor, the main building block of Sunway Taihulight supercomputer. The swSpTRSV consists of a Sparse Level Tile layout to make all data reuse under control and a Producer-Consumer pairing method to make any inter-level synchronization only happen through fast register communication. The experimental results benchmarked with 2057 matrices showed that our approach largely outperforms parallel level-set method and has superior performance over five recent methods on KNC and GPU processors.
It is worth to note that even though in this paper the SLT layout and the Producer-Consumer Pairing method are designed for Sunway architectures with Scratch Pad memory, DMA-based memory access and register communication on mesh network, we believe that our proposed method can bring insightful experience to algorithm design on future architectures, such as GPUs with register-communication and x86 processors with 2D mesh interconnect. -Use nohup to run all the benchmarks $ nohup ./total_run.sh & Note: Use bjobs to check whether the jobs finish or not. Each job will produce a new file in the subdirectory 'results' in the current directory.
-The referenced performance results are listed in the file 'ref_SwSpTRSV', 'ref_serial' and 'ref_levelsets' in the current directory. These results are used to draw the Scatter Plot of Fig. 9 and Fig. 10 .
Note: There are totally 2057 works and will cost about 2 hours.
You do not need to wait for finishing all the test cases and can compare the result of any case with the referenced result as long as it is finished.
