We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model characterizes the algorithmic behavior by taking into account the underlying processor speed, memory system performance, as well as the interconnect speed. The model is validated using the implementation in the SuperLU_DIST linear system solver, the sparse matrices from real application, and an IBM POWER3 parallel machine. Our modeling methodology can be adapted to study performance of other types of sparse factorizations, such as Cholesky or QR, and on different parallel machines.
has been much research dedicated to efficient parallel implementation of the sparse LU factorization. Although the factorization has been made more and more scalable, loss of performance is often observed with increasing number of processors, and more so with very sparse and unsymmetric problems. It is difficult to predict the performance of a parallel sparse factorization, which largely depends on the sparsity of the matrix, and the sparsity patterns vary with applications. Several analytical results exist in the literature [3, 7, 11] , which were obtained for particular classes of matrices arising from the discretization of certain PDEs.
In our previous work we considered a supernodal right-looking parallel factorization on two dimensional grids of processors, using static pivoting [6] . We derived an analytical estimation of the parallel runtime for arbitrary input matrices, as presented in Eq. (14) later in this paper. This was obtained by considering a simple machine architecture and by making several simplified assumptions on the parallel algorithm. We assumed that the floating-point operations are performed at a constant rate throughout the factorization, that was determined for every matrix by the sequential runtime and the number of floating point operations necessary to factorize it. We also assumed that the messages are transferred at a fixed bandwidth value independent of their sizes and that the load and the data is evenly distributed among processors. The analytical model allowed us to obtain performance upper bounds on parallel computers, to perform scalability analysis, and to derive a relation between parallel runtime and matrix sparsity. We used the SuperLU_DIST solver [10] to compare its actual runtime with the model's predicted speed, and observed that the speedup predicted by the analytical performance model started to deviate above the measured speedup of SuperLU_DIST with increasing number of processors.
In this paper we present a much refined model that removes the unrealistic assumptions. In addition, our new model takes into account not only the speeds of CPU and interconnect network, but also the memory system performance. We will show that the new model can accurately predict the actual performance of the code. First, we develop a runtime model for the sequential algorithm. For this, we perform a detailed analysis of the numerical kernels used in the sparse LU factorization, which takes into account that blocking and loop unrolling are used for obtaining high performance. Moreover, instead of estimating the runtime by counting the floating-point operations alone and assuming that they are performed at a constant speed, we also evaluate the cost of memory accesses, because in a sparse factorization, a nontrivial amount of time is often spent in streaming through data. The cost of memory access is different whether the data are already in cache or they have to be fetched from main memory. Thus, different latencies are incurred when the data are fetched at different levels of the memory hierarchy. Second, for the parallel runtime, we include the classical latency-bandwidth communication cost model, but with varying bandwidth values which are calibrated off-line based on the message size distribution throughout the factorization.
The new performance model is based on a simulation of the execution of the right-looking supernodal LU factorization, which accounts for one pass over the structure of the factors L and U. This simulation estimates, at each step of factorization, the time spent in computation, the time spent in communication and the possible overlap of communication and computation. The model uses several input parameters, that include the dimensions of the blocks and loop unrollings used in the numerical kernels, the line size and the total size of the cache memories, the policies used for evicting lines from cache, and the bandwidth values for different message sizes. The parameters for a specific parallel architecture are determined off-line. They are either taken from the machine's specification sheet if available, or obtained using the hardware counter results.
In this paper, we validate our performance model using the right-looking factorization algorithm implemented in SuperLU_DIST, representative matrices from several applications and an IBM POWER3 parallel machine. In the future, we plan to validate the model on different parallel machines. This involves mainly tuning the parameters of the model on the targeted machine.
The rest of the paper is organized as follows. Section 2 describes our modeling methodology and assumptions. Section 3 analyzes the performance of the numerical kernels. Section 4 describes our methodology for estimating the number of cache misses. Section 5 describes a performance model that estimates the execution time of the sequential and parallel factorization. The results validating the performance model are presented in Sect. 6. Finally, Sect. 7 draws the conclusions.
Background
Algorithm 1 presents a right-looking algorithm that factorizes a sparse unsymmetric n × n matrix A into the product of a unit lower triangular matrix L and an upper triangular matrix U. The matrix is partitioned into N × N blocks of submatrices using unsymmetric supernodes (columns of L with the same nonzero structure). The algorithm loops over N supernodes. At the kth step, the first k − 1 block columns of L and block rows of U are already computed. Now, the block column L(k : N, k) is factored, the triangular solves are performed to compute U(k, k + 1 : N), and the trailing matrix is updated using L(k + 1 : N, k) and U(k, k + 1 : N). The last step requires most of the work and also exhibits most of the parallelism in the right-looking approach. 
Algorithm 1 Right-looking factorization
for k := 1 to N do (1) Factorize block column L(k : N, k) (2) Perform triangular solves: U(k, k + 1 : N) := L(k, k) −1 × A(k, k + 1 : N) for j := k + 1 to N with U(k, j) = 0 do for i := k + 1 to N with L(i, k) = 0 do (3) Update trailing submatrix: A(i, j) := A(i, j) − L(i, k) × U(k, j)
Distributed matrix Grid of processors
The parallel algorithm in SuperLU_DIST uses a data distribution on a two dimensional process grid and a look-ahead technique to overlap communication and computation, as shown in Algorithm 2. Figure 1 illustrates the data distribution on a 2D grid of six processors. The blocks of the matrix partitioned by supernodes are distributed among P r × P c (= P) processors using a block cyclic distribution. A block at position (i, j) of the matrix (0 ≤ i, j < N) will be mapped on the process at position (imodP r , jmodP c ) of the grid. U(k, j) (L(k, j)) denotes a submatrix of U (L) at row block index k and column block index j. We use (L + U)(k, j) to denote a block of L or U at block row k and block column j.
The look-ahead technique in Algorithm 2 appears in the update of the trailing matrix. At step k, we first update the block column (k + 1) and perform the block factorization [steps (3.1) and (1.2)]. The column processes owning block column k + 1 then send it to all the processes that need it using nonblocking sends, while the potential receivers post nonblocking receives [steps (a.3) and (a.4)]. Afterwards, the rest of the trailing matrix (block columns k + 2 : N)) is updated [step (3.2)], which overlaps with the communication in steps (a.3) and (a.4). Steps (c.1) and (c.2) contain the matching waits for the nonblocking send and receive from steps (a.3) and (a.4).
The performance model we developed has the following characteristics:
-We identify which part of the code spends most of the time on floating-point operations (i.e., flops-bound) and which part of the code spends most of the time in loading and storing the data (i.e., memory-bound). Flops-bound means that the number of cycles spent in floating-point operations is larger than that spent in loads and stores, and hence we consider that the loads and stores are overlapped with computation. For the loops that are flops-bound, the model uses one parameter to describe the processor's speed, denoted as γ , which is the time taken to perform one flop. For in-cache data, we assume that the processor performs at the peak speed. -We consider a hierarchy of caches with a perfect nesting and that accessing different memory levels incurs different latencies. We use α i to denote the level i cache access latency (in cycles), and α mem to denote the main mem- ory access latency. We only account for the capacity and compulsory misses but ignore the conflict misses. We assume that the cache memories are fully associative.
-We do not model TLB misses in detail. But in our experimental section, we count the effect of TLB misses by considering a value for α mem that includes a memory access and a TLB miss. -For interprocessor communication, we estimate the time for sending a message of m items between two processors as α + mβ, where α denotes the latency and β the inverse of the bandwidth. We ignore the effect of message collisions. The value of the bandwidth is dependent on the size of message being transferred.
Performance analysis of the relevant BLAS routines
In this section we model the performance of the numerical kernels used in The triangular solve and the rank one update operations (DTRSV and DGER) are memory-bound, since there is only O(1) flop associated with each data item and most of the time is spent in accessing the data. For the matrix multiply and update operations (DGEMM), depending on the size of the matrices, the model determines if the routine is memory-bound or flops-bound. We assume that all the computing kernels are implemented using register blocking, which decreases the number of loads and stores between registers and memory and ensures that the multiple functional units are effectively utilized. When using the vendor-supplied BLAS library, the number of registers used for every routine is not known. We determined the number of registers experimentally such that the hardware counter results match well with our estimations.
Analysis of DGEMM Most of the computation in SuperLU_DIST takes place in the rank-k updates [steps (3.1) and (3.2) of Algorithm 2], which are performed using DGEMM. The operation is C = aAB + bC, where C is an M × N dense matrix, A is an M × K dense matrix, B is an K × N dense matrix and a and b are scalars. We use the superscalar implementation [8, 9] described in Algorithm 3 to analyze DGEMM.
Algorithm 3 Superscalar matrix multiply routine
for J = 1 to N step N B do for L = 1 to K step K B do for I = 1 to M step M B do T 1 = a × A(I, L) C(I, J) = bC(I, J) + T T 1
B(L, J) end for end for end for
The algorithm is designed for cache reuse: every block B(L, J) is used multiple times by looping over the block rows of matrix A. At each step of the inner loop, a block of matrix A is multiplied with a and copied into a temporary buffer T 1 , which avoids bad leading dimension problem. Then a block of matrix C is updated with the result of the product of T 1 with a block of B. Loop unrolling is used to ensure good performance of the innermost in-cache matrix multiplication, which we assume is done as follows. E a registers are used for the elements of T 1 , E b registers are used for the elements of B and E a E b registers are used for the elements of C. An E a × E b block of C(I, J) is computed in the innermost loop. First, the E a × E b block of C is multiplied with b and stored in the registers, second, the registers are updated with elements of the product 
(1)
If the innermost loop is flops-bound and the blocks are all in cache, the performance of the in-cache block multiplication 
If the innermost loop of the in-cache matrix multiplication is flops-bound, the overall performance of Algorithm 3 is given by:
Copy from registers to C (5)
The choice to base our simulation on the superscalar matrix multiplication routine was motivated by the experiments presented in the IBM POWER3 Introduction and Tuning Guide [2, p. 155], which compare its performance to the performance of DGEMM implemented in IBM's ESSL POWER3-enhanced library. The results showed the performances are close for square matrices. The authors expect that ESSL may perform better for rectangular matrices.
The routine DGEMM is used at each step k of Algorithm 2, step (3) for the rank-k update of a block of the trailing matrix. To estimate the performance of step (3), we use the following notations. Flops dgemm denotes the number of floating-point operations for all the calls to DGEMM that are computed as follows. For the calls to DGEMM that are flops-bound, the number of flops in Eq. (6) Analysis of DGER Consider the rank one update A = A − xy T , where A is a M × N dense matrix, x is a dense vector of length M and y is a dense vector of length N.
Since the matrix uses a column oriented storage, we assume that the outermost loop loops over columns, which minimizes the cache and TLB misses. The optimal unrolling factor of the innermost loop should be such that the size of each subcolumn is the same as the cache line size. But because of the floating-point register limitation, the unrolling factor may be restricted to a smaller number. The innermost loop then will operate on the number of columns in a vertical block. We denote the unrolling of the outermost loop as E y and of the innermost loop as E x . The number of loads and stores are calculated as:
The routine DGER is used in step (1) of Algorithm 2 for factorizing block column L(k : N, k). This factorization is based on rank-1 updates and scaling. We estimate the memory accesses of step (1) by summing the loads and stores corresponding to all the calls to DGER, which are denoted as Loads dger and Stores dger .
Analysis of DTRSV Consider the triangular solve Lx = y, where L is a N × N lower triangular dense matrix, and x and y are dense vectors. In our analysis we consider a column oriented (axpy) algorithm. The estimation of the number of loads and stores is similar to the one described by Vuduc et al. [13] . It assumes that register blocking is used such that the matrix is partitioned into register blocks of size E L × E L , which leads to E L loads of the vector x for each register block.
The computation of the row block U(k, :) [step (2)] of Algorithm 2 is performed through calls to DTRSV. An estimation of the memory accesses of this step can be therefore computed by adding the number of loads and stores associated with all the calls to DTRSV, which we denote as Loads dtrsv and Stores dtrsv .
Estimation of cache misses
In this section we first present analytical lower and upper bounds for the number of cache misses. We then describe a method that provides a better estimation of the cache misses by simulating the factorization algorithm.
Considering level q (Lq) cache, we use l q to denote its line size and C q to denote its capacity, both in doubles.
At step k of Algorithm 2, the working storage consists of the space needed to store the block columns k and k + 1 of L, the block row k of U, the blocks of the trailing matrix that will be updated, and several temporary arrays used during the updates.
We now give lower and upper bounds for the number of times each block of L and U is brought into cache. Consider block L(i, j), we use x ij to denote the number of updates performed on L(i, j), and y ij to denote the cost of finding L(i, j) when it is the destination of an update. z j denotes the number of nonzero blocks in block row j of U, which equals the number of updates performed at step j and for which L(i, j) is the source of an update. We denote sz(L(i, j)) as the size (in doubles) of the block L(i, j) (nonzero values and indices). Our estimation takes into account the storage by columns, the leading dimension and the line size of the cache.
A lower bound M L on the number of cache misses is obtained by assuming that there is only one compulsory miss for every block. That is, during the first update on the block L(i, j), this block is brought into cache, and it will stay in cache throughout the rest of the updates when it is either source or destination, and during its own factorization.
An upper bound M U for the number of misses is obtained by assuming that for every block, there is a compulsory/capacity miss each time when it is the destination of an update, when its factorization is computed, and when it is the source of an update.
Note that these analytic bounds do not take into account the working space needed at each step of the algorithm compared to the size of the cache C q . If the working space fits in cache, at most one compulsory/capacity miss is generated for each element of the working space. If some of the blocks updated at step k − 1 are also updated at step k and the blocks are still in cache, then fewer misses will be incurred. If the working space does not fit in cache, it means that the cache is not large enough to hold the block row U(k, :) and the block column L(:, k) during the column factorization, triangular solves and the update of the trailing matrix. Then, when updating the trailing matrix, in addition to the misses incurred for every updated block, there is one additional capacity miss for the elements of U(k, :) (values and indices), and for every block of U(k, :), there is one capacity miss for the elements of L(:, k) (values and indices). This discussion shows that it is difficult to derive realistic analytic bounds for the number of cache misses. We expect that at the beginning of the factorization, when the matrix is very sparse, the working space will fit in cache. But when the trailing matrix becomes denser, the working space will no longer fit in cache.
To obtain accurate account of cache misses, we wrote a simulator that simulates the factorization algorithm and the memory system behavior. Every level of cache is associated with a policy for evicting blocks from cache, such as the least recently used (LRU) or the round robin policy. Every time a block of the matrix is accessed, the simulator checks whether the block is already in cache and at which level. If it exists and the eviction policy is LRU, this block is marked as the LRU. If the block is not in cache, it is is brought in cache, and the counter for the cache misses is incremented. When the cache capacity is exceeded, some blocks are evicted. The experimental results in Sect. 6 show that this simulation leads to results close to the cache misses obtained from the hardware counters.
Runtime estimation
We first present an estimation of the sequential algorithm, and then describe the communication model that extends the sequential runtime estimation to an estimation of the parallel runtime.
Sequential runtime estimation.
In an earlier work on performance analysis of sparse matrix-vector multiply [13] , only the cost of memory accesses needs to be accounted, because that algorithm is entirely memory-bound and there is little data reuse. Sparse factorization algorithms on the other hand contain both flops-and memory-bound kernels, and the kernel mix changes at different stages of the factorization, which complicates our analysis.
Assume that there are C cache levels. We denote by h i the number of hits and by m i the number of misses at the ith cache level. The execution time charges for the memory accesses to different levels of the memory hierarchy in a similar way to [13] . The hits for i ≥ 1 are computed as h i+1 = m i − m i+1 . The sequential runtime is estimated as follows:
The memory access cost is computed using the number of loads and stores performed for the memory-bound parts of Algorithm 2. For steps (1.1) and (1.2), the number of loads and stores is given by Loads dger and Stores dger . For step (2) , the number of loads and stores is given by Loads dtrsv and Stores dtrsv . For steps (3.1) and (3.2), we sum up the number of loads and stores in the calls to DGEMM (denoted by Loads dgemm and Stores dgemm ), and the number of loads and stores in copying the matrix blocks into a temporary buffer to prepare for the call to DGEMM (denoted by Loads temp and Stores temp ). The L1 cache hit h 1 is computed as follows: As described in Sect. 3, Loads dgemm represents the number of loads performed in the memory-bound loops of DGEMM, plus the number of loads involved in copying the matrix blocks to temporary buffers. Stores dgemm represents the number of stores involved in copying into the temporary buffers or copying from the registers back to the matrix destination. The time spent in the flops-bound loops of DGEMM is counted in the first term of T seq in Eq. (13) (Flops dgemm γ ).
Parallel runtime estimation
In our previous work, under some simplifying assumptions, we have established the following parallel runtime estimation using a square grid of processors [6] :
where N is the order of the matrix; F is the total number of flops in the factorization; nnz(L) is the number of nonzeros in the off-diagonal blocks of L; nnz(U) is the number of nonzeros in the off-diagonal blocks of U. The first term represents the parallelization of the computation. The second term represents the number of broadcast messages. The third term represents the volume of communication.
For more accurate accounting of the parallel runtime, we wrote a simulator to simulate the execution of Algorithm 2. This simulation has several purposes. First, the simulation can provide an accurate estimation of the cache misses which otherwise cannot be determined analytically. Second, the simulation also allows us to include load imbalance factor in the model. Third, using simulation we can analyze the critical path behavior by considering load balance at each step and the amount of overlap of computation and communication.
We use the commonly adopted latency-bandwidth cost model to analyze the interprocessor communication. An improvement we made is that different values of the inverse of the communication bandwidth β are used based on the size of the messages.
The simulation algorithm needs to estimate at each iteration of Algorithm 2 the time spent in each computation step. This is obtained using Eq. (13), where the number of loads, stores, flops and cache misses are the numbers associated with that step and performed during the current iteration by each processor.
The time spent in communication is estimated at each iteration as follows. In steps (b.1) and (b.2), messages are communicated using blocking sends and receives, which is implemented as follows. Consider for example that processor PE0 needs to send a message of size m to processors PE1 and PE2. PE0 first sends it to PE1 and then sends it to PE2. The time spent by all the processors PE0, PE1 and PE2 in this communication is estimated as 2(α + mβ). In steps (a.1), (a.2), (a.3) and (a.4), messages are communicated using nonblocking sends and receives, in which case the time is estimated as (α + mβ).
The parallel runtime of Algorithm 2 is obtained by computing for each processor the runtime of the steps that lie on the critical path. In the initialization step, each processor that owns blocks of L(:, 1) factorizes L(:, 1) and then sends the blocks to processors in the same row of the grid that need them. The processors that participate in this step and that are in the same row of the grid have the same runtime, which is given by the time spent in factorizing block column 1, and the time spent in the communication (a.1), (a.2).
At each iteration k of Algorithm 2, the runtime is estimated as follows. The time spent in the computation of block row U(k, :) via triangular solves [step (2) ] is added to the processors in the row of the grid that own blocks of U(k, :). 
Experimental results
In this section we compare the analytic bounds and the simulation results against the actual performance of the code. We used the IBM SP RS/6000 distributed memory machine at NERSC (National Energy Research Scientific Computing Center). The system contains 2,944 compute processors distributed among 184 compute nodes. Each node of 16 processors has 16-64 Gbytes of shared memory. Each POWER3 processor is clocked at 375 Mhz and has a peak performance of 1.5 GFlops. It has a L1 data cache of 32 KBytes and a L2 cache of 8, 192 KBytes. The cache line for both L1 and L2 is 128 bytes. The L1 access latency is α 1 = 1 cycle. A data access that misses L1 but hits L2 incurs a load latency of α 2 = 9 cycles. We consider that a data access that misses both L1 and L2 cache incurs a latency α mem between 35 and 139 cycles [13] . Each processor has two floating-point units, each of which can execute one compound floating-point multiply-add (FMA) operation per cycle. Hence POWER3 can execute up to four floating-point operations per cycle. In addition, POWER3 can commit two loads/stores per cycle, and so the L 1 access latency α 1 is divided by two in our estimations.
We used several medium to large size matrices from different application domains. The characteristics of the matrices are given in Table 1 , which includes the matrix order, the number of nonzeros in the input matrix A, the number of nonzeros in the factors L and U, and the number of floating-point operations performed by SuperLU_DIST in the numeric factorization. The matrices are available from the University of Florida Sparse Matrix collection [5] . Matrices af23560, bbmat, fidapm11 and inv-extr1 come from fluid flow and CFD problems. Matrix ecl32 is a device simulation problem. Matrix stomach is issued from a 3D electro-physical model. Matrices g7jac200sc and mark3jac140sc are from economic models.
The goal of our experiments is three-fold. First, we want to compare our estimation of the sequential runtime with that of SuperLU_DIST. This would validate the accuracy of our model in counting the number of loads, stores Table 1 Benchmark matrices and their characteristics: the order n, the number of nonzeros of the input matrix nnz(A), the number of nonzeros in the factors nnz(L + U), the number of floating point operations in the factorization, the sparsity of the matrix measured as nnz(A)/n and the structural symmetry Sym Our model uses several parameters that have to be tuned on the target machine for which the simulation is performed, such as the block size used in DGEMM. When there is no documentation, we determined their values as follows. The computing kernels were executed on the target architecture and the PAPI [4] results for loads, stores and cache misses were collected. Then we choose the parameters that provide a good fit for our estimation of the number of loads, stores and cache misses.
In the parameter tuning phase, we have to estimate the size of the blocks MB, NB and KB used in the superscalar matrix multiplication algorithm presented in Sect. 3. In our experiments we have used MB = 32, NB = 48 and KB = 48. We also have to estimate the number of registers used for loop unrolling. For the DGEMM routine, we have tried several values and we have found that a 4 by 2 unrolling matches well our estimation of number of loads. Note that this happens to be also the unrolling level used on the IBM POWER2 [1] . In fact, both processors have two floating-point units, and while POWER3 has a floating-point pipe of three or four cycles long, POWER2 has a shorter floating-point pipe of two or three cycles. For both processors, a 4 by 2 unrolling ensures that the two floating point units are well utilized.
To estimate the number of cache misses, our simulation can consider different policies for replacing lines in cache. For L1 cache, POWER3 replaces lines using a round-robin policy within a congruence class. The congruence class is defined by the 128-way associativity of the L1 cache. For L2 cache, real addresses are directly mapped to the cache. In our simulation, we assume cache is fully associative, and we use round-robin for L1 cache and LRU policy for L2 cache. Figure 2 presents the number of loads and stores obtained by our performance model, and compares with the PAPI results of loads and stores obtained for SuperLU_DIST. Our model estimates well the number of loads for all the matrices, but does not do so well with the number of stores-the estimation can be smaller than PAPI numbers (up to 51% for matrix mark3jac140sc). We also observed such underestimation when tuning our DGEMM model for the POWER3 architecture. We think that the overhead incurred by a call to DGEMM is nontrivial and is not taken into account by our model. This overhead affects the overall number of stores. But since the number of stores is generally three or four times smaller than the number of loads, the underestimation of number of stores does not have much impact on the overall estimation. Figure 3 presents the results of cache misses comparing our estimation, the analytic lower and upper bounds, and the PAPI results. The lower and upper bounds do not depend on the cache size, and are the same for L1 and L2 cache. For the sake of clarity, we do not include the upper bound in the plot of the L2 results. The lower bound underestimates severely the number of L1 misses (by two orders of magnitude). It also underestimates the number of misses in L2 cache, very often by one order of magnitude. For L1 cache, the upper bound is usually twice the number of PAPI misses. But as expected, it overestimates severely the number of L2 cache misses. These results show that it is difficult to derive analytic bounds for the number of misses for sparse factorizations. Our estimation predicts well the number of cache misses, both for L1 and L2 caches. One exception is matrix mark3jac140sc, which is one of the most unsymmetric matrix in our test set. For this matrix, our estimation tends to overestimate the number of cache misses. For the L1 cache, we underestimate up to 17% for matrix af23560 and overestimates up to 47% for matrix mark3jac140sc, when compared to the PAPI L1 cache miss results. For the L2 cache, for five matrices our estimation underestimates the number of L2 cache misses (up to 33% for matrix fidapm11). For the other three matrices, our estimation overestimates the number of L2 cache misses (less than 1% for af23560 and g7jac200sc, 72% for matrix mark3jac140sc).
In the plot depicting the L2 cache misses we also include the number of TLB misses. Note that SuperLU_DIST incurs large number of TLB misses. For af23560, the number of TLB misses is larger than the number of L2 misses. For bbmat, the number of TLB misses is half the number of L2 misses, and for the other matrices it is less than half. This indicates that using the minimum value of the memory access latency would lead to an underestimation of the runtime. Figure 4 presents the performance results in number of cycles. The lower bound is obtained by using the minimum value of memory access latency for α mem (35 cycles) in Eq. (13), which assumes a TLB hit. The upper bound is obtained by taking the maximum value for α mem (139 cycles), which assumes a TLB miss. Both bounds are smaller than the measured runtime. The upper bound can underestimate the runtime with up to 32% for bbmat. The underestimation is larger for the lower bound (57% for mark3jac140sc and 47% for g7jac200sc.)
For parallel runtime estimation, we first need to determine the two parameters α and β used in the communication cost model. To this end, we ran the point-to-point pingpong micro-benchmark available at the NERSC web site [12] to callibrate the times needed to transfer messages of various sizes that appeared throughout the factorization. The latency value α is 8.0 µs. The bandwidth values used for β are tabulated in Table 2 . This approach is an improvement over our previous work, where we used a single bandwidth value independent of the message size. Figures 5 and 6 compare the runtimes in number of cycles obtained by our simulation and by SuperLU_DIST with increasing number of processors. We also present in Table 3 the SuperLU_DIST runtimes in seconds. Our simulation predicts well the actual performance of SuperLU_DIST.
Our performance model examines the amount of computation on the critical path. In Algorithm 2, the computation performed in steps (2), (3.1) and (1.2) lies on the critical path. That is, the computation of a block row of U, the update and the factorization of block column k + 1 of L lie on the critical path. Since the algorithm implemented in SuperLU_DIST uses a look-ahead technique, the computation corresponding to step (3.2) (update of the trailing matrix at step k) is overlapped with the communication involved in steps (a.3) and (a.4) (send/receive of block column k + 1 of L). Hence at each step of the algorithm, we add the computation corresponding to step (3.2) to the critical path if it is more important than the communication involved in steps (a.3) lying on the critical path. These plots show that on small number of processors, the computation is overlapped with the communication and the computation dominates the critical path. With increasing number of processors, the critical path is dominated by communication. This is because there are more messages of smaller size, and the time spent in communication is larger. Note that for af23560, g7jac200sc and mark3jac140sc our simulation overestimates the runtime on 64 processors. We believe this is due to the loss of accuracy associated with the communication cost model.
We also examined load balance LB throughout the parallel factorization. We define the load L to be the computation involved in the entire factorization. We then compute the load lying on the critical path L CP by adding at each step the load of the most loaded processor. More precisely, consider l pi being the load of processor p at step i [i.e., the computation in number of cycles performed by this processor at step i, which takes into account the number of loads/stores and cache misses as in Eq. (13)]. Then, L CP = N i=1 max P p=1 l pi . The load balance factor is computed as LB = L CP P L . In other words, LB is the load of heaviest processors lying on the critical path divided by the average load per processor. The closer is this factor to 1, the better balanced is the workload. Table 3 shows that the workload distribution is good on a small number of processors. But it can degrade quickly for some matrices, such as af23560, leading to quick performance degradation.
Conclusions
We developed a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. Our model characterizes the algorithmic behavior by taking into account the underlying processor speed, memory system performance, as well as the interconnect speed. The simulation-based model differs from our previously developed analytical model [6] in that it actually simulates the execution of the parallel algorithm. When comparing the predicted runtime with the results of SuperLU_DIST, we noticed that the old analytical model failed to predict accurately the parallel execution time with increasing number of processors.
Unlike Level 2 sparse BLAS operations, such as sparse matrix-vector multiply or sparse triangular solution, where performance is usually solely dominated by memory system performance, the sparse factorization algorithms usually contain a mixture of Level 2 and Level 3 BLAS routines. Depending on the sparsity of the matrix, which changes at different stages of factorization, the kernels can be either memory-bound or flops-bound. We have taken into account this feature in our analysis, and our new model is more accurate in predicting performance.
Another improvement over the previous work is to use different bandwidth values to model interconnect communication speed for different message sizes. This is very important for sparse factorization algorithms, because throughout factorization, the message sizes can differ by several orders of magnitudes.
We have validated our model using SuperLU_DIST on an IBM POWER3 parallel machine, and showed that the simulated runtime is close to the measured runtime. In the future, we plan to validate the model on different parallel architectures. After this validation phase, we will be able to use the model realistically for predicting code performance on newly designed architectures and for new matrices. Furthermore, our model reveals that the loss of parallel efficiency on a large number of processors is due to load imbalance and communication cost. This points to possible future work to improve the code. For example, we can use better matrix distribution scheme to reduce load imbalance. We can use a higher level of look-ahead scheme to increase the overlap of communication with computation.
Finally, our modeling methodology can be adapted to study performance of other types of sparse factorizations, such as Cholesky or QR. This remains future work.
