Abstract-A sparse LU factorization based on Gaussian elimination with partial pivoting (GEPP) is important to many scientific applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main difficulty is that partial pivoting operations dynamically change computation and nonzero fill-in structures during the elimination process. This paper presents an approach called S* for parallelizing this problem on distributed memory machines. The S* approach adopts static symbolic factorization to avoid run-time control overhead, incorporates 2D L/U supernode partitioning and amalgamation strategies to improve caching performance, and exploits irregular task parallelism embedded in sparse LU using asynchronous computation scheduling. The paper discusses and compares the algorithms using 1D and 2D data mapping schemes, and presents experimental studies on Cray-T3D and T3E. The performance results for a set of nonsymmetric benchmark matrices are very encouraging, and S* has achieved up to 6.878 GFLOPS on 128 T3E nodes. To the best of our knowledge, this is the highest performance ever achieved for this challenging problem and the previous record was 2.583 GFLOPS on shared memory machines [8] .
INTRODUCTION
PARSE matrix factorization is an important approach to solving a sparse system of linear equations. If a matrix is symmetric and positive definite, Cholesky factorization can be used, for which fast sequential and parallel algorithms have been developed in [24] , [30] , [31] . However, in many applications, the associated equation systems involve nonsymmetric matrices and pivoting may be required to maintain numerical stability for such nonsymmetric linear systems [6] , [23] . Because pivoting operations interchange rows based on the numerical values of matrix elements during the elimination process, it is impossible to predict the precise structures of L and U factors without actually performing the numerical factorization. The adaptive and irregular nature of sparse LU data structures makes an efficient implementation of this algorithm very hard, even on a modern sequential machine with memory hierarchies.
There are several approaches that can be used for solving nonsymmetric systems. One approach is the unsymmetric-pattern multifrontal method [5] , [25] that uses elimination graphs to model irregular parallelism and guide the parallel computation. Another approach [19] is to restructure a sparse matrix into a bordered block upper triangular form and use a special pivoting technique which preserves the structure and maintains numerical stability at acceptable levels. This method has been implemented on Illinois Cedar multiprocessors, based on Aliant shared memory clusters. This paper focuses on parallelization issues for a given column ordering, with row interchanges to maintain numerical stability. Parallelization of sparse LU with partial pivoting is also studied in [21] on a shared memory machine by using static symbolic LU factorization to overestimate nonzero fill-ins and avoid dynamic variation of LU data structures. This approach leads to good speedups for up to six processors on a Sequent machine and further work is needed to assess the performance of the sequential code.
As far as we know, there are no published results for parallel sparse LU on popular commercial distributed memory machines such as Cray-T3D/T3E, Intel Paragon, IBM SP/2, TMC CM-5, and Meiko CS-2. One difficulty in the parallelization of sparse LU on these machines is how to utilize a sophisticated uniprocessor architecture. The design of a sequential algorithm must take advantage of caching, which makes some previously proposed techniques less effective. On the other hand, a parallel implementation must utilize the fast communication mechanisms available on these machines. It is easy to get speedups by comparing a parallel code to a sequential code which does not fully exploit the uniprocessor capability, but it is not as easy to parallelize a highly optimized sequential code. One such sequential code is SuperLU [7] , which uses a supernode approach to conduct sequential sparse LU with partial pivoting. The supernode partitioning makes it possible to perform most of the numerical updates using BLAS-2 level dense matrix-vector multiplications and, therefore, to better exploit memory hierarchies. SuperLU performs symbolic factorization and generates supernodes on the fly as the factorization proceeds. UMFPACK is another competitive sequential code for this problem and neither SuperLU nor UMFPACK is always better than the other [3] , [4] , [7] . MA41 is a code for sparse matrices with symmetric patterns. All of them are regarded as of high quality and deliver excellent megaflop performance. In this paper, we focus on the performance analysis and comparison with SuperLU code since the structure of our code is closer to that of SuperLU.
In this paper, we present an approach called S* that considers the following key strategies together in parallelizing the sparse LU algorithm: 1) Adopt a static symbolic factorization scheme to eliminate the data structure variation caused by dynamic pivoting. 2) Identify data regularity from the sparse structure obtained by the symbolic factorization scheme so that efficient dense operations can be used to perform most of computation and the impact of nonzero fill-in overestimation on overall elimination time is minimized. 3) Develop scheduling techniques for exploiting maximum irregular parallelism and reducing memory requirements for solving large problems.
We observe that on most current commodity processors with memory hierarchies, a highly optimized BLAS-3 subroutine usually outperforms a BLAS-2 subroutine in implementing the same numerical operations [6] , [9] . We can afford to introduce some extra BLAS-3 operations in redesigning the LU algorithm so that the new algorithm is easily parallelized, but the sequential performance of this code is still competitive to the current best sequential code. We use the static symbolic factorization technique first proposed in [20] , [21] to predict the worst possible structures of L and U factors without knowing the actual numerical values, then we develop a 2D L/U supernode partitioning technique to identify dense structures in both L and U factors, and maximize the use of BLAS-3 level subroutines for these dense structures. We also incorporate a supernode amalgamation [1] , [10] technique to increase the granularity of the computation.
In exploiting irregular parallelism in the redesigned sparse LU algorithm, we have experimented with two mapping methods, one of which uses 1D data mapping and the other uses 2D data mapping. One advantage of using 1D data mapping is that the corresponding LU algorithm can be easily modeled by directed acyclic task graphs (DAGs). Graph scheduling techniques and efficient runtime support are available to schedule and execute DAG parallelism [15] , [16] . Scheduling and executing DAG parallelism is a difficult job because parallelism in sparse problems is irregular and execution must be asynchronous. The important optimizations are overlapping computation with communication, balancing processor loads, and eliminating unnecessary communication overhead. Graph scheduling can do an excellent job in exploiting irregular parallelism, but it leads to extra memory space per node to achieve the best performance. Also, the 1D data mapping can only expose limited parallelism. Due to these restrictions, we have also examined a 2D data mapping method and an asynchronous execution scheme which exploits parallelism under memory constraints. We have implemented our sparse LU algorithms and conducted experiments with a set of nonsymmetric benchmark matrices on Cray-T3D and T3E. Our experiments show that our approach is quite effective in delivering good performance in terms of high megaflop numbers. In particular, the 1D code outperforms the current 2D code when processors have sufficient memory. But, the 2D code has more potential to solve larger problems and produces higher megaflop numbers.
The rest of the paper is organized as follows. Section 2 gives the problem definition. Section 3 describes structure prediction and 2D L/U supernode partitioning for sparse LU factorization. Section 4 describes program partitioning and data mapping schemes. Section 5 addresses the asynchronous computation scheduling and execution. Section 6 presents the experimental results. Section 7 concludes the paper. Fig. 1 shows how a nonsingular n × n matrix A can be factored into two matrices, L and U, using GEPP. The elimination steps are controlled by loop index k. For elements manipulated at Step k, we use i for row indexing and j for column indexing. This convention will be used through the rest of this paper. During each step of the elimination process, a row interchange may be needed to maintain numerical stability. The result of LU factorization process can be expressed by PA = LU, where L is a unit lower triangular matrix, U is a upper triangular matrix, and P is a permutation matrix which contains the row interchange information. The solution of a linear system Ax = b can hence be solved by two triangular solvers, Ly = Pb and Ux = y. The triangular solvers are much less time consuming than the Gaussian elimination process.
PRELIMINARIES
Caching behavior plays an important role in achieving good performance for scientific computations. To better exploit memory hierarchy in modern architectures, supernode partitioning is an important technique to exploit the regularity of sparse matrix computations and utilize BLAS routines to speed up the computation. It has been successfully applied to Cholesky factorization [26] , [30] , [31] . The difficulty for the nonsymmetric factorization is that supernode structure depends on pivoting choices during the factorization and, thus, cannot be determined in advance. SuperLU performs symbolic factorization and identifies supernodes on the fly. It also maximizes the use of BLAS-2 level operations to improve the caching performance of sparse LU. However, it is challenging to parallelize SuperLU on distributed memory machines. Using the precise pivoting information at each elimination step can certainly optimize data space usage, reduce communication, and improve load balance, but such benefits could be offset by high run-time control and communication overhead.
The strategy of static data structure prediction in [20] is valuable in avoiding dynamic symbolic factorization, identifying the maximum data dependence patterns, and minimizing dynamic control overhead. We will use this static strategy in our S* approach. But the overestimation does introduce extra fill-ins and lead to a substantial amount of unnecessary operations in the numerical factorization. We observe that, in SuperLU [7] , the DGEMV routine (the BLAS-2 level dense matrix vector multiplication) accounts for 78 percent to 98 percent of the floating point operations (excluding the symbolic factorization part). It is also a fact that BLAS-3 routine DGEMM (matrix-matrix multiplication) is usually much faster than BLAS-1 and BLAS-2 routines [6] . On Cray-T3D with a matrix of size 25 ¥ 25, DGEMM can achieve 103 MFLOPS, while DGEMV only reaches 85 MFLOPS. Thus, the key idea of our approach is that, if we could find a way to maximize the use of DGEMM after using static symbolic factorization, even with overestimated nonzeros and extra numerical operations, the overall code performance could still be competitive to SuperLU, which mainly uses DGEMV.
STORAGE PREDICTION AND DENSE STRUCTURE IDENTIFICATION

Storage Prediction
The purpose of symbolic factorization is to obtain structures of L and U factors. Since pivoting sequences are not known until the numerical factorization, the only way to allocate enough storage space for the fill-ins generated in the numerical factorization phase is to overestimate. Given a sparse matrix A with a zero-free diagonal, a simple solution is to use the Cholesky factor L c of A T A. It has been
shown that the structure of L c can be used as an upper bound for the structures of L and U factors regardless of the choice of the pivot row at each step [20] . But it turns out that this bound is not very tight. It often substantially overestimates the structures of the L and U factors (refer to Table 1 ). Instead, we consider another method from [20] . The basic idea is to statically consider all possible pivoting choices at each step. The space is allocated for all the possible nonzeros that would be introduced by any pivoting sequence that could occur during the numerical factorization. We summarize the symbolic factorization method briefly as follows.
The nonzero structure of a row is defined as a set of column indices at which nonzeros or fill-ins are present in the given n ¥ n matrix A. Since the nonzero pattern of each row will change as the factorization proceeds, we use 5 k i to denote the structure of row i after Step k of the factorization and A k to denote the structure of the matrix A after Step k.
And, a ij k denotes the element a ij in A k . Notice that the structures of each row or the whole matrix cover the structures of both L and U factors. In addition, during the process of symbolic factorization, we assume that no exact numerical cancelation occurs. Thus, we have
We also define the set of candidate pivot rows at
Step k as follows:
is structurally nonzero 1 
> C
We assume that a kk is always a nonzero. For any nonsingular matrix which does not have a zero-free diagonal, it is always possible to permute the rows of the matrix so that the permuted matrix has a zero-free diagonal [11] . Though the symbolic factorization does work on a matrix that contains zero entries in the diagonal, it is not preferable because it makes the overestimation too generous. The symbolic factorization process will iterate n steps and at Step k, for each row i OE 3 k , its structure will be updated as: 
Essentially, the structure of each candidate pivot row at
Step k will be replaced by the union of the structures of all the candidate pivot rows, except those column indices less than k. In this way, it is guaranteed that the resulting structure A n will be able to accommodate the fill-ins introduced by any possible pivot sequence. A simple example in Fig. 2 demonstrates the whole process. This symbolic factorization is applied after an ordering is performed on the matrix A to reduce fill-ins. The ordering we are currently using is the multiple minimum degree ordering for A T A. We also permute the rows of the matrix using a transversal obtained from Duff's algorithm [11] to make A have a zero-free diagonal. The transversal can often help reduce fill-ins [12] . We have tested the storage impact of overestimation for a number of nonsymmetric testing matrices from various sources. The results are listed in Table 1 . The fourth column in the table is the original number of nonzeros, and the fifth column measures the symmetry of the structure of the original matrix. The bigger the symmetry number is, the more nonsymmetric the original matrix is. A unit symmetry number indicates a matrix is symmetric, but all matrices have nonsymmetric numerical values. We have compared the number of nonzeros obtained by the static approach and the number of nonzeros obtained by SuperLU, as well as that of the Cholesky factor of A T A, for these matrices. The results in Table 1 show that the overestimation usually leads to less than 50 percent extra nonzeros than SuperLU scheme does. Extra nonzeros do imply additional computational cost. For example, one has to either check if a symbolic nonzero is an actual nonzero during a numerical factorization, or directly perform arithmetic operations which could be unnecessary. If we can aggregate these element-wise floating point operations and maximize the use of BLAS-3 subroutines, the sequential code performance will still be competitive. Even the last column of Table 1 shows that the floating operations from the overestimating approach can be as high as five times, the results in Section 6 will show that actual ratios of running times are much less. Thus, it is necessary and beneficial to identify dense structures in a sparse matrix after the static symbolic factorization. It should be noted that there are some cases where static symbolic factorization leads to excessive overestimation. For example, matrix memplus [7] is such a case. The static scheme produces 119 times as many nonzeros as SuperLU does. In fact, for this case, the ordering for SuperLU is applied based on A T + A instead of A T A. Otherwise, the overestimation ratio is 2.34 if using A T A for SuperLU also. For another matrix wang3 [7] , the static scheme produces four times as many nonzeros as SuperLU does. But our code can still produce one GFLOPS for it on 128 nodes of T3E. This paper focuses on the development of a high performance parallel code when overestimation ratios are not too high for a given ordering. Future work is to study ordering strategies that minimize overestimation ratios.
2D L/U Supernode Partitioning and Dense Structure Identification
Supernode partitioning is a commonly used technique to improve the caching performance of sparse code [2] . For a symmetric sparse matrix, a supernode is defined as a group of consecutive columns that have nested structure in the L factor of the matrix. Excellent performance has been achieved in [26] , [30] , [31] using supernode partitioning for Cholesky factorization. However, the above definition is not directly applicable to sparse LU with nonsymmetric matrices. A good analysis for defining unsymmetric supernodes in an L factor is available in [7] . Notice that supernodes may need to be further broken into smaller ones to fit into cache and to expose more parallelism. For the SuperLU approach, after L supernode partitioning, there are no regular dense structures in a U factor that could make it possible to use BLAS-3 routines (see Fig. 3a ). However, in the S* approach, there are dense columns (or subcolumns) in a U factor that we can identify after the static symbolic factorization (see Fig. 3b ). The U partitioning strategy is explained as follows: After an L supernode partition has been obtained on a sparse matrix A, i.e., a set of column blocks with possible different block sizes, the same partition is applied to the rows of the matrix to further break each supernode panel into submatrices. Now, each offdiagonal submatrix in the L part is either a dense block or contains dense blocks. Furthermore, the following theorem identifies dense structure patterns in U factors. This is the key to maximizing the use of BLAS-3 subroutines in our algorithm.
In the following theorem, we show that the 2D L/U partitioning strategy is successful and there is a rich set of dense structures to exploit. The following notations will be used through the rest of the paper:
• The L and U partitioning divides the columns of A into N column blocks and the rows of A into N row blocks so that the whole matrix is divided into N ¥ N submatrices. For submatrices in the U factor, we denote them as
denotes the diagonal submatrix. We use A ij to denote a submatrix when it is not necessary to distinguish between L and U factors.
• Define S(i) as the starting column (or row) number of the ith column (or row) block. For convenience, we define S(N + 1) = n + 1.
• A subcolumn (or subrow) is a column (or row) in a submatrix. For simplicity, we use a global column (or row) index to denote a subcolumn (or subrow) in a submatrix. For example, by subcolumn k in the submatrix block U ij , it means the subcolumn in this submatrix with the global column index k where S(j) £ k < S(j + 1). Similarly, we use a ij to indicate an individual nonzero element based on global indices. A compound structure in L or U is a submatrix, a subcolumn, or a subrow.
• A compound structure is nonzero if it contains at least one nonzero element or fill-in. We use A ij π 0 to indicate that block A ij is nonzero. Notice that an algorithm only needs to operate on nonzero compound structures. A compound structure is structurally dense if all of its elements are nonzeros or fill-ins. In the following, we will not differentiate between nonzero and fill-in entries. They are all considered as nonzero elements. PROOF. Recall that 3 k is the set of candidate pivot rows at symbolic factorization step k. Given a supernode spanning from column k to k + s, from its definition and the fact that after step k the static symbolic factorization will only affect the nonzero patterns in submatrix a k+1:n, k+1:n , and A has a zero-free diagonal, we have
Notice at each step k, the final structures of row i (i OE 3 k ) are updated by the symbolic factorization procedure as
For the structure of a row i, where k £ i £ k + s, we are only interested in nonzero patterns of the U part (excluding the part belonging to L kk ). We call this partial structure 85 i . Thus, for i OE 3 k :
85 85
It can be seen that after the kth step updating, 85 85
Knowing that the structure of row k is unchanged after Step k, we only need to prove that 85 85 85
, as shown below. Then, we can infer that the nonzero structures of rows from k to k + s are the same and subcolumns at the U part are either structurally dense or zero. Now, since 3 k … 3 k+1 , and 3 k+1 π f, it is clear that: The above theorem shows that the L/U partitioning can generate a rich set of structurally dense subcolumns or even structurally dense submatrices in a U factor. We also further incorporate this result with supernode amalgamation in Section 3.3, and our experiments indicate that more than 64 percent of numerical updates is performed by the BLAS-3 routine DGEMM in S*, which shows the effectiveness of the L/U partitioning method. Fig. 4 demonstrates the result of a supernode partitioning on a 7 ¥ 7 sample sparse matrix. One can see that all the submatrices in the upper triangular part of the matrix only contain structurally dense subcolumns.
Based on the above theorem, we can further show a structural relationship between two submatrices in the same supernode column block, which will be useful in implementing our algorithm to detect nonzero structures efficiently for numerical updating. ¢ . This will lead to a nonzero element in the subcolumn k in U i j ¢ , because the subcolumn k of U ij is structurally dense. According to Theorem 1, subcolumn k in U i j ¢ is structurally dense.
COROLLARY 2. Given two nonzero submatrices U ij , U i j ¢ , where i < i¢ < j and L i i
¢ is nonzero. If U ij is structurally dense, U i j ¢ must be structurally dense. PROOF. That is straightforward using Corollary 1.
Supernode Amalgamation
For most tested sparse matrices, the average size of a supernode after L/U partitioning is very small, about 1.5 to two columns. This results in very fine-grained tasks. Amalgamating small supernodes can lead to great performance improvement for both parallel and sequential sparse codes because it can improve caching performance and reduce interprocessor communication overhead.
There could be many ways to amalgamate supernodes [7] , [30] . The basic idea is to relax the restriction that all the columns in a supernode must have exactly the same nonzero structure below diagonal. The amalgamation is usually guided by a supernode elimination tree. A parent could be merged with its children if the merging does not introduce too many extra zero entries into a supernode. Row and column permutations are needed if the parent is not consecutive with its children. However, a column permutation introduced by the above amalgamation method could undermine the correctness of the static symbolic factorization. We have used a simpler approach that does not require any permutation. This approach only amalgamates consecutive supernodes if their nonzero structures only differ by a small number of entries and it can be performed in a very efficient manner, which only has a time complexity of O(n) [27] . We can control the maximum allowed differences by an amalgamation factor r. Our experiments show that when r is in the range of four to six, it gives the best performance for the tested matrices and leads to 10-55 percent improvement on the execution times of the sequential code. The reason is that, by getting bigger supernodes, we are getting larger dense structures, although there may be a few zero entries in them, and we are taking more advantage of BLAS-3 kernels. Notice that after applying the supernode amalgamation, the dense structures identified in the Theorem 1 are not strictly dense any more. We call them almost-dense structures and can still use the result of Theorem 1 with a minor revision. That is summarized in the following corollary. All the results presented in Section 6 are obtained using this amalgamation strategy.
COROLLARY 3. Given a sparse matrix A, if supernode amalgamation is applied to A after the static symbolic factorization and 2D L/U supernode partitioning are performed on A, each nonzero submatrix in the U factor of A contains only almost-structurally-dense subcolumns.
PROGRAM PARTITIONING, TASK DEPENDENCE, AND PROCESSOR MAPPING
After dividing a sparse matrix A into submatrices using the L/U supernode partitioning, we need to partition the LU code accordingly and define coarse grained tasks that manipulate on partitioned dense data structures.
Program Partitioning
Column block partitioning follows supernode structures. Typically, there are two types of tasks. One is Factor(k), which is to factorize all the columns in the kth column block, including finding the pivoting sequence associated with those columns. The other is Update(k, j), which is to apply the pivoting sequence derived from Factor(k) to the jth column block, and modify the jth column block using the kth column block, where k < j and U kj π 0. Instead of performing the row interchange to the right part of the matrix right after each pivoting search, a technique called "delayed-pivoting" is used [6] . In this technique, the pivoting sequence is held until the factorization of the kth column block is completed. Then, the pivoting sequence is applied to the rest of the matrix, i.e., interchange rows. Delayed-pivoting is important, especially to the parallel algorithm, because it is equivalent to aggregating multiple small messages into a larger one. Here, the owner of the kth column block sends the column block packed together with the pivoting information to other processors. An outline of the partitioned sparse LU factorization algorithm with partial pivoting is described in Fig. 6 . The code of Factor(k) is summarized in Fig. 7 . It uses BLAS-1 and BLAS-2 subroutines. The computational cost of the numerical factorization is mainly dominated by Update() tasks. The function of task Update(k, j) is presented in Fig. 8 . The lines (05) DQG (12) are using dense matrix multiplications.
We use directed acyclic task graphs (DAGs) to model irregular parallelism arising in this partitioned sparse LU program. The DAGs are constructed statically before numerical factorization. Previous work on exploiting task parallelism for sparse Cholesky factorization has used elimination trees (e.g., [28] , [30] ), which is a good way to expose the available parallelism because pivoting is not required. For sparse LU, an elimination tree of A T A does not directly reflect the available parallelism. Dynamically created DAGs have been used for modeling parallelism and guiding run-time execution in a nonsymmetric multifrontal method [5] , [25] .
Given the task definitions in Figs. 6, 7, and 8, we can define the structure of a sparse LU task graph in the following.
These four properties are necessary.
• There are N tasks Factor(k), where
For a dense matrix, there will be a total of N(N -1)/2 updating tasks.
• There is a dependence edge from Factor(k) to task Update(k, j), where k < j and U kj π 0.
• There is a dependence from Update(k, k¢) to Factor(k¢), where k < k¢, U kk¢ π 0 and there exists no task Update(t, k¢) such that k < t < k¢.
As described below, we add one more property that, while not necessary, simplifies implementation. This property essentially does not allow exploiting commutativity among Update() tasks. However, according to our experience with Cholesky factorization [16] , the performance loss due to this property is not substantial, about 6 percent on average when graph scheduling is used.
• There is a dependence from Update(k, j) to Update(k¢, j),
where k < k¢ and there exists no task Update(t, j) such that k < t < k¢. Fig. 9a shows the nonzero pattern of the partitioned matrix shown in Fig. 4. Fig. 9b is the corresponding task dependence graph.
1D Data Mapping
In the 1D data mapping, all submatrices, from both L and U part, of the same column block will reside in the same processor. Column blocks are mapped to processors in a cyclic manner or based on other scheduling techniques such as graph scheduling. Tasks are assigned based on owner-compute rule, i.e., tasks that modify the same column block are assigned to the same processor that owns the column block.
One disadvantage of this mapping is that it serializes the computation in a single Factor(k) or Update(k, j) task. In other words, a single Factor(k) or Update(k, j) task will be performed by one processor. But this mapping strategy has an advantage that both pivot searching and subrow interchange can be done locally, without any communication. Another advantage is that parallelism modeled by the above dependence structure can be effectively exploited using graph scheduling techniques.
2D Data Mapping
In the literature, 2D mapping has been shown more scalable than 1D for sparse Cholesky [30] , [31] . However, there are several difficulties to apply the 2D blockoriented mapping to the case of sparse LU factorization; even the static structure is predicted. First, pivoting operations and row interchanges require frequent and wellsynchronized interprocessor communication when submatrices in the same column block are assigned to different processors. Effective exploitation of limited irregular parallelism in the 2D case requires a highly efficient asynchronous execution mechanism and a delicate message buffer management. Second, it is difficult to utilize (
Find the pivoting row t in column m; (4) Interchange row t and row m of the column block k; (5) Scale column m and update rest of columns in this column block; (6) endfor mod , mod . The 2D data mapping is considered more scalable than 1D data mapping because it enables parallelization of a single Factor(k) or Update(k, j) task on p r processors. In the next section, we will discuss how 2D parallelism is exploited using asynchronous schedule execution.
PARALLELISM EXPLOITATION
Scheduling and Run-Time Support for 1D Methods
We discuss how 1D sparse LU tasks are scheduled and executed so that parallel time can be minimized. George and Ng [21] used a dynamic load balancing algorithm on a shared memory machine. For distributed memory machines, dynamic and adaptive load balancing works well for problems with very coarse grained computations, but it is still an open problem to balance the benefits of dynamic scheduling with the run-time control overhead, since task and data migration cost is too expensive for sparse problems with mixed granularities. We use task dependence graphs to guide scheduling and have investigated two types of scheduling schemes.
• Compute-ahead scheduling (CA). This is to use block-cyclic mapping of tasks with a compute-ahead execution strategy, which is demonstrated in Fig. 10 . This idea has been used to speed up parallel dense factorizations [23] . It executes the numerical factorization layer by layer based on the current submatrix index. The parallelism is exploited for concurrent updating. In order to overlap computation with communication, the Factor(k + 1) task is executed as soon as Factor(k) and Update(k, k + 1) (if exists) finish so that the pivoting sequence and column block k + 1 for the next layer can be communicated as early as possible.
• Graph scheduling. We order task execution within each processor using the graph scheduling algorithms in [36] . The basic optimizations are balancing processor loads and overlapping computation with communication to hide communication latency. These are done by utilizing global dependence structures and critical path information.
Graph scheduling has been shown effective in exploiting irregular parallelism for other applications (e.g., [15] , [16] ). Graph scheduling should outperform the CA scheduling for sparse LU because it does not have a constraint in ordering Factor() tasks. We demonstrate this point using the LU task graph in Fig. 9 . For this example, the Gantt charts of the CA schedule and the schedule derived by our graph scheduling algorithm are listed in Fig. 11 . It is assumed that each task has a computation weight two and each edge has communication weight one. It is easy to see that our scheduling approach produces a better result than the CA schedule. If we look at the CA schedule carefully, we can see that the reason is that CA can look ahead only one step so that the execution of task Factor(3) is placed after Update (1, 5) . On the other hand, the graph scheduling algorithm detects that Factor(3) can be executed before Update (1, 5) , which leads to better overlap of communication with computation. Interchange rows according to the pivoting sequence; (09) Perform task Update(k, k + 1); (10) Perform task Factor(k + 1); (11) Broadcast column block k + 1 and the pivoting sequence;
(12) for j = k + 2 to N with U kj π 0 (13) if column block j is local (14) if column block k has not been received (15) Receive column block k and the pivoting choices; (16) Interchange rows according to the pivoting sequence; (17) Perform task Update(k, j); (18) endfor (19) endfor Fig. 10 . The 1D code using compute-ahead schedule.
However, the implementation of the CA algorithm is much easier, since the efficient execution of a sparse task graph schedule requires a sophisticated run-time system to support asynchronous communication protocols. We have used the RAPID run-time system [16] for the parallelization of sparse LU using graph scheduling. The key optimization is to use Remote Memory Access(RMA) to communicate a data object between two processors. It does not incur any copying/buffering during a data transfer, since low communication overhead is critical for sparse code with mixed granularities. RMA is available in modern multiprocessor architectures such as Cray-T3D [34] , T3E [32] , and Meiko CS-2 [15] . Since the RMA directly writes data to a remote address, it is possible that the content at the remote address is still being used by other tasks and, then, the execution at the remote processor could be incorrect. Thus, for a general computation, a permission to write the remote address needs to be obtained before issuing a remote write. However, in the RAPID system, this hand-shaking process is avoided by a carefully designed task communication protocol [16] . This property greatly reduces task synchronization cost. As shown in [17] , the RAPID sparse code can deliver more than 70 percent of the speedup predicted by the scheduler on Cray-T3D. In addition, using RAPID system greatly reduces the amount of implementation work to parallelize sparse LU.
Asynchronous Execution for the 2D Code
As we discussed previously, 1D data mapping cannot expose parallelism to a maximum extent. Another issue is that a time-efficient schedule may not be space-efficient. Specifically, to support concurrency among multiple updating stages in both RAPID and CA code, multiple buffers are needed to keep pivoting column blocks of different stages on each processor. Therefore, for a given problem, the per processor space complexity of the 1D codes could be as high as O(S 1 ), where S 1 is the space complexity for a sequential algorithm. For sparse LU, each processor in the worst case may need a space for holding the entire matrix. The RAPID system [16] also needs extra memory space to hold dependence structures.
Based on the above observation, our goal for the 2D code is to reduce memory space requirement while exploiting a reasonable amount of parallelism so that it can solve large problem instances in an efficient way. In this section, we present an asynchronous 2D algorithm which can substantially overlap multistages of updating but its memory requirement is much smaller than that of 1D methods. Fig. 12 shows the main control of the algorithm in an SPMD coding style. Fig. 13 shows the SPMD code for Factor(k), which is executed by processors of column k mod p c . Recall that the algorithm uses 2D block-cyclic data mapping and the coordinates for the processor that owns submatrix A ij are (i mod p r , j mod p c ). Also, we divide the function of Update() (in Fig. 8 ) into two parts: ScaleSwap(), which does scaling and delayed row interchange for submatrix
+1 , as shown in Fig. 14; Update_2D( ), which does submatrix updating as shown in Fig. 15 . In all figures, the statements which involve interprocessor communication are marked with *.
It can be seen that the computation flow of this 2D code is still controlled by the pivoting tasks Factor(k). The order of execution for Factor(k), k = 1, 2, , N is sequential, but Update_2D() tasks, where most of the computation comes from, can execute in parallel among all processors. The asynchronous parallelism comes from two levels. First, a single stage of tasks Update_2D(k, k + 1 : N) can be executed concurrently on all processors. In addition, different stages of Update_2D() tasks from Update_2D(k, k + 1 : N) and Update_2D(k¢, k¢ + 1 : N), where k π k¢, can also be overlapped. The idea of computeahead scheduling is also incorporated, i.e., Factor(k + 1) is executed as soon as Update_2D(k, k + 1) finishes. Some detailed explanation for pivoting, scaling and swapping is given below. In line (5) (p r -1, p c ) .
PROOF. We will use the following facts in proving the theorem:
• When p c = 1, it is trivial based on Fact 1. When p c > 1, we can imagine a scenario in which all processors in column 0 have just finished task Factor(k), and some of them are still working on Update_2D(k -1, *) tasks, and processors in column 1 could go ahead and execute Update_2D(k, *) tasks. After processors in column 1 finish Update_2D(k, k + 1) task, they will execute Factor(k + 1). Then, after finishing Update_2D(k, *) tasks, processors in column 2 could execute Update_2D(k + 1, k + 2) task and start Factor(k + 2) task, and so on. Finally, processors in column p c -1 could execute Factor(k + p c -1) and, then, start Update_2D(k + p c -1, *). At this moment, processors in column 0 may be still working on Update_2D(k -1, *) . Thus, the overlapping degree is p c . Now we will show by contradiction that the maximum overlapping degree is p c . Assume that, at some moment, there exist two updating stages being executed concurrently: Update_2D(k, *) and Update_2D(k¢, *), where k¢ > k + p c . Then, Factor(k¢) must have been completed. Without loss of generality, assuming that processors in column 0 execute Factor(k¢), then, according to Fact 1, all Update_2D(j, *) tasks, where j < k¢ -1 should be completed before this moment. Since block cyclic mapping is used, it is easy to see each processor column has performed one of the Factor(j) tasks, where k¢ -p c + 2 £ j £ k¢. Then, Update_2D(k¢ -p c + 1) should be completed on all processors. Then, for any concurrent stage Update_2D(k, *), k must satisfy k ≥ k¢ -p c , which is a contradiction. (
Update A ij using L ik and U kj ; (5) endfor Part 2. First, we show that overlapping degree min(p r -1, p c ) can be achieved within a processor column. For convenience of illustration, we consider a scenario in which all delayed row interchanges in ScaleSwap() take place locally without any communication within a processor column. Therefore, there is no interprocessor synchronization going on within a processor column except in Factor() tasks. Assuming p r -1 < p c , we can imagine at some moment, processors in column 0 have completed Factor(s), and P 0, 0 has just finished ScaleSwap(s), and starts executing
Update_2D(s,
*), where s mod p r = 0 and s mod p c = 0. Then, processors in column 1 will execute Update_2D(s, s + 1) and, then, Factor(s + 1), after which P 1,0 can start ScaleSwap(s + 1) and, then, Update_2D(s + 1, *). Following this reasoning, after Update_2D(s + p r -2, s + p r -1) and Factor(s + p r -1) have been finished on processors of column p r -1, P p r -1,0 could complete previous Update_2D() tasks and ScaleSwap(s + p r -1
), and start
Update_2D(s + p r -1, *). Now, P 0, 0 may be still working on Update_2D(s, *). Thus, the overlapping degree is p r -1. If p r ≥ p c + 1, obviously the above reasoning will stop when processors of column p c -1 finish Update_2D(s + p c -2, s + p c -1) and Factor(s + p c -1). In that case, when P p c -1,0 is to start Update_2D(s + p c -1, *), the processor P p r -1,0 could still be working on Update_2D(s -1, *) because of the compute ahead scheduling. Hence, the overlapping degree is p c . Now we need to show that the upper bound of overlapping degree within a processor column is min (p r -1, p c ) . We have already shown, in the proof of Part 1, that the overall overlapping degree is less than p c , so it is the overlapping degree within a processor column. To prove it is also less than p r -1, we can use a similar proof as that in Part 1, except using ScaleSwap(k) to replace Factor(k), and using Fact 2 instead of Fact 1.
Knowing degree of overlapping is important in determining the amount of memory space needed to accommodate those communication buffers on each processor for supporting asynchronous execution. Buffer space is additional to data space needed to distribute the original matrix. There are four types of communication that needs buffering: 1) Pivoting along a processor column (lines (05), (07), and (08) in Fig. 13 ), which includes communicating pivot positions and multicasting pivot rows. We call the buffer for this purpose Pbuffer. 2) Multicasting along a processor row (lines (12), (13) , and (14) in Fig. 13 ). The communicated data includes L k k , local nonzero blocks in L k+1:N,k , and pivoting sequences. We call the buffer for this purpose Cbuffer. 3) Row interchange within a processor column (line (05) in Fig. 14) . We call this buffer Ibuffer. Fig. 14) . The data includes local nonzero blocks of a row panel. We call the buffer Rbuffer.
4) Multicasting along a processor column (line (10) in
Here, we assume that p r £ p c + 1, because, based on our experimental results, setting p r £ p c + 1 always leads to better performance. Thus, the overlapping degree of Update_2D() tasks within a processor row is, at most, p c , and the overlapping degree within a processor column is, at most, p r -1. Then, we need p c to separate Cbuffer's for overlapping among different columns and p r -1 to separate Rbuffer's for overlapping among different rows.
We estimate the size of each Cbuffer and Rbuffer as follows: Assuming that the sparsity ratio of a given matrix is s after fill-in and the maximum block size is BSIZE, each Cbuffer is of size:
Similarly, each Rbuffer is of size:
We ignore the buffer size for Pbuffer and Ibuffer because they are very small (the size of Pbuffer is only about BSIZE ◊ BSIZE and the size of Ibuffer is about s ◊ n/p c ). Thus, the total buffer space needed on each processor for the asynchronous execution is:
Notice that the sequential space complexity S 1 = n 2 s. In practice, we set p c /p r = 2. Therefore, the buffer space complexity for each processor is 2.5 ◊ n ◊ BSIZE ◊ s, or 2 5 1
S , which is very small for a large matrix. For all the benchmark matrices we have tested, the buffer space is less than 100K words. Given a sparse matrix, if the matrix data is evenly distributed onto p processors, the total memory requirement per processor is S 1 /p + O(1) considering n @ p and n @ BSIZE.
This leads us to conclude that the 2D asynchronous algorithm is very space scalable.
EXPERIMENTAL STUDIES
Our experiments were originally conducted on a Cray-T3D distributed memory machine at San Diego Supercomputing Center. Each node of the T3D includes a DEC Alpha EV4(21064) processor with 64 Mbytes of memory. The size of the internal cache is 8 Kbytes per processor. The BLAS-3 matrix-matrix multiplication routine DGEMM can achieve 103 MFLOPS, and the BLAS-2 matrix-vector multiplication routine DGEMV can reach 85 MFLOPS. These numbers are obtained assuming all the data is in cache and using cache read-ahead optimization on T3D, and the matrix block size is chosen as 25. The communication network of the T3D is a 3D torus. Cray provides a shared memory access library called shmem, which can achieve 126 Mbytes/s bandwidth and 2.7ms communication overhead using shmem_put() primitive [34] . We have used shmem_put() for the communications in all the implementations.
We have also conducted experiments on a newly acquired Cray-T3E at the San Diego Supercomputing Center.
Each T3E node has a clock rate of 300 MHZ, an 8 Kbytes internal cache, 96 Kbytes second level cache, and 128 Mbytes main memory. The peak bandwidth between nodes is reported as 500 Mbytes/s and the peak round trip communication latency is about 0.5 to 2 ms [33] . We have observed that, when block size is 25, DGEMM achieves 388 MFLOPS, while DGEMV reaches 255 MFLOPS. We have used block size 25 in our experiments, since, if the block size is too large, the available parallelism will be reduced. In this section, we mainly report results on T3E. In some occasions where the absolute performance is concerned, we also list the results on T3D to see how our approach scales when the underlining architecture is upgraded. All the results are obtained on T3E unless explicitly stated.
In calculating the MFLOPS achieved by our parallel algorithms, we do not include extra floating point operations introduced by the overestimation. We use the following formula:
Achieved MFLOPS Operation count obtained from SuperLU Parallel time of our algorithm on T3D or T3E . =
The operation count for a matrix is reported by running SuperLU code on a SUN workstation with large memory, since SuperLU code cannot run for some large matrices on a single T3D or T3E node due to memory constraint. We also compare the S* sequential code with SuperLU to make sure that the code using static symbolic factorization is not too slow and will not prevent the parallel version from delivering high megaflops.
Impact of Static Symbolic Factorization on Sequential Performance
We study whether the introduction of extra nonzero elements by the static factorization substantially affects the time complexity of numerical factorization. We compare the performance of the S* sequential code with SuperLU code performance in Table 2 1 for those matrices from Table 1 that can be executed on a single T3D or T3E node. We also 1 . The times for S* in this table do not include symbolic preprocessing cost, while the times for SuperLU include symbolic factorization because SuperLU does it on the fly. Our implementation for static symbolic preprocessing is very efficient. For example, the preprocessing time is only about 2.76 seconds on a single node of T3E for the largest matrix we tested (vavasis3).
introduce two other matrices to show how well the method works for larger matrices and denser matrices. One of the two matrices is b33_5600, which is truncated from BCSSTK33 because the current sequential implementation is not able to handle the entire matrix due to memory constraint, and the other one is dense1000.
Though the static symbolic factorization introduces a lot of extra computation, as shown in Table 1 , the performance of S* after 2D L/U partitioning is consistently competitive to that of highly optimized SuperLU. The absolute single node performance that has been achieved by the S* approach on both T3D and T3E is consistently in the range of 5-10 percent of the highest DGEMM performance for those matrices of small or medium sizes. Considering the fact that sparse codes usually suffer poor cache reuse, this performance is reasonable. In addition, the amount of computation for the testing matrices in Table 2 is small, ranging from 18 to 107 million double precision floating operations. Since the characteristic of the S* approach is to explore more dense structures and utilize BLAS-3 kernels, better performance is expected on larger or denser matrices. This is verified on a matrix b33_5600. For even larger matrices, such as vavasis3, we cannot run S* on one node, but as shown later, the 2D code can achieve 32.8 MFLOPS per node on 16 T3D processors. Notice that the megaflops performance per node for sparse Choleksy reported in [24] on 16 T3D nodes is around 40 MFLOPS, which is also a good indication that S* single-node performance is satisfactory.
We present a quantitative analysis to explain why S* can be competitive to SuperLU. Assume the speed of BLAS-2 kernel is w 2 second/flop and the speed of BLAS-3 kernel is w 3 second/flop. The total amount of numerical updates is & flops for SuperLU and & ¢ flops for the S*. Apparently, &¢ ≥ &. For simplicity, we ignore the computation from the scaling part within each column because it contributes very little to the total execution time. Hence, we have:
and
where T symbolic is the time spent on dynamic symbolic factorization in the SuperLU approach, r is the percentage of the T3D  T3E  T3D  T3E  T3D  T3E  T3D  T3E  T3D  T3E numerical updates that are performed by DGEMM in S*. Let h be the ratio of symbolic factorization time to numerical factorization time in SuperLU, then we simplify (1) to the following:
We estimate that h < 0.82 for the tested matrices, based on the results in [7] . In [17] , we have also measured r as approximately r < 0.67. The ratios of the number of floating point operations performed in S* and SuperLU for the tested matrices are available in Table 1 . On average, the value of
is 3.98. We plug these typical parameters into (2) and (3), and we have: . . These estimations are close to the ratios obtained in Table 2 . The discrepancy is caused by the fact that the submatrix sizes of supernodes are nonuniform, which leads to different caching performance. If submatrices are of uniform sizes, we expect our prediction is more accurate. For instance, in the dense case, Table 2 .
The above analysis shows that using BLAS-3 as much as possible makes S* competitive to SuperLU. Suppose in a machine that DGEMM outperforms DGEMV substantially and the ratio of the computation that is performed by DGEMM is high enough, S* could be faster than SuperLU for some matrices. The last two entries in Table 2 have already shown this.
Parallel Performance of 1D Codes
In this subsection, we report a set of experiments conducted to examine the overall parallel performance of 1D codes, the effectiveness of scheduling and supernode amalgamation.
Overall Performance
We list the MFLOPS numbers of the 1D RAPID code obtained on various number of processors for several testing matrices in Table 3 (A "-" entry implies the data is not available due memory constraint, same below). We know that the megaflops of DGEMM on T3E is about 3.7 times as large as that on T3D, and the RAPID code after using a upgraded machine is speeded up about three times on average, which is satisfactory. For the same machine, the performance of the RAPID code increases when the number of processors increases and speedups compared to the pure S* sequential code (if applicable) can reach up to 17.7 on 64 T3D nodes and 24.1 on 64 T3E nodes. From 32 to 64 nodes, the performance gain is small except for matrices goodwin, e40r0100, and b33_5600, which are much larger problems than the rest. The reason is that those small tested matrices do not have enough computation and parallelism to saturate a large number of processors when the elimination process proceeds toward the end. It is our belief that better and more scalable performance can be obtained on larger matrices. But, currently, the available memory on each node of T3D or T3E limits the problem size that can be solved with the current version of the RAPID code.
Effectiveness of Graph Scheduling
We compare the performance of 1D CA code with 1D RAPID code in Fig. 16 . The Y-axis is 1-PT RAPID /PT CA , where PT stands for parallel time. For two and four processors, in certain cases, the compute-ahead code is slightly faster than the RAPID code. But for the number of processors more than four, the RAPID code runs 10-72 percent faster. The more processors involved, the bigger the performance gap tends to be. The reason is that for a small number of processors, there are sufficient tasks making all processors busy and the compute-ahead schedule performs well, while the RAPID code suffers a certain degree of system overhead. For a larger number of processors, schedule optimization becomes important since there is limited parallelism to exploit.
Effectiveness of Supernode Amalgamation
We have examined how effective our supernode amalgamation strategy is using the 1D RAPID code. Let PT a and PT be the parallel time with and without supernode amalgamation, respectively. The parallel time improvement ratios (1 -PT a /PT) on T3E for several testing matrices are T3D  T3E  T3D  T3E  T3D  T3E  T3D  T3E  T3D  T3E  T3D  T3E   sherman5 Table 4 and similar results on T3D are in [17] . Apparently, the supernode amalgamation has brought significant improvement due to the increase of supernode size, which implies an increase of the task granularities. This is important to obtaining good parallel performance [22] .
2D Code Performance
As mentioned before, our 2D code exploits more parallelism but maintains a lower space complexity, and has much more potential to solve large problems. We show the absolute performance obtained for some large matrices on T3D in Table 5 .
Since some matrices cannot fit for a small number of processors, we only list results on 16 or more processors. The maximum absolute performance achieved on 64 nodes of T3D is 1.48 GFLOPS, which is translated to 23.1 MFLOPS per node. For 16 nodes, the per-node performance is 32.8 MFLOPS. Table 6 shows the performance numbers on T3E for the 2D code. We have achieved up to 6.878 GFLOPS on 128 nodes. For 64 nodes, megaflops on T3E are from 3.1 to 3.4 times as large as that on T3D. Again, considering that DGEMM megaflops on T3E is about 3.7 times as large as that on T3D, our code performance after using an upgraded machine is good. Notice that 1D codes cannot solve the last six matrices of Table 6 due to memory constraint. For those matrices solvable using both 1D RAPID and 2D codes, we compare the average parallel time differences by computing the ratio: 1 -PT RAPID /PT 2D and the result is in Fig. 17 . The 1D RAPID code achieves better performance because it uses sophisticated graph scheduling technique to guide the mapping of column blocks and ordering of tasks, which results in better overlapping of communication with computation. The performance difference is larger for the matrices listed in the left of Fig. 17 compared to the right of Fig. 17 . We partially explain the reason by analyzing load balance factors of the 1D RAPID code and the 2D code in Fig. 18 . The load balance factor is defined as work total /(P ◊ work max ) [31] . Here, we only count the work from the updating part because it is the major part of the computation. The 2D code has better load balance, which can make up for the impact of lacking of efficient task scheduling. This is verified by Fig. 17 and Fig. 18 . One can see that when the load balance factor of the 2D code is close to that of the RAPID code (e.g., lnsp3937), the performance of the RAPID code is much better than the 2D code; when the load balance factor of the 2D code is significantly better than that of the RAPID code (e.g., jpwh991 and orserg1), the performance differences are smaller.
Synchronous versus Asynchronous 2D Code
Using a global barrier in the 2D code at each elimination step can simplify the implementation, but it cannot overlap computations among different updating stages. We have compared parallel time reductions between the asynchronous code and the synchronous code for some testing matrices in Table 7 . It shows that our asynchronous design for 2D mapping improves performance significantly, especially on large number of processors on T3E. It demonstrates the importance of exploiting parallelism using asynchronous execution. The experiment data on T3D is in [14] .
CONCLUDING REMARKS
In this paper, we present an approach for parallelizing sparse LU factorization with partial pivoting on distributed memory machines. The major contribution of this paper is that we integrate several techniques together, such as static symbolic factorization, scheduling for asynchronous parallelism, 2D L/U supernode partitioning techniques to effectively identify dense structures, and maximize the use of BLAS-3 subroutines in the algorithm design. Using these ideas, we are able to exploit more data regularity for this open irregular problem and achieve up to 6.878 GFLOPS on 128 T3E nodes. This is the highest performance known for this challenging problem; the previous highest performance was 2.583 GFLOPS on shared memory machines [8] . The comparison results show that the 2D code has a better scalability than 1D codes because 2D mapping exposes more parallelism with a carefully designed buffering scheme. But the 1D RAPID code still outperforms the 2D code if there is sufficient memory, since the scheduling and execution techniques for the 2D code are simple, and are not competitive to graph scheduling. Recently, we have conducted research on developing space efficient scheduling algorithms while retaining good time efficiency [18] . It is still an open problem to develop advanced scheduling techniques that better exploit parallelism for 2D sparse LU factorization with partial pivoting. There are other issues which are related to this work and need to be further studied; for example, alternative for parallel sparse LU based on Schur complements [13] and static estimation and parallelism exploitation for sparse QR [29] , [35] .
It should be noted that the static symbolic factorization could fail to be practical if the input matrix has a nearly dense row because it will lead to an almost complete fill-in of the whole matrix. It might be possible to use different matrix reordering to avoid such a case. Fortunately, this is not the case in most of the matrices we have tested. Therefore, our approach is applicable to a wide range of problems using a simple ordering strategy. It will be interesting in the future to study ordering strategies that minimize overestimation ratios so that S* can consistently deliver good performance for various classes of sparse matrices. 
