Although Gaussian elimination with partial pivoting is a robust algorithm to solve unsymmetric sparse linear systems of equations, it is di cult to implement e ciently on parallel machines, because of its dynamic and somewhat unpredictable way of generating work and intermediate results at run time. In this paper, we present an e cient parallel algorithm that overcomes this di culty. The high performance of our algorithm is achieved through (1) using a graph reduction technique and a supernode-panel computational kernel for high single processor utilization, and (2) scheduling two types of parallel tasks for a high level of concurrency. One such task is factoring the independent panels on the disjoint subtrees in the column elimination tree of A. Another task is updating a panel by previously computed supernodes. A scheduler assigns tasks to free processors dynamically and facilitates the smooth transition between the two types of parallel tasks. No global synchronization is used in the algorithm. The algorithm is well suited for shared memory machines (SMP) with a modest number of processors. We demonstrate 4{7 fold speedups on a range of 8 processor SMPs, and more on larger SMPs. One realistic problem arising from a 3-D ow calculation achieves factorization rates of 1.0, 2.5, 0.8 and 0.8 Giga ops, on the 12 processor Power Challenge, 8 processor Cray C90, 16 processor Cray J90, and 8 processor AlphaServer 8400, respectively.
Introduction
In earlier work with Eisenstat and Liu, we described a publically released sequential software library, SuperLU, to solve unsymmetric sparse linear systems using Gaussian elimination with partial pivoting 5]. This left-looking, blocked algorithm includes symmetric structural reduction for fast symbolic factorization, and supernode-panel updates to achieve better data reuse in cache and oating-point registers.
Computer Science Division, University of California, Berkeley, CA 94720 (demmel@cs.berkeley.edu). The research of Demmel and Li was supported in part by NSF grant ASC{9313958, DOE grant DE{FG03{94ER25219, UT Subcontract No. ORA4466 from ARPA Contract No. DAAL03{91{C0047, DOE grant DE{FG03{94ER25206, and NSF Infrastructure grants CDA{8722788 and CDA{9401156. In this paper we study an e cient parallel algorithm based on SuperLU. The primary objective of this work is to achieve good e ciency on shared memory systems with a modest number of processors (for example, between 10 and 20) . In addition to measuring the e ciency of our parallel algorithm on these machines, we also study a theoretical upper bound on performance of this algorithm. The e ciency of the algorithm has been demonstrated on several shared memory parallel machines. When compared to the best sequential runtime of SuperLU, the parallel algorithm typically achieved 4 to 7 speedups on 8 processors platforms, for large sparse matrices.
The rest of the paper is organized as follows. In Section 2 we review the sequential SuperLU algorithm. Section 3 presents the sources and the characteristics of the test matrices. Section 4 presents the parallel machines used in our study. In Section 5 we describe several design choices we have made in parallelization, including how to nd parallelism, how to de ne individual tasks and memory management for supernodes. Section 6 sketches the high-level parallel scheduling algorithm. In Section 7, we present the parallel performance achieved with the test matrices on a number of platforms. Both time and space e ciency will be illustrated. We also quantify the sources of the overhead in parallelization, and give a thorough analysis of their impact on performance. In the end of this section we establish a PRAM (Parallel Random-Access Machine) model to predict an upper bound on speedups attainable by the proposed algorithm. Finally, Section 8 draws conclusions and suggests future research.
2 Overview of sequential algorithm in SuperLU Figure 1 sketches the supernode-panel factorization algorithm used in SuperLU. A supernode is de ned to be a range (r: s) of columns of L with the triangular block just below the diagonal being full, and with the same row structure below this block. We store a supernode as a rectangular block, including the triangle of U in rows and columns r through s, see Figure 2 . This allows us to address each supernode as a two-dimensional array in calls to BLAS routines 6, 7] , and so get high performance. To increase the average size of supernodes (and hence performance), we merge groups of consecutive columns (usually no more than 4 columns) at the fringe of the column elimination tree (Section 5.1) into relaxed supernode regardless of their row structures. A panel is a block of w consecutive columns in the matrix which are updated simultaneously by a supernode using calls to the BLAS. The row structures of the columns in a panel may not be correlated in any fashion, and the boundaries between panels may be di erent from those between supernodes. Each panel factorization, outer loop i n Figure 1 , consists of three distinct steps: (1) the symbolic factorization to determine the nonzero structure, (2) the numerical updates by supernodes, and (3) the factorization of each column in the panel. The pivot selection, detection of the supernode boundary, and symmetric structure reduction (to reduce the cost of later symbolic factorization steps) are all performed in the inner factorization step. Both panel and column symbolic steps use depth-rst search (DFS). A further re nement, a two-dimensional supernode partitioning (de ned by the blocking parameters t and b in Figure 2 ), enhances performance for large matrices and machines with small caches. A more detailed description of SuperLU is in the paper 5].
We conducted extensive performance evaluation for SuperLU on several recent superscalar architectures. For large sparse matrices, SuperLU achieves up to 40% of the peak oating-point performance on both IBM RS/6000-590 and MIPS R8000. It achieves nearly 25% peak on the DEC Alpha 21164. More details can be found in 24] .
for column j = 1 to n step w do F(:; j : j + w ? 1) = A(:; j : j + w ? 1); (1) Predict the nonzero structure of panel F(:; j : j + w ? 1): Determine which supernodes will update any of F(: ; j: j + w ? 1); (2) Update panel F(:; j : j + w ? 1) using previous supernodes:
for each updating supernode (r: s) < j in topological order do F(s + 1 : n; j : j + w ? 1) = F(s + 1 : n; j : j + w ? 1)? L(s + 1 : n; r : s) U(r : s; j : j + w ? 1); end for (r : s); (3) Inner factorization for each column in the panel:
for column jj = j to jj + w ? 1 do Supernode-column update for column F(j : n; jj); Row pivoting for column F(jj : n; jj); Determine whether jj belongs to the same supernode as jj ? 1; Symmetric structure pruning; end for jj; end for j; 
Test matrices
To evaluate our algorithms, we have collected matrices from various sources, with their characteristics summarized in Table 1 .
Some of the matrices are from the Harwell-Boeing collection 8]. Many of the larger matrices are from the ftp site maintained by Tim Davis of the University of Florida. 1 Those matrices are as follows. Memplus is a circuit simulation matrix from Steve Hamm of Motorola. Rdist1 is a reactive distillation problem in chemical process separation calculations, provided by Stephen Zitney at Cray Research, Inc. Shyy161 is derived from a direct, fully-coupled method for solving the NavierStokes equations for viscous ow calculations, provided by Wei Shyy of the University of Florida. Goodwin is a nite element matrix in a nonlinear solver for a uid mechanics problem, provided by Ralph Goodwin of the University of Illinois at Urbana-Champaign. Venkat01, Inaccura and Raefsky3/4 were provided by Horst Simon then of NASA and currently at NERSC. Venkat01 comes from an implicit 2-D Euler solver for an unstructured grid in a ow simulation. Raefsky3 is from a uid structure interaction turbulence problem. Raefsky4 is from a buckling problem for a container model. Af23560 is from solving an unsymmetric eigenvalue problem, provided by Zhaojun Bai of the University of Kentucky. Ex11 is from a 3-D steady ow calculation in the SPARSKIT collection maintained by Youcef Saad at the University of Minnesota. Wang3 is from solving a coupled nonlinear PDE system in a 3-D (30 30 30 uniform mesh) semiconductor device simulation, as provided by Song Wang of the University of New South Wales, Sydney. Vavasis3 is an unsymmetric augmented matrix for a 2-D PDE with highly varying coe cients 31]. Dense1000 is a dense 1000 1000 random matrix. This paper does not address the performance of column preordering for sparsity. We simply use the existing ordering algorithms provided by Matlab 17] . For all matrices except 1, 15 and 21, the columns were permuted by Matlab's minimum degree ordering of A T A, also known as \column minimum degree" ordering. However, this ordering produces a tremendous amount of ll for matrices 1, 15 and 21, because it only attempts to minimize the upper bound on the actual ll and the upper bounds are too loose in these cases. We found that when these three matrices were symmetrically permuted by Matlab's symmetric minimum degree ordering on A + A T , the amount of ll is much smaller than using column minimum degree ordering. The last column in Table 1 shows the number of nonzeros in matrix F when using these column preorderings.
The matrices are sorted in increasing order of flops=nnz(F), the ratio of the number of oatingpoint operations to the number of nonzeros nnz(F). This \ gure of merit" gives the maximum potential data reuse, as described in 5]. Thus, we expect our performance to increase with increasing flops=nnz(F).
Shared memory multiprocessor systems used for testing
We evaluated the parallel algorithm on several commercially popular machines, including the Sun SPARCcenter 2000 30], SGI Power Challenge 29], DEC AlphaServer 8400 11], and Cray C90/J90 32, 33]. Table 2 summarizes the con gurations and several key parameters of the ve parallel systems. In the column \Bus Bandwidth" we report the e ective or sustainable bandwidth to main memory. In \Read Latency" we report the minimum amount of time it takes a processor to fetch a piece of data from main memory into a register in response to a load instruction.
The last column in the Table 3 : Some characteristics of the processors used in the parallel systems.
context switching of the threads are accomplished rapidly at the user level, without entering the OS kernel. For P processors, we usually create P (logical) threads for the scheduling loop Slave worker() ( Figure 10 ). Scheduling these threads on available physical processors is done by the operating system or runtime library. Thread migration between processors is usually invisible to us. The program is easily portable to multiple platforms. The source codes on di erent machines di er only in thread spawning and locking primitives. Table 3 summarizes the characteristics of the individual processors in the parallel machines, including the clock speed, the cache size, the peak M op rate, and the DGEMM and DGEMV peak M op rate. Most DGEMM and DGEMV M op rates were measured using vendor-supplied BLAS libraries. When the vendors do not provide a BLAS library, we report the results from PHiPAC 4], with an asterisk ( ) beside such a number. For some machines, PHiPAC is often faster than the vendor-supplied DGEMM.
Parallel strategies
In this section, we present crucial design choices we have made to parallelize SuperLU, such as, how we shall exploit both coarse and ne levels of parallelism, how we shall de ne the individual tasks, and how we shall deal with the issue of dynamic memory growth.
In order to make the parallel algorithm e cient, we need to make non-trivial modi cations to serial SuperLU. All these changes are summarized in Table 4 and discussed in the subsections below. These show that the parallel algorithm is not a straightforward parallelization of the serial one, and illustrate the program complications arising from parallelization. In the performance evaluation, we will time various parts of the algorithms. The time notation to be used is listed in Table 5 . Table 5 : The di erences of the parallel algorithm from serial SuperLU.
Parallelism
We exploit two sources of parallelism in the sparse LU factorization. The coarse level parallelism comes from the sparsity of the matrix, and is exposed to us by the column elimination tree (or Having studied the parallelism arising from di erent subtrees, we now turn our attention to the dependent columns, that is, the columns having ancestor-descendant relations. When the elimination process proceeds to a stage where there are more processors than independent subtrees, we need to make sure all processors work cooperatively on dependent columns. Thus the second level of parallelism comes from pipelining the computations of the dependent columns.
Consider a simple situation with only two processors. Processor 1 gets a task Task 1 containing column j, processor 2 gets another task Task 2 containing column k, and node j is a descendant of node k in the etree. The (potential) dependency says only that Task 2 cannot nish its execution before Task 1 nishes. However, processor 2 can start Task 2 right away with the computations not involving column j; this includes performing the symbolic structure prediction and accumulating the numeric updates using the nished columns that are descendants in the etree. After processor 2 has nished the other part of the computation, it has to wait for Task 1 to nish. (If Task 1 is already nished at this moment, processor 2 does not waste any time waiting.) Then processor 2 will predict the new lls and perform numeric updates that may result from the nished columns in Task 1. In this way, both processors do useful work concurrently while still preserving the precedence constraint. Note that we assume the updates can be done in any order. This could give di erent (indeed, nondeterministic) numerical results from run to run. 2 Although this pipelining mechanism is complicated to implement, it is essential to achieve higher concurrency. This is because, in most problems, a large percentage of the computation occurs at upper levels of the etree, where there are fewer branches than processors. An extreme example is a dense matrix, the etree of which is a single chain. In this case, the parallel SuperLU \reduces to" a pipelined column-oriented dense LU algorithm.
Panel tasks
As studied in 5], the introduction of supernodes and panels makes the computational kernels highly e cient. To retain the serial algorithm's ability to reuse data in cache and registers, we treat the factorization of one panel as a unit task to be scheduled; it computes the part of U and the part of L for all columns within this panel. Choosing a panel as scheduling unit a ords the best granularity on the SMPs we targeted, and requires only modest changes to the serial code 5]. The alternative, 2 In order to guarantee determinism, we must statically assign the tasks to processors. The performance cost we pay for determinism may be load imbalance and reduced parallelism. We are considering adding a debugging option to the software that guarantees determinism. blocking the matrix by rows and columns 22, 28] , introduces too much synchronization overhead to make it worthwhile on SMPs with modest parallelism. A panel task consists of two distinct subtasks. The rst corresponds to the outer factorization, which accumulates the updates from the descendant supernodes. The second subtask is to perform the panel's inner factorization. We exploit parallelism within the rst subtask, but not the second.
Since the parallel algorithm uses the column etree as the main scheduling tool, it is worth studying the relationship between the panels and the structure of the column etree. We assume that the columns of the matrix are ordered according to a postorder on the column etree. We expect a postorder on the column etree to bring together unsymmetric supernodes, just as a postorder on the symmetric etree brings together symmetric supernodes. Pictorially, panels can be classi ed into three types, depending on where they are located in the etree, as illustrated in Figure 3 .
In the pipelining algorithm, panels of type (c) complicates the record-keeping if a processor owning this panel needs to wait for, and later perform, the updates from the busy panels down the etree. To simplify this, we imposed two restrictions. We rst restricted the de nition of panels so that type (c) panels do not occur. We will let a panel stop before a node (column) that has more then one child in the etree. That is, every branching node necessarily starts a new panel. Secondly, we make sure that all busy descendant panels always form one path in the etree. So the processor waiting for these busy panels can simply walk up the path in the etree starting from the most distant busy descendant.
By this restricted de nition of panels, there will be more panels of smaller sizes. The question arises whether this will hurt performance. We studied the distribution of oating-point operations on di erent panel sizes for all of our test matrices, and observed that usually more than 95% of the oating-point operations are performed in the panels of largest size, and these panels tend to occur at a few topmost levels of the etree. Thus, panels of small sizes normally do not represent much computation. On uniprocessors, we see almost identical performance using the earlier and the new de nitions of panels. Therefore, we believe that this restriction on panels simpli es and accelerates the parallel scheduling algorithm with little performance loss on individual processors.
Supernode storage using nonzero column counts in QR factorization
It is important to store the columns of a supernode consecutively in memory, so that we can call BLAS routines directly in-place without paying the cost of copying the columns into contiguous memory. Although this contiguity is easy to achieve in a sequential code, it poses problems in the parallel algorithm. Consider the scenario of parallel execution depicted in Figure 4 . According to the order of the nishing times speci ed in the gure, the panel consisting of columns f3,4g will be stored in memory rst, followed by panel f1,2g, and then followed by panel f5,6g. The supernode f3,4,5,6g is thus separated by the panel f1,2g in memory. The major di culty comes from the fact that the supernodal structure emerges dynamically as the factorization proceeds, so we cannot statically calculate the amount of storage required by each supernode. Another di culty is that panels and supernodes can overlap in several di erent ways. One immediate solution would be not to allow any supernode to cross the boundary of a panel. In other words, the leading column of a panel would always be treated as the beginning of a new supernode. Thus a panel could possibly be subdivided into more than one supernode, but not vice versa. In such circumstances, the columns of a supernode would always be contiguous in memory because they would be assigned to a single processor by the scheduler. Each processor would simply store a (partial) ongoing supernode in its local temporary store, and copy the whole supernode into the global data structure as soon as it was nished.
This restriction on supernodes would mean that the maximum size of supernodes would be bounded by the panel size. As discussed in Section 7.5 (also 24]), for best per-processor e ciency and parallelism, we would like to have large supernodes but relatively small panels. These con icting demands make it hard to nd one good size for both supernodes and panels. We conducted an experiment with this scheme for the sequential algorithm. Figure 5 shows the uniprocessor performance loss for various panel sizes (i.e., maximum sizes of supernodes). For large matrices, say matrices 12 { 21, the smaller panels and supernodes result in more performance loss. For example, when w = 16, the slowdown can be as large as 20% to 68%. Even for large panel sizes, such as w = 48, the slowdown is still between 5% and 20%. However, in the parallel algorithm, such large panels give rise to too large a task granularity and severely limit the level of concurrency. We therefore feel that this simple solution is not satisfactory. Instead, we seek a solution that does not impose any restriction on the relation between panels and supernodes, and that allows us to vary the size of panels and supernodes independently in order to better trade o concurrency and single-processor e ciency. Then there will always be space to store supernode columns as they are computed. We now describe how we preallocate enough (but not too much) space.
After Gaussian elimination with partial pivoting, we can write A = P 1 L 1 P 2 L 2 P n?1 L n?1 U, where P i is an elementary permutation matrix representing the row interchange at step i, and L i is a unit lower triangular matrix with its i-th column containing the multipliers at step i. We now de ne L as the unit lower triangular matrix whose i-th column is the i-th column of L i , such that L ? I = P i (L i ? I). 3 We shall make use of the following structure containment property in our storage scheme. Here we only quote the result without proof. reasons why fundamental supernodes are appropriate, one of which is that the set of fundamental supernodes is the same regardless of the particular etree postordering. For consistency, we now also impose this restriction on the supernodes in L and H, respectively. For convenience, let S L denote the fundamental supernodes in the L factor, and S H denote the fundamental supernodes in the symbolic Householder matrix H. We shall omit the word \fundamental" when it is clear. Our code breaks the L supernode at the boundary of an H supernode, forcing the L supernode to be contained in the H supernode. In fact, if we use fundamental L supernodes and ignore numerical cancellation (which we must do anyway for symmetric pruning), we can show that an L supernode is always contained in an H supernode 20] .
Our objective is to allocate storage based on number of nonzeros in S H , so that this storage is su ciently large to hold S L . Figure 6 illustrates the idea of using S H as a bound. Two supernodes in S L from di erent branches of the etree will go to their corresponding memory locations of the enclosing supernodes in S H . Even if an H supernode breaks into multiple L supernodes, those L supernodes will all lie on one path in the column etree. Thus an L supernode from a di erent subtree cannot interrupt the storage for a supernode as in Figure 4 . Since the panels (and hence the supernodes) within an H supernode are nished in order of increasing column numbers, the columns of each S L supernode are contiguous in the storage of the S H supernode.
To determine the storage for S H , we need an e cient algorithm to compute the column counts nnz(H j ) for H. We also need to identify the rst vertex of each supernode in S H . Then the number of nonzeros in each supernode is simply the product of the column count of the rst vertex and the number of columns in the supernode.
Finding the rst vertex and computing the column count can be done using a variant of the QR-column-count algorithm by Gilbert et al. 20 ]. The modi ed QR-column-count algorithm takes Struct(A) and the postordered T as inputs, and computes nnz(H j ) and S H . The complexity of the algorithm is O(m (m; n)), where m = nnz(A) and (m; n) is the slowly-growing inverse of Ackermann's function coming from disjoint set union operations. In practice, it is as fast as computing the column etree T 24, Table 5 .2]. In both the etree and QR-column-count algorithms, the disjoint set union operations are implemented using path halving and no union by rank (see 21] for details.) One remaining issue yet to be addressed is what we should do if the static storage given by an upper bound structure is much too generous than actually needed. We developed a dynamic prediction scheme as a fallback for this situation. In this scheme, we still use the supernode partition S H . Unlike the static scheme, which uses the column counts nnz(H j ), we dynamically Saylr4 .89 .92 13 Shyy161 .91 .92 14 Goodwin .95 .98 15 Venkat01 .11 .74 16 Inaccura .96 .99 17 Af23560 .95 .97 18 Dense1000 1.00
1.00 19 Raefsky3 .99 .99 20 Ex11 .99 1.00 21 Wang3 .14 .89 22 Raefsky4 .99 .99 23 Vavasis3 .95 .98 Table 6 : Supernode storage utilization, using static and dynamic upper bounds. The number tabulated is the ratio of the number of nonzeros in supernodes of L to that in the prediction H.
compute the column count for the rst column of each supernode in S H as follows. When a processor obtains a panel that includes the rst column of some supernode H(:; r : s) in S H , the processor invokes a search procedure on the directed graph G(L(:; 1 : r ? 1) T ), using the nonzeros in A(:; r : s), to determine the union of the row structures in the submatrix (r : n; r : s). We use the notation D(r : n; r : s) to denote this structure. It is true that Struct((L + U)(r : n; r : s)) Struct(D(r : n; r : s)) Struct(H(r : n; r : s)) :
(1) The search procedure is analogous to (but simpler than) the panel symbolic step (Figure 1, step (1)); now we only want to determine the count for the column D(r : n; r), without the nonzero structure or the topological order of the updates. Then we use the product of nnz(D(r : n; r)) and s ? r + 1 to allocate storage for the L supernodes within columns r through s. Since nnz(L(r : n; r)) nnz(D(r : n; r)) nnz(H(r : n; r)), the dynamic storage bound so obtained is usually tighter than the static bound.
The storage utilizations for the supernodes in S L are tabulated in Table 6 . The utilization is calculated as the ratio of the actual number of nonzeros in the supernodes of L to the number of nonzeros in the supernodes of H. When collecting this data, the maximum supernode size t was set to 64. For most matrices, the storage utilizations using the static bound by H are quite high; they are often greater than 70% and are over 85% for 14 out of the 21 problems. However, in the static scheme, the storage utilizations for matrices 1, 15 and 21 are only 4%, 11% and 14%, respectively. The dynamic scheme overcomes those low utilizations. For the three matrices above, the utilizations in the dynamic scheme are 68%, 74% and 89%. These percentage utilizations are quite satisfactory. For other problems, the dynamic approaches also result in higher utilizations.
The runtime overhead associated with the dynamic scheme is usually between 2% and 15% on the single processor RS/6000-590. From these experiments, we conclude that the static scheme using H often gives a tight enough storage bound for S L . For some problems, such as matrices 15 and 21, the dynamic scheme must be employed to achieve better storage utilization. Then the program will su er from a certain amount of slowdown. Our code tries the static scheme rst and switches to the dynamic scheme only if the static scheme requests more space than is available.
Nonblocking pruning and depth-rst search
The idea of symmetric pruning 9, 10] is to use a graph G 0 with fewer edges than the graph G of L T to represent the structure of L. Traversing G 0 gives the same reachable set as traversing G, but is less expensive. As shown in 10], this technique signi cantly reduces the symbolic factorization time.
In the sequential algorithm, in addition to the adjacency structure for G, there is another adjacency structure to represent the reduced graph G 0 . For each supernode, since the row indices are the same among the columns, we only store the row indices of the rst column of G and the row indices of the last column of G 0 . (If we use only one adjacency list for each supernode, since pivoting may have reordered the rows so that the pruned and unpruned rows are intermingled in the original row order, it is then necessary to reorder all of L and A to account for it.) Figure 7 illustrates the storage layout for the adjacency lists of G and G 0 of a sample matrix.
Array Lsub *] stores the row subscripts. In the parallel algorithm, contention occurs when one processor is performing DFS using G 0 's adjacency list of column j (a READ operation), while another processor is pruning the structure of column j, because pruning will reorder the row indices in the list (a MODIFY operation). There are two possible solutions to avoid this contention. The rst solution is to associate one mutually exclusive (mutex) lock with each adjacency list of G 0 . A processor acquires the lock before it prunes the list and releases the lock thereafter. Similarly, a processor uses the lock when performing DFS on the list. Although the critical section for pruning can be very short, the critical section for DFS may be very long, because the list must be locked until the entire depth-rst search starting from all nodes in the list is completed. During this period, all the other processors attempting to prune the list or to traverse the list will be blocked. Therefore this approach may incur too much overhead, and the bene t of pruning may be completely o set by the cost of locking.
We now describe a better algorithm that is free from locking. We will use both graphs G 0 and G Since G 0 is generally a subgraph of G, the depth-rst searches in the parallel code may traverse more edges than those in the sequential code. This is because in the parallel algorithm, a supernode may be pruned later than in the sequential algorithm. However, because of the e ectiveness of symmetric reduction, very often the search still uses the pruned list in G 0 . So it is likely that the time spent in the occasional extra search in the G-lists is much less than that when using the locking mechanism. Figure 8 shows the relative size of the reduced supernodal graph H, and Figure 9 shows the fraction of searches that use the G 0 -lists. The numbers in both gures are collected on a single processor Alpha 21164. practice is to organize the program as a self-scheduling loop, interacting with a global pool of tasks that are ready to be executed. Each processor repeatedly takes a task from the pool, executes it, and puts new ready task(s) in the pool. This pool-of-tasks approach has the merit of balancing work load automatically even for tasks with large variance in granularity. There is no notion of ownership of tasks or submatrices by processors { the assignment of tasks to processors is completely dynamic, depending on the execution speed of the individual processors. Our scheduling algorithm employs this model as well. This is in contrast to some implementations of sparse Cholesky, which can schedule work to processors carefully and cheaply ahead of time 22]. The dynamic nature of partial pivoting prevents us from doing this.
Our scheduling approach used some techniques from the parallel column-oriented algorithm developed by Gilbert 19] . Figure 10 sketches the top level scheduling loop. Each processor executes this loop until its termination criterion is met, that is, until all panels have been factorized.
The parallel algorithm maintains a central priority queue of tasks (panels), that are ready to be executed by any free processor. The content of this task queue can be accessed and altered by any processor. At any moment during the elimination, a panel is tagged with a certain state, such as READY, BUSY, or DONE. Every processor repeatedly asks the scheduler (at line 4) for a panel task in the queue. The Scheduler() routine implements a priority-based scheduling policy described below. The input argument oldpanel denotes the panel that was just nished by this processor. The output argument newpanel is a newly selected panel to be factorized by this processor. The selection preference is as follows:
( Determine which supernodes will update panel newpanel; 10 .
Skip all BUSY panels/supernodes; 11.
panel numeric factor( newpanel );
12.
Accumulate updates from the DONE supernodes, updating newpanel; 13 .
Wait for the BUSY supernodes to become DONE, then predict new lls and accumulate more updates to newpanel; 14. inner factorization( newpanel ); /* independent from other processors */ panel which can be computed without pipelining, that is, a leaf panel. (3) If no more leaf panels exist, the scheduler will take a panel that has some BUSY descendant panels currently being worked by other processors. Then the new panel must be computed by this processor in a pipelined fashion. One might argue that (1) and (2) should be reversed in priority. Choosing to eliminate the immediately available parent rst is primarily concerned with locality of reference. Since a just-nished panel is likely to update its parent or other ancestors in the etree, it is advantageous to schedule its parent and other ancestors on the same processor.
To implement the above priority scheme, the task queue is initialized with the leaf panels, that is, the relaxed supernodes, which are marked as READY. Later on, Scheduler() may add more panels at the tail of the queue. This happens when all the children of newpanel's parent, parent, are BUSY; parent is then enqueued and is marked as eligible for pipelining. By rule (1), some panel in the middle of the queue may be taken when all its children are DONE. This may happen even before all the initial leaf panels are nished. All the intermediate leaf panels are taken in this way. By rules (2) and (3), Scheduler() removes tasks from the head of the queue.
It is worth noting that the executions of di erent processors are completely asynchronous. There is no global barrier; the only synchronization occurs at line 13 in Figure 10 , where a processor stalls when it waits for some BUSY updating supernode to nish. As soon as this BUSY supernode is nished, all the processors waiting on this supernode are awakened to proceed. This type of synchronization is commonly referred to as event noti cation. Since the newly nished supernode may produce new lls to the waiting panels, the symbolic mechanism is needed to discover and accommodate these new lls.
Parallel performance
We now evaluate the performance of the algorithm. The organization of this section is as follows. Section 7.1 summarizes the observed speedups on various platforms. The speedup is compared to that of serial SuperLU. Section 7.2 quanti es parallel overhead and their impact on performance. Section 7.3 gives the statistics of load balance. Section 7.4 studies the space e ciency of the algorithm.
Speedup summary
Figures 11 through 15 report the speedups of the parallel algorithm on the ve platforms, with number of threads \P" varied. Because of memory limits we could not test all problems on the SPARCcenter 2000. The speedup is measured against the best sequential runtime achieved by SuperLU on a single processor of each parallel machine.
In each gure, the bottom curve labeled \P = 1" illustrates the overhead in the parallel code when compared to the serial SuperLU, using the same blocking parameters. The structure of the parallel code, when run on a single processor, does not di er much from sequential SuperLU, except that a global task queue and various locks are involved. The extra work in the parallel code is purely integer arithmetic. In order to achieve a higher degree of concurrency, the panel size We also tabulate these speedup gures in the Appendix (Tables 13 through 17) , where the last two columns in each table show the factorization time and Mega op rate, respectively, corresponding to the largest number of processors used.
Impact of overhead on parallel e ciency
The parallel algorithm experiences some overhead, which mainly comes from three sources: the reduced per-processor e ciency due to smaller granularity of unit tasks, accessing critical sections via locks, and orchestrating the dependent tasks via event noti cation. The purpose of this section is to understand how much time is spent in each part of the algorithm and explain the speedups we saw in Section 7.1.
Decreased per-processor performance due to smaller blocking
The rst overhead is due to the necessity to reduce the blocking parameters in order to achieve more concurrency. Recall that two blocking parameters a ect performance: panel size (w) and maximum size of a supernode (maxsup). For better per-processor performance, we prefer larger values. On the other hand, the large granularity of unit tasks limits the degree of concurrency.
On the Cray J90, this trade-o is not so important, because w = 1 is good for the sequential algorithm. We therefore also use w = 1 in the parallel algorithm. When varying the value of maxsup, we nd that performance is quite robust in the range between 16 and 64.
On the Power Challenge and AlphaServer 8400, we observe more dramatic di erences with varied blockings. Figure 16 and 17 illustrate this loss of e ciency for several large problems on single processors of the two machines. In this experiment, the parallel code is run on single processors with two di erent settings of w and maxsup. Figure 16 shows, on a single processor Power Challenge, the ratio of the runtime with the best blocking for 1 CPU (w = 24; maxsup = 64) to the runtime with the best blocking for 12 CPUs (w = 12; maxsup = 48). Figure 17 shows the analogous ratio for the 8-CPU AlphaServer 8400. On the Power Challenge, the blocking used for best parallel performance achieves only 81% uniprocessor e ciency for matrices 17 and 19. The corresponding lowest number on the AlphaServer 8400 is 86% for matrix 22.
Accessing critical sections
Several places in the program must be protected by mutual exclusion. In Table 7 , we roughly count the number of times the program acquires and relinquishes various locks. Note that the total number of lockings performed is independent of the number of processors. Since we want to allow di erent processors to enter di erent critical sections simultaneously, we use ve mutex variables to guard the ve critical regions.
To see how much cost is associated with locking, in Table 8 we measured the time it takes to acquire and relinquish a lock on several platforms, with di erent numbers of threads P. The gure in the parenthesis is the number of clock cycles. In this small benchmark code, the critical section is simply one statement, to increment a counter. The locking and unlocking are placed around this statement. The measurement is done in a tight loop with many iterations. When there is more than one thread, the time increases slightly, but not linearly in the number of threads. The uniprocessor slowdown is partly due to the overhead incurred by using these locks, when there are no other processors competing for the locks. By multiplying the time for a single lock/unlock in Table 8 by the number of the lockings performed in Table 7 , we can estimate the locking overhead. As a concrete example, let us consider a medium size matrix 13, on a single processor Cray J90. Since the sequential code performance is 26 M ops, each lock/unlock is equivalent to roughly 69 oating-point operations. When the factorization is performed with panel size w = 1, the total number of lock acquisitions is 237004, which, when multiplied by 2.67 microseconds, results in about 0.64 seconds. This is less than 3% of the entire factorization time (24.85 seconds). We observe that this percentage is typical for large matrices (also the bottom curve in Figure 18 ). The locking overhead also varies with machines. For example, it is higher on the Cray J90 than on the Power Challenge or the AlphaServer 8400. 
Coordinating dependent tasks
The third source of overhead is due to insu cient parallelism in the pipelined executions of the dependent panels. Dependent panels are those that have an ancestor-descendant relation in the column etree. When a processor factoring a panel needs an update from a BUSY descendant panel, this processor simply spins, waiting for that panel to nish, as shown at line 13 in the scheduling loop of Figure 10 . During the spin wait the processor does nothing useful. The total amount of spin wait time observed is signi cant in some cases, especially with larger numbers of processors. For example, for matrix 16, on the 12-CPU Power Challenge, about 40% of the parallel runtime is spent spinning. The corresponding number for the dense matrix is about 58%. The dense matrix is the worst one, because the factorization of all panels must be carried out in pipelined fashion. Figure 18 depicts the locking overhead (Section 7.2.2) and the spinning due to dependencies on the 8-CPU Cray J90. The locking overhead also includes the possible contention from the 8 processors. In this gure, we also plot the ine ciency (i.e., 1? e ciency) of the parallel algorithm. For most matrices, the spinning overhead due to dependencies is much higher than the overhead from lock acquisition.
Putting all overheads together
In this subsection we evaluate the e ect of the combined overheads on the parallel e ciency. In summary, the overheads include Overhead (1): reduced uniprocessor performance due to smaller blocking Overhead (2): accessing critical sections Overhead (3): idle time (from spin wait in the panel pipeline and in the top-level scheduling loop) Overhead (1) only a ects uniprocessor performance. Overhead (2) decreases both uniprocessor performance of the parallel code and parallel performance. Compared with the serial execution, the parallel execution experiences more contention for locks. But Table 8 and Figure 18 indicate that runtime does not increase signi cantly because of contention. Therefore, we may model (2) We now analyze the relations of the various times de ned in Table 5 . All the times are measured independently. In particular, T I is obtained by timing two kinds of idle periods on each processor and summing over all processors: one is the spin wait in the panel update pipeline, and the other is when a processor calls Scheduler() (line 4 in Figure 10 ) and fails to get a panel from the scheduler. We found that, for the test matrices and the numbers of processors being considered, failure from the scheduler rarely occurs. So most of the idle time is due to pipeline waiting. The following relation holds for the parallel runtime: 5 P T P T 1 + T I : (2) We now compute the observed e ciency (E actual ) as follows:
E actual = T s P T P :
Since T P , T 1 , and T I are obtained from di erent runs of the program, the left-hand side and the right-hand side of Equation (2) may not match well. For the purpose of checking, we also compute the following quantity:
E check = T s T 1 + T I : (4) The closeness of E check to E actual indicates the accuracy of the timings, see Tables 9 and 10 .
In order to understand the impact of the overheads discussed in previous subsections on the parallel e ciency, we introduce two parameters 1 and p , which are calculated based on T s , T 1 , 5 In the absence of errors in the individual time measurement, equality should hold. Raefsky4 .51 .55 .02 .43 .53 .97 Table 9 : E ciencies and overheads on a 16-CPU Cray J90.
T P , and T I as follows:
p = T I P T P :
Both 1 and p are in the range 0; 1); 1 measures the overhead that degrades the uniprocessor performance, while p measures the overhead in the parallel execution. The smaller 1 and 2 are, the more e cient is the parallel algorithm. Since
we can use E est = (1 ? 1 ) (1 ? p ) :
(7) as an estimate for the actual e ciency.
In Tables 9 and 10 , we report E actual , E est , 1 , p and E check obtained on the two parallel machines.
Cray J90
In the rst two columns of Table 9 , we compare the estimated e ciency E est in Equation (7) with the actually observed e ciency E actual in Equation (3). The estimated and observed e ciencies are very close. Their di erences are mostly within 5%, except for matrices 13 and 20 which have 15% and 9% di erence, respectively. For these two matrices, E actual and E check di er signi cantly, indicating that some overhead is not re ected in T 1 or T I .
As mentioned in Section 7.2.1, the uniprocessor performance on the J90 does not degrade much with smaller maxsup, that is, Overhead (1) does not exist (T 0 s = T s ). Therefore, Ts T 1 can be from the bottom curve in Figure 15 . We gathered the statistics for p and B on 16 processors, as shown in Table 9 . For most problems, the pipeline spin waiting, as measured by p , is the primary cause of ine ciency. This is particularly evident for matrices 15, 18 and 21, for which 74%, 67% and 71% of the time processors are idle, respectively. This explains the low speedups achieved for these matrices. Vavasis3 .56 .71 .14 .17 .68 .97 Table 10 : E ciencies and overheads on a 12-CPU Power Challenge.
Power Challenge
On a cache-based machine, the uniprocessor performance loss of the parallel code is a combination of performing lockings and less e cient cache utilization. Therefore, Ts T 1 equals the product of the numbers from the bottom curve in Figure 12 Compared to the J90, we observe that 1 is much larger, because the cache plays an important role on the Power Challenge. In fact, for matrices 13, 17, 19, 20 and 22, uniprocessor performance loss is more severe than the parallel overhead, p .
Again, for matrices 15, 18 and 21, the spin wait time is the major bottleneck; the processors are idle more than 50% of the time. We found that E est and E actual did not match as well as they did on the J90. For matrices 13, 14, and 23, the gaps are 16%, 12%, and 15%, respectively.
The corresponding gaps between E check and E actual are large as well. This again indicates some overhead not accounted for in T 1 or T I . We need further study to fully understand why this is.
Load balance
As mentioned earlier, our dynamic scheduling approach can automatically balance the workload.
One way to measure the load balance is as follows. Let f i denote the number of oating-point operations performed on processor i, and P denote the number of processors. We de ne the load balance B as B = P i (f i ) P max i (f i ) : (8) In words, B equals the average work load divided by the maximum work load. It is readily seen that 0 < B 1, and higher B indicates better load balance. If load imbalance is the sole overhead in a parallel program, the parallel execution time is simply the execution time of the slowest processor, whose work load is highest. We should note that the load balance measured by Equation (8) is an accurate measure of work distribution only under the condition that each oating-point operation takes the same amount of time. This is not the case in practice, but the large values of B shown in the last columns of Tables 9 and 10 still show that good load balance was achieved in terms of op counts. Matrix 13 is an exception. Table 11 : Working storage requirement as compared with the storage needed for L and U. The blocking parameter settings are: w = 8, t = 100, and b = 200.
Working storage requirement
The parallel algorithm may require more working storage than the sequential one. Multiple threads share heap storage, static storage, and code, all residing in main memory. Each thread, upon execution, is allocated a private stack and has its own register set. Our program does not use many stack variables, so the stack size for each thread need not be very large. All working storage is allocated via malloc() from the heap. The working storage consists of two parts, where one part is shared among all threads, and another part is local to each thread. The shared working storage is mainly used to facilitate the central scheduling activity and memory management. It includes:
one integer array of size p used as the task queue, where p is the total number of panels; one bit vector of size n to mark whether a column is busy; four integer arrays of size n to record the status of each panel; one integer array of size n to record a column's most distant busy column down the etree during pipelining; three integer arrays of size n to implement storage layout for supernodes (Section 5.3).
The local working storage used by each thread is very similar to that used by sequential SuperLU, that is, all that is necessary to factorize one single panel. It includes: eight integer arrays of size n to perform the panel and column depth-rst search; one n-by-w integer array to keep track of the position of the rst nonzero of each supernodal segment in U; one n-by-w integer array to temporarily store the row subscripts of the nonzeros lled in the panel;
one n-by-w real array used as the SPA. one scratch space of size (t + b) w to help BLAS calls. See Figure 1 for the de nition of t, b and w. This amount of local storage should be multiplied by P, where P is the number of threads created. Thus the working storage grows a nely with respect to P, and this algorithm, albeit e cient, is hard to scale up from a memory point of view.
To put this in perspective, Table 11 compares the working storage requirement with the actual LU storage. The last two columns report the amount of working storage as a fraction of the total LU storage in Megabytes, for 1 and 8 threads, respectively. It is clear that for P = 8, the working storage requirement can be comparable to the LU storage for small problems. For large problems, working storage is typically 10% to 20% of the LU storage. Matrix 13 is exceptionally bad: it is a matrix of medium size for which the required working storage is more than LU storage. Since we would not use multiple processors on the small problems anyway, the overall working storage requirement is quite small.
A PRAM model to predict optimal speedup
Given a matrix with a xed column ordering, we want to establish a performance model to estimate the maximum speedup attainable by our algorithm, and indeed to determine the limitations of algorithms based on partitioning a matrix by columns, and using a column as a scheduling unit. Because of various precedence constraints in the algorithm, some parts of the work must be nished before other parts can start. Thus, the completion time of the parallel algorithm is constrained by the amount of work that must be done serially, i.e., the critical path. Our objective here is to give a lower bound on parallel completion time.
In the model we make the following simplifying assumptions: (1) The work only includes oating-point operations, and each oating-point operation takes one unit of time. (2) There is an in nite number of processors. Whenever a task is ready, there will be a free processor to execute this task immediately. Figure 20 shows the part of the DAG associated with a particular panel p.
Each T mod and T div is the indivisible task, and is carried out sequentially on one processor. Clearly, T div cannot start until all the T mod 's have nished. By looking at the precedence relations of these two types of tasks, we can determine the runtime of T panel (p) on processor P. We will try to schedule these tasks as early as possible, in order to derive the minimum parallel execution time.
We rst look at the tasks associated with one particular panel p, as shown in Figure 20 . (1)). That means processor P will be idle during the period of LAG := EFT(i) ? EFT (1) . Thus the amount of time to nish all the T mod s will be at least LAG + P k i=1 tmod(p; i).
On the other hand, in Sched-A, at least some T mod (p; j); j < i have been scheduled in the time period LAG. Hence the amount of work left after time EFT(i) is less than the work left when using Sched-B. Sched-A will give shorter nishing time than Sched-B. There are several points worth noting in this model. First, because of numerical pivoting, we do not know the computational DAG in advance of the factorization; rather, the DAG is built incrementally as the factorization proceeds. Also, the oating-point operations associated with all the tasks are calculated on the y. So this model gives an a posteriori estimate. Secondly, for each panel computation, the scheduling method of Sched-A requires sorting the EFT's of all the descendant supernodes that will update this panel. The cost associated with this sorting is prohibitively high, and so this method cannot be used to schedule panel updates in practice. Nevertheless, this gives us an upper bound on the theoretically attainable speedup.
In our algorithm, two parameters control task granularity: The panel size w determines the amount of work in a T div task, and both w and the maximum supernode size maxsup determine the amount of work in a T mod task. Any large supernode of size exceeding maxsup (such as in a dense matrix) is divided into smaller ones so that they t in cache. Table 12 reports the predicted speedups when varying w and maxsup. For a xed value of maxsup, the simulated speedups decrease with increasing w. For sequential SuperLU we nd empirically that the best choice for w is between 8 and 16, depending on matrices and architectures. In the parallel setting, a smaller w, say between 4 and 8, seems to give the best overall performance.
This embodies an interesting trade-o between available concurrency and per-processor e ciency.
We now compare the results when xing w but varying maxsup. In relatively sparser matrices, such as matrices 1 { 10, the actual sizes of supernodes may be much smaller than maxsup. The performance for such matrices are not so sensitive to maxsup. However, for larger and denser matrices, larger value of maxsup results in poorer speedup. Finally we note that the speedups for small matrices are very low, even with small values of w and maxsup. Fortunately, for large matrices such as 13 { 21, the predicted speedups are greater than 20 when w = 8 and maxsup = 32. These matrices perform more than one billion oatingpoint operations in the factorization. It is these matrices that require parallel processing power. The current column-oriented algorithm is well suited for most of the commercially popular SMPs, because the number of processors on these systems is usually below 20.
The height of the column etree can also be used as a crude prediction of the parallel performance.
The height of a node i is de ned as
if i is a leaf node 1 + maxf height(j) j j 2 child(i)g otherwise The height of the etree is the height of the root, which represents the longes path in the etree. The computation of all the nodes along this path must be performed in succession. Therefore, the length of the critical path constrains performance. The last column of Table 12 shows the height of the etree over total numbers of nodes n in the etree. The larger height=n is, the larger the fraction of panels will be factorized in pipelined manner, resulting in poor parallelism and more synchronizations. For example, height=n for matrices 1, 3, 15 and 21 is rather large. This is consistent with the relatively lower predicted speedups. However, we must note that the etree height alone is not an accurate measure of parallelism. For example, both dense matrix (18) and a tridiagonal matrix have height=n = 1:00, but the former possesses much more concurrency than the later.
The actual speedups achieved are much lower than the upper bounds predicted by the PRAM model (Figures 11 through 15 ). This is because the model does not capture the details of the machines and the implementation, such as cache behavior, synchronization, etc. However, we do see a similar shape of speedup curves. For example, the model predicts that matrices 15 and 21 have lower speedups compared with the other large matrices. In reality, these two matrices perform worse than the others. The poor performance is primarily due to two factors: (1) The column etree is tall, and contains substantial false dependencies. (2) The dynamic algorithm is needed to allocate memory for the supernodes (Section 5.3), because the static upper bound on the supernodes storage is too large for these two problems (Table 6 ).
Conclusions
We have designed and implemented a parallel algorithm for shared memory multiprocessors of modest size. The e ciency of the algorithm has been demonstrated on several parallel machines. Figure 21 shows the speedups on 8 processors of three parallel machines. Figures 22 through 25 summarize the factorization rate in Mega ops for six large matrices, with increasing number of processors. We believe these large problems are the primary candidates to be solved on parallel machines. In fact, the largest one in our test suite takes a little more than 0.5 GBytes memory, far less than most parallel machines o er. Our algorithm is expected to work well for even larger problems.
For a realistic problem arising from a 3-D ow calculation (matrix 20), on the 12-CPU Power Challenge, the 8-CPU Cray C90, the 16-CPU J90, and the 8-CPU AlphaServer 8400, our parallel algorithm achieves 23%, 33%, 25%, and 17% peak oating-point performance. The respective M op rates are 1002, 2583, 831 and 781. These are the fastest results for the unsymmetric LU factorization on these powerful high-performance machines. Previous results showed much lower factorization rates because the machines used were relatively slow and the computational kernel in the earlier parallel algorithms was based on Level 1 BLAS. The closest work is the parallel symmetric pattern multifrontal factorization by Amestoy and Du 1], also on shared memory machines. However, that approach may result in too many nonzeros and so be ine cient for unsymmetric pattern sparse matrices.
Another contribution was to provide detailed performance analysis and modeling for the underlying algorithm. In particular, we identi ed the three main factors limiting parallel performance: (1) contention for accessing critical sections, (2) processors sitting idle due to pipeline waiting, and (3) the need to sacri ce some per-processor e ciency in order to gain more concurrency. Which factor plays the most signi cant role depends on the relative performance of integer and oating-point arithmetic in the underlying architecture.
We have developed a theoretical model to analyze our parallel algorithm and predict the optimally attainable speedup. When comparing the theoretical prediction (Table 12 ) with the actual speedups (Figure 21 ), we nd that there exists a discrepancy between the two. This is because our hypothetical machine and the optimal scheduling used in the model do not capture all the details of a real machine with real scheduling. Nevertheless, we do see a similar behavior in the predicted and actual speedups. That is, for the matrices predicted lower speedups, such as 11, 15, 18 and 21, the actual speedups are also lower. The model is a useful tool to help identify the inherently sequential problems with bad column orderings. The model also suggests that the panel-wise parallel algorithm, although e cient on small scale SMPs, cannot e ectively utilize more than 50 processors.
We plan to expand this research in several directions. We will study a more scalable algorithm for larger parallel machines. This algorithm is likely to partition the matrix by both rows and columns, and schedule blocks of submatrices onto processors. This will potentially increase parallelism, and reduce the panel update pipeline waiting time. In the framework of SuperLU, both serial and parallel, we will investigate incomplete LU factorizations, which can be used as a class of preconditioners for unsymmetric sparse iterative solvers. Table 17 : Speedup, factorization time and M op rate on a 16-CPU Cray J90.
