We describe the improvements to the task scheduling for MUMPS, an asynchronous distributed memory direct solver for sparse linear systems. In the new approach, we determine, during the analysis of the matrix, candidate processes for the tasks that will be dynamically scheduled during the subsequent factorization. This approach signi cantly improves the scalability of the solver in terms of execution time and storage. By comparison with the previous version of MUMPS, we demonstrate the e ciency and the scalability of the new algorithm on up to 512 processors. Our test cases include matrices from regular 3D grids and irregular ones from real-life applications.
Introduction
We consider the direct solution of sparse linear systems on distributed memory computers. Two state-of-the-art codes for this task, MUMPS and SuperLU, have been extensively studied and compared by Amestoy, Du , L'Excellent and Li (2001b) . Speci cally, the authors show that on a large number of processors, the scalability of the multifrontal approach used by MUMPS Koster 2001a, Amestoy et al. 2001b ) with respect to computation time and use of memory could be improved. This observation is the starting point for this current work.
The solution of a linear system of equations using MUMPS consists of three phases. In the analysis phase, the matrix structure is analysed and a suitable ordering and data structures for an e cient factorization are produced. In the subsequent factorization phase, the numerical factorization is performed. The nal solve phase computes the solution of the system by forward and backward substitution using the factors that were just computed.
The numerical factorization is the most expensive of these three phases, and we now describe how parallelism is exploited in this phase. The task dependency graph of the multifrontal factorization is a tree, the so-called assembly tree. A node of this tree corresponds to the factorization of a dense submatrix (the frontal matrix), and an edge from one node to another describes the order in which the corresponding submatrices can be factorized. In particular, independent branches of the assembly tree can be factorized in parallel as the computations associated with one branch do not depend on those performed in the others. Furthermore, each node in the tree can itself be a source of parallelism. The ScaLAPACK library (Choi, Demmel, Dhillon, Dongarra, Ostrouchov, Petitet, Stanley, Walker and Whaley 1996) provides an e cient parallel factorization of dense matrices and is used for the matrix associated with the root of the assembly tree. But MUMPS o ers another possibility for exploiting parallelism for those nodes that are large enough. Such nodes can be assigned a master process during analysis that chooses, during numerical factorization, a set of slave processes to work on subblocks of the dense matrix. This dynamic decision about the slaves is based on the load of the other processors, only the less loaded ones are selected to participate as slaves.
In order to address the scalability issues, we have modi ed this task scheduling and the treatment of the assembly tree during analysis and factorization. We now give a brief description of these new modi cations to Version 4.1 of MUMPS (to which we sometimes refer as the old code or the previous version of MUMPS).
The objective of the dynamic task scheduling is to balance the workload of the processors at run time. However, two major problems arise from o ering too much freedom to the dynamic scheduling. In the previous version of MUMPS, a master process is free to choose its slaves among all available processes. Since this choice is taken dynamically during the factorization phase, we have to anticipate it by providing enough memory on every process for the corresponding computational tasks. Since typically not all processes are actually used as slaves (and, on a large number of processors, often only relatively few are needed), the prediction of the required workspace will be overestimated. Thus, the size of the problems that can be solved is reduced unnecessarily because of this di erence between the prediction and the allocation of memory by the analysis phase and the memory actually used during the factorization. Secondly, decisions concerning a node should take account of global information in the assembly tree to localize communication. For example, by mapping independent subtrees to disjoint sets of processors so that all data movements related to a subtree are performed within the set, we can improve locality of communication and increase performance.
With the concept of candidate processors, it is possible to guide the dynamic task scheduling and to address these issues. The concept originates in an algorithm presented by Pothen and Sun (1993) and has also been used in the context of static task scheduling for sparse Cholesky factorization (Henon, Ramet and Roman 2002) . In this paper, we show how it also extends e ciently to dynamic scheduling. For each node that requires slaves to be chosen dynamically during the factorization, we introduce a limited set of processors from which the slaves can be selected. While the master previously chose slaves from among all less loaded processors, the freedom of the dynamic scheduling is reduced so that the slaves are only chosen from among the candidates. This allows us to exclude all non-candidates from the estimation of workspace during the analysis phase and leads to a more realistic prediction of the workspace needed. Furthermore, the candidate concept allows us to structure the computation better since we can explicitly restrict the choice of the slaves to a certain group of processors and enforce for example a`subtreeto-subcube' mapping principle (George, Liu and Ng 1989) . (Throughout this paper, we assume that every processor has one single MPI process associated with it so that we can unambiguously identify a processor and a corresponding MPI process.)
We illustrate the bene ts of the new approach by tests using a number of performance metrics including execution time, memory usage, communication volume, and scalability. Our results demonstrate signi cant improvements for all these metrics, in particular when performing the calculations on a large number of processors.
The rest of this paper is organized as follows. In Section 2, we review brie y the general concepts of the multifrontal direct solution of sparse linear systems. We introduce the assembly tree as a model for the tasks and task dependencies. We describe in Section 3 the possibilities for exploiting parallelism. We then introduce, in Section 4, the concept of candidate processors. In Section 5, we give an overview of how the candidate concept ts into the scheduling algorithm and present the algorithmic details in Section 6. Section 7 gives an overview of the test problems used in this paper. The presentation of our experimental results begins with parameter studies and detailed investigations of the improved algorithms in Section 8. Afterwards, we present a systematic comparison of the previous with the new version of the code on regular grid problems and general matrices in Section 9. Finally, we discuss possible extensions of our algorithm in Section 10 and present our conclusions and a brief summary in Section 11.
Tasks and task dependencies in the multifrontal factorization
We consider the direct solution of large sparse systems of linear equations Ax = b on distributed memory parallel computers using multifrontal Gaussian elimination. For an unsymmetric matrix, we compute its LU factorization; if the matrix is symmetric, its LDL T factorization is computed.
The multifrontal method was initially developed for inde nite sparse symmetric linear systems (Du and Reid 1983) and was then extended to unsymmetric matrices (Du and Reid 1984) . Because of numerical stability, pivoting is required in these cases in contrast to symmetric positive de nite sparse systems where pivoting can be avoided. We are concerned with general unsymmetric and symmetric inde nite matrices in the following, for an overview of the multifrontal method for symmetric positive de nite systems we refer to Du and Reid (1983) , Du , Erisman and Reid (1986) , and Liu (1992) .
In this section, we describe the tasks arising in the factorization phase of a multifrontal algorithm. Speci cally, we investigate the work associated with the factorization of individual frontal matrices and the order in which these factorizations can be performed.
The so-called elimination tree Reid 1983, Liu 1990) represents the order in which the matrix can be factorized, that is, the order in which the unknowns from the underlying linear system of equations can be eliminated. For a dense matrix, the elimination tree is a chain and de nes a complete ordering of the eliminations. However, for a general sparse matrix, the de nition yields only a partial ordering which allows some freedom for the sequence in which pivots can be eliminated.
One central concept of all modern sparse direct solvers and, in particular, the multifrontal approach is to group (or amalgamate) columns with the same sparsity structure to create bigger supervariables or supernodes Reid 1983, Liu, Ng and Peyton 1993) in order to make use of e cient dense matrix kernels. We will discuss later the advantages and dangers of amalgamation in the context of a distributed memory multifrontal code. We mention here that it is common to relax the criterion for amalgamation and permit the creation of coarser supernodes with extra ll-in that, however, improve the performance of the factorization Reid 1983, Ashcraft and Grimes 1989) . The amalgamated elimination tree is called the assembly tree.
We now investigate more closely the work associated with the factorization of the frontal matrix at an individual node of the assembly tree. Frontal matrices are considered as dense matrices and we can make use of the e cient BLAS kernels and avoid indirect addressing, see for example Dongarra, Du , Sorensen and van der Vorst (1998) . Frontal matrices can be partitioned as shown in Figure 2 is computed and used to update later rows and columns of the overall matrix which are associated with the parent nodes. We call this Schur complement matrix the contribution block of the node. The notion of children nodes which send their contribution block to their parents leads to the following interpretation of the factorization process. When a node of the assembly tree is being processed, it assembles the contribution blocks from all its children nodes into its frontal matrix. Afterwards, the pivotal variables from the fully summed block are eliminated and the contribution block computed. The contribution block is then sent to the parent node to be assembled once all children of the parent (which are the siblings of the current node) have been processed.
We remark that possibly some variables cannot be eliminated safely from a frontal matrix because of possible numerical instability. In this case, their elimination will be delayed until stable pivots can be found. The corresponding fully summed rows and columns are added to the contribution block and are assembled at the parent node. The contribution block is then larger than was predicted during the analysis phase, and the data structure used in the factorization needs to be modi ed dynamically.
3 Parallelism in the multifrontal factorization
In the following, we identify di erent sources of parallelism in the multifrontal factorization and describe how these are exploited in MUMPS (Amestoy et al. 2001a ).
The di erent types of parallelism
In Section 2, we mentioned that the tasks of multifrontal Gaussian elimination for sparse matrices are only partially ordered and that the task dependencies are represented by the assembly tree. A pair of nodes where neither is an ancestor of the other can be factorized independently from each other, in any order or in parallel. Consequently, independent branches of the assembly tree can be processed in parallel, and we refer to this as tree parallelism or type 1 parallelism.
A fundamental concept for the complete static mapping of assembly trees from model grid problems, the subtree-to-subcube mapping, was given by George, Liu and Ng (1989) . This algorithm was then generalized for problems with irregular sparsity structure and unbalanced trees to the bin-pack mapping scheme by Geist and Ng (1989) and the proportional mapping approach by Pothen and Sun (1993) . We will describe these algorithms in Section 3.2.
It is obvious that, in general, tree parallelism can be exploited more e ciently in the lower part of the assembly tree than near the root node. Experimental results presented by Amestoy, Du and L'Excellent (2000) showed a typical speedup obtained from tree parallelism of less than ve on 32 processors. These results are related to the observation (Amestoy and Du 1993 ) that often more than 75% of the computations are performed in the top three levels of the assembly tree where tree parallelism is limited. For better scalability, additional parallelism is created from parallel blocked versions of the algorithms that handle the factorization of the frontal matrices.
The computation of the Schur complement of frontal matrices with a large enough contribution block can be performed in parallel using a Master-Slave computational model. The contribution block is partitioned and each part of it assigned to a slave. The master processor is responsible for the factorization of the block of fully summed variables and sends the triangular factors to the slave processors which then update their own share of the contribution block independently from each other and in parallel. We refer to this approach as type 2 parallelism and call these nodes type 2 nodes.
Furthermore, the factorization of the dense root node can be treated in parallel with ScaLAPACK (Choi et al. 1996) . The root node is partitioned and distributed to the processors using a 2D block cyclic distribution. We refer to this as type 3 parallelism. Note that a 2D distribution could also be used for frontal matrices other than the root node, but this is not exploited in MUMPS.
MUMPS performs the factorization of the pivot rows of a frontal matrix on a single processor. This can lead to performance problems, particularly if the frontal matrix has a large block of fully summed variables and only a relatively small contribution block. However, if the front size is big enough, it is possible to create arti cial type 2 parallelism by splitting the pivot block, see Amestoy et al. (2001a) for a discussion of splitting.
Parallel task scheduling: main principles
From the point of view of scheduling, the di erent types of parallelism vary in their degree of di culty. Apart from looking at each type of parallelism individually, it is also necessary to investigate their interaction. The main objectives of the scheduling approaches are to control the communication costs, and to balance the memory and computation between the processors. We describe in this section the techniques implemented in Version 4.1 of MUMPS which is described by Amestoy et al. (2000) and Amestoy et al. (2001a) , and which has been extensively tested and compared with SuperLU (Amestoy et al. 2001b) and WSSMP (Gupta 2002) . We also present the proportional mapping of Pothen and Sun (1993) from which we develop, in Section 4, our idea of the candidate-based scheduling that is used in the new version of MUMPS.
Geist-Ng mapping and layers in the assembly tree
We mentioned in Section 3.1 that, in general, only the lower part of an assembly tree can be exploited e ciently for tree parallelism. Our previous scheduling approach consists therefore of two phases. At rst, we nd the lower part of the assembly tree where enough tree parallelism can be obtained. Afterwards we process the remaining upper part of the tree additionally exploiting type 2 and type 3 parallelism.
The mapping algorithm by Geist and Ng (1989) allows us to nd a layer in the assembly tree so that the subtrees rooted at the nodes of this layer can be mapped onto the processors for a good balance with respect to oating-point operations. Processor communication is avoided by mapping each subtree completely to a single designated processor. We call the constructed layer L 0 . It marks the boundary between the lower part where scheduling exploits only tree parallelism (type 1), and the upper part where all three types of parallelism are used.
We consider the following top-down tree-processing approach (Geist and Ng 1989) . We take as a potential layer L 0 the root nodes (or the root node, for an irreducible matrix) of the assembly tree. We check whether the nodes can be mapped onto the processors so that the load on each processor is balanced up to a threshold. In general, this will not be the case, in particular if the number of processors used for the factorization is larger than the number of root nodes. We then modify the potential layer L 0 by replacing the node whose subtree has largest computational cost by its children. Again, we check if the mapping of the new layer is balanced up to the threshold, otherwise we repeat the previous substitution step for the node which has now the highest computational subtree costs, and so forth. The algorithm stops once the nodes in the potential layer L 0 allow a threshold-balanced mapping. Intuitively, we can think of the algorithm descending down the assembly tree, as illustrated in Figure 3 .1. As the nodes in one layer can be only processed if all their children, belonging to the lower layers, have already been treated, the layer partition not only represents dependency but also concurrency of the multifrontal factorization. An example is shown in Figure 3 
The proportional mapping of Pothen and Sun
The proportional mapping approach of Pothen and Sun (1993) represents an alternative approach to task scheduling in both regular and possibly irregular assembly trees. It consists of a recursive assignment of processors to subtrees according to their associated computational work.
The assembly tree is processed from top to bottom, starting with the root nodes. For each root node, we calculate the work associated with the factorization of all nodes in its subtree, and the available processors are distributed among the root nodes according to their weight. Each node thus gets its set of so-called preferential processors. The same partitioning is now repeated recursively. The processors that have been previously assigned to a node are now distributed among the children proportional to their weight given by the computational costs of their subtrees. The recursive partitioning stops once a subtree has only one processor assigned to it.
A main bene t of the proportional mapping is that communication is e ectively localized within the processors assigned to a given subtree, with the partitioning guided from a global point of view taking account of the weight of subtrees. An illustration of the proportional mapping algorithm is given in Figure 3 .3.
P 5 P 6 7 P P 8 P 1 P 2 P 3 P 4 P 5 P 6 7 P P 8 P 4 P 5 P 6 7 P P 8 P 1 P 2 P 3 P 1 P 2 7 P P 8 P 6 P 5 While the proportional mapping approach has not been previously used in MUMPS, we mention it here due to its central importance for other sparse linear solvers like PaStiX (Henon et al. 2002) and because the concept of candidate processors for type 2 parallel nodes presented later exploits this idea.
Dynamic task scheduling for type 2 parallelism
It is possible to extend the static mapping to the tasks arising in the Master-Slave computational model for the factorization of type 2 parallel nodes. However, the static mapping is performed during the analysis phase on the basis of estimated costs of computational work and communication. These estimates can be inaccurate if pivots have to be delayed for numerical reasons. For a better equilibration of the actual computational work at run time, both the number and the choice of the slaves of type 2 nodes are determined dynamically during factorization as described by Amestoy et al. (2000) and Amestoy et al. (2001a) . When the master of a type 2 node receives the symbolic information on the structure of the contribution blocks of the children, the slaves for the factorization are selected based on their current workload, the least loaded processors being chosen. The master then informs the processes handling the children nodes which slaves are participating in the factorization of the node so that they can send the entries in their contribution blocks directly to the appropriate slaves.
The previous version of MUMPS exploits type 2 parallelism above layer L 0 as follows.
If a node possesses a contribution block larger than a given threshold and the number of eliminated variables in its pivot block is large enough, then it will be declared a type 2 node and will be involved in the dynamic decision to schedule new activities. In the new version of MUMPS, we leave this concept principally unchanged; however, we restrict the freedom for the dynamic choice of the slaves. While, in the earlier algorithm, potentially every processor could be chosen as a slave during run time, in the new approach we restrict this selection to the candidates that have been chosen for a given node during the analysis phase. This is explained in detail in Section 4.2.
4 Combining the concept of candidates with dynamic task scheduling
The dynamic choice of the slaves of type 2 nodes during the factorization phase is an attempt to detect and adjust an imbalance of the workload between the processors at run time. It was shown to work very well on a small to medium (64) number of processors (Amestoy et al. 2001a , Amestoy et al. 2001b ). However, the straightforward extension of this technique to a large number of processors often o ers more freedom to the dynamic scheduling than can be exploited e ectively. In this section, we rst give a more detailed illustration of these shortcomings, and then propose as a solution an algorithm that exploits the concept of candidate processors.
Issues of dynamic scheduling
The rst issue of dynamic scheduling concerns the memory management. In MUMPS, the amount of memory needed for each processor is estimated during the analysis phase and is reserved as workspace for the factorization. Consequently, if every processor can be possibly taken as a slave of a type 2 node, then enough workspace has to be reserved, on each processor, during the analysis phase for the potential corresponding computational task. However, during the factorization, typically not all processors are actually used as slaves. This leads to a severe overestimation by the analysis phase of the required work space with the possible consequence of exhausting the memory available on the processors. Secondly, the choice of the slaves is completely local. When a type 2 node is to be processed, its master greedily takes the slaves that seem best to it; those processors that are less loaded (with respect to the number of oating-point operations) than itself at the time of the scheduling decision are selected as slaves. Thus, the decision about the slaves depends crucially on the instant when the master chooses the slaves (locality in time). Furthermore, no account is taken of other type 2 nodes in the tree that have to be processed (locality in space). Instead of sharing the available slaves so that other nodes can be processed in parallel, a master might decide to take all of them, hindering the work on the other type 2 nodes and the treatment of other branches of the assembly tree. Furthermore, it is not possible in this approach to guarantee any locality of communication and data movement as, in principle, every processor can work on any type 2 node in the assembly tree. However, controlling locality is of great importance for modern computer architectures, for example SMPs like the IBM SP, where, for an MPI programming model, data movement within the shared memory of a node is cheap compared to communication across nodes.
Candidate processors for type 2 parallel nodes
In the following, we present a concept of candidate processors that naturally addresses the issues raised in Section 4.1. For each type 2 node that requires slaves to be chosen dynamically during the factorization because of the size of its contribution block, we introduce a limited set of processors from which the slaves can be selected. While the master previously chose slaves from among all less loaded processors, slaves are now only chosen from this list of candidates. This e ectively allows us to exclude all non-candidates from the estimation of workspace during the analysis phase and leads to a tighter and more realistic estimation of the workspace needed. Secondly, we can expect a performance gain in cases as described in the previous section where greedy decisions of one type 2 master can no longer hinder processors from processing another node.
The candidate concept can be thought of as an intermediate step between full static and full dynamic scheduling. While we leave some freedom for dynamic decisions at run time, this is guided by static decisions about the candidate assignment during the analysis phase. We refer to Section 6.6 for a full description of the algorithmic details.
In Section 3.2.1, we described the layer structure of the assembly tree. As each layer of the assembly tree represents a view of concurrent execution, all type 2 nodes on the same layer are potential rivals for slave processors, see Section 4.1. By assigning the candidates to all type 2 nodes of a given layer simultaneously, we avoid isolated treatment of nodes and direct our candidate concept from a global view of complete layers.
The assignment and the choice of the candidate processors is guided using a proportional mapping as described in Sections 3.2.2 and 5.1. We partition the set of processors recursively, starting from the root, so that for each subtree there is a well de ned subset of preferential processors which guides the selection of the candidates.
With this approach, we achieve Locality of communication as we limit the communication to those processors belonging to the subtree. Independence of computation as we limit the interaction of the processing of one subtree with the treatment of another independent one.
Task mapping and task scheduling in MUMPS
In this section, we give a generic description of the algorithm used by Version 4.1 of MUMPS (Amestoy et al. 2001a , Amestoy et al. 2001b ) and discuss in general terms our improvements to it as they have been integrated into the new version. A detailed discussion of the key modi cations is given in Section 6. We speak, in the following, of task mapping when we refer to the assignment of master processors and candidates during the analysis phase, and of task scheduling when we refer to the dynamic choice of type 2 slaves during the factorization phase.
Task mapping algorithm during the analysis phase
We consider the task mapping during the analysis phase and compare the previous version with the new version of MUMPS. A rst major point to emphasize is the greater exibility and adaptivity of the new algorithm when mapping the upper part of the assembly tree (that is, above layer L 0 ). The former version, shown in Algorithm 1, performs a simple mapping of only the master nodes, while the new version, shown in Algorithm 2, treats the upper part layer-wise, mapping both master nodes and type 2 candidates. Using a layerwise approach we take better account of the task dependency that will control the later factorization phase and, by analysing the quality of mapping decisions taken on previous layers, we can try to correct problems by in uencing the mapping of the current layer. This adaptivity was conceptually impossible in the old mapping algorithm. The second contribution of the new algorithm is of course the added features. A very important feature is the candidate concept guided by a proportional mapping partition of the processors. Furthermore, we have added to the treatment of each layer a preprocessing step that performs amalgamations and node splitting. Moreover, we have improved the construction of layer L 0 for better memory scalability. Lastly, we treat memory imbalances due to type 2 node mapping using a post-processing step.
We now present in more detail the previous version of the task mapping (Algorithm 1) and compare it afterwards with the new one, Algorithm 2.
Algorithm 1 Old task mapping algorithm.
(1) Given the assembly tree of a sparse matrix A (2) Build and map initial layer L 0 (3) Decide type of parallelism for nodes in upper part of tree (4) Map master nodes of upper part of tree The starting point (1) of the original algorithm is the assembly tree that was constructed from the elimination tree of a given sparse matrix using basic amalgamation and node splitting. From this assembly tree, the algorithm constructs, in step (2), an initial layer L 0 following the Geist-Ng approach presented in Section 3.2.1 with the objective of balancing the work between the processors. Afterwards, it is decided for which nodes type 2 or type 3 parallelism is exploited (3), and nally the masters of all nodes above layer L 0 are mapped (4) with the objective of balancing the memory. The choice of the slave processors for the type 2 nodes is left entirely to the dynamic scheduler during factorization, see Section 5.2.
The starting point (1 0 ) of the new algorithm is the same assembly tree as for the old approach (1). In step (2 0 ), we calculate a variant of the proportional mapping as introduced in Section 3.2.2 and whose algorithmic description is given later in Section 6.1. For each Algorithm 2 New task mapping algorithm.
(1 0 ) Given the assembly tree of a sparse matrix A (2 0 ) Calculate relaxed proportional mapping, i.e. the preferential processors (3 0 ) Build and map modi ed initial layer L 0 current layer = 1 while there exist unmapped nodes on or above current layer do (4 0 ) Perform tree modi cations if necessary (5 0 ) Decide type of parallelism for the nodes on current layer (6 0 ) Map the tasks associated with the nodes on current layer current layer = current layer + 1 end while (7 0 ) Post-processing of the candidate selection to improve memory balance node in the assembly tree, we obtain a set of preferential processors that will guide the selection and mapping of the candidate processors in step (6 0 ).
Step (3 0 ) di ers from the corresponding step (2) in the old algorithm insofar as the constructed initial layer has one additional property. Not only can it be mapped so that the computational work is balanced between the processors, but we also control better the memory demands of the subtree roots, see Section 6.2 for the details.
Step (4 0 ) performs amalgamations and node splitting to improve the nodes of the current layer. In step (5 0 ), we decide which type of parallelism we exploit for the nodes of the current layer. Nodes are of type 2 when their contribution block is large enough, that is greater than a minimum block size. The list of tasks associated with the current layer includes the masters for the type 1 and type 2 nodes, and the type 2 candidates which are derived from the proportional mapping (2 0 ), see Section 6.1. For the task mapping, we use a list scheduling algorithm that is described in Section 6.4. The main di erence of the new mapping (6 0 ) from the old one (4) is that we now pre-assign candidate processors for the type 2 nodes while in the former version, every processor was a potential type 2 slave. The post-processing step (7 0 ) a ects mostly the LU factorization. Because the op-based equilibration of the mapping step (6 0 ) can lead to a particularly bad memory balance, we perform a remapping of the type 2 masters for improved memory balance, as described in Section 6.5.
Task scheduling during the factorization phase
In this section, we describe the task management of a processor during the factorization phase. Here, the old and the new version of the algorithm di er only in the way the type 2 slaves are chosen. In the previous algorithm, every processor was a potential slave for a type 2 node, whereas now only the candidates can be selected to work on the parallel update of the contribution block.
The task pool (1) of a processor can contain the following tasks: master of a type 1 node, master of a type 2 node, or slave of a type 2 node. MUMPS uses a stack as the data structure for the task pool; the processor adds new tasks (3) or extracts them from the pool (4), respectively. If, during the factorization, the task pool of the processor is empty, it will wait until it receives new tasks and then re-enter loop (2). If the processor works Algorithm 3 Dynamic task scheduling performed on each processor during the factorization.
(1) Given the task pool of one processor while (5) before it starts the elimination of the pivotal block (6). Otherwise, if the processor is a type 1 master or a type 2 slave, it begins directly with the pivot elimination or the contribution block update, respectively (6).
In the new version of the algorithm, only step (5) is modi ed to ensure that the type 2 slaves are selected from among the candidates allocated for the type 2 node. We give the details of the algorithm for choosing the slaves in Section 6.6.
Details of the improved task mapping and scheduling algorithms
After the general discussion, in Section 5, contrasting the task mapping and scheduling for the old and new versions of MUMPS, we now describe the key points of the new algorithm in detail.
The relaxed proportional mapping
We give below, in Algorithm 4, an algorithmic description of one step of the proportional mapping presented in Section 3.2.2. The preferential processors given to a node are distributed among its children according to their weights. Note that we can relax the strict proportional mapping by multiplying the number of preferential processors n a by a relaxation factor 1 in step (2).
In step (1), we calculate the relative costs c r (s) of a child s; s 2 fs 1 ; : : : ; s i g from the costs c(s) for the factorization of all nodes in the subtree rooted at s as c r (s) = c(s) P i k=1 c(s k ) :
(6.1)
From the relative weight of child s, we obtain its share of preferential processors in step (2) that can be relaxed by the factor . A fundamental property of the proportional Algorithm 4 One step of proportional mapping.
Given a node n with preferential processors p 1 ; : : : ; p n a (n) and children s 1 ; : : :; s i for each child s of n do
(1) Calculate relative costs c r (s) of child s, 0 c r (s) 1 (2) Calculate number of preferentials n a (s) = max (1; min f c r (s) n a (n); n a (n)g) for child s end for (3) Cyclic assignment of the preferential processors for all children s 1 ; : : :; s i mapping is that the preferential processors of a child s are a subset of the preferential processors of its parent node n. This is ensured because n a (s) n a (n) in (2) and we always choose from the preferential processors of the parents at step (3). Here, we also make sure that each child has at least one processor, even if its cost is negligible. After we have calculated the number of preferential processors for all children, in step (3) we distribute the processors p 1 ; : : : ; p n a (n) among the children. If the proportional mapping is strict ( = 1) and the number of preferential processors calculated from the relative weight c r (s) in equation (6.1) is an integer, each processor is assigned exactly once. Otherwise, and in particular if the proportional mapping is relaxed with > 1, processors can become preferential for more than one child. Consequently, large values of will dilute the strict partition of the processors so that in the extreme case, ! 1, all processors become preferential for each node.
The Geist-Ng construction of layer L 0
We now give an algorithmic description of the construction of the initial layer L 0 that extends the Geist-Ng approach presented in Section 3.2.1.
Algorithm 5 The Geist-Ng algorithm.
(1) Let L 0 contain all root nodes of the assembly tree Starting with a potential layer L 0 consisting of the root nodes of the assembly tree (1), we rst compute (2) a mapping of L 0 with the list scheduling heuristics described in Section 6.4. The former criterion for accepting the layer in step (3) demands that the load imbalance between the processors is smaller than a threshold. Here, the work associated with a node in L 0 is de ned as the costs for computing the factors of the subtree rooted at the node and can be estimated during the analysis phase. If the mapping of layer L 0 is not acceptable, then the node with the highest costs is eliminated from the layer and replaced by its children (4, 5). A new mapping is computed (6) with the same algorithm as in (2).
The main problem of the algorithm is that balancing the computational work does not necessarily imply balancing the memory. Consider a node with a very small number of pivots but a big contribution block. The costs for the factorization depend mainly on the size of the pivotal block and are small, while the memory required to stack the contribution block is large. In order to take care of such situations, we propose the following approach.
If a node with a large contribution block was in the upper part of the tree above L 0 , it could either be amalgamated with its parent or become a type 2 node, and, in both cases, the memory problems would vanish. Thus, by eliminating such nodes from layer L 0 and replacing them by their children, we control the memory required for the subtrees in addition to balancing the work on layer L 0 . The idea is to treat critical nodes in the latter part of the algorithm by moving them into the upper part of the tree.
Summarizing, we modify the criterion of acceptability (3) to demand that both the load imbalance for the mapping of L 0 is smaller than a threshold and that L 0 contains no nodes that would need to be amalgamated.
6.3 Choosing the number of candidates for a type 2 node
We now describe the role of the proportional mapping for the decision of which processors become candidates for a type 2 node. Our approach consists of two steps. For a given layer, we rst determine for each type 2 node the number of candidate processors. In a second step, we choose the candidates from the available processors. (Thus, for a given node n we determine rst an integer number n c (n) that determines how many candidate processors p 1 ; : : : ; p n c (n) have to be chosen in the second step.) The reason for this approach is the following. The selection of a candidate processor is conceptually similar to the selection of the master processors for the type 1 and type 2 nodes; all these are tasks that need to be mapped onto the set of processors. In Section 6.4, we describe the algorithm that we use to map the tasks associated with one layer in the assembly tree. By mapping the master and candidate processors together, we hope to obtain better load balancing.
Algorithm 6 Determining the number of candidates using the preferentials.
Given a layer in the assembly tree for each Type 2 node n with n a (n) preferential processors in the layer do
(1) Determine the number of candidates by n c (n) = n a (n).
end for (2) OPTIONAL: Redistribute the total number of candidates of the layer among the layer's type 2 nodes according to their relative weights.
We have experimented with two di erent ways for determining the number of candidates for a given type 2 node and describe these in Algorithm 6. In the rst approach, we select its preferential processors as candidates, thus setting the number of candidates equal to the number of preferentials. We emphasize that this approach is not equivalent to a relaxed proportional mapping as the candidates are only potential slaves for the factorization. In the second approach, we employ an additional post-processing step where we redistribute the candidates of the layer according to the relative weight of the nodes. As described in Section 3.2.2, the proportional mapping is calculated from the costs of complete subtrees, not individual nodes. So it might happen that a small node has a large number of preferentials because it is the root of a large subtree, while a relatively large node on the same layer has only a small number of preferentials. In order to correct this, we can reassign candidates from small type 2 nodes as candidates of large type 2 nodes on the same layer by the optional step in Algorithm 6.
Layer-wise task mapping
The algorithm that we use for the mapping of the tasks of each layer is a variant of the well known list scheduling algorithm (Hochbaum 1996) where we rst make a list of the tasks sorted by decreasing costs, and then maps the tasks in this order one after another to the processor that has the least work assigned so far. We remark that this heuristic can be proved to construct a schedule whose total makespan (that is, the time by which all jobs complete their processing) never exceeds twice the makespan of an optimal schedule (Hochbaum 1996).
As described in the previous sections, the tasks associated with a layer can include the following:
The masters of all type 1 and type 2 nodes. For each type 2 node, the number of candidate processors determined using the node's preferential processors, see Section 6.3. The type 3 parallel node.
In the case of layer L 0 , we employ the original list scheduling algorithm (Hochbaum 1996), however, for all upper layers L 1 ; L 2 ; : : : our algorithm is more complicated for two reasons. Firstly, we want to guide mapping decisions by the proportional mapping representing a global view of the tree. Secondly, we have to take care of constraints that arise either from explicit user-given limits on memory or work for each processor, or implicitly from the fact that any two candidate processors or any candidate and the master of a type 2 node have to be di erent from each other.
Algorithm 7 Generic mapping algorithm.
(1) Create an ordered task list while Task list not empty do (2) Extract the next task t i from the list (3) Make a preference list for the processors while Task t i not mapped to a processor do (4) Try to map t i to next processor from the preference list end while end while
The rst two steps (1) and (2) of Algorithm 7 are identical to the original list scheduling approach: we create a list of all tasks that have to be mapped on the layer, that is, the work of the type 1 node masters, of type 2 node masters, and of type 2 node candidates (which have been obtained from the proportional mapping, as described in Section 6.1).
This list is then ordered by decreasing costs and the tasks are mapped in the order that they appear in the list.
Steps (3) and (4) are the generalization of the idea of mapping to the least loaded processor. In order to take account of the proportional mapping, we can simply propose mapping the task on the least loaded of the preferential processors coming from the proportional mapping. However, this is actually too simple as the mapping constraints for type 2 nodes that we mentioned above have to be respected. Our solution is that we create a preference list containing all the processors, at rst the preferential ones ordered by decreasing workload and then the non-preferential ones ordered separately, also by decreasing workload. The rst processor in the preference list that doesn't violate the mapping constraints will be the one to which the task is mapped.
6.5 Post-processing of the assembly tree for an improved memory balance in the LU factorization . The mapping algorithm from Section 6.4 tries to balance the work between the processors. However, there is an important di erence between symmetric and unsymmetric factorization with respect to memory. In the LDL T factorization, the master of a type 2 node only holds the pivotal block whereas, in the LU factorization, the master stores the complete fully summed rows. The additional memory that a master requires for storing its part of the factors in the LU factorization (with respect to the LDL T factorization) is illustrated in Figure 6 .1. In both the LU and the LDL T factorization, a type 2 slave reserves space for a part of the L factor below the pivotal block as shown in Figure 6 .2. Thus, in the case of the LU factorization, the work equilibration can lead to memory imbalances if the same processor becomes master of several type 2 nodes. We propose the following simple remedy. After the whole tree is mapped with the objective of balancing the work, we use a post-processing step to correct obvious memory problems.
In Algorithm 8, we process the upper part of the assembly tree from the top down (1), as the type 2 nodes creating the biggest problems are often near a root of the tree. By Algorithm 8 Post-processing for better memory equilibration in the LU factorization.
(1) Process the type 2 nodes in the tree from the root downwards (2) For a node n with master p M (n) select candidate c (n) with smallest memory if memory imbalance can be improved by swapping p M (n) and c (n) then (3) Exchange the roles of master and candidate processor p M (n) , c (n) end if swapping a master processor with one of the candidates, we still guarantee the bene ts of the proportional mapping used in the candidate assignment, but locally improve the memory imbalance (steps 2 and 3).
6.6 The dynamic scheduling algorithm used at run time We describe, in Algorithm 9, the dynamic scheduling algorithm used in MUMPS for the mapping of the slaves of a type 2 node at run time (Amestoy et al. 2001a ) and show how the candidate concept in uences the original approach. Furthermore, we identify and describe the role of the algorithmic parameter that controls the minimum granularity for type 2 parallelism at run time. We denote this parameter by k max .
Algorithm 9 Dynamic choice of the slaves of a type 2 node. Given a type 2 node n with master processor p M (n) and children s 1 ; : : : ; s i (1) The masters of the children p M (s 1 ); : : :; p M (s i ) send symbolic data to p M (n) (2) p M (n) analyses its information concerning the load of all processors (3) p M (n) decides the partitioning of the frontal matrix of node n and chooses the slave processors p S 1 (n); : : :; p S j (n) (4) p M (n) informs all processors working on the children about the partition (5) The numerical data is sent directly to the slaves p S 1 (n); : : :; p S j (n)
Instead of rst assembling all numerical data on the master of the type 2 node and distributing it afterwards to the slaves, a two-phase assembly process is used. At step (1), the master receives only the integer data describing the symbolic structure of the front. At step (2), the master analyses the information on the workload of the other processors. Each processor is responsible for monitoring its own workload status and for broadcasting signi cant changes to all other processors so that everyone has accurate information on the overall progress of the factorization. At step (3), the master processor p M (n) selects the least loaded among all processors as slaves. As a general rule, all processors that are less loaded than p M (n) are chosen as slaves. Then a partition of the frontal matrix onto the slaves is calculated.
The following constraint on the number of slaves, n s , for a type 2 node selected during factorization is imposed. It must always satisfy n s max(b ncb k max c; 1);
where ncb denotes the number of rows in the contribution block. The parameter k max controls the maximum work of a type 2 slave and thus the maximum bu er size permitted for the factorization of a type 2 node. Once the slaves participating in the parallel update of the contribution block have been selected, they obtain the part of the symbolic information from the master p M (n) that is relevant for their work (4). Furthermore, they receive the corresponding numerical data from the processors working on the children (5). In the candidate-based scheduling approach, we modify step (3) so that the slaves are always chosen among the candidates provided for the node. At rst, we select all those candidates that are less loaded than the master processor. If the inequality (6.2) is not satis ed, additional candidates are chosen so that it holds. In order to be able to choose the slaves at factorization time among the candidates so that (6.2) is never violated, we must take care to provide enough candidates during analysis. If we provide only a minimum number of candidates so that (6.2) holds as equality, we enforce a static scheduling. In this case, all candidates must be selected as slaves during factorization. We have freedom for dynamic choices during the factorization only if we provide a number of candidates greater than ncb=k max . Consequently, the freedom o ered to the dynamic scheduling can be measured by the number of extra candidates given for a type 2 node. On the other hand, the larger the number of candidates for a given node, the closer we come to the case of fully dynamic scheduling with all the possible drawbacks discussed in Section 4.1. In the following experiments, we will see that the dynamic scheduling works most e ectively when k max is large and (6.2) does not impose a signi cant restriction. Because of overall memory constraints, the scope for increasing the parameter is limited; however, we will see that the better memory estimates from the candidate approach greatly increase the range from which k max can be chosen.
We remark that, in the case of the LDL T factorization, MUMPS precomputes a partition of the contribution block in order to guarantee that each of the slaves performs approximately the same amount of work (Amestoy et al. 2000) . As the frontal matrix is symmetric (and only the lower triangular part is stored), rows at the bottom of the frontal matrix are longer (and thus associated with more work) than rows at the top.
The test environment
In this section, we present the test matrices that we use to illustrate the behaviour of our algorithm. Speci cally, we consider in Section 7.1 matrices from regular grids and in Section 7.2 irregular ones from real-life applications. We mention that our set of regular grid problems includes those used by Amestoy et al. (2001b) which allows us to compare the performance of the new code with results already published. For our tests, we use both a CRAY T3E-900 (512 processors, 256 MBytes RAM and 900 peak MFlops per processor) and an SGI Origin 2000 (32 processors, 16 GBytes shared memory, 500 peak MFlops per processor). We consider di erent orderings including nested dissection from SPARSPAK (George and Ng 1984) and METIS (Karypis and Kumar 1998) , and Approximate Minimum Fill (Rothberg and Eisenstat 1998, Ng and Raghavan 1999).
Regular grid test problems
We consider a set of test matrices obtained from an 11-point discretization of the Laplacian on 3D grids of either cubic or rectangular shape, the grid sizes are reported in Table 7 .1. The set of problems is chosen as in Amestoy et al. (2001b) and is designed so that when the number of processors increases, the number of operations per processor in the LU factorization stays approximately constant when employing a nested dissection ordering (George and Ng 1984 In Tables 7.2 and 7 .3, we show the distribution of work for type 1 masters (T1), type 2 masters (T2M) and slaves (T2S), and the type 3 root node (T3). It can be seen that, when increasing the problem size and the number of processors used, the work of the type 2 slaves becomes a major part of the overall work. Thus, improving the mapping of the type 2 slaves through the candidate concept will have a great in uence on the overall performance of the factorization, in particular on larger problems.
LU
LDL T   Processors T1 T2M T2S T3 T1 T2M T2S T3  1  100  0  0 0 Processors T1 T2M T2S T3 T1 T2M T2S T3  1  100  0  0 0 100  0  0 0  2  88  0  0 12 88  0  0 12  4  84  1  3 12 84  1  3 12  8  49  5 34 12 49  3 36 12  16  25  5 58 12 25  2 61 12  32  16  4 68 12 16  2 70 12  48  14  4 70 12 12  2 74 12  64 10 4 74 12 10 1 77 12 Table 7 .3: Percentage distribution of work for 3D rectangular grid problems (nested dissection ordering).
General symmetric and unsymmetric matrices
The matrices described in this section all arise from industrial applications and include test matrices from the PARASOL Project (PARASOL 2002), the Rutherford-Boeing Collection (Du , Grimes and Lewis 1997) , and the University of Florida sparse matrix collection (Davis 2002 Table 7 .4: Matrix order, type, and number of entries for the irregular test matrices.
In Table 7 .4, we describe the characteristics of the test matrices arising from real life problems. In Table 7 .5, we show for the irregular problems on 64 processors the distribution of work for type 1 masters (T1), type 2 masters (T2M) and slaves (T2S), and the type 3 root node (T3). We see that the work distribution depends heavily on the ordering used. The AMF ordering produces assembly trees that are rich in type 2 parallelism; on the other hand, the root nodes are so small that type 3 parallelism cannot be exploited e ectively, in contrast to METIS. For all matrices apart from bbmat, the major part of the work is associated with the factorization of type 2 nodes, similar to the regular grid problems. Matrix  T1 T2M T2S T3 T1 T2M T2S T3   bbmat   43 3  54 0 
AMF METIS

Experimental investigation of algorithmic details
In this section, we study the in uence and scope of parameters in the algorithms used by Version 4.1 and by the new version of MUMPS. Furthermore, we present a detailed investigation of isolated parts of the improved algorithm by typical examples of phenomena that we have observed in our experiments.
The impact of k max on volume of communication and memory
We rst show the impact on the volume of communication and memory of the parameter k max that controls the minimum granularity of the type 2 parallelism.
Our test matrix is from Section 7.1 and comes from an 11-point discretization of the Laplacian on a cubic grid of order 46, ordered by nested dissection. Here, we perform an LU factorization on an SGI Origin 2000 with 16 processors. This platform is well suited for testing the k max parameter over a wide range of values because of its shared-memory architecture where a large amount of memory is available to all processors.
At rst, we study the behaviour of Version 4.1 of MUMPS and then compare it with the new code.
The two graphs in the upper row of Figure 8 .1 illustrate that with increasing k max , both the total volume of communication and the number of messages associated with dynamic scheduling decrease. If k max is small, the required minimum number of slaves for a type 2 node and the corresponding communication volume will be large. With increasing k max , a single type 2 slave might be authorized to work on larger parts of a contribution block and the minimum number of slaves required during factorization becomes smaller. With k max su ciently large and thus the guaranteed minimum number of slaves from inequality (6.2) being no longer a constraint, the dynamic scheduling can freely choose slaves among the least loaded processors. Thus, further increases in k max do not further reduce the volume of communication.
The graph in the left lower corner of Figure 8 .1 shows the increase in estimated and actually used memory with increasing k max , and the graph in the right lower corner shows the decomposition of the estimated memory into the space reserved for the communication bu ers, the LU factors, and the stack. As potentially every processor can be selected as a slave during the factorization and the memory predicted depends monotonically on k max , the prediction during the analysis phase will lead to an increasing gap between real and estimated memory as can be seen in the graph on the lower left. On the lower right, we see that the main contribution to the overestimation of the memory is the stack. As slaves stack their part of the contribution block until it can be received by the processors working on the parent of the node, the stack has to grow when k max increases. Furthermore, a single type 2 slave is authorized to work on larger parts of a contribution block.
When weighing memory estimation and communication volume against each other, the best value for k max is so that it reduces the memory overestimation but at the same time limits the communication volume su ciently. We now investigate the behaviour of the new candidate-based code on the same test matrix. Candidates are assigned without relaxation and layerwise redistribution, following the proportional mapping of the assembly tree. From the two graphs in the bottom row of Figure 8 .2 we observe the expected better estimation of memory. Compared to the corresponding graphs in Figure 8 .1, the growth of the gap between estimated and real memory is signi cantly smaller. As the type 2 slaves can only be chosen from the candidates, the non-candidates can be excluded thus making the estimation tighter and more realistic. Furthermore, the two graphs in the top row of for the dynamic scheduling, so that actually less parallelism is created and fewer slaves are chosen during factorization. Thus, if we want to reduce the communication volume in the new code, we are not obliged to increase k max substantially with the consequent drawback of overestimating the memory. Instead, we can choose k max relatively small and have the bene ts of a relatively realistic memory estimation together with a reduced communication volume.
The impact of k max on performance
In the following, we show the impact of the parameter k max on the factorization time, using as a test matrix an 11-point discretization of the Laplacian on a cubic grid of order 51, ordered by nested dissection. We perform an LU factorization on a CRAY T3E with 64 processors. (We have reduced the problem size from that in Table 7 .1 so that we have enough exibility with respect to memory for this parameter study.) Furthermore, because of limited memory and in order to separate the di erent algorithmic parameters, we use a candidate assignment without relaxation. For a study of the in uence of relaxation we refer to Section 8.3.
The CRAY T3E is well suited for providing reliable timing for performance measures because the processors are guaranteed to run in dedicated mode for a single task. On the other hand, the T3E has a distributed-memory architecture with a fairly small amount of memory per processor, so that we can vary the parameter k max only in a relatively small range compared to the range possible on the Origin 2000 which has a shared memory. From Figure 8 .3, we see that with increasing k max , the factorization time decreases in both versions of the code as the minimum number of slaves required during factorization gets smaller and the dynamic scheduler is free to decrease unnecessary parallelism. However, the previous version of MUMPS needs much more memory than the candidatebased version, and thus the exibility for increasing k max is more strictly limited.
Once k max is su ciently large, a further increase in k max shows no further improvements in performance. This corresponds to the results on the limited reduction in the volume of communication obtained in Section 8.1. We note that for the larger values of k max , the new version of the code performs better. We will con rm this observation by systematic studies on our set of test matrices in Section 9.
Modifying the freedom o ered to dynamic scheduling
We now investigate the behaviour of the new code when modifying the assignment of candidates. We study two di erent approaches. As described in Section 6.3, we can increase the number of candidates given to a node by increasing its number of preferentials through relaxing the proportional mapping. Furthermore, according to Algorithm 6, we can modify the candidate assignment for a given layer by an optional redistribution of the candidates that takes account of the weight of the nodes relative to each other. We rst compare the performance of the candidate assignment with and without layerwise redistribution. Afterwards, we show the impact of relaxation on the two assignment strategies. For our study, we use the same test case as in Section 8.2. Figure 8 .4 shows the factorization time of the new version of MUMPS for the candidate assignment with and without layer-wise candidate redistribution as a function of the minimum granularity. We cannot nd signi cant di erences in the behaviour of the two approaches. This example is representative of the results we have obtained on the complete set of test problems.
We now investigate the impact of relaxation on the volume of communication, memory, and performance. In Figures 8.5 and 8.6, the horizontal axis denotes the percentage relaxation factor. We present the behaviour of the new version of MUMPS for the candidate assignment with and without layer-wise candidate redistribution as a function of the relaxation.
The two graphs in Figure 8 .5 illustrate that, with increasing relaxation, both the total volume of communication and the number of messages related to dynamic scheduling increase because the exibility for choosing the slaves during factorization becomes greater. Likewise, the memory estimation grows with increasing relaxation, see Figure 8 .5. However, we do not observe a positive impact of relaxation on the performance of the algorithm; a possible interpretation is that, through the relaxation, we create additional parallelism that is not actually needed at run time. In general, we already have, without relaxation, enough freedom for a dynamic choice of the slaves. While this observation holds for all the experiments we have conducted, we are convinced that relaxation might show a positive impact on irregular problems from real-life applications. Unfortunately, the irregular problems available to us are not large enough to e ectively exploit parallelism on a large number of processors, and we plan to investigate this further in the future, see the remarks in Section 10.
In conclusion, we note that the candidate approach without layer-wise redistribution and without additional relaxation already o ers good results in our experiments. In the following, we focus therefore on the presentation of the results obtained with this algorithmic con guration.
Improved node splitting
Node splitting is useful when the pivot block is large relative to the size of the frontal matrix. It is discussed in detail by Amestoy et al. (2001a) and was introduced at the end of Section 3.1 and discussed brie y in Section 5.1. We now illustrate the additional capabilities of the new code for node splitting on the set of test matrices from Section 7.1. Those matrices are obtained from an 11-point discretization of the Laplacian on 3D cubic or rectangular grids with Approximate Minimum Fill (AMF) ordering and are described in Table 7 .1. The AMF ordering produces long and thin trees from the regular grid problems which we can use to illustrate the problems of the splitting criterion used in the previous version of the code. Splitting was done in a preprocessing step and before the layer structure of the assembly tree was known. In order to prevent useless splitting below layer L 0 where no type 2 parallelism is exploited, the algorithm authorized node splitting only up to a xed distance from the root node, where this distance depended only on the number of processors but not on the matrix. So it could happen that even though there were nodes in the upper part of the tree that should have been split for better performance, the splitting was not performed. The new algorithm incorporates the splitting systematically in the upper part of the tree. Once layer L 0 is known, we can authorize splitting everywhere in the upper part of the tree to create more parallelism if this is useful. We illustrate the additional splitting in Table 8 .1 for both cubic and rectangular grids. For each grid type, we show in the rst of the three columns the additional number of splittings of the new code and compare them to the total number of splittings (including the splittings already performed by Version 4.1 of the code) and the total number of nodes in the upper part of the tree after splitting in the second and third columns, respectively. For example, for the rectangular grid on 512 processors, and with the same splitting criteria, the new code performs 61 splittings in addition to those already done by the old code, so that altogether 107 splittings are performed, resulting in an assembly tree with 830 nodes. In other words, in this example, the previous version of MUMPS missed 57% of the possible splittings. We illustrate, in Table 8 .2, the properties and bene ts of the improved splitting in the case of the cubic grid of order 57 on 64 processors. The additional splitting slightly increases the number of assembly operations and also the average amount of memory per processor. However, it creates additional parallelism by augmenting the number of type 2 nodes. This signi cantly improves the performance of the factorization. Moreover, memory can be balanced better between the processors because of the additional type 2 parallelism.
Improved node amalgamation
Amalgamation, particularly amalgamation that potentially increases ll-in, can be considered the opposite of splitting and can be useful when the pivot block is relatively small. In this section, we illustrate the improvements we have made concerning node amalgamation by using our test examples from Table 8 .3: Improved amalgamation of the new code.
In Table 8 .3, we show, for both cubic and rectangular grids, the number of extra amalgamations the new code performed in layer L 0 of the Geist-Ng algorithm as described in Section 6.2 and the total number of extra amalgamations performed on all layers of the tree. We also give the total number of nodes in the upper part of the tree after all amalgamations have been performed. We emphasize that, in the new version, we use the same amalgamation criteria as in the previous version and show, in the table, the amalgamations that are performed in addition to those performed before. Amalgamation in the previous version was only possible between a parent node and its oldest child; the greater freedom in the new code allows many more amalgamations as can be seen in particular for the large test matrices on 512 processors. For example, for the cubic grid on 512 processors, the new code performed 119 additional amalgamations, that is, 119 amalgamations more than the old code with the same amalgamation criteria. Among the additional amalgamations of the new code are 77 for layer L 0 , so that the amalgamated assembly tree has 1371 nodes in the upper part.
We illustrate in Table 8 .4 the properties and bene ts of the improved amalgamation in the case of the cubic grid of order 46 on 16 processors. The additional amalgamation decreases the number of assembly operations and allows a better memory balance because the stacking of several large type 1 nodes can be avoided. 8.6 Post-processing for a better memory balance
On the CRAY T3E, we illustrate a shortcoming that we detected when testing an initial version of the candidate-based LU factorization. We consider the cubic grid problem of order 72 from Table 7 .1 ordered by nested dissection. We show the bene ts obtained by remapping the masters of type 2 nodes for better memory balance as described in Section 6.5 and conclude that the post-processing is also crucial for obtaining good speedup.
Looking at the rst rows (no postp.) of Table 8 .5, we see that the op-based equilibration of the scheduling algorithm leads to severe memory imbalance both in the estimated and the actual memory. In particular, the process needing the largest amount of (estimated) memory requests 179 megabytes, about 70% of the memory of the processor. In the last two rows of Table 8 .5, we show the memory statistics when the postprocessing is performed. We observe that the di erence between average and maximum values for both the estimated and actual memory are much reduced. This allows us to double the k max parameter for this test case and obtain better performance for the factorization. However, note that the estimate for the most loaded processor is more accurate without post-processing. This is because the di erences between memory estimation and actually used memory are mainly related to a processor being a type 2 candidate but not being chosen as a slave during factorization. Without post-processing, the major activity for the most loaded processor probably involves work associated with being master of several type 2 nodes, see Section 6.5. But after the post-processing, the processor exchanges its role as master with another and becomes a type 2 candidate so that its memory estimate can become less accurate.
Performance analysis
In the following, we compare the performance of the new MUMPS code with the previous version (Amestoy et al. 2001b ) on the complete set of test problems presented in Section 7. All these tests were performed on the CRAY T3E of the NERSC computing center at Lawrence Berkeley National Lab in Berkeley, California.
Nested dissection ordering
In this section, we use the test matrices from Table 7 .1 ordered by nested dissection.
We observe, from the results in Table 9 .1, that for up to 64 processors, the new version has similar performance to the good results of the previous version. However, when more processors are used and the matrices become larger, the new code performs signi cantly better. Looking at the results on 128, 256 and 512 processors, we note the greatly improved scalability of the candidate-based code. Table 9 .1: Performance of the old and new LU factorization (time in seconds on the CRAY T3E).
Another major advantage of the new candidate-based code is that it better estimates the memory used for the factorization. In Table 9 .2, we show the memory space for the LU factors of the old and the new version of MUMPS. We see that the candidate-based code signi cantly reduces the overestimation of the storage required, and that the gains increase with the matrix size and the number of processors. Table 9 .2: Space for the LU factors (number of reals 10 6 ).
The big gains of the new candidate-based code are a result of the individual improvements concerning splitting and amalgamation, reduced communication and the better locality of the computation as illustrated in Section 8. Furthermore, we need to decrease k max in the large problems for the old version of MUMPS because of memory.
This limits the performance as we saw in Section 8.2. On the other hand, we do not need to decrease k max in the candidate-based code as the tighter estimates stay within the memory available.
As all regular test matrices are symmetric, we can also compare the old with the new candidate-based LDL T factorization. The results presented in Table 9 .3 con rm those obtained for the LU factorization. The candidate-based code shows a much better performance in particular for the large problems on a large number of processors due to improved locality of communication and computation, and because of the bigger scope for increasing the k max parameter. Table 9 .3: Performance of the LDL T factorization (time in seconds on the CRAY T3E).
Note that, because of the improvements in the scalability of the new code, MUMPS now compares favourably to SuperLU on a large number of processors. (The factorization time for SuperLU on 128 processors and the same nested dissection ordering is 71:1 seconds for the cubic and 56:1 seconds for the rectangular grid (Amestoy et al. 2001b ).)
Approximate Minimum Fill (AMF) ordering
Recently a fairly large number of experiments have been conducted with several heuristics to reduce the ll-in (de ciency) during the elimination process Raghavan 1999, Rothberg and Eisenstat 1998) . The approximation of the de ciency used in our AMF code is based on the observation that, because of the approximate degree, we count variables twice that belong to the intersection of two elements adjacent to a variable in the current pivot list. This property of the approximate degree can be exploited to improve the estimation of the de ciency and the accuracy of the approximation proposed by Rothberg and Eisenstat (1998) .
The AMF ordering produces trees that are di cult to exploit in MUMPS. The upper part of the tree where type 2 and type 3 parallelism can be exploited is usually a long and thin chain. In Table 9 .4, we show the memory required to store the factors of the LU factorization for the di erent orderings. In the case of the cubic grid, the real space used by the factors is signi cantly larger than when using the nested dissection ordering. Furthermore, we note that for the rectangular grid, AMF actually needs the least space for the factors. However, the shape of the assembly tree still o ers less potential for parallelism and we expect the factorization time for AMF-ordered matrices to be considerably longer than for the case of nested dissection. This is con rmed by the results in Tables 9.5 Table 9 .6: Performance of the LDL T factorization (time in seconds on the CRAY T3E).
Analysis of the speedup for regular grid problems
We now summarize the results of the previous sections by presenting a comparison of the speedup on the 3D grid problems.
Let t j denote the time to execute a given job involving ops j oating point operations on j parallel processors. Then, we de ne the scaled speedup, S p , for p processors to be S p = t 1 =ops 1 t p =ops p :
In Figures 9.1 and 9.2, we show the scaled speedup for the matrices ordered by nested dissection and in Figures 9.3 and 9 .4 for the AMF ordering, respectively. 
Performance analysis on general symmetric and unsymmetric matrices
In this section, we compare the performance of the new mapping algorithm with the previous version on general symmetric and unsymmetric matrices. The main problem with this comparison is that our algorithm o ers the biggest performance gains only on a large number of processors. However, the unsymmetric matrices available to us are either too small to o er enough potential for scalability on more than 64 processors, or they are too large to do the analysis (which is performed on only one processor). This was already observed in the analysis of the scalability of both MUMPS and SuperLU (Amestoy et al. 2001b ) .
In order to compare the quality of the di erent orderings, we show the number of entries in the factors for the test matrices in Table 9 .7.
While it is not always the best ordering, METIS consistently provides a good overall performance with respect to the number of entries in the factors. Table 9 .8: Performance of old and new code on the irregular test matrices (factorization time in seconds on the CRAY T3E).
In Table 9 .8, we see that in general the new mapping algorithm performs similarly to the old one. As already noted, we would expect signi cant improvements on large matrices and on more than 64 processors. However, we notice some improvements for the AMF ordering on bbmat and g7jac200. However, since METIS generally provides better orderings, these improvements for AMF are not so relevant and only show the capacity of our algorithm to correctly handle irregular trees.
Perspectives and future work
In this section, we summarize the open questions that need further investigation.
In Section 8.3, we investigated the behaviour of the new code when modifying the assignment of candidates through relaxation and layer-wise redistribution. On the test cases that we have studied in the framework of this paper, these modi cations have not shown a positive e ect on the overall performance of the code. Still, there is an intuitive argument suggesting further experiments. The analysis phase tries to predict the actual factorization of the matrix and takes mapping decisions based on this symbolic factorization. However, there are cases where this approach might not be accurate enough; for example we do not take into account costs of communication between the processors as is done, for example, by the static scheduler of PaStiX (Henon et al. 2002) . Since, during factorization, the assembly tree is treated from bottom up, we might expect mapping problems to have more severe in uence towards the root of the tree. For this reason, we could decide to o er more freedom to dynamic scheduling near the root nodes so that unfortunate mapping decisions can be corrected dynamically there.
Furthermore, in Section 9.4 we have presented test results on a few large irregular test matrices from real life applications. We have already remarked that these matrices are still relatively small and do not o er enough sources of parallelism on a large number of processors. This study needs to be extended in order to be able to give reliable statements on the scalability of the new code also in real life applications. Finally, our candidate based approach can be extended to take account of the system architecture, for example with respect to non-uniform communication costs on machines consisting of SMP nodes. We can modify the task scheduling so that processors which require expensive communications are penalized so that the master-slave communication costs are reduced.
Summary and conclusions
Previous studies of MUMPS, a distributed memory direct multifrontal solver for sparse linear systems, indicated that its scalability with respect to computation time and use of memory could be improved. In this paper, we have presented a new task scheduling algorithm designed to address these problems. It consists of an approach that treats the assembly tree layer by layer and integrates tree modi cations, such as amalgamation and splitting, with the mapping decisions. As a major feature, we have introduced the concept of candidate processors that are determined during the analysis phase of the solver in order to guide the dynamic scheduling during the factorization.
We have illustrated key properties of the new algorithm by detailed case studies on selected problems. Afterwards, by comparison of the old with the new code on a large set of regular and irregular test problems, we have illustrated the main bene ts of the new approach. These include improved scalability on a large number of processors, reduced memory demands and a smaller volume of communication, and the easier handling of parameters relevant for the performance of the algorithm. Finally, we have discussed possible extensions of our algorithm, in particular with respect to its use on SMP architectures.
