Abstract. We develop and compare several ne-grained parallel algorithms to compute the Cholesky factorization of a sparse matrix. Our experimental implementations are on the Connection Machine, a distributed-memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special-purpose algorithms in which the matrix structure conforms to the connection structure of the machine, our focus is on matrices with arbitrary sparsity structure.
1. Introduction. 1.1. Data parallelism. Highly parallel, local memory computer architectures promise to achieve high performance inexpensively by assembling a large amount of simple hardware in a way that scales without bottlenecks. A data parallel programming model simpli es the programming of local memory parallel architectures by associating a processor with every data element in a computation (at least conceptually), thus hiding from the programmer the details of distribution of data and work to the processors.
Some major challenges come along with these promises. Communication is expensive relative to computation, so an algorithm must minimize communication. Locality is important in communication, so it pays to substitute communication with nearby processors for more general patterns where possible. The sequential programmer tunes the inner loop of an algorithm for high performance, but simple data parallel algorithms tend to have \everything in the inner loop" because a sequential loop over the data is typically replaced by a parallel operation. (From this point of view, the advantage of the more complicated of our two algorithms, Grid Cholesky, is that it removes the general-pattern communication from the inner loop.)
Algorithms for data parallel architectures must make di erent trade-o s than sequential algorithms: They must exploit regularity in the data. For e ciency on SIMD machines, they must also be highly regular in the time dimension. In some cases entirely new approaches may be appropriate; examples of experiments with such approaches include particle-in-box ow simulation, knowledge base maintenance 5], and the entire eld of neural computation 20] . On the other hand, the same kind of regularity in a problem or an algorithm can often be exploited in a wide range of architectures; therefore, many ideas from sequential computation turn out to be surprisingly applicable in the highly parallel domain. For example, block-oriented matrix operations are useful in sequential machines with hierarchical storage and conventional vector supercomputers 3]; we shall see that they are also crucial to e cient data parallel matrix algorithms.
1.2. Goals of this study. Data parallel algorithms are natural for computations on matrices that are dense or have regular nonzero structures arising from, for example, regular nite di erence discretizations. The main goal of this research is to determine whether data parallelism is useful in dealing with irregular, arbitrarily structured problems. Speci cally, we consider computing the Cholesky factorization of an arbitrary sparse, symmetric, positive de nite matrix. We will make no assumptions about the nonzero structure of the matrix besides symmetry. We will present evidence that arbitrary sparse problems can be solved nearly as e ciently as dense problems by carefully exploiting regularities in the nonzero structure of the triangular factor that come from the clique structure of its chordal graph.
A second goal is to perform a case study in analysis of parallel algorithms. The analysis of sequential algorithms and data structures is a mature and useful science that has contributed to sparse matrix computation for many years. By contrast, the study of complexity of parallel algorithms is in its infancy, and it remains to be seen how useful parallel complexity theory will be in designing e cient algorithms for real parallel machines. We will argue by example that, at least within a particular class of parallel architectures, asymptotic analysis combined with experimental measurement of parameters is accurate enough to be useful in choosing among alternative algorithms for a single fairly complicated problem.
1.3. Outline. The structure of the remainder of the paper is as follows. In Section 2 we review the de nitions we need from numerical linear algebra and graph theory, sketch the architecture of the Connection Machine, and present a timing model for a generalized data parallel computer that abstracts that architecture.
In Section 3 we present the rst of two parallel algorithms for sparse Cholesky factorization. The algorithm, which we call Router Cholesky, is based on a theoretically e cient algorithm in the PRAM model of parallel computation. We analyze the algorithm and point out two reasons that it fails to be practical, one having to do with communication and one with processor utilization.
In Section 4 we present a second algorithm, which we call Grid Cholesky. Grid Cholesky is a data parallel implementation of a supernodal, multifrontal method that draws on the ideas of Du and Reid 7] and Ashcraft et al. 1] . It improves on Router Cholesky by using a two-dimensional grid of processors to operate on dense submatrices, thus replacing most of the slow generally-routed communication of Router Cholesky with faster grid communication. It also solves the processor utilization problem by assigning di erent data elements to the working processors at di erent stages of the computation. We present an analysis and experimental results for a pilot implementation of Grid Cholesky on the Connection Machine.
The pilot implementation of Grid Cholesky is approximately as e cient as a dense Cholesky factorization algorithm, but is still slow compared to the theoretical peak performance of the machine. Several steps necessary to improve the absolute e ciency of the algorithm, most of which concern e cient Cholesky factorization of dense matrices, are described. Finally we draw some conclusions and discuss avenues for further research.
2. De nitions. The rst two subsections below summarize the linear algebra and graph theory needed to study sparse Cholesky factorization. Most of this material is covered in more detail by George and Liu 13, 23, 24] . The remainder of the section outlines the data parallel programming model and its implementation on the Connection Machine, and describes our parametric model of a data parallel computer.
2.1. Linear algebra. Let A = (a ij ) be an n n real, symmetric, positive de nite sparse matrix. There is a unique n n lower triangular matrix L = (l ij ) with positive diagonal such that A = LL T : This is the Cholesky factorization of A. We seek to compute L; with it we may solve the linear system Ax = b by solving Ly = b and L T x = y. We will discuss algorithms for computing L below. In general, L is less sparse than A. The nonzero positions of L that were zero in A are called ll or ll-in. For any matrix X, we write (X) to denote the number of nonzero elements of X.
The rows and columns of A may be symmetrically reordered so that the system solved is PAP T (Px) = Pb where P is a permutation matrix. We assume that such a reordering, chosen to reduce (L) and the number of operations required to compute L, has been done. We further assume that the structure of L has been determined by a symbolic factoring process. We ignore these preliminary computations in this study because the cost of actually computing L typically dominates. (In many cases, several identically structured matrices may be factored using the same ordering and symbolic factorization.) A study of the implementation of appropriate reordering and symbolic factorization procedures on data parallel architectures is in preparation 18] .
If the matrix A is such that its Cholesky factor L has no more nonzeros than the lower triangle of A, i.e. there is no ll, then A is a perfect elimination matrix. If PAP T is a perfect elimination matrix for some permutation matrix P, we call the ordering corresponding to P a perfect elimination ordering of A.
Let R and S be subsets of f1; : : :; ng. Then A(R; S) is the jRj jSj matrix whose elements are a r;s for r 2 R and s 2 S. (For any set S, we write jSj to denote its cardinality.) 2.2. Graph theory. 2.2.1. Vertex elimination. We associate two ordered, undirected graphs with the sparse, symmetric matrix A. First, G(A), the graph of A, is the graph with vertices f1; 2; : : :; ng and edges E(A) = f(i; j) j a ij 6 = 0g :
(Note that E(A) is a set of unordered pairs.) Next, we de ne the lled graph, G (A), with vertices f1; 2; : : :; ng and edges E (A) = f(i; j) j l ij 6 = 0g ; so that G (A) is G(L+L T ). The edges in G (A) that are not edges of G(A) are called ll edges. The output of a symbolic factorization of A is a representation of G (A). For every ll edge (i; j) in E (A) there is a path in G(A) from vertex i to vertex j whose vertices all have numbers lower than both i and j; moreover, for every such path in G(A) there is an edge in G (A) 28]. Consider renumbering the vertices of G (A). With another numbering, this last property may or may not hold. If it does, then the new ordering is a perfect elimination ordering of G (A); Liu 24] It is immediate that any two simplicial vertices are either independent or indistinguishable. A set of indistinguishable simplicial vertices thus forms a clique, though not in general a maximal clique. The equivalence relation of indistinguishability partitions the simplicial vertices into pairwise independent cliques. We call these the simplicial cliques of the graph.
Elimination trees.
A fundamental tool in studying sparse Gaussian elimination is the elimination tree. This structure was de ned by Schreiber 30] ; Liu 24] gives a survey of its many uses. Let A have the Cholesky factor L. The elimination tree T(A) is a rooted spanning forest of G (A) de ned as follows. If vertex u has a higher-numbered neighbor v, then the parent p(u) of u in T(A) is the smallest such neighbor; otherwise u is a root. In other words, the rst o -diagonal nonzero element in the u th column of L is in row p(u). It is easy to show that T(A) is a forest consisting of one tree for each connected component of G(A). For simplicity we shall assume in what follows that A is irreducible, so that vertex n is the only root, though our algorithms do not assume this.
There is a monotone increasing path in T(A) from every vertex to the root. If 2.2.4. Clique trees. The elimination tree describes a Cholesky factorization in terms of operations on single columns. A description in terms of operations on full blocks can yield algorithms with better locality of reference, which is an advantage either on a machine with a memory hierarchy (registers, cache, main memory, disk) or on a distributed-memory parallel machine. The Connection Machine falls into both of these categories. Describing symmetric factorization in terms of operations on full submatrices is the key idea in both Du and Reid's \multifrontal"algorithm 7] and the \supernodal" algorithm of Ashcraft, Grimes, Lewis, Peyton, and Simon 1] , which can be traced back to the so-called element model of factorization 15, 33] . A full submatrix of L is a clique in the chordal graph G (A). The clique structure of chordal graphs has been explored extensively in the combinatorial literature; representations of chordal graphs as trees of cliques date back at least to 1972 10] and continue to be used 16, 25, 26] .
A clique tree for matrix A is a tree whose nodes are sets that partition the vertices of G (A) into cliques, in such a way that all the vertices of a node N are indistinguishable simplicial vertices in the graph that results by deleting from G (A) all vertices in nodes that are proper descendants of N. An equivalent de nition is to think of starting with an elimination tree and collapsing vertices that are indistinguishable from their parents. (This de nition di ers slightly from that of Peyton 26] , whose tree nodes are overlapping maximal cliques of G (A).)
A clique which is a node in a clique tree is also called a supernode. If all vertices of G (A) in proper descendants of some supernode are deleted, the supernode becomes a simplicial clique in the resulting graph. The clique tree is sometimes called a supernode tree or supernodal elimination tree 2]. A matrix may have many di erent clique trees|indeed, the elimination tree itself is one. Our numerical factorization algorithm Grid Cholesky can actually use any clique tree; the symbolic factorization we describe in Section 4.1 uses a block Jess and Kees algorithm to compute a shortest possible clique tree.
2.3. The Connection Machine. 2.3.1. Architecture. The Connection Machine (model CM-2) is a local memory, SIMD parallel computer. The description we present here corresponds to the machine architecture presented by the assembly language instruction set Paris 34] . We programmed the CM in *lisp, which is compiled into Paris.
A full-sized CM has 2 16 = 65,536 processors, each of which can directly access 65,536 bits of memory. (Since this work was done, larger memories have become available.) The processors are connected by a communication network called the router, which is con gured by a combination of microcode and hardware to be a 16-dimensional hypercube.
The essential feature of the CM programming model is the parallel variable or pvar. A pvar is an array of data in which every processor stores and manipulates one element. The size of a pvar may be a multiple of the number of physical machine processors, p. If there are V times as many elements in the pvar X as there are processors, then (through microcode) each physical processor simulates V virtual processors; thus the programmer's view remains \one processor per datum." The ratio V is called the virtual processor (VP) ratio. The CM requires that V must be a power of two. Thus we will nd useful the notation d dx e e, meaning the smallest power of two not smaller than the real number x.
The geometry of each set of virtual processors (and its associated pvars) is also determined by the programmer, who may choose to view it as any multidimensional array with dimensions that are powers of two. The VP sets and their pvars are embedded in the machine (using Gray codes) so that neighboring virtual processors are simulated by the same or neighboring physical processors.
The Paris instruction set corresponds fairly well to the abstract model of data parallel programming that the CM attempts to present to the programmer, but it does not correspond closely to the actual hardware of the CM. Largely for this reason, it is hard to get high performance when programming in Paris or in a language that is compiled into Paris 31] . We shall go into this point in detail later. There are other ways to view and to program the hardware of the CM-2 that can provide better performance. These are just now becoming available to users, but were not when this work was done.
2.3.2. Connection Machine programming. Parallel computation on the CM is expressed through elementwise binary operations on pairs of pvars that reside in the same VP set|that is, have the same VP ratio and layout on the machine. (Optionally, one may specify a boolean mask that selects only certain virtual processors to be active.) These operations take time proportional to V , since the actual processors must loop over their simulated virtual processors. This remains true even when the set of selected processors is very sparse.
Interprocessor communication is expressed and accomplished in three ways, which we discuss in order of increasing generality but decreasing speed.
Communication with virtual processors at nearest neighbor grid cells is most e cient. A pvar may be shifted along any of its axes using this type of communication. The shift may be circular or end-o at the programmer's discretion.
A second communication primitive, scan, allows broadcast of data. For example, if x is a one-dimensional pvar with the value 1, 2, 3, 4, 5, 6, 7, 8] then a scan of x yields 1, 1, 1, 1, 1, 1, 1, 1]. Scans are implemented using the hypercube connections. The time for a scan of length n is linear in logn. Scans can also be used to broadcast along either rows or columns of a two-dimensional array. Scans that perform parallel pre x arithmetic operations are also available, but we do not use them.
Scans of subarrays are possible. In a segmented scan, the programmer speci es a boolean pvar, the segment pvar, congruent to x. The segments of x between adjacent T values in the segment pvar are scanned independently. Thus, for example, if we use the segment pvar T F F F T F F T] and x is as above, then a segmented scan returns 1, 1, 1, 1, 5, 5, 5, 8].
The third and most general form of communication allows a virtual processor to access data in the memory of any other virtual processor. These operations go by several di erent names even within the CM environment; we shall refer to them in terms already familiar in sparse matrix computation: gather and scatter.
A gather allows processors to read data in the memory of other processors. The CM term for a gather is pref!! (for the *lisp programmer) or get (for the Paris programmer). In a gather, three pvars are involved: the source, the destination, and the address. The address of the processor whose memory is to be read is taken from the integer address pvar. Suppose the source is the one-dimensional pvar 15; 14; 13;: ::; 2; 1; 0] and the address pvar is 0; 1; 2; 0; 1; 2;: ::; 0; 1; 2; 0]. Then the data stored in the destination is 15; 14; 13;15;14;13;: ::; 15; 14; 13; 15]. The Fortran-90 or Matlab statement that accomplishes this is is dest = source(address); it performs the assignment dest(i) source(address(i)) for 1 i length(dest).
A scatter allows processors to write data to the memory of other processors. The CM term for a scatter is *pset (for the *lisp programmer) or send (for the Paris programmer). Again the three pvars are a source, a destination, and an integer address. The Fortran-90 or Matlab version is is dest(address) = source, and the e ect is dest(address(i)) source(i) for 1 i length(source). In a scatter, when the address pvar has duplicate values, data from several source processors are sent to the same destination processor. (When this happens we say that a collision has occurred.) The programmer can select one of several di erent ways to combine colliding values by specifying a combining operator. For example, if the source and address are as above and the destination initially has the value 1; 1; : : :; 1] then after a scatter-with-add, the destination has the value 45; 40; 35;1;1; 1; :: :; 1]. The sum of elements source(j) such that address(j) = k is stored in dest(k) if there are any such elements; otherwise dest(k) is unchanged. Other combining operators include product, maximum, minimum, and \error". We have found combining scatter to be a powerful aid to data parallel programming. (Floating-point addition takes the same time as multiplication, .) In our model, execution time scales linearly with VP ratio, which is essentially correct for the Connection Machine. Therefore is proportional to VP ratio, and the other parameters are independent of VP ratio. In Table 1 , we give measured values for these parameters obtained by experiment on the CM-2. We observe that router times range over a factor of four depending on the number of collisions; it is possible to design pathological routing patterns that perform even worse than this. For any given pattern, gather usually takes just twice as long as scatter, presumably because it is implemented by sending a request and then sending a reply. In our approximate analyses, therefore, we generally choose a value of for scatter corresponding to the number of collisions observed, and model gather as taking 2 oating-point times.
3. Router Cholesky. Routine cdiv (j) divides the subdiagonal elements of column j by the square root of the diagonal element in that column, and routine cmod (j; i) modi es column j by subtracting a multiple of column i. This is called a left-looking algorithm because column j accumulates all necessary updates cmod (j; i) from columns to its left just before the cdiv (j) that completes its computation. By contrast, a right-looking algorithm would perform all the updates cmod (j; i) using column i immediately after the cdiv (i).
Now consider the elimination tree T(A). A given column (vertex) is modi ed only by columns (vertices) that are its descendants in the tree 30]. Therefore a parallel left-looking algorithm can compute all the leaf vertex columns at once.
procedure Router Cholesky (matrix A); for h 0 to height(n) do for each edge (i; j) with height(i) < height(j) = h pardo cmod (j; i) od; for each vertex j with height(j) = h pardo cdiv (j) od od end Router Cholesky;
Here height(j) is the length of the longest path in T(A) from vertex j to a leaf. Thus the leaves have height 0, vertices whose children are all leaves have height 1, and so forth. The outer loop of this algorithm works sequentially from the leaves of the elimination tree up to the root. At each step, an entire level's worth of cmod's and cdiv's are done.
A processor is assigned to every nonzero of the triangular factor (or, equivalently, to every edge and vertex of the lled graph G ). Suppose processor P ij is assigned to the nonzero that is initially a ij and will eventually become l ij . (If l ij is a ll, then a ij is initially zero; recall that we assume that the symbolic factorization is already done, so we know which l ij will be nonzero.) In the parallel cdiv (j), processor P jj computes l jj as the square root of its element, and sends l jj to processors P ij for i > j, which then divide their own nonzeros by l jj . In the parallel cmod (j; i), processor P ji sends the multiplier l ji to the processors P ki with k > j. Each such P ki then computes the update l ki l ji locally and sends it to P kj to be subtracted from l kj .
We call this a left-initiated algorithm because the multiplications in cmod (j; i) are performed by the processors in column i who then, on their own initiative, send these updates to a processor in column j. Each column i is involved in at most one cmod at a time because every column modifying j is a descendant of j in T(A), and the subtrees rooted at vertices of any given height are disjoint. Therefore each processor participates in at most one cmod or cdiv at each parallel step. If we ignore the time taken by communication (including the time to combine updates to a single P kj that may come from di erent P ki1 , P ki2 , : : :), then each parallel step takes a constant amount of time and the parallel algorithm runs in time proportional to the height of the elimination tree T(A).
CM implementation of Router Cholesky. To implement Router Cho-
lesky on the CM we must specify how to assign data to processors, and then describe how to do the communication in cdiv and cmod.
We use one (virtual) processor for each nonzero in the triangular factor L. We lay out the nonzeros in a one-dimensional array in column major order, as in many standard sequential sparse matrix algorithms 13]. This makes operations within a single column e cient because they can use the CM scan instructions. Each column is represented by a processor for its diagonal element followed by a processor for each sub-diagonal nonzero. The symmetric upper triangle is not stored. We can also think of this processor assignment as a processor for each vertex j of the lled graph, followed by a processor for each edge (i; j) with i > j.
Router Cholesky, like many data parallel algorithms, is pro igate of parallel variable storage. Each processor contains some working storage and the following pvars: In processor P ij , a pointer to P i;p(j) .
next update Pointer to next element this one may update.
(Recall that p(j) > j is the elimination tree parent of vertex j < n.)
At each stage of the sequential outer loop, each processor uses i ht and j ht to decide whether it participates in a cdiv or cmod. By comparing the local processor's i ht or j ht to the current value of the outer loop index, a processor can determine if its element is in a column that is involved in a cdiv or a cmod.
The cdiv uses a scan operation to copy the diagonal element to the rest of the active column.
The cmod uses a similar scan to copy the multiplier l ji down to the rest of column i. The actual update is done by a scatter-with-add, which uses the router to send the update to its destination.
To gure out where to send the update, each element maintains a pointer called next update to a later element in its row. The nonzero positions in each row are a connected subgraph of the elimination tree, and are linked together in this tree structure by the e parent pointers. Each nonzero updates only elements in columns that are its ancestors in the elimination tree. At each stage, next update is moved one step up the tree using the e parent pointers. The most time-consuming step is incrementing the next-update pointer; the router is used twice in this operation. The dominant term is the router term c 1 h. Notice that we do not explicitly count time for combining updates to the same element from di erent sources, since this is handled within the router and is thus included in .
To get a feeling for this analysis consider the model problem, a 5-point nite di erence mesh in two dimensions ordered by nested dissection 11]. If the mesh has k points on a side, then the graph is a k by k square grid, and we have n = k 2 , h = O(k), and (L) = O(k 2 logk). The number of arithmetic operations in the Cholesky factorization is O(k 3 ), in either the sequential or parallel algorithms. Router Cholesky's running time is O( k 3 logk=p). If we de ne performance as the ratio of the number of operations in the sequential algorithm to parallel time, we nd that the performance is O(p= logk) (taking to be a constant independent of p or k; this is approximately correct for the Connection Machine, although theoretically should grow at least with logp). This analysis points out two weak points of Router Cholesky. First, the performance on the model problem drops with increasing problem size. (This depends on the problem, of course; for a three-dimensional model problem a similar analysis shows that performance is O(p) regardless of problem size.) More seriously, the constant in the leading term of the complexity is proportional to the router time , because every step uses general communication.
This analysis can be extended to any two-dimensional nite element mesh with bounded node degree, ordered by nested dissection 17]. The asymptotic analysis is the same but the values of the constants will be di erent. we use only 48,608 of the 65,536 virtual processors.) We observed a running time of 53 seconds, of which about 41 seconds was due to gathers and scatters. Substituting into the analysis above (using = 200 since there were in general many collisions), we would predict router time c 1 h = 39 seconds and other time (c 2 + c 3 ) h = 1:5 seconds. This is not a bad t for router time; it is not clear why the remaining time is such a poor t, but the expensive square root and the data movement involved in the pointer updates contribute to it, and it seems that I/O may have a ected the measured 53 seconds.
The observation, in any case, is that router time completely dominates.
3.5. Remarks on Router Cholesky. Router Cholesky is too slow as it stands to be a cost-e ective way to factor sparse matrices. Each stage does two gathers and a scatter with exactly the same communication pattern. More careful use of the router could probably speed it up by a factor of two to ve. However, this would not be enough to make it practical; something more like a hundredfold improvement in router speed would be needed. The one advantage of Router Cholesky is the extreme simplicity of its code. It is no more complicated than the numeric factorization routine of a conventional sequential sparse Cholesky package 13]. It is interesting to note that column-oriented sparse Cholesky codes on MIMD message-passing multiprocessors 12, 14, 35] are more complex. They exploit MIMD capability to implement dynamic scheduling of the cmod and cdiv tasks. They allow arbitrary assignment of columns to processors and therefore are required to use indirect addressing of columns. Finally, they are written with low-level communication primitives, the explicit \send" and \receive." Router Cholesky's simplicity comes dearly. Flexibility in scheduling allows MIMD implementations to gain a modest performance advantage over any possible SIMD implementation. More important, we employ (L) virtual processors, regardless of the number of physical processors. It is essential that these virtual processors not all sit idle, consuming physical processor time slices, when there is nothing for them to do. As implemented by the Paris instruction set, they do sit idle.
We described Router Cholesky as a left-initiated, left-looking algorithm. In a right-initiated algorithm, processor P ij would perform the updates to l ij . In a rightlooking algorithm, updates would be applied as soon as the updating column of L was computed instead of immediately before the updated column of L was to be computed. Router Cholesky is thus one of four cousins. It is the only one of the four that maps operations to processors evenly; the other three alternatives require an inner sequential loop of some kind. All four versions require at least h router operations.
Grid Cholesky. In this section we present a parallel supernodal, multifrontal
Cholesky algorithm and its implementation on the CM. Multifrontal methods, introduced by Du and Reid 7] , compute a sparse Cholesky factorization by performing symmetric updates on a set of dense submatrices. We follow Liu 23] in referring to an algorithm that uses rank-1 updates \multifrontal" and the block version that uses rank-k updates \supernodal multifrontal." The idea of using block methods to operate on supernodes has been used in many di erent sparse factorization algorithms 1, 7] . Parallel supernodal or multifrontal algorithms have been used on MIMD message-passing and shared-memory machines 2, 6, 32].
The algorithm uses a two-dimensional VP set (which we call the \playing eld") to partially factor, in parallel, a number of dense principal submatrices of the partially factored matrix. By working on the playing eld, we may use the fast grid and scan mechanisms for all the necessary communication during the factorization of the dense submatrices. Only when we need to move these dense submatrices back and forth to the playing eld do we need to use the router. In this way we drastically reduce the use of the router: for the model problem on a k k grid we reduce the number of uses from h = 3k ? 1 to 2 log 2 k ? 1. The playing eld can also operate at a lower VP ratio in general because it does not need to store the entire factored matrix at once.
This method, like all multifrontal methods, is in essence an \out of core" method in that the Cholesky factor is kept in a data structure that is not referred to while doing arithmetic, all of which is done on dense submatrices. The novelty here is the factorization of many of these dense problems in parallel; the simultaneous exploitation of the parallelism available within each of the dense factorizations; the use of a two-dimensional grid of processors for these parallel dense factorizations; the use of the machine's router for parallel transfers from the matrix storage data structure; and the use of the combining scatter operations for parallel update of the matrix storage data structure.
4.1. The Grid Cholesky algorithm. 4.1.1. Block Jess and Kees reordering. First we describe an equivalent reordering of the chordal graph G = G (A) that we call the block Jess/Kees ordering. Block Jess/Kees is a perfect elimination ordering that has two properties that make it the best equivalent reordering for our purposes: It eliminates vertices with identical monotone neighborhoods consecutively, and it produces a clique tree of minimum height.
Our reordering eliminates all the simplicial vertices of G simultaneously as one major step. In the process, it partitions all the vertices of G into supernodes. Each of these supernodes is a clique in G , and is a simplicial clique when its component vertices are about to be eliminated. Each vertex is labeled with the stage, or major step number, at which it is eliminated. In more detail, the reordering algorithm is as follows.
procedure Reorder The cliques are the nodes of a clique tree whose height is h, one less than the number of major elimination steps. The parent of a given clique is the lowest-stage clique adjacent to the given clique.
The name \block Jess/Kees" indicates a relationship with an algorithm due to Jess and Kees 21] that nds an equivalent reordering for a chordal graph so as to minimize the height of its elimination tree. The original (or \point") Jess/Kees ordering eliminates just one vertex from each simplicial clique at each major step. (This is a maximum-size independent set of simplicial vertices.) Each step of point Jess/Kees produces one level of an elimination tree, from the leaves up. The resulting elimination tree has minimum height over all perfect elimination orders on G . Our block Jess/Kees eliminates all the simplicial vertices at each major step, producing a clique tree one level at a time from the leaves up. This ordering may not minimize the height of the elimination tree. However, as Blair and Peyton 4] have shown, it does produce a clique tree of minimum height over all perfect elimination orders on G .
Every vertex is included in exactly one supernode. We number the supernodes as 4.2. Multiple dense partial factorization. In order to make this approach useful, we need a fast dense matrix factorizations on two-dimensional VP sets. To that end, we discuss an implementation of LU decomposition without pivoting. (We use LU instead of Cholesky here because we can see no e cient way to exploit symmetry with a two-dimensional machine; moreover, LU avoids the square root at each step and so is a bit faster.) We analyzed and implemented two methods: a systolic algorithm that uses only nearest neighbor communication on the grid, and a rank-1 update algorithm that uses row and column broadcast. With either of these methods, all the submatrices A(K; K) corresponding to supernodes at a given stage are distributed about the two-dimensional playing eld simultaneously (each as a separate \baseball diamond"), and the partial factorization is applied to all the submatrices at once. We describe the rank-1 algorithm in terms of its e ect on a single submatrix A(K; K), with a supernode of size and a Schur complement of size . A single rank-1 update consists of a division to compute the reciprocal of the diagonal element, a scan down the columns to copy the pivot row to the remaining rows, a parallel multiplication to compute the multiplier for each row, another scan to copy the multiplier across each row, and nally a parallel multiply and subtract to apply the update. The number of rank-1 updates is , the size of the supernode.
An entire stage of partial factorizations is performed at once, using segmented scans to keep each factorization within its own \diamond." The number of steps on the playing eld at stage s is s , the maximum value of over all supernodes at stage s. Then a stage of rank-1 partial factorization takes time (c 3 + c 4 ) : Here c 3 is 2 s , and c 4 is proportional to s as well.
The relative cost of the various parts of the rank-1 update code are summarized below, for a complete factorization (that is, one in which = 0). The bookkeeping includes nearest-neighbor communication to move three one-bit tokens that control which processors perform reciprocals, multiplications, and so on at each step.
Scans (row and column broadcast): 79.7% News (moving the tokens):
5.5% Multiply (computing multipliers):
2.7% Divide (reciprocal of pivot element):
7.1% Multiply-subtract (Gaussian elimination):
4.8%
Unlike the rank-1 implementation, the systolic implementation never uses a scan; all communication is between grid neighbors. Thus its communication terms are proportional to rather than . This advantage is more than o set by the fact that 3 s + 2 s sequential iterations are necessary, while the rank-1 method only needs s .
4.2.1. Remarks on dense partial factorization. Theoretically, systolic factorization should be asymptotically more e cient as machine size and problem size grow without bound, because scans must become more expensive as the machine grows. Realistically, however, the CM happens to have 6 ; for a full factorization ( = 0) a sixfold decrease in communication time per step more than balances the threefold increase in number of steps, and so the systolic method is somewhat faster. But for a partial factorization the rank-1 algorithm is the clear winner. For example, for the two-dimensional model problem the average Schur complement size s is about 4 s , so the rank-1 code has an 11-to-1 advantage in number of steps. This more than makes up for the fact that scan is much slower than grid communication.
It is interesting to note that the only arithmetic that matters in a sequential algorithm, the multiply-subtract, accounts for only 1=21 of the total time in the rank-1 parallel algorithm. Moreover, only 1=3 of the multiply-subtract operations actually do useful work, since the active part of the matrix occupies a smaller and smaller subset of the playing eld as the computation progresses. This gives the code an overall e ciency of one part in 63 for LU and one part in 126 for Cholesky. We have found this to be typical in *lisp codes for matrix operations, especially with high VP ratios. The reasons are these: scan is slow relative to arithmetic; the divide and multiply operations occur on very sparse VP sets; and the VP ratio remains constant as the active part of the matrix gets smaller.
Signi cant performance improvements could come from several possible sources.
A more e cient implementation of virtual processors could improve performance substantially: the loop over V virtual processors should be restricted to those virtual processors that are active. Sometimes a determination that this is going to happen can readily be made at compile time. As an alternative, we could have rewritten the code; the VP set could shrink as the matrix shrinks, and the divide and the multiplies could be performed in a sparser VP set.
The scans could be sped up considerably within the hypercube connection structure of the CM. Ho and Johnsson 19] have developed an ingenious algorithm that takes O(b=d + d) time to broadcast b bits in a d-dimensional hypercube, in contrast to the scan which takes O(b + d). The copy scans could also be implemented so that each physical processor gets only one copy of the data rather than V copies.
More e cient use of the low-level oating-point architecture of the CM-2 is possible. The Paris instruction set hides the fact that every 32 physical processors share one vector oating-point arithmetic chip. Performing 32 oating point operations implies moving 32 numbers in bit-serial fashion into a transposer chip, then moving them in parallel into the vector chip, then reversing the process to store the result. While this mode of operation conforms to the one-processor-per-data-element programming model, it wastes time and memory bandwidth when only a few processors are actually active (such as computing multipliers or diagonal reciprocals in LU), since 32 memory cycles are required just to access one oating-point number.
This mode also requires intermediate results to be stored back to main memory, precluding the use of block algorithms 3] that could store intermediate results in the registers in the transposer. Thus the computation rate is limited by the bandwidth between the transposer and memory (about 3.5 giga ops for a 64K processor CM) instead of by the operation rate of the vector chip (about 27 giga ops total).
A more e cient dense matrix factorization can be achieved by treating each 32-processor-plus-transposer-and-vector-chip unit as a single processor, and representing 32-bit oating-point numbers \slicewise," with one bit per physical processor, so that they do not need to be moved bit-serially into the arithmetic unit. (Viewed this way, we were working with just 256 real processors!) Also, if virtualization is handled e ciently, we need only keep 256 processors busy. At the time this work was done (late in 1988) the tools for programming on this level were not easily usable. While CM Fortran allows this model, it does not allow scans and scatter with combining, so we still (early 1991) cannot use it. 4 .3. CM implementation of Grid Cholesky. Grid Cholesky uses two VP sets with di erent geometries: the matrix storage stores the nonzero elements of A and L (doing almost no computation), and the playing eld is where the dense partial factorizations are done. The top-level factorization procedure is just a loop that moves the active submatrices to the playing eld, factors them, and moves updates back to the main matrix. The stage at which j occurs in a supernode.
updates Working storage for sum of incoming updates. We use a scatter to move the active columns from matrix storage to the playing eld. The supernodes C are disjoint, but their neighboring sets adj(C) may overlap; that is, more than one Y C may be computing updates to the same element of L at the same stage. Therefore, we use scatter-with-add to move the partially factored matrix from the playing eld back to matrix storage. 4 .3.2. The playing eld. The second VP set, called the playing eld, is the two-dimensional grid on which the supernodes are factored. In our implementation it is large enough to hold all the principal submatrices for all maximal cliques at any stage, although it could actually use di erent VP ratios at di erent stages for more e ciency. Its size is determined as part of the symbolic factorization and reordering. The pvars used in this VP set are dense a
The playing eld for matrix elements.
update dest
The matrix storage location (processor) that holds this matrix element; an integer pvar array indexed by stage. as well as some boolean ags used to coordinate the simultaneous partial factorization of all the maximal cliques.
The processors of the playing eld compute LU factorizations by simultaneously doing rank-1 updates (see Section Section 4.2) of all the dense submatrices stored there, using segmented scans to distribute the pivot rows and columns within all submatrices at the same time. The number of rank-1 update steps is the size of the largest supernode at the current stage. The submatrices may be di erent sizes; each matrix only does as many rank-1 updates as the size of its supernode.
In order to use this procedure we need to nd a placement of all the submatrices A(K; K) for all the maximal cliques K at every stage. This is a two-dimensional bin packing problem. In order to minimize CM computation time, we want to pack these square arrays onto the smallest possible rectangular playing eld (whose borders must be powers of two). Optimal two-dimensional bin packing is in general an NP-hard problem, though various e cient heuristics exist 9]. Our experiments use a simple \ rst t by levels" heuristic. This layout is done during the sequential symbolic factorization, before the numeric factorization is begun. where c 6 and c 7 are constants (in fact c 6 = 2), and the subscript s indicates that the value of is taken in the playing eld VP set at stage s. The VP ratio in this VP set could be approximately the ratio of the total size of the dense submatrices at stage s to the number of processors, changing at each stage as the number and size of the maximal cliques vary. However in our implementation it is simply xed at the maximum of this value over all stages. Again, to get a feeling for this analysis let us consider the model problem, a vepoint nite di erence mesh on a k k grid ordered by nested dissection. For this problem n = k 2 , h = O(logk), and (L) = O(k 2 logk). The factorization requires O(k 3 ) arithmetic operations. Size of largest maximal clique C adj(C) at stage s. P ( C + C ) 2 Total area of all dense submatrices A(K; K) at stage s.
The VP ratio in matrix storage for the model problem is O( (L))=p = O(k 2 logk=p), so the matrix storage time is O(k 2 log 2 k=p). Our pilot implementation uses the same size playing eld at every stage. According to Table 2 , a playing eld of size d d25k 2 e e su ces if the problems can be packed in without too much loss. The VP ratio is O(k 2 =p). The sum over all stages of s is O(k) (in fact it is 3k+O(1)), so the playing eld time is O(k 3 =p). In summary, the total running time of Grid Cholesky for the model problem is O k 2 log 2 k p + k 3 p : Two things are notable about this: First, the performance, or ratio of sequential arithmetic operations to time, is O(p); the log k ine ciency of Router Cholesky has vanished. This is because the playing eld, where the bulk of the computation is done, has a lower VP ratio than the matrix storage structure. Second, and much more important in practice, the router speed appears only in the second-order term. This is because the playing eld computations are done on dense matrices with more e cient grid communication. This means that the router time becomes less important as the problem size grows, whether or not the number of processors grows with the problem size.
One way of looking at this analysis is to think of increasing both problem size and machine size so that the VP ratio remains constant. Then the model problem requires O(k) total parallel operations, but only O(log k) router operations. The analysis of the model problem carries through (with di erent constant factors) for any two-dimensional nite element problem ordered by nested dissection; a similar analysis carries through for any three-dimensional nite element problem. Thus, Grid Cholesky is a \scalable" parallel algorithm.
4.5. Grid Cholesky performance: Experiments. Here we present experimental results from a fairly small model problem, the matrix arising from the vepoint discretization of the Laplacian on a square 63 63 mesh, ordered by nested dissection. This matrix has n = 3,969 columns and 11,781 nonzeros (counting symmetric entries only once). The Cholesky factor has (L) = 86,408 nonzeros, a clique tree with h = 11 stages of supernodes, and takes 3,589,202 arithmetic operations to compute.
The matrix storage VP set requires 128K VPs. The xed-size playing eld requires 256 512 VPs. The results quoted here are from 8,192 processors, with oating-point coprocessors, of the machine at NASA. Both VP sets therefore had a VP ratio of 16. (A larger problem would need a higher VP ratio in the matrix storage than in the playing eld.)
We observed a running time of 6:13 seconds. Of this, 4:09 seconds was playing eld time (3:12 for the scans, 0:15 for nearest-neighbor moves of one-bit tokens, and 0:82 for local computation). The other 2:04 seconds was matrix storage time, consisting mostly of the four scatters at each stage. Our analytic model predicts playing eld time to be about 3k (2 + 4) PF . This comes to 4:0 seconds, which is in good agreement with experiment. The model predicts matrix storage time of about h 4 MS . This comes to between 1:5 and 4:7 seconds, depending on which value we choose for . In fact 3=4 of the router operations are scatters with no collisions, and the other 1=4 are scatter-with-add, typically with two to four collisions. The t to the model is therefore quite close. 4 .6. Remarks on Grid Cholesky. On this small example , Grid Cholesky is about 20 times as fast as Router Cholesky. It is, however, only running at :586 mega ops on 8K processors, which would scale to 4:68 mega ops on a full 64K machine. A larger problem would run somewhat faster, but it is clear that making Grid Cholesky attractive will require major improvements. Some of these were sketched in Section 4.2.1.
A rst question is whether Grid Cholesky is a router-bound code like Router Cholesky. For the small sample problem the relative times for router and non-router computations are as follows.
Moving data to the playing eld: 12% Factoring on the playing eld: 67% Moving Schur complements back to matrix storage: 21%
Evidently, the Grid Cholesky code is not router-bound for this problem. For larger (or structurally denser) problems this situation gets better still: for a machine of xed size, the time spent using the router grows like O(log 2 k) while the time on the playing eld grows like O(k 3 ) for a k k grid, as we showed above. If we solved the same problem on a full-sized 64K processor machine, the relative times would presumably be the same as above; but if we solved a problem 8 times as large the operation count would increase by a factor of about 22 while the number of stages, and router operations, would increase only by a factor of about 1.3. Next, we ask whether our use of the playing eld is e cient or not. The number of parallel elimination steps on the playing eld is given by h X s=1 s ; which is 177 for the example. On the playing eld of 131,072 processors this allows us to do 22:8 10 6 multiply-adds or 45:6 10 6 ops. The number of ops actually done is is 3:69 10 6 , so the processor utilization is 7:9%. There are several reasons for this loss of e ciency:
The algorithm does both halves of the symmetric dense subproblems (factor of 2); the implementation uses the same playing eld size at every level (factor of about 4=3); the architecture forces the dimensions of the playing eld to be powers of two (factor of about 4=3); the playing eld is not fully covered by active processors initially, and as the dense factorization progresses processors in the supernodes fall idle (factor of about 7=2). These e ects account for a factor of roughly 12.4, which is consistent with our measured result; 1=:079 = 12:6. In other words, every step must use all 131,072 virtual processors, but on average we only have work for about 10,500 of them. Thus, the Paris virtualization method cost us a factor of roughly (131; 072=10; 500) = 12:5.
A similar analysis shows that virtualization slows the use of the router by a factor of 7:6.
We summarize the reasons that our achieved performance is so far below the CM's peak:
Virtualization. We used 2 17 virtual processors. On average, we make use of 10,500 on the playing eld and 5,600 in matrix storage. In actuality, there are 256 physical ( oating-point) processors in the machine we used. The slow router. Communication costs for scans on the playing eld, using the built-in suboptimal algorithm. The SIMD restriction. This causes us to have to wait for the divides and multiplies. (Since there are very few active virtual processors during these tasks, most of this e ect could also be removed by e cient virtualization). The Paris instruction set, which makes reuse of data in registers impossible, thus exposing the low memory bandwidth of the CM. Table 3 gives an upper bound on the improvement that removal of each of these impediments to performance would provide. In each case, we hypothesize an \ideal" machine in which the corresponding cost is completely removed. Thus, for example, the statistics for the third row of the table are for a marvelous machine in which a router operation takes no time whatever. Clearly, good e ciency is possible, even on an SIMD machine with a router. The chief problems are the Paris virtualization method; the lack of a fast broadcast in the grid; and the memory-to-memory instruction set. All of these are peculiarities of Paris on the CM-2; none is fundamental to SIMD, data parallel computing. 5 . Conclusions. We have compared two highly parallel general-purpose sparse Cholesky factorization algorithms, implemented for a data parallel computer. The rst, Router Cholesky, is concise and elegant and takes advantage of the parallelism present in the elimination tree, but because it pays little attention to the cost of communication it is not practical.
We therefore developed a parallel algorithm, Grid Cholesky, that does most of its work with grid communication on dense submatrices. Analysis shows that the requirement for expensive general-purpose communication grows only logarithmically with problem size; even for modest problems the code is not limited by the CM router. Experiment shows that Grid Cholesky is about 20 times as fast as Router Cholesky for a moderately small sample problem.
Our pilot implementation of Grid Cholesky is not fast enough to make the Connection Machine cost-e ective for solving generally structured sparse matrix problems. We believe, however, that our experiments and analysis lead to the conclusion that a parallel supernodal, multifrontal algorithm can be made to perform e ciently on a highly parallel SIMD machine. We showed in detail that Grid Cholesky could run two to three orders of magnitude faster with improvements in the instruction set of the Connection Machine. of our pilot implementation from the 27-giga op theoretical peak performance of a 64K processor CM yawn somewhat less dauntingly.
Most of these improvements are below the level of the Paris virtual processor abstraction, which is to say below the level of the assembly-language architecture of the machine. Although TMC has recently released a low-level language called CMIS in which a user can program below the virtual-processor level, we believe that ultimately most of these optimizations should be applied by high-level language compilers. Future compilers for highly parallel machines, while they will support the data-parallel virtual processor abstraction at the user's level, will generate code at a level below that abstraction.
Although Grid Cholesky is more complicated than Router Cholesky, we are still able to use the data parallel programming paradigm to express it in a straightforward way. The high-level scan and scatter-with-add communication primitives substantially simpli ed the programming. The simplicity of our codes speaks well for this data parallel programming model. In summary, even though our pilot implementation is not fast, we are nonetheless encouraged | about the Grid Cholesky algorithm and about the potential of data parallel programming and highly parallel architectures for solving unstructured problems. We believe that future generations of highly parallel machines may allow e cient parallel programs for complex tasks to be written nearly as easily as sequential programs. To get to that point, there will have to be improvements in compilers, instruction sets, and router technology. Virtualization will have to be implemented without sacri cing e ciency.
We mention four avenues for further research. The rst is scheduling the dense partial factorizations e ciently. The tree of supernodes identi es a precedence relationship among the various partial factorizations. Our simple approach of scheduling these one level at a time onto a xed-size playing eld is not the only possible one. There is in general no need to perform all the partial factorizations at a single level simultaneously. It should be possible to use more sophisticated heuristics to schedule these factorizations onto a playing eld of varying VP ratio, or even (for the Connection Machine) onto a playing eld considered as a mesh of individual vector oating-point chips. Some theoretical work has been done on scheduling \rectangular" tasks onto a square grid of processors e ciently 8].
The second avenue is improving the time spent in the matrix storage VP set. Of course, as problems get larger this time becomes a smaller fraction of the total. At present matrix storage time is not very signi cant even for a small problem, but it will become more so as the playing eld time is improved.
Third, we mention the possibility of an out-of-main-memory version of Grid Cholesky for very large problems. Here the clique tree would be used to schedule transfers of data between the high-speed parallel disk array connected to the CM and the processors themselves.
Fourth and nally, we mention the possibility of performing the combinatorial preliminaries to the numerical factorization in parallel. Our pilot implementation uses a sequentially generated ordering, symbolic factorization, and clique tree. We are currently designing data parallel algorithms to do these three steps 18].
We conclude by extracting one last moral from Grid Cholesky. We nd it interesting and encouraging that the key idea of the algorithm, namely partitioning the matrix into dense submatrices in a systematic way, has also been used to make sparse Cholesky factorization more e cient on vector supercomputers 32] and even on workstations 29]. In the former case, the dense submatrices vectorize e ciently; in the latter, the dense submatrices are carefully blocked to minimize tra c between cache memory and main memory. We expect that more experience will show that many techniques for attaining e ciency on sequential machines with hierarchical storage will turn out to be useful for highly parallel machines.
