INTRODUCTION
In this paper we present a concurrent algorithm for factoring a dense matrix on a local-memory multiprocessor. The algorithm is based on Gaussian elimination with partial pivoting. and some earlier work using a hypercube simulator can be found. in 121. It should be noted that we treated the hypercube network as a complete graph in [2] , whereas the hypercube topology is exploited in our current work for global communication among the node processors. We have made substantial use of some ideas in previous work by Geist [5] . and Geist and Heath [71, and we assume that the reader is familiar with the arguments and results found there.
In [5] , arguments are presented to support implementing Gaussian elimination with partial pivoting in a row-oriented manner. That is, the rows of the matrix are distributed to each node processor, and the computation is performed row by row. One of the original motivations in considering the row-oriented factoring algorithm is that the resulting triangular systems can be conveniently solved on the multiprocessor. Recently, two column-oriented schemes for the solution of triangular systems are proposed by Romine and Ortega [ll] . and Li and Coleman [lo] . Since the performance of the column-oriented triangular solvers is comparable to the row-oriented schemes, she user can choose a different scheme under different context. Geist and Romine 181 have recently devised a variation of the partial pivoting scheme described in this paper. A few comments about their scheme are contained in section 4.
A disadvantage of the row-oriented factoring scheme is that determining the pivot element is difficult, because the rows are distributed among multiple processors. Geist's original solution involves using the (otherwise more-or-less idle) host to aid in determining the pivot, and he is able to effectively overlap the associated communication cost with the computation in his implementation of the factoring algorithm for a hypercube simulator. Subsequently, Geist and Heath [71 find that the performance of the same algorithm on an Intel iQsC hypercube can be quite different from the simulator results. In particular, because the host has limited buffer space. and the host-to-node communication is sequential and substantially slower than the node-to-node communication. it is no longer practical to involve the host in the process of pivot selection. In the solution proposed by Geist and Heath. a node processor is designated as the manager with the extra responsibility of determining the pivot and informing all other nodes of the pivot row number.
In addition, several distinct schemes can be used for global node-to-node communication. In [7] , three different implementations are discussed. One approach is to simply ignore the hypercube interconnection topology. Whenever a processing node needs to synchronize or communicate with another processing node, they will have to exchange messages explicitly. This approach was used in our earlier work when the factoring algorithm was implemented for a hypercube simulator [21. The timing results reported in 171 indicate that such a communication scheme is not the most efficient one for broadcasting a message to all nodes in the Intel hypercube network. The scheme recommended in [7] embeds a minimum spanning tree in the hypercube interconnection network. The root of the spanning tree will always be the processing node which initiates the broadcast. That is, if a message is to be broadcast to all others from a particular node. this node will send the message to its immediate descendants only. Each other node will wait to receive the message from a parent node and then send the received message to its immediate descendants. Using this scheme. the message being broadcast will be transmitted across each link of the spanning tree exactly once. and the maximum path length will be the height of the spanning tree. which is the dimension of the hypercube network being employed. In the algorithm to be proposed in section 2 of this manuscript, the latter approach is adopted in effecting the global node-to-node communication.
A potential weakness of the implementations in [5] and [7] is that they do not deal with the possible "unbalancing" of the computation that could be caused by an unfortunate sequence of pivot choices. Geist and Heath report in [7] that their experience suggests that the cost of this imbalance will normally be low, typically around 5% to 15% in execution time in the factorization phase. However. there are examples where it is much higher, We discuss this in more detail in section 2.
Our contribution is to show that Geist and Heath's general approach can be modified so that the node processor loads remain balanced. irrespective of the pivot sequence chosen. In addition, using some reasonable assumptions about the probability of pivoting occurring, a model for the communication costs of the algorithm is developed, along with an analysis of the computation performed in each node processor. This model is then used to derive the expected speed-up of the algorithm.
Finally, experiments using an Intel iPSG hypercube multiprocessor are presented in order to demonstrate that load balancing can be achieved economically. and the performance of the parallel factoring algorithm can be predicted by the model.
THEALGORITHM
We consider the problem of factoring a dense matrix A on a local-memory multiprocessor having p processing nodes, where p is assumed to be much less than n, the order of the system. The algorithm is based on a row variant of Gaussian elimination with partial pivoting. A serial version of the algorithm (in pseudo code) is given below. At the end of the execution of the algorithm, the coefficient matrix A is overwritten by the triangular factors. As usual with parallel algorithms, we would like to achieve a balanced distribution of work load and a low volume of data movement and communication. A uniform work load distribution and a low communication cost contribute directly to the speed-up, which is the ultimate goal of a concurrent algorithm.
Since there is no globally shared memory. the data must be distributed among the processing nodes in some way. typically either by rows or by columns. I n either care, there is a decision to be made concerning the way in which the rows or columns are mupped onto the processors. For example, block-mapping may be used, where the first n / p rows (or columns) are assigned to processor 1. the next n / p rows (or columns) are assigned to processor 2, and so on. Alternatively. wrapmapping may be used, where consecutive rows (or columns) are assigned to consecutive processors. with assignment "wrapping around" to processor 1 after a row (or column) is assigned to processor pTo reiterate. there are two issues: whether the data are distributed by rows or columns, and the way in which the rows (or columns) are mapped onto the processors.
Discussion about various mapping strategies can be found in 13.6.91. In the case of column-oriented Cholesky decomposition or column-oriented Gaussian elimination with partial pivoting, the work load distribution is statically determined by the initial data mapping. It was found in [3.6,9] that either wrap-mapping or reflection mapping is quite effective in this context. However, using a row-oriented scheme, such as that proposed in 151 and 171, with the pivot row dynamically chosen during the factorization process, the work load distribution is no longer dictated by the initial data mapping.
For example, irrespective of the way in which the rows are mapped onto the processors. it is possible that the first n/p pivot rows are chosen from the same processor, after which it would remain idle until the factorization was completed. In the worst case. the permutation in effect can turn an initial wrapmapping into a block-mapping. which can lead to a 50% increase in execution time [SI.
Of course, in general this is unlikely to occur in practice. Instead, as Geist argues, it is more likely that the pivots will be selected more or less evenly from all of the processors, and the work load will therefore remain roughly balanced. Nevertheless. the penalty can be significant, and can be quite substantial for an unfortunate sequence of pivot choices.
The algorithm proposed in this article eliminates this penalty by dynamically balancing the load. It does so by explicitly performing the row interchanges so that each processor node retains approximately the same number of uneliminated rows. We shall see that the modest amount of communication involved in performing the interchanges can be done without affecting the overall execution time very much.
An additional advantage of the dynamic load balancing scheme is that at the end of the numerical factorization, each processing node will have the rows of the triangular factors of a permuted form of A ready for the solution phase. Therefore, the triangular solution module can be designed independently by assuming that the rows of the triangular factors of the permuted A have been distributed among the nodes using a wrap-mapping scheme. This automatically eliminates the similar unbalanced load problem in the solution phase, and it is easier to organize the computation. For more details about the parallel solution of triangular systems of equations. the readers are referred to Romine and Ortega [ll], and Li and Coleman [lo] .
In the following description of the algorithm, the action of fan out refers to the spanning tree broadcast discussed in section 1, and fan in refers to the reverse communication from leaf nodes to the root of the spanning tree. Therefore. the fun in action is initiated by the leaf nodes -each leaf node sends the message to its parent, and each intermediate node will first wait to receive the messages from all its immediate descendants. combine or update the message, and then send the updated message to its parent.
Initialization
1. Compose and broadcast the mapping information to all nodes.
2.
Distribute the rows of the coefficient matrix to the nodes according to the mapping information.
Factorization
Repeat the following steps for each elimination stage:
1. Receive one pivot row number from a node processor.
2.
Update the permutation vector in which the pivoting sequence is saved.
NODE PB Initialization
1. Receive the mapping information from the host.
2.
Receive its share of rows from the host.
3.
Determine the maximum absolute value of the first column among the rows it owns, and fan in this value and its row number along a spanning tree rooted at the manager node. At each intermediate node, the (value, row number) pair with the maximum v d w is chosen among all received candidates and the local candidate. and this updated (value, r o w number) pair is sent to the parent.
Factorization (k -th step)
1. Fan out the pivot row number which is determined by the manager node along a spanning tree rooted at the manager.
2. Check whether the k th row and the pivot row are the same. If they are different but are both located in this node, interchange them (without communication cost). If either the k-th row or the pivot row is located in this node, interchange them via message passing. In any case, at most two nodes are involved in the interchange. the number of processing nodes, and log@ is the broadcast path length using a minimum spanning tree scheme.
5.

B
PERFORMANCE ANALYSIS
We shall provide an analytical performance model for the algorithm we proposed in section 2. Comparison will then be made with the actual performance of the program running on an Intel iPSC hypercube multiprocessor. We begin our analysis by identifying and relating various common performance measurements. As usual, our primary objective is to attain maximum speed-up. That is. given a p-processor machine, we would like to solve our problem in time that is as close as possible to a factor of p less than that needed to solve the same problem on a single processor version of the machine, using the best serial algorithm available. We assume that the single processor machine has adequate memory, presumably much more than that available to a single processor in the multiple processor configuration. We also assume that all processors in the machine have the same execution speed, and that, all processors are started simultaneously and stop at the same time as the process that h i s h e s last. It should be noted that the latter assumption does not impose any restriction on the model, because when it is not the case the difference for the processor which starts late or finishes earlier can be considered as idle time.
We adopt the following notation.
T, : execution time of the best serial program.
T : execution time of the parallel program running on p nodes.
Tu : the average computation time of a node process. T, : the average time spent by a node process in sendingheceiving messages. Ti : the average idle time of a node process.
We note that and T s P T > T u 3 - 
(3.2)
We shall assess an implementation of a parallel algorithm by its e&ciency and speedup, where It is now clear that , and T, = Ti = 0. speedup = p and efficiency = 1 iff T, = -Ts P When a node processor i s busy, it is either doing arithmetic computation or sending/receiving data. The idle time of a node processor can be caused by an unbalanced workload and/or by the transmission delays in passing messages. In particular, with the hypercube-like topology, messages tend to traverse different paths of different length and so exhibit varied latencies. We note that the computation and data communication on any individual processor must be carried out sequentially. and the execution time of a parallel program is determined by the process that finishes last.
COMMUNICATION MESSAGE COMPLEXITY
We first derive the average number of messages sent and received by an individual node processor during the process of numerical factorization. For concreteness, we consider solving a system of order n on a multiprocessor of p nodes, where p is much less than n .
For convenience. we assume that n is an integral multiple of p .
For the wrap-mapping scheme we have chosen to employ, each node processor will be allocated n / p rows of the coefficient matrix A . which are to be overwritten by the corresponding rows of the triangular factors of a permuted form of A . The map is composed in the host and sent to each node processor. For every elimination step, every node must participate in sending the local pivot candidate to the manager node and broadcasting the pivot row number and the pivot row to every node on the machine. Using the broadcast scheme based on a minimum spanning tree, a node must first receive the message from its immediate descendants (or parent) and then send the message to its parent (or immediate descendants). We note that a node of the spanning tree has cT= immediate descendants on average, where d = logzp = dimension of the hypercube ~ Thus we can approximate the total number of messages sent and received by one node as given below.
With the dynamic load balancing scheme, we have to account for the communication cost of permuting rows residing in two different nodes. Recall exchange will involve two nodes only. If we assume that this is necessary possible that the at every elimination stage. then each node will have to perform the exchange at least. n l p times. Since the future exchanges can not be predicted. it is possible that all of the ( p -1 ) nodes will exchange with a specific node for the first n -1 stages, resulting in an upper bound of (n -1) exchanges. We therefore have n n -< NodeSend,,., < n -1. with an average z + -P 2 2P and n n -< NodeRecv,,, d n -I. with an average * 2-+ -
(3.9) (3.10)
In addition, one node processor will send the chosen pivot row number to the host at each elimination stage. According to the algorithm we present in section 2. this is performed by the processor which owns the k -th row at the k -th stage of elimination. Since each node is allocated n / p rows, we have
From equations (3.5) to (3.11). we obtain an estimate of the total number of messages sent and received by each node processor on average, as shown below.
AvgSendndefct *(lS + 20-1 n + 1.5-tk -2u-1 AvgRecvdefb z ( 2 . 5 + u)n -0.5-n -u-2 (3.12) (3.13) P P
ESTIMATION OF EFFICIENCY
Our aim is to provide an analytical performance model for the parallel program running on the hypercube multiprocessor. From equations (3.1) to (3.4) . we see that the speed-up and e&iency can be determined given T,. the execution time of the best serial program, T, , the average node computation time, T, , the average node communication time, and T i , the average node idle time. Our task is therefore to provide estimates for T, , T, , T, , and Ti. (for each permuted row in a 200x200  matrix) . According to the measured data rate for Intel hypercube, the round-trip time required by one node to send a message of size 0.5 Kbytes to itself is 1700 psec, and the time for performing one: multiplicative floating-point operation on the node processor is 45 psec. We therefore use the following p value in our analytical model: Recall that the idle time of a node processor is either caused by an unbalanced work load or by the transmission delays in message passing. In approximating the average idle time, T i . we shall assume a perfect load balancing and estimate Ti by the transmission delays only. This would be the case if p is much less than n . Since the transmission delay is proportional to the message size, we first give the type and the average size of each message to be received by a node processor as below.
Pivot row number
During the course of computation. we assume that whenever the data is needed, it is at most d links away, where d is the dimension of the hypercube and hence the height of the broadcast spanning tree. Since the average idle time will be approximated in units of multiplicative operations. we need to determine the value of a. which is the ratio of the time for transmitting one floating-point number across one link to the time of one multiplicative operation. According to message timing experiments performed on the Intel iPSC hypercube, it is reasonable to assume a to be 1 for the range of problems in our numerical experiment. From equations (3.6). (3.8) and (3.10). we obtain 
I I
We can now estimate the a d y t i c a l efficiency as shown below. Our experiments were performed on an Intel iPSC hypercube multiprocessor. The algorithm is implemented in the C language. The maximum run time of an individual node processor is reported for each test problem. We note that the time reported does not include the time for initialization and data generation. Table 1 compares the results from performing Gaussian elimination with partial pivoting on randomly generated matrices and diagonally dominant matrices. Since row interchanges will not occur in the latter case, the difference in their respective performance indicates the effectiveness of our dynamic load balancing scheme. From the experimental results presented in Table 1 , we see that the increase of execution time is very modest for the test problems. In the following tables, "migrations" indicate the number of row interchanges between different nodes. Note that on a multiprocessor with p nodes, assuming the pivot row i s equally likely to reside on any of the p nodes, the probability of migrations occurring at every elimination stage is (1 -l/p 1. This explains why the number of migrations for the randomly generated matrices is approximately (1 -l/p)n .
Since the interchange of rows actually happens a large number of times for the test problems, the dynamic load balancing scheme appears to be very effective.
Geist and Romine [8]
have recently implemented a refined version of this laadbalancing scheme. Their scheme reduces the number of migrations and incurs some overhead in doing so. Whether a net reduction in execution time is achievable will be explored in a future report by Geist and Romine 181.
the hypercube efficiency for problems with n and p varied in a wide range. Since the limited memory on a single node processor prevents us from obtaining the serial time for a matrix of order much greater than 200. the comparison can be made for the test problems in the range of order 32 to 200 only. 
CONCLUSIONS
The dynamic load balancing scheme we have incorporated in our implementation of Gaussian elimination with partial pivoting for a local-memory multiprocessor is effective in maintaining a balanced load distribution throughout the factoring process with very modest communication cost. and the overall execution time is not affected very much for the test problems discussed in section 4.
The analytical model we developed in section 3 predicts the performance of the parallel factoring algorithm well when it is applied to an Intel hypercube multiprocessor. Since different hypercubes may have different execution and data rates, different start-up cost for communications. and send messages in packets of different size, the parameters a! and / 3 will have to be chosen specifically for the target machine when applying the model to a different hypercube.
Although recently Chamberlain [l] suggests another scheme to perform the roworiented Gaussian elimination with partial pivoting on a hypercube that does not involve actually interchanging the rows among processors. the dynamic load balancing scheme proposed here may be useful in other applications where interchange of rows or columns is necessary. For example, it is well known that column pivoting is necessary when applying Householder reduction to determine the numeric rank of a rectangular matrix [4] . The results in this paper suggest that the parallel Householder reduction of a full matrix can be efficiently implemented by distributing the columns to the processing nodes using a wrap-mapping scheme. With column pivoting, the columns of the matrix will have to be interchanged dynamically to maintain load balance, and a load balancing scheme similar to the one described in this paper can be used.
