Minimizing Communication Overhead for Matrix Inversion Algorithms on Hypercubes by Wang, Xiaodong & Roychowdhury, Vwani P.
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
6-1-1995
Minimizing Communication Overhead for Matrix
Inversion Algorithms on Hypercubes
Xiaodong Wang
Purdue University School of Electrical and Computer Engineering
Vwani P. Roychowdhury
Purdue University School of Electrical and Computer Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Wang, Xiaodong and Roychowdhury, Vwani P., "Minimizing Communication Overhead for Matrix Inversion Algorithms on





Overhead for Matrix Inversion 
Algorithms on Hypercubes 
Xiaodong Wang and Vwani P. Roychowdhury2 
School of Electrical and Computer Engineering 
Purdue University 
West Lafayette, IN 47907, USA 
Abstract 
The mirin contribution of this report is the development of novel algorithms {that make efficient 
use of the communication system in distributed memory architectures with plrocessing elements 
interconnected by a hypercube network. These algorithms achieve almost optirr~al overlap of com- 
municatior~ delays by computation, leading to  a minimization of communicatioi~ overhead. Rigor- 
ous ana1yt:ical and numerical performance analysis of our parallel algorithms are presented as well. 
The parallel algorithm under study in this report is the parallel Gauss-Jordan matrix inversion 
algorithm. Many of the ideas introduced in this report, however, apply to other linear algebra 
algorithms as well. 
Parallel Gauss-Jordan matrix inversion algorithms on the hypercube multiprocessors have been 
extensively studied in the literature. Two common data partitioning strategies for matrix algo- 
rithms are row-wise partitioning and submatrix partitioning. It has been claimed that for the 
parallel G,J inversion algorithm, submatrix partitioning scheme exhibit commlinication overhead 
advantages; not shared by partitions limited to rows or columns. Most parallel algorithms proposed 
in the literature, however, do not attempt to overlap inter-processor communication by computa- 
tion. As zr result, the formula execution time=computation time + communication time is used 
to analyze the complexity of the parallel algorithm. However, during most of the communication 
time, the processors are actually idle, waiting for the data to arrive. 
Most c~3mmercially available parallel machines provide communication interrupt handling capa- 
bility. By utilizing this feature, we believe a lot of parallel mat,rix algorithms can be improved by 
overlapping interprocessor communication and computation. In this report we piropose and analyze 
new parallel GJ inversion algorithms under different data partitioning strategies, with or without 
partial pivoting. 
We first propose a new broadcasting algorithm on the hypercube muli,iprocessor for parallel GJ 
algorithm. This algorithm ensures that the data are sent out from the source and arrives at the 
destinations at  the earliest possible time. We then give the parallel GJ inversion algorithm using 
row partitioning. The strategy to overlap communication by computation is for each processor to 
compute arnd send out the data needed by the other processors as early as possible. We prove a 
lower bouind on the matrix size such that data transmission is flilly overlapped by computation. 
We also prove that the message length in the input buffer of each processor is art most 2. 
We also consider the algorithms under submatrix partitioning, with or without pivoting. We 
show that when submatrix partitioning is used, even when t,he communication is fully overlapped 
by compul;ation, the communication overhead is larger than when using row partitioning. Thus, we 
show that by minimizing communication overhead, the row partitioning scheme can indeed have 
better overall performance than the submatrix partitioning scheme. 
Finally we extend the idea of overlapping communication and computation to the parallel LU 
factorization algorithm. 

Cont erlt s 
1 Introcluction 1 
1.1 Assumptions on the Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 
1.2 Algorithmic Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 
1.3 Related Work on Parallel Gauss-Jordan Matrix Inversion Algorithm . . . . . . . . . 3 
1.4 M[otivation for a New GJ Inversion Algorithm . . . . . . . . . . . . . . . . . . . . . . 4 
1.4.1 Sequential GJ inversion algorithm with column interchanges . . . . . . . . . . 4 
1.4.2 A brief review of the subcube-grid parallel GJ inversion algoritl~m in [I] . . . 4 
1.5 Strategy for Overlapping of Communication and Computation . . . . . . . . . . . . 5 
2 Summ1ary of Results 7 
3 Algorithm I: Row Partitioning With Pivoting 9 
4 The SBT; Broadcasting Algorithm 11 
5 Analy:sis of Algorithm I 17 
6 Algorithm 11: Submatrix Partitioning, Without Pivoting 29 
7 Analysis of Algorithm I1 33 
8 Algorithm 111: Submatrix Partitioning, With Pivoting 37 
9 Scalablility Analysis 39 
9.1 The Isoefficiency Metric of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 39 
9.2 The Isospeed Metric of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 
10 LU Factorization 41 
11 Conclilding Remarks 45 

1 Intiroduction 
One of the challenges for large-scale parallel processing is to  minimize commnnication overhead. 
Although one can attempt to  reduce it via architectural innovations, physical reality dictates that 
remote access will always be significantly slower than local access. Software tech,niques for reducing 
overhead are therefore essential. 
An efficient approach to reducing communication overhead is to allow communication to overlap 
computation. This helps reducing overhead in two ways: First, when one process is waiting for 
data from a remote process, another ready process maybe scheduled for execution. Secondly, even 
a single process may wait for multiple data items simultaneously, and continue execution whenever 
any of the expected items arrive [32]. Research projects addressing overlap of communication and 
computation have been carried out a t  both hardware and software levels. Research prototypes of 
message driven architectures, such as the J-Machine [33] and Monsoon [34], expend a sigificant 
amount of hardware to integrate communication with compntation. The message driven execution 
style promotes the overlap of communication and computation, and exhibits impressive communi- 
cation performance. For the current message passing architectnres, a communication mechanism, 
active message, has been proposed [31]. This model minimizes the software overhead in message 
passing mirchines and achieves overlap of communication and computation. 
The approaches mentioned above deal with the communication overhead from a general engi- 
neering perspective. We, however, in this report, treat the problem of minimizing communication 
overhead from an algorithmic perspective. That is, given a machine model an~d a computational 
task, how can one develop an efficient parallel algorithm which achieves overlap of communication 
and compntation, and incurs minimum communication overhead. The technique used in this re- 
port is ana.lytic, which provides rigorous quantitative evaluation on the performance of the parallel 
algorithm. We consider the parallel Gauss-Jordan matrix inversion algorithm on message passing 
hypercube multiprocessors. 
1.1 Assumptions on the Machine Model 
Throughout this report, we make the following assumptions about the machine model: 
1. The parallel machine has binary-cube network with all-port communicati~on capability, i.e., 
the hardware supports simultaneous communication on all the channel:;. Machines with 
hypercube network that support all-port communicat~ion include CM-2, Chi-200 and nCUBE 
2. 
2. It is multiple instruction multiple data (MIMD) parallel machine. MIMI) machines include 
nCU:BE 2, iPSC, CM 2, Paragon, etc. 
3. The interprocessor communication is through DMA ~ont~rolled asynchronous message passing. 
Machines with this feature include nCUBE 2, iPSC, Paragon, etc. 
4. The software supports user specified interrupt handling. Intel iPSC, Intel Paragon and CM 
5 have this feature. 
1.2 Algorithmic Communication Model 
The most common cost model used in algorithm design for large-scale multiproc.essor assumes that 
the program alternates between computation and communication phases, and that communication 
requires time linear in the size of the message, plus a setup cost. Consider a simple scenario in 
which all the processors have the same computation and communication loads, and the program 
alternates between computation and communication phases in a synchronous manner. Thus, the 
time to  rurn a program is T = Tcompute + Tcommunicate and Tcommunicate = ~ 2 i ( t a  + twLi), where ta 
is the setulp cost, t, is the time per byte, L; is the ith message length, and Nc is the total number of 
communications. To achieve 90% of the peak processor performance, the algorithm must be tailored 
to  achieve a sufficiently high ratio of computation to  communication, i.e., Tcompute 2 9Tc,mmun;c,~e. 
A high performance network is required to  minimize the comm~inication time, a~nd it sits 90% idle 
1311. 
If com~nunication and computation are fully overlapped, the situation is very different. The time 
to run a program becomes T = max(Tco,p,te + Net,, xz1 t,,L;). Thus, to achieve high processor 
efficiency, the data transmission and computation time need only balance, anti the computation 
time need only swamp the setup overhead, i.e., Tcompute >> Nct, [31]. 
Next we introduce the communication model used this report,, and the concept of full overlap of 
communic,ation and computation. Assuming that at time TI, processor Pl starts sending a message 
of length vn to its adjacent processor P2 (i.e., Pl and P2 is directly connected by communication 
channel). .PI will first spend time t, to  set up the comrn~nicat~ion channel. The message will then 
take additional time twm to reach P2, i.e., the message arrives at P2 at  time Ifl + t, + t,m. PI 
can return to  computation after setting up the communicat~ion channel, i.e., at time Tl + t,. Now 
assume that a t  time T2, processor Pz needs to  read this message from PI. If this message has 
already been in the input buffer of P2, i.e., this message was sent out by Pl at  time TI such that 
Tl + (t, + tuna) 5 T2, then P2 can read this message without any delay. Here we assume it takes 
negligible time to  read the message in the input buffer. Ot,herwise, P2 will rernain idle until the 
message arrives in its input buffer. The idle time is = TI + (2, + t,m) - T2. Therefore the 
communic;~tion overhead of each processor contains two parts: t,he setup time, when the processor 
sets up the communication channel, and the idle time, when the processor is idle, waiting for 
message t c ~  arrive. 
In a pzrrallel computation process on a multiprocessor, if for every processor, the idle time is 
always zero throughout the process (that is, each processor is either doing computation, or setting 
up the communication channel, but never idle), then we say communication is  .fully overlapped by 
computatic~n in this process. Hence our objective is to accomplish this fill1 overlap of communication 
and comp~~tation, and at the same time, to  minimize the t,ot,al setup overhead. 
1.3 Rellated Work on Parallel Gauss-Jordan Matrix Inversion Algorithm 
Matrix operation, such as inversion and factorization, is a standard part of any linear algebra 
library. Such libraries are critical for the evolution of parallel computer into useful tools for large 
scale scientific computations. Two common sequential algorithms for dense matrix inversion are 
Gaussian :Elimination (GE) and Gauss-Jordan (GJ) Elimination. They have the same operation 
count (FLOPS) for inversion. It has been shown that  G J  is significantly more efficient in parallel 
than GE,  because parallel G J  has a more homogeneous work load distributilon and incurs less 
communication overhead [12]. Furthermore, G J  inversion is an in-place matrix illversion algorithm. 
Thus no additional memory is required as work place during the process of G J  elimination. 
Several parallel algorithms have been developed for implementing G J  elimin,ation with or with- 
out pivotiing on distributed-memory architectures. Some of t,hem are fine-grain algorithms devel- 
oped for hypercubes assumed t o  have little setup cost and high communicatioa bandwidth, e-g., 
the Caltech Hypercube [ l l ,  121. The others are medium-grain a lg~r i t~hms  develolped for hypercubes 
with substantial message passing latency, e.g., the commercially available Intel iPSC/l ,  iPSC/2, 
iPSC/860 and nCUBE 2 hypercubes [I, 171. In [I] some of tohe algorit,hms representing recent 
developments in both classes are reviewed. 
I t  was shown in [2] that  the G J  algorithm has essentially different properties when using col- 
umn interchanges instead of row interchanges for improving numerical stability. II more satisfactory 
bound for the residual norm can be derived for solutions obtained by G J  with column interchanges 
instead of row interchanges. In [I.] a medium-grain parallel algorithm for implementing G J  inversion 
with colurnn interchanges on a hypercube multiprocessor was proposed. The hypercube network 
was configured as a twedimensional subcube-grid to  support submatrix partitioinings. The authors 
claimed thlat for G J  inversion with column interchanges on hypercubes with substantial commu- 
nication latency, submatrix partitions had communication advantages not shared by partitions 
limited t o  rows or columns. In their algorithm, no attempt was made to  overlap communication 
and compiit a t  ion. 
Two popular optimization techniques applied to  parallel matrix computations for improving 
concurrence are the send- ahead strategy that  attempt to  communicate data =I early as possible, 
and the compute-and-send-ahead strategy that  computes in advance data which have t o  be com- 
municated [3, 4, 51. Their most important aim is to  avoid t,he processors idleness due t o  the data 
waiting tirne by overlapping communication and computation. In [3] these two optimization strate- 
gies were applied t o  parallel LU decomposition algorithms, with or without p,artial pivoting. In 
their implementation, they used the communication subroutine provided by the software, and it is 
not clear b.ow the overlap is achieved. No analytical performance analysis was provided in their pa- 
per. Their experimental results also showed the ~uperiorit~y of t,he square scattered decomposition 
strategy. 
1.4 Motivation for a New GJ Inversion Algorithm 
1.4.1 Sequential G J inversion algorithm with column interchanges 
We first give the sequential G J  algorithm with column interchanges for inverting an N x N matrix 
A = [a;,j:l. This algorithm performs in-place inversion. An array a l : ~  is used t o  record the 
. . permutation of the columns. Initially a; = z, z = 1,2,  ..., N. At the k-th step of G J  reduction, if 
the pivot column is found t o  be column pivot, then columns k and pivot need t o  be interchanged, 
and we record this interchange by swapping the value of a k  and aPiv,t. Note that  in the algorithm 
we actually do not interchange columns, but rather, we just address the permuted columns using 
the information saved in a .  In this way, some intermediate data movement can be saved. As a 
result, A is overwritten by a row and column permuted form of A-l .  Let B = [bi , j ]  = A - l ,  then 
after applying the algorithm t o  A, a ; j  = b k , r ,  where k = a;, and = j .  
for k = 1 t o  N 
pivot = k 
f o r j = k + l t o N  
if (ak,al ( > lak,a,,,,,l then pivot = j 
a k  ''pivot 
for j = 1 t o  N and j # k 
ak,aj = ak,aj/ak,ak 
for i = 1 t o  N and i # k 
for i = 1 t o  N and i # k 
ai,uk = -ai,ak /ak,ak 
a k , ~ k  = l / a k , U k  
1.4.2 A brief review of the subcube-grid parallel GJ inversion algorithm in [ I ]  
We next briefly review the G J  inversion algorithm proposed in [I] (Thereby we call this algorithm 
Algorithm 0). The matrix is partitioned into submatrices, and the data are distributed among the 
processors using full column and row wrap-mapping. Each column and each row of processors form a 
subcube. Within the main loop, there are three blocks of code corresponding t o  t,he communication 
of the pivot row; the selection and permutation of the pivot colnmn; and the G J  elimination on the 
processor'c~ share of matrix. 
A hypr:rcube of dimension d is configured as a yl x y2 = p = 2d snbcobe-grid. Each subcube 
column has dimension dl = log yl and each subcuberow has dimension d2 = logay2. dl and d2 differ 
at most by one for optimal performance. At each iteration, there are two types of communication 
required fix this algorithm: one to broadcast the pivot row and another to find and broadcast 
the pivot column. The former involves one-to-all broadcast within the subcube-column, and the 
latter involves permutations within the subcube-row. After every processor receives the appropriate 
pivot row tgegment, they each determine the local pivot element and the correspon~ding pivot column. 
Every processor has the pivot element and the corresponding segment of the final pivot column 
after permutations (recursive doubling) within the subcube-row. Since recursivr: doubling within a 
subcube of dimension d2 involves d2 steps of write-read pairs, this portion of the algorithm causes 
a synchronizaiion barrier for all processors. No processor can start computation before the entire 
communication is completed. Therefore this algorithm is implemented in a synchronous manner, 
i.e., in every iteration, a processor starts performing its computation only after the entire data 
communication is completed. 
Assuming that yl = y2 = &, then the communication of the pivot row amounts dl = logp 
one-hop transfers of a message of size N/&. The selection and perm~itat~ion f the pivot column 
amounts t80 d2 = i l o g p  one-hop exchanges of a message of size N / G .  Therefore, the total 
communication overhead for each processor incurred in this algorithm is T,, = N(dl + d2)(t, + 
N t w r )  = N logp(t, + twT). 4 
Since no communication is overlapped by c~mputat~ion, T, appears as a whole in the total 
execution time of the algorithm, i.e., T = T,, + T,, , where T,, is the total computation 
time of each processor. 
1.5 Strategy for Overlapping of Communication and Computation 
The stratr:gy for overlap of communication and cornp~t~at~ion s to let each processor to compute 
and send out the data needed by the other processors as early as possible, so thart these data would 
have arrived at the destinations when they are to be used. This is called compute-and-send-ahead 
strategy and we will elaborate on it later when we give the parallel GJ inversion algorithms. The 
best we cam do to this end is to make the idle time of each processor zero, and to make the total 
setup overhead of each processor as small as possible. 
In [3] t,he send-ahead and the compute-and-send-ahead st,rat,egies have been applied to the par- 
allel LU fzrctorization algorithm to achieve overlap of commllnication and computation. The main 
communication operation of the algorithm is broadcasting t,he pivot rows and the multipliers. Since 
the data i(3 expected to arrive at the destinations at  the earliest possible time, the data broadcast 
process is of the most importance and should be carefully designed. However, in their paper, the 
authors took the broadcast within a subcube (which sends data from one processor to all others 
within the subcube) as a communication primitive, and did not specify the detipils of this process. 
And no ar~alytical performance analysis was given for the algorithms in the paper. 
To accomplish full overlap of data transmission by computation, we need the high level languages 
for the parallel machines to  provide the interrupt handling Whenever a message arrives 
at the input buffer of a processor, the processor is interr~pt~ed from the current computation and 
starts executing the interrupt handling subroutine, which involves forwarding the message to some 
adjacent processors. After that the processor resumes the previous interrupted computation. This 
feature guarantees that the message is passed along the communication tree in the hypercube 
without delay and reaches the destinations at the earliest possible time. For example, for one-to-all 
broadcasting, when the message arrives at a processor, and if this processor is nclt a leaf node of the 
broadcasting tree, then this processor is interrupted from the current computation and forwards 
the message to the adjacent nodes on the hypercube which has not got the message, before it 
resumes tlie interrupted computation. This feature of interrupt handling is crucial for the overlap 
of communication and computation. Throughout this report we assume the underlying machine 
model has this feature. 
2 Surnmary of Results 
In this report, we show that by utilizing the interrupt handling capability of the parallel machine, a 
set of asyn~chronous algorithms can be designed such that the data transmission can be partially or 
fully overlapped by computation. We propose a new one-teal1 broadcast algori~thm on hypercube 
that helps to  achieve overlap of communication and computation for the parallel GJ inversion 
algorithm. 
We give the new parallel GJ inversion algorithms under different data partitioning strategies 
(i.e., row partitioning and submatrix partitioning), with or without partial pivoting. For each a lge  
rithm, we prove a lower bound on the matrix size such that data transmission is fully overlapped by 
computation, and the total communication overhead when full overlap is achieved. Our conclusion 
is that when the matrix size is large enough such that data transmission is fully overlapped by 
computation, ihe r o w  parlilioning slralegy has the lowest  commi~nica2ion overhead. We also give 
the scalabi~lity analysis for the algorithms discussed in this report. Furthermore, we also apply the 
idea of overlap communication and computation to  the parallel LU factorization algorithm. 
I submatrix partitioning 1 1  N(ts + it,%) logp I 
Partition Strategy 
row partitioning 
Table 1: ?'he communication overhead for the parallel GJ inversion algorithm under the row partitioning 
and submat;rix partitioning, when no overlap of communication and computation is attemped. The matrix 
size is N x N ,  t ,  is the communication startup overhead, t ,  is the data transmission rate, p is the number 
of prossors. 
Total Comm. Overhead 
N(ts + ttu N) log P 
Condition I Total Comm. Overhead I 
I submatrix, w/ pivoting I( i lv(1 + logp)ts 
Table 2: Slimmary of key theoretical results in this report, for the conditions of full overlap of communication 
by c~mputa~t ion,  and the total communication overhead of the algorit,hm when full overlilp is achieved, under 
the row or wbmatrix partitioning, with or without pivoting. 
Table I. lists the communication complexities for the parallel GJ inversion algorithm under the 
row partitioning and submatrix partitioning, where no overlap of comm~inicatioin and computation 
is attempted [I, 91. Table 2 summarizes some theoretical results in this report on the conditions of 
full overlap and the total communication overhead when frill overlay is achievedl. 
Figure 1 depicts some numerical results on the total commlinicat,ion overhead of different a lge  
rithms. Note that the vertical axis is log scaled. It is clear from this plot that overlapping data 
communication and computation can greatly reduce the communication overhead of the parallel 
algorithm. Furthermore, when the matrix size is large enough, row partitioning is superior to 
submatrix partioning. 
1 0. 
rubrnatrlx pariltlon. ovrrlap 
(wnhoul plvotlng) 
1 os 
row pariltlon. overlap 
100 200 300 400 500 600 4 0 0  
Meblx Slze N 
Figure 1: Comparison of the communication overheads of four parallel GJ matrix inversion algorithms 
under the row or submatrix partitioning, with or without overlap (Number of processoirs p=16). 
This report is organized as follows: In section 3 we give our parallel GJ inversion algorithm using 
row partitioning, with pivoting (Algorithm I). In section 4 we propose a new one-to-all broadcast 
algorithm on hypercube that helps to achieve overlap of communication and computation. In 
section 5, ,we give the performance analysis of Algorithm I. In section 6 we give Algorithm 11, which 
uses submatrix partitioning and without pivoting. The analysis of Algorithm I1 is given in section 
7. In section 8, we discuss Algorithm 111, which also uses submatrix partitioning but with pivoting. 
In section 9, we give the scalability analysis of our parallel algorithms. In sec:tion 10 we extend 
the idea of overlapping communication and computation tlo parallel LU factoirization algorithm. 
Section 11 concludes this report. 
3 Algorithm I: Row Partitioning With Pivoting 
In this section, we give our algorithm of parallel GJ inversion with column interchanges, using 
row partitioning. Suppose we want to invert an N x N matrix A on a d dimensional hypercube 
containing, p = 2d processors. Assume N = np. We partition the matrix by rows and use wrap- 
mapping to  distribute the rows of A among processors, i.e., rows k, p + k, 2p -t k, - . ., N - p + k 
are assigned to  processor Pk, 1 5 k 5 p. We will discuss how to determine the physical address of 
processor Pk on the hypercube in the next section. We use notation [k] to  represent [(k - 1) mod 
p ] + l ,  thus 15 [k] 5 p .  
When the matrix is partitioned by rows, each row is entirely within one pi.ocessor. Thus the 
search for the pivot element can be done by the single processor which has the pivot row, and the 
recursive tloubling is no longer needed. In the algorithm, during the k-th loop, each processor first 
gets the b-th pivot row of A and the pivot element from pro
c
essor P[k1, and then updates all the 
rows assigned to  it. In order to overlap the commlinication by cornputlation, the pivot row should 
be sent out at the earliest possible time, so that when a processor is to use it,, it would have already 
arrived at that processor. Since processor fik+l1 is assigned row (k + 1) of A, we let it update 
this row ant first during the k-th loop. Then it finds the pivot element and normalizes this row 
(cornpule-ahead). After that it broadcasts this (k + 1)st pivot row and the pivot element to all 
the other processors, before it resumes to update the other n - 1 rows using the k-th pivot row 
(send-aheird). By using this cornpule-and-send-ahead strategy, t,he (k+l)st  pivot row is transmitted 
through the communication channels while all the processors are still updating in the kth loop. 
Thus the overlap of communication and computation is achieved. 
Since the message may take several hops to  reach the destinations, it needs to  be routed to the 
next proc~:ssor without delay upon arrival at an intermediate processor. This is realized by a local 
message routing algorithm and the communication interrupt handling capability of the machine. By 
the local message routing algorithm, each processor knows where to forward the: incoming message 
to. We will discuss this algorithm in the next section. Whenever a message arri~ves at a processor, 
the processor is interrupted from the computation. By identifying the header of the message the 
processor knows which pivot row this message contains. It will then determine where to forward 
this messa,ge to. If it does not need to  forward it, it reslimes its compntation. C)therwise, it has to 
forward the message to some of its adjacent processors, and incurs a setup overhead oft , ,  before 
it resumes its computation. If after completing the k-tlh loop of updating, the (k + 1)st pivot row 
still has n'ot arrived at the processor, then the processor has t,o wait for it before it can start the 
(k + 1)st updating loop. 
The following is pseudo code for the kth step of the parallel GJ inversion algclrithm with column 
interchanges, using row partitioning: 
if ( P  =: Pik+ll) then 
read in the kth pivot row and pivot element; 
update the (k + 1)st row of A; 
findl the (k + 1)st pivot element; 
normalize the (k + 1)st pivot row; 
broadcast the (k + 1)st pivot row; 
updlate the rest (n - 1) rows of A; 
if ( P  =: P[k]) then 
updlate the (n - 1) rows of A (other than the kth row); 
if ( P  #: P[k+ll and P # Ppl) then 
read in the kth pivot row and pivot element; 
upd.ate the n rows of A; 
Intemrpt handling: 
read the incoming data message; 
forward the incoming message to  appropriate processors, if needed; 
4 The SBT; Broadcasting Algorithm 
In order t o  implement the conapuie-and-send-ahead strategy discussed in the previous section, we 
need to  construct a set of p broadcasting trees in the hypercube. At the kth broadcasting in 
Algorithm I,  the root of the tree is PIkJ. Furthermore, each broadcasting tn:e should have the 
following two features: 
1. P[k+l] and P[k] should be directly linked. 
As discussed in the previous section, the ( t  + 1)st row of matrix A is broardcast by processor 
P [k+ l ]  immediatedly after it is updated and normalized in the kth stage of the parallel G J  
inve1:sion algorithm, before P[k+l] starts updating the other rows using the: kth pivot row. To 
broadcast the (k + 1t)st a t  the earliest possible time, processor P[k+l] should receive the kth 
pivot row from processor PIk1 at  the earliest possible time. This implies that P[k+ll and Pik1 
should be adjacent on the hypercube. 
2. P[k+l] should be a leaf node in the kth broadcasting tree. 
After receiving the kth pivot row, P[k+l] should not be involved in forwardling the message t o  
somc: other processors, so that  its computation is not delayed by the startup time overhead, 
and the (k + 1)st row can be sent out a t  the earliest possible time. This implies that  qk+1] 
should be a leaf node in the t t h  broadcasting tree. 
We n e : ~ t  discuss how t o  determine the physical address of each processor on the hypercube. 
Since the logically consecutively numbered processors need t o  be physically adjacent, i.e., processors 
P[k+l ]  and P[k] be adjacent on the hypercube, we address the processors in a binary-reflected Gray 
code order with the starting address as 0 [7]. Let the d-bit code of p = 2d integers be Gray(d). 
The binary-reflected Gray code on d bits is recursively defined as follows. 
d d  Let G7 'a~(d)  = (gl ,gz,  - ' ., gtd-l, 9;d)- Then 
Alternatively, 
Gray(d+ 1) = (g$, gtll g i l ,  gi07 . ., g$l, g$o)- 
We use this binary-reflected Gray code mapping t o  associate each processor's logical address with 
its physicarl address, i.e., let the physical address of processor Pi be g;d. For the rest of the report, 
we will j u d  use Pi t o  denote the physical address g;d. 
In [7] the spanning binomial tree (SBT) of a d-cube rooted a t  node s is defined as follows. Let 
i be any node, and let c = i @ s, where @ denotes bitwise excl~lsive OR operation. Let q be such 
that  c, = 1 and c, = 0, Vm E M(c) = {q + 1, q + 2, . . . , d - I ) ,  and let q = - L if c = 0. The set 
M(c) is the set of leading zeros of c. Then in SBT(s), the children nodes and the parent node of a 
given node i are 
- 
children(i) = {(id-1 id-2 - . - i, . . . i o ) ) ,  Vm E M(c), 
It can be easily verified that  the children and parent functions defined above are consistent, i.e., 
that  node j is a child of node i iff node i is the parent of node j [7]. A node is a leaf node if it has 
no children, i.e., q = d - 1. 
Lemma 1. In SBT(s), any node i is at level H(s ,  i), where H(s,  i) is the Hamming distance between 
s and i. 'Inhe number of nodes at level 1 is , and the height of the tree is d. The number of 
children nodes of any node i is n, = d - 1 - q ,  and 0 5 n, 5 d - H(s ,  i). [7] 
At the Eth broadcasting in Algorithm I, the root of the SBT is And because of the binary- 
reflected Gray code mapping, PIk+ll is adjacent t o  P[k]  and t,herefore is a t  the first level of the 
SBT. However, with the SBT defined above, it is not guaranteed that  P [ k + l ]  is: a leaf node, since 
by definition, P[k+l] is a leaf node in SBT(PIk1) only if they differ by the high order (d - 1)st bit. 
On the hypercube, if node P and node PI differ only by the jt,h bit (and therefore they are 
directly linked), 0 < j < d, then we say P is the j t h  neighbor of PI. Suppose Plkl and P [ k + l ]  differ 
by the j t h  bit, we need t o  construct a SBT rooted a t  P [ k ]  and with P[k+ll as a lleaf node. 
Definition 1 A SBTj(s)  in a d-cube is a SBT rooted at node s and th,e jth neighbor node of s is 
a leaf nod(:. (Note the SBT(s) defined in [7] is therefore SBTd-1 ( s ) . )  
Definition 2 The shufle operation on a node i = (id-lid-z - .  -ilia) is defined as S(i) = 
. .  . 
(id-2id-3 - - - ilioid-1). The inverse shufle operation on i is defined as S-l(i) = (a0ad-lad-2 - .  . il).  
Moreover, s k ( i )  represents applying shufle operation k times on i, and F k ( i )  tpepresents applying 
inverse shaufle operation E times on i. 
Definition 3 The shufle operation on a graph G(V, E )  is defined as S(G) = G(S(V),S(E)),  
where S(br) = {S(i)lVi E V) and S ( E )  = {(S(i),S(j))lV(i, j )  E E). Similarly we define the 
inverse shufle operation on a graph S-'(G) = G(S-l(v) ,  S-'(E)). 
Lemma 2 Shufle and inverse shufle operations of a graph preserve the Hamming distance between 
nodes. The shufle operation sk maps every edge in dimension 1 to dimension ( I  + k) mod n. The 
inverse sh:ufle operation s - ~  maps every edge in dimension 1 to dimension ( I  - k) mod n. [7] 
Corollary 1 The topology of a graph remains unchanged under s h ~ ~ f l e  and inzlerse shufle opera- 
tions. [7] 
The ne~xt lemma shows that  SBTk(s) can be obtained by applying shuffle and inverse shuffle 
operation on SBTd-1 ( 9 ) .  
Figure 2: SBT(Gray(3)) is a family of 23 = 8 SBT's. The root node of the [i + 11-st SlBT is a leaf node of 
the i-th SBT a t  the first level. 
Lemma 31 
SBTk(s) = s - ( ~ - ' - ~ )  [ s B T ~ - ~  ( S ~ - ' - ~ ( S ) ) ]  .
Proof.  By Corollary 1 ,  s-(~-'-') [ S B T ~ - ~ ( S ~ - ' - ~ ( S ) ) ]  is a SBT rooted a t  s-(~-'-') [ s ~ - ' - ~ ( s ) ]  = 
s. Let i' be the leaf node a t  the first level of SBT~-' ( s~- ' -~(s ) ) ,  then by definition, i' differs with 
sd-'-'(s) by the (d-1)st bit. Let i be the corresponding node of i' in s-(~-'-') [ S B T ~ - ~ ( S ~ - ' - ~ ( S ) ) ]  , 
by Lemma 2, i and s  differ by the (d - 1) - ( d -  1 - k) = k  th  bit,. Therefore this SBT is SBTk(s). 
0 
The following theorem shows how to  the construct SBTk(s) directly, by spcxifying the parent 
and childr,en functions. 
Theorem 1 In SBTk(s), let q be such that c, = 1 and c ,  = 0, Vm E M(c) = + 1 mod p,  q + 2 
mod p,  - - - ,  k  mod p ) ,  and let q = k  - d  i f  c = 0.  Then  
Proof: Let a' = Sd-l-k ( s ) ,  i' = Sd-l-k ( i ) ,  c' = i' @ s'. Clearly c' = s ~ - ' - ~ ( c ) .  Let q' be such 
that  ci ,  = 1 and 8m = 0, Vrn E M ( c l )  = {ql+ 1,  q1+2, . . . , d-  1) .  Then by definit:ion, in SBTd-1(8'), 
the edges that connect i' and its children nodes are those in dimension d - 1, d -- 2,  - - - ,  q'+ 1. And 
the edge that  connects i' and its parent node is the one in dimension q'. Let q = q'- ( d-  1 - k )  mod 
d.  From tlhe definition of q' and the relation c' = Sd- l -k(c) ,  it is easy t o  verify that  q satisfies the 
constraints given in the theorem. By Lemma 3,  S B T k ( s )  is obtained by inverse shuffle SBTd- l (s ' )  
( d  - 1 - E )  times. Then by Lemma 2,  in S B T k ( s ) ,  the edges that connect i andl its children nodes 
are those in dimension d -  1- ( d - 1- k )  = k , k - 1  mod d ,  - - - ,  q l + l - ( d -  1 - - k )  mod d = q + 1  
mod d. And the edge that  connects i and its parent node is the one in dimension q. 
The next theorem shows the structure of SBTk ( s ) .  
Theorem 2 In  S B T k ( s ) ,  any node i is al level H ( s , i ) ,  where H ( s , i )  is the Hamming dislance 
between s and i. The number of nodes al level 1 is (:) , and the height of lhe lree is d. The 
number oj' children nodes of any node i is ( k  - q )  mod ( d  + 1).  The number of deaf nodes is 2d-1. 
Proof: The first three statements follow directly Lemma 1, Lemma 2 and Corollary 1. By 
Theorem I, in SBTk ( s )  node i is a leaf node if and only if the kth bit of i ( a k )  differs from the kth 
bit of s ( s , ~ ) .  For a given root s ,  there are 2d-1 nodes i such that ik # sk. Therefore there are 2d-1 
leaf nodes in S B T k ( s ) .  0 
Next vre give the construction of the p broadcasting trees used in Algorithm I. 
Definition 4 S B T ( G r a y ( d ) )  is a family of p = 2d SBT's: 
d d  d where G r e y ( d )  = ( g l ,  g2 ,  - - - , g2d-1,  9:d), SBT(;) = S B T ( ; ) ( ~ ~ ) ,  where and gi+ll difler by lhe 
j ( i ) - lh  bil. 
As an example, consider d = 3. We have 
and Figure 2 shows the corresponding SBT's of S B T ( G r a y ( 3 ) ) .  
The next theorem shows the structure of the set of trees, and will lead t o   the setup overhead 
comp1exit:y of the broadcasting algorithm. 
Theorem. 3 I n  S B T ( G r a y ( d ) ) ,  each node appears al lhe 1-lh level of ( ) S T , .  Moreover, 
each node appears as a leaf node in  2d-1 SBT7s .  
Proof: 
( :) s.5; 
level 1. 
By Theorem 2, in SBTk(s), any node i is at level H ( s ,  i). For any fixed i, there are 
such that H(s ,  i) = 1. And in the corresponding SBT's rooted at these nodes, i is at 
To prclve each node appears as a leaf node in 2d-1 SBT's, we use proof b;y induction on the 
hypercube dimension d. We use the first definition of binary-reflected Gray code. Induction hy- 
pothesis: :In SBT(gl,  92, . - . , g2d) or SBT(g2d, . . . ,g2, gl), any node i appears as leaf node in 2d-1 
SBT's. 
Base case: d = 1. The proposition is obviously true. 
Induction Step: Assume the proposition is true for dimension d, we show it is also true for 
dimension d + 1. 
For SBT(g1,92, - - - , g2d), let 
1, i and gm differ by the j(m)t,h bit,, 
d d  (i7 9m) = 
0, otherwise. 
where gm and g~,+~] differ by the j(m) th bit. 
And for SBT(g2d, - - - ,92, gl), let 
1, i and gm differ by the j1(m)t3h bit, 
0, otherwise. 
where g, and glm-l~ differ by the jl(m) th bit. 
Then Ily the induction hypothesis, 
For the d + 1 dimension, we have SBT(Ogl, - - - , og;, lg;, - - - ,  lgl), for any node Oi, we have 
Suppose g2d and gl differ by the j th  bit. If dd(i, g2d) = 1 then the j th  bits of i and g2d are 
different, so the j th  bits of i and gl are the same, and q5&(i,gl) = 0. Similarly, if dh(i, gl) = 1, 
2d+l 
@d(i, g2d) := 0. Therefore 1 - ($hd(i, g2d) + f$$(i, 91)) = 0. Thlls f$d+i (Oil 9;:') = zd. 
Therefore in S B T ( ~ ~ + ' ,  g;+l, . - - , g$:l), any node appears as leaf node 2d times. Using the similar 
argument, we can show that in SBT(~;&~~,  . - . , g,d+', g:"), any node appears as leaf node 2d times. 
The following corollary gives the setup overhead complexity of the broadcasting algorithm. 
Corollar:~ 2 In Algorithm I (row partitioning, with pivoting), when SBT(Gray(d)) i s  used t o  
broadcast the pivot rows,  the total  setup overhead of each processor i s  i ~ t , .  
Proof: Assume N  = np. There are totally N  one-to-all broadcasts, and SBT(Gray(d)) is used 
n  = N / p  times. By Theorem 3, in SBT(Gray(d)), each node appears as leaf node p/2  times. That 
is, for con.secutive p  broadcastings, each processor incurs setup overhead of i p t , .  Therefore the 
total setup overhead for each processor in the N  broadcasts is f $ = f ~ t . .  
5 Analysis of Algorithm I 
In this section we analyze the performance of the Algorithm I given in Section 3. The objective is 
to show the following: 
1. A sufficient condition for full overlap of computation and communication. 
2. The message queue length is bounded by a small number when full overlap is achieved. 
3. Numerical results on the communication overhead of Algorithm I for partial or full overlap. 
First consider the computation load of each processor at the Cth loop. The updating of each 
element nct on the Cth column involves one multiplication and one subtraction, and the updating 
of each element on the Cth column involves one division. For simplicity, we assume the time 
to  update each element of A is f .  For processor Pikl, since it has already normalized the Cth 
pivot row and found the pivot element in the previous (C - 1)st loop, it needs ,to update the rest 
(n - 1) rows it has using the Cth pivot row and pivot element,. We call this computational task 
TIl, and the corresponding computation time is TI, where TI = (n - l ) N  f .  For processor qk+l l ,  
its computational task in the Cth loop can be split into two parts, IIz and I13. 112 corresponds to 
updating the C + 1st row of A using the Cth pivot row, normalizing this row and finding the pivot 
element of the C + 1st pivot row. The time for updating is N f .  We assume it takes another N f 
to  find the pivot element and normalize this row. Therefore the time for 112 is Tz = 2N f .  After 
completing TI2 processor P [ k + l ]  immediately broadcasts the (C+ 1)st pivot row. Then it resumes to 
update the other (n - 1) rows using the Cth pivot row, we call this task TI3, and the corresponding 
time T3 = TI = (n - l ) N  f .  For all the other processors, the computational task during the Cth 
loop is 114, which corresponds to  updating all the n rows using the Cth pivoting row. The time for 
114 isT4 = n N f .  
Therefore the computational load of processor P at the Cth loop of Algorithim I is given by 
Tl = (n - l )Nf ,  P = P[k] 1 
= { T2 +T3=(n+1)Nf1  P=P[k+l ] ,  
T4 = nNf,  otherwise. 
Note there is computational load imbalance among the processors in the Cth loop. P[k+l] has 
the largest load and Pik] has the smallest load. By using the wrap-mapping to distribute the 
rows of A among processors, every processor becomes P[k1 (and P[k+l]) alternatively and thus the 
computatic~n loads are balanced over the whole process of GJ inversion. 
In orde~, t o  explain and analyze our parallel GJ inversion algorithm, we introduce a task schedul- 
ing diagram, which is instrumental in our analysis of the performance of the proposed algorithm 
and allows us to  conveniently formalize the notion of overlap commlinication and computation. A 
task scheduling diagram is a directed graph G(V, E) consisting of a node set V and an arc set E. 
Figure 3: Task scheduling diagram for Algorithm I (y  = 4 and N = 8). 
A node in G represents one of the four computational tasks ITl, IT2, IT3 and T I 4 .  An arc in G rep- 
resents precedence relations that  exist between two computational tasks, subject to  the condition 
that  certain amount of time is allocated to  each arc to  account for the actual co~mmunication time 
between the two computational tasks represented by the two leaf nodes of tohe arc. 
An example of a task scheduling diagram is given in Figure 3. It represents the parallel GJ 
inversion process of a 8 x 8 matrix using 4 processors. In general the task scheduling diagram G 
has the following properties: 
P r o p e r f y  1: T h e  nodes of G are ordered horizontally into p columns, each coliimn 1 corresponds 
t o  the updating operation sequence of processor Pi during the parallel GJ inversion. 
P r o p e r f y  2: Vertically G is constructed in N stages, each stage k corresponds t o  the concurrent 
updating olperations of all the processors at  the kth loop of the parallel GJ inversion. 
Proper ty  3: In the kth stage of G ,  the  node in column [k] is a Tl node; in column [k + I ]  is a 
T2 node followed by a T3 node; and the nodes in all other colllmns are T4 nodes. 
Proper ty  4: All nodes a t  the first stage of G has in-degree 1. At stages other than the first, a 
Tl or T3 ncde has in-degree 1 ,  and a T2 or T4 node has in-degree 2. 
Properly 5: A T2 node has out-degree p and the each of the other three type of nodes has 
out-degree 1, except in the last stage. 
As can be seen in Figure 3, there are two kinds of arcs in C: "vertical" arcs and "non-vertical" 
arcs. They represent two kinds of precedence constraints and communication delays. A vertical 
arc is an a,rc connecting two adjacent nodes in the same column in C. The precedence constraint 
it represents is that the computation in the tail node must precede that of the head node, i.e., a 
processor can start the ( h  + 1)st loop only after it completes the kth loop. The communication 
delay introduced by a vertical arc is the communication startup delay. As dliscussed earlier, a 
processor :may be interrupted by an incoming message and it may need to  forward the message 
to  other processors. This may introduce communication startup overhead and this overhead is 
represented by the vertical arc. A non-vertical arc is an arc from a TZ node to a T2 node or to  a 
T4 node at  the next stage in G. The precedence constraint it represent is that the computation in 
the head node can not start before it receives the message from the tail node, i.,e., a processor can 
not start t:he (k + 1)st loop before it gets the (k + 1)st pivot row from Pik+ll. The communication 
delay introduced by a non-vertical arc is the time between the message is sent out and the message 
reaches t hc: processor. 
Next we give an analytical model for the performance of Algoritthm I. First some notations are 
listed as following. 
The time processor P starts the kth loop. 
The time processor P completes the kth loop. 
The computation load at the kth loop of processor P. 
The time the kth pivot row is sent out. 
The time the kth pivot row arrives a t  P. 
The setup overhead of P at the kth stage. 
The time between the kth message is sent out by PIkl and the time it arrives 
a t  processor P ,  assuming no delay at  each intermediate procerssor. 
At stage k in G,  a TZ or a T4 node has two incoming arcs, the delay introduced by the vertical 
arc is sk(P:l and the delay introduced by the non-vertical arc is tk(p) .  From Section 4 we know that 
s k ( p )  is either 0 or t,, depending on whether or not P is a leaf node in the kth broadcasting tree. 
Let H(Pik1, P )  be the Hamming distance between the physical addresses of Pikl and P .  Then the 
message will take H(Plk1, P )  hops to  reach P from Plk1. Therefore t k (p )  = H(qk l ,  P)(1, + t,N). 
A TI node has one vertical incoming arc, which introdlices no delay, i.e., sk (p )  = 0. A T3 node has 
one vertical incoming arc, which introduces delay s k ( p )  = t,. 
Figure 4 illustrates the relationship among the parameters defined above, which can be formally 
stated by the following lemma. 
Lemma 4 We have Ihe following ~ecurrenl relationships: 
(k-1)st iteration 
- > tlrne 
-rk-ip) 
Idle time > 0 
Idle tlme = 0 
Figure 4: illustration of the relationship among the parameters defined above. 
Proof: These relationships are obvious from the definition of T ~ , ~ , ( P ) ,  T ~ ~ ~ ~ ~ ~ ~ ~ ( P )  and TLnd, 
and the t a k  scheduling graph, as illustrated in Figure 4. 0 
The next theorem states that a message can not be delayed at an intermediate processor during 
the broadcirst process in Algorithm I. 
Theorem. 4 In Algortihm I, when the SBTi broadcast a1gorith.m is used, then 
',k,riwe(~) = 'sk,nd + tk(p). 
Proof: We need t o  show that whenever a message arrives a t  an intermediate processor, the 
processor can forward this message immediately without delay. A delay is incurred if when the 
message arrives a t  the processor, the processor is setting up the communication channel for a 
previous nnessage. Then only after the processor finishes setting up the comn~unication channel 
for the prc:vious message can it start forward this message. In the worst case the message can be 
delayed for up t o  time t,. 
We have the following two simple observations: 
(1) If 2'bk,,;,,(~) - T&;,,(P) > t,, then message k will not be delayed a t  processor P ,  because 
when message k arrives a t  P ,  P has already finished setting up c~ommlinication clhannel for message 
k - 1, and therefore can forward message k immediately. 
(2) If P does not forward message k - 1, i.e., P is a leaf node of the (k - 1)st SBT, then as 
long as T~,,;,,(P) - T,k,T:,,(p) > 0, message k will not be delayed at processor P, since P can 
immediately forward message k. 
Assumc? H(P[k-l], P )  = m, since H(P[k], P[k-l]) = 1, we have either H(Plk1, P )  = m + 1 or 
H(P[k], P )  = m - 1. Suppose P[k] and P[k-l] differ by the j th  bit, if P and P[k-ll have the same 
j th  bit, th'en P and P[kJ have different j th  bits, and H(P[k], P )  = m + 1; otherwise P and P[k-ll 
have different j t h  bits, then P and P[k1 have the same j th  bit,, and H(P[k1, P )  = m - 1. 
Notice that 
The proof is by induction on k. For k = 1 obviously the message will not be delayed a t  any 
processor since there is no other previous message. Assume message (k - 1) is not delayed a t  any 
processor lJ ,  i-e., 
~ ~ - l  ( P )  = TL; + tk-'(P) = T,",;: + m(ts + t,N). arrtve 
We need tcl show message k also can not be delayed. We consider two cases: 
(1) H(lJ[k], P )  = m +  1. Then T:,,,~,,(P) -T;G;,,(P) 2 Tknd +(m+ l)(t ,  + t , ~ )  -T;;:,(P) = 
2(t, + t,N) + T2 > t,. Therefore message k will not be delayed at processor P. 
(2) H(Plkl,  P )  = m-  1. Then T:,.,~,(P) -T:G:,(P) >_ T:,~ +(m- l ) ( ts  + L N )  -T;,.$(P) = 
T2 > 0. Since P and Pik-ll have different j th  bits, P is a leaf node in the (k - 1:lst SBT, therefore 
P does not forward message (k - 1). So message k will not be delayed a t  processor P. 
Let T,,,,(P) be the total computation load of processor P 7  and Tcom,(P~~ be the total com- 
munication overhead of processor P ,  then 
It  can be seen in Figure 4 that  a t  the ( k  - 1)st iteration, if T:,.,~_(P) > then 
processor P will be idle for a time t : d ; f ( ~ )  = T:rr,,e ( P )  - T ~ ~ ~ , , ~ ~ ( P ) ;  otherwise ( P )  = 0 .  
If t fdle(p)  = 0 for all 2 5 k 5 N ,  then full overlap is achieved a t  P .  Now we can give a formal 
definition 13f full overlap of communication and computation. 
Definition 5 We say that the communication is fully overlapped by comp~rtation for processor P 
in Algoritlrm I if 
Tk-1 
comple t e (p )  2 ~:enci + t k ( p ) ,  2 5 k 5 N ,  
That is, the kth pivot row has already arrived at processor P when P completes computation of the 
k - 1st stage of the parallel GJ inversion algorithm. 
At the 'beginning of the algorithm, i.e., L = 1, all the processors except PI are idle until the first 
pivot row errrives, as illustrated in Figure 3. Let do(P) be the initial idle time of ;processor P ,  Since 
Pl first normalizes the first row of A and finds the pivot element, which takes t8ime N f ,  before it 
sends it out, then do(P) = H(P1, P) ( t s  + t ,N) + N f .  
Since our objective is t o  let the message arrive a t  all the processors before it is needed, another 
parameter of interest is the number of messages in the input buffer of a procesr;or a t  a particular 
time. 
Definition 6 Let Q k ( p )  be the message queue lenglh of processor P at stage L, then Q k ( p )  = 1, i f  
~,k,:fi + t k t l ( p )  5 T ~ ~ ~ ~ ~ ~ ~ ~ ~ ( P )  < Tktltl send + tk+ltl PI 1 
That is, afier processor P completes stage L and is to start stage C + 1, messc;rges for stage L + 
1, L + 2 ,  . . -, k + 1 hawe already arrived at P .  
The next theorem gives a sufficient condition for full overlap comm~lnication and computation 
in Algorithm I. 
Theorem, 5 When N 2 No, full overlap ofcommunication and computation cast be achieved, where 
No is the positive root of the quadratic equation 
ProoJ We rewrite the quadratic equation in the standard form: 
1 (f) N~ - (3f + 2tw logp) N - (-p 2 + 2logp)tS = 0. 
Obviously this equation has one positive root and one negative root. Let the positive root be No. 
When N 2; No, we have 
N 1 
(- - 3)Nf 2 apts + 2(ts + t w N )  logp. 
P 
Let A T ~ ( P )  = T ~ ~ ~ , , ~ ,  ( P )  - T : ~ ~ ~  - t k ( p ) ,  we will show t,hat A T ~ ( P )  2 0 for 2 < k 2 N and 
P E {PI, P2, - - - , Pp) - P k .  The proof is by induction. 
(1) Base case: k = 2, 
where T'(jD) - T2 
Therefore 
A T 2 ( p )  > ( n- 3 ) N f  -2( t ,+t , , ,N) logp-Nf 
N = (- - 3) N f - 2(ts + i!,,, N )  logp 
P 
> 0, 
when N 2 No. 
(2) Induction step: Assume the proposition is true for 2 5 1 < k .  We prove it is also true for 
l = L + l .  
By the induction hypothesis, TLtart(P) = T ~ ~ , ~ ~ , , , ( P )  + s l (P)?  for 2 5 1 < k.  Therefore 
-tk'-I ( P )  
1 = -NjC  + ( n  - l ) N f  - -pt, - 2(t, + t,N) logp 
2 
when N > No. Here we used the fact that if let r = I;], then 
And therefore 
We also used the fact that 
The next theorem states that the message queue length at each processor is lbounded by 2. 
Theorem 6 W h e n  N 2 No, lhe message queue length i s  al m.osl 2, i .e . ,  Q k ( P )  5 2. 
Proof: By contradiction. Otherwise there exist k and P such that Q k ( p )  = 1 2 3. Since 
N > No, I)y Theorem 5, communication is fully overlapped by computation. By the definition of 
the queue length Qk(P) ,  we have 
But when lt 2 3, 
1 
> (is + t.,,N) logp + Zpts. 
(1) and (2) contradicts each other. Therefore Qk(P)  5 2. 
Corollary 3 W h e n  N > No, lhe comnunical ion overhead of p rocessorP  in  Algorilhm I i s  do(P)+ 
+ts.  
Proof: At the beginning of the algorithm (k  = I ) ,  the processors other than P1 must wait to  get 
the first pivot row from PI,  since no computation can be done for these processors before getting 
the first pivot row. Thus a communication overhead (initial idle time) of do(P) is incurred. After 
that, since N 2 No by Theorem 5, the data transmission is fully overlapped by computation for 
Figure 5: Communication overheads of Algorithm I (Number of processortl p=16). 
all the processors during the rest of the updating process (for 2 5 k 5 N). Therefore the total 
idle time fi3r processor P in Algorithm I is do(P). By Corollary 2, the t,otal setup overhead of each 
processor in Algorithm I is qt, .  So the total communicat,ion overhead is &(P)  + Tt,. 
Next we give some simulation results on the parallel overheads of Algoritlhm I, obtained by 
numerically evaluating the recurrent relationships given in Lemma 4. The time is scaled to f, i.e., 
let f = 1. We use the machine parameter t, = 150 and t, = 3. These parameteirs are close to that 
of the nCIJBE 2 machine [28]. Since the communication overheads may be difrerent for different 
processors, the data depicted in the figure are the largest overheads among the processors. Figure 
5 depicts the communication overhead T,,,,vs the matrix size N when the machine size p = 16. 
Curve 1 is the total communication overheads; curve 2 is the total comm~inication overheads minus 
the initial delays; for the case of curve 3, we let each processor initially holcls the first row of 
A, such that all processors can start without the initial delay. It can be seer1 that when there 
is initial delay, when N > 260 the communication is "almost~" overlapped by computation, but 
not comp1t:tely until N > 460. When there is no initial delay, the comm~inication is completely 
overlapped by computation when N > 260. 
Figure 6 shows the communication overheads with different machine sizes. The initial delays 
are subtracted so that it is easy to see where full overlap is achieved. We can see that when full 
overlap is achieved, the communication overhead is independent of the machine size (if the initial 
delay is not considered), and is linearly increasing with the problem size (i.e., T, = $Nt,). 
Figure 6: Communication overheads of Algorithm I with different machine sizes. 

6 Algorithm 11: Submatrix Partitioning, Without Pivoting 
In this section, we apply the idea of overlap communication and computation to the parallel GJ 
matrix inversion algorithm using submatrix partitioning. The main advantages of the submatrix 
partitioni~ng are shorter message length, i.e., the message is of length instead of N; and shorter fi 
communic:ation propagation delay, i.e., each broadcasting tree is of height $ instead of d. However, 
the disadvantage is that the searching of pivot element can not be done by a single processor. 
Instead ,/E processors individually find their local pivot element and compare with each other to 
find the final pivot element through recursive doubling. As a result, at every iteration there is a 
synchroni,zaiion barrier among all the processors which makes it difficult to overlap communication 
and computation. In this section, we study the algorithm without pivoting. 
Let p = 2d be the number of processors in a d-cube. These processors arre configured as a 
two dimer~sional array of processors. Let p, and p, be the number of processor rows and columns 
respectivelly, where p, = p, = &i = 2d/2 when d is even, and p, = ,&i = 2(dS1)/21 pc = @ = 
2(d-1)/2 w:hen d is odd. The physical address of each proc.essor is determined by the 2-d Gray code 
mapping. Let d, = logp,, d, = logp,, and d = max(d,, d,). The d-bit Gray (code is Gray(d) = 
gi, - - 0 ,  s$). We use notation Pi, to denote the processor with coordinates (i, j )  in the processor 
array, where 1 < i 5 p, and 1 5 j < p,. Then the physical address of Pij is gfg$ It is easy to see 
that under this mapping strategy each column and each row proc,essors form i% subcube and the 
adjacent processors on the grid are directly connected. 
The matrix elements are distributed among the processors by full column and row wrap- 
mapping, i.e., matrix element a;j is placed in processor row I and column J accoirding to I = (i- 1) 
mod p, + :l and J = ( j  - 1) mod p, + 1. Thus 1 < I 5 p, and 1 5 J 5 p,. This mapping strategy 
improves the computation load balance when interchanges are performed [I]. In our implementa- 
tion, the objective is to achieve the overlap of communication and ~omput~ation. This full column 
and row wrap-mapping strategy will balance both the ~omplitat~ion a d communication loads. 
Under the submatrix partitioning strategy, there are two t,ypes of commani.cation required at 
every iteration: one to pass the pivot row and another t,o pass the multipliers. The multipliers are 
the matrix column that is to be set to zero in that step. These data are available in one column of 
processors and the multipliers must be passed to the remaining processor colunins. Therefore the 
communicirtion path for passing multipliers is across processor colnmns. 
The pivot row is contained initially in a single row of processors. The communication path for 
passing pi~eot row is across processors rows. Before sent out, each element of the pivot row need to 
be normalized by the pivot element. The pivot element is contained in the multipliers passed to 
the procesclor. Therefore the normalization and the broadcast of the pivot row can not start until 
the process~ors get the multipliers. 
We still use the cornpule-and-send-ahead strategy to achive the overlap of communication and 
computation, i.e., at the k-th iteration, the data elements of t,he k + 1st pivot row and multipliers 
are first updated and then sent out. Unlike in Algorithm I, where there is only one broadcast within 
the hypercube a t  every iteration, in Algorithm 11, there are concurrent broadcasts within subcubes 
a t  each iteration: first there are p, concurrent broadcasts of segments of multipliers within the 
subcube lows; then there are p, concurrent broadcasts of segments of the pivot row within the 
subcube columns. For simplicity, we assume p, = p, = fi. We use the notation [k] to represent 
(k-  1) mod p, + 1. Then at  the kth iteration, processors in the [k+ I]-st column and the [k+ I]-st 
row are roots of the broadcast trees. As an example, Figure 7 shows the data flow for Algorithm 
I1 where p = 16 and [k + 11 = 2. 
- d t i p l i e r  broadcastings within subcube rows 
- - - - -  pivot mw broedcestings within subcube columns 
Figure 7: Data flow of Algorithm 11. 
Let ( I ,  J )  be the coordinates for processor P in the processor array, where 1 5 I 5 p, and 
1 5 J 5 p,. The pseudo code for the kth step of the Algorithm I1 is as followinl:: 
if ( I  # [k + I.] and J # [ k+  I] then 
upd,ate all the elements it has using the corresponding  segment,^ of the 
kth pivot row and multipliers; 
Interrupt handling: 
If the message is from a processor in the same processor row, 
pass it t o  its children nodes in the row subcube, if there is any; 
If the message is from a processor in the same processor column, 
pass it t o  its children nodes in the column subcube, if there is any ; 
if ( I  = [k + 11 and J = [k + 11 then 
update the corresponding segment of the k + 1st column of A 
broa,dcast this column segment in the row subcnbe; 
uptlate the corresponding segment of the k + 1st row of A; 
and normalize it using the k + 1st pivot element; 
brcladcast this row segment and the pivot elelment in the column subcube; 
update the rest of the elements of A; 
if ( I  # [k + 11 and J = [k + 11 then 
update the corresponding segment of the k + 1st column of A; 
broadcast this column segment in the row subcube; 
uptlate the rest of the elements of A. 
Interrupt handling: 
If the message is from a processor in the same processor row, 
pass it to its children nodes in the row subcube, if there is any; 
if ( I  = [k + I.] and J # [k + 11 then 
update the corresponding segment of the k + 1st row of A; 
update the rest of the elements of A; 
Intt:rmpt handling: 
If the message is from a processor in the same processor coltlmn 
pass it to its children nodes in the column subcnbe, if there is any; 
normalize the corresponding segment of the k + 1st row A 
using the k + 1st pivot element; 
broadcast this row segment and the pivot element in the row subcube; 
In the above algorithm, whenever a message arrives at the processor, if the processor is doing 
computation, it will be interrupted and start executing the communication interrupt handling. If 
on the other hand, the processor has finished all the comp~itat~ion at a particular step and the 
message has not yet arrived, it will become idle until the message arrives, and then start executing 
the interrupt handling code before it goes to the next strep of the loop. 
The algorithm for each broadcast within a subcube-row or subcnbe-column is essentially the 
same as the broadcasting algorithm proposed in Section 4. Consider the i-th processor row con- 
d d  d d  sisting processors with physical addresses gi gl , gi g2, . . - , From Section 4 we know now to 
construct ar family of SBT7s: 
where in S B T ( ~ ) ,  is the root node and gf+l is a leaf node at the first level. 
A family of SBT7s in the i-th processor row can be obtained by making the l'ollowing transfor- 
mation on each node of each SBT in S B T ( G r a y ( d ) ) :  if the physical address of a node is gj, then 
replace it with gfgj, 1 5 j 5 2d.  Similarly a family of SBT7s in the j-th processor column can 
be obtained by the following transformation on each node of each SBT in SBT(Gray(d)):  if the 
physical address of a node is gi ,  then replace it with gig;, 1 5 i 5 2d. 
7 Analysis of Algorithm I1 
In this section we give the performance analysis of Algorithm 11. For simplici1;y we only consider 
d 
the case the hypercube dimension d is even, i.e., pc = p, = fi = 2 F .  Let Pij be the processor in 
the ith processor row and j th  processor column, 1 5 i 5 fi, 1 5 j 5 fi. The following notations 
are used in the analysis: 
The time Pij starts the Eth loop. 
The time Pij completes the kth loop. 
The computation load of Pij at the kth loop. 
The time to  update or normalize one segment of the multipliers or the pivot row elements. 
The time that the segment of kth row of A is sent by qklj. 
The time it takes for this segment of the kth row t,o reach Pij. 
The time that the segment of the kth column of A is sent by Pi[kl. 
The time it takes for this segment of the kt,h column to reach Pij. 
The setup overhead of Pij when br~adcast~ing the kt,h row segment. 
The setup overhead of Pij when broadcasting the kth column segment. 
The following theorem shows that the total setup overhead of A1gorit)hm I1 is twice that of 
Algorithm I. 
Theorem 7 In Algor i thm 11, the total  se tup overhead of each processor i s  N t , .  
Proof: In Algorithm 11, each processor involves N  one-to-all broadcasts in its subcube column 
and N  on(:-to-all broadcasts in its subcube row. By the construction of the SBT's, for consecu- 
tive fi broadcastings, each processor incurs setup overhead of ;@t,. Therefore the total setup 
overhead for each processor is 2 - i f i t ,  5 = Nt. .  
Each processor involves two broadcastings at every iteration. The situation, might occur that 
when a message containing the segment of the pivot row arrives at an intermediate processor, this 
processor is in the process of setting up the communication channel for a messiage containing the 
segment 01' the multipliers. Therefore the message can not he immediately pass over. Instead, it 
will be delayed until the processor finishes setting up the channel for the previous message. This 
situation can not happen in Algorithm I, as proved in Theorem 4. In t,he worst case, a message is 
delayed f o ~ ,  a time of t ,  at an intermediate processor. As a result,, we have 
Lemma li W e  have ihe following recurreni relaiionship: 
ma{~!dplete(~ij) + s.!(pij) + st(pij)y ~ : ~ ~ ~ - ~ ( ~ i ~ k ~ )  
+':(~lj) + '.!(~ij) 1 ~L:,,d-T(p[k]~) + f)(pij) + S:(P;~)}, l r #  t k ] ,  j # I k ] ,  
mar{T:zplete(p[klj) + s.!(P[~] j) + s:(qk] j )  , T & ~ ~ - ~ ( P [ ~ ~ [ ~ ~ )  
+t!(qk]j) + s.!(p[k]j)}> 1: = [ k ] ,  j # [k]  , 
ma{T:zplete ('i[k]) + s:(pi[kl) + s:(pi[k]) 7 T:end-T (+][k]) 
+'*(~i[k]) + s)(~i[k])}7 
T k - 1  
1 # P I ,  j = [k l ,  
conap~ete(~[k][k]) + s,k(P[kl[k]) + s!(p[k][k]) , i = [ k ] ,  j = [ k ] .  
We say that communication is fully overlapped by computation if 
Cart(pi j )  = ~ ' 1 , ~ ~  (pij) + st(pij) + s,"(pij) , 2 5 k S N ,  Vi,j. 
Theorem 8 W h e n  N 2 No, ihe communicaiion can be fully overlapped by comjouiaiion, where No 
i s  ihe posilive rooi of ihe following quadraiic equaiion: 
Proof: See Appendix. 
At the beginning of Algorithm 11, each processor except Pll must be initially idle until it got 
the multipliers and/or the first pivot row elements. Let rdo(Pij) be the initial delay of Pij. 
Corollary 4 When N 2 No, ihe communicaiion overhead of processor Pij :in Algon'ihm II i s  
&(Pij) + 1V1,. 
Next we give the simulation results on the parallel overhead of Algorithm 11, obtained by 
numerically evaluating the recurrent relationships give in Lemma 5 .  The parameters ( f ,  t, and t,) 
are the same as in Section 5. Figure 8 depicts the communicat~ion overhead Tco,, vs the matrix size 
N when the machine size p = 16. As a comparison, we put the corresponding curve of Algorithm 
I (curve I. in Figure 5) in the same figure. We can see that when the matrix size N is small, 
Algorithm 11 incurs less overhead than Algorithm I; but when N becomes large, Algorithm I incurs 
much less overhead than Algorithm 11. We can also see that in Algorithm I1 the: matrix size No for 
full overlap is smaller than that in Algorithm I. Figure 9 shows the communical;ion overheads with 
different nnachine sizes. We can see that when full overlap is achieved, the comm~unication overhead 
is independent of the machine size (if we neglect the initial delay), and linearly increases with the 
problem size (i.e., T,,,, = Nt,). 
Figure 8: Communication overheads of Algorithm 11 (Number of processor p=16). 
Figure 9: Communication overheads of Algorithm I1 witah different machine sizes. 

8 Algorithm 111: Submatrix Partitioning, With Pivoting 
We consider the parallel GJ inversion algorithm with column interchanges. !since the matrix is 
partitioned by submatrix, the searching of pivot element can not be done by a single processor. 
Instead, a row of processors each determine the local pivot element and then f ind  the pivot element 
through su cursive doubling. During the E-th iteration, the (E + 1)st pivot row elements can be 
computed ahead and sent ahead by a row of processors; but the multipliers cain not be computed 
before the processor gets the segment of the pivot row and finds out the local pivot element. In 
Algorithm. 111, we let each processor updates its share of the submatrix colurnn by column. As 
soon as the message containing the pivot row elements arrives, it will be interrupted and find the 
local pivot element. Then it will first update the column segment corresponding to this local pivot 
element and then participate permutation among the processors in the same subcube-row. After 
write-read pairs, each processor will get the pivot element and the corresponding segments of 
multiplier:;. Then it will nomarlize the pivot row elements using the pivot element. 
Therefbre at each iteration of Algorithm 111, each processor will be interrupted + 1 times: 
one for thc: broadcasting of the pivot row elements; $ for the permut,at,ions. We have the following 
theorem regarding the total setup overhead of Algorithm 111. 
Theorem 9 In Algorithm 111, the selup overhead for each processor is :(I+ lalgp)Nt,. 
Proofi The one-hal l  broadcastings of the pivot row elements within the. subcube-columns 
causes setup overhead of $Ni,. At each iteration, the perm~t~ation within th'e subcube-rows to 
find the pivot elements causes each processor a setup overhead of logpt,, and (,here are totally N 
iterations. So the overall setup overhead for each processor is $(I + log p) Nt, . 
Therefore Algorithm 111 incurs much larger setup overhead than Algorithm I and Algorithm 
11. Further more, Since the recursive doubling essentially causes synchronization barrier for all the 
processors, although we can use the interrupt handling capability to let the processor do some 
computation while waiting for the message, it is difficult t,o achieve full overlap. 
if ( I  = [E + 11) then 
upd:ste the corresponding segment of the (E + 1)st row of A 
broadcast this row segment in the subcube-column; 
update the rest of the submatrix of A column by c,olumn; 
if ( I  # [E + 11) then 
update the share of the submatrix of A column by column; 
Interrupt handling: 
if the message is from a processor in the subcube cohimn, then 
forward this message if needed; 
find the local pivot element; 
update the segment column corresponding to the local pivot,; 
write this updated segment to its first neighbor in the subcub e-raw; 
if the message is from the j th  neighbor in the subcube-row,then 
if j < $ then find the larger pivot element and send the corresponding segment to 
its j + 1-th neighbor; 
if j = $ then find the larger pivot element and the corresponding segment is 
the mu1 tiplier . 
9 Sca~lability Analysis 
The scalability of a parallel algorithm on a parallel architecture is a measure of its capability to 
effectively utilize the processors. Several metrics have been proposed in the literature to measure 
the scalability of algorithms and machine systems [28, 29, 301 In this section we adopt two metrics, 
namely, t:he i soef ic iency  metric [28] and the [30] metric to study the scalabi~lity of the parallel 
Gauss - Jordan matrix inversion algorithm proposed in this report. 
9.1 The Isoefficiency Metric of Scalability 
Let p be the number of processors, W be the problem size which is the total amount of computation 
to  be don'e, and To be the total parallel overhead of all the processors. Then the execution time 
on p processors Tp satisfies W + To = pTp. Speedup is defined as S = g. Efficiency is defined as 
E = S = . W = W -  1 
P PTP W+To - s' 
Gupta and Kumar defingd an isoefficiency function fE(p) to  measure the scalability of parallel 
algorithm and architecture combinations [28, 291, where fE(p) is the amount of work needed to 
maintain im constant efficiency E  when p processors are used. 
Theorem. 10 For ihe parallel G J  inversion algoriihm on a hypercube archiieciu.re, when ihe mair ix  
i s  parditior2ed as  submairices and ihere i s  n o  overlap beiween cornmunicaiion ans! compuialion ( i .e . ,  
Algoriihm O), ihen the i soe f i c i ency  function fE(p) = ( 1 0 ~ ~ ) ~ ) .  
Proof: When submatrix partitioning is used and if there is no overlapping between commu- 
nication and computation, the communication overhead for each processor is N(ts + t,%) logp. 
Therefore the total communication overhead is To = pN(t, + t, 5) logp. The work load of the 
GJ inversion algorithm is W = N~ f .  To keep the efficiency E  fixed, we have j ' ~ (p )  = W = KTo, 
where I< =I A. Therefore N~ f = I<pN(ts + i , $ )  logp, or f N 2 - Kt.,fllogpN - I<i,p logp = 0 
Solving for N we get N = @(fllogp).  Thus fE(p) = W = N3 f = ~ ( ~ g ( l o ~ ~ ) ~ ) .  O 
Theorem 11 For ihe parallel G J  inversion algoriihm on a hypercube archiieciurpe, when ihe mair ix  
i s  padi i ioned by rows  and when communicaiion i s  fully ot)erlapped by compuiaiion ( i .e . ,  Algorithm 
I), ihen ihe i soe f i c i ency  funclion fE(p) = 
Proof: When the matrix is partitioned by rows and when communication is fully overlapped by 
computation, the overhead for each processor is ~ N Z , .  Therefore the total overhead is To = ip~i, .  
Therefore N3 f = To keep the efficiency E  fixed, we have fE(p) = W = KT,, where I< = m. 
$KpNi3. Solving for N we get N = fi = Q(fi)). Thus fE(p) = W = N~ f = 0 
9.2 The Isospeed Metric of Scalability 
In [30] the scalabilty of a parallel algorithm-machine combination is defined based on the isospeed 
metric. The average speed is the achieved speed of the given compl~ting system divided by p, 
the number of processors. Let W be the amount of work of an algorithm when p  processors are 
employed, and W' be the amount of work of the algorithm when p' > p  processors are employed 
t o  maintain the average speed, then the scalabiliiy from sysiem size p  io system size p' of the 
W 
algorithm - machine combination is defined as: d ( p ,  p') = 3. The work W' is determined by the 
7 
isospeed constraint. 
Theorem 12 For Algoriihrn I, when ihe problem size is such ihai communicaiion is fully overlapped 
by compulaiion, ihen ihe scalabiliiy from sysiem size p  io system size p' (p' > p )  is d ( p ,  p') = J$. 
W ' W W ' 1 1 Proof: By the isospeed constraint, we have 5 = p ~ ,  or WTT; = wl+z., 01 -T- = 
I+#  I+%'. 
l p N t s  Lp'N't 
W - 
Therefore 5 = W" or + ~f = 2 4 .  Thus 5 = Therefore d ( p , p t )  = -& = $ (7$)3 = 
N f ;;r 
10 LlJ Factorization 
In this section, we apply the ideas of overlapping communication and computaction, t o  the parallel 
LU factor:ization algorithm, using the row partitioning strategy and column interchanges for partial 
pivoting. 
The ptgeudocode for the  sequential row oriented LU decomposition algorithm is given below: 
for k = 1  t o N - 1  
f o r i = k + l t o N  
lik = aiklakk 
f o r j = k + l t o N  
ai; = ai; - likakj 
The  j loop subtracts multipliers of the kth row of the current A from succeeding rows. At 
the kth stage of the algorithm, there are ( N  - k) row updatings, and the len,gth of each row t o  
be updated is ( N  - k). (Recall in the GJ inversion algorithm, a t  the ktrh stage, there are N row 
updatings., and each row is of length N.) 
When implemented in parallel, t o  keep the computational load balanced among the processors, 
we use a variation of the row wrapped storage - reflection wrapped storage. Here the rows are 
distributed as illustrated in Figure 10 for the case of p = 4 and N = 16. In general, the first p rows 
are distributed t o  the p processors in order, the next p rows are distributed in reverse order, and 
so on. We use the  notation PIk) t o  denote the processor that has the kth row. 
rows 4,5,12,13 rows 1,8,9,16 Irows 2,7,10,15 rn 
Figure 10: Reflection wrapped row storage. 
The  parallel LU factorization algorithm using row partitioning is similar t o  the Algorithm I in 
Section 3. During the k-th iteration, each processor first get,s the k-th pivot row of A and the pivot 
element f r ~ ~ m  processor Plk1, and then does the row updat,ings. To overlap communication and 
computatil~n, processor P{k+*) first updates the (k + 1)-st row of A dluing the k-th iteration. Then 
it finds the pivot element and normalizes this row (comp~rle-ahead). After that  it broadcasts this 
(k + 1)-st :pivot row and the  pivot element to  all the other processors, before it resumes t o  update 
the other rows using the k-th pivot row (send-ahead). 
Consider the  computation load of each processor at  the kth stage. Let p = ~ f j ,  then the 
computaticm load of processor P a t  the kth stage is 
Lemma (3 F o r  any 1 < K < N ,  and f o r  any i w o  processors  P  and P', 
Proof :  Without loss generality we assume P  = P{;) and P' = P I j ) ,  wheire 1 5 i < j 5 p. 
Consider :cZ1 T k ( p ) .  
If i # 1 and i # p, then 
Thus 
If i = I! then 
Therefore we have 
Similarly we can show that for any r 2 0, 
Therefore 
Unlike the parallel G J inversion algorithm, in the parallel LU decomposition algorithm, the 
computation load of each processor decreases as the cornputatmion proceeds. When the computation 
load decreases to a certain level, there will not be enough computation to  do to  compensate for the 
processor :idle time caused by data communication. Therefore full overlap of communication and 
computation can not be achieved throughout the whole computation process. Nevertheless, it can 
be shown 1;hat for a certain number c, where 0 < c < 1, in t,he parallel LU factorization algorithm, 
up to stage cN,  full overlap of communication and compr~t~ation can be achieved, given the matrix 
size N is large enough, as stated by the following theorem. 
Theorem 13 W h e n  N > No, up t o  the cNth  i teration,  where 0 < c < 1, full overlap of commu- 
nication and computation can be achieved, where No i s  the positive root of the  following quadratic 
equation. 
Proof: Using the above lemma, the proof is similar to the proof of Theorem 5 and therefore 
omitted here. 
Next we give the simulation results on the processor idle time vs the number of iterations for 
the parallel LU decomposition algorithm. The machine parameters are the same as in the previous 
sections. I:n Figure 11 the two curves correspond to N = 160 and N = 320 respectively, and p = 8. 
We can see for N = 160, up to k = 55, full overlap is achieved; while for N = 320, up to I: = 215, 
full overlap is achieved. 
Figure 11: The accumulated processor idle time vs. the number of iterations in the piwallel 
sition a1gor:lthm (p=8). 
Aau-ed ldls Tbne 
decompo- 
1 O O D O  -. 
m o o 0  
S O O O  
4 0 0 0  
D O 0 0  





0 5 0  L O O  1 5 0  P O D  2 5 0  ' 0 °  Uelmtlon 
11 Concluding Remarks 
Most parallel matrix algorithms proposed in the literature do not attempt t o  overlap inter-processor 
communication by computation. This leads t o  increased communication overhead, and might lead 
t o  conclusions that  are not valid when communication overhead is systematica.lly minimized. For 
example, it is claimed in [I, 31 that  the submatrix partitioning strategy for GJ algorithm is superior 
t o  the r o ~ ~ / c o l u m n  partitioning strategy because of lower communication overhead. We, however, 
show that  if communication and computation are efficiently overlapped, then tlne row partitioning 
strategy has a lower communicaiion overhead than the submatrix partitioning strategy. 
We first proposed a new broadcasting algorithm on the hypercube multiprocessor for parallel 
GJ algorikhm. This algorithm ensures that  the data  are sent out from the source and arrives at  
the destinations a t  the earliest possible time. We then gave the parallel GJ inversion algorithm 
using row partitioning. We prove a lower bound on the matrix size s~ich that  data  transmission 
is fully overlapped by computation. We also prove that  the message length in the input buffer of 
each processor is a t  most 2. 
We also consider the algorithms under submatrix partitioning, with or without pivoting. We 
show that  when submatrix partitioning is used, even when the communication is fully overlapped 
by compu~;ation, the communication overhead is larger than when using row partitioning. 
Our numerical simulation shows that when the algorithms proposed in this report is used, the 
parallel algorithms incurs much less communication overhead compared with tlie algorithm in [I], 
even when the communication is not completely overlapped by computation. Finally, we observe 
that  the icleas in this report can be applied to  other parallel algorithms as well, such as parallel LU 
factorization algorithm. 

Appen.dix: Proof of Theorem 8 
Proof: Let 
":('ij) = ( ' ~ ~ p l e t e ( ' i j )  + 8:(pij) + $(pij)) - ( T L ~ ~ - ~  ( ~ [ ~ ] j )  + t:(pij) + 8:(pij)), 
We will show AT$(P;~) 2 0 and AT;(P,,) 2 0, Vi, j ,  2 < k 5 N. The proof is by induction on L. 
(1) Base case: k=2. 
First consider the case i # 2 and j # 2. 
AT: (pij) 
= (T'tart(pij) + T1 (pij) + s:(pij)) - ( ~ t t ~ T t ( ~ 2 j )  + 2T' + tP(pij)) 
2 p.. = (T1(pij) - 2T') + s:(Pij) - (Etart(P2j) - 'Cta7t(pij)) - tr( 8 , )  
Similal-ly we can show A q ( P j j )  2 0 when i = 2 or j = 2. And similarly we can show 
AT;(P;,) 12 0 ,  Vi, j .  
(2) Indudion SZep: Assume the proposition is true for 2 5 1 5 k. We prove it is also true for 
1 = h + 1. We just show the case i # k + 1 and j # k + 1, and T,~A;-, = T k f l  send-c (qk+l : l [ k+l l )  + 
t f + l ( ~ [ ~ + ~ ] j )  + s ~ + ~ ( P ~ + ~ ~ ~ )  + T) .  The proofs for the other cases are similar. 
By the induction hypothesis, T i t a r t ( ~ i j )  = T ~ ~ , , , ~ ~ ~ ~ ( P , ~ )  + st(%) + 8)(Pi j ) ,  for 2 < I 5 k,  Vi, j. 
Similarly we can show that AT:+'(P;~) > 0, Vi, j. 
References 
[I] E. Cllu, A. George and D. Quesnel, "Parallel matrix inversion on a subcube-grid ", Parallel 
Computing, 19 ,  1993, pp.243-256. 
[2] T. J .  Dekker and W-Hoffmann, "Rehabilitation of the Gauss-Jordan algorithm " , Numer. 
Math., 54, 1989, pp.591-599. 
[3] M. Angelaccio and M. Colajanni, "Unifying and optimizing parallel linear algebra algorithms 
", I E E E  Transactions on Parrallel And Distributed Systems, Vo1.4, No.12, Dec 1993, pp.1382- 
1397. 
[4] J.M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems, New York: 
Plenr~m, 1988. 
[5] J.M. Ortega and C.H. Romine, "The ijk forms of factorization methods 11. Parallel systems ", 
Parar'lel Computing, 7, 1988, pp.149-162. 
[6] S.L. Jlohnsson, "Minimizing the communication time for matrix multiplication on multiproces- 
sors ", Parallel Computing, 19, 1993, pp.1235-1257. 
[7] S.L. Johnsson and C-T Ho, "Optimum broadcasting and personalized cornmunication in hy- 
percubes ", IEEE Transactions on Computers, Vo1.38, No.9. Sept 1989, pp.1249-1268. 
[8] E. Chu and A. George, "A balanced sribmatrix merging algorithm for multiprocessor architec- 
tures ", Parallel Computing, 18, 1992, pp.1-10. 
[9] A.C. Elster and A.P. Reeves, "Block-matrix operations rising orthogonal trees ", Third Conf. 
on Hypercube Concurrent Computers and Applications, ACM, 1088, pp.15154-1561. 
[lo] A. Gerasoulis, I. Nelken, N. Missirlis and R. Peskin, "Implement,ing Gauss-Jordan on a hy- 
percube multiprocessor ", Third Conf. on Hypercube Concurrent Computers and Applications, 
ACM, 1988, pp.1569-1576. 
[ l l ]  G.C. Fox, W. Furmanski and D.W. Walker, "Optimal matrix algorithrr~s on homogeneous 
hypercubes ", Third Conf. on Hypercube Concurrent Comp~rters and Applications, ACM, 1988, 
pp.15154-1561. 
[12] P.G. Hipes and A. Kuppermann, "Gauss-Jordan inversion with pivoting on the Caltech Mark 
I1 hupercube ", Third Conf. on Hypercube Concurrent Comp~~ters  and Applications, ACM, 
1988, pp.1621-1634. 
[13] G.A. Geist and Michael T. Heath, "Matrix fact,orizat,ion on a hypercube multiprocessor " , 
Hypercube Multiprocessors 1986, SIAM, (Ed. M.T. Heat,h, 1986), pp.161-180. 
[14] Cleve! Moler, "Matrix computation on distributed memory rn~lt~iprocessors ", Hypercube Mul- 
tiproc:essors 1986, SIAM, (Ed. M.T. Heath, 1986), pp.181-195. 
[15] C.H. Romine, "The Parallel solution of triangular system on a hypercube " I ,  Hypercube Multi- 
processors 1987, SIAM, (Ed. M.T. Heath, 1987), pp.552-559. 
[16] S. Lakshmivarahan and Sudarshan .K. Dhall, "A lower bound on the communication complex- 
ity ol' solving linear tridiagonal systems on cube architectures ", Hypercube Muliiprocessors 
1987, SIAM, (Ed. M.T. Heath, 1987), pp.560-568. 
[17] Apostolos Gerasoulis and Izzy Nelken, "Scheduling linear algebra parallel algorithms on MIMD 
Architectures ", Proceedings of the 4th SIAM Conference of Parallel Proct:ssing for Scientific 
Comltuting, 1989, pp.68-95. 
1181 U. Mtlier, "A parallel partition method for solving banded syst,ems of linear equations", Parallel 
Com~~uiing , 2, 1985, pp.33-43. 
1191 S. Leiinart Johnsson, "Solving tridiagonal systems on ensemble architectures " , SIAM Journal 
of Sc~entific and Siaiistical Computing, Vo1.8, No.3, May 1987, pp.354-392. 
[20] M.D. Levin and D.J. Evans, "The inversion of matrices by the double-bordering algorithm on 
MIMII computers ", Parallel Computing, 17,  1991, pp.591-602. 
1211 Alan Geroge, Joseph W.H. Liu and Esmond Ng, "Communication results for parallel sparse 
Cholesky factorization on a hypercube ", Parallel Computing, 10,  1989, pp.287-298. 
[22] S. Ler~nart Johnsson, "Communication efficient basic linear algebra computations on hypercube 
architectures ", Journal of Parallel And Distributed Compvrting , 4, 1987, pp.133-172. 
[23] Robert F. Lucas, Tom Blank and Jerome J .  Tiemann, "A parallel solutioii method for large 
sparse! systems of equations ", IEEE Transactions on Computer-Aided Design, Vol.CAD-6, 
No.6, Nov 1987, pp.981-991. 
[24] Ching-Tien Ho, "Optimal broadcasting on SIMD hypercubes without indirect addressing ca- 
pabililty ", Journal of Parallel and Distributed Comp?rting, 13, 1991, pp.246-255. 
1251 Alan George, Michael T .  Heath, Joseph Liu and Esmond Ng, "Sparse Cholesky factorization 
on a local-memory multiprocessor ", SIAM Journal of Scientific and Statistical Compuiing, 
Vo1.9, No.2, Mar 1988, pp.327-340. 
[26] John IL. Gustafson, Gary R. Montry and Robert E. Benner, "Development of parallel methods 
for a 1024-processor hypercube ", SIAM Journal of Scientific and Statistical Compuiing, Vo1.9, 
No.4, Jul 1988, pp.609-638. 
[27] Anshul Gupta and Vipin Kumar, "Scalability of parallel algorithms for matrix multiplication 
", Te'chnical &port TR91-54, Computer Science Department, University of Minnesota. 
[28] Anshul Gupta and Vipin Kumar, "Scalability of FFT on parallel computers ", IEEE Trans- 
actio~ns on Parallel and Distributed Systems, Vo1.4, No.8, Aug 1993, pp.922-932. 
[29] Ananth Grama, Anshul Gupta and Vipin Kumar, "Isoefficiency measuring the scalability of 
parallel algorithms and architectures " , IEEE Parallel and Distributed Technology, Vol. 1, No.3, 
Aug1993, pp.12-21. 
[30] X-H Sun and D.T. ELover, "Scalability of parallel algorithm-Machine conlbinations ", IEEE 
Tran.sactions on Parallel and Distributed Systems, Vo1.5, No.6, Jun 1994, pp.599-613. 
[31] T.von Eicken, D.E. Culler, S.C. Goldstein and K.E. Schauser, "Active messages: a mechanism 
for integrated communication and computation ", Proc. of the 1Sth In2 '1 Symp. on Computer 
Archlitecture, Australia, May 1992, pp.256-266. 
[32] A. Gursoy and L.V. Kale, "Dagger: combining benefits of synchronoris and asynchronous 
communication styles " , Proc. of the 8th Int '1 Parallel Processing Symp., Mexico, 1994, pp.590- 
596. 
[33] W.Dirlly and et al., "Architecture of a message-driven processor ", Proc. of the 14th Annual 
Int '1 ,Symp. on Comp. Arch., June 1987, pp.189-196. 
[34] G.M. Papadopoulos and D.E. Culler, "Monsoon: an explicit token-store architecture ", Proc. 
of tht: 17th Annual Int'l Symp. on Comp. Arch., Seatitle, Washingtoon, May 1990. 
