Torus data flow for parallel computation of missized matrix problems  by Funderlic, R.E. & Geist, George A.
Torus Data Flow for 
Parallel Computation of Missized Matrix Problems 
R. E. Funderlic and George A. Geist 
Engineering Physics and Mathematics Division 
Oak Ridge National Laboratory 
Oak Ridge, Tennessee 37831 
ABSTRACT 
A data-flow approach is used to solve dense symmetric systems of equations on a 
torus-connected 2-D mesh of processors. A torus mapping of the matrix onto this 
processor array allows the Cholesky decomposition to be completed in 3n- 2 time 
steps using only n2/4 processors (less than half the number needed in previously 
reported results). New definitions for missized problems and parallel algorithm perfor- 
mance are given along with various time-step, efficiency, and processor utilization 
plots. 
1. INTRODUCTION 
We define a data-flow algorithm as a group of computational processes 
(nodes) where each process has a program of the form: 
wait for data 
compute 
send data 
if (finished) stop 
repeat 
Notice that this definition contains no implications about how synchroniza- 
tion is accomplished. In data flow, synchronization can be global, as in 
systolic arrays (e.g., [6]) and/or local to a particular node. Local synchroniza- 
tion can be as simple as the data arriving in order and immediately being used 
LINEAR ALGEBRA AND ITS APPLZCATZONS 77: 1499163 (1986) 149 
1.50 R. E. FUNDERLIC AND GEORGE A. GEIST 
by the processor, or it can be as complicated as the data arriving out of order 
and the processor having to wait until it has received and ordered enough 
data locally to carry out the computation. Depending on the computer 
architecture, global synchronization may be impossible or undesirable. We 
consider just this case, in which local synchronization is the only means used 
to ensure correct algorithm synchronization. Our approach to data flow 
follows and complements that of O’Leary and Stewart [9, lo]. We make the 
assumption, as they did, that we have a torus-connected processor array 
connected orthogonally (each processor is connected orthogonally only to its 
four nearest neighbors). Also, each processor is a CPU with some local 
memory. The O’Leary-Stewart approach follows more along the lines of the 
wavefront approach of S. Y. Kung [1984]. For other viewpoints see Dennis [3] 
and Davis and Keller [2]. 
We use the Cholesky algorithm as a paradigm and focus on the problems 
that arise when there are more computational nodes than processors. The 
algorithms we chose to work with have fine granularity. For matrix problems 
there can be one computational node for each matrix element, and typically 
several nodes are associated with each processor. When one processor has to 
handle several nodes simultaneously, resource contention becomes a problem. 
At a particular instant each of the nodes on a processor will be either 
(1) active, i.e. executing on the processor, (2) ready and waiting for the 
processor to become available, (3) wuiting for data to arrive, or (4) dead, i.e., 
finished for the remainder of the job. Note that at a particular instant only 
one node can be active on a processor. This leads to a scheduling problem. 
The ready nodes can be executed in any order, but some scheduling al- 
gorithms may lead to better performance than others. We investigated three 
different scheduling algorithms based on wave number and order of data 
arrival and found the execution time varied about 6% between them. Details 
of these three orderings are given in Section 4. 
In Section 2 we discuss the inherent parallelism in the Cholesky algorithm. 
In Section 3 we discuss several ways of mapping matrices onto processor 
arrays. We will focus particularly on mapping the entire and the lower 
triangular part of matrices into a torus configuration. Also introduced in this 
section is the idea of a problem being missized with respect to an optimal 
number of time steps. In Section 4 an implementation of the Cholesky 
algorithm is detailed which requires only n2/4 processors to perform the 
factorization in 3n -2 time steps. Section 5 gives results from a portable 
data-flow simulator we have written. Included in these results are several plots 
of processor utilization during the solution of the problem. A comparison 
between the optimal parallel performance and our algorithm is graphed. 
Finally, two different performance measures are plotted. 
TORUS DATA FLOW 151 
for k=l to xl loop 
scp-t: dk,kI- s&&&D; 
for i+k+ 1 to P 100p 
cdiv: di,k&&k~&k~ 
end loop; 
for $=k+l to n loop 
for &k+ 1 to n loop 
for j+k+l to n loop 
elim: ali jl+=ali jl-ali&+& j% 
end loop; 
end loop; 
end loop; 
FIG. 1. Submatrix Cholesky. 
2. PARALLELISM IN THE CHOLESKY ALGORITHM 
Dongarra, Gustavson, and Karp [4] have analyzed various ways for 
implementing Gaussian elimination and have given implications for several 
architectures. For Gaussian elimination there are three nested loops which 
can be permuted to give six possible loop variations. George, Heath, and Liu 
[5] have characterized these six loop methods for self-scheduling a pool of 
tasks for the Cholesky algorithm. These six are variations of either TOW, 
column, or submatrix Cholesky. The submatrix version of the Cholesky 
algorithm, Figure 1, is particularly well suited for data flow since its wave- 
front implementation conveniently parallelizes all three nested loops. 
In order to analyze data-flow algorithms in a simple fashion, it is conve- 
nient to assume that all processors that have a node ready to execute compute 
together and take the same amount of time. This computation along with the 
necessary sending of data is called a time step. Though in practice the 
situation is much more complicated, this is a good approximation to actual 
performance as long as the sending of data dominates the time step or the 
time required to perform a square root, a division, or an elimination is 
approximately the same. 
As was mentioned, submatrix Cholesky, as given in Figure 1, has consider- 
able parallelism. At the kth stage of the algorithm we are left with a 
submatrix to decompose, made up of the last n - k rows and n - k columns. 
In order to determine the optimal parallel speed, we will assume we have an 
ideal parallel computer. This ideal computer can perform as many parallel 
152 R. E. FUNDERLIC AND GEORGE A. GEIST 
S D 
El D E Takes 3 steps for each pass except last 
s 
I D L 
D 
--I Takes n passes to complete E 
. 
. 
3n-2 Time Steps is Minimal 
FIG. 2. Submatrix-Cholesky parallelism. 
operations as desired (see Figure 2). A square root must first be performed 
(S), then all the 2( n - k) divisions (D), and finally all the (n - k)’ elimina- 
tions (E). Thus each stage except the last takes three time steps, and the last 
stage takes one time step. Therefore 3n - 2 time steps are optimal, taking into 
consideration the synchronization dictated by a loop version of the submatrix 
Cholesky algorithm. An important point to notice is that by taking advantage 
of symmetry, fewer processors can be used, but the optimal speed is still 
3n - 2 time steps. 
O’Leary and Stewart have shown that if the number of processors is the 
same as the number of matrix elements, then their data-flow implementation 
of Cholesky also takes only 3n - 2 time steps (see Figure 3). 
For missized problems the number of processors is typically much less 
than the number of matrix elements. In this case we are so far removed from 
the ideal parallel setting that it is more realistic to look at speedup. Speedup is 
the ratio of the sequential time steps to the parallel time steps. Sequentially it 
takes n3/6 + 0(n2) time steps to perform any of the six variations of the 
symmetric Cholesky algorithm. Unless otherwise specified, we will ignore 
TORUS DATA FLOW 153 
n=3 p-3 
Sl - - - Dl - - - DI 
- - - Dl - - - El - 
- - - - - - Dl - - 
- _ _. - - -. _ _- - 
- 32 El - - E2 - - - 
_ El - - E2 El - _- E2 
- - - 
- - s3 
311-2 Time Steps 
FIG. 3. Wavefront implementation of Cholesky on matched problem. 
terms other than the highestorder ones. Since the minimal parallel speed is 
3n - 2 time steps, the optimal speedup is n2/18. If we have two computers 
working lOO?& of the time, the optimal speed we could achieve is (n3/6)/2 
time steps. Similarly if we have p2 processors, where p2 << n2, then the 
optimal speed, i.e., minimal number of time steps, that can be achieved is 
( ns//p2)/6. 
3. MAPPING M&SIZED PROBLEMS 
There are many ways to map a large matrix onto a small number of 
processors. Block, reflected, and torus mappings, depicted in Figure 4, are 
among possible mappings (see, e.g., [lo]). In block mapping the matrix is 
partitioned into submatrices, each of which is then mapped onto a processor. 
We think of cutting the problem into p2 smaller problems, each assigned to a 
processor. This is probably the most widely used method of approaching 
missized problems. In reflected mapping, adjacent elements of the matrix are 
assigned to adjacent processors in the processor array. At the edge of the 
processor array the mapping is reflected back across the array. In this case, 
we think of folding the matrix in both dimensions until it fits on the processor 
array. In the torus mapping, adjacent elements are again assigned to adjacent 
processors. Unlike the reflected mapping, when the edge of the processor 
array is reached, the torus mapping wraps around to the opposite edge. Thus 
we think of wrapping the matrix around the processor array in both dimen- 
sions. 
154 R. E. FUNDERLIC AND GEORGE A. GEIST 
BLOCK REFLECTED TORUS 
cui fold -P 
FIG. 4. Some matrix-to-processors mappings. 
We chose to study the torus mapping because it displays an interesting 
data-flow pattern. The torus data movement is best at balancing workloads as 
long as communication requires the same order of time as a computation [S]. 
Data flow around the processor array until the job is complete. In addition, 
tori are easily embedded in the now popular hypercube architecture. For 
symmetric missized problems the right side of Figure 5 depicts a useful 
Block 
311-2 Time Steps for Both Block and Torus Mappings 
FIG. 5. Matrix and processor arrays minimally matched: 3n -2 time steps for 
both block and torus mappings. 
TORUS DATA FLOW 155 
triangular mapping onto a torus array of processors. Mapping the lower 
triangular part of a matrix onto a processor array can be done with any of the 
methods mentioned above. 
When block mapping is used, obtaining the minimal number of time steps 
(3 n - 2) requires n( n + 1)/2 processors as depicted in Figure 5. Our simula- 
tor led us to realize that the minimal number of time steps can be achieved 
with many fewer processors if a torus mapping is used. Minimality is achieved 
if n = 2p; i.e., only n2/4 processors are required to factor the matrix in 
3 n - 2 time steps. 
The mappings shown in Figure 5 suggest a new definition of matrices and 
processors being matched. We say that a problem is minimally matched if 
the minimum number of processors necessary to achieve the minimal number 
of time steps is used to solve a given parallel algorithm. An important point 
here is that the definition of a data-flow algorithm includes for each node a 
send command. This implies that a data-flow algorithm includes the way the 
problem is mapped onto the processor array. A nice feature of the O’Leary- 
Stewart [lo] approach is that the node programs are invariant under a variety 
of mappings. 
Notice that both diagrams in Figure 5 depict minimally matched prob- 
lems. The number of processors needed for each differs because a different 
algorithm was used for each of them. One uses a block mapping and the other 
uses a torus mapping. 
4. IMPLEMENTATION OF THE ALGORITHM 
Perhaps the most important purpose of the paper is to indicate a simple 
implementation of the Cholesky method that takes the minimal 3n -2 time 
steps with fewer than n( n + 1)/2 processors. Brent and Luk [l] have given a 
systolic array that requires n(n + 1)/2 processors to achieve 3n - 2 time 
steps. The O’Leary and Stewart [lo] implementation can be thought of as 
having a program for each matrix element on a processor. Let the lower 
triangular matrix elements be torus mapped onto a processor array as in 
Figure 5. Then removing Figure 6 gives a similar implementation of our 
algorithm. 
Another implementation of this method, which is the one we simulated, 
has only one program on each processor. This program does dynamic load 
balancing on the nodes for which the processor is responsible. The load 
balancing occurs for two reasons. First, the processor does not try to execute 
nodes that are not receiving information. Thus a processor that is responsible 
for a large number of nodes does not waste time context-switching to the 
156 R. E. FUNDERLIC AND GEORGE A. GEIST 
init: 
sqrt: 
cdiv: 
elim: 
FIG. 6. 
Node (I. J> 
k:=O; 
loop 
k:-k+l; 
if k-1 and k-J then 
await0 
a:-sqrtca); 
send(a:south) 
finis: 
elsif k-J then 
await(an:north); 
a:-a/an; 
send(an:south) (a:east); 
finis: 
else 
await(an:north) (aw:west); 
a:-=8 - an*aw: 
send(an:south) (aw:east): 
end if: 
end loop; 
Program for Cholesky decomposition. 
many nodes that are stuck in await states. Second, the processor executes 
nodes in the order the data arrive, or as closely to this order as possible. By 
executing nodes in this order the wavefronts tend to move at near their 
maximum rate across the matrix. 
Before getting into the details of the method, we will give some back- 
ground information on the message structure and processor queues. Each 
message that is sent from processor to processor contains the destination node 
ID, which is equivalent to the row and column indices of the matrix element 
the datum will be applied to, the wave number of the data, and finally the 
datum itself, which in our case is a floating-point number. The processor 
maintains two queues, the wait queue and the ready queue. All incoming 
messages are placed in the wait queue. The processor executes those nodes 
whose numbers are in the ready queue in some user-defined order. The three 
orders we tried were top to bottom (order of arrival), lowest wave number 
first, and highest wave number first. 
We use two routines to implement the method. The first routine searches 
the wait queue for data needed for executing a node. Notice that knowing the 
node ID and wave number is sufficient for determining this condition. Once a 
TORUS DATA FLOW 157 
, p-3 
I n-40 
d I I 
FIG. 7. Processor utilization for fnll matrix using torus mapping. 
node is found, it is placed along with the data and wave number onto the 
ready queue. The second routine is the node subprogram. This routine 
searches the ready queue from top to bottom, executing all those nodes which 
both are listed and have a wave number one greater than the previous wave 
number seen by this node. The same “if” tests illustrated in Figure 6 are used 
to determine which of the three operations will be performed on this matrix 
element. 
5. GRAPHICAL RESULTS FROM SIMULATOR 
This final section gives some graphs depicting results of our data-flow 
Cholesky simulation. We first give utilization graphs and then graphs depict- 
ing nearness to optimality. The section concludes with an efficiency graph 
and an additional performance graph. 
Figure 7 shows the processor utilization for a ‘40 x 40 matrix mapped onto 
a 3 ~3 processor array using torus mapping. Almost immediately all 9 
158 
a - 
: 
3 
2 
2 
0 L 
0 loo0 2aoo 3ooo 
TIME STEPS 
FIG. 8. Processor utilization for triangular matrix using torus mapping. 
6 
6 
6 
I- 
LOWER TRIANGLE 
I 
p-3 
n-100 
I 
10,cKlo 
TIME STEPS 
FIG. 9. Processor utilization for larger matrix. 
1.0 
0.E 
0.6 
. 
0.4 
02 
0 
I I 
STEPS - 8 n3/pz+ . . . 
------------- --- 
I 
I I 
1060 2066 
t&p2 
159 
0 
FIG. 10. Full matrix: comparison with optimal performance. 
I 
LOWER TRIANGLE 
0.8 - 
STEPS = a n3/p2+ . . . 
0.6 
a 
0.4 
0.2 - _------ ---_ 
0’ I I 
0 1000 20@3 3000 
n31p2 
FIG. 11. Triangular matrix: comparison with optimal performance. 
160 R. E. FUNDERLIC AND GEORGE A. GEIST 
TABLE 1 
PARALLEL TIME STEPS 
P=3 P=4 P=5 
N Steps N Steps N Steps 
1 1 
2 4 
3 7 
4 10 
5 13 
6 16 
7 22 
8 28 
9 34 
10 45 
11 55 
12 65 
13 80 
14 94 
15 109 
16 130 
17 151 
18 172 
19 200 
20 229 
21 256 
22 292 
23 329 
24 365 
25 409 
26 454 
27 500 
28 554 
29 609 
30 664 
40 1458 
50 2739 
100 20233 
1 1 1 1 
2 4 2 4 
3 7 3 7 
4 10 4 10 
5 13 5 13 
6 16 6 16 
7 19 7 19 
8 22 8 22 
9 28 9 25 
10 34 10 28 
11 40 11 34 
12 46 12 40 
13 57 13 46 
14 68 14 52 
15 78 15 58 
16 88 16 70 
17 103 17 79 
18 118 18 91 
19 133 19 102 
20 146 20 113 
21 167 21 126 
22 188 22 141 
23 209 23 156 
24 230 24 170 
25 258 25 185 
26 286 26 204 
27 314 27 225 
28 343 28 246 
29 378 29 267 
30 415 30 288 
TORUS DATA FLOW 161 
processors are kept constantly busy, and they continue to be fully utilized 
until more than half the time steps are finished. When the factorization nears 
completion, the processor array alternates between being fully used and 
partially used. This behavior continues until the very end of the factorization. 
Figure 8 shows the processor utilization for the lower triangle of a 40X40 
matrix mapped onto the same 3 X 3 processor array using torus mapping. 
Several noteworthy observations can be made. First, the triangular implemen- 
tation of Cholesky takes about 2 the number of time steps of full mapped 
Cholesky. Because of resource contention the ratio, 5, is slightly higher than 
the ratio, $, that occurs in the sequential comparisons. Second, a long stretch 
of full processor utilization is not observed as it was for the full mapped case. 
Instead the pulsing behavior exists from the beginning to the end of the 
factorization. At first we believed that this was a characteristic ending to this 
problem and the triangular case did not last long enough to show full 
processor utilization. To test this idea we ran a triangular mapped 100 X 100 
matrix. The results in Figure 9 show that a long period of full processor 
utilization is not achieved and that the pulsing behavior is apparently 
characteristic of the factorization of a triangular mapped matrix. 
Given the above performance, we sought to determine how the number of 
time steps for the two torus mappings compare with the optimal parallel 
performance (see Figures 10 and 11). From our simulator for a given n3/p” 
100 
80 
0 
FIG. 12. 
I I I I 
0 20 40 60 80 100 
n 
Efficiency curve for lower-triangle mapping. 
162 R. E. FUNDERLIC AND GEORGE A. GEIST 
we obtain the number of time steps s(n, p) (see Table 1). Then a is obtained 
from s = an3/p2 and is plotted. Notice how the value of a appears to be 
approaching the optimal values of i and i respectively. Though Figures 10 
and 11 were obtained for fixed p (p = 3), the graphs are consistent with the 
several p’s we tried. 
Figure 12 shows the efficiency of triangular mapped Cholesky. In this 
graph efficiency is defined as the optimal number of parallel time steps 
divided by the actual number of parallel time steps. The optimal number of 
parallel time steps for the Cholesky algorithm described is (n3/6 + n2/2 
+ n/3)/p’. For a given processor array the efficiency of our algorithm 
improves as the order of the matrix increases. For matrices of reasonable size, 
efficiencies over 95% are observed. One problem with this definition of 
efficiency is that optimal efficiency always occurs when p = 1 (the sequential 
case) because this efficiency is based on processor utilization. 
In order to get a better feel for the improvement of performance due to 
parallelism as the number of processors increases to its optimum, we defined a 
measure based on the peak theoretical speed of 3n - 2 time steps as described 
100 
60 
b0 
“iE 
(si”. 
P 
,8 40 
20 
0 
FULL MATRIX 
p- 10 
I I I 
0.4 0.6 0.6 1.0 
p2/n2 
FIG. 13. Algorithm performance with respect to optimal speed. 
TORUS DATA FLOW 163 
in Section 2. This measure is defined as the peak theoretical speed divided by 
the actual parallel speed, both evaluated in parallel time steps. Figure 13 
shows the performance of the two algorithms plotted against p2/n2, which is 
the ratio of processors to computational nodes. We see that the torus 
triangular mapping improves much faster than the torus square mapping even 
though the processor utilization is better in the square case. This points out 
the difference between optimizing with respect to processor utilization and 
with respect to execution speed. 
This paper is based on the data-flow approach of Dianne O’Leary and 
Pete Stewart. We thank them for their suggestions and encouragement. 
The simulation program that produced the results of Section 5 was coded 
by James Meyer@, and undergraduate 1984 Oak Ridge Associated Universi- 
ties Research Participant. 
REFERENCES 
1 R. Brent and F. T. Luk, Computing the Cholesky factorization using a systolic 
architecture, Report TR82521, Dept. of Computer Science, Cornell Univ., 
Ithaca, N.Y., 1982. 
2 A. Davis and R. Keller, Data flow program graphs, IEEE Computer 15:26-41 
(1982). 
3 J. Dennis, Data flow supercomputers, IEEE Computer 13:48-56 (1980). 
4 J. J. Dongarra, F. G. Gustavson, and A. Karp, Implementing linear algebra 
algorithms for dense matrices on a vector pipeline machine, SIAM Reo. 26:91-112 
(1984). 
5 J. A. George, M. T. Heath, and J. L’ m, Parallel Cholesky factorization on a 
multiprocessor, to appear. 
6 H. T. Kung, Why systolic architectures?, IEEE Computer 15:37-46 (1982). 
7 S. Y. Kung, On supercomputing with systohc/wavefront array processors, IEEE 
Proc. 74:867-884 (1984). 
8 D. P. O’Leary, personal communication, 1984. 
9 D. P. O’Leary and G. W. Stewart, On the mapping problem for parallel 
implementation of matrix factorizations, to appear. 
10 D. P. O’Leary and G. W. Stewart, Data-flow algorithms for parallel matrix 
computations, Comm. ACM 28:841-853 (1985). 
Rweioed 30 November 1984; revised 28 October 1985 
