Partitioning PDE Computations: Methods and Performance Evaluation by Houstis, Catherine E. et al.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1986 
Partitioning PDE Computations: Methods and Performance 
Evaluation 
Catherine E. Houstis 
Elias N. Houstis 
Purdue University, enh@cs.purdue.edu 
John R. Rice 
Purdue University, jrr@cs.purdue.edu 
Report Number: 
86-614 
Houstis, Catherine E.; Houstis, Elias N.; and Rice, John R., "Partitioning PDE Computations: Methods and 
Performance Evaluation" (1986). Department of Computer Science Technical Reports. Paper 532. 
https://docs.lib.purdue.edu/cstech/532 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
PARTITIONING PDE COMPUTATIONS:
METHODS AND PERFORMANCE EVALUAnON
Catherine E. Houstis*
Department of Electrical Engineering
Elias N. Houstis**
John R. Rice**





We consider modeling, predicting and evaluating the perfonnance of methods for solving
PDEs in parallel architectures. We have developed a method for coarse grain partition-
ing of computations for parallel architectures and we apply it to three PDE applications:
(a) Cholesky factorization, (b) spline collocation, and (c) an application complete from
processing text input to plotting the PDE solution. OUf partitioning method is oriented to
minimizing interprocessor communication and we review some "unifonn" architectures
and models of their communication. We apply this method to the three applications
implemented on the FLEX/32 multicomputer. We review the architecture on the
FLEX/32 and the results of applying the partitioning method to computation running on
the FLEX/32. We observe that the FLEX/32 does not have any communication
bottleneck and probably will not suffer substantial perfonnance degradation if the pro-
cessor speeds are increased by a factor of 10. Our partitioning method works reasonably
well even here where communication costs are negligible. The coarse grain structure of
two of these applications is not highly parallel and we observe speedups of about k/2 for
k processors. The other application is highly parallel and we observe optimal speedups
for any number of processors as the problem size increases.
'loThis research supported in part by NSF grant DMC-8508684.
*"'This research supponed in pan by AFOSR granl 84·0385.
This paper will appear in the January 1987 issue of Parallel Computing.
-2-
1. MODELING THE PERFORMANCE OF ALGORITHM I ARCHITEC-
TUREPAIRS
We consider a Partial Differential Equation (POE) application A to be a computa-
tion with four properties: processing requirements, memory requirements, communica-
tion requirements and precedence (or synchronization) between the subcomputations.
Visualize the com'putation broken into computational modtdes which are nodes of a pre-
cedence graph for the computations. Note the processing and memory requirements at
each node of the graph. Note the communication requirements along each link or edge
or tbe graph. This annotated graph is called G(A). Observe tbat links representing
communication may need to he added to G(A) even though they are redundant as rar
as precedence is concerned.
We consider a machine to have three components: processing elements, memory
elements and communication paths (an interconnection network). Similarly, the
machine can be represented by an annotated graph G(M). In this paper, we consider
machines tha.t have a set of identical processors (which may have local memories), a set
of identical common memories, and a queueing delay function which represents the
communication performance characteristics of the machine. Intuitively, one may con·
sider these as machines with a rather uniform architecture, one that scaJes simply with
the number of processors and memories.
Assume that the graph G(A) is given and that an appropriate choice of probJem
size and granularity has been used in partitioning A to create G(A). Assume also that
G(M) is given. In general, we want to map G(A) onto G(M) so the computational
modules (nodes) of G(A) are associated with processors in G(M) and communication
links (edges) or G(A) are associated with data paths in G(M). We assume here that the
parallelism or G(A) is no larger than that or G(M). Normally the number or nodes in
G(A) is much larger than the number or processors in G(M) and the computation is
8cheduled by assigning the computational modules to processors. This produces a new
graph G' (A). The three graphs, G(A), G' (A) and G(M) along with their associated
parameters (e.g. ~peed of processors, amount of arithmetic at a Dede) are a model of the
PDE computation A to be carried out on machine M. One key question is how well
this model predicts the performance of typical algorithm/architecture pairs. A second
key question is how to map G(A) into G(M) so as to obtain G' (A).
In general, the mapping problem has three somewhat independent steps:
1. Reduce the parallelism or the application to that or the machine.
2. Schedu~e the computationa.l modules so that the application runs efficiently.
3. Given steps I and 2 are done, imhed the application into the machine.
The mapping prohlem is known to be NP complete [JENN77] and to find the optimal
mapping for any of these three steps is also NP complet.e. We propose to use fast,
heuristic algorithms here which are intended to produce good mappings at low cost.
- 3 -
Due to the lack of space, we discuss the parallelism reduction in depth elsewhere
[HOUS86], and assume here that this step is not needed. Given that steps 1 and 2 are
done, step 3 is Cairly easy for the class of machines considered here. Thus, this paper
concentrates on the problem of scheduling tbe application so tbat it runs efficiently and
on evaluating the performance of these schedules.
Our overall plan for making performance evaluations is as follows. We first devise
fast, heuristic algoritbms wbicb carry out steps 1, 2 and 3 above. Note that the order
of steps 1 and 2 is not necessarily fixed, we assume in the discussions that step 1 pre-
cedes step 2 so as to simplify the discussion but this is not an essential ingredient to
our approach. OUf heuristics use communication delay models for "uniform" architec-
tures in order to simplify the complexity of the mapping problem. Once we have the
application mapped onto the machine and scheduled to provide good efficiency, we then
run the program on an actual machine (we use a FLEX/32 multiprocessor) or use a
simulation package, (for example, SIM:ONIFUn85]) to llexecute" a "compressed" revi-
sion of the program and to estimate its performance.
The performance of a distributed system depends very much on how well the
architecture and tbe algorithms are matched IKLEIN85]. A methodology for predicting
multiprocessor performance from the systems point of view and the RP3 architecture is
presented in [NORT85J. The problem of mapping has been studied in [BOKH81J and
[BERM84,85] for the finite element macbine and CHlP architectures, respectively. In
[WILL83J various objective criteria necessary for assigning processes to processors are
examined and tested in a distributed environment. A different approach for modeling
communications complexity of parallel algoritbms is presented in IGANN84J. Our
approach to mapping POE compntations onto parallel macbines (i.e. step 2 of the map-
ping process) is described in IHOUS83j and our approach to compressing the parallelism
is described in [HOUS86J.
In tbis paper we give models of three POE applications (Section 2) and discuss how
to model communications for various machine architectures (Section 3). We then
describe the FLEX/32 architecture and its performance (Section 4) and apply our map-
ping algorithm to the three POE applications (Section 5). The observed performance
for these applications is discussed (Section 6) and compared with predictions Crom the
model. Some conclusions are stated in Section 7.
2. APPLICATION MODELING WITH THREE EXAMPLES
For th~ modeling of an application, we assume an initial partitioning of the com-
putation into a numher of communicating compntational modules. If the communica-
tion paths among the modules are not known a priori (i.e. real time applications) then
the partition is modeled by a stochastic AND-EOR directed graph SG(A). In [HOUS84]
a stochastic a~alysis is applied to reduce SG(A} into a deterministic graph G(A) which
- 4-
comprises the input to the mapping model.
IC the communication paths among modules are known a priori then we use com-
plexity analysis and/or simulation to obtain values for the parameters of the computa-
tion graph G(A). In the case of simulation, we use the data flow language SIMON
{FUJI85! and coucurrent C aud FORTRAN languages {FLEX85) to specify the compu-
tation modules and their communication and synchronization requirements. In order to
execute SIMON programs we use our non-shared memory multiprocessor simulator
{FUJI85). Concurrent C or FORTRAN programs arc run on FLEX/32 multiprocessor
system. Next, we consider three scientific computations to be used for the evaluation
of our methodology.
2.1 Appliclltion I: Cholesky decomposition of symmetric mlltrl••s
We consider the parallelizaUon of Cholesky decomposition of symmetric matrices.
Figure 1 presents the computation of each module in the example of a parallel Chale--
sky decomposition algorithm [O'LEA85]. Figure 2 depicts the corresponding graph
G(A) and the values of the various workload parameters (node processing time and
hlocking time, communication traffic among nodes) ohtained hy setting SIMON's
switching delay to zero. The bloeking lime or algorithm synchronization delay or a








1F(j " N) PUT (OUT...EAST.X):
IF(; "N) PUT (OUT_SOUTH,X):
BREAK:
END




IF Ij " N) PUT (OUT_EAST.X):
IF (i "N) PUT (OUT-.SOUTH.Y):
BREAK:
END
ELSE IF (K=j) THEN
BEGIN
GET (IN_EAST,Y):
IF (j " N) PUT (OUT...EAST,Y):







IF (i ~ N) PUT (OUT_SOUTH,Y);
IF (j ~ N) PUT (OUT_EAST,Z);
END
END ·.vHILE
Figure 1. Computation of node (i,j) in SIMON data flow language for a parallel
Cholesky decomposition algorithm.
A matrix is partitioned into blocks so that the maximum degree or para.llelism is
equal to the number or processors. 1£ smaller blocks are used, then the parallelism
reduction would be applied but we do not consider tbis case bere. Tbi, application pro-
vides an example where the parallelism is mo,Uy less tban maximum and tbus wbere
the speedup should be considerably less than tbe number or processors. See [O'LEA86]
for furtber discussions. That paper sbows tbat optimal speedup is acbievable witb a






m = (24, 30, 30, 25, 27, 31, 37, 32, 27,34,38, 3g, 21, 28, 35, 45)
Module blocking times
.b. = (a, 4, 14, 24, 15,27,37,48,33,45,61,66,51,64,76,88)
Figure 2. Precedcnce graph G(A) or the parallel Cholcsky decomposition algorithm
for a 4 by 4 block decomposition of a symmet.ric matrix. The assumed
numbering of modules is specified in each node. The communication traffic
is indicated as a weight on the links of the graph.
2.2 Application II: Spline collocation method. for elliptic PDEs
We consider t.he numerica.l solution of an elliptic PDE problem defined by the
second order elliptic operator
Lu = au" + 2/1u,y +')1Iyy +6u, +, uy+(U = r,
on ncm2 subject to boundary conditions
Bu = 0 on the boundary or n.
We use the line collocation method based on cuhic splines studied in lHOUS84!,
[VANA85] to discretize the above PDE problem. For simplicity, we consider the
HelmhoJtz equation with constant coefficients and an iterative method corresponding to
SOR for solving the underlying linear equations. This discretization is defined by the
block equations
Alx = (I-w) Alx-wA2y+wF,
Q2y =(1-w)Qzy+wQIx-wG.
where




Q2 = diag (8,8, ... , 8) with 8=trid (1,4,1),
A2 = P-AI, QI = P-Q2,
P = permutation matrix of the ordering of unknowns,
X = soJut.ion vector in horizontal mesh lines.
y =solution vector in vertical mesh lines.
W = overrelaxation parameter
FIG = consta.nt matrices from the right side r.
The discretization (2.1) can be considered as the application of a set of equations
to a data set. For example, the first block 01 equat.ions (2.la) is applied to data associ-
ated with the horizontal mesh lines (see Fig. '3a) and the (2.lb) equations are applied to
data related to the vertical mesh lines (see Fig. 3b). One partition technique is to
divide the data sets (mesh lines) into part:;;, instead of the equations, and assume each
part is assigned to a single processor. Figure 4 depict.s the parallel spline collocat.ion
algorithm based on the above data set decomposition for a two processor system. The
generalization is straight forward. In Section 6 we present the performance of this
parallel algorithm for a. shared memory architecture.
2.3. Application m. Large Bcale PDE computation.
This PDE application is complete from the description of a problem in a very high
level language through the graphical display 01 the computed solut.ion. Figure 5 shows
a schematic diagram with the steps 01 the computation, it has degree 01 parallelism 6.
Not shown in Figure 5 are vertical data communication paths between corresponding
nodes on the lan-in/lan-out 01 the tree structure in the center. The PDE problem
solved is a general one on a Don-rectangular domain. A multifront method is used
based on a nested dissection partition of the domain. Gauss elimination is used to
eliminate unknowns in the interior of each domain. The numbers shown at each node
are the "computation units" for a particular instance of this problem corresponding to
using a 20 by 20 mesh, a finite element method with cubic basis functions, and a 40 by
40 plotting grid. The domain boundary has 3 pieces and, at the lourth step, all but one
processor are working on the interior of the domain. A computational unit here is
about 1000 arithmetic operations plus associated memory and control operations.
Figure 6 show. the same graph with the additional communication paths. These
paths transfer LU factorizations of subbJocks of the linear systems from the initial solve
phase to the back substitution phrase. The numbers shown are data transfers required
in units or 1000 words.
- 8-







I \IX N/2 xN XN+II
.. . . . .
JXo xI XN/ 2·- ,11 I
___t "---..-t-.....Processor] data - L Processor IT data
(B)
Figure 3. Partitions 01 the mesh line data 01 {2.I} into (A) hori20ntal mesh lines and









FACTOR T FACTOR S
I -e~




SOLVE Alxl=Rll SOLVE Al~=RlD
:r "'l:
FORM Ql~ ~ FORMQ1~
FORM Q211 FORM Q2rn




I SOLVE Q2YD R2n•n........
I:
Figure 4. Schematic of lhe parallel spline collocation algorithm and iteration (2.1).
The decomposition ror two processors is shown based on alterna.tely using





Process 3 Boundary Pieces




Create 6 Frontal Are""
Interior Assembly + Elimination
Merger and Elimination












Tabulate Answer for Plot
Tabulate Flow Field
Create Plot Dat.a Structure
Compule Contours, Colors
for 6 Views
Figore 5. 'Annotated graph G(A) Cor Application ill Cor 6 processors. The numbers









Figure 6. The graph 01 Figure 5 showing additional communication edges and the
number or units to be communicated along each edge.
- 12 -
3. COMMUNICATION MODELING IN MULTIPROCESSOR
ARCHITECTURES
Our methodology for reducing the initial interconnection graph G(A) is based on
minimizing its communication cost. Thus, it is crucial to be able to predict the com-
munication cost or G(A) running on a specific muJtiprocessor system architecture. We
consider architectures Cor which there is a realistic simple model of the communication
cost at the message passing level. Such performance models of multiprocessor systems
are given in IMAR83), IKRU83!. The performance measure is a delay function which
provides the expected queueing delay a message suffers in traversing the interconnection
network or the system. Queueing delay is computed as a function of the system's com-
munication interconnection network utilization.
Let
c;j=total traffic (# of messages) between modules mi and mi'
kIP =processors assigned to computational modules,mj,mj
and
dk,Q = interprocessor distance in the interconnection network
of the considered architecture,
then the interprocessor communicatr"o7f. cost or queueing delay is given by
D = D(u)
where u = u(Cij, dk,~ I interconnection network characterization) denotes the utilization.
When two modules are assigned to different processors their utilization of the intercon-
nection network is computed as follows:
c·· .dkfu = I,l I
C.T
where C = capacity of interconnection network and T = time rrame during which each
parallel processor is running.
The delay obtained is only approximate because or the various simplifying assump-
tions involved in the analytical modeling of the performance characteristics (queueing,
utilization) of the multiprocessor system. In all or the models we consider, two assump-
tions are made: (a) every processor accesses the interconnection network at the same
rate, and (b} every processor accesses the various common memory modules with the
same probability. These assumptions lead to a queueing model which is mathemati-
cally tractable. Intuitively, these assumptions imply that the communication reqnire-
ments of G(A) are nearly uniform. rr G(A) is one of many applications running in the
same system, these assumptions may be realistic even it the algorithm has somewhat
- 13-
non~unirorm communication patterns. A precise investigation on this question is under
way.
We consider the performance analysis of five different multiprocessor parallel archi-
tectures: (1) Siogle bus, common memory, (2) Single bus, distributed common memory
modules, (3) Multiple bus, distributed common memory modules, (4) Omega intercon-
nectioD, distributed memory modules, (5) Omega. interconnection processor to processor
rnuJUprocessor system.
3.1 Single bUB common memory architecture
In this architecture, (Figure 7(i)), k processors (Pi) are connected to aD external
common memory (eM) via a global bus (GB). Each P; has a local memory (LM;) con-
nected to its own local bus (LB;). The system has been analyzed in 1MAR83) as a
lImachine repairman" problem. For given bus utilization level, say ull the average
queueing delay per information transfer unit is given by D,(u,) =(k-u.lp,)/u,. The
parameter PI characterizes the communication of the architecture and it is found by
solving the nonlinear algebraic equation.
Pk(p,)-1 k..u, = P ( ) where Pk(p,) =I;p/k!/(k-J)!.
k PI j=O
The workload characterization of the application is given by Pp = pd2 and represents
the ra.tio Ap/ll where Ap is each Pi'S access rate to the bus which in turn depends on the
communication pattern of G(A) and IIp is the average memory transfer requirement
which depends on the average message length.
3.2 Single bus distributed memory architecture
This architecture (Figure 7(ii)), is obtained from the previous one hy distributing
the common memory to each processor. The local memory of each Pi is logically
divided in private memory PMj and common memory CM i and accessed by a double
port.
This system has been analyzed in IMAR83]. The average queueing delay per infor-
maHon transfer unit as a funelion of the bus utilization is given by
k-u /p p
D,(u,) = u: z where p, = ...!:L
Pp+1
k . k'
where Pk(p,) = .I;pJ (k-··)I
J=O J .
• 1'1·
3.3 Multiple bus distributed common memory modules architecture
A multiple bus distributed common memory modules architecture is shown in Fig-
ure 7(iii). The Pi'S have their own LMj's and communicate via common memory
modules. Multiple global busses are used by the Pi's to access CMi's. Such a. system is
referred to as a kxmxb system where k is the number DC Pi's, m is the number oC
eM;'s and b is the number of global busses used. We assume tbat k > m > b.
This system has been analyzed in [MAR821 using several approximate models
based on a Markov chain approach..The model we are adopting here provides pl!rCor.
mance characteristics which are a lower bound to the characteristics oC the actual sys.
tern.
The average queueing delay per inCormation transCer unit is given by
where Pa = Pp
ua =
k k-ik-j ! n rr l
k Ps (i-I)" _I I
where Pk(Ps) = I; ---:k-_,l',~~k-!,'::k~,,.-. -





I;jPj(9) + bI; [Pb(j + b)Pm_b(O-2b-j + m)]
_ j=1 j=O 0
b I 9 b '
L;Pj(O)+ I;(PbO +b)Pm_b(O-2b-j + mJl
j= I j=O
>0
Pj(Ol = Pj(H) +Pj_,(O-j) + ... +p,(I-j) +Po(i-j)
wit.h initial condit.ions
Pj(O) = 0 0 < j
PoW) =0 0 > 0
Pj(O) = 1 j > 0
3.4 Omega Interconnection distrIbuted memory modules
In Figure 7(iv), an example oC an Omega interconnected architecture is shown
where the 9mega network is drawn for k = 8 processors in the system, and 2x2
switches.
This system has been analyzed by [KRU83], as t.he interconnection network of the
ultracomputer {KRU83]. The same analysis applies to a general class of multistage
- 15-
Archlteeture Delay function















~ 1-_..-_-.__~G~'o~b~'I~b"~'.. , __r
~
(iii) Multiple bus distributed common memory















D = 10 k(t +t m2(I-i/n}p )+(m-i)t,










Figure 1. The delay function and workload parameter (pp) as a function of ntiliza-
tion (u) and parameter (p) for Cour shared memory architectures with k
. processors. In case (iii) m = # or common memories, b =#: or common
busses and k ~ m > b. In case (iv) nxn is the size or the switch, tr =
transit time or a switch, tc = cycle time or a. switch, m = #: or packets in a.
msg; p =average # of msgs entered by each P;/eyele time.
int.erconnection networks called Banyan networks [KRU83J. Some or these are the
Omega, Delta or indirect cube interconnection networks. In our case, we have con-
sidered oxu switches where the capacity of the buffers is infinite. Then the average
delay per message in traversing the interconnection network is simply its delay per




Dip,) = log"k(t,+t, ( ) ) + (m-I)t,
2 1- mp,
and
where k =number of processors in the system
t r = transit. delay of a switch
te = cycle Hme of a switrh
n =size of switch (nxn)
m = number of packets per message
P.. =avera.ge number of messages entered by each Pi per cycle time
An exact calculation of D-t(P4) is possible since P4. = u/k and can be directly substituted
into the delay funct.ion.
3.6 Omega Interconnection PE to PE archItecture
In every architecture considered so Car we have dk" = 1. Figure 8 shows an archi-
tecture for which dk,Q ¢. 1. The organization of the processors and memories differs
from that in Figure 7(iv) in that each memory is associated with each processor forming
a single element called the processing element (PE). Thus a PE to PE configuration i.
shown in Figure 8. Note that this time the distance between the processors is
dk,1 = 1,2. For the average queueing delay per message through tbe omega intercon-
nection a formula is given which is the same as in architecture (3.4) but it is stated in
more general terms by INOR85J and is as follows:
u I
D =2( ) xmxstx(I--)
l--u u
where u = utilization of the interconnection by a processor per cycle time
n:- = Dumber of packets per message
si = nnmher of network stages
nXD =size or switch












Figure 8. The delay function for nonshared memory int.erconnection networks u =
utilization of the interconnection by a processor per cycle time, m
number of packets per message, st = number of network stages, nxn -
size of switch.
- 18-
net.works, delta, omega, indirect cube.
4. THE FLEX/32 ARCIDTECTURE AND PERFORMANCE
The FLEX/32 is a collection of computers that can each execute independent
instruction streams and can separately access memory. The memory is shared and dis-
tributed in several modules as illustrated in Figure g. There are several choices of
boards and they can be mixed.
The shared memory computers communicate by using the common memory. The
intercommunication hardware for FLEX/32 is depicted in Figure 10 and consists of two
common busses and ten local busses. Common busses are nonmultiplexed, synchronous
with a clock rate of 12 Mhz, and with address width and data width of 32 bils. The
communication mechanism is implemented by the Common Control Card (Gee) and
the Common Access Cards (CAG). The GAG· card provides a local bus to common
interfaces, contains 64 to 512 kbytes of common memory and hardware for interprocess
synchronization. The ecc card also provides a local bus to common bus interface and
arbitrates conditional critical region communication. Only one CCC is needed per
machine, additional GGG's allow the FLEX/32 to he partitioned into completely
independent machines.
4.2 The Purdue FLEX/32 configuration
The department or Computer Science at Purdue Unhrersity operates a FLEX/32
multiprocessor system with 7 computers based on National Semiconductor N,S 32032
processors each supported by floating point unit. Six computers have 1 Mbyte or local
memory and one has 4 Mbytes. The system has 6 CAG cards with 512 Kbytes of com-
mon memory each. Each computer can operate in serial mode under UNIX and in mule
ticomputing or parallel mode under the MMOS operating system IFLEX85!. The sys-
tem is programmed in Concurrent C and Concurrent FORTRAN 77.
4.2 Purdue FLEX/32 system perf'ormance
Some parameters that have been used to characterize the perrormance or~
machines [FOX85) are:
k = number or processors (or computers)
tcomm = time to communicate one word between processors
til =time for single precision Doating point calculation
t d1'9 =time for double precision 80ative point calculation
tint. =time for integer calculation
text. = time to read or write one word from external devices
The values of these parameters ror the Purdue FLEX/32 are given in Table 1. No








8MI 8M2 8M3 Shared
Modul
Intercommunication Ha.rdware
-fP3lPI I P5 I
I LMII I LM31 I LMSI
P2 I ~ P6 I
I LM21 I LM4 1 I LM61
-
Inlerraces 10 I/O and Peripherals S
Figure 9. . Schemalic or lhe FLEX/32 archileclure. There may he up to len ,hared
memory modules and twenty processor boards.
- 20-
precision calculations.
The FLEX/32 processors (NS 32032) were compared with VAX 780 single proces-
sor computer. It appears that each FLEX/32 computer based on NS 32032 is 1/3 of
VAX 780 Cor double precision floating point calculations and 1/2.65 Cor integer arith-
metic. This agrees closely with comparisons to the VAX 780 by other NS32032 based
machines we have. The LINPACK subroutine DCHDC for Cholesky decomposition for
symmetrie matriees was also used to eompare the performance of the FLEX/32-
NS32032 processor and the VAX 780. Table 2 gives the performance of LINPACK
double precision routiDe in C UDder FLEX/32-UNlX and MMOS. The data indicate
that MMOS is slightly faster. Table 2 shows that a Fortran implementation of
DCHDC is slightly more efficient than the C implementation. Table 2 also shows that
a single processor on the FLEX/32 runs DCHDC faster thaD expected when compared
to a VAX 780. The ratios or execution times a.verage about 2.1 while the ratios or pro-
cessor speeds vary Crom 2.6 to 3.3. Once again we see that predicting performance or
even arithmetic dominated applications cannot be done reliably just by using processor
speeds.
Extensive experiments have been performed to measure the communication cost
through the common memory. For the F;;ldue FLEX/32 configuration communication
delay appears to be eompletely independent of the number of processors that are active
DO matter what they are doing.
·21·
••• C> COMMON BUS *1- .. -" " " """ < < < COMMON BUS *2" " ... "
C>- .. ., -.., .., .., ..,
'" '" '" '"::> ::> ::> ::>
Ol Ol Ol Ol
~ ~ ~
-'<
" " " "0 0 0 0-' -' -' -'
Figure 10. FLEX/32 intercommunication hardware. The common bus operates at a











operands in shared memory
operands in local memory
operands in shared memory
operands in local memory
IIwrite" operations
llread" operat.ions
Table 1. Hardware parameters for the Purdue FLEX!32.
N UNlX-C MMOS-C UNlX/MMOS UNlX-Fi7 FLEX32{VAX780
32 0.62 0.58 1.063 0.58 2.32
64 3.88 3.52 1.103 3.63 2.18
96 12.08 10.74 1.125 11.15 2.18
128 27.18 24.16 1.125 25.33 2.21
160 51.83 45.72 1.134 48.18 2.16
192 87.92 77.34 1.136 82.56 l.Q9
224 137.72 120.94 1.138 125.32 2.05
256 204.63 178.4S 1.146 IP~.28 2.08
228 285.33 ~51.~e 1.137 ~68.2E 2.15
320 390.33 342.96 1.138 367.02 2.16
Table 2. ExecutioL times in seccods on a single processor of tbe LINPACK subrou-
ti:.,·:! DeHDe for Cboieskj' ciecompodtio!J of s)'i"..metrk matrices or order
N. Cclum::::~ I, 2 ~~d 3 stJW the effect of running C programs under the
two operating systems, UNIX a.nd :MMOS. Column 4 shows the time fa. 8
Fortran 77 version and the last column gives the ratio of execution time
for the same program on a VAX 780.
6. MAPPING COMPUTATION MODELS TO ARCHITECTURES
We now consider how to obtain tbe schedule graph G' (A) from the original graph
G(A). V:.: use a b~nristj~ algorithm based on the minimization of communication cost.
This cost is estimated from the data about communicat.ioD in G(A} aDd the averag-e
communicaf.ion deia)" mode:h: such as given in Section 2. The details of this algorithm
are given in !HOUS83j a.nd we briefly summarize it here for comp!eteness. There are
numerous constraints implicit in the mapping problem which complicate a complete
mathematical formulation considera.bIy. However, experience shows that this algorithm
runs f""t and tOo run t.ime is normally linear in the size of G(A) and the numher of pro-
cessors. The reduction can be viewed as a clustering process tha.t tries to minimize
communr"cation Urne for data and computatr"on variable" b)' clust.ering modules together.
This clustering must satisry the roHowing constrainl.s:
(i) reaource constraints:
. Every computational module must fit into the memory assigned.
- Datablocks must fit into the memory "'<signed.
. Com"putations must have enough processor time.
- 23·
(ii) parallelism constraint:
. Parallel computational modules can not he clustered, they are assigned ~,o
different processors.
(iii) artificial constraint:
. Processing time on each processor is limited to T (a parameter).
Tbe time frame parameter T is used implicitly to calculate tbe utilization (u) of tbe
interconnection network of the machine M due to inlermodule communication traffic.
The reduction of T forces more and more processors to be used.
We give tbe results of applying our metbod to map Applications I and 3 to tbe
FLEX/32 architecture. Application 2 is not considered because there is an obvious
mapping for it that gives optimal speedup.
6.1 The clustering algorithm
The solution of the reductjoD problem for a specific time frame T is achieved by
the rollowing heuristic clustering strategy:
Start
Assign one processor pCT computation module
Assign data blocks to lclosest' memory
Iteration
1. Select a pair or computation modules ror possible merger into one processor
which gives maximum reduction in objective runction (communication
cost).
2. Ir no constraint.s are violated, t.hen merges the two modules, otherwise
remove the pair rrom list or eligible pairs.
Experience suggests that this heuristic strategy "solves" the reduction problem in
approximately linear time. Ir the graph G(A) can be scpamt.f>d into disjoints sub-
graphs by synchronization nodes (nodes with parallelism = I), then the subgraphs
are treated separately and the resulting subclusters joined in arbitrary order. The
input to tbe clustering algorithm is as follows:
·24 -
Algorithm spedfication
the applicatioo graph G(A). Recall that this graph explicitly cootains requir.,.
ments tor processing, memory and communication plus the precedence or the
computation modules. From this information one may easily derive the syn-
chronization delay or blocking times of tbe modules.
Architecture characteristics
communication models (delay (unction)
interconnection network bandwidth
G(M) if diflerent than G(A)
Resource constraint
time Crame parameter T
6.2 Mapping application I to the FLEX/32 architecture
The Cholesky parallel algorithm [O'LEA85j was implcmented in Coocurrent C
IFLEX85j for block symmetric matrices. The size of each block is denoted by N' and
the number of blocks is k2 where k represents the number of active processors. The
graph of the applications is depicted in Figure 2 Cor 4 by 4 b1o.ck matrix. The
FLEX/32 communication delay function is tcomm • k *ci,i where Ci,j is the number oC
units or data transferred between modules i and j. Tahle 3 gives the mappings
obtained versus k. Figure 11 shows the clusters Cor the 5 by S case.
This applicat.ion allows us to test t.he claim t.hat the heurist.ic ma,pping algorithm
has run time proportional to the number or nodes and processors. In this case the




















k Cluster #1 Cluster #2 Cluster #3 Cluster #4 Cluster #5 Cluster #6
2 1,2 3,4
3 1,2,3 4,7,8,9 5,6
4 1,2,3,4,15,16 5,9,11,12,13 6,7,8 10,14
5 1,2,3,4,5 6,11,16,18,21, 7,8,9,10 12,18,10,20,22 13,14,15
23,24,25
6 1,2,3,4,5, 7,13,19,25,31 8,9,10,11,12 14,20,26,28,32 15,16,17,18 21,27,33
6,22,23,24 20,30 34,35,36
Table 3. The clustering of the nodes of G(A) produced to obtain G' (A) by tbe heuris-
tic algorithm. For k processors the nodes are numbered using the same pat-
tern as in Figure 2 and then assigned to c1ust.ers as indicated in this tabl~.
5.3. Mapping application ill to the FLEX/32 architecture
The application of the heuristic mapping algorithm to the graph G(A) shown in
Figures 5 and 6 gives the clusters shown in Figure 12. Each node of G(A) is replaced
by the cluster number to which it belongo in G' (A). The interlacing of the clusters
makes the display such as in Figure 11 too complex. The time to con~pute this c1uster~
iog is 5.484 seconds.
5.4 On the beh avior of the mapping algorithm
When the time frame parameter T changes, different clusterings in G(A) can be
obtained. Let
TPAR = the shortest time frame Cor which all allocated processors can run the
application A in parallel,
TMIN = shortest time frame Cor which the application A can run under the
imposed resource utilization constraints.
Our observlltion has been t.hat T MlN -:;f: T pAR when the intermodular communication
cost is very low. For example, in the case of Cholesky decomposition (Figure 2) there
are ten different c1usterings which a.re summarized in Figure 13. Normally,
TMIN = TPAR and the mapping algorithm produces a unique solution. If we define as
an optimum clnstering for G(A) as the one with minimum elapsed time and minimum
communication cost then it is easy to verify the Collowing observations.
Lemma: If TMlN ¢ TpAR then the optimum clustering is the one that
corresponds to time trame T = TPAR or the one with minimum number of clus-
ters.
- 26-
Figure 11. The clusters ohtained hy the mapping ot G(A) into G' (A) tor the 5 by 5
Cholesky computation with N=240. The predicted elapsed time is [94







Figure 12. Partition oC the graph G(A) oC Applications ill into 6 clusters using the
heuristic algorithm with T=66.37. The numbers at tbe nodes indicate to
which cluster that node belongs. Clusters 2, 3 and 6 are also indicated by
the heavy lines.
• 28 -
For the different clusterings G' (A) or Figure 13, the elapsed time versus the time frame
T are plotted in Figure 14. These results were obtained using simulation on the
FLEX/32.
This observation is supported by our experimental results. Table 4 shows the
effeet of using more processors than required for Application I (the 4 by 4, N=240 case).
Communication is 50 Cast in the FLEX/32 compared to computation that one sees little
effect in actuality. We artificially increased the communication requirements by a fac-
tor of 100 and those results are shown in column 3 of Table 4. Then the adverse effect
oC having too many processors is evident.
The current implementation of the graph reduction pbase is capable, by using an
iterative procedure, to identify all possible solutions and the breakpoints as in Figure
13. Also, it is possible to estimate the elapsed time provided we are able to predict tbe
blocking time of eacb module in G(A) due to the precedence of computations. The
workload analysis based on SIMON or FLEX/32 provides this inrorrnation. It turns
out that TpAR is a. close upper bound of the elapsed time when the blocking time of
each module is incorporated as part or the processing time of the modules. We believe
that this approach to predicting the elapsed time is reasonable since the blocking time
is solely an attribute of the application algorithm. The importance of the elapsed time
is twofold. (a) For the same application the performance of different multiprocessor
system architectures can be compared by comparing their elapsed times. The a.dvan-
tage of parallel processing can also be investigated by comparing the elapsed times on a
parallel architecture system to a uniprocessor system. (b) Given different initial parti-
tions G(A) of the same application A, their performance can he compared by looking at
the elapsed time of the different partitions running on the same multiprocessor system.
6. PERFORMANCE OF APPLICATION/ARCmTECTURE PAIRS
In this section we present actual performance data for the three applications on
the FLEX/32. We analyze the speedup obtained and correlate this with the computa-
tion structure and with the predicted values from our allocation algorithm.
6.1 ApplicatIon I/FLEX32 pair
Consider an N by N matrix and the Cholesky decomposition described in Section
2.1 and assume that is to be implemented using k processors. One way is to partition
the ma.trix into k2 blocks. Then the computational complexity of this algorithm is st
least O(N3/k) and the communication complexity is a.t least O(N2/k). This algorithm
was implemented in concurrent C and ran on the FLEX/32 using different values of N
and k. The mapping was determined by the allocation algorithm assuming a linear
delay model t"rnrnxkxc;,j' Tahle 5 shows the speedup and efficiency obtained. The
entry for k=8 was obtained by partial simulation. Efficiency is the speedup obtained






I~ I'" I~ I~= 6 I I0 I ~- I I I0 I I I~ I I'" 4 I'" I I I I IE I I I= I I I I IZ 2 I I I I II I I I I
I I I I I
0 I I I I I
0 20 40 60 80 100 120 140 160 200
TMIN TpAR
Time Frame Parameter, T
Figure 13 The number of clusters obtained Cor different values of the time frame








Elapsed time with communications





Table 4. Effect 01 having more processors than required lor the parallelism 01 an al'"
plication in the case of Cholesky decomposition with a 4 by 4 system and
N=240 (see Figure 2).
- 30-
divided by k. It is worth noting that the communication complexity is :,,:,,duced with
respect in a factor dose to the speedup.
Number 01 Elapsed Speedup Efficiency
active processors Time
1 408046
2 347.10 1.18 .50
3 276.08 1048 040
4 228.67 1.70 .45
5 104.36 2.1 042
6 160.54 204 040
8 102.38 4.0 040
Table 5. Elapsed time, speedup and efficiency oHained lor Application I on the
FLEX/32.
One should not expect speedup 01 about k (or efficiency almost 1) for thi, applica-
tion. As ,een in Figure 2, during the computation tbe parallelism of G(A) grow,
steadily to its max imum and then decreases steadily back to 1. Thus a course grained
partitioning such as we use should use, on the average, about balr the processors so we
expect the efficiency to be about 0.5 which corresponds well with the observed results.
In order to study the effect of communication cost on the computation we have
artificially increased it by a factor of 10 and 100. This increases the computational
time for a 4x4 Cholesky decomposition graph N=240 and 4 active processors as fol-
lows:
Communication Elapsed Ratio to
Cost lime normal case
Normal 228.67
10 times normal 277.30 1.2
100 times normal 747.67 2.7
One of the objectives 01 our methodology is to predict the perlormance of some
algorithm/architecture pairs. It appears that TpAR caD be used for this purpose. Table




k Elapsed time Speedup Speedup
2 316 1.29 1.18
3 243 1.68 1.48
4 162 2.52 1.79
5 159 2.57 2.10
6 140 2.90 2.40
Table 6. Predicted and measured speedup. The predicted time is TPAR trom the al-
location a.lgorithm, the measured time from Table 5.
6.2 Application II/FLEX32 pair
A straight forward estimation of the computational complexity DC the spline collo-
cation method indicates that it requires O(N2) arithmetic operations per iteration where
liN is the size or the partition in the x and y directions. Ideally, distributing the work
among k processors reduces the computational complexity in a parallel algorithm to
O(N2jk). The performance curve in Figure 14 indicates ideal speedup for large N.
Table 7 presents the timings obtained with the FLEX/32 with a single processor and
the VAX 780. These data indicate that the VAX 780 is about twice as fast as a
FLEX/32 processor. This agrees with t.he data in Table 2 and supports the conjecture
that the VAX 780 is not as much faster than FLEX/32 processor as the raw speeds of
the processors would suggest.
- 32-
Speed-up=6







, , , I , N
20 40 60 80 100 120 140 160 180 200 220
Number of Lines in One Direction
Figure 14 Measured speedup for Application II OD the FLEX/32. The rough nature
or the curve ror 6 processors is due to variations Crom properties or neigh-
boring values of N (e.g. at N=174, 176 and 180).
- ~
N FLEX-UNIX FLEX-MMOS VAX-UNIX FLEX-MMOSIVAX-UNIX
12 0.30 0.24 0.08 3.00
24 1.15 0.82 0.32 2.56
48 4.18 3.18 1.38 2.30
96 15.70 12.52 5.85 2.14
120 23.85 19.54 9.32 2.09
150 38.03 30.52 14.20 2.15
180 52.83 46.61' 20.93 2.23
210 72.2g 58.29' 28.43 2.05
216 78.65 63.43' 29.Q2 2.12
Ii: predi·:ted ,'alues
Table 7. Time to execut.e one iteration on a single processor computer under different
operating systems. The computation is one itera.tion DC the spline colloca-
tion method (see equation 2.1) for an N by N mesh.
We obsene t.hat the computatio"1a! complexity and communication complexity are
hoth O(N'/k) here. For the FLEX/32 the ratio
t"mm/tdW = 6.6/31.5 = 0.21
is quite sma.ll which is necessary to obtain the high efficiency observed here. For the
particular case N =180 we have the following results
k = number of processors
speedup
1 2 3 4 5 6
1 2.00 2.47 3.88 4.77 5.65
The speedup curves as shown in Figure 14 are typical of those seen in other appli-
cations and for other machines. This same data is redisplayed in Figure 15, we plot the
number of processors times the elapsed execution time for one itera.tion or this compu-
tation. If we had perfect speedup, then these curves would be constants. We observe
that they have a slight rise, almost linear. In particular, each of these has risen about
3.5 at k =6. This suggests that
Elapsed time = C,l'-'"/k + C, = 0.00135 N'/k + 0.75k
which gives.
Speedup = k(l + (k-1)C,/(C,N'))
Thus there is a very plausible fixed overhead or Cz in the elapsed time per processor.
The significance of this behavior is that C2 is an overhead independent of N and thus
·34·
one that becomes less significant as the problem size increases.
6.3 Application m/FLEX32 pair
Application m reqnire, 252.86 ,econd, to rnn on a single FLEX/32 processor. A
brier examination or Figure 6 ,ngge,t, that a snbstantial ,peednp i, possible ror 6 pro-
cessors, but somewhat less than 6 is expected. Tber~ is a drop in parallelism in the
center or the grap" (where tbe mnltirrontal metbod merges) which involve, substantial
computation and the "create plot data structure" node is se.quential and sizable. When
this application is run with 6 processors on the FLEX/32 using the clustering or Figure
12, it takes 72,('.3 seconds. This is a speedup or 3.34.
An analysi, or Figure 5 shows that the maximum possible speedup i, 3.92 ir one
assumes that communication is instant.aneous. Recall that our heuristic algorithm does
not attempt to minimize elapsed time directly, so the speedup obtained is reasonable.
The time frame parameter T used to achieve the clustering is 66.37. This gives
the predicted speedlip of 252.86/66.37 = 3.81 which is between maximum possible and
the observed value.
7. CONCLUSIONS
We see that the heuristic partitioning algorithm works well even ror a machine like
the FLEX/32 where communication costs are negligible compared to computational
costs. The applications considered have a high correlation between communication and
computational costs which is not surprising, this is probably rairly common. The seven
processor ii"LEX/32 at Purdue exhibits no communication degradation even with aU
processors continually accessing shared memory. We expect this situation to continue
ir the number or processors is increased to the one cabinet limit (20 processors). Our
projection is that the FLEX/32 boards can be npgrade,-: with processor te:: times raster
than the current ones without adversely impacting most computations. We are
surprised to see on two unrelated, arithmetic dominated applications that the FLEX/32
runs haIr as rast as a VAX 780 even though the VAX processors are 3 times raster.
Optimal speedup requires that the degree or paral1elism be maximal most or the
time. Our approach only exploits coarse grain parallelism and two or the applications
can use both coarse and fine grain parallelism. Even so, we observe good speedup or
about k/2 ror k processors. The third application (spline collocation) naturally has high
parallelism and, Cor any number oC processors, we observe optimal speedup as the prof>.
lem size increases. In this case the lack of perfect speedup seems to be due to a small
constant "start up cost" per processor. This small cost has a large apparent effect on
traditional "speedup versus size" graphs.
- 35-
N=220 }




20 ====-=~N~I~2~0======~~ ~-_-_-_-_-] 3.6






Figure 15. Plot or k times elapsed computing time tor Application IT tor N=60, 120,
180 and 220. Note that the difference trom a constant in these curves at




















Berman, F. and Snyder, L., "On mapping parallel algorithms in parallel
architectures," Prot:. Internal. Con/. Parallel Processing, (lgS4) 307-30Q.
Berman, F., Goodrich, M" Koelbel, C., Robinson, ill, \V. J., and Showell,
K., Ilprep_p: A mapping preprocessor for chip computers/' Proc. Inter.
Cont. Parallel Processing, (IU85) 731-733.
Bokhari, Shamid, H., "On the Mapping Prohlem," IEEE Compute.., C-30,
(IUSI) 207-214.
Flexible computer corporation, FLEXI9~ Multicomputer system overview,
Doc. No. 030-0000-002, Second Edition, June (IU85).
Fox, G, The performance of the Caltecb Hypercube in scientific calcula-
tions, Caltech Technical Report, April 5 (IU85).
Fujimoto, R. M., liThe SIM:ON simulation and development system," Sum-
mer Computer Simulation Conference, IgS5 (Univ. of Utah).
Gannon, D. and J. Von Rosendale, On the comml:nication complexity of
parallel numerical algorithms, IEEE Trans. Computer, Vol. C-33, No. 12,
pp. llSo-llg4, Decemher IgS4.
Houstis, Catherine E., IIAllocation of real time application in distributed
systems," (lgS4) submitted for puhlication.
Houstis, C. E., Houstis, E. N., a.nd Rice, J., "Partitioning and Allocation of
PDE Computation to Distributed Systems," PDE Software: Modules Inter-
/ace. and Sy.!em., Edited by B. Enguist and T. Smedsaas, North Holland,
(IU83), 67-S5.
Houstis, E. N. and Rice, J. R., Heuristic methods for reducing the par_~l1el­
ism in computational graphs, Technical Report, Computer Science, Purdue
University, 19S6.
Hwang, Kai, and Briggs, Faye, Computer Architecture and Parallel Pro-
ce...·ng, McGraw-Hill, tgS4.
Jenny, C. J., uProcess partitioning on distributed systems," Digest of paper
Nationat Telecammunica!io.. Con/erence, (IU77) 31:1-31:10.
Kleinrock, Leonard, "Distributed systems," Communications ACM, 28,
(IUS5), 1200-1213.
Kruskal, C., and Snir, M., liThe performance of multistage interconnection
nets for multiprocessing," IEEE Tra ... Comput"" (lg82), IOgl-IOgS.
Marsan, M. A., and Gerla, M., lIMarkov models for multiple bus multipro-
cessor systems," IEEE Tran•. Compute.., C-32, (IUS3) 2311-248.
Marsan, M. A. Balbo, G., and Conte, G., "Comparative performance
analysis of single bus multiprocessor architectures," IEEE Trans. Comput.
ers, C-31, (IU83) 1170-11Ul.
Norton, Alan and G. F. Pfister, "A methodology for prediCting multipro-






O'Leary, D. P. and G. W. Stewart, "Data-flow algorithms ror parallel
matr;x computations," Communication ACM, 28, (1085) 84()'853.
O'Leary, D. P. and G. W. Stewart, "Assignment and scheduling in parallel
matrix ractorization" LiD. Alg. Appl., (1086), to appear.
Williams, E. A. I "Assigning processes to processors in distributed systems,"
Pro,. Int,rnat. Conf. Paratld Pro""ing, (1083) 404-406.
