Performance Evaluation Modeling for Distributed Computing by Houstis, Catherine E. et al.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1986 
Performance Evaluation Modeling for Distributed Computing 
Catherine E. Houstis 
Elias N. Houstis 
Purdue University, enh@cs.purdue.edu 
John R. Rice 
Purdue University, jrr@cs.purdue.edu 
Report Number: 
86-576 
Houstis, Catherine E.; Houstis, Elias N.; and Rice, John R., "Performance Evaluation Modeling for 
Distributed Computing" (1986). Department of Computer Science Technical Reports. Paper 495. 
https://docs.lib.purdue.edu/cstech/495 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
PERFORMANCE EVALUATION MODELING 
FOR DISTRIBUTED COMPUTING 
Catherine E. Houstis 
Elias N. Houstis 
John R. Rice 





li s . stis
. i
- 6
PERFORMANCE EVALUATION MODELING FOR 
DISTRIBUTED COMf UTING 
Catherine E. Houstis ' 
Department of Electrical Engineering 
Elias N. FIoustis" 
John R. Rice" 
Department of Computer Science 
Purdue University 
West Lafayette, Indiana 47807 
ABSTRACT 
We present a methodology for evaluating the performance of application programs 
on distributed computing systems. An application A is represented by an annotated 
grap b G(A) giving its requirements for processing, memory and communication plus the 
precedence between computation modules. Machines are represented by a similar graph 
G(M) and the methodology is to (a) map G(A) into G(M), (b) compress the applicstion 
code to a much simpler one that uses the same resources, and (c) use a simulation pack- 
age on the compressed code. Step (a) is subdivided into three parts (al) reduction of 
the parallelism, (a2) scheduling the computational: modules to (nearly) minimize com- 
munication costs, and (a3) actual layout of resulting graph into G(M). The two techni- 
cal problems addressed here are (1) using communication delay models to simplify G(M) 
and the step ( a . ) ,  (2) scheduling the modules (step (a2)). Our communicatioa models 
apply well to "uniform" architectures, we explicitly consider the following five: single 
bus and common memory, single bus and distributed memory, multiple bus and distri- 
buted common memory, omega interconnection and distributed common memory, and 
omega interconnection and distributed memory. 














, 8. ( ).
) M)
s), . ica. n
r " i :
e or
e
• i C·85 .
.t i as AG29- -K-0 26
8 .
1. INTRODUCTION 
In a previous paper IHOUS831 we considered the problem of mapping partial 
differential equation (PDE) computations into parallel machine arcbitectures. We con- 
tinue this study here, but with no particular involvement of PDE computations. We 
now consider the problem for a special, but widely used, class of architectures where we 
can obtain better results more simply. The architecture class we consider is where the 
machine has a somewhat homogeneous nature (this is made more specific later). To set 
the context, we briefly outline the general problem of mapping an application onto a 
machine and our approach to this problem. 
We consider an application A to be a computation with four proper ties: processing 
requirements, memory requirements, communication requirements and precedence (or 
synchronization) between the subcomputations. Visualize the computation broken into 
computafional modules which are nodes of a precedence graph for the computations. 
Note the processing and memory requirements at each node of the graph. Note the 
cornmunieation requirements along each link or edge of the graph. This annotated 
graph is called G(A). ' ~ o t e  that links representing communication may need to  be 
added to G(A) even though they are redundant as far as precedence is concerned. 
We consider a machine to have three components: processing elements, memory 
elements snd communication paths (an interconnection network). Similarly, the 
machine can be represented by an annotated graph G(M). In this paper, we consider 
the class of machines that  can be modeled by a set of identical processors (which may 
have local memories), a set of identical common memories, and a queueing delay func- 
tion which represents the communication performance characteristics of the machine. 
Five specific examples from this class are considered here. Intuitively, one may con- 
sider these as machines with a rather uniform architecture, one that scales simply with 
the number of processors and memories. 
In general, the mapping problem has three somewhat independent steps: 
I. Reduce the parallelism of the application to that  of the machine. 
2. Schedule the computational modules so that the application runs efficie~ tly. 
3. Given steps 1 and 2 are done, imbed the application into the machine. 
The mapping problem is known to be NP complete [JENN77] and to find the optimal 
mapping for any of these three steps is also NP complete. We propose fast, heuristic 
algorithms here which &re intended to produce good mappings at low cost. Due to  the 
lack of space, we discuss the first step in depth elsewhere [HOUS86], a brief comment is 
given at the end about our approach. Given that  steps 1 and 2 are done, step 3 is 
fairly easy for the class of machines considered here, see Section 4. Thus, this paper 
concentrates on the problem of scheduling the application so that it runs efficiently on 
one of the machines considered. 
Our overall plan for making performnnce evaluations is as ~ollows. We first devise 








ti l . OT
.
m c n





















of steps 1 and 2 is not necessarily fixed, we assume in the discussions that step 1 pre- 
cedes step 2 so as to simplify the discussion but this is not an essential ingredient to 
our approach. We use communication delay models for "uniform" architectures in 
order to simplify the complexity of the mapping problem. Once we have the applica- 
tion mapped onto the machine and scheduled to provide good efficiency, we then take 
the application code and "compress" it to a code that takes approximately the same 
resources. For example, the following loop 
for i = 1 to 100 
for j = 1 to 20 
a =(i-f - j )  * b 
x = 3. * cos(3.14 * a) + (a-b) * log(i+j +Imax) 
c(i,j) = x ** 2 + a/b 
end 
might be replaced by 
for i = 1 to 2000 
d u m m y = l . + 2 . + 3 . + 4 . + 5 . + 6 . + 7 . + 8 . + 9 . ~ 1 0 . + 1 1 . + 1 2 .  
dummy = sinla) +sin(b) 
end 
as there are 12 arithmetic and 2 function evaluations in the loop. FinaHy, we use a 
simulation package (lor example, SIMON FUJI851) to "execute" the compressed code 
and estimate its performance. 
The performance of a distributed system depends very much on how well the 
architecture and the algorithms are matched IK_tEIN85J. A methodology for predicting 
multiprokessor performance from the systems point of view and the RP3 architecture is 
presented in [NORT85]. T h e  problem of mapping bas been studied in [BOKHB:LJ and 
[BERM84,85] for the finite element machine and CHlP architectures respectively. In 
(WILL831 various objective criteria necessary for assigning processes to processors are 
examined and tested in a distributed environment. A different approach for modeling 
communications complexity of parallel algori thrns is presented in [GANN86]. 
2. MODELING PERFORMANCE OF ALGORITHMS/ARCHITECTURE 
PAIRS 
Assume that the graph G(A) is given and that an appropriate choice of problem 
size and granularity has been used in partitioning A to create G(A). Assume also tbat 
G(M) is given. In general, we want to  map G(A) onto GIM) so the computational 
modules (nodes) of G(A) are associated with processors in G(M) and the communication 
links (edges) of G(A) are associated with data paths in GIM). We assume that the 
parallelism of G(A) is no larger than tbat 01 G(M) (see the discussion in Section 4). 
Normally the number of nodes in G(A) is much larger than the number of processors in 
G(M) and the computation is scheduled by assigning the computational modules to p r e  
eessors. This produces a new graph G'(A). We choose G'(A) to be efficient in the sense 
·3-
t i t il i , i t i i t t t r --
t i li t i i t t i i t ti l i i t t
r r . s i ti l ls f r iCorm" r it t r s i
t i li t l it t i l . t li
ti t t i l t i i i , t t
t li ti " it t t i t l t
r r . l , t f llo ing; l






um y = 1. + 2. + 3. + . 5 6. + 7. + 8. + Q. + 10. + 11. + 12.
( ) si (
ll
f [FUn J) lCexecute"
it t [ LEIN85J. t ti
c
85]. f h 81J
84,85J I ti l .







ll li i l t h of M) t i i i ti .
ll i ) i l f i
) t t ti i i i t t ti l l t ro-
c . i ' . ' ) t i i t i t
tha t  the communication costs of the computation are low. Experience shows that  this 
criterion provides schedules which are also good in the sense of short elapsed execution 
times. 
We assume that  the machine is "uniform" in nature and richly endowed with com- 
munication paths. This assumption holds for many architectures including five specific 
cases considered here in detail. This assump tion has two effects, first, it provides sim- 
ple algorithms to carry out the final layout step of mapping G'W into G(M) (see the 
discussion in Section 4). Second, it allows u s  to model the communication behavior of 
the machine by known formulas that  are p.ractical to use in determining G'(A). In this 
section we introduce an example showing how G(A) appears and then present the com- 
munication delay models lor tbe following architectures: (1) single bus and common 
memory, (2) single bus and distributed memory, (3) multiple bus and distributed com- 
mon memory, (4) omega interconnection and distributed common memory, and (5) 
omega connection and distributed memory. 
2.1 Application modeling 
For the modeling of an application, we assume an initial partitioning of the com- 
putation into a number of communicating computational modules. If the communica- 
tion paths among the modules are not known a priori (i.e real time applications) then 
the partition is modeled by a stochastic AND-EOR directed graph SG(A). This graph 
is specified by the following parameters: 
mi = processing time of module mi 
di = processing time of data block di 
pij = probability or control transfcr from mi to mj 
a;j = communication processing time between mi and mi 
In [HOUS841 a stochastic analysis is applied to  reduce SG(A) into a determini~:ic graph 
G(Aj which comprises the input to the mapping model. 
In case the communication paths among modules are known a priori then a 
dataflow language [FUJI851 is employed to specify the computation modules and their 
communication and synchronization requirements. We use the language SIMON 
iFUJI85) from the simulator for non-shared memory mu1tiprocessor systems. Figure 1 
presents the computation of each module in the example of a parallel Cholesky decom- 
position algorithm lO'LEA85). Figure 2 depicts the corresponding graph G(A) and the 
values of the various workload parameters (node processing time and blocking time, 
communication traffic among nodes) obtained by setting SIMON'S switching delay to 




















P j ::: f e j ffi
3.i ::: j j









. c t r
the module must wait for it,s inputs before its computation can start. 
K: =O 
WHILE (TRUE) DEGtN 
K: = K + 1  ; 
IF (K = i and K = j) then 
BEGlN 
END ' 
ELSE IF [K=j) THEN 
BEClN 
END 
ELSE IF IK-il TliEN . . 
BEG'IN 
ELSE DECIN 
OUT-SOUTH, Y) ; 
RN 
Figure 1. Computation of node (i,j) in SllION dataflow language for aaparallel Chole 
sky decomposition algorithm. 
Module processing time 
- m = (24, 30, 30, 25, 27, 31, 37, 32, 27, 34, 38, 30, 21, 28, 35, 45) 
Mod uIe bloc king time 
4 = (0, 4,  14, 24, 15, 27, 37, 48, 33, 45, 61, 66, 51, 64, 76, 88) 
Figure 2. Precedence graph G(A) of the parallel Cholesky decomposition algorithm for 
a 4x4 symmetric matrix. The assumed numbering of modules is specified in 
each node. The communication traffic is indicated as a weight on the links 








W=~~~LW1 ;PUT (OUT EAST, X) ;






IF ~. -.t N) PUT (OUT EAST, X);




GET (IN_EAST, ~ ;
IF ~. i!. N) rUT OUT EAST, Y) i




GET IN_BAST, Z) ;
x= .yeZ;
IF Q" 7'- N} PUT ( _S T , y
IF j ¢ N) PUT (OUT_EAST, Z);
E
END WlID.E
i ) fM n aral el ole-
( , , , , , , , , , , , 9, , , , )
l
h {O, , , , , , , , , , , }
( } r ra.
i
. ffi
2.2 Communication Modeling in Mult~processor Architectures 
Our methodo'fogy for reducing the initial interconnection graph G(A) is based on 
minimizing its communication cost. Thus, i t  is crucial to be able to predict the com- 
munication cost of G(A) running on a specific mu~tiprocessor system architecture. We 
consider architectures for which there is a realistic simple model of the communication 
cost a t  the message passing level. Such performance models of multiprocessor sys tem 
are given in IMAR831, IKRU831. The performance measure is a delay function which 
provides the expected queueing delay a message suffers in traversing the interconnection 
network of the system. Queueing delay is computed as a function of the system's com- 
munication interconnection network utilization. 
Let 
ci,j= total traffic (# of messages) between modules mi and mi, 
k,P = processors assigned to computational modules, mi, mj 
and dk,P = interprocessor distance in the interconnection network 
of the considered architecture, 
then the  interprocessor communicalion cost or queueing delay is given by 
where u = u(cij, dk,P, interconnection network characterization) denotes the utilization. 
When two modules are assigned to different processors their utilization of the intercon- 
nection network is computed as follows: 
where C = capacity of interconnection network and T = time frame during which each 
paralIel processor is running. 
The delay obtained is only approximate because of the various simplifying assump- 
tions involved in the analytical modeling of the performance characteristics (queueing, 
utilization) of the multiprocessor system. In all of the models we consider, two assump- 
tions are made: (a) every processor accesses the interconnection network at the same 
rate, and (b) every processor accesses the various common memory modules with the 
same probability. These assumptions lead to a queueing model which is mathemati- 
cally tractable. Intuitively, these assumptions imply that the communication require- 
ments of G(A) are nearly uniform. If G(A) is one of many applications running in the 
same system, these assumptions may be realistic even if the algorithm has somewhat 
non-uniform communication patterns. A precise investigation on this question is under 
way. 
We consider the performance analysis of 6ve different multiprocessor parallel archi- 
tectures; (1) Single bus, common memory, (2) Single bus, distributed common memory 
i l iproces or
l ( )
,


























modules, (3) Mu1 tipie bus, distributed common memory modules, (4) Omega in tercon- 
nection, distributed memory modules, (5) Omega interconnection processor to processor 
multiprocessor system. The queueing delay functions of these systems depend on a 
number of parameters. Some of these parameters are (a) the number of processors in 
the system (b) the number of memory modules (el the number df busses or switch size 
of the interconnection (d) the distribution of message size in the system (e) the message 
generation rate, (I) the memory access time, (g) the number of network stages. 
Next, we present the queueing delay functions of these mu1 tiproeessor parallel 
architectures and the way they have been applied to the mapping algorithm. 
(I) Single bus common memory arehttecture 
In this architecture, (Figure 3(i)), k processors (Pi) are connected to an ex terns1 
common memory (GM) via a global bus (GB). Each Pi has a local memory (LM,) con- 
nected to its own local bus (LBi). The system has been analyzed in [MAR831 as a 
"machine repairmanh problem. For given bus utilization level, say ul, the average 
queueing deIay per information transfer uni t  is given by Dl(ul) =: (k-ul/pl)/ul. The  
parameter p ,  characterizes the communication of the architecture and i t  is found by 
solving the nonlinear algebraic equation 
Pk(~l)-l  k - 
U 1  - where Pt(pl) = C p! k!/(k-j)!. 
P ~ ( ~ l  j =O 
The workload characterization 01 the application is given by p,, = p1/2 and represents 
the ratio Xp/p ~vhere Xp is each P;s access rate to the bus which in turn depends on the 
communication pattern of G(A) and 1/p is the average memory transfer requirement 
which depends on the average message length. 
(2) Single bus distributed memory architecture 
This architecture, (Figure 3(ii)), is obtained from the previous one by distributing 
the common memory to each processor. The local memory of each Pi is logically 
divided in private memory PMi and common memory CMi and accessed by a double 
port. 
This system has been analyzed in [MAR83]. The average queueing delay per infor- 
mation transfer unit  as a function of the bus utilization is given by 
k-u2/p2 P 
D2(~2) = where pp = 












)), j ) al
CM) i LMil ~
J J
u ine " UIt




ere k(Pl) E P: !/(k-j)!.
.ct of Pp II







where P2 =: --p-
1
k . k!
where P k(P2) = .E p~ (k-')'
1':::0 J .
(3) Multiple bua distributed common memory modules 
A multiple bus distributed common memory modules is shown in Figure 3(iii). Pi's 
have their own LMi's and communicate via common memory modules. Multiple global 
busses are used by the P: s to access CMi's. Such a system is referred to as a kxmxb 
system where k is the number of P: s, m is the number of CMi's and b is the number of 
global busses used. We assume that k 2 m > b. 
This system has been analyzed in m 8 2 ]  using several approximate models 
based on a Markov chain approach. The model we are adopting here, provides perfor- 
mance characteristics which are an upper bound to the characteristics of the actual sys- 
tem. 
The average queueing delay per information transfer unit is given by 
k-u3/p3 
D3(~3)  = - where p3 - pp 
u3 
b-l 
C jpj(P ) + b g [ p b ( j  +b)p,-&zb-j +m)l 
- j = t  j =O and Ps - b-I , f > o  
~ p j ( @ )  + E l p d j  +b)p,-dP-Zb-j +m)J 
j=l j =O 
where 
with initiaI conditions 
(4) Omega interconnection distributed memory modules 
In Figure 3(iv), an example of an Omega interconnected architecture is shown 
where the Omega network is drawn for k = 8 processors in the system, and 2x2 
switches. 
This system has been analyzed by IKRU83), as the interconnection network of the 
ultracomputer [KRU83]. The same analysis applies to a general class of multistage 
interconnection networks called Banyan networks w U 8 3 ] .  Some of these are the 





i' 5 j's. .







( ) - 3 3U3 -
u3
k' k-jk-j . n ,q-l
P3 ('-1)1 f-IV
I . 1== 1
l i-b
E Pj(f E [PbO Pm-b(Q-2b-j J
i l i
(3g -=--b---I---'--=--b----------, V > 0
Pj( ~ ) 'E IpbO b)Pm-b(V -2b-j m))
1 o










e . r ,
l 3L
83]. . .l l
fKRU83 .
Architecture 








T Global nus 
K 
0 
L I I I - 
memory 
- 
(iii) Multiple bua distributed common memory 








L # CM, .a. CMI 
LMI I,M2 
(iv) O m e g ~  Interconnection d istribu ted common memory Pp = P 
p2 - 
LM, 
- v - L 
"x 
ISMK 
Figure 3. The delay function and workload parameter ( p p )  as a function of utilization 
(u) and parameter ( p )  for four shared memory architectures with k proces- 
sors. In case (iii) m = # or common memories, b = # of common busses 
and k 2 m > b. In case (iv) nxn is the size of the switch, t, = transit time 
of a switch, t, = cycle time of a switch, rn = # of packets in a msg; p = 
average # of msgs entcred by each Pi/cycle time. 
... 
C:MK 
n D = Jog,k(t,+t, 
- Pu 
LMu 
CMI CM2 *.* CMK 
A 
I 1 1 
INTERCONNECTION 
rvETW ORK 































I 1 1 .
•
I r
PJ P7 ... I'K
LMJ LMj! LMK
ssc.'S








) ) .r it t
) rie!;, "#
> . , ,.
r c , m ::: r
e
considered buflered nxn switches where the capacity of the buffers is infinite. Then the 
average delay per message in traversing the interconnection network is simply its delay 
per stage times the number of stages (log,k) in the interconnection. 
, where k = number of processors in the system 
t, = transit delay of a switch 
t, = cycle time or a switch 
n = size of switch (nxn) 
m = number of packets per message 
average number of messages entered by each Pi per cycle time P4 - 
An exact calculation of D4(p4) is possible since p4 = u/k and can be directIy substituted 
into the delay function. 
(5) Omega Interconnection PE to PE Architecture 
In every architecture considered so far dk,p = 1. Figure 4 shows sn architecture 
for which dk,p + 1. The organization of the processors and memories dirers from that 
in Fjgure 3(iv) in that each memory is associated with each processor forming a single 
element called the processing element [PE). Thus a PE to PE configuration is shown in 
Figure 4. Note that this time the distance between the processors is dkL = 1,2. For the 
average queueing delay per message through the omega interconnection a formula is 
given which is the same as in architecture (4) but it is stated in more general terms by 
[NOR851 and is as follows: 
where u = utilization of the interconnection by a processor per cycle time 
m = number of packets per message 
st = number of network stages 
nxn = size of switch 
The above formula is applicable for the general class of Banyan interconnection net- 










( ) I t r ti t r it t r
Ca ,f a.


















Figure 4. The delay function for nonshared memory interconnection networks u = 
utilization of the interconnection by a processor per cycle time, m = 
number of packets per message, st  = number of network stages, nxn = 
size of switch. 
3. CLUSTERING TO OPTIMIZE COMMUMCATION COSTS 
We now consider how to obtain the schedule graph G'(A) from the original graph 
G(A). We use a heuristic algorithm based on the minimization of communication cost. 
This cost is estimated from the data about communication in G(A) and the average 
communication delay models such as given in Section 2. The details of this algorithm 
are given in  [HOUS83] and we briefly summarize it here for completeness. There are 
numerous constraints implicit in the mapping problem which complicate a complete 
mathematical formulation considerably. However, experier.qe shows that this algorithm 
runs las t  and the run time is normally linear in the size of G(A) and the number of prw 
cessors. The reduction can be viewed as a clustering process that tries to minimize 
communicalion time /or dafo and eompulalion variables by clustering modules together. 








D = x m x st x (I - -)
2(1-u) u
i r . el y f ction f r ODs r i di tw -
tili ti C t i t r ti y r ss r r cle ti , -
t ssa e, t t r stages,
size s itc .
. I I I UNI I S S
o si er t t i t e s e le r ' ) C t e ri i al r
( ). se a. risti al rit t e i i izati r c unicati c st.
his c st i esti t fro t e t t c i ti in (A) t e a erage
unicati el odels s c 9.S i en i ection . e etails t is l rit
re i in [ OUS83] e riefly s arize it. r C r c let ss. er re
erous constraints i plicit i the apping r l ic co plicate a. c plete
at e ati l f r ulati consi era l . o ever, experie;,~ sho s tha.t this algorith
r ns fast and t e r ti e is r all li e r i the size r { ) and t e D r o ro-
essors. e r ti i e s cl stering r cess t t tries t i i ize
c i ti ti e l ata co putation varia les cl steri odules t gether.
his clustering ust satisf t e fol owing c strai ts:
(i) resource constreinla: 
- Every computational module must fit into the memory assigned 
- Data blocks must fit into the memory assigned. 
- Computations must have enough processor time. 
(ii) parallelism constraint: 
- Parallel computationa1 modules can not be clustered, they are assigned to different 
processors 
(iii) astificial conslraint: 
- Processing time on each processor for application A is limited to T (a parameter). 
It is worth noting that  the lime jrame parameter T is used implicitly to calculate the 
utilization (u) of the interconnection network of the machine M due to  intermodule 
communication traffic of the application A. The reduction of T forces more and more 
processors to be used. 
3.1 Clustering algorithm 
The solution of tbe reduction problem for ;. q~ecific time frame T is achieved by 
the following heuristic cluslering strategy: 
Start 
Assign one processor per computation module 
Assign data blocks to 'closest' memory 
In teration 
1. Select a pair of computation modules for possible merger into one processor 
which gives maximum reduction in objective function (communication cost). 
2. If no constraints are violated, then merges the two modules, otherwise 
remove the pair from list of eligible pairs. 
Experience suggests that  this heuristic strategy "solves" the reduction problem in 
approximately linear time. The input to the clustering algorithm is as follows: 
- the application graph G(A). Recall that  this graph explicitly contains require- 
ments for processing, memory and communication plus the precedence of the 
computation modules. From this information one may easily derive the syn- 























Architecture c h a r a c l e ~ a f i c o  
- communication models (delay function) 
- interconnection network bandwidth 
- G(M) if different than G(A) 
Resource constraint 
- time frame parameter T 
The output of the clustering algorithm for the Cholesky decomposition graph of 
Figure 2 are shown in Figures 5-7. Figure 5 is for a 4 by 4 matrix and T = 464 while 
Figures 6 and 7 are for a 5 by 5 matrix and use timc frame parameters T = 541 and 
473, respectively. 
Processor Modules/data blocks 
number utilization allocated 
1 78.80% 1,2,3,4,11,12 
2 38.20% 5,9,13 
3 47.07% 6,7,8 
4 89.94% 10,14,15,16 
Figure 5. Mapping of computation modules for Cholesky decomposition algorithm to 
architecture 2 with 17 processors for time frame, T = 464. The utilization 
of eacb processor is given. 
- 13 -
r it t r e arac:leristic8
• c icati els ( ela f cti )
.. i terc ecti et r a i t
- (M) iC iff r t t ( )
s rce str i t
- ti e fra e r t r
e t i l C r i f
i e re i i e - . Co i ::::














i re 5 a pi oC t ti ules C ol i
archite e i rocess rs C ti a e, il ti








Figure 6. The solution of clustering algorithm for T = 541 snd Cholesky decomposi- 
tion algorithm for a 5x5 matrix. The module processing plus blocking time 
requirements, in the order of nodes specified, are 24, 34, 44, 54, 60, 42, 59, 
74, 85, 90, 00, 70, 99, 195, 120, 78, 98, 117, 139, 145, 91, 180, 129, 151, 173. 
Figure 7. The solution of the clustering algorithm for T = 473 for the application con- 








t , , D
, , , 6 {J, gO 0 , OS, , 0, 0 lQO, 0 , .
:::: l i
6
3.2 O n  the Solution of the Graph Reductton Problem 
When the time frame T changes, a different clustering in G[A) might be obtained 
as indicated in Figures 6 and 7. Throughout we dcnote by 
TPAR: the shortest time frame for which all allocated processors can run the appli- 
cation A in parallel, 
TmN: shortest time frame for which the application A can run under the imposed 
resource utilization constraints. 
Our observation has been that TmN # TPAR when the intermodular communication 
cost is very low. For example, in the case of Figure 2 there are ten dimerent clusterings 
which are summarized in Figure 8. Normally, TMN =TPAR and the clustering algo- 
rithm produces a unique solution. If we define as an optimum clustering for G[A) as the 
one with minimum elasped time and minimum communication cost then i t  is easy to 
verify the following observations. 
Lemma; If TMN fTpAR then the optimum clustering is the one that corresponds to 
time frame T = TPAR or the one with the minimum number of clusters. 
For the different clusterings Gt(A) of Figure 8, the elapsed time versus the time frame 
T are plotted in Figure 9. These results were obtained using the SIMON simulator. 
The current implementation of the graph reduction phase is capable, by using an 
iterative procedure, to identify all possible solutions and the breakpoints as in Figure 8. 
Also, it is possible to estimate the elapsed time provided we are able to  predict the 
blocking time of each modular in G(A) due to the precedence of computations. The 
workload analysis based on SIMON provides this information. I t  turns out that TPAR 
is a close upper bound of the elapsed time when the blocking time of each module is 
incorporated as part of the processing time of the modules. We believe that this 
approach is reasonable since the blocking time is solely an attribute of the application 
algorithm. The importance of the elapsed time is twofold. (a) For the same spplica- 
tion, the performance of different multiprocessor system architectures can be compared 
by comparing their elapsed time. The advantage of parallel processing can also be 
investigated by comparing the elapsed time of an application running on a parallel 
architecture system to a uniprocessor system. (b) Given different initial partitions G(A) 
of the same application A, their performance can be compared by looking a t  tbe 
elapsed time of the different partitions running on the same multiprocessor system. 
4. REDUCTION OF PARALLELISM AND LAYOUT 
In general, the degree of parallelism of a computation can be much larger than the 
number of processors available in G(M). We may either reduce the parallelism of G(A) 
initially or reduce the parallelism of G'(A) after the clustering. It is not a priori clear 
which is the best strategy, but similar techniques can be used in either case. For this 
purpose a number ol heuristics have been devised. The analysis and performance of 
these algorithms are reported elsewhere POUS861. We illustrate the nature 01 these 
- 15-
. t l ti f t r ti r l
t ti fr , iff rent l t ri i ( ) i t t i
s i i t i i r s . r t e t
TPAR: the shortest ti e fra e for hich all alloca.ted processors can run the a.ppli-
ti i r ll l,






: lN ~ P R
pAR f





















Figure 8. The number of clusters obtained for different values of time frame T lor 
the Cholesky decomposition application, (4x4 matrix). 
Figure 9. Elapsed time and average transmission delay of the different Gb(A)'s ob- 









t I. ,..., . .
: : :;
: : : :,-, ---
: : ;;:
, I ,.,
, . '"::: :::
I I I I'.
















. s is f r ' ( )'s
l l .t ).
algorithms by giving the simplest and the most promising. 
Heuristic HI: A4erge small pair8 of parallel modules 
- start with all parallel modules G'(A) (output of clustering algorithm) 
- Apply the clustering afgoritbm, after relaxing the paralIeIism constraint and ordering 
the pairs according to the sum of their processing times. 
- Stop when the parallelism is reduced to that of G(M). 
The motivation for this. heuristic is that it makes use of an existing algorithm, it is fast 
and we hope it results in a small increase of the elapsed time. 
Heurfstic Hz: Merge 8rnaIl pair8 in set8 with maximum pora?lel;am. 
- Determine all sets of modules with maximum parallelism. 
- Merge the pairs with the smallest sum of processing times. 
- Repeat first two steps un bil maximum parallelism is acceptable. 
In this case there is significant initial algorithmic cost (but still linear in thc size of the 
graph). The cost per repetition is small and the parallelism is reduced rapidly. One 
can equally well merge pairs which decrease most the communication cost. 
The layout problem is often taken to mean the whole process of mapping G(A) to 
G(M) or to substantial portions of it. In our approach we have obtained d ( A )  which 
(a) has the same or Iess parallelism than G(M), and (b) has nearly minimal communica- 
tion cost. There can still be substantial difficulties in mapping G'(A) into G(M) if the 
machine has a low level of interconnection (e-g., as in a ring or grid of processors). 
However, the class of architectures considered in this paper tend to have a high level of 
interconnection, so the final layout is not normally difficult. See PERM841 for a more 
general discussion of the layout problem. We present simple layout algorithms for the 
five multiprocessor architectures described in Section 2. 
1. Single bus, common m e m o q  
Assign modules to processors in any manner consistent with G' (A). 
2. Single bus, distributed memory 
Multiple bus, distributed tnemoy 
Omega interconnection, d i~fr ibuted  common memory 
Assign modules to processors so that data sets associated with particular modules 
are assigned to memories associated with the corresponding processors. 
3. Omega interconnection, distributed memory 
Assign one module MI and corresponding data set to a processor PI and 
corresponding memory. Assign the module (and corresponding data set) that is 
nearest to MI to the processor that is the nearest neighbor to  PI (and correspond- 
ing memory). Repeat until a11 modules are assigned. 
- 17 -
i tJ irs f ll l




isti 2: sm l irs s a l is
f
W a
t. iU e f
r ). t titi i ll . t ll li i i l .
{ )

















4. References - 
[BERM841 Berman, F. Snyder, L., "On mapping parallel algorithms in parallel archi- 
t e c t u r e ~ , ~  Proceedings o j  the Internotional Conjerence on Parallel Proceas- 
ing, 1984, pp. 307-309. 
[BERM851 Berman, F., Goodrich, M., Koelbel, C., Robinson, m, W. J., Showell, K., 
"Prep-p: A mapping preprocessor lor chip computers," Proceedings of the 
International Conference on ParaNel Proceaaing, 1885, pp. 731-733. 
[BOKH81] Bokhari, Shamid, H., "On the Mapping Problem," IEEE Tranaaclions on 
Cornputera, Vol. C-30, No. 3, 1881, pp. 207-214. 
[FUJI851 F u  jimot o, R. M., "The SIMON simulation and development system," Sum- 
mer  Computer Sitnulalion Conjerence, 1985 (Univ. of Utah). 
IGANN861 Gannon, D. and J. Von Rosendale, On the communication complexity of 
parallel numerical algorithms, E E E  Trans. on Cornputere, to appear. 
[HOUS86] Partitioning PDE algorithms: Methods and Performance Evaluations, Paral- 
lel Cotnputing, 1988, to appear. 
[HOUS84] Houstis, Catherine E., "Allocation of Real Time Application in Distributed 
Systems," submitted for publication to the E E E  Transactione on Com- 
puter, 1984. 
IHOUS831 Houstis, C. E., Houstis, E. N., Rice, J., "Partitioning and Allocation of PDE 
Cqmputation to Distributed Systems," PDE Software: Modules Interfaces 
and Systems, Edited by 3. Enguist and T. Smedsaas, North Holland, 1983, 
pp. 67-85. 
[HWAN84] H~vang, Kai, Briggs, Faye', Compufer Architecture and ParaNel Processing, 
McGraw-Hill, 1884. 
IJENN771 Jenny, C. J., "Process Partitioning on Distributed System," Digest of 
paper National Telecommunicalions Conjerence, 1977, pp. 31:l-31:lO. 
[KLEI85] Kleinrock, Leonard, "Distributed Systems," Cornmunicafions o j  ACM, Vol. 
28, Number 11, 1985, pp. 12W1213. 
[KRUS83] Kruskal, C., Snir, M., "The performance of multistage Interconnection Nets 
for multiprocessing," IEEE Tranaoctions on Computers, Vol. G32, No. 12, 
1983, pp. 1091-1098. 
[MARS831 Marsan, M. A., Gerla, M., "Markov modeb for Multiple Bus Multiprocessor 
Systems," E E E  Transactions on Compute~a, Vol. G32, No. 3, 1983, pp. 
238-248. 
W S S ]  Marsan, M. A. Balbo, G., Conte, G., "Comparative Performance Analysis of 
Single Bus Multiprocessor Architectures," E E E  Transacfions on Comput- 
em, Vol. G31, No. 3, 1983, pp. 1178-1191. 
- 18-
. f re ces
[ 84] r , . r, ., i r ll l l rit i r ll l r i-
tectures,» i f t I t ati l f ll l 88-
; , S4, . 309.
] , ., , , ill,
ICprep- : i r i t , i 8 f t
ll ssi g} 9 ,
81] , i , ICOn ," s t
mputers, . G- . , IGS ,
n85] ji t , . ., li i l ti l t t ,
t r m l t f
] , C
IE s. m t s
US86] Cor .ti
l mputi , 6,
S84J ,
JJ IE s ·
l Q .
[ ! ti , . R, , . ., , itioning .ti f
o. :
ste B, B. ,
{ N84} wang, t ll
ili IQS .
[ ] u s/' t
l unic t f 1-31:10.
LEI&S] , , m u i t f
, , 9&5, 00-
S83) l, , , li
C saction. C-32
, . HXH-IOQg.
] , I , II ar ls r
/' IE puteTs, C-32, o. , ,
9 .
IMARS83 , , , Cor
l r s," IEE t
r8 . C- J . , Q t 9 l Ql.
[NORT85] Norton, Alan and G. F. Pfister, "A methodology for Predicting Multiproces- 
sor Performance," Proceedinga of the International Conference on ParaNel 
Processing, 1985, pp. 772-778. 
[OfLEA85] OILeary, 1). P. and G. W. Stewart, "Dsta-Bow algorithms for parallel 
matriv computations," Communication o/ A CM, Vol. 28, 1985, pp. 840-853. 
PILL831 Williams, E. A,, "Assigning Processes to Processcrs in Distributed Systems," 
Proceedings of the InternationaI Conference on Parallel Processing, 1983, pp. 
404-406. 
85] , l . . , et l r i t n oces-
8 f ll
i g} 85}
10' EA85] 'Leary, D. . . U at 8 l
x i i f C } . , . ()"
[WILL } . " r o t
01 ti n l ussing} IQ ,
--
