The algorithm mapper: A system for modeling and evaluating parallel applications/archicture pairs by Houstis, C. E. et al.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1989 
The algorithm mapper: A system for modeling and evaluating 
parallel applications/archicture pairs 
C. E. Houstis 
Elias N. Houstis 
Purdue University, enh@cs.purdue.edu 
John R. Rice 
Purdue University, jrr@cs.purdue.edu 




Houstis, C. E.; Houstis, Elias N.; Rice, John R.; Samartzis, S. M.; and Alexandrakis, D.L., "The algorithm 
mapper: A system for modeling and evaluating parallel applications/archicture pairs" (1989). Department 
of Computer Science Technical Reports. Paper 728. 
https://docs.lib.purdue.edu/cstech/728 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
\A USER GUIDE TO THE ALGORITHM MAPPER
THE ALGORITHM MAPPER; A SYSTEM FOR










A USER GUIDE TO THE ALGORHTM MAPPER
THE ALGORITHM MAPPER: A SYSTEM FOR






EN. Houstis**. l.R. Rice**







We present a methodology for eValuating the performance of application programs
on distributed computing systems. An application A is represented by an annotated
grn.ph G (A) giving its requirements for processing, memory and communication,
plus the precedence between computation modules. Machines are represented by a
similar graph G (M) and the methodology is to map G (A) into G (M). The mapping
problem is subdivided into three steps (51) a reduction of the parallelism of G (A) to
that of G(M), (52) scheduling the computational modules to (nearly) minimize com-
munication costs, and (s3) acmal layout of resulting graph into G (M). The twO
technical problems addressed here are (1) using communication delay models to sim·
pli[y G (M) and the step (s3), (2) scheduling the modules (step (s2). Our communi-
cation models apply well to "uniform" architectures, we explicitly consider the fol-
lowing four. single bus and common memory. single bus and distributed memory,
multiple bus and distributed common memory, Banyan interconnection and distri-
buted common memory.
• 'I'hi.!I re.lc.arch was fiIPPOned by NSF gnml DMC-SSOB0684Al and ESPRIT Projed 15811. Moll of thu wu complar.o
cd while she was willll1le E1earic::a1. Enginecing DepartmClJt of Purdue Univenity.
... This ~card1 was supported by ARO grant DAAG29-83-K-0026 BDd AFOSR gi1ln184-0385.
"- 2-
1. INTRODUCTION
Parallel processor systems are efficiently utilized when the computations they are assigned
can be performed in -parallel and they are mapped in such a way as to maximize their speed up.
Such systems can be interconnected in a variety of ways. which c:m be roughly classified as
shared memory and non-shared memory. In shared memory systems, processors usually have
their own local memory and communicate by contending for common resources, such as ari
interconnection network and shared memory. Shared memory can be either distributed among
the processors in the form of shared memory modules. or it can be common. The interconnec-
tion network alODg with the shared memory can be regarded as the system's communication net-
work. Interconnection networks are commonly from the class of multiple bus systems. ranging
from the simplest configuration of a single bus up to a highest bandwidth configuration of a
crossbar switch. The class of Banyan networks is also common. In the case of the multiple bus
interconnection, the distance between the processors is clearly one, since the interconnection
provides a direct connection from every processor to every other processor (through the com-
mon memory). In the Banyan case, the distance between processors can also be regarded as
one, since on the average the routes between processors uses the same number of intennediate
switches.
In non-shared memory architectures, a variety of interconnections exist For example, pro-
cessors can be placed at the nodes of a grid or at the nodes of a cube. etc. In these cases, the
distance between processors is not necessarily one and it depends on the routes chosen between
processors and the geometry of the architecture.
A number of methodologies exist in the literamre which address the problem of mapping
computations to parallel syStems and they can be divided into two categories; (a) the methods
that assume implicitly or explicitly that the distance between processors is one and (b) the
methods that assume that this distance is different from one. Our methodology explicitly
assumes the distance between processors is one and thus we review mainly such methodologies.
We first state the mappi1;lC problem. We consider an application A to be a computation
with four properties: processing requirements, memory requirements, communication require-
ments and precedence (or synchronization) between the subcomputatioDS. We visualize the
computation broken into computational modules which are nodes of a precedence graph for the
computations. We note the processing and memory requirements at each node of the graph.
We note the communication requirements along each link or edge of the graph. TIlis annotated
graph is called G(A).
We consider a machine to have three components: processing elements, memory elements
and commwtication paths (an interconnection network). Similarly, the machine can be
"- 3 -
represented by an annotated graph G(M).
In general, the mapping problem has three somewhat independent steps:
1. Schedule the computational modules so that the application runs efficiently.
2. Reduce the parallelism of the application to that of the machine.
3. Given Steps 1 and 2 are done, embed the application imo the machine.
We apply our methodology to several applications. mainly numerical or real-time.
2. MODELING APPLICATION/ARCHITECTURE PAIRS
2.1 Review of Existing Modeling Methodologies
Most of the exisling algorithms have addressed Step I, [CHU 801, [CHU 87J, [EFE 821,
[HAES 80], [GYLY 76J. In general, the approaches used can be classified as graph theoretic,
integer programming and heuristic methods.
In [CHU 80], a graph theoretic and an integer programming approach is used to solve the
scheduling or allocation problem. which is defined as the assignment of M modules to N identi-
cal processors, so thal interprocessor communication is minimized. The communication net-
work connecting the processors is not described explicitly and the solution implicitly assumes a
shared memory archirecmre.
In [CHU 87] a heuristic approach to a task allocation for distributed systems is presented.
The issues discussed are very similar to what we have examined. The objective function in
[CHU 87] is the minimization of the maximum processor loading. This has produced a number
of differences in their analysis and solution of the problem.. They m also not concerned about
parallel module execution.
A heuristic approach to module allocation by minimizing interprocessor communication
subject to a load balancing constraint is suggested in [EFE 82]. This approach clusters modules
imo proceSSDrs and contains a mechanism to solve Step 2 of lhe mapping problem as fDllDws: N
processors are assumed. and tile modules are clustered heuristically imo possibly N +K clusters.
A second heuristic. a module reassignment algorithm, is tD divide the K clusters among tile N
processors when possible, by balancing their IDad.
In [HAES 80], a graph theDretic approach is used. N modules in the application graph are
initially associated with N available processors and are regarded as resident mDdules. The algD-
rithm then divides the rest Df the graph inID N clusters, which are centered at the resident
mDdules by minimizing communicatiDn between processDrs. In [GYLY 76], a two mDdule clus-
tering algDrithmS are used to search for "eligible" pairs Df modules, eligible in the sense that
when they are assigned to the same processDr, the greatest pDssible imCrprllCI.:SSDf
-4-
communication is eliminated. This "fusion" continues until all eligible pairs are fused. The
implicit assumption here is that the number of processors is equal to the number of clusters
obtained. A shared memory architecture is also implicitly assumed, since the distance between
processors is not considered.
For shared memory architecrores, the approaches based on graph theoretic methods [CHU
80l, [JENN 82] and integer programming methods [CHU 80], result in fairly complex algo-
rithms which prohibit their use for applications with large graphs. TIle heuristic approaches
[EFE 82], [GYLY 76], are more promising and simple to use.
In all of the above models, the interprocessor communication is measured in terms of the
amount of data transferred among modules assigned to different processors. These models fail
to incorporate the performance characteristics of the parallel system. In our model, data
transfers arc assigned a communication cost which includes he queueing delay incurred by com-
municating processors. Tbis requires the performance analysis of the parallel systems architec-
ture. The resulting performance measure is the queueing delay versus utilization of the com-
munication network. Two different classes of interconnection networks have been examined,
the Banyan and multiple bus. We have experimentally shown that queueing delay affects con-
siderably the allocation decision, thus our approach incorporates the system's architecture into
the mapping heuristics. The use of performance models simplifies the systems involvemem into
the problem. Moreover, it provides the means of examining the performance of
application/:m:hitecture pairs. The third step in the mapping problem for shared memory archi-
tectures is simplified, since the distance between processors is one, and it is an arbitrary assign-
ment of the allocation of modules obtained in Step 2 to the systems processors. provided that
memory requirements are satisfied.
The approach we propose for shared memory architectures is computationally simple.
applies to large and general graphs and it can easily be extended to non-shared memory archi-
tectures. Moreover, we have ex.amined a few small problems (Cholesky decomposition) where
we know the optimum schedule and we see that our algoriIbm produces this optimum schedule.
The extension of this work for non-shared memory architectures is underway.
3. METHODOLOGY FOR SHARED MEMORY ARCHITECTURES
We have given three steps for the solution of the mapping problem. We deal primarily
with Step 1 and Step 2, since Step 3 simplifies when shared memory architectures are used. In
Step 1, a heuristic algorithm is used to schedule the application computation modules and data
blocks into parallel clusters. The algorithm minimizes queueing delay among processors by
assigning module pairs with the most communication to the same processor, provided (a) they
do not have to be executed in parallel, (b) they do not overutilize the processor, and (c) they fit
· ,
-5-
into its local memory. The output of this step is a number of clusters of modules that has the
same parallelism as the application. Step 2 is the reduction of the clusters obtained in Step 1 to
the number of available processors in the system. This is nec.essary mainly for two reasons:
first, the application's parallelism in general can be much higher than the number of available
processors. Second, there are architectures in which the number of processors can be adjusted
to equal the number of clusters obtained. Multiple bus interconnection architectures present
such a feasibility. In the Banyan interconnection system, the number of processors can be
increased (or decreased) only by powers of 2. Thus if the number of clusters obtained is not a
power of 2. then it is cost effective to use a number of processors equal to the next higher power
of 2. less than the number of clusters. In this case, SEep 2 is uruLvoidable. Step 3 is an arbitrary
assignment of the clusters obrained in Step 2 to the system's processors, provided memory con-
straints are satisfied.
The mapping problem requires the modeling of the application and of the parallel system.
The computations of an application are assumed to be partitioned and they are modeled by a
precedence direcred logic graph. 'This graph can be stochastic when the data flow in the applica-
tion is not known a priori as is often the case is in real time applications or it can be determinis-
tic as often in the case in numerical applications. Next we describe both a stochastic and a
deterministic graph.
3.1 A Data Flow Graph Representation
We dcnote by G = (M,L) a directed parallel, cyclic and weighted AND - EOR logic graph.
Throughout, we assume that the partition of each application is represented by such a graph.
Let M = {mj, i = I, 2, ... , Np} be a set of weights corresponding to each program (module)
of the graph. Each mj represents the execution time requirement of the i-th program. For each
pair of programs (i,j), we denote by Pij the probability of passing control from ito j and "tij the
expected amount of data transferred. Data transfers are measured in an rrus (Information
Transfer Unit) which are the smallest unit of information whose queueing delay due to com-
munication can be determined. Each link in the graph L is associated with the weights
L = {Iij = (Yij,Pij); i,j = 1, ... , Nf' i ~ j}. Note that L also represents the precedence graph
for the application. If program i passes information to program j, then i precedes j in the execu-
tion. Information may flow in both directions between a pair of program modules, we assume
the application is such that no infinite loops exist Note that G is a stochastic graph.
The synchronization requirements of the panitioned application are defined in terms of an
AND or EOR logic in the J}O of each node. It is wonh noticing that an AND logic on the out-
put links of a node indicate the potential parallel execution of the modules associated with these
links. Figure 1, presentS an example of a stOchastic model that includes internal data blocks
- 6 -
referenced by each module. These internal data blocks are associated by the code and are not
paIt of the data from the applicatiOIL We want to determine the tow processing time require-
ments of the application. For this we apply a Markov Chain analysis [LOW 73] to the stochas-
tic graph G and UilI1SfOIm it into a deterministic graph G'. The cransformation G' is defined by
(M',U) where
,
M'- 'm -m'~" ,. 1 N}
-I' j iJ" = "'0' p.
L'={l;j=jiPijYij+'/;Pji¥ji if i>j.
,
lij=O if i$j for i,j=1,2, ...• Np},
,
For each program i. Ii is the number of times it is executed, mi indicates its total processing
time requirement, while Z;j is the total amount of ITUs uansferred between programs i and j.
The parallelism of the application (represemed by AND logic) is implemented in a matrix form
by the precedence matrix !:J. = {ou; Ojj = 1 if i and j programs can be executed in parallel;
C'ij = 0 otherwise}. The degree of parallelism of the application is the maximum number of
program modules that can be executed in parallel Our melhodology uses the parallelism as
assigned by the user. We call this the assigned degree of parallelism, it can be less than the
actual degree of parallelism. The assigned degree of parallelism may be intetprered as the
amoUDt of parallelism that the user wishes to maintain in the computation. The information in
the 6 matrix is in the G graph and can be easily obtained from it Notice that the matrices L',
M' and Do constitute part of the input to the allocation algorithms to be described.
3.2 Performance Analysis of the Parallel System Architecture
Performance models of parallel multiprocessor systems are used [0 derive the queueing
delay processors incur in communicating among th~selves. TIlls delay is due to two factors,
(a) accessing the common interconnection network and (b) accessing the common memory





Single bus and common shared memory system.
Single bus and distributed shared memory system.
Multiple bus and distributed memory system.







Figure 1. An example of a stochastic data flow grnph model.
A comparative performance analysis of these architectures has been performed. and in all cases,
the interconnection network queueing Delay versus the utilization D (u). has been computed.
The performance analysis leads to a number of perfonnance measures. The main variable is the
average number of Active Processors. APt in the system, i.e., processors doing computation in
their local memory. From the AP, D (u) can be cornpmed as discussed in Section 4.2.3. The
speed up, 5, of the application running on the machine must be bounded as follows
15.S5.AP.
- 8 -
3.3 Performance Measures of the Algorithm Mapper
The algorithm mapper produces schedules of modules to processors and a number of per-
formance measures, which relate to the system use and the efficient execution of Ehe application.
The main measures computed are directly related to the analytical systems model, the first is




the number of processors in the system,
the set of all module indices in G (A) assigned to processor j,




up = k L u~.
i=l
Moreover, the average processor utilization up. relates to the active processors in the system AP
as follows,
AP
u --P - k .
Our second performance measure in the speed up S defined in [enns of T&~q = execution time of
the application by a single processor (sequential execution) and T RLlL = execution time of the
application by the parallel system. We define
S = T3~qrrREAL'
4. THE ALGORITHM MAPPER RESOURCE ALLOCATION SYSTEM
The algorithm mapper software system is composed of three main pans (a) a preprocessor,
which takes the partitioned application and produces the information about its gIaphical.
representation and the input dara to the mapping heuristics, (b) the heuristic algorithms for the
mapping problem, and (c) a user friendly interface. The interface is interactive and displays the
input and output of the mapping heuristics on a SUN workstation employing color graphics.
The algorithm mapper can be a part of a parallel machine operating system, namely the
scheduler and can be applied at load time. It preassumes a partition of the application which we
have obtained by applying mathematical techniques to redesign (when necessary) the
application's computation with the objective of parallelizing it Compiler techniques are also
applicable when a known computation is investigated for parallelism.
· 9 -
4.1 The Allocation Algorithm
In Step I, we fonnulate and solve the module allocation problem. It is stated fonnally as
a constrained minimization problem as follows:
Inpur. (a) An application which consists of commwticating program modules and data blocks.
(b) Specifications of a given distributed system: processor speeds, memory module sizes
and D (u) characterizing the interconnection network queueing delay vs. UtilizatiOIL
ProbleJ?1: Allocate the application modules and data in order to minimize the queueing delays due
to imerprocessor communications into (1) clusters of modules allocated to individual
processors and (2) clusters of data allocated to memory modules. TIlis is sUbject to the
(a) Distributed System Consrraints:
(i) size of memory modules.
eii) processor utilization capacity,
(b) Applicarion Constraints:
(i) a fixed time allowed for executing the application,
(ii) program modules that can be executed in parallel will be executed in parallel.
Note that the objective of minimum processing time will be achieved as a result of parallel pro-
cessing and minimizing the queueing delays due to interprocessor communications.
The derails of the mathematical formulation of the allocation algorithm can be found in
[HOUS 88a,bj.
-10 -
4.1.1 The output of the allocation algorithm
I ,U! CCADON SOl lmm; AND WQRKI DAD STATISTICS I
1Il=:l:lt>::" oImodul", is oil
mi=~ dlp....llim~ is 70.0 &. copocir.yis 15151S.Z time =ill
lllowcd rcallimc for pIOC:CSOing is 70.0 time WIitJ"
an PROcrsSOR UIlIlZATION·..•
~ IIOW("") plOC (0;>;,) p~
IltiliZ-UllJll ulil.iu.lilJIl a.s>::i:;ncd
I 5\57 Sl.2Jl I ,
" " " " "2 69.76 69.60 2 7 13
"
22 22 2A 2S 26
'", 24.19 23.75 4 , II
4
'"'"
'000 6 16 31 30, 4SZI ~S.13 • 17 32 40 416 55.73 55.40 , 10 21
'" "
3S
7 67.74 H.67 10 12 14
" " "Av=gc:: 5>" 51.55
UO APPUCATIO:-rS COM.\!UNICA1l0N &: PROCESSING COST u_
(il:lli=c unils)
~ memory iale<'- ml=pn>=lO:" """"-
_..







































delay pe:oil=l 0.000007 -- APPLlCATIOl\"S CO~'CAnON REQ1JIREl\.1E?'t"TS .....
(\10 l:im.o 1tfIig;)








~ lIl=ory module il==;iCl'S blo:kod
inmuctim I m.e I llti1iz1tioa "'~'"
I I 100.00 7.00 I ,
" "
26 30





, I 100.00 '.00 , , II, I 100.00 ,.., 6 16 31 30
, I 100.00 '.00 , 17
"
40 41
6 I 100.00 '.00 , 10 21
'" "
3S
7 I 100.00 6.00 10 12 14
" " "
~ JIleIQ0J)' lIlodule ins=I:ioos blt:d.c.d~DI1 I Iiu. I UllliwlOll "'~'"
I I 100.00 7.00 I ,
" "
26





, I 100.00 '.00 , , II
, I lOO.OO '.00 6 16 31
", I 100.00 '.00 , i7 32









••• th= ueno 'OIlllIla:> d.W>lock.l ....
Table 1. Output of the allocation nlgorilhm, the variable id is me index of the program module or data
block.
- II -
A sample output of the algoriIhm is given in Table 1. The output includes the lOcal (%)
utilization of each processor which is the processing time plus communication time of modules
assigned to a processor.
When the time frame T changes, a different clustering of G(A) is obtained. Ideally, we
would like La find the shonest time frame T for which the application can be run. Let TPAR =
the slwrtest time frame for which the machine can run the application A in parallel.
From !.he ourput of the allocation algorithm, we can compute the average number of active
processors, AP. (see Section 3.3) as follows
, .
AP = L u;
;=1
We can obtain TpAR and TR£,AL from the algorithm and then the following bounds hold
4.2 The Algori!.hm Mapper System
The algorithm mapper is a software system which maps any application to a shared paral-
lel memory architecture system. It is made up of a preprocessor, the heuristic allocation algo-
rithm called ALLOe, and a user friendly graphical interface.
The allocation algorithm was initially implemented in Pascal [STEV 82] for a single time
frame and a hypothetical sysrem queueing delay fimction. All input data were assumed known.
No preprocessor or- user interface was available. The code was inefficient and did not com-
pletely solve the mapping problem.
The present allocation algorithm ALLOC is implemented in the language C with a vari-
able time frame. A library of performance functions has been added of four multiprocessor
architectures, namely (a) single bus with shared common memory, (b) single bus with distri-
buted shared memory modules (two port memory), (c) multiple bus with shared memory
modules, and (d) processors and shared memory with a Banyan interconnection. In the case of
a multiple bus with shared memory module architecture, the number of busses can be set equal
to one, thus Obtaining a single bus with disoibuted shared memory architecture or set equal to
the nwnber k of processors, thus obtaining a crossbar interconnection of processors with distri-
buted shared memory.
The time frame T is varied. Initially a large value of T is used, which decreases until Tn.R
is obtained, i.e., the shonest time frame for which the number of cluster.; obtained equals the
application parallelism in Step I, or after the parallelism reduction Step 2. it equals the nwnber
of available processors in the system.
- 12 -
After an allocation has been obtained and the clusters fonned. the execution of the applica-
tion is simulated in order ro obtain TREAL. Le., the actual real time required to execute the appli-
cation architecture. This simulation routine is 3150 in the C program.
4.2.1 The algorithm mapper preprocessor
The preprocessor PALLOC is an interface between the user and the allocation program
ALLOe which automatically creates its input data file. The preprocessor is necessary because
the amount of data. needed by ALLOe is very large, especially for large graphs.
PALLOC is written in the C programming language and currently runs on a DEC VAX-
II{780 computer under the Berkeley operating sysrem (4.3 BSD). All applications programs for
PALLOC are written in C. There is also a version of me preprocessor for applications written
in FORlRAN.
PALLOe uses either an abstraction of the acwal application or an instrwnemed version of
it or a combination. In any case, the application must be partitioned into subroutines
corresponding to the program modules to be used in the parallel implementation. Further, the
data communicated between these modules must be explicitly specified as to destination and
size (m bytes). The execution time of the code in a program module can either be specified
explicitly or the acrual code is compiled and timed during a sequential execution of the applica-
tion. In summary, PALLOC determines the execution times and total communication of
modules from this special version of the application. The details about the features of the input
to PALLOC can be found in [HOUS 88b].
4.2.2 The algorithm mapper graphics user interface
AllocTool is a graphical interface to the allocation algorithm. It helps the user to specify
the computation graph, to enter the required data for the algorithm, and to display the results in
a graphical fonn. In general, for a specific application, the user has to do the following steps to
use the allocation algorithm:
(a) Run AllocTool (which is trivial).
(b) Draw the application data flow graph.
(c) Specify the various data that are required for the algorithm.
(d) Run the allocation algoriLhm and display the results.
(e) (Optional) Store the application description in a file for later use.
Steps (b) and (c) can be replaced by loading an application data file previously prepared.
, ,
- 13 -
Figure 2. Sample output of the graphics user imerface of AllocTool Colors are used in the
large CANVAS FRAME (cemer) to indicate the processor assignment (or numbers
for black and whire worksrations). Processor utilization is shown at the upper right,
the control PANEL at the upper left
-14 -
AllocTool uses the Sun View library rourines and should work on any Sun worksratioIL In
the following pamgraphs. we use the terminology of Sun View library when referring to win-
dows and specific items within these windows. These tenns from Sun View are in ITA1JCS.
The tool is composed of twO basic FRAMES (windows). TIle first one is the conoul
PANEL. that controls the functions of the tool and the second is a frame containing a CANYAS
window for images and a small PANEL on the top for diagnostic messages. In [HOUS 88b],
we describe me operation of lhese windows.
4.2.3 Shared memory architecture models
Four different shared memory architectures (see Section 3.2) are analyzed and their queue-
ing delay functions D (u) presented. In all cases the overall system organization is the same and
it is introduced first.
The systems are composed of k processors and k memory modules. (although we are
assuming that the number of processors is the same as the number of memory modules, the
same analysis can be applied when the number of processors is less than the number of memory
modules). Each processor has its own private memory module, where the program and the data
are stored. If processor i wants to communicate with processor j, it prepares a message and
sends it to memory module j where it can be accessed by processor j.
The performance analysis of 1.hese architectures is documented in detail in [HOUS 87],
{HOUS 88b]. The main performance measure obtained is the average number of active proces-
sors. AP. We have also obtained the average queueing delay per message, D (u), and the aver-
age utilization of the comnllUlication network. We summarize 1.hese results below.
Sysrem 1: Single bus and shared common memory archirecTUre.
Set p = A/fl., where Ais the message generotion rate of a processor and I/f! is the average length
both from an exponential disuibution, then
, .
L pJk! -1
jciJ (k - j)!
AP 1 = .t ~k!
Pj~ (k - j)!
and
u=pxAPd2, D,(u)=(k-AP,)!u.












D,(u) = (k - AP ,)/(1 - p))lu.
System 3: Multiple bus and distributed shared memory modules architecrure. It is of order
k x m x b where k is the number of processors, m the number of memories and b
the number of busses.
Set p =;\.f~ llIld define the array p}l) by
pi/) =pp - j) + Pj-I(/ - j) + ... +PI(/ - j) +Po(/ - j)
with initial conditions
pP)=o 1< j
Po(/) = 0 I> 0
Pj(/) = 1 j~O
Then set
b-l I-b
:E jPI(/) + b :Efp,U + b)Pm-b(/ - 2b - j + m)J
j 1 j:{J~l= b-l I-b 1;;:0
:E pP) + :Efp,U + b)Pm-b(/ - 2b - j + m)]
j_1 j-'J
llIld
[ ~ , . ]]-1k-l ,-J1 +:E pH 1:;- II ~i"ljcfJ J. /_1
, k1k.-i ,
P=" '-i.....:.Il~7Ip,fori=ltok-1.
k ~Pi! j-o }
We then hnve
,






D,(u) = (k - u/p)lu.
System 4: Banyan switch and distributed shared memory architecture.
Set p = Ali! and
k 1 i·p.=p . k II _I_
I 0Ck-i)!Pj_1C(j)
{
i(2k O.5i 1.5) . I
2(}: _ I) I >
where cU) ~flo~(i),fl(i)=$(j,-I(i))and $(i)= I i = 1.
Then we have
•AP4 = L iFj
;=1
With u(x) = max (x. 0). let





D,(u) =.(k + Lp)/(k - L).
4.3 A Parallel Implementation of the Algorithm Mapper
The allocation algorithm runs initially for a large value of the time frame T and subse·
quently the value of T is decreased until the number of clusters equals the parallelism of the
graph in Step 1, or equals the number of processors in Step 2 (see the discussion at the end of
Section 1). The smallest value of T for which Step 1 or Step 2 is completed is TpAR . Note that
in each iteru.tion of the algorithm. the allocation heuristic is executed producing a number of
clusters and schedule of modules for the current value of T. When T changes, the output of !:he
allocarion algorithm changes as well. Moreover, the ompu[ of one iteration is not used in the
next ilerntion. This attribute makes the algorithm a suimble candidate for parallel execution of
· 17·
the multisection extension of the bisection algorithm for locating zeros of fimctions. See
[MIRA 69], [RICE 71] for funher deUlils. We have implemented a parallel version using the
Sequent, a parallel machine which houses (in the configuration we used at the Computer Sci-
ence Deparunem at Purdue) 20 processors.
Any number k ::; 20 processors can be used. OUf particular version for the parallel algo-
rithm mapper is as follows.
1. Estimate T as best one can.
2. Take as initial interval [a,b] = [TI2,11 hoping that the guess is good enough for this
interval to contain TpAR • Set To = O.
3. Divide [a,b] imo k - 1 equal subintervals
T; = a + (i - I)(b - al/(k -I), i = 1,2 •... , k.
by the values
4. Assign the ith processor the value Tj and run the allocation program wil.h it.
5. Let [Tj,Tj+d be the interval such that (a) Tj +1 gives the smallest number of clusters.
and (b) the number of clusters for Tj is larger than that of Tj +1• If j :;::: I. take [a.b]
to be [Tj,Tj +I ]. If j = 1. take [a,b] to be [max(T0,2T1 - TJJ,T 1]' If b - a :i e =
convergence tolerance, then go to 6, otherwise go to 3.
6. Set T = a and produce output If the number of clusters is less than the parallelism
of the application, then find that interval [ThT1+il where 1 is the smallest index
where the number of clusters exceeds those of T. Set To = TI , a = T/, b = TJ: and go
to 3.
A little analysis shows that this algorithm has speed up of log2(k + 1) which is wonhwhile for
small values of k.
4.4 Time Complexity and Optimality of the Algorithm Mapper
The time complexity of the allocation algorithm is approximately proportional to the
number of links in lhe application graph. This is a consequence of the search for the links that
carry high communication and the formation of lists of such links until a merge can be per-
formed. After each merge, the same search startS over again. In Table 2, timing data from
three representative applications are shown. These data strongly suggest that the time complex-
ity is no worse than linear in the number of links in G (A).
- 18 -
Application Number of links Time Ratio
Real time 27 1.18 22.9
PDE collocation 77 4.1 18.8
Robot ann 230 21.1 10.9
Table 2. Experimental data on the complexity of the allocation algorithm. Time for execution
is given in seconds on a VAX lIn80 along with the ratio of links to time.
A symmetric and computationally simple partition of a Cholesky decomposition applica-
tion is used to test the optimality of the algorithm mapper. A (4 x 4) partition of the application
graph is discussed in detail in Section 5.1. The degree of parallelism of the graph is four and
four clusters were obtained for System 2 with four processors. By using simulation. TpAR has
been shown to be the minimum possible elapsed time for this application (see Figure 4). More-
over, the processor utilizations were well balanced (see Table 3).
5. EXAMPLE STUDIES USING THE ALGORITHM MAPPER
Several applications have been tested [0 confirm the analysis and heuristics.
These applications are, (1) a Cholesky decomposition algorithm for the solution of linear
equations, (2) a partial differential Equation (PDE) problem, using the collocation method, and
(3) the solution to the NeWlon-Euler equations, for the mechanical movement of a robotic elbow
manipulator. In all three cases, a partitioning of the application is given. The partitions are
from previous work: PDE collocation application, [HOUS 87J; Cholesky decomposition,
[OLEA 85J: robot elbow application. [!CASH 85].
5.1 Cholesky Decomposition Application
,We consider the parallelization of Cholesky algorithm for the factorization of symmetric
matricos [OLEA 85]. We initially used a data flow language SIMON [FUn 85] to specify the
computational modules and their communication and synchronization requirements. We then
executed these programs using the Sli'v10N non-shared memory multiprocessor simulator. Fig-
ure 3 shows the graph G (A) and the values of the various workload parameters (node process-
ing time and blocking time, commtm.ication traffic among nodes) obtained by setting SIMON's
switching delay to zero (Le., assume there are DO communication delays). The blocJdng rime Or
algorithm synchronization delay of a module is the time that the module must wait for its inputs
before its computation can start. The processing lime is the time to execute the code in a
module. These times are given in the vectors b and m of Figure 3. The applic:ltion was also
- 19-
run by our preprocessor and similar input daC<!. for the allocation were obtained.
The matrix is partitioned into blocks so that the maximum assigned degree of parallelism
is equal to the number of processors available, so we have as many processors as the assigned
degree of parallelism of the application. A single bus multiprocessor and disnibuted shared
memory architecture is used. Thus only Step 1 of the mapping problem needs to be taken. We
have nm this application for various assigned degrees of parallelism (number of processors), we
only report on the case of assigned degree of parallelism four here. Table 3 shows the output of
the allocation algorithm. Note that no internal datablocks are necessary in this application, since
no reference to data is necessary. By default the number of internal dacablocks are set equal to
one per module of size zero.
Figure 3. Precedence graph G (A) of the parallel Cholesky decomposition algorithm for a 4
by 4 block decomposition of a symmetric mattix. The numbering of modules is
specified in each node and the processing times and blocking times are given in
this order. The communiciltion traffic is indicilted ilS a weight on the links of the
graph. (Module processing times m = (24, 30, 30, 24, 27, 31, 32, 27. 34, 38, 39,
21, 28, 34, 45); Module blocking times b = (0, 4, 14,24, 15,27, 37, 48, 33, 45,
61,66,51,64,76, 88).
-20-
From Table 3 we see that TpAR = 250. Tm1 = 622 (139+152+170+161), and TREAL = 336.
The processor utilization u~ is (total cluster processing tiroe)/(time frame n. for example
u~ = 1391250 = 0.566. The number AP of active processors is calculated from LU~ to be
2.488. The speed up is S = Ts~qrrREAL = 1.851. We see that the following relationship holds.
1 ,;; S = T",rrREAL = 1.851 ,;; T",rrPAR = 2.488 ~ AP = 2.488
When the time frame T is decre~sed below TPAR • the result is more and more clusters. We
have used simulation to verify (see Figure 4) that the minimum elapsed time of this application
does indeed occur at T = TPAR • See [HOUS 87] for more details.
Table 3.
- 21 -
Ourput of a 4 x 4 case of the Cholesky decomposition application of Figure 3.
Figure 4.
- 22-
Simulation results for the elapsed time and average transmission delay of the
different G'(A)'s obtained for the Cholesky 4 x 4 decomposition application.
5.2 PDE Collocalion Application
The PDE problem solved is a general one on a non-rectangular domain. A multifront
method is used based on a nested dissection partition of the domain. Gauss elimination is used
to eliminate unknowns in the interior of each domain. The numbers shown at each node are the
'computation units' for a particular instance of this problem corresponding to using a 20 by 20
mesh, a finite element method with cubic basis functions, and a 40 by 40 plotting grid. The
domain boundary has 3 pieces and. at the fourth step, aU bm one processor are worlcing on the
interior of the domain. A computational unit here is about 1000 arilhmetic operations, plus
associated memory and control operations. The complete graph as displayed by the algorithm
mapper is shown in figure 5 (the communication bas been scaled by a factor of 1000). We
again use a single bus multiprocessor and diStributed shared memory architecture with the
number of processors equal to the assigned degree of parallelism of the application. Thus only
Step 1 of the mapping problem needs to be taken. In Figure 6 the output from the algorithm
mapper of the allocation algorithm is shown. All modules having the same number are allo-
cated to the smne processor.
Table 4 shows another fonn of the oUEput of the algorithm mapper system. No internal




The complete graph of the PDE collocation application as displayed by the algo-
rithm mapper sysrem. 'This is at the input stage, the communication along edges
is shown and the computational modules numbered.
Figure 6.
- 24-
The output of the algorithm mapper system for the PDE collocation application.
The allocation of comput:ltional modules to processors is indicated by the




Additional output of the algorithm mapper system for the PDE collocation appli-
cation.
- 26-
Similar cillculadons can be made from Table 4 as with the previous application. Thus
TpAR = 70, Ts£q = 252.587. TREAL = 71 and
AP =.512 + .6968 + .2375 + .5 + 4513 + .554 + .6567 =3.608. The speed up is 3.54 and we
have
1 5 S = Ts~qrrREAL = 3.54 ::;; TseqIT,.u = 3.6 ::; AP = 3.608
See [HOUS 87J for more details.
5.3 Robotic Elbow Manipulator Application
In [KASH 85] a partition of a robot elbow manipulator computation is given at the equa-
tion level, Le.• the computational modules represent the solution of an equation. We use thls
partition with slight modifications. Figure 7 shows the a precedence graph of this application;
the numbers assigned to the modules identify them. Modules are organized on different levels
and modules on the same level can be executed in par.alleL The execution times of modules are
given in [KASH 85] in msec for an Intel 8087 processor. The panition in [KASH 85] requires
little communication among modules and communication is not considered there. We have
modified this by including in the execution time. ce. of a module. both the processing time Cx
and communication time ty. that is te = Cx + 7" We also assume that synchronization delay is
included in cx' If a module has several outgoing links, we distribute the communication times Cy
uniformly among them. We have modified this application to obtain three applications with
different behaviors. We take the execution time ce to be constant and divide into communica-
tion and computation so that the values of the compucarionicommunicarion rario r = cxlt, are
1110. 1/1 and 10/1. The laner corresponds closely to the original application.
We vary the computation/communication ratio r of this application to study its t:ffect and
to show various properties of the allocation algorithm. Usually a fine partition requires more
communication than a coarse partition, but at the same time, the number of modules is ~creased
or decreased respectively. By varying " we change the partition grain without affecting the
number of modules. Lee cp be the number of clusters obtained by the algorithm in Srep 1. then
in general cp is greater than or equal to the parallelism of the application.
For this application, a comparative study has been performed for two architectures, the
multiple bus and Banyan switch, both with distributed shared memory.
Figure 7.
- 27-
The precedence graph of the 105 computational modules of the robot elbow mani-
pulator. The maximum assigned degree of parallelism is 11.
·28 -
Schedules for the MUltiple Bus and Distribuced Shared Memory Architecture
TItis application, shown in Figure 7, is allocated to a k x m x b multiple bus and distri-
buted shared memory (see Section 4.2.3) architecrure. We choose the number k of memory
modules em) and processors (k) to be equal A 13 x 13 xl and a 13 x 13 x8 system are used.
The number cp of parallel clusters obrained is equal to the number of processors in the system,
as indicated in Figure 10.
Sample schedules of the results are illustrated in Figures 8.9. We give a schedule of
assigned modules to each processor, the processor utilization u~ (i = 1,2, ...• 13), and the
optimal Tpr.R obtained. We also give for each pro'cessor the toral processor utilization which
includes both the processing and communication delay times required by all modules assigned
to the same processor.
In Figure 10, we present a summary of the statistics of schedules produced by the algo-
riIhm mapper for the multiple bus s)'stem architecture. We have used u; to be the crverage Ioral
processor utilization. In [ROUS 88b], details on every schedule as illustrated in Figures 8,9 are
included.
Comparing the results in Figure 10. we observe for the case of r = I, 1/10. that TPAR is
shaner for the 8 bus system than for the 1 bus system. This is expected. since the 8 bus system
has higher bandwidth and thus less delay. In addition. in e<lch case different processor
schedules have been obtained. This is the effect that queueing delay has on the scheduling. In
the case o! r = lOll, queueing delay does not playa significant role in scheduling and the I bus
and 8 bus systems have identical schedules. We have measured the utilization for this case and
we find that it is low. Thus, for low utilization values, there is no significant difference in the
queueing delay for 1 and 8 bus sys[ems. We nme that processor utilizations are low due [0 the
parallelism constraint be[Ween modules, we cannot"assign more modules to increase their utili-
zation
In Figure II, a schedule is shown which is based on minimizing the amoWlt of communi-
cation among modules without assigning a CQS[ [0 ,it, i.e., the queueing delay D (u) is zero. A
different schedule is produced as compared [0 the same r = 1/10 case for the 1 bus system in
Figure 8. The significan[ queueing delay in 1:he 1 bus sys[em also increases TpAR substantially.
Thus queueing delay has a direct effect on scheduling.
- 29-
I sYStem: 13 x 13 x I, r -1110. TeAR = 4054 [
Processor Total (%) Utilization Modules
id utilization (%) assi!!I1ed
1 10.88 3.05 1 27 28 61 66 67 72 92 98 104
2 18.94 4.26 2 7 24 36 37 42 48 49 54 78 87
3 14.82 4.62 3 9 58 64 70 81 85 96
4 15.86 2.06 4 10 16 46
5 15.86 2.06 5 11 17 47
6 15.86 2.17 6 12 18 60
7 . 16.29 2.15 8 14 15 29
8 10.16 2.24 13 25 43 57
9 18.93 5.67 19 20 21 22 23 26 31 32 34 35
10 18.93 5.40 30 39 45 51 55 75 79 83 89 94 100
11 9.80 5.76 38 44 50 73 74 82 88 93 99 102 105
12 19.29 7.80 40 41 52 53 76 77 80 84 90 91 95 100
13 15.31 5.25 56 59 62 65 68 71 86 97 103
Figure 8. Schedule of module assignmen~ of the robot elbow manipulator for the 13x13xl
multiple bus architecture and low computation/communication ratio, r = 1/10.
Isystem: 13 x 13 x I, r - 1011. rpM = 33041
Processor Total (%) Utilization Modules
id utilization (%) assi~ed
1 12.97 12.38 1 27 30 61 67
2 42.08 40.16 2 6 12 18 37 49 60 87
3 89.98 88.03 3 9 39 41 45 51 53 75 79 83 89 91 94 100
4 26.67 25.31 4 10 16 46
5 26.67 25.31 5 11 17 47
6 67.46 65.20 7 8 14 15 24 29 36 42 48 54 78
7 25.25 24.76 13 43 57 63 69
8 70.90 69.60 19 20 21 22 23 26 31 32 33 34 35
9 34.31 33.29 25 28 55 66 72 92 98 104
10 71.10 70.70 38 44 50 73 74 82 88 93 99 102 105
11 74.85 74.00 40 52 76 77 80 84 90 95 101
12 65.36 64.37 56 59 62 65 68 71 86 97 103
13 51.73 51.17 58 64 70 81 85 96
Figure 9. Schedule of module assignments of the robot elbow manipulator for the 13x13xl
multiple bus system architecture and high computation/communication rana.
r = 1011.
- 30-
Multiple Bus Architecture (!o:mxb)
r= 1 13x13xI 13x13x8
TPAR = 2206 T pAR =2125
u; = 54.41 u; = 53.56
up = 40.82 up = 42.38
c" = 13 c" - 13
r = 1/10 T pAR = 4054 T pAR = 3881
u~ = 15.45 u; = 15.10
up = 4.03 up = 4.21
c" = 13 c. = 13
r = 1011 T pAR = 3304 TpAR = 3304
u; = 50.71 u; = 50.67
up = 49.56 up = 49.56
c = 13 c -13
Figure 10. Summary of performance statistics of 13 processor schedules for the multiple bus
architecture (1 bus or 8 busses system).
I system: none. r - 1110, TEAR = 3880 I
Processor Tow (%) Utilization Modules
id utilization (%) assigned
I 8.78 2.34 1 13 25 43 57 63 69
2 16.38 4.45 2 7 24 36 37 42 48 49 54 78 87
3 13.12 4.82 3 9 58 64 70 81 85 96
4 13.37 2.15 4 10 16 46
5 13.37 2.15 5 II 17 47
6 13.39 227 6 12 18 60
7 13.73 2.25 8 14 15 29
8 16.69 5.92 19 20 21 22 23 26 31 32 33 34 35
9 11.24 6.04 27 28 61 66 67 72 73 92 93 98 99 100 101 102 103 104 105
10 25.97 6.20 30 39 45 51 55 75 79 82 83 89 94
II 15.76 3.11 38 44 50 74 88
12 20.24 8.05 40 41 52 53 76 77 80 84 90 91 95
13 12.55 5.06 56 59 62 65 68 71 86 97
Figure 11. Schedule of module assignments of the robot elbow manipulator. The queueing
delay is set to zero and a low computation/communication ratio, r = 1/10 is used.
Schedules: Banyan Switch and Disrribured Shared Memory Archieecture
The results of applying the allocation algorithm to this application with the Banyan switch
and distributed shared memory architecture are summarized In Figure 12. A s:unple output is
shown in Figure 13. Twelve or thincen parallel clusters are obtained. Our observation has been
that for symmetric application gmphs or almost symmeuic gmphs, the number of clusters
- 31 -
obtained is equal to the assigned degree of parallelism of the application. When the application
graph does not possess any symmetry, like the one here, the number of clusters may be greater
than the graph's assigned degree of parallelism. This is 'due to the allocation algorithm's
mechanism of module merging which must 'Satisfy the parallelism and time frame constraints at
the same time. Tbis is a limitation of our heuristic algorithm, since it does not exhaust all pos-
sible schedules.
Banvan switch. 8 nrocessor system
r=l TpAR = 2212
u~ = 53.85
Up = 38.27 .
C!l = 12








Figure 12. Summary of perfoIlIlance statistics of cluster schedules for the Banyan system
before parallelism reduction is applied.
I system: 8 processor, Banyan switch, r = 1011. TPt!8 = 3306
Cluster Tow (%) Utilization Modules
id utilization (%) assi£!Iled
1 12.95 12.37 1 27 30 61 67
2 42 40.14 2 6 12 18 37 49 60 87
3 89.88 87.99 3 9 39 41 45 51 53 75 79 83 89 91 94 100
4 26.62 25.30 4 10 16 46
5 26.61 25.30 5 11 17 47
6 67.36 65.17 7 8 14 15 24 29 36 42 48 54 78
7 25.23 24.75 13 43 57 63 69 .
8 70.83 69.57 19 20 21 22 23 26 31 32 33 34 35
9 34.27 33.27 25 28 55 66 72 92 98 104
10 71.05 70.67 38 44 50 73 74 82 88 93 99 102 105
11 74.79 73.96 40 52 76 77 80 84 90 95 101
12 65.30 64.34 56 59 62 65 68 71 86 97 103
13 51.69 51.14 58 64 70 81 85 96
Figure 13. Schedule of module assignments of the robot elbow manipulator for the Banyan
switch architecture and high computation/communication ratio, r = 1011.
- 32-
5.4 Reduction of Parallelism in the Robot Application
Here we investigate the possibility of using the algorirhm mapper system's allocation algo-
rithm to reduce the parallelism of an application. This corresponds to Step 2 of the mapping
problem as described in Section 1. We use the robot application and the Banyan switch archi-
tecture. The number of parallel clusters obrained in Figure 12 are twelve or thirteen. and we
now assume that only an g processor syStem is available and thus we need to reduce the number
of clusters to 8 from 12 or 13. To reduce parallelism we use the SilIDe heuristic allocation algo-
rithm with a simple modification, we eliminate the parallelism constraint. We use as the input
[0 the heuristic algorithm for Step 2 the clusters obtained in Step 1. Each of these clusters is
regarded as a single module and the COmmunication between clusters forms the communication
between modules. Thus the graph output from Step 1 is the inpUt to Step 2, except that the
parallelism constraints are removed. The communication cost is found as in Step 1. Since
parallelism between the modules of the new graph is not a constraint. it is always feasible to
cluster the modules intO a predetermined number of processors (in this case 8), by adjusting
appropriately the time frame T parameter. The results for the three cases of r used in Step 1 are
summarized in Figure 15. A sample output of the algorithm is shown in Figure 14. Note the
module 1 of Figure 14 corresponds to cluster 1 of Figure 13 and so on for the rest of the
modules. We also note thaI processor utilizations are much higher than in Step 1.
The time frame TpAR may increase or decrease when the parallelism is reduced. If com-
munication dominates the worX. then TpAI/. should be smaller. because with the parallelism
reduction, less communication is required. One sees this to be the case when r = 1/10 (com-
pare Figures 12 and 15). If computation dominates the work, then TPAR should be larger
because there are fewer processors to do the computation. One sees this to be the case when
r = 1011 (compare Figures 12 and 15). In an inteIIl1ediate case where communication and com-
putation are of similar amounLS. then the effect on Tn.R is unclear. One sees for this particular
application that TpM is increased slightly (compare'Figure 12 and 15).
- 33-
Isysrem: 8 processor, Banvan switch. r - 1011, T"1R =3706
Processor Toml (%) UEilization Modules
id utilization (%) assimed
1 74.25 73.10 I 8
2 83.97 80.94 2 4 5
3 80.18 78.49 ",
4 89.75 87.81 6 9
5 88.92 88.06 7 11
6 63.38 63.04 10
7 58.25 57.40 12
8 46.11 45.62 13
Figure 14. Parallelism reduction for the schedules shown in Figure 13 for the robot applica-
tion and Banyan switch architecrure. The number of clusters is reduced from 13
to 8.
Banvan switch, 8 orocessor svstem












Figure 15. Summary of performance Statistics of processor schedules for the Banyan switch
after parallelism reduction is applied.
For comparison purposes. we also reduce the 13 cluster schedule obtained for the 13 pro-
cessor multiple bus archirectures to 8 clusters for an 8 processor system. The results are sum-
marized in Figure 16. Detailed schedules are reported in [HOUS 88bl.
5.5 Perforrmmce Evaluation of Applications/Architecture Pairs
We now compare the perfonnance of the Banyan switch architecture and multiple bus
archirecrure for the robot elbow manipulator application. Three versions of the application are
considered with computation/communication ratio values of r = Ill, 1110 and 1011. Both
· .
- 34-
architectures have disnibuted shared memory and 8 processors, the multiple bus architectures
also have 1 or 8 busses. The primary comparison is on the basis of TpAR. the shonest parallel
execution time. We.. also show (1) the average total processor utilization u; which, includes
both processor time and queueing waits for communication. (2) the speed up. and (3) the effi-
ciency [SIEG 82] = (speed up)/k.
Figure 16 shows that the smallest TPtJl value and the highest processor utilizations were
obtained for r = 1110, as one expects. The 8 x 8 x 8 multiple bus architecture has a higher
bandwidth than the Banyan network, and the schedule for an 8 x 8 x 8 multiple bus architecture
has the best perfonnance. TItis demonstrates the usefulness of the mapping methodology in
matching applications to architectures. Speed ups and efficiency factolS for the 8 processor sys-
tems are presented in Figure 17.




r=l TpAR =2529 u~ = 73.68




r =1110 T pAR =1278 TpAR =1207





r = lOll TpAR =3706.3 T pAR =3804




Figure 16. Performance comparison of multibus and Banyan architectures for the robot elbow
manipulaLOr application. TI1ree values of the computation/communication ratio r
are used. All ilIchitecturcs have 8 processors.
- 35-
Ratio r Speed up Efficiency
Banyan Multiple bus Banyan Multiple bus
(I bus) 4.836 (I bus) .604
7=111 4.630 (8 busses) 4.836 .578 (8 busses) .604
7 = 1110 1.665 (I bus) 1.763 .208 (I bus) .220
(8 busses) 1.877 (8 busses) .234
7 = 1011 5.744 (I bus) 5.596 .718 (I bus) .699
(8 busses) 5.596 (8 busses) .699
Figure 17. Comparison of speed ups and efficiency for the 8 processor Banyan switch and
multiple bus architectures.
From darn. in [KASH 85] the elapsed (sequential) time in a uniprocessor system, Tseq • can
be calculated. In a uniprocessor system the communication COSt is zero, so the sequential time
depends heavily on the ratio r of computation to communication. Then for each value of r, we
have considered that we get Tseq as shown in Figure 18.
r = 1 T:t:q = 11709
r = 1110 TslIq = 2128
r = 10/1 T"eq = 21281
Figure 18. Sequential elapsed time of the application for values of r.
In Figures 19 and 20 speed ups and efficiencies are given for the multiple bus architecrures
under the assumption that the number of processors equals the number of parallel clusters. i.e..
after Step 1 of the algorithm is performed. This data is derived directly from the schedules in
Section 5.3. Note that for r = 1/10 and the one or eight bus system, the speed up is actually
less than I, which indicates that the parallel system does worse than il single processor system.
Thus, a partition where the communiciltion is ten times the computation, is the WOISt partition
of the three we have examined. This shows how our methodology helps us evaluate the various
partitions of an application.
- 36-
Ratio r Speed up Efficiency






Figure 19. Speed up and efficiency clata for the schedules of Figure 10 for the 13x13xl mul-
tiple bus and disrributed shared memory architecture.






Figure 20. Speed up and efficiency data for me schedules of Figure 10 for the 13 x13x8 mul-
tiple bus and distributed shared memory architecrure.
We illustrate the performance analysis funher for those schedules (see Figure 16) where
the assigned degree of parallelism has heed reduced. The speedup S and average number of
active processors APt are shown in Figures 21 and 22. Note that the bounds discussed in Sec-
tions 3.2 and 4.1.1 hold. In the cases where communication is limited, Le., r = 1011 and
r = 111, its cost is also negligible, thus S = AP. When r = 1110, then the communication cost
in terms of queueing delay is not negligible any more, and it affects the TPAR results. Thus
S = Tstq!JPAR is lower than in the previous two cases. By definition, communication delay is
not included in AP, thus AP ?: S as stated previously and observed in both Figures 21 and 22.
We may also calculate using TREAL. For example for the schedule of Figure 9, we obtain AP = .
6.4428, Ts~q = 21281. TpAR = 3304. TRLIL = 6113 and the following relationships.
15: S = Tstqfl'REAL = 3.481 5: Ts~qrrPAR = 6.440 SAP = 6.4428
- 37-
Ratio r 8x8x1sysrem 8x8x8system
AP Tst!qITPAR AP Tst!qfIpAR
r = 1011 5.5969 5.5968 5.5969 5.596
r=l/1 4.8351 4.835 4.8351 4.836
r = 1110 2.4144 1.763 2.3133 1.877
Figure 21. Speed up and active processors for the schedules of Figure 16 for the multiple bus
and distributed shared memory architecture. The assigned degree of parallelism
has been reduced from 13 to 8 in these schedules.
Ratio r AP TSt!qrrpAR
r = lOll 5.7446 5.744
r=1/1 4.6310 4.630
r=IIIO 1.9658 1.665
Figure 22. Speed up and active processors for the schedules of Figure 15 for Banyan switch
and distributed shared memory architecture. The assigned degree of parallelism
hi!S been reduced from 13 to 8 in the schedules.
6. SUMIHARY
We have formulated the mapping problem and described our algorithm mapper system and
methodology. Four architectures have been considered and three realistic applications have
been evalualed for them. These applications were Cholesky decomposition algorithm, a PDE
collocation solution and a robot arm m<lIlipulation computation. The four architectures con-
sidered are: (1) single bus and shared common memory, (2) single bus and distributed shared
memory, (3) multiple bus ..,d distributed shared memory, and (4) Banyan switch and distributed
shared memory. The allocation metllOdology has made use of the parallel architecture perfor-
mance models to assign a cost to communication between parallel processors. This cost is the
queueing delay in communicating messages between modules assigned to different processors.
The allocation algorithm used is based on a merging heuristic l1lat minimizes communication
between processors in assigning parallel modules to different processors. The approach has also
been used to evaluate various partitions of a single application.
- 38 -
We have evaluated the performance of application/architecture pairs. Different partitions
of the same application may" fit" better on differem architectures. OUf methodology provides
the means for this evaluation. We have made extensive experimentation with the robot arm
application with 105 modules. We see a dramatic improvement in execution speed up using
even a SUboptimal allocation method such as ours. In some small systems and applications, we
are able to demonstrate optimality of results. but we do not expeCt this in general. The parallel
architectures we have studied have shared memory and OUf current approach depends on this











Abraham. S.G. and Davidson, E.S., "Task Assigrnnem Using Network Flow
Methods for Minimizing Communication in n-Processor Systems" I Technical
RepoIT, CSRD Rpt. No. 598, Center of Supercomputing Research and Develop-
ment, National Center of Supercomputing Applications. Universiry of TIlinois at
Urbana-Champaign, 1986.
Allen, A.a.• Probabiliry Statistics and Queueing Theory, Academic Press, 1978.
Batcher. K..E., "The Flip Network in STARAN", Int'l Con! on Parallel Pro-
cessing, 1976, pp. 65-71.
D.P. Batra, Archilecrural Implications of Problem Partitioning for Distribwed
Processor Systems, Ph.D. Thesis, Computer Science Depanment, Northwestern
University, Evanston, illinois, 1978.
Bukles, B.P and Hardin, D.M, "Partitioning and Allocation of Logical
ResOUItts in a Distributed Computing Environment", Tutorial: Distributed Sys-
tem Design, IEEE Computer Society, 1979.
Chu, W.W., Holloway, L.J., Lan, M.T. and Efe, K., "Task Allocation in Distri-
buted Data Processing", Computer, 1980, pp. 57-69.
Chn, W.W., Lance M-T, Lan, "Task Allocation and Precedence Relations for
Distributed Real-Time Systems", IEEE Trans. Computer Engineering, Vol. C,
1987.pp.667-679.
Efe, K.., "Heuristic Models of Task Assignment Scheduling in Disnibuted Sys-
terns", Compuler, Vol. 15, 1982, pp. SQ.-56.
Fujimoto, RM., "The SIMON SimulaLion and Development System", Summer
Compucer Simulation Conference, 1985.
· ,
- 39-
[GANN 86] Gannon D. and Von Rosendale, I., "On the Communication Complexity of
Parallel Numerical AlgoriIhms",lEEE Trans. Computers, (to appear).
[GILB 87] Gilbert. JK and Zmijewski, E., "A ParJilel Graph Partitioning Algorithm for a
Message-Passing Multiprocessor". Technical Report, 1R 87-803, Deparonenr of
Computer Science, Cornell University, Imac:l, N.Y.. 1987.
[GYLE 76] Gylys. V.B. and Edwards, J.A. • . 'Optimal Partitioning of Workload for Disni~
bured Systems", Proceeding Compean, 1976. pp. 353--357.
[GOTT 83] Gottlieb, A" et aI .. "The NYU illrracompmer-Designing on MIMD Shared
Memory Parallel Computer", IEEE Trans. CompUlers, C-33, 1984, pp.
1180-1194.
[HAES 80] Haessig K. and Jenny, C.J.. "Partitioning and Allocation Computational Objects
in Disnibured Computing Systems". Proe. of IFIP Congress. 1980, pp.
59>'-598.
[HOUS 81] Roustis, CE., Houstis E.N. and Rice. l.R. "Partitioning and Allocation ofPDE
Computation to Disaibuted Systems". in: B. Engquist and T. Smedsass, eds.
PDE Software: Modules Interfaces and Systems, Nonh-Holland, Amsterdam,
1981, pp. 67-85.
[HOUS 82] Houstis, C.E., "Software Partitioning in a Distributed Environment", Technical
Report. College of Engineering, University of South Carolina, 1982.
[HODS 84] Houstis, E.N, Rice, J.R. and Vavalis E.A., "Spline Collocation Methods for
Elliptic Partial Differential Equations", in: R Vichnevetsky and RS. Steple-
man, eds, Advances in Computer Methods jor Partial Differential Equarions V,
Th1ACS, Rutgers University, 1984, pp. 191-194.
[HOUS S7a] Roustis, CE., Houstis, E.N. and Rice, JR.. "Partitioning PDE Computations:
Methods and Performance Evaluation", J. Parallel CompUling, Vol. 5, 1987,
pp. 141-163.
[HODS 87b] Houstis, C.E.. "Allocation of Real-Tune Applications to Distributed Systems",
Int' [ Can! Parallel Processing, 1987, pp. 863-866.
[HODS 87c] Houstis, C.E., "Disttibuted Processing PerfonmU1ce Evaluation", Third Inter-
national Conference on Data Communican·on Sysrems and Their Perfonnance,
L.FM. de Moaves, E. de Souse e Silva and L.F.G. Soaves, eds., Rio de Janeiro.
Brazil, 1987, pp. 391-406.
[HOUS 87d} Houstis, C.E. and Aboelaze, M., "The Mapping of Applications to Multiple
Bus and Banyan Interconnected Multiprocessor Systems: A Case Study",
-40-
Supercompucing, 1987. pp. 514-543.
[HOUS 8Sa] Houstis, C.E.• "Allocation afReal Time Applic<ltions to Distribmed Systems",
Under review in the IEEE Trnns. on Software Engineering.
[HOUS 8Sb] f:{austito CE., Houses, E.N., Rice. J.R., Samanzis, S.M. and Alexandrakis.~~ ~Dt:-f\the Algorithm Mapper: A Sys[em for Modeling and Evaluating Parallel
Applications/.AIt:hitecrure Pairs", Tectmical. Repon CSD·1R-793, Computer
Science Depanmem., Purdue University, West Lafayene. IN 47907, August
1988.
[HWA...J.~ 84] Hwang, K.. and Briggs, F., Computer Architecture and Parallel Processing,
McGraw-Hill, New Yolk. 1984.
[GYLY 76] Gylys, V.B. and Edwards, lA., "Optimal Partitioning of Workload for Dism-
buted Systems". Proceeding Compean, 1976, pp. 353-357.
[JEJ.~ 77] Jenny, C.J.• "Process Panitioning on Disrributed Systems". Digest of Papers.
NTC 1977, pp. 31:1-31-10.
[JENN 82] Jenny, C.J., "On the Placement of FIles and Processes in a System With Dislri·
buted Intelligence", Proceedings of International Zurich Seminar on Digital
Communicarions, 1982, pp. B1.1-8.
[KASH 85] Kashara, H. and Nari"" S., "Parallel Processing of Robot-Arm Control Compu-
tation on a Multiprocessor Sysrem", IEEE J. Robotics Auromarion, Vol. RA-l,
1985, pp. 104-113.
[KLEI 85] Kleinrock, L., "Disuibuted Systems", Comm. ACM, Vol. 28, 1985, pp.
1200-1213.
[KRUS 83] Kruskal, C. and Snir, M., "The Performance of Multistage Interconnection Nets
for Multiprocessing" ,IEEE Trans. Computers, Vol. C-32, 1983, pp. 1091-198.
[LAWR 75] Lawrie, D., "Access and Alignment of Data in an Array Processor", IEEE
Trans. Computers, Vol. C-24, 1975, pp. 1145-1155.
[LI 81] Li, H., "The Impact of Process Intercommunication on the Global Bus Archi-
tecture", IEEE Proc. Real-Time Systems, 1981, pp. 29-31.
[LOW 73] Lowe, T.C., .. Analysis of an Information System Model Wilh Transfer Penal·
ties", IEEE Trans. Computers, Vol. C-22, 1973, pp. 269--280.
[MA 81] Ma, P., Lee, E.Y.S. and Tsucruya, M., "On the Design of a Task Allocation




[MARS 83] Marsan, M.A., Balbo. G. and Come, G.• "CompllIiluve Performance Analysis
of Single Bus Multiprocessor Archi[ecrures", IEEE Trans. Computers, Vol. C-
31. 1983, pp. 1179-191.
[MARS 83] M:lISan, M.A. and Gerla, M., "Markov Models for Multiple Bus Multiprocessor
Systems",/EEE Trans. Computers, Vol. C·32, 1983. pp. 239-248.
[:MIRA 69] Miranker, W.L., "Parnllel Methods for Approximating the Root of a FWlction",
IBM J. Res. Develop.. VoL 13, 1969, pp. 297-301.
[NO,RT 85] Norton, A. and Pfister, G.F.. "A Methodology for Predicting Multiprocessor
Performance". Proc. Inr'Z Corrf Parallel Processing. 1985, pp. 772-778.
[OLEA 85] O'Leary, D.P. and Srewan, G.W.• "Dara·Flow Algorithms for Parallel Maaix
Computations", Comm. ACM, 28, 1985, pp. 840-853.
[OLEA 87] O'Leary, D.P. and S[ewan. G.W., "Assignment and Scheduling in Parallel
Matrix Factorization", Linear Algebra Appl., 77. 1986. pp. 275-300.
[pATE 79] Patel, I.H., "Processors-Memory Interconnections for Multiprocessors", Proc.
6th Ann. Symp. Compucer Arch.• 1979, pp. 168-177.
[RlCE 71] Rice, JR., "Malrix Represemations of Nonlinear Equation Iterations - Applica-
tion to Parallel Comput:ltion", Mach. Comp., Vol. 25, 1971. pp. 639--647.
[SARK. 86] Sarkov, V. and Hennessey, J., "Compile-Time Partitioning and Scheduling of
Parallel Programs". Proceedings of che SIGPLAN 1986 Symposiwn on Com-
piler Insrructions, ACM, 1986, pp. 17-36.
[SIEG 78] Siegel, H.J. and Smith, H.D., "Study of Multistage Sll0D Interconnection Net-
works". 5ch Annual Symposium on Compmer Archiceccure, 1978, pp. 223--229.
[SIEG 78] Siegel, H.l, McMillen, R.J. and Mueller, P.T., Jr., "A Survey of Imerconnec-
tion Methods for Reconfigurable Parallel Processing Systems", Inc' I Con! on
Parallel Processing, 1978, pp. 9-17.
[SIEG 82J Siegel, L., Siegel, RJ. and Swain, P.H.• "Pcrfonnance Measures for Evaluating
Algorithms for SnvID Machines", IEEE Trans. Sofcware Engineering, Vol.,
SE-8, 1984, pp. 319-331.
[SIEG 85]
[S1EV 82]
Siegel, H., Incerconneccion Necworks for Large-Scale Parallel Processing,
Heath and Company, 1985.
Stevens, R.M.• A Pascal Program Which Partitions Programs for a Multipro-
cessor Syscem, Masters Thesis, Electrical and Compmcr Engineering, University





Stone, H.S., "Multiprocessor Scheduling With the Aid of Network Flow Algo-
ril:hms", IEEE Trans. So/nvare Engineering. Vol. SE·3, 1977, pp. 85-93.
Stone, H.S.. and Bokhari. S.H.. "Control of Disrributed Processes" I Computer,
Vol. 11. 1978. pp. 971-976.
William. E.A., "Assigning Processors to Processors in Distributed Systems".
IEEE Cont Parallel Processing, 1983, pp. 404-406.
