A User Guide to the algorithm mapper: A system for modeling and evaluating parallel applications/architecture pairs by Houstis, C. E. et al.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1988 
A User Guide to the algorithm mapper: A system for modeling and 
evaluating parallel applications/architecture pairs 
C. E. Houstis 
Elias N. Houstis 
Purdue University, enh@cs.purdue.edu 
John R. Rice 
Purdue University, jrr@cs.purdue.edu 




Houstis, C. E.; Houstis, Elias N.; Rice, John R.; Samartizis, S. M.; and Alexandrakis, D.L., "A User Guide to 
the algorithm mapper: A system for modeling and evaluating parallel applications/architecture pairs" 
(1988). Department of Computer Science Technical Reports. Paper 679. 
https://docs.lib.purdue.edu/cstech/679 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
A USER GUIDE TO THE ALGORITHM: MAPPER:









A USER GUIDE TO THE ALGORITHM MAPPER:
A SYSTEM FOR MODELING AND EVALUATING PARALLEL
APPLICAnONS!ARCHITECTURE PAIRS
C.E. Houstis·, E.N. Houstis**. l.R. Rice**
S.M. Samanzis**. and D.L. Alexandrakis**
Computer Science Deparonent
Purdue University




We present a methodology for evaluating the performance of application programs
on distributed computing sysLems. An application A is represented by an annotated
graph G (A) giving its requirements for processing, memory and communication.
plus the precedence between computation modules. Machines are represented by a
similar graph G(M) and the methodology is to map G(A) into G(M). The mapping
problem is subdivided into three steps (51) a reduction of the parallelism of G(A) to
that of G (M), (82) scheduling the computational modules to (nearly) minimize com-
munication costs, and (83) actual layout of resulting graph into G(M). The two
technical problems addressed here are (1) using communication delay models to sim-
plify G(M) and the step (s3), (2) scheduling the modules (step (s2)). Our communl-
cation models apply well to "uniform" architectures, we explicitly consider the fol-
lowing four: single bus and common memory. single bus and dislributed memory.
multiple bus and dislributed common memory, Banyan interconnection and distri-
buted common memory.
• This research was supported by NSF granl DMC-85080684At.
•• This research was supported by ARO grant DAAG29-B3-K-0026 and AFOSR grant 84-0385.
-2-
THE ALGORITHM MAPPER: A SYSTEM FOR MODELING AND EVALUATING
PARALLEL APPLICATIONS/ARCHITECTURE PAIRS
C.E. Houstis. E.N. Houstis. J.R. Rice
D.L. Alexandrakis and S.M. Samanzis
l. INTRODUCTION 3
2. MODELING APPLICATION/ARCHITECTURE PAIRS 4
2.1 Review of Ex isling Modeling Methodologies.•........•......................................................................................4
2.2 Comparison of Modeling Methodologies .5
3. METHODOl.OGY FOR SHARED MEMORY ARCIDTECTURES 6
3.1 A Stochastic Data Flow Graph Representation 7
3.2 A Deterministic Dala Flow Representation 7
3.3 Performance Analysis of the Parallel System Architecture 9
3.4 Performance Measwes of lhe Algorithm Mapper 9
4. THE ALGORITHM MAPPER RESOURCE ALLOCAnON SYSTEM 10
4.1 The Allocation Algorithm 10
4.1.1 ParamelerS of the application 10
4.1.2 Parameters of the architecture 11
4.1.3 The allocation model 11
4.1.4 The heuristic allocation algorithm 13
4.1.5 The output of the allocation algorithm 14
4.1.6 Analysis of the use of the time frame in the allocation algorithm 15
4.2 The Algorithm Mapper Simulation System 16
4.2.1 The algorithm mapper preprocessor 17
4.2.2 The algorithm mapper graphics user interface 21
4.2.3 Shared memory architcclUre models 26
4.3 A Parallel Implementation of the Algorithm Mapper .33
4.4 Time Complexity and Optimality of the Algorithm Mapper 34
5. EXAMPLE STUDIES USING THE ALGORITHM MAPPER .35
5.1 Cholesky Decomposition 35
5.2 PDE Collocation Application .40
5.3 Robotic Elbow Manipulator Application .47
5,4 Reduction of Parallelism in the Robot Application 54




I. Description of the Allocation PrograJl1 69
II. Instructions for Using the Preprocessor 79
m. Error Messages 80
·3·
1. INTRODUCTION
Parallel processor systems are efficiently utilized when the computations they are assigned
can be performed in parallel and they are mapped in such a way as to maximize their speed up.
Such systems can be interconnected in a variety of ways, which can be roughly classified as
shared memory and non-shared memory. In shared memory systems, processors usually have
their own local memory and communicate by contending for common resources, such as an
interconnection network. and shared memory. Shared memory can be either distributed among
the processors in the fonn of shared memory modules. or it can be common. The intercOIlllcc·
lion network along with the shared memory can be regarded as the system's communication net-
work. Interconnection networks are commonly from the class of multiple bus systems, ranging
from the simplest configuration of a single bus up to a highest bandwidth configuration of a
crossbar switch. The class of Banyan networks is also common. In the case of the multiple bus
interconnection, the distance between the processors is clearly one, since the interconnection
provides a direct connection from every processor to every other processor (through the com-
mon memory). In the Banyan case, the distance between processors can also be regarded as
one, since on the average the routes between processors uses the same number of intermediate
switches.
In non-shared memory architectures, a variety of interconnections exist. For example, pro-
cessors can be placed at the nodes of a grid or at the nodes of a cube, etc. In these cases, the
distance between processors is not necessarily one and it depends on the routes chosen between
processors and the geometry of the architecture.
A number of methodologies exist in the literature which address the problem of mapping
computations to parallel systems and they can be divided into two categories; (a) the methods
that assume implicitly or explicitly that the distance between processors is one and (b) the
methods that assume that this distance is different from one. Our methodology explicitly
assumes the distance between processors is onc and thus we review mainly such methodologies.
We first state the mapping problem. We consider an application A 10 be a computation
with four properties: processing requirements, memory requirements, communication require-
ments and precedence (or synchronization) between the subcomputations. We visualize the
computation broken into computational modules which are nodes of a precedence graph for the
computations. We note the processing and memory requirements at each node of the graph.
We note the communication requirements along each link or edge of the graph. This annotated
graph is called G(A).
We consider a machine to have three components: processing elements, memory elements
and communication paths (an interconnection network). Similarly, the machine can be
·4-
represented by an annotated graph GOO.
In general. the mapping problem has three somewhat independent steps:
1. Schedule the computational modules so that the application runs efficiently.
2. Reduce the parallelism of the application to that of the machine.
3. Given Steps 1 and 2 are done, embed the application into the machine.
We apply our methodology to several applications, mainly numerical or real-time.
2. MODELING APPLICATION/ARCillTECTURE PAIRS
2.1 Review of Existing Modeling Methodologies
Most of the existing algorithms have addressed Step 1, [CHU 80], [CHU 87], [EFE 82],
[HAES 801, [GYLY 76]. In general. the approaches used can be classified as graph theoretic.
integer programming and heuristic methods.
In [CHU 80], a graph theoretic and an integer programming approach is used to solve the
scheduling or allocation problem. which is defined as the assignment of M modules to N identi-
cal processors. so that interprocessor communication is minimized. The communication net-
work connecting the processors is not described explicitly and the solution implicitly assumes a
shared memory architecture.
In [CHU 87] a heuristic approach to a task allocation for distributed systems is presented.
The issues discussed are very similar to what we have examined. The objective function in
[CHU 87] is the minimization of the maximum processor loading. This has produced a number
of differences in their analysis and solution of the problem. They are also not concerned about
parallel module execution.
A heuristic approach to module allocation by minimizing interprocessor communication
subject to a load balancing constraint is suggested in [EFE 82]. This approach clusters modules
into processors and contains a mechanism to solve Step 2 of the mapping problem as follows: N
processors are assumed and the modules are clustered heuristically into possibly N + K clusters.
A second heuristic, a module reassignment algorithm, is to divide the K clusters among the N
processors when possible, by balancing their load.
In [HAES 80], a graph theoretic approach is used. N modules in the application graph are
initially associated with N available processors and are regarded as resident modules. The algo-
rithm then divides the rest of the graph into N clusters, which are centered at the resident
modules by minimizing communication between processors. In [GYLY 76], a two module clus-
tering algorithms are used to search for "eligible" pairs of modules, eligible in the sense that
when they are assigned to the same processor, the greatest possible interprocessor
· 5 -
communication is eliminated. This "fusion" continues until all eligible pairs are fused. The
implicit assumption here is that the number of processors is equal to the number of clusters
obtained. A shared memory architecture is also implicitly assumed, since the distance between
processors is not considered.
In a non-shared memory architecture. the work in [BERM 84, 85], [BOKH 81], [Wn.L
84], provide heuristic algorithms for matching an application to a parallel architecture. In
[BERM 84], a software system called Prep-P is described, it provides to the user a transparent
interface of the parallel machine. a preprocessor for lhe parallel application and heuristics for
mapping the application to a grid architecture. In [BOKH 81], the number of processors in the
system are assumed equal to the number of modules in the application and using this as a start-
ing point, one heuristically solves the problem of maximizing the number of pairs of communi-
cating modules that fall onto pairs of directly connected processors. In [WilL 83], modules are
assigned to heterogeneous processors using heuristics; queueing disciplines for the processors
are decided based on these assignments. In all of the non-shared memory algorithms, communi-
cation delay depends to a great extend on the number of links (number of processors) in the
path between the communicating processors. Thus, optimal routing between processors is an
integral part of the mapping algorithms.
2.2 Comparison of Modeling Methodologies
For shared memory architectures, the approaches based on graph theoretic methods [CHU
80], [HAES 82] and integer programming methods [CRU 80], result in fairly complex algo-
rithms which prohibit their use for applications with large graphs. The heuristic approaches
[EFE 82], [GYLY 76], are more promising and simple to use.
In all of the above models, the interprocessor communication is measured in terms of the
amount of data transferred among modules assigned to different processors. These models fail
to incorporate the performance characteristics of the parallel system. In our model, data
transfers are assigned a communication cost which includes he queueing delay incurred by com-
municating processors. This requires the performance analysis of the parallel systems architec-
ture. The resulting performance measure is the queueing delay versus utilization of the com-
munication network.. Two different classes of interconnection networks have been examined,
the Banyan and multiple bus. We have experimentally shown that queueing delay affects con-
siderably the allocation decision, thus our approach incorporates the system's architecture into
the mapping heuristics. The use of performance models simplifies the systems involvement into
the problem. Moreover, it provides the means of examining lhe performance of
application/architecture pairs. The third step in the mapping problem for shared memory archi-
tectures is simplified, since the distance between processors is one, and it is an arbitrary
- 6 -
assignment of the allocation of modules obtained in Step 2 to the systems processors, provided
that memory requirements are satisfied.
In non-shared memory architectures. the comnllmication cost is the system routing delay
due to the various paths followed by information until it roaches its destination and this is
accounted for in [BERM 84] and [WILL 83]. The approach used in [BERM 84] is useful. but
computationally very complex when one considers general and arbitrarily large application
graphs. In [WILL 83], the approach is simple, but one does not have enough infonnation about
how close it is to the optimum.
The approach we propose for shared memory architectures is computationally simple.
applies to large and general graphs and it can easily be extended to non-shared memory archi-
tecLures. Moreover, we have examined a few small problems (Cholesky decomposition) where
we know the optimum schedule and we see that our algorithm produces this optimum schedule.
The extension of this work for non-shared memory architectures is underway.
3. METHODOLOGY FOR SHARED MEMORY ARCHITECTURES
We have given three steps for the solution of the mapping problem. We deal primarily
with Step 1 and Step 2, since Step 3 simplifies when shared memory architectures are used. In
Step 1, a heuristic algorithm is used to schedule the application computation modules and data
blocks into parallel clusters. The algorithm minimizes queueing delay among processors by
assigning module pairs with lhe most communication to the same processor, provided (a) they
do not have to be executed in parallel, (b) they do not overutilize lhe processor, and (c) they fit
into its local memory. The output of this step is a number of clusters of modules that has the
same parallelism as the application. Step 2 is the reduction of the clusters obtained in Step 1 to
the number of available processors in the system. This is necessary mainly for two reasons:
first. the application's parallelism in general can be much higher than the number of available
processors. Second, there are architectures in which the number of processors can be adjusted
to equal the number of clusters obtained. Multiple bus inlerconnection architectures present
such a feasibility. In lhe Banyan interconnection system, the number of processors can be
increased (or decreased) only by powers of 2. Thus if the number of clusters obtained is not a
power of 2, then it is cost effective to use a number of processors equal to lhe next higher power
of 2, less than the number of clusters. In this case, Step 2 is unavoidable. Step 3 is an arbitrary
assigrunent of the clusters obtained in Step 2 to the system's processors, provided memory con-
straints are satisfied.
The mapping problem requires the modeling of the application and of the parallel system.
The computations of an application are assumed to be partitioned and they are modeled by a
precedence directed logic graph. This graph can be stochastic when the data flow in the
·7-
application is not known a priori as is often the case is in real time applications or it can be
deterministic as often in the case in numerical applications. Next we describe both a stochastic
and a deterministic graph.
3.1 A Stochastic Data Flow Graph Representation
We denote by G = eM,n a directed parallel, cyclic and weighted AND - EOR logic
graph. Throughout, we assume that the partition of each application is represented by such a
graph. Let M = (mi. i = 1, 2, ... , Np ) be a sct of weights corresponding to each program
(module) of the graph. Each mj represents the execution time requirement of the i-th program.
For each pair of programs (i,n, we denote by Pij the probability of passing control from i to j
and 'Yij the expected amount of data transferred. Data transfers are measured in an ITUs (lnfor~
mation Transfer Unit) which are the smallest unit of information whose queueing delay due to
communication can be detennined. Each link in the graph L is associated with the weights
{iij = (Yij.Pij); i,j = I •... , Np, i '#. j}. Note that L also represents the precedence graph for
the application. If program i passes information to program j, then i precedes j in the execu-
tion. Information may Dow in both directions between a pair of program modules, we assume
the application is such that no infinite loops exist.
The synchronization requirements of the partitioned application are defined in terms of an
AND or EOR logic in the I/O of each node. It is worth noticing that an AND logic on the out-
put links of a node indicate the potential parallel execution of the modules associated with these
links. Figure 1, presents an example of a stochastic model that includes internal data blocks
referenced by each module. These internal data blocks are associated by the code and are not
pan of the data from the application.
3.2 A Deterministic Data Flow Representation
We want to detennine the total processing time requirements of the application. For this
we apply a Markov Chain analysis [LOW 73] to the stochastic graph G and transform it into a
deterministic graph G'. The Lransformation G' is defined by (M',U) where
.' .M = (mj = mJ.·; l = 1 , ... , Np),
,
lij=O if i$j for i.j=I,2, ... , Np },
•For each program i, Ii is the number of times it is executed, mi indicates its total processing
•time requirement, while lij is the Lotal amount of ITUs transferred between programs i and j.
The parallelism of the application (represenled by AND logic) is implemented in a matrix form
by the precedence matrix fj. = (Bjj ; B;j = I if i and j programs can be executed in parallel;
- 8 -
Bij = 0 otherwise}. The degree of parallelism of the application is the maximum number of
program modules that can be executed in parallel. OUf methodology uses the parallelism as
assigned by the user. We call this the assigned degree of parallelism. it can be less than the
actual degree of parallelism. The assigned degree of parallelism may be interpreted as the
amount of parallelism that the user wishes to maintain in the computation. The infonnation in
the D. matrix is in the G graph and can be easily obtained from it Notice that the matrices L',
M' and .6. constitute part of the input to the allocation algorithms to be described.
Figure 1. An example of a stochastic data flow graph model.
-9-
3.3 Performance Analysis of the Parallel System Architecture
PerfOimance models of parallel multiprocessor systems are used to derive the queueing
delay processors incur in communicating among lhemselves. This delay is due to two factors.
(a) accessing the common interconnection network and (b) accessing the common memory





Single bus and common shared memory system.
Single bus and distributed shared memory system.
Multiple bus and distributed memory system.





A comparative performance analysis of these architecLures has been performed. and in all cases,
the interconnection network queueing Delay versus the utilization D(u), has been computed.
The performance analysis leads to a number of performance measures. The main variable is the
average number of Active Processors, AP in the system. i.e.• processors doing computation in
their local memory. From the AP, D (u) can be computed as discussed in Section 4.2.3. The
speed up. S, of the application running on the machine must be bounded as follows
1 S S SAP.
3.4 Performance Measures of the Algorithm Mapper
The algorithm mapper produces schedules of modules to processors and a number of per-
formance measures. which relate to the system use and the efficient execution of the application.
The main measures computed are directly related to the analytical systems model. the first is




the number of processors in the system.
the set of all module indices in G (A) assigned to processor j,




up = - L u~.
K i=l
Moreover, the average processor utilization up' relates to the active processors in the system AP
as follows,
Our second performance measure in the speed up S defined in terms of Tstq = execution time of
- 10-
the application by a single processor (sequential execution) and TREAL = execution time of the
application by the parallel system. We define
S = TSl!qfl'REAL.
4. THE ALGORITHM MAPPER RESOURCE ALLOCATION SYSTEM
The algorithm mapper software system is composed of three main parts (a) a preprocessor,
which takes the partitioned application and produces the infoITIlation about its graphical
representation and the input data to the mapping heuristics. (b) the heuristic algorithms for the
mapping problem, and (c) a user friendly interface. The imerface is interactive and displays the
input and output of the mapping heuristics on a SUN workstation employing color graphics.
4.1 The Allocation Algorithm
In Step 1, we fannulate and solve the module allocation problem. It is stated Connally as
a constrained minimization problem as follows:
Input: (a) An application which consislS of communicating program modules and data blocks.
(b) Specifications of a given distributed system: processor speeds, memory module sizes
and D (u) characterizing the interconnection network queueing delay vs. utilization.
Problem: Allocate the application modules and data in order to minimize the queueing delays due
to intelprocessor communications into (1) clusters of modules allocated to individual
processors and (2) clusters of data allocated to memory modules. This is subject to the
(a) Distributed System Constraints:
(i) size of memory modules,
(ii) processor utilization capacity,
(b) Application Constraints:
(i) a fixed time allowed for execuling lhe applicalion,
(ii) program modules that can be executed in parallel will be executed in parallel.
Note that the objective of minimum processing time will be achieved as a result of parallel pro~
cessing and minimizing the queueing delays due to inlerprocessor communications.
Next, we define variables and parameters of this problem and fonnulate it mathematically.
4.1.1 Parameters of the application










Number of progmm modules in the application.
Number of data blocks in the application.
MaUix representation of the intennodule communications in the application,
,
hj = total number of ITUs transferred between the i-th module and j-tIl module.
= 0 if both modules are allocated Lo the same processor,
Total execution time of the i-lh module (in ruts).
Code storage for the i-th module.
Number of ITU's in the i-th data block,
Number of times the i-th module is executed.
Matrix representation of the parallelism in the application,
iiij = 1 if the i-th and the j-th module may be executed in parallel,
iiij = 0, otherwise.
H = Matrix representation of the data references in the application,
hij = I if the i-th module is referencing data from the j-th data block.
hij = 0, otherwise,
H' = Matrix representation of the data communication in the application,
,
hij = Jih;jdj total number of rru's referenced by the i·th module from the j-th
data block.
4.1.2 Parameters of the architecture
The annotated graph G (M) is the data structure wil.h the following elements. In this case the
machine is completely homogeneous so no actual graph is needed in its representation.
K = Number of processors of the machine,
~f' = Storage capacity of the private memory of the J-th processor for storing code,
~t = Storage capacity of the private memory of the J-th processor for storing data.
111 Permitted utilization of the l-th processor,
C = Capacity of the interconnection neLwork,
D (u) = Queueing delay versus utilization characteristic of the inlerconnection network.
4.1.3 The allocation model
The mathematical fonnulation requires further quantities to express the allocation:
Q = Matrix representation of the assignment of program modules,
qiI = 1 if the i-th module is allocated to the loth processor,
qiI = 0 if both modules are allocated to the same processor,
- 12-
x = Matrix representation of the assignment of data blocks,
Xii = 1 if the i-th data block is allocated to the l-lh processor,
Xii = 0 otherwise,
R = Auxiliary matrix
XiJ = 1- '£qilqjl if the i-tb and j~th modules arc allocated to the same processor,
I
Tij = 0 otherwise,
S = Auxiliary matrix
Sjj = 1 - L qi/Xjl if the i-th module and the j-tIl data block are assigned to the
I
same processor,
Sij = 0 otherwise.
The three times associated with the computations of me l-th processor are:
N, ,
T~ = Program execution time = L mj qUI
;=1
N~ NI' I
Tq = Queueing delay time due to communicalions = L qiJ L rijl;jd where d is the
;=1 j=l
average queueing delay per ITU in the interconneclion network (see Section
4.1.4 for details of the computation of d),
NI' Np
Td = Queueing delay due to data block references = L qil L sijhjjd where d is the
;=1 j=1
average queueing delay per ITU in the interconnection network (see Section
4.1.4 for details).
The architecture constraints for the l-th processor are expressed in terms of these quantities as
follows:
N,
Module storage: :L tjqjJ :::;; J.lr.
j",l
N,
Data storage: :L djxij :::;; J.lf.
j=l
In addition to the these conslraints on individual processors, lhere is also the parallel processing
constraint represented by Ll; two modules executed in parallel cannot be assigned to the same
processor. More explicitly, if 5jj = I then rij = O. This assures the exploitation of the parallel-
ism in the application.
Note that this model of the computation is not completely realistic. It does not include lhe
algorithmic synchronization delays of the program modules. As discussed in Section 4.1.6,
after the allocation is made one must make a small additional computation to estimate the actual
- 13 -
execution time of the algorithm.
4.1.4 The heurislic allocation algorithm
The heuristic algorithm used for the allocation problem involves an artificial parameter T,
called the timefrome:
T =total time allowed for execution of any processor (in TU's)
This introduces an additional constraint into the problem, namely I for the ltll processor
T(! + Tq + Td ::;:; 11fT.
Thus the algorithm has the following input/output:
INPUT:
OUTPUT:
T, Jli' til dj • Np , ND. L, /),., H, Jlf', Jlt. TlI> C, K, D (u)
Q = assignment maLrix'of lhe program modules.
X =assignment matrix of the data blocks,
NopleT) = optimal number of clusters of program modules and data blocks,
u~ =utilization of the l-th processor, 1= I, 2 I ••• ,Nopl(T) = p.
We now outline the heuristic algorithm. The first step is to select a value of the time
frame T. It must not be too small. otherwise the application cannot execute in such a short
time. If T is too large. then all the program modules are placed in a few processors and the
computation is not as distributed as it might be. The heuristic algorithm has five phases as fol-
lows:
1. Initialization
Start by allocating each module to a different processor (rii = I, rij = 0, i #:. j). Search the
H' matrix by row to determine if any data block is referenced by a single module. If so, make
it local to the processor by entering the data block in the processor's private memory list and by
deleting this entry from the H' matrix.
2. List pairs ofprogram modules eligible for merging
Search the L' matrix and locate all program modules pairs eligible for merging, that is,
cSij = 0, and choose those with the maximum number of ITU's transferred. If no such pair
exists, stop.
3. Find a pair of modules for merging
For each pair (Pi ,pj) in the list from Phase 2 calculate:
(a) Total amount of ITUs to be communicated had Pi and Pi been merged (intennodule
communications plus external data block accesses.)
N, ,
I- h" - I-
1=1 r = local
, , N,.




(b) Interconnection network utilization due to Vij
D""U(i,j) = __'J_
exT
(c) Interconnection network queueing delay due to Dij
C"' C"' k-APd = d I./J = D(u I.Jl) = C" ")
U 'J
The function D (u) depends on the machine architecture. see Section 4.2.3.
(d) Processor utilization had Pi and Pj been merged.
N,
I; (h~+ h~)d{i,j) + L (l;, + l}f)d(i,J)J(I'
r == local l=1
If Vf i,}) > Tli then Pi. Pi are ineligible for merging, delete them from the list. If all pairs are
ineligible go to Phase 2, otherwise find lhe pairs for which U,(;,)) 5:Tli. if any. Then select the
pair whose merging allows the maximum number of dala blocks to be assigned to private
memory. If there is more than one, then select the pair which yield the lowest processor utiliza-
tion. If there is still more than one, pick one pair at random.
4. Merge apair a/modules
Merge the module pair (pj,Pj) from Phase 3 by adding Pi to the list of modules of proces-
, , ,
sar n containing Pi, m n = m n + mj, rnj = rjn = 1. Delete Pj as follows:
, , , ,
fork <i and k :l:-j,l/ci= l/ci+ lij+ lji and
, , , ,
for k > i and k :I:- j, llk. = llk. + lpl. + lkj
Further, examine the H' to find if there is any data blocks made local due to this merging and,
if any, add them to the private memory of this processor n, and delete them from the H' matrix.
5. Reset the ineligible pairs
After a merge has been completed, a new graph has resulted with one less node and fewer
links (as computed in Phase 4). Reset all ineligible pairs of modules (from previous merge) to
eligible and return to Phase 2.
4.1.5 The output of the allocation algorithm
A sample output of the algorithm is given in Table 1. The output includes the total (%)
utilization of each processor which is the processing time plus communication time of modules
assigned to a processor.
• 15 •
I AJJ.OCA1JON SO! 1mON AND WORKI.QAD STATISTICS I
num'oct ofmodul... i5 41
eslimIuod d.lpsod tim" iJ; 70.0 &0 Clplcil)' ill 151515.2 time 1Il'Iil!
ollowcd re&l time for proxcaing io 7nO Lime umu
.... PROCESSOR UIIlJZATION·...
" lu~~~ pro<: ('I.) p~lIlilizotion ....igncd
, 51.57 51"" I ,
"
20







, 24.19 23.75 , , II




"27 45.13 S 17
" '" "6 55.73 55.40 9 10 21 TI


























































delly pc:l" il<:m 0.000007
••• APPllCATION"S COMMUNICATIONREQUIREMENrS •••
rUi lime uniJ.s)








tJ m~:r instruo:Li..,. bllll:k="'~""inmuctill1 aizc uli1izalion, , 100.00 7.00 I ,
"
20
" " ", , 100.00 10.00 , 7
" "
22
" " " ", , 100.00 '.00 , , II
, I 100.00 '.00 6 16
"
.,
, , 100.00 '.00 0 17
" " "6 I 100.00 600 9 10 21 TI





tJ mIT!" inobuclioru blocked"'~""in.nrucLi oW: uLilizluon





, , 100.00 .000 , 7
" "
22 24
" " ", , 100.00 '.00 , , II, , 100.00 '.110 6 16
" ", , 100.00 '.110 0 17
" " "6 , 100.00 6.110 9 10 21 TI





• .... lh= an: liD common dlliblocU •••
Table 1. OutpUl of the allocaLion algorilhm, the variable id is the index of !.he program module or dala
block.
- 16-
4.1.6 Analysis of the use of the time frame in the allocation algorithm
When the time frame T changes, a different clustering of G (A) is obtained. Ideally, we
would like to find the shortest time frame T for which the application can be run. Let TPIIR =
the shortest time frame for which the machine call run the application A in parallel.
At the end of Step I, TpAR is the shortest time frame for which the application is allocated
to as many processors as its degree of parallelism. In Step 2, TPlJR is shortest time frame for
which the number of clusters obtained equals the number K of processors available. Thus TPAR
is the time frame for which optimal clustering is obtained.
As mentioned earlier, the model of the computation does not include the time for delays
due to the algorithmic synchronization of the application. That is, the time one program waits
on its predecessors to finish is not included. The time frame TpAR includes only the times for
execution, communication and data reference within a clustcr. Synchronization dclays are an
anribute of the applications and their effect can be easily computed once the allocation is made.
One merely "executes" the algorithm abstractly using the infonnation in G(A). The resulting
executions time is denoted by TREAL.
From the output of the allocation aIgoritlun, we can compute the average number of active
processors, AP, (see Section 3.3) as follows
K .
AP- "'" u'- ... P
i=l
We can obtain TPAR and TRE,AL from the algorithm and then the following bounds hold
4.2 The Algorithm Mapper System
The algorithm mapper is a software system which maps any application to a shared paral-
lel memory architecture system. It is made up of a preprocessor, the heuristic allocation algo-
rithm called ALLOC, and a user friendly graphical interface.
The allocation algorithm was initially implemented in Pascal fSTEV 82] for a singlc time
frame and a hypothetical system queucing delay function. All input data were assumed known.
No preprocessor or user interface was available. The code was inefficient and did not com-
pletely solve the mapping problem.
The present allocation aIgoritlun ALLOC is implemented in the language C with a vari-
able time frame. A library of perfonnance functions has been added of four multiprocessor
architectures, namely
- 17-
(a) Single bus with shared common memory,
(b) Single bus with distributed shared memory modules (two port memory),
(c) Multiple bus with shared memory modules.
(d) Processors and shared memory with a Banyan interconnection.
In. the case of a multiple bus with shared memory module architecture, the number of busses can
be set equal to one, thus obtaining a single bus with distributed shared memory architecture or
set equal to the number K of processors, thus obtaining a crossbar interconnection of processors
with distributed shared memory.
The time frame T is varied. Initially a large value of T is used. which decreases until TpAR
is obtained, i.e.• the shortest time frame for which the number of clusters obtained equals the
application parallelism in Step 1, or after the parallelism reduction Step 2, it equals the number
of available processors in the system.
After an allocation has been obtained and the clusters formed. the execution of the applica-
tion is simulated in order to obtain TREAL, i.e., the actual real Lime required to execute the appli-
cation architecture. This simulation routine is also in the C program.
4.2.1 The algorithm mapper preprocessor
The preprocessor PALLOe is an interface between the user and the allocation program
ALLOe which automatically creates its input data file. The preprocessor is necessary because
the amount of data needed by ALLOe is very large, especially for large graphs.
PALLOC is written in the C programming language and currently runs on a DEe VAX-
Iln80 computer under the Berkeley operating system (4.3 BSD). All applications programs for
PALLOC should be written in C. There is also a version of the preprocessor for applications
written in FORTRAN.
PALLOC uses either an abstraction of the actual application or an instrumented version of
it or a combination. In any case, the application must be partitioned into subroutines
corresponding to the program modules to be used in the parallel implementation. Further, the
data communicated between these modules must be explicitly specified as to destination and
size (in bytes). The execution Lime of the code in a program module can either be specified
explicitly or the actual code is compiled and limed during a sequential execution of the applica-
tion. In summary, PALLOC determines the execution times and total communication of
modules from this special version of the application. We now give more details about the
features of the input to PALLOC.
- 19-
3. finisO
This must be the last statement of the main program and signals the final computa-
tions and creation of the output files.
Examples





{ initialize data ... }
process(modulel, 1, ptrl)
cobeginO
process(module2, I, .1, P[(2)
process(module3, I, .5, plr3)




The preprocessor maintains a table of virtual clocks, one for each process (program
module), and an incremental "stop watch" variable clock_ for the currently executing process.
Together these detennine the virtual time for the current module.
There are two ways to alter the virtual timer while executing (note that only clock_ should
be altered).
0.1. Increment the clock manually. Insert lines like
clock += execute_lime;
into a subroutine. and compile it wilh 'ce' (C compiler). For example. a hardware com-
plex multiply module could be modeled as
extern unsigned int c1ock_;
typeddef struct {float re, im;} COMPLEX;
cmult(a, b, c) '* A=B"'C, a is a pointer to AJ etc. $'
COMPLEX "'a, "'b, "'c;
{ int tm = 892;
clock_ += tm;
}
where tm = 892 is the number of virtual nanoseconds needed to obtain the result. tm
could also be a parameter to cmultQ jf execution time variations need to be modeled.
- 18 -
The user writes for each module a different subroutine (which might call other subroutines,
but NOT subroutines that are modules). When a module needs to pass information to another
module, the PALLOC statement send (pid, nbyles) is used; where pid is the identifying number
of the destination module and nbyles is the number of bytes sem. A module receives informa-
tion by using the PALLOC statement receive (message) to indicate acceptance of infonnation
from any other module. The preprocessor maintains a table for each module with all the neces-
sary data. and which is updated as the preprocessor executes.
In any module the following can be used:
General PAUOC Statements
1. send(pid, nbytes) -- in order to send nbytes to pid module, were pid is the number
of the destination module.
2. receive(messages) _. in order to receive information from any other module,
3. getpidO -- this returns the number of the current module at run time,
4. dock_onO
clock_off() -- used for execution time control; sec below under Clock Control.
Main Program Constructs
After writing the code for the various modules, the user must explicitly specify the
sequence in which the modules are going to be executed. Two constructs arc provided by the
preprocessor; sequential and parallel. They are controlled by the main program of the applica-
tion. The purpose of the main program is to initialize the data and specify the module
sequence. The following statements are provided:
1. process (name, pid, prob, pannptr)
where name - is a pointer to the program module to be executed,
pid - is the number of the current module,
prob - is the probability with which infonnation is being sent from the current
module to the successor module,
parmptr - is a pointer to a structure which contains data to be passed to the
successor module.
This tells the preprocessor that this 'name' is one module of the computation. The
pid must be assigned by the user, this is provided so lhat the user has better control.
2. cobeginO
coendO
These specify that the modules bracketed are to be executed in parallel.
- 20-
0.2. Increment the clock automatically subject to start and stop signals. Write the process rou-
tine without any reference to clock_ (except through two controls described below) and
compile it with the C compiler 'simcc' instead of the usual 'cc', This modified C com-
piler generates its code by looking at the "usual" assembly code and inserting an incre-
ment to clock for each VAX inslIUclion. Timing data is taken from a table of 01AX
instruction, execution) pairs. The default table has execution time = 0 nsec for all
instances, but this can be altered easily by supplying another table. The clock is initially
turned off; it can be started with a call to clock anD and turned off again with a call to





The 'simcc' compiler command line option '-i' can be used to force the clock to be turned
off initially.
Actually, there are two routines named clock_onO and clock_offO. Calls to parameterless
subroutines are compiled into a single instruction like
calls $Q,_c1ock_off
where the value of the starting address of _clock_off (Le., clock_offO] is to be supplied by the
linker loader. When ccsf (the component of simcc which inserts increments to ciock-.J sees the
above instruction, it simple copies the following instructions until it sees
calls $O,_clock_on
at which point it begins to insert increments to clock_ as if the "intervening" code was not
present Note that the clock on/off switch retains its setting across subroutine boundaries, since
it acts solely as a textual modification. A final hazard is the compiler's removal of dead
(unreachable) code; e.g., do not place these switches after an infinite loop with no exit.
The automatic timing scheme is not without faults, such as:
1. Dependence on the simulator's host machine for the simulated instruction set model.
While the VAX instruction set is rich and varied, there are many operations which require
more than one instruction to implement. It is then difficult to formulate an appropriate
timing model for new hardware operations.
2. Invariant instruction execution times. We should be able to give different weights to
instructions using register access, memory access, indirecl memory access, autoincrements,
etc., and to ignore certain "bookkeeping" instructions such as stack pushes before a func-
tion call or arithmetic type conversions enforced by the C compiler.
- 21-
3. Compiler dependence. The C compiler optimizations do not always generate high qUality
code from readable programs. For example, unnecessary address calculations may be per-
fanned while stepping through an array; this problem is correctable by using pointer vari-
ables, but such programs are more difficult 10 produce.
4.2.2 The algorithm mapper graphics user interface
A/locTool is a graphical interface to the allocation algorithm. It helps the user to specify
the computation graph, 10 enter the required data for the algorithm, and to display the results in
a graphical form. In general, for a specific application, the user has to do the following steps to
use the allocation algOlithm:
(a) Run A/locToo/ (which is trivial).
(b) Draw the application data flow graph.
(c) Specify the various data that are required for the algorithm.
Cd) Run the allocation algorithm and display the results.
(e) (Optional) Store lhe application description in a file for later use.
Steps (b) and (c) can be replaced by loading an application data file previously prepared.
AllocTool uses the Sun View library routines and should work on any Sun workstation. In
the following paragraphs, we use the terminology of Sun View library when referring to win-
dows and specific items within these windows. These terms from Sun View will be in ITALr
ICS.
The tool is composed of two basic FRAMES (windows). The first one is the control
PANEL, that controls the functions of the tool and the second is a frame containing a CANVAS
window for images and a small PANEL on the lOp for diagnostic messages. In the follOWing
paragraphs, we describe the operation of these windows.
Message PANEL
The message PANEL is the subwindow at the top of the second FRAME. It has two lines
that display diagnostic and error messages. The first line has three mouse images that represent
the mouse left, middle and right buttons. On the right of each image is displayed a short text
describing the function that the respective butlon will perform in the current slate of the tool. If
no text is displayed for a button, that means no function is assigned to this button.
The second line is for general purpose diagnostic messages.
-22-
Main PANEL
This PANEL is composed from several PANEL ITEMS, that we describe starting from the
top and going down.
Dir ITEM. Is a TEXT item and is used to display the current directory. When lhe tool
starts, the directory is assumed to be lhe current user directory, afterwards the user can change
to any directory he likes by changing the text in the dir ITEM. To change the text, you click
the left mouse button on the text, erase any unwanted characters using the backspace, and com~
plete the new directory. If the directory is not a valid one, no error message is displayed. but
when the toalmes to use this directory one gets an error message "directory not found".
File ITEM. This is also a TEXT item and can be used in similar ways with the dir ITEM
described above. The file ITEM is used to enter the name of a file fuat contains data for an
application after using AllocTool. These data can be loaded (using the load BUTTON), edited
and rerun 1.hrough the allocation algorithm. Also one can specify in the file ITEM the name of
the file where the current data application is stored (using the store BUTTON). All the files are
in the directory specified by the dir TEM.
Load BUTTON. This is a BUTTON used to load the data for the application from a
specified file (file ITEM). These data have been produced using AllocTool. They contain data
for the image of the application graph, as well as data for the various parameters of the alloca-
tion algorithm. If the file specified does not exist, then an error message is displayed. During
the time the tool is loading the data the cursor is transformed to an hourglass and no other
operation is possible in the tool.
To initiate the loading process, one clicks the left mouse button whcn the pointer is inside
the BUTTON area.
Store BUTTON. This is also a BUTTON and is used to trigger lhe store process to place
the current application data in the file specified in the file ITEM. Data are stored for both the
image of the computation graph and the algorithm parameters. If the file used already exists (is
not a new file), the tool asks the user to confirm overwriting of the file.
Quit BUTTON. This is used to exit from A/locTool. The current application data are not
saved by default anywhere, so to save them, one must use the store BUTTON.
Histo BUTTON. This BUTTON displays a processor ulilization hislogram after an execu-
tion of the algoritlun has been complcted.
Help BUTTON. (not implementcd yet).
Edit BUTTON. One can edit the file specified in the me Hem using this BUTTON.
- 23-
Param BUTTON. Using this BUTTON, one enters or changes the values for several
parameters of the allocation algorithm. A popup window appears on the screen that displays the
parameter names and their current values. The left mouse button selects the parameter to
change, and one can enter the new value in the value field for this parameter. When finished,
select the done BUTTON from this window and the new values are stored. Note that the win-
dow is a blocking one; all the input from the mouse or the keyboard is directed to this window.
Nothing else in the screen can accept input simultaneously.
Clear BUTTON. This BUTTON-clears the image of the application graph from the CAN-
VAS (if any) and initializes the internal database of the tool, so a new application can be started
(entered or loaded).
Redraw BUTTON. This redraws the image of the data flow graph of the current applica-
tion.
Exec BUTTON. This BUTTON executes the allocation algorithm. The tool produces an
output file which is fed to the allocation program to get a result file. The first file is named
currentJj/e.alloc and the second currentJjle.res, where currentJjle is the file name currently
specified in the file item. While execution lakes place, the cursor is U'ansfonned to an hourglass
and no other function is possible. After the completion of the program, the resulting allocation
is displayed on the application graph. Each node now has a label specifying the processor
number to which it has been assigned. If the workstation supports color, different processors are
represented by different colors. NOTE: It is important to make sure that all the required data
are given in order for the allocation algorithm to work correctly.
Grid CHOICE ITEM. This is a CHOICE ITEM, that is, an item which has values from a
small set, here ON or OFF. If the grid is ON, a grid is displayed on the CANVAS in order to
assist the drawing of the application data flow graph. If the grid is OFF, the grid disappears.
Label CHOICE ITEM. This is also a CHOICE ITEM like grid and its values (SHOW or
fiDE) control showing or hiding the labels on the edges of lhe data flow graph.
Show Canvas Window BUTTON - Hide Canvas Window BUTTON. This is a single BUT-
TON, but the label string changes every time it is used. The use is the obvious one, to show
and hide the canvas window where all the graphics is displayed.
Command CHOICE ITEM - Object CHOICE ITEM. These Lwo CHOICE ITEMS are used
for drawing. In order to make the graph, one selects commands and objects. (Not all the possi-
ble combinations are valid). The data flow graph is composed from four different objects:
circles, (that represent computation blocks)
squares, (that represent memory blocks)
edges, (lhat represent the communication paths = graph edges)
- 24-
number, (that are the labels on the edges)
The mouse and mouse buttons are used for the selection. Most of the operations are perfonned
by pressing and depressing a mouse button at a specific point in the CANVAS window. The fol-
lowing commands are used to create a graph.
ADD
This adds some object in the current graph. The add command is nOl valid for number
objects. there is another way of assigning the labels for the graph edges.
To add a circle (computation block) after selecting ADD, move the cursor in the canvas
subwindow and point the location where the center of the circle is to be. Then click the left
mouse button and the new computation block appears on the canvas. One can enLer a square
(memory block), the same way. but the point lhe cursor points to is the lOp left comer of the
square.
To add an edge (line) one specifies a source node and a destination node. Do that by
pointing and clicking the left mouse button inside the image of the source node, and the right
button inside the image of the destination node. It is important to enter the source node first If
the edge exists. then an error message is displayed. This produces a straight line edge.
If the edge is to be a broken line, then the intennediate points must be inserted in order,
using the middle mouse button. So, first insert the source using the len mouse button, then the
intennediate points (if any) using the middle mouse button, and finally the destination using the
right mouse button.
DELETE
The delete command is used to remove objects from the graph using me mouse. To select
a node for deletions, just click inside the node image. If a node with edges is deleted, all the
incident edges are deleted also. To select an edge for deletion, specify the source and destina-
tion nodes. Thus, this command deletes an object simply by selecting it The delete command,
like add, does not apply to number objects.
MOVE
This command is used to relocate an object. It is not valid for edges as they follow the
moves of the nodes they are attached to.
To move an object, first select it (the selection procedure is the same as in delete). Then
point to the new location and click using the middle mouse button. One can repeatedly move
the same object without selecting it again, since the tool remembers the last selection. To move
- 25-
a number object (edge label), select the edge as in the delete command.
DRAG
The drag command is a "continuous" version of move, the selected. object follows the
track of the mouse on the CANVAS. An object is selected to drag as in move command. Hold
down the middle mouse button and drag the pointer aroWld the CANVAS to an appropriate new
location, then release the mouse bULton. The object image is relocated La Lhis point.
ALIGN
The align command is applicable only to graph nodes and makes graphs with some nodes
horizontally or vertically aligned. The horizontal alignment of the nodes is important because
vertical position is used to infer the parallelism of the graph, essential information for the allo-
cation algorithm. Computation blocks that can be executed in parallel must be drawn in the
same horizontal level. The grid choice item is supplied to aid in this. Vertical alignment is for
appearance (symmetry) purposes. The nodes are selected using the left mouse button. The first
node selected is assumed to be the "pivot" node, it represents the start of the alignment All
the other nodes are aligned relative to it After selecting the nodes, click the middle mouse but-
ton for vertical alignment or the right mouse button for horizontal alignment Both circles and
squares are aligned using the same function, so selecting eilher type will do.
DATA
The data command is a facility to enter data about objects inlo the data flow graph. Most
of these data are used by the allocation algorithm, and some by the tool itself. One can enter
data only for edges and nodes. When data is the current command, one selects an object and a
popup blocking subwindow appears. It has named data fields and one can specify values as
with the param BUTTON.
An edge is selected in the same way as for adding an edge (see add). The subwindow that
appears has the following fields:
Source node, the source node for the edge,
Destination node, the destination node,
Probability, the probability that execution of the source node is followed by execution of
the destination node,
Communication, the amount of communication (in ITU's) iliat is transferred on this edge.
The last two fields are dala for the allocation algorithm and must be entered by the user. The
first two are displayed for convenience to idenlify the edge and should not be edited. After
completing any changes in the subwindow, click the done BUTTON and the subwindow disap-
pears. The new values are now stored in the lool's database.
- 26-
The window for a computation block has the following fields:
Label, the label in the circle,
Radius. the radius of the circle,
Level, the parallelism level number,
Processor, the number of the processor assigned to this block by the allocation algorithm.
Initial Probability, the probability of this block to be the first one in the computation,
Expected Time. the estimated time for this block to complete execution.
Instruction Block Size. the number of bytes for the code of the instruction block.
Except from the radius and level which are inferred from the graph image data. the rest are data
for the allocation algorithm. The user should specify them prior to execution. The processor
number is an output variable computed by the allocation algorithm.
The window for a memory block has the following fields:
Label, the label of the memory block.
Size, the width of the square,
Level. the level in the graph.
Processor, the number of the processor where this block is assigned,
Datablock Size, the size of the data (memory) block.
Of these fields, only the last one must be entered by the user and it is used in the algorithm.
All data fields for the data command contain initial default values. These are, in general,
not realistic values.
A sample output of the graphics user interface is shown in Figure 2. The numbers of the
graph nodes are the processor numbers to which the modules are allocated. Processor utilization
is also shown graphically.
4.2.3 Shared memory architecture models
Four different shared memoIY architectures (see Section 3.3.3) are analyzed and their
queueing delay functions D (u) presented. In all cases the overnll system organization is the
same and it is introduced first
The systems are composed of k processors and k memory modules, (although we are
assuming that the number of processors is the same as the number of memoIY modules, the
same analysis can be applied when the number of processors is less lhan the number of memoIY
modules). Each processor has its own private memoIY module, where the program and the data
are stored. If processor i wants to communicate wiLh processor J, it prepares a message and
sends it to memory module J where it can be accessed by processor j. We assume asynchro-
nous communication, so any processor can be in anyone of three states:
- 27-
Figure 2. Sample output of the graphics user interface of AllocTooI. Colors arc used in the
large CANVAS FRAME (center) Lo indicate the processor assignment (or numbers
for black and white workstations). Processor utilization is shown at the upper right,
the control PANEL at the upper left.
- 28-
1. The processor is executing a program in its local memory,
2. The processor is sending a message to another processor,
3. The processor is blocked waiting for the interconnection network to deliver a mes-
sage to another processor.
Processors in the first state are considered active processors doing useful work. Processors are
considered to be in the first state (executing) even when they are idle waiting for data. We do
not account for the time taken by processors to read messages, since this is considered to be part
of the program executed by the processor.
We assume that the time between the generation of messages is an exponentially distri-
buted random variable with mean ~. and the length of the message is an exponentially distri-
buted random variable with mean.!.. We also assume that an access request from processor i
~
is directed to memory module j wilh probability Pij = 11k, i.e.• the communication pattern is
homogeneous. When a processor sends a message to another processor, it stops executing its
program. Then, if the interconnection network can establish a path from the source processor to
the destination memory, it does so instantaneously with no delay and the processor begins to
send its message. When the message is completed, the processor returns to its active state. If
there is contention at the interconnection network or destination module, the processor is put in
a queue, "blocked", waiting for the contention to be resolved before transmitting its message.
The performance analysis of these architeclures is documented in detail in [HODS 87],
[HOUS 88]. The main performance measure obtained is the average nwnber of active proces-
sors, AP. We have also obtained the aYerage queueing delay per message, D (u), and the aYer-
age utilization of the communication network. We summarize these results below.
System J: Single bus and shared conunon memory architecture.
Set p = A/~ then
, .
:E p1k! -1
jdJ (k - j)!
API = " pik!
P j;' (k - j)!
and
- 29-
A three processor example of this system is shown in Figure 3 and D I (u) is ploned in
Figure 4.
P,




Figure 3. Single bus and shared common memory architecture wiLh 3 processors Pj and 3
private memories PMj •
Figure 4. Delay vs. utilization D I (u) of the interconnection network of System 1 of Figure 3.








P j~ U -k)!
n = AP,p/(I - p),
D,(u) = (k -AP,I(I - p))/u.
A three processor example of this system is shown in Figure 5 and D 2 (u) is plotted in Figure 6.
r---------,r---------,r---------,











, PM, , P, , , , PM, ,,
L_
, , , ,
L_
,







c us c. us c· us
0 • us
Figure S. Single bus and distributed shared memory architecture with 3 processors Pi. 3
private memories PMi • and 3 common memories CMj •
Figure 6. Delay vs. utilization D2(u) of the interconnection network of System 2 of Figure 5.
- 31 -
System 3: Multiple bus and distributed shared memory modules architecture. It is of order
k x m x b where k is the number of processors, m the number of memories and b
the number of busses.
Set p = ),.!lL and define the amlY pil) by
pi/) =p/I - j) +Pj-,(I- j) + ... +p,(I- j) +Po(l- j)
with initial conditions
Pj(/) = 0 1< j
Po(l)=O I> 0
pi/) = 1 j ;c, 0
Then set
b-l I-b
L jPj(l) + b LfpbU + b)Pm-b(l- 2b - j + m)]
R - j=1 j=O 1 "2: 0
}-'1- bi l-b
L p/I) + LfpbU + b)Pm-b(l- 2b - j + m)]
j=l j=!J
and
[ '-'~ k l '-j ]]-'I + L ph -.. II P,'j=O J! 1=1
• k,k-i I
P ="" k-i_" n R-:-lp for i=l to k-l.




D,(u) = (k - u!p)!uo
Figure 7 illustrates this system and D 3 (u) is plotted in Figure 9 for a system with k = m = 8,
b=lor8.
- 32-
0) G) • • • A
.
.




Figure 7. Multiple bus and distributed shared memory architecture of order k x m x b.





I~ -c-"k-'-,!cc i A_1_]-1~-:O(k-i)!Pj=leU) ,Po= k! jPj=P o pk II(k - O! j-I
{
i(2k - O.5i - 1.5)
2(k 1)
where e(j) = flo& (i), fl(i) = $(f,-1 (i)) and $(i) ~ I
Then we have
•AP 4 = L iPi
;=1
With u(x) = max (x, 0), let
•L = L uri - e(i»)Pi ,
;=1
then we have
u = 1 - Po
and
D,(u) = (k + Lp)j(k - L)





Figure 8. Schematic of the Banyan switch and distributed shared memory architecture.
Figure 9. Delay vs. utilization D(u) of the interconnection networks of Systems 3 and 4. Sys-
tem 3 is illustrated with lWO choices (1 and 8) for the number of busses.
4.3 A Parallel Implementalion of the Algorithm Mapper
The allocation algorithm runs initially for a large value of the time frame T and subse-
quently the value of T is decreased until the number of clusters equals the parallelism of the
graph in Step I, or equals the number of processors in Step 2 (sec the discussion at the end of
- 34-
Section 1). The smallest value of T for which Step 1 or Step 2 is completed is TPlJR • Note that
in each iteration of the algorithm, the allocation heuristic is executed producing a number of
clusters and schedule of modules for the current value of T. When T changes, the output of the
allocation algorithm changes as well. Moreover, the output of one iteration is not used in the
next iteration. This attribute makes the algorithm a suitable candidate for parallel execution of
!:he multisection extension of the bisection algorithm for locating zeros of functions. See
[MIRA 69], [RICE 71] for further details. We have implemented a parallel version using the
Sequent, a parallel machine which houses (in the configuration we used at the Computer Sci-
ence Department at Purdue) 20 processors.
Any number k ::;; 20 processors can be used. Our particular version for the parallel algo-
rithm mapper is as follows.
1. Estimate T as best one can.
2. Take as initial interval [a,b] = [TI2,TJ hoping that the guess is good enough for this
interval to contain TpAR . SetTa =0.
3. Divide [a,b] into k -1 equal subintervals
Ti = a + (i - I)(b - a)/(k - I), i = I, 2, ... , k.
by the values
4. Assign the ith processor the value Tj and run the allocation program wilh it
5. Let [Tj,Tj+rl be the interval such that (a) Tj +1 gives the smallest number of clusters,
and (b) the number of clusters for Tj is larger than that of Tj +1• If j '* I, take [a,b]
to be [TjITj +1]. If j = I, take [a,b] to be [max(To I2T1 - T..I),T d. If b - a $; E =
convergence tolerance, then go to 6, otherwise go to 3.
6. Set T = a and produce output. If the number of clusters is less than the parallelism
of the application, then find that interval [T{,T{+I] where I is the smallest index
where the number of clusters exceeds those of T. Set Ta =Til a =Til b = Tk and go
to 3.
A little analysis shows that this algorithm has speed up of 10g2(k + 1) which is worthwhile for
small values of k.
4.4 Time Complexity and Oplimality of the Algorithm Mapper
The time complexity of the allocation algoriLhm is approximately proportional to the
number of links in the application graph. This is a consequence of the search for the links that
carry high communication and the fonnation of lists of such links until a merge can be per-
fanned. After each merge, the same search stans over again. In Table 2, timing data from
three representative applications are shown. These data strongly suggest that the time complex-
ity is no worse than linear in the number of links in G (A).
- 35-
Application Number of links Time Ratio
Real time 27 1.18 22.9
PDE collocation 77 4.1 18.8
Robot arm 230 21.1 10.9
Table 2. Experimental data on the complexity of the allocation algoritlun. Time for execution
is given in seconds on a VAX IinSO along with the ratio of links to lime.
A symmetric and computationally simple partition of a Cholesky decomposition applica-
tion is used to test the optimality of the algorithm mapper. A (4 x 4) partition of the application
graph is discussed in detail in Section 5.1. The degree of parallelism of the graph is four and
four clusters were obtained for System 2 with four processors. By using simulation. TpAR has
been shown to be the minimum possible elapsed time for this application (see Figure 13).
Moreover, the processor ulilizations were well balanced (see Figure 12).
s. EXAMPLE STUDIES USING THE ALGORITHM MAPPER
Several applications have been tested to confirm the analysis and heuristics. All except
one are actual computations for specific applications. The exception is an application whose
graph is stochastic and has been used in [BATR 78] to illustrate that a merging heuristic will
perform satisfactory for any general stochastic graph. In [BATR 78], a similar allocation algo-
rithm was applied to this application, but it was not actually programmed and a number of mis-
takes have been made because oflimited testing. We give an account of this in [HOUS 88] and
it is not described further here.
The three real applications tested are, (I) a Cholesky decomposition algorithm for the
solution of linear equations, (2) a partial differential Equation (PDE) problem, using the colloca-
tion method, and (3) the solution to the Newton-Euler equations, for the mechanical movement
of a robotic elbow manipulator. In all three cases, a partitioning of the application is given and
the algorithm mapper syslem used to obtain an allocation. The partitions are from previous
work: PDE collocation application, [HODS 87]; Cholesky decomposition, [OLEA 85]; robot
elbow application, [KASH 85].
5.1 Cholesky Decomposition Application
We consider the parallelization of Cholesky algorithm for the factorization of symmetric
malrices. Figure 10 shows the computation of each module [OLEA 85]. We initially used a
data flow language SIMON [FUJI 85] lo specify the computational modules and their
- 36-
communication and synchronization requirements. We then executed these programs using the
SIMON non-shared memory multiprocessor simulator. Figure 11 shows the graph G(A) and
the values of the various workload parameters (node processing time and blocking time, com-
munication traffic among nodes) obtained by setting SIMON's switching delay to zero (i.e.,
assume there are no communication delays). The blocking time or algorilhm synchronization
delay of a module is the time that the module must wait for its inputs before its computation
can start. The processing time is the time to execute the code in a module. These times are
given in the vectors b and m of Figure 11. The application was also run by our preprocessor
and similar input data for the allocation were obtained.
The matrix is partitioned inlo blocks so that the maximum assigned degree of parallelism
is equal to the number of processors available, so we have as many processors as the assigned
degree of parallelism of the application. A single bus mulliprocessor and distributed shared
memory architecture is used. Thus only Step I of the mapping problem needs to be taken. We
have run this application for various assigned degrees of parallelism (number of processors), we
only report on the case of assigned degree of parallelism four here. Figure 12 shows the output
of the allocation algorithm. Note that no internal datablocks are necessary in this application,
since no reference to data is necessary. By default the number of internal datablocks are set





IF (K = i and K =j) THEN
BEGIN
X := sql1(X);
IF U"N) PUT (OUT_EAST,X);
IF (i ,. N) PUT (OUT_SOUTH,X);
BREAK
END




IF U ,. N) PUT (OUT_EAST,X);
IF (i ,. N) PUT (OUT_SOUTH,Y);
BREAK;
END




IF U ,. N) PUT (OUT_EAST,Y);







IF (i ,. N) PUT (OUT_SOUTH,Y);
IF U"N) PUT (OUT_EAST,2);
END
END WHILE
Figure 10. Computation of node (i,j) in the SIMON data flow language for a parallel block
Cholesky algorithm. The operations sqrt(X), XI Y, etc. are on block submatrices of
the matrix to be factored.
- 38-
Figure 11. Procedence graph G (A) of the parallel Cholesky decomposition algorithm for a 4
by 4 block decomposition of a symmetric matrix. The numbering of modules is
specified in each node and the processing times and blocking times are given in
this order. The communication traffic is indicated as a weight on the links of the
graph. (Module processing times m =(24, 3D, 30,24,27,31,32,27, 34, 38, 39,
21,28, 34, 45); Module blocking times b =(0, 4, 14,24, 15,27, 37,48,33,45,
61,66,51,64,76, 88).
From Figure 12 we see that TpAR = 250. Tuq = 622 (139+152+170+161), and
TREAL = 336. The processor utilization u~ is (total cluster processing time)/(time frame T), for
example u~ = 139/250 = 0.566. The number AP of active processors is calculated from D~ to
be 2.488. The speed up is S = TseqfTREAL = 1.851. We see that the following relationship
holds.
1 .:::;; S = Tseq(FREAL = 1.851 ::; Tseq{I'PAR = 2.488 ::; AP = 2.488
When the time frame T is decreased below Tp/lR. the result is more and more clusters. We
have used simulation to verify (see Figure 13) that lhe minimum elapsed time of this application
does indeed occur at T = TpAR . See [HOUS 87] for more details.
- 39-
Figure 12. Output of a 4 x 4 case of the Cholesky decomposition application of Figures 10
and!!.
·40·
Figure 13. Simulation results for the elapsed time and average transmission delay of the
different G'(A)'s obtained for the Cholesky 4 x 4 decomposition application.
5.2 PDE Collocation Application
This PDE application is complete from the description of a problem in a very high level
language through the graphical display of the computed solution. Figure 14 shows a schematic
diagram with the steps of the computation, it has a assigned degree of parallelism 6. Not shown
in Figure 14 are vertical data communication paths between corresponding nodes on the fan-
in/fan-out of the tree structure in the center. The PDE problem solved is a general one on a
non-rectangular domaill A mulLifront method is used based on a nested dissection partition of
the domain. Gauss elimination is used to eliminate unknowns in the interior of each domain
The numben; shown at each node are the 'computation unils' for a particular instance of this
problem corresponding to using a 20 by 20 mesh, a finite element method with cubic basis func-
tions. and a 40 by 40 plotting grid. The domain boundary has 3 pieces and. at the fourth step.
all but one processor are working on the interior of the domain. A computational unit here is
about 1000 arithmetic operations, plus associated memory and control operations.
Figure 15 shows the same graph with the additional communication paths. These paths
transfer LU factorizations of subblocks of the linear systems from the initial solve phase to the
back substitution phase. The numbers shown are data transfers required in units of 1000 words.
The complete graph as displayed by the algorithm mapper is shown in Figure 16 (the communi-
cation has been scaled by a factor of 1000). We again use a single bus mUltiprocessor and dis-
tributed shared memory architecture with the number of processors equal to the assigned degree
of parallelism of the application. Thus only Step 1 of the mapping problem needs to be taken.
In Figure 17 the output from the algorithm mapper of the allocation algorithm is shown. All




Process 3 Boundary Pieces




Create 6 Frontal Areas












Tabulate Answa for Plot
Tabulate How Field
Create Plot Data Structure
Compute Contours. Color
for 6 Views
Figure 14. Annotated graph G (A) for the PDE application. The numbers at the nodes are
units of computation to be done.
- 42-
Figure 15. The graph of Figure 14 showing additional communication edges and the number
of units to be commllllicated along each edge.
Figure 18 shows another form of the output of the algorilhm mapper system. No internal
data blocks exist in this application, thus datablocks by default are set 10 equal one per module
of size zero.
- 43-
Figure 16. The complete graph of the PDE collocation application as displayed by the algo-
rithm mapper system. This is at the input stage, the communication along edges
is shown and the computational modules numbered. At the upper left is the mes-
sage PANEL discussed in Section 4.2.2.
-44-
Figure 17. The output of the algoritJun mapper system for the PDE collocation application.
The allocalion of computational modules to processors is indicated by the
numbering of the nodes. Nodes with the same number are allocated to the same
processor.
- 4S-
Figure 18. Additional output of the algorithm mapper system for the PDE collocation appli-
cation.
-46-
Similar calculations can be made from Figure 18 as with the previous application. Thus
T pAR = 70. T,uq = 252.587, TREAL = 71 and
AP = .512 + .6968 + .2375 +.5 + 4513 + .554 + .6567 ~ 3.608. The speed up is 3.54 and we
have
See [HODS 87] for more details.
5.3 Robotic Elbow Manipulator Application
In [KASH 85] a partition of a robot elbow manipulator computation is given at the equa-
lion level, i.e., the computational modules represent the solution of an equation. We use this
partition with slight modifications. Figure 19 shows the a precedence graph of this application;
the numbers assigned to the modules identify them. Modules are organized on different levels
and modules on the same level can be executed in parallel. The execution times of modules are
given in [KASH 85] in msec for an Intel 8087 processor. The partition in [KASH 85] requires
little communication among modules and communication is not considered there. We have
modified this by including in the execution time, te , of a module, both the processing time t;c
and communication time ty, that is te = tz + ty • We also assume that synchronization delay is
included in tr . If a module has several outgoing links, we distribute the commWlication times ty
unifonnly among them. We have modified this application to obtain three applications with
different behaviors. We take the execution time te to be constant and divide into communica-
tion and computation so that the values of the computation/communication ratio r = tzlty are
1/10, 111 and 1011. The latter corresponds closely to the original application.
We vary the computation/communication ratio r of this application to study its effect and
to show various properties of the allocation algorithm. Usually a fine partition requires more
communication than a coarse partition, but at the same time, the number of modules is increased
or decreased respectively. By varying r, we change the partition grain without affecting the
number of modules. Let cp be the number of clusters obtained by the algorithm in Step I, then
in general cp is greater than or equal to the parallelism of the application.
For this application, a comparative study has been performed for two architectures, the
multiple bus and Banyan switch, both with distributed shared memory.
- 47-
Figure 19. The precedence graph of the 105 computalional modules of the robot elbow mani-
pulator. The maximum assigned degree of parallelism is 11.
- 48-
Schedules for the Multiple Bus and Distributed Shared Memory Architecture
This application, shown in Figure 19. is allocated to a k x m x b multiple bus and distri-
buted shared memory (see Section 4.2.3) architecture. We choose the number k of memory
modules (m) and processors (k) to be equal. A 13 x 13 xl and a 13 x 13 x8 system are used.
The number cp of parallel clusters obtained is equal to the number of processors in the system.
Outputs of the heuristic algorithm. for different values of the computation/communication ratio r,
and the two multiple bus systems are shown in Figures 20 and 21. We give a schedule of
assigned modules to each processor, the processor utilization u~ and the optimal time frames
TPAR obtained. We also give the total processor utilization which includes both the processing
and communication delay times required by all modules assigned to the same processor.
Comparing Figures 20 and 21, we observe for the case of r =], 1/10, that TpM is shoner
for the 8 bus system (Figure 21a.b) than for the 1 bus system (Figure 20a.b). This is expected.
since the 8 bus system has higher bandwidth and thus less delay. In addition. in each case
different processor schedules are obtained. This is the effect that queueing delay has on lhe
scheduling. In the case of r = 1011, queueing delay does not playa significant role in schedul-
ing and the 1 bus (Figure 20c) and 8 bus (Figure 21c) systems have identical schedules. We
have measured the utilization for this case and we find that it is in an area (see Figure 9) where
there is no significant difference in the queueing delay for 1 and 8 bus systems. We note that
processor utilizations are low due to the parallelism constraint between modules.
In Figure 22. a schedule is shown which is based on minimizing the amount of communi-
cation among modules (the H' maUix) without assigning a cost to it, i.e., the queueing delay
D(u) is zero. A different schedule is produced as compared to the same r = 1/10 case for the 1
bus system in Figure 20. The significant queueing delay in the I bus system also increases
T pM substantially. If we compare Figure 22 with the r = 1/10 for the 8 bus system in Figure
21b, then one observes that the same schedule has been obtained. The value of T PAR is practi-
cally unchanged. This is due to the fact lhat a 13 x 13 x 8 system is practically a crossbar
switch and queueing delay is almost non~ex.istent. One should not forget that if there are
modules with communication among them, they are scheduled into the same processor unless
they are parallel modules. This eliminates most of the high communication between modules.
- 49-
Isystem: 13 x 13 x 1. r = 1/1, TpAR = 22061
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 51.13 45.33 1 57 61 63 66 67 69 72 92 98 104
2 70.31 68.67 3 37 49 59 65 71 73 86 87 93 97 99 100 101 102 103 105
3 50.50 36.49 3 7 9 24 36 42 48 54 78
4 35.34 20.85 4 10 16 46
5 35.34 20.85 5 11 17 47
6 36.35 21.98 6 12 18 60
7 36.60 21.76 8 14 15 29
8 51.65 38.30 13 41 43 53 56 62 68 91
9 71.26 57.34 19 20 21 22 23 26 31 32 33 34 35
10 83.36 64.59 25 30 40 52 76 77 80 84 90 95
11 51.77 32.41 27 38 44 50 74 88
12 58.79 46.69 28 55 58 64 70 81 85 96
13 75.02 55.52 39 45 51 75 79 82 83 89 94
Figure 20a. Schedule of module assignments of the robot elbow manipulator for 13x13xl
multiple architecture and moderate computation/communication ratio, r = 1/1.
I system: 13 x 13 x 1, r -1110. TPAR - 40541
Processor Total (%) Utilization Modules
id utilization (%) assi!!I1ed
1 10.88 3.05 1 27 28 61 66 67 72 92 98 104
2 18.94 4.26 2 7 24 36 37 42 48 49 54 78 87
3 14.82 4.62 3 9 58 64 70 81 85 96
4 15.86 2.06 4 10 16 46
5 15.86 2.06 5 11 17 47
6 15.86 2.17 6 12 18 60
7 16.29 2.15 8 14 15 29
8 10.16 2.24 13 25 43 57
9 18.93 5.67 19 20 21 22 23 26 31 32 34 35
10 18.93 5040 30 39 45 51 55 75 79 83 89 94 100
11 9.80 5.76 38 44 50 73 74 82 88 93 99 102 105
12 19.29 7.80 40 41 52 53 76 77 80 84 90 91 95 100
13 15.31 5.25 56 59 62 65 68 71 86 97 103
Figure 20b. Schedule of module assignments of the robot elbow manipulator for the 13x13xl
multiple bus architecture and low computation/communication ratio, r = 1110.
- 50-
I system: 13 x 13 x 1, ,. = 1011, TPilR = 33041
Processor Total (%) Utilization Modules
id utilization (%) assimtcd
1 12.97 12.38 1 27 30 61 67
2 42.08 40.16 2 6 12 18 37 49 60 87
3 89.98 88.03 3 9 39 41 45 51 53 75 79 83 89 91 94 100
4 26.67 25.31 4 10 16 46
5 26.67 25.31 5 11 17 47
6 67.46 65.20 7 8 14 15 24 29 36 42 48 54 78
7 25.25 24.76 13 43 57 63 69
8 70.90 69.60 19 20 21 22 23 26 31 32 33 34 35
9 34.31 33.29 25 28 55 66 72 92 98 104
10 71.10 70.70 38 44 50 73 74 82 88 93 99 102 105
11 74.85 74.00 40 52 76 77 80 84 90 95 101
12 65.36 64.37 56 59 62 65 68 71 86 97 103
13 51.73 51.17 58 64 70 81 85 96
Figure ZOe. Schedule of module assignments of the robot elbow manipulator for the 13x13xI
multiple bus system architecture and high compulation/communication ralio,
f" = 1011.
Isystem: 13 x 13 X 8, r = Ill, TPAR = 21251
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 58.87 58.35 1 61 66 67 72 73 87 92 93 98 99 100 101 102 103 104 105
2 44.27 30.12 2 27 37 41 49 53 91
3 81.15 62.35 3 9 39 45 51 75 79 82 83 89 94
4 32.92 21.65 4 10 16 46
5 32.92 21.65 5 11 17 47
6 34.00 22.82 6 12 18 60
7 74.58 55.76 7 8 14 15 24 29 36 42 48 54 78
8 61.06 52.24 13 43 56 59 62 65 68 71 86 97
9 70.36 59.53 19 20 21 22 23 26 31 32 33 34 35
10 27.30 22.12 25 57 63 69
11 57.89 48.47 28 55 58 64 70 81 85 96
12 76.96 64.71 30 40 52 76 77 80 84 90 95
13 44.00 31.29 38 44 50 74 88
Figure 21a. Schedule of module assignments of the robot elbow manipulator for the 13x13x8
mUltiple bus architecture and moderate computation/communication ratio.
r = 1/1.
- 51 -
Isystem: 13 x 13 x 8. r = 1110. TPAR = 3881 I
Processor Total (%) Utilization Modules
id utilization (%\ assigned
1 8.86 2.34 1 13 25 43 57 63 69
2 16.53 4.45 2 7243637424849547887
3 13.22 4.82 3 9 58 64 70 81 85 96
4 13.51 2.15 4 10 16 46
5 13.51 2.15 5 11 17 47
6 13.53 2.27 6 12 18 60
7 13.88 2.25 8 14 15 29
8 16.83 5.93 19 20 21 22 23 26 31 32 33 34 35
9 11.30 6.04 27 28 61 66 67 72 73 92 93 98 99 100 101 102 103 104 105
10 26.23 6.21 30 39 45 51 55 75 79 82 83 89 94
11 15.92 3.12 38 44 50 74 88
12 20.40 8.06 40 41 52 53 76 77 80 84 90 91 95
13 12.65 5.06 56 59 62 65 68 71 86 97
Figure 2Ib. Schedule of module assignments of the robot elbow for the 13x13x8 multiple bus
architecture and low computation/commWlication ratio, r = 1/10.
I system: 13 x 13 x 8, r = lOll, TPAR = 33041
Processor Total (%) Utilization Modules
id ulilization (%) assilITlcd
1 12.96 12.38 1 27 30 61 67
2 42.02 40.16 2 6 12 18 37 49 60 87
3 89.83 88.03 3 9 39 41 45 51 53 75 79 83 89 91 94 100
4 26.63 25.31 4 10 16 46
5 26.63 25.31 5 11 17 47
6 67.40 65.20 7 8 14 15 24 29 36 42 48 54 78
7 25.24 24.76 13 43 57 63 69
8 70.87 69.60 19 20 21 22 23 26 31 32 22 34 35
9 34.28 33.29 25 28 55 66 72 92 98 104
10 71.08 70.70 38 44 50 73 74 82 88 93 102 105
11 74.82 74.00 40 52 76 77 80 84 90 95 101
12 65.33 64.37 56 59 62 65 68 71 86 97 103
13 51.72 51.17 58 64 70 81 85 96
Figure 2Ie. Schedule of module assignments of the robot elbow manipulator for the 13x13x8
multiple bus architecture and high computation/communication ratio, r = 1011.
- 52-
Isystem: none, r -1110. TPAR - 3880 I
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 8.78 2.34 1 13 25 43 57 63 69
2 16.38 4.45 2 7243637424849547887
3 13.12 4.82 3 9 58 64 70 81 85 96
4 13.37 2.15 4 10 16 46
5 13.37 2.15 5 11 17 47
6 13.39 2.27 6 12 18 60
7 13.73 2.25 8 14 15 29
8 16.69 5.92 19 20 21 22 23 26 31 32 33 34 35
9 11.24 6.04 27 28 61 66 67 72 73 92 93 98 99 100 101 102 103 104 105
10 25.97 6.20 30 39 45 51 55 75 79 82 83 89 94
11 15.76 3.11 38 44 50 74 88
12 20.24 8.05 40 41 52 53 76 77 80 84 90 91 95
13 12.55 5.06 56 59 62 65 68 71 86 97
Figure 22. Schedule of module assignments of the robot elbow manipulator. The queueing
delay is set to zero and a low computation/communication ratio, r = 1110 is used.
Schedules: Banyan Switch and Distributed Shared Memory Architecture
The results of applying the allocation algorithm to this application wilh the Banyan switch
and distributed shared memory architecture are shown in Figure 23a,b,c. Twelve or thirteen
parallel clusters are obtained. Our observation has been that for symmetric applicaLion graphs
or almost symmetric graphs, the number of clusters obtained is equal to the assigned degree of
parallelism of the application. When the application graph does not possess any symmetry, like
the one here. the number of clusters may be greater than the graph's assigned degree of parallel-
ism. lbis is due to the allocation algorilhm's mechanism of module merging which must
satisfy the parallelism and time frame consLraints at the same time. This is a limitation of our
heuristic algorithm, since it does not exhaust all possible schedules.
- 53-
I system: 8 processor, Banyan switch, r = III, TPAR = 22121
Processor Tola! (%) Utilization Modules
id utilization (%) assigned
1 49.54 45.20 1 57 61 63 66 67 69 72 92 98 104
2 74.22 72.99 2 37 49 59 65 71 73 82 86 87 93 97 99 102 103 105
3 40.67 26.44 3 6 9 12 18 60
4 31.62 20.79 4 10 16 46
5 31.62 20.79 5 11 17 47
6 71.65 53.56 7 8 14 15 24 29 36 42 48 54 78
7 28.82 22.60 13 30 43 56 62 68
8 67.58 57.18 19 20 21 22 23 26 31 32 33 34 35
9 60.56 52.20 25 39 45 51 75 79 83 89 94 100
10 46.79 32.32 27 38 44 50 74
11 55.60 46.55 28 55 58 64 70 81 85 96
12 87.66 78.64 40 41 52 53 76 77 80 84 90 91 95 101
Figure 23a. Schedule of module assignments of the robot elbow manipulator for the Banyan
switch architecture and moderate computation/communication ratio, r = 1/1.
I system: 8 processor, Banyan switch, r - 1110, TpAR = 38831
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 8.79 2.34 1 13 25 43 57 63 69
2 16.39 4.45 2 7 24 36 37 42 48 49 54 78 87
3 13.22 4.82 3 9 58 64 70 81 85 96
4 13.38 2.15 4 10 16 46
5 13.38 2.15 5 11 17 47
6 13.40 2.27 6 12 18 60
7 13.75 2.25 8 14 15 29
8 16.71 5.92 19 20 21 22 23 26 31 32 33 34 35
9 11.24 6.04 27 28 61 66 67 72 73 92 93 98 89 100 101 102 103 104 105
10 26.00 6.20 30 39 45 51 55 75 79 82 83 89 94
11 15.77 3.11 38 44 50 74 88
12 20.26 8.05 40 41 52 53 76 77 80 84 90 91 95
13 12.56 5.06 56 59 62 65 68 71 86 97
Figure 23b. Schedule of module assignments of the robot elbow manipulator for the Banyan
switch architecture and low computation/communication ratio, r = 1/10.
- 54-
I system: 13 processor, Banyan switch, r = 10/1. TPAR = 3306
Processor TOlai (%) Utilization Modules
id utilization (%) assilmed
1 12.95 12.37 1 27 30 61 67
2 42 40.14 2 6 12 18 37 49 60 87
3 89.88 87.99 3 9 39 41 45 51 53 75 79 83 89 91 94 100
4 26.62 25.30 4 10 16 46
5 26.61 25.30 5 11 17 47
6 67.36 65.17 7 8 14 15 24 29 36 42 48 54 78
7 25.23 24.75 13 43 57 63 69
8 70.83 69.57 19 20 21 22 23 26 31 32 33 34 35
9 34.27 33.27 25 28 55 66 72 92 98 104
10 71.05 70.67 38 44 50 73 74 82 88 93 99 102 105
11 74.79 73.96 40 52 76 77 80 84 90 95 101
12 65.30 64.34 56 59 62 65 68 71 86 97 103
13 51.69 51.14 58 64 70 81 85 96
Figure 23c. Schedule of module assignments of the robot elbow manipulator for the Banyan
switch architecture and high computation/communication ratio. r = 10/1.
5.4 Reduction of Parallelism in the Robot Application
Here we investigate the possibility of using the algorithm mapper system's allocation algo-
rithm to reduce the parallelism of an application. This corresponds to Step 2 of the mapping
problem as described in Section t. We use the robot application and the Banyan switch archi-
tecture. The number of parallel clusters obtained in Figure 23a,b,c are twelve or thirteen, and
we now assume that only an 8 processor system is available and thus we need to reduce the
number of clusters to 8 from 12 or 13. To reduce parallelism we use the same heuristic alloca-
tion algorithm with a simple modification, we eliminate the parallelism constrainL We use as
the input to the heuristic algorithm for Step 2 the clusters obtained in Step 1. Each of these
clusters is regarded as a single module and the communication between clusters forms the com·
munication between modules. Thus the graph output from Step 1 is the input to Step 2, except
that the parallelism constraints are removed. The communication cost is found as in Step 1.
Since parallelism between the modules of the new graph is not a constraint. it is always feasible
to cluster the modules inlO a predetermined number of processors (in this case 8), by adjusting
appropriately the time frame T parameter. The results for the three cases of r used in Step 1 are
presented in Figure 24a,b,c. Note the module 1 of Figure 24 corresponds to cluster 1 of Figure
23 and so on for the rest of the modules. We also note that processor utilizations are much
higher than in Step 1.
- 55-
The time frame TpAR may increase or decrease when the parallelism is reduced. If com-
munication dominates the work, then TPAR should be smaller. One sees this to be the case
when r = 1/10 (compare Figures 23b and 24b). If computation dominates the work, then T pAR
should be larger because there are fewer processors to do the computation. One sees this to be
the case when r = 1011 (compare Figures 23c and 24c). In an intermediate case where com-
munication and computation arc of similar amounts, then the effect on TPAR is unclear. One
sees for this particular application that TpAR is increased slightly (compare Figure 23a and 24a).
system: 8 processor, Banyanswiteh, r = Ill, TpAR =2528
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 43.34 3954 1
2 64.94 63.86 2
3 82.49 5951 3 4 5
4 62.68 46.86 6
5 75.82 65.44 7 9
6 59.12 50.02 8
7 40.93 28.27 10 11
8 62.66 54.76 12
Figure 24a. Parallelism reduction for the schedule shown in Figure 23a for the robot applica-
tion and Banyan switch architecture. The number of clusters is reduced from 12
to 8.
- 56-
system: 8 processor, Banyan switch, r = 1/10, TpAR = 1278
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 80.13 28.30 1 8
2 76.15 34.92 2 9 11
3 78.42 23.50 3 4
4 66.31 15.86 5 6
5 43.17 20.16 7
6 72.18 26.86 10
7 70.24 27.81 12
8 40.13 19.17 13
Figure 24b. Reduction of parallelism for the schedules shown in Figure 23b for the robot ap-
plication and Banyan switch architecture. The number of clusters is reduced from
13 to 8.
system: 8 processor, Banyan switch, r = lOll, TpAR = 3706
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 74.25 73.10 1 8
2 83.97 80.94 2 4 5
3 80.18 78.49 3
4 89.75 87.81 6 9
5 88.92 88.06 7 11
6 63.38 63.04 10
7 58.25 57.40 12
8 46.11 45.62 13
Figure 24c. Parallelism reduction for the schedules shown in Figure 23c for the robot applica-
tion and Banyan switch architecture. The number of clusters is reduced from 13
to 8.
For comparison purposes, we also reduce the 13 cluster schedule obtained for the 13 pro-
cessor multiple bus architectures to 8 clusters for an 8 processor system. The results are shown
in Figures 25 and 26. Again, the processor ulilizations are increased substantially and the TpAR
values decreased.
- 57-
I system: 8 x 8 X I, r = 1/1, TpAR = 2421 I
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 82.02 78.66 1 13
2 89.42 72.26 2 8
3 71.30 54.71 3
4 79.46 58.D1 4 5 6
5 65.54 48.93 7
6 61.79 52.23 9
7 89.07 76.18 10 12
8 50.84 42.53 11
Figure 25a. Reduction of parallelism for the schedule shown in Figure 20a for the robot appli-
cation and the multiple bus architecture. The assigned degree of parallelism is re-
duced from 13 to 8.
I system: 8 x 8 x I, r = 1/10. rpM = 1207
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 89.60 33.44 1 12
2 56.68 14.31 2
3 84.05 22.44 3 4
4 70.06 14.23 5 6
5 48.02 7.23 7
6 57.31 19.05 8
7 87.81 49.40 9 10 11
8 42.88 16.27 13
Figure 25b. Reduction of parallelism for lhe schedule shown in Figure 20b for the robot appli-
cation and the multiple bus architecture. The assigned degree of parallelism is re-
duced from 13 to 8.
- 58 -
Isystem: 8 x 8 x 1, r = lOll, TpAR = 3804 I
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 72.34 71.22 1 8
2 81.81 78.86 2 4 5
3 78.12 76.47 3
4 87.44 85.56 6 9
5 86.64 85.79 7 11
6 61.75 61.42 10
7 56.76 55.92 12
8 44.93 44.45 13
Figure 2Sc. Reduction of parallelism for the schedule shown in Figure 20e for the robot appli-
cation and the multiple bus architecture. The assigned degree of parallelism is reo
duced from 13 to 8.
I system: 8 x 8 x 8. r = 111, TPAR = 2421
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 82.00 78.66 1 13
2 89.32 72.26 2 8
3 71.20 54.71 3
4 79.34 58.01 4 5 6
5 65.44 48.93 7
6 61.73 52.23 9
7 89.00 76.18 10 12
8 50.79 42.53 11
Figure 26a. Reduction of parallelism for the schedule shown in Figure 21a for the robot appli-
cation and the multiple bus architecture. The assigned degree of parallelism is re·
duced from 13 to 8.
-59-
I system: 8 x 8 x 8. r = 1/10. TPAR = 1134 I
Processor Total (%) Ulilization Modules
id utilization (%) assigned
1 80.11 28.30 1 8
2 56.81 15.23 2 12
3 84.99 24.21 3 7
4 46.46 7.37 4
5 69.94 15.15 5 6
6 84.27 41.92 9 10
7 54.74 10.66 11
8 56.74 22.16 13
Figure 26b. Reduction of parallelism for the schedule shown in Figure 21b for the robot appli-
cation and the multiple bus architecture. The assigned degree of parallelism is re-
duced from 13 to 8.
I system: 8 x 8 x 8, r = 1011. TpAR = 3804
Processor Total (%) Utilization Modules
id utilization (%) assigned
1 72.34 71.22 1 8
2 81.81 78.86 2 4 5
3 78.12 76.47 3
4 87.44 85.56 6 9
5 86.64 85.79 7 11
6 61.75 61.42 10
7 56.76 55.92 12
8 44.93 44.45 13
Figure 26c. Reduction of parallelism for the schedule shown in Figure 21 c for the robot appli-
cation and the multiple bus architecture. The assigned degree of parallelism is re-
duced from 13 to 8.
5.5 Performance Evaluation of Applications!Architecture Pairs
We now compare the perfonnance of the Banyan switch architecture and multiple bus
architecture for the robot elbow manipulator application. Three versions of lhe application are
considered with computation/communication ratio values of r = 111, 1/10 and 1011. Both
architectures have distributed shared memory and 8 processors, the multiple bus architectures
- 60-
also have 1 or 8 busses. The primary comparison is on the basis of TPAR , the shortest parallel
execution time. We also show (1) the average coral processor utilization u~ which, includes
both processor time and queueing wailS for communication, (2) the speed up, and (3) the effi-
ciency [SIEG 82] =(speed up)/k.
Figure 27 shows that the smallest TpAR value and the highest processor utilizations were
obtained for r = 1110. as one expects. The 8 x 8 x 8 multiple bus architecture has a higher
bandwidth than the Banyan network, and the schedule for an 8 x 8 x 8 multiple bus architecture
has the best perfonnance. This demonstrates the usefulness of the mapping methodology in
matching applications to architectures. Speed ups and efficiency factors for the 8 processor sys-
tems are presented in Figure 28.




r=1 TpAR = 2529 u~ = 73.68




r =1110 TpAR = 1278 TPAR = 1207





r = 1011 TpAR =3706.3 TpAR ~ 3804




Figure 27. Performance comparison of multibus and Banyan architectures for the robot elbow
manipulator application. Three values of the computation/communication ratio r
are used. All architectures have 8 processors.
- 61 -
Ratio r Speed up Efficiency
Banyan Multiple bus Banyan Multiple bus
(1 bus) 4.836 (1 bus) .604
r = 1/1 4.630 (8 busses) 4.836 .578 (8 busses) .604
r = 1110 1.665 (1 bUs) 1.763 .208 (1 bus) .220
(8 busses) 1.877 (8 busses) .234
r = 1011 5.744 (1 bus) 5.596 .718 (1 bus) .699
(8 busses) 5.596 (8 busses) .699
Figure 28. Comparison of speed ups and efficiency for the 8 processor Banyan switch and
multiple bus architectures.
From data in [KASH 85] the elapsed (sequential) time in a uniprocessor system, Tseq • can
be calculated. In a uniprocessor system the communication cost is zero, so the sequential time
depends heavily on the ratio r of computation to communication. Then for each value of T, we
have considered that we get Tseq as shown in Figure 29.
r= 1 Tseq = 11709
r = 1/10 Tseq = 2128
r = 10/1 Tseq = 21281
Figure 29. Sequential elapsed time of the application for values of r.
In Figures 30 and 31 speed ups and efficiencies are given for the multiple bus architectures
under the assumption that the number of processors equals the number of parallel clusters, Le.,
after Step 1 of the algoril:hm is performed. This data is derived directly from the schedules in
Section 5.3. Note that for r = 1110 and the one or eight bus system, the speed up is actually
less than 1. which indicates that the parallel system does worse than a single processor system.
Thus, a partition where the communication is ten times the computation, is the worst partition
of the three we have examined. This shows how our methodology helps uS evaluate the various
partitions of an application.
- 62-
Speed up Efficiency





Figure 30. Speed up and efficiency data for the schedules of Figure 20a,b,c for the 13x13xl







Figure 31. Speed up and efficiency data for 1.he schedule of Figure 21a,b,c for the 13 x13x8
multiple bus and distributed shared memory architecture.
We illustrate the performance analysis further for those schedules (see Figure 25 and 26)
where the assigned degree of parallelism has heed reduced. The speedup S and average number
of active processors AP, are shown in Figures 32 and 33. Note that the bounds discussed in
Sections 3.3 and 4.1.6 hold. In the cases where communication is limited. Le., r = 10/1 and
r = 111, its cost is also negligible. thus S =AP. When r = 1/10, then the communication cost
in terms of queueing delay is not negligible any more, and it affects the TpAR results. Thus
S = Ts~qrrPAR is lower than in the previous two cases. By definition, communication delay is
not included in AP, thus AP ~ S as stated previously and observed in both Figures 32 and 33.
We may also calculate using TREAL. For example for the schedule of Figure 2Oc. we obtaine AP
=6.4428. Ts~q =21281, TpAR =3304, TREAL =6113 and the following relationships.
1 S S =Ts~qrrREAL =3.481 S Ts~qrrPAR =6.440 S AP =6.4428
·63 -
Ratio r 8x8x 1 system 8x8x8sysLem
AP Tseq(I'PAR AP T:uq/TPAR
r = 1011 5.5969 5.5968 5.5969 5.596
r = 1/1 4.8351 4.835 4.8351 4.836
r = 1110 2.4144 1.763 2.3133 1.877
Figure 32. Speed up and active processors for the schedules of Figures 25 and 26 for the
multiple bus and distributed shared memory architecture. The assigned degree of
parallelism has been reduced from 13 to 8 in these schedules.
Ratio r AP Tseq(I'PAR
r = 10/1 5.7446 5.744
r = 1/1 4.6310 4.630
r = 1110 1.9658 1.665
Figure 33. Speed up and active processors for the schedules of Figure 24 for Banyan switch
and disUibuted shared memory architecture. The assigned degree of parallelism
has been reduced from 13 to 8 in the schedules.
6. SUMMARY
We have formulated the mapping problem and described our algorithm mapper system and
methodology. Four architectures have been considered and three realistic applications have
been evaluated for them. These applications were Cholesky decomposition algorithm, a PDE
collocation solution and a robot arm manipulation computation. The four architectures con-
sidered are: (1) single bus and shared common memory. (2) single bus and distributed shared
memory, (3) multiple bus and distributed shared memory, and (4) Banyan switch and distributed
shared memory. The allocation methodology has made use of the parallel architecture perfor-
mance models to assign a cost to communication between parallel processors. This cost is the
queueing delay in communicating messages between modules assigned to different processors.
The allocation algorithm used is based on a merging heuristic that minimizes communication
between processors in assigning parallel modules to different processors. The approach has also
been used to evaluate various partitions of a single application.
- 64-
We have evaluated the performance of application/architecture pairs. For example, we see
that a high bandwidth parallel system will not always perform better for a particular application
than a lower bandwidth system. Different partitions of the same application may ., fit" better on
different architectures. OUf melhodology provides the means for this evaluation. We have
made extensive experimentation wilh the robot arm application with 105 modules. We see a
dramatic improvement in execution speed up using even a suboptimal allocation method such as
ours. In some small systems and applications. we are able to demonstrate oplimality of results,
but we do not expect this in general. The parallel architectures we have studied have shared











Abraham, 8.G. and Davidson, E.S.• "Task Assignment Using Network Flow
Methods for Minimizing Communication in n-Processor Systems", TeclUlical
Report, CSRD Rpt. No. 598, Center of Supercomputing Research and Develop-
ment, National Center of Supercomputing Applications. University of lllinois at
Urbana-Champaign. 1986.
Allen, A.a., Probability Statistics and Queueing Theory, Academic Press, 1978.
Batcher, K.E., "The Flip Network: in STARAN",/nt'l Con! on Parallel Pro-
cessing, 1976,pp.65-71.
D.P. Batra, Architectural Implications of Problem Partitioning for Distributed
Processor Systems, Ph.D. Thesis, Computer Science Department, Northwestern
University, Evanston, illinois, 1978.
Berman, F. and Snyder. L., "On Mapping Parallel Algorithms in Parallel Arehi.
tectures",lnt'l Con! Parallel Processing, 1984, pp. 307-309.
Berman, F., Goodrich, M., Koelbcl, C., Robinson, III, W.J. and Showell, K.,
"Prcp-p: A Mapping Preprocessor for Chip Computers", Im'l Conf Parallel
Processing, 1985, pp. 731-733.
Berman, F. and Haden, P., "A Comparative Study of Mapping Algorithms for
an Automated Parallel Programming Environment", Computer Science Techni.
cal Report Number CS-088, Unive~ity of California at San Diego, 1987.
Bokhari, Shamid, H, "On the Mapping Problem", JEEE Trans. Computers,
Vol. C-30, 1981. pp. 207-214.
- 65-
[BUKL 79] Bukles. B.P and Hardin, n.M., "Partitioning and Allocation of Logical
Resources in a Distributed Computing Environment", Tutorial: Distributed Sys-
tem Design, IEEE Computer Society, 1979.
[CHU 80] Chu. W.W., Holloway, L.J., Lan. M.T. and Efe, K., "Task Allocation in Distri-
buted Data Processing", Computer, 1980, pp. 57--69.
[CHU 87] Chu. W.W., Lance M-T, Lan, "Task Allocation and Precedence Relations for
Dislributed Real-Time Systems", IEEE Trans. Computer Engineering, Vol. C,
1987. pp. 667-679.
[EFE 82] Efe, K., "Heuristic Models of Task Assignment Scheduling in Distributed Sys-
tems", Computer, Vol. 15, 1982, pp. 50-56.
[FUJI 85] Fujimoto, RM., "The SIMON Simulation and Development System", Summer
Computer Simulation Conference, 1985.
[GANN 86] Gannon D. and Von Rosendale. J. t "On the Communication Complexity of
Parallel Numerical Algoritluns",IEEE Trans. Computers, (to appear).
[GILB 87] Gilbert, J.R. and Zmijewski, E., .. A Parallel Graph PaI1ilioning Algorithm for a
Message-Passing Multiprocessor", Technical Report, 1R 87-803, Department of
Computer Science, Cornell University, Ithaca, N.Y., 1987.
[GYLE 76] Gylys, Y.B. and Edwards, I.A., "Optimal Pal1ilioning of Workload for Dism-
buted Systems", Proceeding Compcon, 1976, pp. 353-357.
[GOTT 83J Gottlieb, A., et al., "The NYU Ultracomputer-Designing on MIMD Shared
Memory Parallel Computer", IEEE Trans. Computers, C-33, 1984, pp.
1180-1194.
[HAES 80] Haessig K. and Jenny, C.l., "Partitioning and Allocation Computational Objects
in Distributed Computing Systems", Proc. of IFIP Congress, 1980, pp.
593-598.
[HOUS 81J Houslis, CE., Houstis E.N. and Rice, J.R. "Partitioning and Allocation ofPDE
Computation to Distributed Systems", in; B. Engquist and T. Smedsass, cds.
PDE Software: Modules Interfaces and Systems, North-Holland, Amsterdam,
1981. pp. 67-85.
[HOUS 82] Houslis, C.E., "Software Partitioning in a Distributed Environment", Technical
Report, College of Engineering, University of South Carolina, 1982.
[HOllS 84] Houslis, E.N, Rice, J.R. and Vavalis E.A.. "Spline Collocation Methods for
Elliptic PaI1ial Differential Equations", in: R. Vichnevetsky and R.S. Steple-




IMACS, Rutgers University, 1984, pp. 191-194.
[HODS 87a] Houstis, C.E.. Houstis, E.N. and Rice, J.R., "Partitioning PDE Computations:
Methods and Pcrfonnance Evaluation", J. ParaLLel Computing, Vol. 5. 1987,
pp. 141-163.
[HODS 8Th] Houstis, C.B., "Allocation of Real-Time Applications to Distributed Systems",
Infl Cont Parallel Processing, 1987, pp. 863-866.
[HODS 87c] Houstis. C.E., "Distributed Processing Performance Evaluation". Third Inter-
national Conference on Data Communication Systems and Their Performance,
L.F.M. de Moaves, E. de Souse e Silva and L.F.G. Soaves, eds., Rio de Janeiro,
Brazil, 1987, pp. 391-406.
[HOUS 87d] Houstis, C.E. and Aboelazc, M., "The Mapping of Applications to Multiple
Bus and Banyan Interconnected Multiprocessor Systems: A Case Study",
Supercomputing, 1987. pp. 514-543.
[HODS 88] Houstis, C.E., "Allocation of Real Time Applications to Distributed Systems",
Under review in the IEEE Trans. on Software Engineering.
[HWAN 84] Hwang, K. and Briggs, F., Computer Architecture and Parallel Processing,
McGraw-Hill, New York, 1984.
[GYLY 76] Gylys, V.B. and Edwards, J.A., "Optimal Partitioning of Workload fot Distri-
buted Systems", Proceeding Compean, 1976, pp. 353--357.
[JENN 77] Jenny, C.l., "Process Partitioning on Distributed Systems", Digest of Papers,
NTC 1977, pp. 31:1-31-10.
Jenny, C.l., "On the Placement of Files and Processes in a System With Dislri-
bUled Intelligence", Proceedings of International Zurich Seminar on Digital
Communications, 1982, pp. Btl-S.
Kashara, H. and Narita, S., "Parallel Processing of Robot-Arrn Control Compu-
tation on a Multiprocessor System", IEEE J. Robotics Automation, Vol. RA-l,
1985, pp. 104--113.
[KLEI 85] Kleinrock, L., "Distributed Systems", Comm. ACM, Vol. 2S, 1985, pp.
1200-1213.
[KRUS 83] Kruskal, C. and Snir, M., "The Performance of Multistage Interconnection Nets
for Multiprocessing", IEEE Trans. Computers, Vol. C-32, 1983, pp. 1091-19S.
[LAWR 75] Lawrie, D., "Access and Alignment of Data in an Array Processor", IEEE







Li, H., "The Impact of Process Intercommunication on the Global Bus Archi-
tecture", IEEE Proc. Real-Time Systems, 1981, pp. 29-31.
Lowe, T.C., "Analysis of an Infonnation System Model With Transfer Penal-
ties", IEEE Trans. Computers, Vol. C-22, 1973, pp. 269-280.
Ma, P., Lee, E.Y.S. and Tsuchiya, M., "On the Design of a Task Allocation
Scheme for Time-Critical Applications", IEEE Proc. Real-Time Systems, 1981,
pp. 121-126.
[MARS 83] Marsan, M.A., Balbo, G. and Conte, G., "Comparative Performance Analysis
of Single Bus Multiprocessor Architectures" ,IEEE Trans. Computers, Vol. C-
31,1983. pp. 1179-191.
Marsan., M.A. and GerIa, M., "Markov Modcls for Multiple Bus Multiprocessor
Systems", IEEE Trans. Computers, Vol. C-32, 1983, pp. 239-248.
Miranker, W.L., "Parallel Methods for Approximating the Root of a Function",
IBM J. Res. Develop.. Vol. 13, 1969, pp. 297-301.
[NORT 85J Norton, A. and Pfister, G.F., "A Methodology for Predicting Multiprocessor
Performancc", Proc. Int'l Con! Parallel Processing, 1985, pp. 772-778.
[OLEA 85J O'Leary, D.P. and Stewart, G.W., "Data-Flow AlgoritJuns for Parallcl Matrix
Computations", Comm. ACM, 28, 1985, pp. 840-853.
[OLEA 87J O'Leary, D.P. and Stewart, G.W., "Assignment and Scheduling in Parallel
Matrix Factorization", Linear Algebra Appl., 77, 1986, pp. 275-300.
[pATE 79J Patel, J.H., "Processors-Memory Interconnections for Multiprocessors", Proc.
6th Ann. Symp. Computer Arch., 1979, pp. 168--177.
[RICE 71] Rice, J.R., "Matrix Representations of Nonlinear Equation Iterations - Applica-
tion to Parallel Computation", Math. Comp., Vol. 25, 1971, pp. 639--647.
[SARK 86J Sarkov, V. and Hennessey, J., "Compile-Time Partitioning and Scheduling of
Parallel Programs", Proceedings of the SIGPLAN 1986 Symposium on Com-
piler Instructions, ACM, 1986, pp. 17-36.
[SIEG 78] Siegel, H.I. and Smith, H.D., "Study of Multistage SThID Interconnection Net-
works", 5th Annual Symposium on Computer Architecture, 1978, pp. 223-229.
[SIEG 78] Siegel, H.I, McMillen, R.J. and Mueller, P.T., Jr., "A Survey of Interconnec-
tion Methods for Reconfigurable Parallel Processing Systems", Int'l Con! on
Parallel Processing, 1978, pp. 9-17.
[SIEG 82] Siegel, L., Siegel, H.J. and Swain, P.H., "Perfonnancc Measures for Evaluating







SE-8, 1984, pp. 319-331.
Siegel, H., Interconnection Networks for Large-Scale ParaUet Processing,
Heath and Company, 1985.
Stevens, R.M., A Pascal Program Which Partitions Programs for a Multipro-
cessor System, Masters Thesis, Eleclrical and Computer Engineering, University
of South Carolina, 1982.
Stone, H.S.. "Multiprocessor Scheduling With the Aid of Network Flow Algo-
rithms", IEEE Trans. Software Engineering, Vol. 5E-3. 1977, pp. 85-93.
Stone, H.S. and Bokhan, S.H., "Control of Distributed Processes" I Computer,
Vol. 11, 1978, pp. 971-976.
William, E.A., "Assigning Processors to Processors in Distributed Systems".
IEEE Conf Parallel Processing, 1983, pp. 404-406.
- 69-
8. APPENDICES
APPENDIX I: Description of the Allocation Program
1. Main Program -- driver routine
It stans by calling the 'entry' procedure which reads the data for the problem and then
calls 'calcmats' which initializes working matrices. Then it calls 'fpart' which implements one
iteration of the algorithm with the initial time frame (T) that was given as input to the program.
Depending on the value of 'mode' (input), it will call 'iterative' which calis '.1i>art' repeatedly
for various values of T by decrementing or increasing the time until it finds an interval [ta,tb]
such that: Itb - ta I :5 tolerance (see input) and the minimum number of processors have been
obtained. Finally it will create the fIle "graph.g" for graphics software.
Then there is an parallelism reduction (optional, by setting 'mode' in input) for the case of
Banyan networks and multiple bus architectures. In this case we caD get the number of avail-
able processors we have in the network by providing this infoITIlation (see 'kvalue' =# of pro-
cessors available). For example, if the algorilhm gives say 13 clusters, but we can have only 8
then by setting kvalue = 8, the parallelism reduction will reduce the number of clusters to 8 by
deleting all the conflicts and calling repeatedly 'jpan' until we get 8 clusters with the minimum
time frame T.
2. fpart
This subroutine implements the allocation algoritlun. After initializing the working
matrices needed for the current iteration it finds any datablocks that can initially be made local
to a processor (call to 'findlocal' and 'makelocal' subroutine).
Then it finds processor pairs that have the largest interprocessor traffic and creates a list of
these as candidates for merger (' fmdcands'). It selects the candidate pairs that causes the largest
amount of data to be made local ('besteands'), and then selects the candidate pair that yields the
lowest processor loading for merger ('selcand').
If the processor loading for this pair is not too high, it merges those two, otherwise it puts
this pair "on hold" so that it will not select it again ('putonhold'). The above is repeated until
we exhaust all the possible candidates for merging.
Finally it calIs 'calcutil' which calculates processor utilizations and then 'printsolution'
which prints out the solution found (see output).
-70 -
2. calmals
It calculates and initializes working clusters and matrices.
4. reconstruct
It is called from procedure 'optimize' before we start the parallelism reduction phase. By
extracting the infoIIDation from the current matrices that contain the data from the previous
solution, we create new matrices for the new graph that will be input to the algorithm. The new
graph is constructed in the following way: For each processor we create a node. For each node
we get the processing and communication with the other nodes. In the new graph, we delete the
connicts so we can have a new merge.
5. allocate
The purpose of this procedure is to efficiently pack memory blocks into memory modules.
Memblocks is the set of blocks that must be put into memory modules, mmc is the memory
module capacity. and mat contains the sizes of the memory blocks. The procedure first calcu-
lates the first guess of how many memory modules it will take to contain the memory blocks.
This number is tried and if there is no 'success', then the number is incremented until the
memory blocks fit into the memory modules allocated.
The procedures which sort and pack the memory blocks require a different data structure
than this procedure. Therefore the data must be converted before these routines can be used.
On the first successful allocation the results are printed.
6. bestcands
This procedure is passed a list of candidates for merger by the main program. The func-
tion is to prune the list by allowing only the candidates that will provide the greatest amount of
datablocks to be made local to remain in the list.
The procedure creates a test matrix "testmat" for each of the candidate pairs which
represents what the datablock reference matrix would be if that candidate pair were merged.
Then the test matrix is passed to the 'findlocal' procedure which returns the total "amount" of
data that could be made local. This number is compared to a "best" thus far and the candidate
pair is left in the list only if is is greater or equal in value.
-71-
7. calcutil
The purpose of this procedure is to calculate the processor utilization of all the processors
left in after partitioning. This is accomplished by calculating tx which is the total nwnber of
ITU's of information transferred through the interconnection scheme. From this the 'delayper-
itu I can be found using the delay graph data. Data for fOUf different delay functions are stored.
The processor loading can then be calculated. The processor utilization is then simply
lOO"'loading/info.tau (where info.tau is the allowed real time for processing, i.e. 1).
If, in the course of calculating utilizations, a single process results in greater than 100 per-
cent utilization, an error message is printed telling what process is the problem. This can hap-
pen if the allowed real time for processing is lower than what is required for that single process
to run in a processor alone.





This function takes a list of coordinate pairs which represent points on a graph. Then given an
'abscissa' value it returns a calculated ordinate value based on a linear interpolation between the
appropriate coordinate pairs.
9. findcands
This procedure searches through a matrix b and creates a list of process pair candidates for
merger. Numcands is the number of candidates found and "candlist" is the list of candidate
pairs.
The values in the b matrix represent communicaLion between processes. The processes
which have the 'largest' amount of interaction are the ones which are chosen for merger. The
matrix is searched and a list of pairs with the largest interaction is built. A process pair is not
eligible for merger if they are "on-hold" or are in the "conflict" matrix or if one of the
processes is no longer "active". (Note: A pair is "on hold" means that this pair has already
been examined and it was rejected.) A pair (t,;) is in conflict if conflict (t,;) = true and a pair is
no longer 'active' if it has already been merged.
-72-
10. findlocal
This procedure scans the matrix "omega" and composes a list of datablocks that are refer-
enced by a single process. The elements of this list are candidates to be made local to that par-
ticular process. .. Amount" refers to the total amount of data made local if the entire list was
made local.
11. makelocal
This procedure takes the datablocks in "list" and makes them local by removing them
from the "omega" matrix and placing them in the appropriate area of the dalablock partition T.
The datablocks are also removed from the common datablock set.
12. merge
Does all that is necessary computations to merge a process pair.
13. pack
This is a recursive procedure that attempts to pack items into boxes. It keeps rearranging
the items until all fit into the given boxes or until it determines that there is no possible way to
do it If impossible, then the Boolean variable 'success' is set to 'false'.
There are three parts to an itemlist: 'items.id I, which is the vector which contains the
integers that identify the items; 'items.size', which is the vector that specifies the sizes of the
items and 'items.num', which is the number of items in the list. 'Numboxcs' specifies the
number of boxes that are going to be used and 'space' is a vector that specifies the amount of
space in each box. 'Packlist' is an array of sets which represents the partition of items that have
been packed. The procedure operates as follows: upon being called the procedure trics to put
the item on the bottom of the itemlist in a box, if it will not fit, it tries the next box and so on.
If it cannot locate a box with enough space to place the item, it will return success=false to the
calling program. However, if a space is found, then the item is removed from the list and put
into the packlist, the amount of space required for the item is subtracled from the space vector,
and lhe number of items is decreased by one. At this point, the pack procedure is called again
to pack the remaining items. If it returns success=false, then the item is taken back from the
packlist, the amount of space subtracted from the space vector is replaced, and the number of
items is incremented by one, thus restoring the original status. If no other box can be found to
hold it, success=false is returned.
-73 -
14. pulonhold
The purpose of this procedure is to mark certain candidate pairs so that they cannot be
selected for merger. This is done simply by setting to "true" the position in the Boolean
'onhold' matrix that conesponds to the given processor pair for all pairs in the candidate list
(candlist).
A list of candidate pairs is put on hold when it is discovered that none of the candidates
would yield a processor loading that is low enough for a merger. Putting on hold prevents
reselection of the same candidates. (Note: Upon each actual merger, all candidates on hold are
released because a merger will cause the processor loading calculations to be lessened due to
less interconnection usage. Thus a candidate pair that failed the test for merger in one instance,
might pass after other processes have been merged. This is handled as part of the merge pro-
cedure.)
15. selcand
The purpose of this procedure is to make a final selection from the list of candidates
passed to it by selecting the candidate pair !.hat will yield the lowest processor loading if
merged. TItis is accomplished by first calculating tx which is the total number of lTV's of
information transferred through the interconnection scheme. With this number, the 'delayperitu'
is found by utilizing the graph function and the delay graph data. The 'delayperitu' is necessary
in the formula to calculate loading. The predicted processor loading tij (in TV's) of each pro-
cessor pair as it is calculated. After calculation the value is stored in the vector 'tijvec'. As the
'ti)' values are calculated, they are compared to a 'lowest', then the value of 'lowest' changes
and that pair is remembered as having the lowest value by storing a pointer in the variable
'bestcand'.
It is this variable 'bestcand' which represents its final selection when it returns to the cal-
ling program. The Boolean variable 'processorloadingisnonoohigh' is set as an indicator that
the lowest predicted processor loading is an acceptable value. The 'lowest' is compared to the
value of 'eta"'tau', that is, the processor utilization times the allowed real time for processing. If
the 'lowest' value is within this range, then the merge would be acceptable and 'processor-
loadingisnottoohigh' would be set to 'true'.
16. sort
The purpose of this procedure is to bubblesort items before sending them to the pack pro-
cedure. The pack procedure is much more efficient timewise when packing, if it selects the
largest items to pack first The veclor 'items.id' that contains the numbers that identify the
·74 -
memory modules that are to be sorted. 'items.size' is a vector that contains the corresponding
size of the given memory modules, and 'items.num' is the number of items in the list. Note
that this procedure orders the items from smallest to largest, this is the way it should work as
the pack procedure selects items from the bottom of the list first.
17. math library routines
This is the ffialmath (matrix mathematics) module. It is a collection of matrix. operation











sumelements (this is a function)
18. createdatagraph
Find the (x,y) coordinates for each process for input to a graphics program.
19. entry
Reads input, for complete description, see the INPUT section.
20. printsolution
The purpose of this program is to print out the processor numbers and utilizations and to
allocate and print the required instruction memory modules and data memory modules for each
given processor. This procedure uses the processor partition "s" and the datablock partition
"r" which was developed by the algorithm to serve this purpose. The cluster "s" is scanned to
find nonempty sectors, these represent a group of processes that are to be executed by the same
processor. When found, a processor number is assigned and is printed along with the expected
processor utilization. Then the associated instruction blocks are partitioned and assigned
memory modules by the "allocate" procedure. It also causes the memory module assigrunent
to be printed along with the efficiency of usage. The set of data blocks that are local to that
-75 -
processor are then given a similar treatment.
After all of the processors and their associated information have been printed. the data
blocks that could nOt be made local to any given processor are partitioned and their memory
module assignments and efficiencies are printed out. At this point the program is finished. For
a complete description of output, see the OUTPUT section referred to the output.
INPUT
The input of the allocation program is read exactly in the following sequence.
A.
o. printing level: (flag III - 000) [integer]
first flag (001): print info (debugging)
second flag (010): entry will print information asking input
third flag (100): print info (debugging)
n.b. could have any combination of the above
1. the number of processes (n)
2. the number of data blocks (k)
3. the allowed real-time in TV's (time frame), for the complete execution of the appli-
cation program T
4. the maximum allowed processor utilization (eta), (suggested to be 0.9)
5. the maximum number of itu's that can be lransferred in the allowed real time (C)
(the maximum information handling capacity - C)
6. the instruction memory module size (immc)
7. the data memory modules size (dmmc)
B. The instruction block size vector (x) reads n real numbers (one for each instruction block).
C. The data block sizes vector (y) reads k real numbers (one for each data block).
D. The module expected execution time vector (a) reads n real numbers (one for each pro-
cess).
E. The module entrance probability vector (e) reads n real numbers (one for each process).
F. The module transition probability malrix (P) and the inter-module information (lTU's)
transfer per control transition matrix Oambda) in the following way: For each module (0,
supply the number of modules that could follow this i-th module and then for each one of
them, enter the module number, the probability that control will pass to this module and
-76 -




so two modules follow the first module. Modules 3 with possibility 1.0 and ITU's passed
are 180. Modules 5 with probability 0.5 and 120 ITU's passed to it
G. The control data of the EXOR graph (Pp) for each module (0 supply the number of
modules that could follow the i-th module, the module number for each one, and the pro-
bability to transfer, e.g., input for 1st module (i = 1):
1 3 1.0
so 1 module follows module 1. which is module 3 with probability 1.0.
H. The node-conflict table which indicates the modules not to be assigned to the same proces-
sor (conflict). (Note: if any two modules arc meant to be run in parallel, then they cannot
be ran by the same processor, so they are in conflict) The input is given in the following
manner: For each module. give the number of modules that are in conflict with !:he chosen
module and then give the number of these modules, e.g., input for 3rd module (i = 3):
2
57
so 2 modules (5 and 7) are in conflict with module 3.
I. The module-data block reference matrix (h). For each module, give the number of data
blocks referenced by this module and the datablock number for each one of them, e.g.,
input for the 3rd module (i = 3):
3
1 3
so 2 datablocks (1 and 3) are referenced by module 3.
K. The number of co-ordinate pairs needed to describe the delay graph (if no such pairs of





(Note: Fi"'t pair 'must' be (0.0, 1.0).)
L. The following data also needed:
1. dtsize: initial time is decremented by this dtsize.
-77 -
2. tolerance: used by delay functions.
3. utilizationbound: suggested 100.
4. interflag: choose one delay function (out of 4).
(a) delay function 1
(b) delay function 2
(c) delay function 3







7. t-tolerance: how close the iteration will come to the best solution.
OUTPUT
The output of the allocation program is as follows:
(I) ALLOCATION SOLUTION AND WORKLOAD STATISTICS number of modules is
estimated elapsed time is T and capacity is C, time units
allowed real time for processing is (inilial n, time units
(2)
id memory module data blocks
inslruclion size utilization
4. Statistics on processors utilization, (total utilization for every processor = processing
+ communication, and utilization of each one due to processing only). Also printed











5. Statistics on application's communication requirements for each processor.
••• APPLICATION'S COMMUNICATION REQUIREMENTS •••
(in lTU',)
id data referenced interprocessor
via interconnection data transfer
id memory inter- interprocessor total cluster total
module commu- communication processing processor




delay per ITU: __
6. Statistics on the common blocks assigned. IF there are common data blocks the out-
put is:
id memory module common data blocks
instruction size utilization
-- --
ELSE the output is:
"there are no common datablocks"
-79 -
APPENDIX IT: Instructions For Using the Preprocessor
This section summarizes the use of the preprocessor on one particular system.
Filenames and directories may vary on your system.
After designing the algorithms to be simulated, create the main program and the
functions you are going to use in one or different files. The purpose of the main program
is only to initialize the data you may need and specify the sequence in which the modules
are going to run. The last statement of lhe main program should be a call to 'finisO' so
that the preprocessor will create the input file for the allocation program in the fannat that
is read. If you set up the printflag it will also print the output explaining in detail the data
collected. This is also useful for debugging purposes.
Next, create the files parameters.c and a makefile. The casiest approach is to copy
the files parameters.c and sample.make from direclory -/alloc/prc/src and modify them to
suit the allocation and directory. The make command is then invoked to compile the
simulation program.
Automatic instruction timing is controlled by 'clock_onO' and 'clock_offO' and the
'sirncc' command line option '-T file'. The T file contains a list of VAX instructions and
supposed execution times; a complete list wiLh 1 microsecond times is stored in
-/alloc/simccNAXinstrucrs. Default is zero msec for all instructions (if optin -T is not
used). See also
In the directory -/alloc/prelsample there are some examples for using the preproccs~
soc. One is the Cholesky decomposition.
A list of enor messages appears in Appendix III.
- 80 .
APPENDIX ID: Error Messages







error »> create: invalid process id #
you will get this error when the argument 'pid' to
pmcess(code, pid, prob, parmptr)
is: pid < 1 or pid > NPROCESSES.
By default the NPROCESSES is 100 in the preprocessor
If you need more, just edit the file src.c and then ron 'make'
error »> create: process id # already used
that means that you already have created a process with id =#,
but a process id must be unique
error »> cobegin: nested cobegin's not allowed
you can't have nested cobegin's probably you forgot a coend
error»> coend: must follow a 'cobegin'
probably you forgot a cobegin
error »> coend: cobegin must be closed by a coend
probably you forgot to close the last cobegin with a coend
error »> numbering of processes must be sequential
the numbering of processes must start at process #1 and
continue without any gaps up to process #maxprocess
