Node to processor allocation for large grain data flow graphs in throughput-critical applications by Cardany, John Paul
Calhoun: The NPS Institutional Archive
Theses and Dissertations Thesis Collection
1994-06
Node to processor allocation for large grain data flow
graphs in throughput-critical applications
Cardany, John Paul
Monterey, California. Naval Postgraduate School
http://hdl.handle.net/10945/30831
NAVAL POSTGRADUATE SCHOOL 
Monterey, California 
r Thes l s " 
C19735 
THESIS 
]'Ir."ODE TO PROCESSOR ALLOCA nON FOR LARGE 
GRAIN DATA .. LOW GRAPHS IN 
THROUGHPUT·CRITICAL APPLICA nONS 
John P. Cardany 
June 1994 
Thesis Advisor" Shridhar B. Shukla 
Approved for public release; distribution is unlimited. 
Dl!ctfY KW)ll.18AARY 
NA'V;'.l p(;:~!C' P.'''''!:)Li.~TE SCHOO! 
MONH:FtEY eA 83943-5101 
I lnc!ass jfied 
SECURnY ClJISSIfJCATIONOFTInS PAGE 
Naval Post.graduate School 
'"'- ADDRR'~ (C,t}, St.U<. -' Zll' Codo) 
Monterey·, CA 93943-5000 
JI .Tfl1.E(b,lu;' s..o..-ity ll ... ;f..,..;o.) 
REPORT DOCUMENTATION PAGE 
Naval Post raduate School 
1b.AlXlRESS(C. •• Sw.. -: ZlPCode; 
MontcreY,CA 93943-5000 
Node to Procc~nr Allocation for Large Grain Data R ow Graph<; in Throughput-Critical Applications (U) 
12. PFJISONAL AlrrHOR(S; 
Cardany, John Paul 
16. SUJ>PU:MI!NTARY NOTATI01< 
J l. _DATEOFREPOIIT(Y_.Mootb.DOY) June 1994 l ' I_PAOEC01JNT '5 
The views expressed in this thesis are those of the author and do not reflect the official policy or position of 
the De artment of Defense or the U.S. Government. 
19. AlIsrRAC'T (u..""""o. lJ-.., ...d idoalifybybkcl oumb<r) 
This thesis describes the issues involved in node allocation for a Large Grain Data Flow (LGDF) model 
used in Navy signal processing applications. In the model studied, nodes are a.~signed to processors based on 
load balancing, communication I computation overlap, and memory module contention. Current models using 
the Revolving Cylinder (RC) technique for LGDF graph analysis do not adequately address node allocation. 
Thus, a node to processor allocation component is added to a computer simulator of an LGDF graph model. It 
is demonstrated that the RC technique, when proper node allocation is taken into account, can improve overall 
throughpot as compared to the First-Com e-First- Served (FCFS) technique for high 
communication/computation cost<;. 
l~. lJISTRIBUDONIAVAlI.AIUJIfY Of AIl!ITRACT 2'. AII:>!llACTSF.CURrn'CLASSlf'll'-ATION 5J t JNCu.sSIPIIDJUt,UMITEIl 0 SAME AS J!I'l". 0 oncUSI'.RS Unclassified 
220. NAME Of RE.~PO.o.;smLE INIlIVlD{!Al. 
Shukla, Shridhar B. 
P .. ,.""'«Iiti ... _-..... 
SIN 0102 LF-0I4 6603 
Approved for public reJea<>e; distribution is unlimited 
Node to Processor Allocation for Large Grain Data Flow Graphs in 
Throughput-Critical Applications 
by 
Lieute~~~~, ~:ilt~S<!:~;s Navy 
B.S., University of Washington, 1987 
Submitted in partial fulfillment of the 
requirements for the degree of 




NAVAL POSTGRADUATE SCHOOL 
June 1994 
Michael A. Morgan. Cl(9ih1ian 
Department of Electrical and CompuU:f Engineering 
ABSTRACT 
This thesis describes issues involved in node allocation for a Large Grain Data Flow 
(LGDF) mode! used in Navy signal processing applications. In the model studied, nodes 
are assigned to processors based on load balancll!;, communication I computation overlap, 
and memory module contention. Current models using the Revolving Cylinder (Re) 
technique for LGDF graph analysis do not adequately address node allocation. Thus, a 
node to processor allocation component is added to a computer simulator of an LGDF 
graph modeL It L" demonstrated that the RC technique, when proper node allocation is 
taken into account, can improve overall throu!;hput as compared to the first-come-first-




TABLE OF CONTENTS 
1. INTRODUCfION . ... . .............. 1 
A. BACKGROUND.... . .. I 
B. TIlESIS SCOPE AND CONTRIBUTION.. . ......... 2 
C. THESIS ORGANIZATION.. . ............ 2 
n. ISSUES IN ALLOCATION OF NODES.. . .......... 3 
A. PROBLEMS WITH CURRENT ALLOCATION .. . . ........ ,,3 
B. ISSUES ADDRESSED .. .. ............. 6 
1. Load Balancing.. . ..... .... 6 
2. Overlap.. . ........ 6 
3. Memory Contention... . .... 9 
C. WRAP-AROUND.. . ..... 9 
m. ALGORfIlIM FOR NODE ALLOCATION .. . ....... . 15 
A. OVERLAP.... ..15 
B. WRAP-AROUND.. . ...... 17 
IV. RUN-TlI\rfE PERFORMANCE.. . ....... 18 
A. PERFORMANCEME1RICS.. . ..... 18 
B. RESULTS.. . .. 21 
V. CONCLUSION.. . ....... 31 
A. FURTHERRESEARCH.... . ... 31 
APPENDIX A. NODE ALLOCATION PROGRAM... . .................. 32 
APPENDIX B: PROGRAM USER'S MANUAL.. . .... 56 
APPENDlXC: SAMPLE INPUT OATA FILES. .. . ......................... 80 
i, 
APPENDIX D: SAMPLE RUN OF PROGRAMS .. 
DUDLEY KNOX LIBRARY 
NAVAl POSTGRADUATE SCHUO! 
MONTEREY CA 93943-5101 
. .. ...................... .. . 84 
USTOFREFERENCES .......... . ....... ....... . .... . ........................ .. .... .. . . ..... 87 
INITIAL DISTRIBUTION UST . . ... . . . . .. . . .. ....... ..... .. ...... .. ........ .. .... ... . . . .... 88 

l. INTRODUCTION 
'{be Rt::volving Cylinder (RC) technique [Ref. 11 was developed as an attempt to 
enhance throughput over the First-Come-First-Served (FCFS) technique for dispatching 
nodes for communication intensive applications. A computer programmed simulator based 
on the Department of the Navy's ANIUYS-2 Digital Signal Processing System, also 
known as the Enhanced Modular Signal Proces.~or (EMSP) [Ref. 2J, was developed to 
evaluate the RC techniques with respect to such machines. in this thesis, a node to 
processor allocation component has been added to the simulator. 
A. BACKGRDUl'ID 
Large Grain Data Flow (LGDF) gmphs are particularly suited to describing 
applications wht::re large amounts of data arc generated and require predictable, periodic 
processing. Thus, LGDF graphs are often used to model signal processing applications 
with specific throughput requirements. WDF gmph execution can be carried out using a 
balanct:: of compile-time and run-time decisions in order to achieve the most efficient 
throughput. Digital signal processing (DSP) applications lend themselves easily to 
compile-time analysis because DSP applications are very specific in the computation 
required for each node [Ref. 3]. The ANIUYS-2 programs use large grain data flow 
execution as their run-time environment and thus can be modeled using an LGDF graph 
representation. 
For an LODF graph receiving periodic input data, FCFS cannot provide uniform 
throughput under high loads because the nodes receiving external data become ready 
independent of other nodes in the gmph and thus the nodes higher in the graph become 
ready before the lower nodes in the graph. This results in system congestion and causes a 
decrease in throughput. The RC technique adds graph dependencies to the nodes in the 
graph thus reducinJ; or elirninatinJ; this conJ;estion to ensure a more unifonn throughpUL 
The FCFS scheduling technique places nodes into the system based on when the 
nodes are ready. Thus FCFS cannot benefit from compile-time effons in scheduling nodes 
nur does it bind nodes to specific processors for execution. In previous applications of the 
RC technique, graph dependencies were added at compile-time based on node allocation 
that was perfonned randomly. Perfonnance with this random allocation, however, was 
poorer than that provided by FCFS. Thus, in order to ensure that RC facilitates better 
performance than FCFS, it is necessary to modify the generation of graph dependencies 
using the RC technique based on the node to processor allocation. 
B. THESIS SCOPE AND CONTRIBuTION 
This thesis describes an algorithm for allocation of nodes to processors for an LGDF 
graph. A real application modeled as an LGDF gmph is studied, based on a signal 
corrdator graph representing an actual application running on the AN/UYS-2. Results are 
generated using the node allocation program as well as previously developed software and 
comparisons made between the First-Come-First-Served (FCFS) technique and the 
Revolving Cylinder (Re) technique. 
C. THESiS ORGANIZATION 
Chapter II describes the issues involved in node allocation for improving the 
performance of the LGDF. Included are the problems existing with current allocation 
methods and the issues addressed as a result of these deficiencies. Chapter III gives a 
description of the algorithms used in the node allocation program a~ they relate to the issues 
in Chapter II, Chapter IV presents the analysis of data generated from several scheduling 
methods. Chapter V summarizes the results, presents conclusions drawn from the data 
analysis, and provides topics of further study. 
II. ISSUES IN ALLOCATION OF NODES 
There are several issues relating to the task of node allm:ation. In the model discussed 
in this thesis, nodes are assigned to processors based on several factors, such as load 
balancing. overlap of communication and computation, and contention between nodes for 
memory modules. Each of these is important and a delicate balance ])(:tween these factors 
illru;t be accomplished in order to achieve maximum utilization of the processors. 
A. PROBLEMS WITH CURRENT ALLOCATION 
Node allocation in the general sense refers to the binding of nodes to specific 
processors for execution ba...ed on certain criteria. Allocation is separate from scheduling 
which refers to detennining the time at which the node executes on the processor to which 
it is allocated. Without proper node allocation. the processors cannnt execute at their most 
efficient level, and throughput for the data flow graph is reduced as a result. To 
demonstrate this, the programs in Appendix A and [Ref. 41 were used on a test data flow 
graph. illustrated in Figure 2.1, to allocate nodes and simulate the data flow graph. 
The graph shows two input/output processors and 13 nodes. Even numbered nodes 
were assumed to have two times the number of execution cycles as odd numbered nodes. 
Each individual queue's produce, consume, write, and read amounts were considered 
equal; however these values differed over different queues. These are the values shown on 
the queues in Figure 2.1. 1be queue capacity was equal to eight times the queue threshold. 
The simulation wa.~ run with three processors and no setup or breakdown latency for the 
nodes wa.~ assumed. In addition, the scheduler latency was zero and the communication 
time for one word was five cycles. The simulation was run flrst without node allocation, 
Le. , the nodes are assigned to processors without regard for satisfying the criteria 
described above, and then with proper node allocation. In the first case, the nodes were 
allocated dynamically at run-time based on which node was ready and which processor was 
free . In the second case, the nodes were allocated statically at compile-time based on load 
balancing, queue contention, and memory module contention. The results are compared in 
the graph of Figure 2.2. Note the lower utilization rate of the execution unit of the 
processor for the simulation without node allocation as well as the lower throughput. 
Figure 2.1. Test Data Flow Graph 
Figure 2.2. Improvement With Allocation Over No Allocation 
ll. ISSUES ADDRESSED 
1. Load Balancing 
In order to ensure the processors are being fully utilized, it is important to ensure 
that the nodes executing across processors are balanced with respect to execution and/or 
communication times. Since the emphasis of the node allocation algorithms is based upon 
maximization of the execution unit utilization, load balancing for the processors will focus 
mainly on the execution time of the nodes. Load balancing is achieved by statically 
assigning nodes to processors based on execution times of the nodes, attempting to 
maintain the same number of execution cycles per processor. 
2. Overlap 
Overlap of communication and computation is imponant to the LGDF model of 
computation. The system contains both a control unit and an execution unit per processor. 
It is desirable to utilize both of these units in a way that pennits use of the execution unit to 
the fullest extent possible. This is achieved by overlap of conununication and computation. 
TIlere are two conditions which must be met for nodes to overlap sufficiently such that the 
execution unit is utilized to the fullest extent possible. For two nodes j and j+ 1, where 
node j executes on the processor before node j+l, the following two conditions should 
exisl: 
execution; ~ setuPJ+l' (1) 
",d 
breakdown] :S: execution;+l (2) 
Ideally, perfect overlap of communication and computation is desired, such as 
thaI shown in Figure 2.3. 
node 1 node 2 
assigned assib'l1ed 
CONTROL node I node 0 node 2 00001 1 UNIT setup breakdown setup breakdown 
EXECUTION node 0 node 1 node 2 
UNIT execute execute execute 
Figure 2.3. Ideal Communication I Computation Overlap [Ref. 2] 
node I node 2 
assigned assigned 
01:. < CONTROL node I node 0 000, 'I rode' node 1 UNIT setup breakdown setup bloCked: breakdown }~I:~;:,
EXEClnlON node 0 node 1 node 2 
UNIT 
Figure 2.4. Typical Communication I Calculation Overlap [Ref. 2] 
In this figure, it is assumed that node a has been executing for some time hefore 
node 1 is assigned. Here there are no idle or blocked cycles on the processor, since nodes 
can progress immediately from input (setup) to execution to output (breakdown). Note that 
both condition 0) and condition (2) are met for all nodes and the execution unit is operating 
continuously. This, however. is not always the case in reality, as shown in Figure 2.4. 
In Figure 2.4, there is contention for the execution unit since node 2 has 
completed input but cannot progress to the execution unit because node I is still executing. 
This results in blocked cycles until node 1 has finished executing. In addition, idle cycles 
node 1 
assigned 
CONTROL node 1 
UNIT setup 






Figure 2.5 Poor Communication I Computation Overlap [Ref. 2] 
may also exist. such as above. where node 1 has finished breakdown but node 2 is still 
executing and does not yet require the concrol unit. It is desirable to limit the blocked and 
idle times to maximize overlap wherever possible. In addition, a situation may also occur 
where there is poor overlap, such as that in Figure 2.5. 
In this figure node 2 has not completed setup after node 1 finishes execution, and 
node I must therefore wait for access to the control unit, creating idle cycles on the 
execution unit. In addition. node I's breakdown is longer than node 2's execution. This 
results in ad ditional idl e cycles, sinl:e node 2 must breakdown and nodI;.': 3 must setup 
before the execution unit is utilized again. 
3. Memory Contention 
The memory mcxlules are the representation for the system memory [REF 1]. 
Each processor must address memory modules to transfer data to or from a node during a 
read or write operation, respectively. Each queue in the data flow graph is assigned a 
memory module either by the user or arbitrarily by the scheduler. Only one processor can 
access a given memory module at anyone time. It is possible however, for a processor to 
be accessing a memory module (either reading or writing) while anOlher processor is 
attempting to util ize the same memory module. This is memory contention. Thus the 
processor which is auempting to access the memory module must wait until the memory 
module is free. This delays the completion of the b'1'llph and affects throughput. Memory 
contention can be reduced or avoided by ensuring that sufficient memory modules exist to 
fulfil l the requirements of all the queues of the graph. Alternatively, queues can be mapped 
on the available memory modules such that this contention is minimized. 
C . WRAP-AR OUND 
Wrap-around is a technique used to maximize the overlap as pennitted by the RC 
approach by statically 'wrapping' the breakdown time of the last node to the idle or blocked 
lime at the head of a cylinder. An example will better illustrdte this principle. FiglU'e 2.5 is 
a static representation of a cylinder with three nodes on a single processor. 











breakdown j+ idle 
Figure 2.5. Cylinder Without Wrap· Around 
Both control unit and execution unit are shown. Note that because node j+l 's setup 
time is shoner than node j's execution time, blocked cycles resul t on the control unit. If 
node j+2's breakdown time is sufficiently shon such that node j+2's breakdown time can 
be placed in the blocked cycle time, the cylinder length is reduced and the number of 
blocked cycles are reduced, increasing control unit utilization. The resuhant cylinder is 
shown in Figure 2.6. 
10 
Note that the iteration index of node j+2's breakdown has changed to indicate that the 
breakdown is now from a previous iteration. The goal of wrap-around is 10 attempt to 
shonen the length of the cylinder by an amount equal to the length of the breakdown time 
of the la~t node without extending the length of the execution unit. 








breakdown j+ 1 
executej+2 
idle 
Figure 2.6. Cylinder With Wrap-Around 
In general, for one or two nodes j (and j+ 1) executing on a processor where node j 
executes before nodej+l, wrap-around is possible if; 
setuPj+l + breakdoWIlj + breakdoWIlj+1 5 executionj + executionj+l 
as long as at least condition (1) is satisfied. 
For three or more nodes on a processor, the geneml case becomes more complicated, 
because there is a potential for the third node's setup time to occur during the second 
node's execution time. In this case, wmp-around is dependent on which condition(s) listed 
above is (are) satisfied. 
Let there exist nodes j, j+l,j+2, and j+N, where j is the first node, j+l is the second 
node, j+2 is the third node, and j+N is the last node on a processor with N nodes. For 
11 
exactly three nodes on a processor. j+N and j+2 arc synonymous. There are three cases 
for wrap-around: 
Case 1: only condition (1) is satisfied. Figure 2.7 illustrates this case. Here, since 
node j's breakdown is greater than node j+l's execution lime, node j+2's setup time cannot 
be overlapped with node j + l 's execution time. Thus, wrap-around is possible if: 
setuPj+] + breakdoWl1j+N :!> executionj 




execute j+ 1 
breakdownj 
breakdown j+ 1 idle 
setuPj+2 
Figure 2.7. Wrap-Around (Case 1) 
Case 2: only condition (2) is satisfied. Figure 2.8 illustrates this. For this condition 



















Figure 2.8. Wrap-Around (Case 2) 
13 
Case 3; both condition (1) and (2) exist. This case is shown in Figure 2.9. For this 
case wrap-around is possible if: 
setuPJ~ l + breakdownj + setuPj+2 + breakdown j +N :S execution j + exccution j + 1 












Figure 2.9. Wrap-Around (Case 3) 
J4 
III. ALGORITJL\1 FOR NODE ALLOCA nON 
This chapter discusses the particular node allocation algorithm that addresses the issues 
discussed in the previous chapter within the concept of the LGDF model. Initial allocation 
of the nodes to processors is accomplished by the user. taking into account pmper load 
balancing. TIle remaining issues are handled by the algorithms discussed below. 
A. OVERLAP 
Overlap is accompli~hed by first taking each processor individually and scheduling the 
node with the greatest execution time first. This algorithm is illustrated in Figure 3.1. The 
nodes are then scheduled on each processor with regard to overlap as in Figure 3.2. 
for i = 1 to totaJ_number_of _processors 
for processor Pi 
j=l 
while node j != NULL 
if execution j <= execution j+ I 







Figure 3.1. Execution Cycle Scheduling Algorithm 
In the overlap algorithm, Ihe second node on the processor is initially compared 1.0 the 
first node. If the setup of the second node is less than the execution time of the first node , 
then overlap can occur and the second node is scheduled after the first node. 
15 
for i = I to total_numbecof -processors 
for processor Pi 
j = I 
k=j+l 
schedule 
while node j != NULL 
while execution j < setup k 
k=k+1 
end while 
temp = node j+ I 
node j+ I = node k 





Figure 3.2. Overlap Algorithm 
If this condition is not true, the following node on the processor is then compared to 
the fIrst node and this process continues until a suitable node is found or until all nodes on 
the processor have been checked. If all nodes on the processor have ~n checked and 
none are found suitable, or if a node has been found which mcets the conditions, the node 
is scheduled and this node is then compared to the remaining nodes. This process 
continues until all nodes on the processor have been exhausted. This scheduling method is 
perfonned on each processor in rum. It is assumed that since the nodes are initially 
scheduled in decreasing order of execution that the breakdown of the previous node will 
likely be less than the execution time of the next node. 
16 
B. WRAP-AROUND 
The wrap-around algorithm is shown in Figure 3.3. For each processor, the 
breakdown time of the last node is taken and summed with the setup timc of the second 
node and the breakdown time of the tirst node. 'This sum is compared to the sum of the 
execution times of the first and second nodes. If the sum of tbe setup and breakdown time.~ 
is less than the sum of the execution times, the last node breakdown time can be wrapped-
around. 'Jbere are several other conditions which (;an also oc(;ur. Typically, fur mure than 
three nodes scheduled on a processor, it is possible for the setup time of the third node to 
for i = 1 to total_number_of -processors 
for processor Pi 
j = 1 
if breakdown j + setuPj+l + breakdown j+N <= execution j + execution j+ 1 




Figure 3.3. Wrap-Around Algorithm 
occur during the execution time of the second node after tbe breakdown of the fIrst node. 
In this case, the setup time of the third node i~ also summed with the setup time of the 
second node and the breakduwn times of the first node and the last node. 
17 
IV. RUN-TIME PERFORMANCE 
This chapter describes the results for use of the revolving cylinder algorithm. The 
programs used for generation of the results are fu lly described in Appendix B. Figure 4.1 
is a graphical summary of the programs and their related inputs and ou.tputs. 
A. PERFORMANCE METRICS 
The performance evaluations for the RC technique were generated using an actual 
application graph called a correlator [Ref. 4]. This graph is illustrated in Figure 4.1. The 
RC technique that was analyzed was the start after finish (SAF) technique. The results 
from this technique were compared to an FCFS scheduling algorithm. Simulations were 
performed on cylinders generated for both wrap-around and non wrap-around techniques. 
Several initial assumptions were made for the RC cylinders. The scheduler latency, 
node setup and breakdown latency, and instruction size were assumed to be zero. The 
read, write, produce, consume, and threshold amounts for an individual queue were 
assumed to be equal. The queue capacity was calcolated as eight times the queue threshold. 
Nodes were manually allocated to processors based on load balancing and minimizing 
queue contention; that is, no processor would simultaneously access the same queue for 
reading and writing. As many memory modules as necessary to completely eliminate 
memory module contention were then assigned to processors. The number of memory 
modules required was based on the static representation of the cylinder generated by the 
scheduler and mapping programs. Eight processors were used in the system and the node 
to processor allocation was identical throughout the simulations. 
l8 
Figure 4.1. Revolving Cylinder Program Summary 
19 
Figure 4.2. Correlator Graph {Ref. S1 
20 
B. RESULTS 
Figures 4.3 and 4.4 illustrate the nonnaJized maximum throughput for the correialor 
versus the ratio of communication cycles to computation cycles. The communication costs 
uscd for the mapping were varied from 3 to 23 cycles to transfer one word of data from a 
processor to memory. These correspond to communication/computation ratios of 0.1 to 
0 .77 , respectively. The theoretical minimum average input period was used as the 
-e-FCFS 
~ g, 0.95 +-------------1=!=~~:;e~~:ar~rap 
E-&. 0.9 ___ FCFS New Map 
~ ~ 0.85 ~ ___ SAF - New Map 
~ ~ 0.8 ~ --+-Mapper - NewMap 
.~.~ 0.75 +-t-::::::::::::::::~~'s-..::::;~~;~~~~~~~~::::~:::~::::::::::::= i~ 0.7 • "'--  ~ 
1l.§ 0.65 +-------4'-----'=-EI"'=:s:;:;:~:>60~ ~ g 0.6 +-------------'&-'i--'''''''' ~ ~ 0.55 +---------------=-~--__b'\ 
U.5 -1---+-_-+---+_-+---+_-+----+_-+----+_-+-----< 
0. 1 0.17 0.3 3 0.43 0.5 0.53 0.57 0.6 0.63 0.67 0.7 0.77 
Ratio of Communication Cycles to Computation Cycles 
Figure 4.3. Normalized Maximum Throughput vs. 
Communication/Computation (No Wrap) 
normalizing factor. This normalizing factor was calculated by taking the inverse of the 
ideal cylinder calculation for one instance of me graph and multiplying by ld()6. The 
lxl()6 factor is necessary since maximum throughput is given by the simulator in instances 
pt:r megacycle. 
21 
The 'mapper' points listed in the legend represent the maximum theoretical throughput 
for the compile-time representation of the cylinder. This value is obtained by taking the 
inverse of the end time of activities obtained by the map program multiplied by 1xl06. In 
Figure 4.4, there are two representations of the mapper. The first is a 'flat' cylinder. Each 
0.95 +--'-<-------------1 
0.9 +---""=--"'"""" .. -------1 
~[ 
~~ 0.85 +----=-_::---.<:>.,=--:-':-'''--==-''--='-'1 
~ ~ 0.8 +---------= ...... =-:--'.::"'"x--E § 
E'§ 0.75 F~'<::-:-----------'''<>--.~~ 
~ ~ 0.7 +--.....:::,...C------------
.; ~ 0.65 +------=--~"""'i!<""'=_c:_----~fS 
0.6 +-------------"'-.<:l!:r--
0.55 +-_______________ _ 
0.5 +--+-l-+---t--t-t---t--+-f-+--+-j-----i 
0.1 0.170.330.43 0.5 0.530,57 0.6 0.630.67 0.7 0.77 
Ratio of Communication Cycles to Computation Cycles 
Figure 4.4. Normalized Maximum Throughput vs. 
Communication/Computation (With Wrap) 
static cylinder slice of the graph ends at different time, represented as a number of cyclt:s. 
The 'flat' cylinder takes the greatest end time of all cylinder slices and uses that value as the 
average end time of the graph. This means if a cylinder slice ends before this average end 
time, idle cycles may be added to the execution unit, thereby decreasing the calculated 
22 
throughput. The 'jagged' cylinder, however, takes intu account each individual cylinder 
slice end time, and uses the average of the end times over all cylinder slices as the average 
end time. Thus, maximum throughp ut for the 'jagged' cylinder is greater than the 'flat' 
cylinder. 
In both Figure 4.3 and Figure 4.4, notc that as communication costs increase, SAF 
results in better throughput than FCFS. This is due 10 the ability in SAF to map the nodes 
to minimize contention [Ref. 2J. 
Since the node to processor allocation was identical throughout the simulations, it was 
desimble {O see if different allocation at VariOIlS communication costs would have an effect 
on throughput. A separate node to processor allocation was tried for 15 and 20 cycles to 
transfer onc word of data from a processor to memory. The allocation of nodes was 
modified only slightly, i.e., only one node was allocated to a different processor. These 
points are indicated in Figures 4.3 and 4.4 a.~ 'New Map'. It is clear for both ""Tap and no 
\\Tap cases that the revolving cylinder values (SAF and mapper) are affected by slight 
changes in the node allocation. 
Figures 4.5 and 4.6 represent the normalized response time and the coefficient of 
variation of normalized response time for both the no \\Tap and ""Tap cases, respectively. 
The normalizing factor used in Figure 4.5 is the number of execution cycles required for 
the completion of one iteration of the critical path of the graph. Note that the response time 
for SAF (both no \\Tap and \\Tap cases) is lower than FCFS at high communication costs. 
Note also that although SAF no v.Tap has a slightly better response time than with wrap at 
high communication costs, modifying the node to processor allocation (New Map) has a 
significant affect on the no wrap case. Thus, it is possible to improve the response times 
for both cases by changing the node allocation. 
The coefficient of variation represented in Figure 4.6 is a measured comparison 
between the response times of all graph instances to the average response time. The lower 
this number, the closer the measured response times are to the average [Ref. 21. SAF with 
23 
wrap appears to have the best overall perfonnance as measured by coefficient of variation 
throughout the range of communication costs. Again, however, modification of the node 
to processor allocation significantly affects ' the results, thus indicating that coefficient of 




__ 0 6.5 
1;l E 5 
~ ~ 4.~ 
~ ~ 3.5 
"E .i;l 3 






-8- SAF - No wrap 
---6-SAF - With wra 




0.1 0.170.33 0.43 0.5 0.53 0.57 0.6 0.63 0.67 0.7 0.77 
Ralio of Communication Cycles to Computation Cycles 
Figure 4.5. Normalized Response Time vs. Communication/Computation 
Figures 4.7, 4.8 and 4.9 represent normalized maximum throughput for 
communication coSts of 3 cycles. 5 cycles. and 15 cycles versus load. Load in this case is 
based on fractional multiples of the maximum throughput case (1.0 in the figure). These 
multiples correspond to a range of graph input from severe lack of input data to overflow of 
data. From these figures, SAF results in slightly better overall throughput at higher graph 
loads versus FCFS. Although SAF no wrap perfonns better than SAF with wrap at low 
communication COSts. SAF with wmp achieves a higher overall throughput over SAF no 





-a--SAF - No wrap 
0.275 
----6;-SAF - With wrap 
0.25 
-8-SAP-No Wrap-
0.225 New Map ~ 0.2 ----*-SAF-Wrap-New ...- '-" M,p 
............ .d ~ 0. 175 
5. 0. 15 V ~/,..- / J! 0.125 W~'='-
@- .-/'" ... "- I!I 0.1 




0.1 0.170.330.43 0.5 0.53 0.57 0.6 0.630.67 0.7 0.77 
Ratio of Communication Cycles to Computation Cycles 
Figure 4.6. Coefficient of Variation vs. Communication/Computation 
o.'~~~~ ~~O.775 ~ 
~~ 0.75 ,,/ ~ ~ 07~; f::::::;;;;:;;;:;;:;;::::~C:::~==""""~~""""'~'~= 
E S 0.675 ~~>:'FCF:S===I======="""'~I 
.13 0.65 +j--G-FCFS 
.~ ~ 0.625 t~--E!-~~S~AF~-~N~O ~wrn~p~======== :;: d 0.6 ----6;-SAF With wra ~.g 0.575 
~ ~ 0.55 +---------------
JJ ~ 0.525 +---------------
0.5 +-__ +-__ +-__ +-__ +-_----< 
0.4 0.6 0.7 0.8 
Load (Data Input Rate as a Multiple of 
Simulated Maximum 1broughput) 
1.2 




0.775 i! 0.75 
g. ~ 0.725 
~ ~ 0.7 
E § 0.675 
.§ .~ 0.65 
~ i 0.625 
~] 0.6 
~ ~ 0.575 






-s--SAF - No wrap 
-o---SAF - With wrap 













~ ~ 0.725 
.§;:: 0.7 
~ § 0.675 
E.5 0.65 




;;; .... 0.525 
0.5 
0.4 0.6 0.7 
Lo'" 
-G-FCFS p~ 
-a-SAF - No wrap 








Figure 4.9. Normalized Maximum Throughput vs. Load (IS Cycles/Word) 
(0.50 Communication/Computation) 
26 
Figures 4.10, through 4.15 illusrrate nonnalized n:sponse time versus load and 
coefficient of variation versus load for the same communication costs as the three previous 
figures. From these figures, SAF is shown 10 have the best overall response time and 
lowest wefGcient of variation throughout the range of load. 
--a-FCFS J 6.5 +-----------1 
" 6 t-t================~=:!===~=~=~ ~=O=il:="='~= :; -6 5.5 .Sa 5 ~ § 4.5 j~~:::~~~~~~;~~~~ t~ 4 '" ::; 3.5 ~~ 3 ~ H 2; .§:g 1.5 U 1 
0.5 -o 
0.4 0.6 0.7 0.' 1.2 
Lo,d 
Fi gure 4.10. Normalized Response Time \'s. Load (3 Cyclcs/Word) 
(0.10 Communication/Computation) 
27 
0.4 0.6 0.7 0.8 1.2 
LoOO 







--8--SAF - No ""Tap 
~SAF - With wrap ib 5.~ 
~.g 4.5 t--------....... --==",=".,j],,'<'\,'r- ----
i ~ 3.; t===================~\~\~~======== 
:; ~ 3 t-----------\~\'---­
£ 0: 2.5 t------------\~\\---­i] 1.; +------------\"1~'\\.It----
<nd 1 +-____________ .-'=-~_''_'';:; 
0.5 t======:::::==:=;:===;==:=;= 
0.4 0.6 0.7 0.8 1.2 
LoOO 




0.27:<" -a-SAF - No wrap 
0.25 ---A--SAF - With wrap:,.--
". E 0.225 
./ ".AII. F 0.2 
./ /Y \ ~ 0.175 01' if I&,\ ~ 0.15 
~ 0.125 ./ ./ ~ 
~ 0.1 ./ I 
"""" o 0.075 '"--g' .It ~ 
./ '!it 0.05 




0.6 0.7 0.8 1 1.2 
LoM 
Coefficient of Variation vs. Load (3 Cycles/Word) 
(0.10 Communication/Computation) 
0.3 r~-FCFS 
0.275 --s- SAP - No wrap 
0.25 ---tr- SAF - With wra ./ ~ E 0.225 ./.----",.' 
.g~ 0.2 
.~ ~ 0.175 
~ ~ 0.15 
E c:: 0.125 
~~ 0.1 












./ -....., "s 
.Ar ~ ~ 
--- ---
" "':'" -....., 
0.6 0.7 0.8 1 1.2 
Lo,d 
















---e-SAF - No wrap /\ 




~ .... ' 
;..-"/' 
/' 










This thesis described the issues involved in node allocation and described a program 
implemented to resolve those issues. An addition to the RC technique, wrap-around was 
also analyzed a~ an improvement to the compile-time implementation of the graph. 
A revolving cylinder technique, start-after-finish, was studied and compared to the 
First-Comc-First-Served technique for a large grain data flow graph model. It was 
demonstrated that RC provides overall better throughput than FCFS, particularly at high 
communication costs. In addition, it was shown that the RC It:.Chnique is scnsidve to 
cylinder mapping, especially at high communication costs. Thus, it is imponant in the 
analysis of the RC technique to optimize the mapping for each instance of communication 
A. FURTHER RESEARCH 
There were several initial assumptions that were made for the graph model that could 
be removed for future work. 
1. The number of instructions for each node was assumed to be zero. Analysis 
should be conducted with variable instruction lengths. 
2. Scheduler latency was also assumed to be zero. Th is quantity should also be 
varied and its effect on the RC technique studied. 
3. Since the RC results were sensitive to cylinder mapping, it would he desirable to 
find an optimum cylinder mapping for each level of communication COSt. From this a 
heuristic could be developed such that an e'ltra program module could be added to the 
existing programs to perfonn this task. automatically. 
31 
APPENDIX A. NODE ALLOCATION PROGRAM 
/I LIEUTENANT JOHN P. CARDANY, U.S. NAVY 
If 20 APRIL 1994 
/I NAY AL POSTGRADUATE SCHOOL 
II ADVISORS; PROFESSORS SHRIDHAR SHUKLA AND AMR ZAKY 











cout« ''\n\nLARGE GRAIN DATA FLOW NODE TO PROCESSOR SCHEDULING 
PROGRAM\n\n"; 












1/ LIElITENANT JOHN CARDANY, u.s. NAVY 
II 20 APRD, 1994 
II NAY AL POSTGRADUATE SCHOOL 
11 ADVISORS: PROFESSORS SHRlDHAR SHUKLA AND AMR ZAKY 






Hdefmc newln \n' 
private: 




































long setup, breakdown: 






J/ Class Constructor 
node_allocO: 
/I Number of Nodes in the System 
fI Number of Queues in the System 
If Number of Processors in the System 
fI Fixed Scheduler Latency 
II Fixed Setup time (a) 
fI Communication Time for Onc Word of lnfonnation 
IISetup and Breakdown Times Per Node 
/I System Queues 
II System Nodes 
!!Matrix to store Node structures 
!!Mattix to store ordered Node Structures 
!!Node order matrix 
/I Function to load timing infonnation into the system 
void define_timesO: 
/I Function to read number of proces .. 'IOfS into the system 
void read_processoUlJeO: 
/I Function to load queue data into the system 
void read_queue_fileO: 
/I Function to load the node data into the system 
void change_node_fileO: 
/I Function to Order the Nodes and Create the ORDERJN File 
void order_nodesO: 
/I Function to calculate the unused execution cycles 
void calc_unused_exe3ydesO; 
/I Function to implement wrap-around 
void wrap_aroundO; 
l!Function to Create cylinder fLie 
void make_cylinderJileOong); 
!!Function to print processor statistics 
void generate_processocstats(int,long,long,long,long,long,long, long): 
l!Function to print cylinder statistics 
void generate_statistics(\ong,longJong,longJong,long,longJong); 
34 







II UEllTENANT JOHN P. CARDANY, U.S.NA VY 
1/28 April 1994 
/I NAVAL POSTGRADUATE SCHOOL 
II ADVISORS: PROFESSORS SHRIDHAR SHUKLA A..t\lD AMR ZAKY 










number_oCnodes = 0; 
number_oCqueues = 0; 
number_oCprocessors = 0; 
comm = 0; 





cout« "\nFixed Setup Time (cycles) :"; 
cin » fixed_setup; 
cout« "\nWord Communications Time (cycles) 
cin »comm; 
if ( fixed_setup < 0 II comm < 0) 
( 
} 
cerr« "\nInvalid Communication Tune\n"; 
exit(O); 
1/ Function to Read the Processor File 
void 




if ( !processOf_inputJtle) 
( 
cerr« "\nCannot Open ftle PROCS.IN\n"; 
exit(O); 
} 
processof_inpuCfile » number_of-pcocessors; 
36 
} 
COUI « "\nNumber of Processors: " «number_oCprocessors« "\n\n" ; 
proccssor_input_fllc.closcO: 





queuc_inputJilc.open( "QUEUES.JN" ); 
if ( !qucuc_inpuUlle ) 
I 
} 
cerr« "\nCannot open file QUEUES.IN\n"; 
exiteD); 
queuc_inpuCt1le» number_oCqueues; 
for (in! ent = 0; ent < numbccoCqut:ucs; cnt++ ) 
I 
queucinpuCfile» queuer ent ].queuc_id: 
queuc_inpucfile» queue[ cnt J.source_node; 
queucinputJue» queuel cnIJ.sink:_node; 
queuc_inpuCfilc» queue[ ent ].write_amount; 
queuc_inpuLflle »queue[ ent 1.read_amount; 
if (queue! cnt ].queuc_id <= 0) 
I 
} 
cerr« "\nInvalid Queue ID or Wrong Quantity\n"; 
exiteD); 
if (queue[ cnt ].write_amount < 0 II queue{ cnt J.read_amount < 0) 
I 
} 
celT« "\nInvalid Parameter for Queue: H «sctw(6); 
celT« queue[ ent J.queuc_id «emU; 
exiteD); 
for ( int cntq = 0; cntq < cnt; cntq++ ) 
I 
} 
if( queue[ entq ].queUt!_id = queue[ cnt ].queue_id) 
I 
} 
eerr« "\nDuplicated Queue ID: ~ « setw(6); 










node_inputJtle.open( "NODES.IN" ); 
37 
if ( !node_inpuCftle) 
I 





int n_id, i_size, p_type: 
unsigned long s_timc, e_tllne, b_time; I/s=Setup, b=breakdown, e::::exeeution 
node_input_file» number_oCnudes; 
node_outpucflle« numbecoCnodes« newln «newln; 
for ( int ent == 0; ent < numbecoCnodes; ent++ ) 
I 
node_input_ftle» n_id» i_size» s_time» e_time» b_time» p_type; 
} 
cerr« "\nlnvalid Node 10 or Wrong Quantity\n"; 
exit(O); 
if (Uirnc < 0 II C_tllne < 0 II b_time < 0 II i_size < 0) 
I 
celT« "\nInvalid Parameter for Node "« setw(6)« n_id« emil; 
exit(O); 
long setup =0; 
long breakdown = 0; 
fOf ( int cntq = 0; cntq < numbef_oCqueues; cntq++ ) 
I 
} 
if (queue[ cntq J.souTCe_node = n_id ) 
I 
breakdown += (comm * qucue[ cntq J.write_amount); 
} 
if ( queue[ cntq ].sink_node = fl_id ) 
I 
} 
setup += (comm * queuet cntq j.read_amount ); 
setup += fued_setup; 
breakdown += fucd_setup; 
s_time = setup; 
b_time = breakdown; 
node_outputJIle « n_id« setw(4)« i_size« setw(8)« s_time« setw(l2)« 
e_time« setw(13)« b_time« setw(l4)« p_type« newln; 
nodcinpuCfJle.closeO; 
node_outputJIle.closeO; 
system("mv NODES.IN NODES_IN.oRG ft ); 
system("mv TEMP.OUT NODES.IN"); 
38 




int loop, count, tail_index, swap_index; 
in! SWAPPED, T_MOVE; 
nude_type *Head-Ptr, *CUIT_ptr. *Tail_ptr; 
node_type TEMP_NODE; 
if stream nodc_inpucfile: 
nodc_inputJtle.open("NOOES.L'l'''); 
of stream order_outpucfLle; 
order_outpucflle.open("ORDER.IN"); 
nodc_inpuCftlc» number_of_nodes; 









node[cntj.starCtime = 0; 
for (int i = 0; i < numbecoCprocessors; i++ ) IIPl~ the nodes in two 2x2 matrices 
[ 
(;ount =0; 
for (intj = 0; j < number_oCnodes;j++) 
[ 
if (node[j].prc)lUYpe =i+l) 
[ 
order[countj[iJ = nodel.il; 
order[count][i].nodes_per....Processof = Cllunt+ 1; 
new_order[count][i] = node[j]; 
new_order[count][i].nodes-ref_processor = count+ 1; 
count++; 
order[count][il.node_id = NULL; 
new_order[countJUJ.node_id = NULL; 
/I Order Nodes in decreasing Exe time 
for (int j = 0; j < numbcr_of""processors; j++ ) 
[ 
39 
int node_index = 0; 
SWAPPED = 1; 
while ( SWAPPED ) 
I 
SWAPPED=O; 
for ( i == 0; i < numbecoCnodes; i++ ) 
I 




TEMP_NODE = order[i][j]: 
order[i][jl = order[i+l][jj; 
order[i+ 1lU] = TEMP_NODE; 
SWAPPED = 1; lIand set a flag 
IIthen swap nodes 
} 
} 
II Order nodes by comparing Exe and Setup times 
for (j = 0; j < number_oCprocessors; j++ ) 
I 
in! node_index = 0; 
T_MOVE=O; 
Head_ptr = &order[node_index][jl; 
Tail-Ptr = &order[node_index+1JUJ; 
CUrT_ptr = Tail_ptr; 
if (Tail-ptr->node_id = NULL)/I only one node 
I 
ordeCoutpucfJ..!e« Head-ptr->node_id« setw(8)« Head_ptr->starctime 
« newln; 
} 
while (Tail_ptr->node_id != NULL) IIcheck all nodes on a processor 
I 
SWAPPED = 0; 
T_MOVE = O; 
i hi1e (Head-ptr-xxctime < Tail-Ptr->setup_time) ~?::df:oa:f~; nodes unit! 
tail_index = swap_index + 2; 
Tail_ptr = &order[taiUndexlUl; 1/ point to next node to check condition again 
T_MOVE= I; /lsetflag to indicate tail ptr was moved 
swap_index++: 
/I swap Tail-ptr and CUrT_ptr to put tail node in position after head node 
if (T_MOVE && TaiU>tr->node_id != NULL) 
40 
{ 
TEMP _NODE.nodc_id::= CUIT_ptr->nodc_id; 
TEMP _NODE.instr_sizc = CUIT_ptr->inslr_size; 
TEMP _NODE.setup_time = CUIT_ ptr->setup_time; 
TEMP _!\ODE.e:uuime = CUIT_ ptr->exe_time; 
TEMP _NODE.breakdown_time::= CUIT-ptr->breakdown_timc; 
TE/l.1P _NODE.pnxuype = CUIT_ptr->prexuype: 
IBMP _NODE.starctime = CUIT_ptr->starctime: 
Curr_ptr->nodc_id = TaiCptr->nodc_id; 
CUIT...jltr->inslr_sizc = TaiLptr->instcsize; 
CUIT_ptr->setup_ tllne = TaiLptr->setup_time; 
CUIT_ptr->exe_time = TaiLptr->exe_time: 
CUIT-Ptr->breakdown_time = TaiLptr->breakdown_time: 
CUIT-pIr->proc_type = Tail-pIr->proc_type; 
Curr_ptr->staruime = Tailptr->staJ1_tllne; 
TaiLptr->nodc_id = 'J"EMP _NODE.node_id; 
Tail_ptr->instr_size = TEMP _NODE.instr_size; 
TaiLptr->setup_time = TEMP _NODE.setup_time: 
TaiLptr->exe_time = TEMP _NODE.exctime; 
TaiLptr->breakdown_time = TEMP _NODE.breakdowluirne; 
Tail_ptr->pnxuypt': == TEMP _NODE.proc_type; 
Tail...Jltr->stalt_time "" TEMP _NODE.stalt_time; 
SWAPPED == I; I/set flag to indicate nodes swapped 
Curcptr->starLtime == (Head_ptr->setup_time + Head_ptr->stalt_time); 
/I Nodes were not swapped, only two nodes in array 
if (node_index = 0 ) /lnode is head ptr, put node in new array 
{ 
order_outputJue« Head_ptr->node_id« setw(8)« Head-ptr-









new_order[node_index]U].stalt_time == Head_ptr->Stalt_time; 
l/Put node after head node into order fIle and array 
order_outpucflle« CUIT-ptr->node_id« setw(8)« CUIT-ptr->starctime « 
new_orderlnode_index+ l][j].node_id == Curcptr->nodejd; 
new_order[node_index+ 1 .. instr_size = CUIT_ptr->instr_size; 
new_order[node_index+l . = CUIT-ptr->setup_time; 
new_order[nodejndex+l CUIT-ptr->exe_time; 
new_order[node_index+l time = CUIT_ptr->breakdown_time; 
new_order[node_index+l .procuype = Curr...Jltr->proc_type; 
new_order[node_indeX+l -[ .stalt_time = CUIT-ptr->starCtime; 
41 
if (Tail~ptr->node~id 1= NULL && !SW APPED) I/Nudes were not swapped. 
{ IIno node matches requirement 
lisa sked node after head node 
Tail~ptr->star1~time = (Head-ptr·>setup_time + Head_ptr->stan_time); 
if (node_index = 0) 
{ 











new_orderfnode_index][jl.proctype = Head-ptr->pmc_type; 
new_orderlnode_index][j].starctime = Head_ptr->swCtime; 
orcier_outpucfIle« Tail-ptr->node_id« sctw(8)« Tail-ptr->star1_time« 
new_orderlnode_index+l dc_id; 




new_order[node_index+l .proc_type = Tail..ptr->proc~type ; 
new_orderlnode_index+llUl.start....time = TaiLptr->sWCtime; 
else II last node to be scheduled 
{ 
if (CllIT....ptr->node_id !=NULL && !SWAPPED) Iisked last node in the array 
{ 
CUIT-ptr->stan_time = (Head.Jltr->Setup_time + HeruCptr->stan_time); 
order_output_file« CllIT_ptr->node_id « setw(8)« Cllrcptr-
>start_time« newln; 
new_order[node_index+ l]fj].nodc_id = Curr-ptr->node_id; 
new_order[nodc_index+l . size = CUIT-Ptf->instr_size; 
new_order[node_index+l . = Curr-ptr->setup_time; 
new_order[node_index+l CUIT_ptr-x xe_time; 
new_orderfnode~index+l down_time = Curr_ptr-
>breakdoWll_time; 
new_order[node_index+ l]U1.prexuype = Curr..ptr->proc_typc; 
new _orderL node_index+ lJU1.stan_time = Curr _ptr ->start~time; 
nodc_indcx++; 
Head....ptr = &orderInodc_indexJfj]; 
Tail.ptr = &orderLnodc_index+llUl; 










nude_type ·Head-Ptr, *TaiLpcr, *CUIT_ptr; 
lung unused_cxe3ycJes ::0 0; 
long totaLunuscd_cxc_cyclcs = 0; 
for (intj = O;j < numbecoCprocessors;j++) 
I 
int nOdc_ indell = 0; 
Head,..ptr = &new_order[nod~UndexllJJ; 
Tail_ptr ::= &new_order[node_index+ll[j]; 
Curr_ptr = Tail_ptr; 
unused_c:te3ycle:; += Head-ptr->setup_time; 
while (Tail-ptr->nodtUd 1= NULL) 
I 
int swap_index = node_ index; 
if (Head_ptr->exe_time < Tail...,ptr->setup_time) 
I 
I 
if (Head...lHr->breakdoWTI_time > TaiLptr->exe_time) 
I 
int taiLindex::= swap_index + 2; 
Tail_ptr = &order[laiUndex][j]: 
unused_exc_cycles += Head-Ptr->breakdoWTI_time CUIT_ptr-xxe_time + 




Head_ptr = &new_order[node_indexlUl: 
Tail-ptt == &new _orderl nodejndex+ IJ01; 
CUIT-Ptr == Tail_ptr; 
I 
if (Tail...ptr->node_id = NULL && Head-Ptr->node_id != NULL) 
{ II last node. add breakdown 
unused3~xe_cycles += Head...,ptr->breakdoWIl_time; 
I 
total_unused_cxe_cycles += unused3xe_cycles; 
unused_exe_cycles = 0; 
43 
coul« "Total Unused execution cycles are: ' «total_unused_exe_cycles «" cycles" 
« newln; 




II Initialize values 
long Largesccyl_time = 0, totaCslice_times = 0; 
long idlesxe_cycles = 0, blockedsxe_cycles = 0; 
long idle3trl_cydes = O. blocked_ctrl_cycles = 0; 
long blocked-proc3trl = O. blocked_proc_exe = 0; 
long idlcprocsxe = 0, idle-IJroc_ctrl = 0; 
long totaUdle-proc_exe = 0, totaUdle...,proc3trl = 0; 
long total_blocked....proc_exe = 0, total_blocked_Proc3trl = 0; 
long exe_packing = 0, ctrl""packing = 0; 
nodctype *Head....Ptr. *Tail-Ptr, *Next-IJtr; 
of stream cyITune_outpuUlle; 
cylTirne_outpucftle.open("CYL TIMES.OUT"); 
if ( !cyITime_outpuUlle) 
{ 





if ( !slice_outpuUile) 
{ 
ceo-« "\nCannot Open file slice_time.out\n"; 
exit(O); 
} 
II calculate the number of nodes on each processor 
for(int k = 0; k < number_of-IJrocessors; k++) 
{ 
intindex =0; 
Head_ptr = &new_order[index][k]; 
while (Head_ptr->node_id !:=: NULL) 
{ 
index++; 
Head-IJtr = &new_order[indexJ[kJ; 
44 
new _order[Oj[kj.nodes_pecprocessor::= index; 
for(intj ::= 0; j < numher_oCprocessors; j++) 
[ 
Head_ptr ::= &new_order[O][j] ; llpoint to first node 
Head_ptr->stalt_time = 0; 
int numNodes '" HeadJltr->nodes_perJlrocessor; 
long cyl_time;;; Head_ptr->setup_time; 
blocked_proc3tr1 ::= 0; 
blocked_proc_exe = 0; 
int FLAG = 0, STIJFFED ::= 0, PUSHED = 0; 
if(numNodes = 1) IIOnly one node on processor 
[ 
cyl_time +::= (Hea,Cptr->exe_time + HeadJltr->breakdown_time); 
if(Head_ptr->breakdown_time + Head_ptr->setup_time < Head_ptr->eT.e_time) 
[ 
Head_ptr-x:nd_time ::= cyUime - HeadJltr->breakdoWD_time; 
cyCtime -::= Head-ptr->breakdown_tirne; 
} 
,"" [ 
for(int i '" 1; i < numNodes; i++) II More than one node 
[ 
Tail-Ptr = &new_orderfillil; 1/ point to next node 
Tail-ptr->staruime =: cyLtime; 
Next_ptr ::= &new_orderfi+lJUJ; 
IISeveml condition.~ are possible which modify the way the blocked cycles are calculated 
II Condition I 
if(pUSHED && Nexcptr->node_id != 0) 
[ 
Head_ptr = &new_orderfi][j]; 
T~Jltr = &new_order[i+llfj]; 
++1; 
NexCptr = &new_order[i+lJ[i1; 
cyCtime +::= HeadJltr->setup_time; 
TaiCptr->stan_time =cyl_time; 
PUSHED = 0; 
45 
{/PUSHED ;:>A node's breakdown 
IIwlJS ~n::ater than another node's 
IIexe time 
IICondition 2 
if (!FLAG && lPUSHED) I!FLAG=> 
{ 




cyl_time += Hcad_ptr-xxe_time; 
biocked3lrLcycles += Head_ptr-xxe_time - Tail-pIr->setup_time; 
blocked-proc_ctrl += Head-plr-xxe_time - Tail-ptr->serup_time; 
cyLtime += Tail-pIr->serup_time; 
blocked_exc_cycles += Tail-pIr->setup_time - Head-ptr->exe_time; 
blocked-proc_exe += Tail-Ptr->serup_time - Head_plr->exCtime; 
IICondition 3 
if( !STITFFED && lPUSHED) /lSTUFFED=> breakdown of a node and setup of 
{ Iinext node occur during exe of another node 
Head-ptr->end_time = cyl_time + Head-Ptr->breakdown_time; 
IICondition4 
if(Head_ptr->breakdown_time < Tail-ptr-xxe_time && lPUSHED) 
{ 
cyl_time += Tail-ptr->exe_time; 
FLAG =0; 
Tail_ptr->end_time = cyl_time + Tail.JlIr->breakdown_time; 
IICondition 5 
if(Head-ptr->breakdowo_time + Next-PIr->setup_time) < Tail.plr->exc_time 
&& Next-PIr->nodcid 1= 0) 
{ 
FLAG = 1; 
STUFFED = 1; 
Next-ptr->staruime = cyl_time - Tail_ptr-xxe_time + Tail_plr-
>breakdown_time; 
blocke(LctrLcycles += Tail-Ptr->exe_time - Head-Ptr->breakdown_time -
Nexcplr->setup_time; 





if(Next-ptr->nodcid != 0) 
{ 
STUFFED = 1; 
nAG=1; 
NeXlJllr->starLtime == cyUime - TaiCptr-xxe_time + Tail--Plr-
>breakdown_time; 
46 
blocked_exc_cycles += Head-Ptr->breakdown_timc + Ncxl-ptr-
>!ietup_timc Tail_ptr-xxt:Uimc; 






cyUime += blocked_proc_cxe; 
TaiLptr->end_time += blocked_proc_exe; 
if(Next_ptr->node_id = () && PUSHED) 
I 
} 
cyUlme += (Tail-Ptf->sctup_tirne + Tail-lltT->exe_time); 
TaiLptr->end_time = cyUime + Tail_ptr->breakdown_time: 







PUSHED = 1; 
cyUime += Head_ptr->breakdown_time: 
TaiLptr->end_time = cyUime + TaiLptr->breakdown_time; 
cyUime = Tail...,ptr->end_time; 
bklCked3xe_cycles += Head_ptr->breakdown_time - Tail..Ptf->exc_time; 
blocked-proc_exe +;;; Head....,ptr->breakdown_tirne - TailJltr->exe_time; 
Head-ptr = &new_order[ilUJ; 
if(numNodes ~= 1 && !PUSHED) 
I 
cyUime += Tail-ptr->breakdown_time; 
} 
/I Check for wrap-around condition 
Head-P1r == &new_orderlO][j]; II Point to first node on processor 
Next_ptI = &new_order[l][j]; II Point to second node 
if«(fail..JltI->breakdown_time + Head-PtI->breakdowu_time + Next...,ptr-
>setup_time) <= (Head_ptI->cxe_time + Next..JltI->exe_time) && Head_ptI-
>nodes-per_processor != 1 && !STUFFED && !PUSHED) 
I 
Tail_ptI->end_time = Head_ptI->.>etup_time + Next...,ptI->setup_time + Tall_ptr-
>breakdoWll_time: 
cyl_time -= Tail...,ptI->breakdowu_time: 
47 
I 




biocked3trCcydes -= Tail-Ptr->breakdowll_time; 
blocked_proc_etrl -== TaiCptr->breakdown_time: 
d~ 
Hcad_ptr->end_time::: TaiCptr->end_time + Head-Ptr->breakdown_time: 




node_type *Thlrd_ptr = &new_order[2J[jJ; 
if «(fru1..-Ptr->breakdowll_time + Head_ptr->breakdown_time + Nex!-Ptr-
>setup_time + Third-pIr->setup_time <= Head_ptr->extuime + Nexcptr->exc_time) && 
Head_ptr->nodes_pecprocessor != 1) 
I 
Tail...j>tr->end_time = Head_ptr->setup_time + NextJltr->serup_time + 
TaiCptr->breakdoWTI_time; 
cyl_time -= TaiCptr->breakdown_time; 
if (Next--ptr->setup_time + Tail...,ptr->hreakdown_time < HeadJltr-
I 










Head_ptr->end_time = TaiCptr->end_time + Head-Ptr->b~akdown_time; 
Third...,ptr->starctime = Head_ptr->end_time; 
if(PUSHED) 
I 
Tail_ptr-xnd_tUne = Head_ptr->setup_time + Next-ptr->setup_time + 
Tail--ptr->breakdown_time; 





totaCslice_time..~ += cyUime; /I add all cylinder times 
..,lice_outpucfile «cyl_time «end!; 
4S 
total_blocked_proc_ctrl += blocked-PtOC31I1; 
total_blocked_proc_exe += b!ockedJlroc_cxc; 
cylTime_omputJJle «j+ I «setw(8) «cyUime «end!: 
Head_ptr= &ncw_ordcr(O][j); 
exe_packing = 0, ctfl_packing = 0; 
for (int p = 0; p < numNodcs; p++ ) IICalculate exe and controlllnit packing 
{ 
Head_ptr = &new_Ofder[p][jJ; 
cxe_packing +== Head_ptr.>exc_time; 
ctrl_packing +== Head-Ptf->serup_Ume + Head_ptr->breakdown_time; 
IICaiculate idle cycle times 
idlc_proc_cxc == cyUime - exe_packing - block.ed_proc_exe; 
idle...,proc31I1 == cyUime - t-trUlacking - blocked...,proc_ctrl; 
totaUdle_pfOC_exe +== idle-PfOC3xe; 
totaUd!e_proc_ctrl +== idlc_proc_ctrl; 
generate_processor_stat.<;Q+ I, cyl_time, exe_packing, ctrl-packing, idle-proc3xe, 
id!e_proc3tr1, block.ed-proc_exe, blocked_proc_ctrl): 
slice_olltpIlCflle.c1oseO; 
"Find the largest end time for all procesS{)~ for "flat" cylinder 
for(int m "" 0; m < numbeco(pfocessors; m++) 
{ 
int index == 0; 
Head_ptr = &new _order[indexllm]: 




if(Largest3yUime < Head-Ptr->end_time) 
{ 
Largest_cyl_time = Head_ptr->end_time; 
} 
cyITime_olltput_fUe« end!« end!« Largesccyl_timc« end!; 
Head_ptr == &new_orderfindexHm]; 
} 
/I Find id!e time for "flaC cylinder 






idlcexe_cyclcs, blocked_exe_cycles, totaUdle-PfOC_exe, total_idle_proc_ctrl. 
total_slice_times); 
cyITime_outputJlie.close(); 






of stream cylindecuutpucfLie; 
cylinder_outputJile.open(Hcylinder.out"); 
if ( !cylinder_outputJlie) 
{ 
} 
cerr« "\nCannot Open flie cylinder.out\n"; 
exit(O); 
for ( intj := 0; j < number_oCprocessors; j++ ) !!Print out node order 
{ 
Head_ptr := &new_order{O][j]; 
int processorNodes:= Head-Ptr->nodes-pecprocessor; 
cylinder_outpuCfile« newln « processorNodes« end!« end!; 
for ( int i := 0; i < processorNodes; i++ ) 
{ 
Head_ptr:= &new_order[i][j]; 
cylindecoutpuUile« setw(7) «Head-ptr->node_id; 
cylindeCoutputJlie« setw(l2)« Head-ptr->stan_time; 




cylindeCoutpucflle « newm «newln «LargesccyCtime «endl; 
cylindecoutputJile.c!oseO: 
II Function to print individual processor statistics 
void 
50 
nodc_alloc::generate-PfocessocslatsCint procNulll, long cyl_rimc, long cxc_packing, long 
cuLpacking, long idle-rfoc_ellc, long idle_pmc_ctfl, long blocked_pmc_exe, long 
hlocked_proc_ctrl) 
I 
of stream processocstats_file; 
processof_stats_fuc,Qpcn(" proc_stats.out , ios::app); 
if ( !processof_stats_fllc ) 
I 
cerr « "\neannot Open me proc_stats.out\n"; 
exit(O); 
} 
processOf_stat<;JIle« "PROCESSOR UTlLIZATION\n\n"; 
processocstats_file« "NUMBER OF PROCESSORS: "; 
pro~ssor_staL~_me « setw(4)« numbecoCprocessors« endl «end!; 
processocstatsJtle« "CYCLES PER WORD : "; 
processor_stats_me« setw(4)« carom« endl« endl; 
duuble ctrl_util_rate "" (double)ctrCpacking/cyUime*1(X10; 
double ctrUdle]d.te = (doub1e)idle"'proc_ctrVcyUime*lOQ.O; 
double ctrl_blocked_rate = (double)bJocked_proc_ctrVcyUirru:*100.0: 
double exe_utiUate = (double)exe-packing/cyUime* 100.0; 
double exe_idle_rate = (double)idle-proc_exeJCYUime*lOO.O; 
double exe_blocke<Crate = (double)blocked-proc_exeJcyUime*100.0; 
processocstats_file « "PROCESSOR NUMBER ., 
processocstats_file « setw(4) «procNum« end!; 
processou.tatsJLie« "\nCOl\'TROL UNIT UTILIZATION\n\n"; 
processor_stat~_fLie,setf(ios::flxed) ; 
processof_stats3i1e.setf(ios::showpoint); 
processof_statsJtle« "BEST CYLINDER PACKING (CONTROL TIME) 
proces..~ocstats_ftle« sctw(12)« ctrl-packing« endl« endl; 
proce.~~Qf_statsJlle« "END TIME OF ACTIVITIES 
proa :If_stats_fLie<< sctw(12)« cyUime «emll« endl; 
procc. ,of_stats_file« "Control Unit Utilization Rate " 
processor_stats_file« sctw(6)« setprecision(l); 
processocstats_fi1e« ctrl_util_rate «"%\n"; 
processor_stats_fiIe« "Control Unit Idle Rate 
processocstats_fiJe « setw(6) «setpredsion{ 1); 
processor_stats_fiIe« ctrUdle_rate« "%\n"; 
processocstats_file« "Control Unit Blockage Rate 
processocstats_file« sctw(6)« setprecision(l); 
processor_statsJtle «ctrl_blocked_rate« "%\n\n\n\n\n"; 
processor_slats_tile« "EXEClITlON UNIT UTILIZATION\n\n"; 
processor_statsJIle« "BEST CYLINDER PACKING (EXECUTION TIME) 
processor_Slats_me « sctw(I2) « exe-packing« endl« endl; 
Pfocessocstats_file « "END TIME OF ACTIVITIES 
51 
processor_stats_fIl e «setw(t 2) « cyl_time « endl « emil; 
processor_stats_file« "Execution Unit Utilization Rate : "; 
processocstat.<Uile« setw(6) «setprecision(I ); 
processor_stats_fIle « exe_utiCrate « "%\n"; 
processor_stats_Hle « "Execution Unit Idle Rate 
processor_stats_Hle « setw(6) «setprecision(1); 
processor_stats_fIle « exe_idle_rate« "%\n"; 
processor_stats_fue «"Execution Unit Blockage Rate ., 
processor_stats_file « setw(6)« setprecision(l); 
processor_slats_fIle « exe_blocked_rJ.te « "%\n\n\n"; 
processocstats_file « endl; 
II Function to print cylinder statistics 
void 
nodcalloc::generare_statisticsOong LargesccyUirm:, long idle_ctrI3ycles. long 
blocked_ctrI3yc1es, long idle3xe3ycles.long blocked_exe_cycles, long 
total_idleJ)roc_exe, long total_idleJ)roc3trl, long total_stice_times) 
{ 
long exe3ylJ)acking = 0; 
long ctrl_cyl_packing = 0; 
node_type *Head_ptt; 
of stream statistics_outpuUlle; 
statistics_output_fUe.open("cyl_stats.out"); 
if ( !statistics_outputJtle) 
{ 
J 
cerr« "\nCannot ()p(:n file cyl_stats,out\n"; 
exit(O); 
statistics_outpuUile« "PROCESSOR UTILlZATION\n\n"; 
statistics_outputJIle « "NUMBER OF PROCESSORS: "; 
statistics_outpucftle« setw(4)« number_oCprucessors« endl« endl; 
statistics_outpuctile « "CYCLES PER WORD : "; 
statistics_outpuLfile« setw(4) «comm« end!« end!; 
for(int j "" 0; j < number_ofJ)rocessors; j++) 
{ 
HeadJltr = &new_orderfOJ(j]; 
int processorNodes = HeadJ)tt->nodes_per-processor; 
for ( int i = 0; i < pracessorNodes; i++) !!Calculate exe and ctrl unit packing per 
{ IIprocessor 
HeadJ)tr = &new_order[iJli]: 
exe_cyl...,packing += Herutptr->exe_time; 
ctri3ylJ)acking += Head...,ptt->setup_time + HeadJ)tt->breakdown_time; 
IlCalculate values for "jagged" cylinder 
52 
long aVLsl ice_time = total_s\ice_times/number_oCproccssors; 
long best_exe_packing = exe_cyJ_packinglnumber_oCprocessors; 
long bescctrl_packing = ctrl_cyl_packinglnumber_oCprocessors; 
long aVLctrUdle = (idJe_ctrl_cycies/nwnber_oCprocessors) '- bescctrl_packing; 
long aVLclICblocked = blocked_ctrJ3yclesinumber_oC processors; 
long aVALexe_idle = (idleJ xe_cyclesinumber_oCprocessors) '- bescexe_packing; 
long aVLexe_blocked = blocked_exe3yclesinumber_oCprocessors; 
long aVLctrUdlejag = (totaUdle_proc_ctrVnumber_oCprocessors) '-
beS!3trl_packing; 
if(avLctrUdlejag < 0) 
avg...ctrUdlejag = 0; 
long aVLexe_idJejag = (totaCid1e_proc_cxdnumber_of-processors)-
bescexe_racking; 
if(avg...exe_idJejag < 0) 
aVLexe_idJejag = 0; 
IlCalculate "flat" and Hjagged" cylinder statistics 
double exe_utilJate = (double)bescexe-packing/Largcsccyl_time"'lOIJO; 
double ctrl_utiIJate := (doub1e)besLctrl_packing/LargesLcyLtime* 1 00.0: 
double ctrUdleJate = (double)avg...ctrUdklLargest3yl_time"'lOO.0; 
double ctrl_blocked_rate '" (double)avg...ctrLblockedlLargesCcyUime'" 10110; 
double exe_id1e_rate = (double)avg...exe_idleJLargesCcyl_time'" HX10; 
double exe_blocked_rate = (double)avLexe_blockedlLargesccyl_time"'l OCI.O; 
double (:trUdJe_ratejag = (double)avLctrUdlejaglavg...slice_time"'IOO.(); 
double ctrl_util_ratejag = (double)bescctrLpackinglavLslice_time*lOO.O; 
double ctrl_blockcd_ratejag = (double)avg...ctrLblockcdlavg...slice_time*IOO.O; 
double exe_util_ratejag = (double)bescexe-packinglavg_5lice_time"'IOO.O; 
double cxc_idle_ratejag = (doublc)aVLCXc_idlejaglavLslicc_time* 100.0; 
double exe_blocked_ratej ag = (double)avLexe_blockedlav&.-slice_timc'" HXl.O; 
stati.~tics_outputJllc « "\n\n\nCON1ROL UNIT UlTI.JZA TION\n\n"; 
statistics_outputJlle.setf(ios:;fIxed); 
statistics_outpucflie.setf(ios::showpoint); 
statistics_outputJlle« "BEST CYLINDER PACKING (CON1ROL TIME) 
statistics_outputjile «setw(l2) « bescctrl...,packing « endl «endl; 
statistics_outpuUile « HEND TIME OF ACTIVITIES ('FLAT' CYLINDER) : 
stati.~tics_outputJtle « setw(l2)« Largesccyl_time « endl« endl; 
stati<;tics_outpu!_ftle« "Control Unit Utilization Rate 
statistics_outpuUile« setw(6)« setprecisioo(l); 
statistics_output_file« ctrl_utiCrate « "%\n"; 
statistics_outpuUile« "CoDtrol Unit Idle Rate 
statistics_outpuUik« setw(6)« setprecision(l); 
statistics_output_flle« ctrUdlcrate« "%\n"; 
statistics_outpuLflie « "Control Unit Blockage Rate 
statistics_outpucfile« setw(6) «setprecision(1); 
statistics_outpuCftlc « ctrLblockcd_rate« H%\n\n\n"; 
stati.<;tic.,,_outputJilc « "END TIME OF ACTIVITIES ('JAGGED' CYLINDER): "; 
statistics_output_file« setw(12)« aVLslice_time« endl« endl; 
statistics_outpuUiJe « "Control Unit Utilization Rate ., 
statistics_nutpuUlle« setw(6)« setprecision(l); 
statistics_outpu,-file« ctrl_util_ratejag« "%\n" ; 
statistics_outputJllc « "Control Unit Idle Rate 
53 
statistics_output_fUe« setw(6)« setprccision( l); 
statistics_outpucfile «ctrUdle_ratejag« "%\n"; 
statistics_output_file« "Control Unit Blockage Rate 
Statistics_outpucfile« setw(6)« setprecision( l); 
statistics_outpuUlle « ctrl_blocted_ratejag« "%\n\n\n\n\n\n"; 
statistics_outpuCfile« "EXECUTION UNIT UTll..IZATION\n\n"; 
Statistics_outpuCfile « "BEST CYLINDER PACKING (EXECUTION 'llME) : ". 
Statistics_outpuc file« setw(12)« besLexe"'packing« endl « endl; 
Stati~tics_outputJtle« "END TIME OF ACTIVITIES ('FLAT CYLINDER) : 
Stati<;tics_outpuCfile« setw(12)« LargesLcyl_time « endl« endJ; 
Statistics_output_file« "Execution Unit Utilization Rate . ". 
Statistics_outputjile « setw(6)« setprecision(l); 
Statistics_outputjile « exe_util_rate« "%\n"; 
statistics_outpuLfLIe« "Execution Unit Idle Rate 
Statistics_outpuUile « setw(6)« setprccision(l); 
Statistics_outpuLfile « exe_idJe_rnte« "%\n"; 
statistics_outpuUi1e« "Execution Unit Blockage Rate 
Statistics_outpuUlle« setw(6)« setprecision(l); 
Statistics_outpuLfile «exe_blockedJate« "%\n\n\n"; 
statistics_outputJJ.le « "END TIME OF ACTIVITIES ('JAGGED' CYLINDER): "; 
Statistics_outpuLfile « setw(12) « avg...slice_time« endl« endl; 
statistics_outpucfile« "Execution Unit Utilization Rate " 
statistics_outputJIle« setw(6)« serprecision(l); 
Statistics_outpuLfIle «exe_util_ffitejag« "%\n"; 
Statistics_output_file« "Execution Unit Idle Rate 
statistics_outpuCfile « setw(6) « setprecision(l); 
statistics_output_ftle « exe_idle_ratejag« "%\n"; 
statistics_outpuUile «"Execution Unit Blockage Rate 
statistics_outpuLfIle« setw(6) «setprecision(l); 
statistics_outpuLfile« exe_blocked_ratej ag« "%\n\n"; 
statistics_outpuUlle« flush; 
1/ Function to reorder the ORDER.IN file sequentially 




int i, SWAPPED; 
orderJILtype TEMP_NODE; 
ifstream order_inpuUlle; 






of stream order_output_fue; 
ordecoutpUI_fLIe.openCTEMP _ORDER.IN~) ; 




/I Order Nodes in order of increasing start time 
SWAPPED=l; 
while ( SWAPPED) 
{ 
SWAPPED=O; 
for ( i = 0; i < numbecoCnodes; i++ ) 
{ 
if ( ORDER[i].start_time > ORDER[i+ l].start_time ) /I Reorder nodes 
{ 
TEMP_NODE = ORDER[i]; 
ORDERlil = ORDER[i+lJ: 
ORDER[i+lJ =lEMPflODE; 
SWAPPED = 1; 
!fPut reordered nodes into output file 
for (i = 0; i <= numbeT_oCnodes; i++ ) 
{ 
if (ORDER[i].node_id 1= NULL) 
{ 




system("mv ORDER.IN ORDER_IN.ORG"); 
system("mv lEMP_ORDER.lN ORDERJN"); 
system("mv NODES.IN NODES_SNB.IN"); 
system("mv NODES_IN.ORG NODES.IN H ); 
/I end of program 
55 
APPENDIX B: PROGRAM USER'S MANUAL 
l. NODE SCHEDULING PROGRAM 
This section describes a Large Grain Dala Flow node-to-processor scheduling progl"'dlU 
(refe rred to as SCHEDULE) which provides a detailed node-to-processor scheduling of a 
data flow graph using the model described in [Ref. 4]. The program uses a two 
dimensional array to represent the revolving cylinder to generate the order the nodes should 
enter the system based on input dala files and dala provided by the user. The program also 
detennines if the breakdown time of the last node on a processor can be 'wrapped-around' 
to provide an accurate modeling of the revolving cylinder. 'Illis mapping is only concerned 
with the arithmetic processors and the program nodes. Therefore, input and output nodes 
and the input/output processors described in tRef. 4] are not included in this scheduling 
program or associated dala files. 1hl~ program must be run prior to executing the mapping 
program discussed in Section n. This program begins execution with the command 
'schedu le'. 
A. USER INTERFACE 
The following inputs and options are available to the user: 
1. SCHEDULER LATENCY TIME 
A number which abstractly represents the time it takes the scheduler to change the 
Slate of its local memory when amounts on a queue are modified due to nodt: input or 
output. 
56 
2. COMMUNICATION TL\1E FOR ONE WORD 
This is the time to transmit one word of data between a memory module and a 
processor. 
H. INPUT FILES 
1. Input File: NODES.IN 
This file contains the initial node information required for mapping. The number 
of nodes parameter is an individual element The remairting parameters exist for each node 
in the graph. 
a. Number of Nodes 
Tills is the total number of nodes in the data flow graph. 
b. Node 10 
1bis is the node identifier nwnber. 
c. Instruction Size 
Thl; is the node instruction sin': pammeter in words. 
d. Setup Time 
Thi~ is the node setup time in cycle.~. 
e. Execution Time 
This is the node execution time parameter in cycles. 
J. Breakdown Time 
Thi~ is the node breakdown time parameter in cycles. 
g. Processor Type 
11lis is the processor number that the node will be assigned to. 
2. Input File: QUEUES.IN 
This me contains the initial queue information required for scheduling. The 
number of queues pardIIlcter is an individual element. The remaining parameters exist for 
each queue in the graph. 
57 
Number oj Queues 
ThL~ is the total number of queues in the data flow graph. 
b. Queue ID 
This is the queue identifier number. 
Source Node 
This is the node ID for the node at the tail of the queue. 
d. Sink Node 
This is the node ID for the node at the head of the queue. 
Write Amount 
ThL~ is the queue write amount parameter in words. 
J. Read Amount 
This is the queue read amount parameter in words. 
2. Input File: PROCS.IN 
This file is fully described in Section ll. The only data taken from this fLle L~ the 
number of processors par.uneter. 
C. OUTPUT FILES 
Many output files are created for input to the mapping program. 
1. Output File: ORDER.IN 
This file is the mapping order of the nodes. The mapping occurs in the order the 
nodes are listed. 
Node ID 
This is the node identifier of the next node to enter the system. 
b. Time into System 
This is the time when the node will be available to be mapped. Nonnally. 
all Dodes will have a time of '0' which means all nodes are available to be mapped 
simultaneously from the start time. 
58 
2. Output «'ile: ~ODES_SN8.0UT 
This flie is similar in format to the NODES.L'l fIle but also contains the calculated 
values of setup and breakdown for the nodes in the system based on the u~r input. This 
flie is not used by the mapper program; it is for user information unly. 
3. Output File: cylinder.out 
This fIle is a representation of the mapping of the cylinder. It is in the SlIme 
fonnat as the flIe 'cylinder.dat' whi(;h is described fully as an input to the synchronization 
arc generator (SAG) program. however, this fIle takes into account the possibility of 
'wrap-around' of the breakdown of the last node on a processor. The name of this file 
must be changed to 'cylinder.dat' before using it for input to the SAG program. 
4. Output File: cyl_stats.out 
In thL" fIle are several percentages to express the efficiency of the mapping. Two 
sets of statistics are given. In the fin;t, the largest completion time over all processors is 
computed and all processors are assumed to run to this time ("flat cylinder"). The statistics 
are then computed over the total processor-time required by the mapping. lhls is the 
largest completion time over all processors multiplied by the number of processors. In the 
second set ("jagged cylinder"), each processor completion time is calculated individually 
and the statistics computed for each processor, the average is then taken of the individual 
processor statistics. 
a. Control Unit and Execution Unit Utilization Rate 
This refers to the total percentage of processor-time that the specified unit 
(control or execution) is performing useful work, either input or output for the control unit 
or execution for the execution unit. 
b. Control Unit and Execution Unit Blockage Rate 
This refers to the total percentage of processor-time that the specified unit 
(control or execution) is blocked, i.e., the unit has completed the specific task, but the node 
cannot switch to the other Wlit as the other Wlit is currently busy. 
59 
Control Unit and Execution Unit Idle Rate 
This refers to the total percentage of processor-time that the specified unit 
(control or e;l;ecution) has no node assigned. 
5. Output File: proc_stats.out 
The processor statistics are outlined in this file. The statistic listings are 
essentially the same a~ for 'cyl_stats.out', e;l;cept that the stati.~tics are computed over one 
processor vice an average over all processors. Each processor is also treated as a 'jagged' 
slice, that is, no attempt is made to find the greatest completion time of all processor slices; 
the statistics are calculated based on the fmal completion time for each individual processor. 
60 
II. LGDF MAPPING PROGRAM 
This section describes a Large Grain Data Flow mapping progmm (refemd to as MAP) 
which provides a detailed multiprocessor mapping of a data flow graph using the model 
described in [Ref. 41. The program is time driven. As events are scheduled to occur, the 
event with the lowest time stamp will set the next time flag. When this flagged time occurs. 
all node.~ are checttd for the next event to occur. A set of lists track which nodes are in the 
various states of processing. This mapping is only concerned with the arithmetic 
processors and the program nodes. Therefore, input and output nodes and the input/output 
processors described in [Ref. 4] are not included in this mapping program or associated 
data files. Thi~ program must be run prior to executing the synchronization arc generator 
program or the simulator program discussed in Sections ill and IV, respectively. This 
program b(:gins execution with the command 'map'. 
A. USER INTERFACE 
The following inpul~ and options are available to the user: 
1. SCHEDULER LATENCY TIME 
COMMUNICATION TIME FOR ONE WORD 
These inputs were fully discussed in Section 1. 
2. INTERACTIVE INTERFACE 
The user can select whether or not to use the interactive interface. The interface 
will allow the user to see the current state of the system at any time. Also, the user can 
adjust the operation of the system by manipulating nodes which are waiting to begin 
processing. 
61 
B . INPUT FILES 
1. Input File: NODES.IN 
Input File: QUEUES.IN 
These fIles were al~o previously desl.Tibed in Section I. 
2. Input File: CHAINS.IN 
This fiJe contains the initial chain infonnation required for mapping. The number 
of chains parameter is an individual element The remaining parameten; exist for each chain 
in the graph. Note that this fIle is required to exist. or execution will fail. If there are no 
chains. then simply have '0' as the only entry in the file. 
a. Number oj Choins 
This is the total number of chains in the system. 
b. Chain ID 
This is the chain identifier number. 
c. Chained Nodes 
The node IDs for the nodes included in the chain are listed in the order of 
chaining. A '0' is used to identify the end of the node list for the chain. 
3. Input File: PROCS.IN 
The following infonnation describes the hardware configuration. 
a. Number oj Arithmetic Processors 
This is the total number of arithmetic processors in the system. 
b. Processor Type 
The processor type is listed for the number of processors in the system. For 
example, if there are three processors. the numbers 1.2. and 3 will be listed in a single 
column. 
4. Input File: ORDER.IN 
This file is the mapping order of the nodes. The mapping occurs in the order the 
nodes are listed. This file can be created manually by the user or can be generated using the 
62 
scheduler program. This file is fully dt:scribed as an output to the scheduler program in 
Section I. 
C . OUTPUT FILES 
Many output flies are crcated for complete infOimation on the mapping. 
1. Output Files: CO,,"_EXE.OUT,CON_llNIT.OUT,EXE_UNIT.OUT 
These three files provide an exact mapping of the nodes on the processors. The 
events occurring at a specific time and the nodes involved are depicted. A key to the 
markings is lis ted in each file. File 'CON_EXE.OUT provides a complete mapping file, 
file 'EXE_UNIT.OUT' is a mapping of the execution units only and the file 
'CON_UNIT.OUT is a mapping of the control units only. These output fi le listings do 
not take into account 'wrap·around' of the last node's breakdown time. 
In each file are several percentages to express the efficit:ncy of the mapping. An 
important note about the statistics is that they are computed over the total processor-time 
required by the mapping. This is the time to complete the mapping multiplied by the 
number of processors. The percentages are therefore essentially an average of the 
individual processor rates. 
Processor Utilization Rate 
This refers to the total percentage of processor-time that a processor is 
peoonuing some activity in either the control unit or the execution unit. 
h. Processor Idle Rate 
This refers to the total percentage of proces.~or-time that a processor is not 
perfonning any activity. 
Control Unit and Execution Unit Utilization Rate 
This refers to the total percentage of processor·time that the specified unit 
(control or execution) is perfonning useful work. either input or output for the control unit 
or execution for the execution unit. 
d. Control Unit and Execution Unit Blockage Rate 
This refers to the total percentage of processor-time that the speCified uni t 
(control or execution) is blocked. i.e., the unit has completed the specific task, but the node 
cannot switch to the other unit as the other unit is currently busy. 
Control Unit and Execution Unit Idle Rate 
This refers to the total percentage of processor-time that the specified unit 
(control or execution) has no node assigned. 
2. Output File: SUMMARY.OUT 
This flle summarizes the number of processors in particular states at any given 
time. The event times in the three previous mapping fLIes will match with this me. The 
processor utilization percentages are dlsplayed. 
3. Output Files: NODES.OUT, PROCS.OUT, CHAINS.OUT 
These three files provide extremely detailed data on specific nodes, processors, 
and chains. The lines are well described within the output listings. Most of the items can 
be cross-referenced to other files. 
4. Output File: cylinder.dat 
This file is a representation of the mapping of the cylinder. It is described fully 
as an input to the synchronization arc generator (SAG) program. The inclusion of this file 
is to provide the data necessary to run SAG based on the mapping generated by this 
program without any adjustments. 
D. SELECTION OF THE USER INTERFACE OPTION 
The selection of the user interrace option will allow the user to observe and 
interactively change the mapping as it progresses. However, once the mapping has 
progressed past an event, it is not possible to go back and make a change. The interactive 
interface is very descriptive. The user can view many aspects of the system and make 
many changes during any pause. Selecting the 'CONTINUE WITH NEXT EVENf' will 
64 
allow the mapping to continue. To discontinue the use of the interactive interface. select the 
'CHANGE INTERRUPT STRATEGY' followed by 'COr-..'TINUE TO CONCLUSION' 
followed by 'CONTINUE WITH EVENT' options. This will allow the mapping to 
complete. 
65 
HI. LGDF SYNCHRONIZATION ARC GENERA TOR 
This section describes a Large Grain Data How model synchronization arc generator 
program (referred to as SAG). This program acts as a preprocessor to the simulator 
program (SIM). Its purpose is to mOdify the input flIes to SIM to be able to analyze the 
revolving cylinder (RC) method as described in [Ref. 4J. SAG makes extensive use of 
linked lists. SAG is started with the command 'generate'. 
A. USER INTERFACE 
The user has a choice of one of two arc generation techniques in SAG. Both 
techniques are described fully in [Ref. 41. 
1. Start Arter Finish (SAF) 
This selection will detennine the synchronization arcs based on the start after 
fmi"h technique. 
2 • Start After Start (SAS) 
This selection will detennine the synchronization arcs based on the start after start 
technique. 
B. INPUT FILES 
1. Input File: nodes.dat 
This file is a tabular listing which completely describes the nodes of a data flow 
graph. The number of nodes parameter is an individual element. The remaining 
parameters exist for each node in the graph. 
a. Number uf Nudes 
This is the total number of nodes in the data flow graph. lllis initializes the 
counters neces...ary to read in the node data. 
66 
b. Node ID 
TItis is the node identifier number, which must be unique for each node in 
the system . Do not use '0' as a node ID. 
Node Type 
This identifies the type of node. 11tis type defines how the node will be 
handled in the progr.uns. 
(1) nude type '" 0: nonnal node 
(2) node type = I: input node 
(3) node type = 2: output node 
d. Instruction Size 
This is the node instruction size panuneter in words. 
Execution Time 
This is the node execution time parameter in cycles. 
/. Setup Time 
lbis is the node setup time parameter in cycles. 
g. Breakdown Time 
This is the node breakdown time parameter in cycles. 
h. Required Processor Type 
This is the type of processor required by the node. A listing of '100' 
identifies an input/output processor. 
i. Alternate Processor Type 
This is the alternate processor type to be used if the required processor type 
is unavailable. In most ca.o;es. the alternate is the same as the required processor type. 
j. Memory Module Assignment 
This is the memory module assignment for the node if the user defined 
memory assignment option is chosen. 
67 
k. Node Priority 
This is the assignment priority associated with the node if the user derIDed 
priority option is chosen. A lower number represents a higher priority. 
2 . Input File: queues.dat 
This file is a tabular listing which completely describes the queues of a data flow 
graph. The number of queues parameter is an individual element. The remaining 
parameters exist for each queue in the graph. 
a. Number of Queues 
This is the total number of queues in the system. This initializes the 
counters necessary to read in the queue data. 
b. Queue ID 
This is the queue identifier number. which must be unique for each queue in 
the system. Do not use '0' as a queue ID. 
Queue Type 
This identifies the type of queue. The type dermes how the queue will be 
handled in the programs. 
(1) queue type =0: data queue 
(2) queue type = 1: input queue 
(3) queue type = 2: output queue 
(4) queue type = 3: synchronization arc 
d. Source Node 
This is the node ID for the node at the tail of the queue. 
Sink Node 
This is the node ID for the node at the head of the queue. 
f. Write Amount 
This is the queue write amount parameter in words. 
68 
g. Read Amount 
This is the qm::ut: rt:ad amount parameter in words. 
h. Produce Amount 
This is the queue produce amount parameter in woros. 
i. Consume Amyunt 
This L~ the queue consume amount parameter in words. 
j. Threshold Amount 
This is the queue threshold amount parameter in words. 
k. Initial Lengtlt 
ThL~ is the queue initial length parameter in words. 
Capacity 
1his is the queue capacity parameter in words. 
m. Memory Module Assignment 
This is the memory module assignment for the queue if the user defmed 
memory assignment option is chosen. 
3. Input File: machine.dg 
This file defmes the system hardware configuration. 
a. Number of Memory Modules 
This is the number of memory modules to be modeled in the simulator. 
b. Number of Input I Output Processors 
1his is the number of input I output (110) processors to be modeled in the 
simulator. Nonnally there is only one 110 processor. 
c. Number of Arithmetic Processors 
This is the number uf arithmetic processors in the system. 
d. Processor Types 
This is a li<;t of the types of processors defIned, with the number of elements 
in the list equal to the number of processors. excluding UO processors which are 
69 
automatically defmed a.~ '100', If synchronization arcs without nodes bound to processors 
are desired, the user should enter a '0' for each element. If however. the uscr desires 
synchronization arcs generated with nodes bound to processors, each element should 
correspond to a processor type. For example, if there are three processors, the numbers I, 
2, and 3 should be listed in a column. 
4. Input File: cylinder.dat 
This file is a representation of the mapping of nodes on the processors. If an 
analysis of a cylinder with no wrap-around is desired, this file will be generated by the 
external mapping program (MAP). If an analysis of a cylinder with wrap-around is 
desired, this file is generated by the scheduler program, after the fIlename is modified from 
'cylinder. out'. 
a. Number of Nodes on a Processor 
For each arithmetic processor in the system. the number of nodes which 
used that processor are given. Following the node total, the fullowing data is provided for 
each node on the given processor: 
(1) Node II) 
(2) 1be node start time on the processor 
(3) The node finish time on the processor 
b . Cylinder Size 
Following the listing of the nodes, the time to complete the cylinder slice is 
given. This is equal to the longest processor busy time of all the processors in the system. 
C. OUTPUT FILES 
Many output files are created for complete infonnation on the mapping. 
1. Output File: queues.dat 
This file has the same format as described previously for 'queues.dat'. 
However, synchronization arcs have been appended to the end of the file as detennined by 
70 
this program. This adjusted 'queues.dat' tile may now be used by the simulator to analyze 
the revulving cylinder (RC) scheduling techniques. 
2. Output File: oqucues,dat 
This rue is a copy of the original 'queues.dat', Since this program modifies the 
file 'queues. dar', this file will allow for easy recovery back to the original graph 
description, prior 10 the addition of synchronization arcs. 
3. Output File: indexcyl.out 
This file provides the same infonnation as the 'cylinder.dat' file. In addition, the 
appropriate index for the node is provided as described in tRef. 4]. 
4. Output File: tokens.out 
This me lislS irnponan! information about the synchronization arcs, including the 
source node, sink node, initiallcngth (number of tokens), threshold amount. consume 
amount, and produce amount. 
S. Temporary File: rqueues.tmp 
This file is a temporMy file created during execution which will provide no useful 
infonnation to the user. 
71 
IV. LGDF SIMULATOR 
TItis section describes a simulator (referred to as SIM) for a Large Grain Data How 
model described in [Ref. 41. SlM is an event-driven program that makes extensive usc of 
linked lists. SlM i~ started with the command 'simulate'. 
A. USER INTERFACE 
There are many inputs and options available to the user. They are presented below 
exactly as they appear in the program. 
1. COMMENT LINE 
TItis is a comment which will be displayed at the head of the data set in the 
statistics me to enable the user UJ easily distinguish the me output. Results from successive 
executions of SlM can be drnnped to a single me without lasing Irack of the data set~. 
2. THE INSTANCE NUMBER TO START GATHERING RESULTS 
This is the input instance of the graph to start gathering throughput and utilization 
results from the simulation. 
3. THE INSTANCE NUMBER TO TERMINATE THE SIMULATION 
This is the output instance, which when completed, will tenninate the simulation. 
4. SCHEDULER LATENCY TIME (cycles) 
This is scheduler latency for any queue variations in the scheduler internal 
memory. This could be the time taken by the scheduler to manipulate i~ internal data 
structures. 
5. COMMUNICATION TIME FOR ONE WORD (cycles) 
This is the time to transmit ooe word of data between a memory unit and a 
processor across the data transfer network. 
72 
6. DATA RATE OPTION 
Two options are avai lable: 
User Defined 
The user will be prompted for further input of the time interval which will 
pass after the input data for one graph iteration are entered into the system until the input 
data for the next graph iteration are entered into the system. The prompt seen by the user 
is: ENTER THE DATA PERlOD BEFORE TIlE NEXT GRAPH ITERA llON (cycles). 
b. Maximum Throughput 
The simulator will generate data for consecutive graph iterations to insure 
that the input queue is conslillltly filled. This will drive the machine at its maximum 
throughput This effectively pennit~ the user to detennine the upper bound in the input data 
rate for the given configuration. 
7. MEMORY MAPPING OPTIONS 
Two options are available: 
User Defined Mapping 
This option will map nodes and queues to memory modules as defmed in 
the nodes.dat fIle. 
h. Arbitrary Mapping 
The simulator will arbitrarily assign nodes and queues to memory modules. 
8. NODES ON READY LIST OPTION 
Two options are available: 
a. Only One Node Instance can be on Ready List 
Only one instance of a node can be maintained on the ready list at any given 
time. 
h. MuWple Node Instances can be un Ready List 
Multiple instances of a node can be maintained on the ready list at any given 
time. However, only one instance of the node can be processing. 
73 
9. NODES EXECUTION PRIORITY OPTIONS 
Several options are available to place nodes in the ready list.: 
a. No Priority 
Nodes are executed on a First-Come-First-Served (FCFS) basis, i.e" 
according to the order in which they are ready. 
b. User Defined 
The node priorities are as dermed in the flle 'nodes.dat' This allows the 
user to designate critical nodes to be assigned to a processor immediately when data is 
available. 
Shortest Execution Time First 
A ready node with a shorter execution time will be assigned before a ready 
node with a longer execution time. 
d. Longest Execution Time First 
A ready node with a longer eJlecution time will be assigned before a ready 
node with a shorter execution time. 
B. INPUT FILES 
Three input files are.required by the simulator. 
1. Input File: nodes.dat 
The contents of this input file are described fully as an input to the 
synchronization an: generator program in Section 4. 
2. Input File: queues.dat 
The contents of this input file are described fully as an input to the 
synchronization arc generator program in Section 4. 
3 • Input File: machine.cfg 
The contents of this input fil e are described fully as an input to the 
synchronization arc generator program in Section 4. 
74 
C. OUTPUT FILES 
Three output files are generated by this prugrnm. 
1. Output Files: starts.out 
Ibis is a listing of the graph instance and the start times of those instances being 
measured. 
2. Output File: endtimes.out 
This is a listing of the graph instance and end times of those instances being 
measured. 
3. Output Files: stats.out 
This file summarizes the data from a given simulation ... be same information is 
displayed to the standard output upon program completion. Note that this tile i~ an 
appended rue, so additional simulation re.~ults are added to the end of the frle which enables 
easier comparison of multiple tests. The following data is provided: 
a. COMMENT 
The comment line input by the user. 
h. DATA RATE OPTION 
The number corresponding to the choice made at the stan of the program. 
c. MEMORY MAPPING OPTION 
The number corresponding to the choice made at the stan uf the program. 
d. NODES ON READY UST OPTION 
The: number corn:sponding to the choice made at the start of the program. 
e. NODES EXECUTION PRIORITY OPTION 
The number corn:sponding to the choice made at the start of the prob'l"llm. 
f. START INSTANCE 
The data flow graph instance where measurements were started. 
g. END INSTANCE 
The data flow graph instance which tenninated the simulation. 
75 
h. SCHEDULER LATENCY TIME (cycles) 
The scheduler latency for any queue adjusnnent in the scheduler imernul 
memory. 
i. COMMUNICATION TIME FOR ONE WORD (cycles) 
The cummunication time to transfer one word of data between a memory 
module and a processor. 
j. ITERATION DATA PERIOD (cycles) 
The time differential for the input of consecutive data flow graph iterations, 




k. PROCESSOR DATA 
For each processor in the system, the following data is provided: 
(l) ill - the processor identifier 
(2) TYPE - the processor type (100 refers to lIO processors) 
(3) UTIUZA nON - the overall processor utilization rate 
(4) EXECUTION - the utilization rate of the execution unit 
(5) EXEJUTIL - the amount of execution as part of the overall 
(6) UTll.rEXE - the amount of communication not overlapped with 
I. AVERAGE PROCESSOR UTIliZATION 
The average amount over all arithmetic processors (excluding VO) of 
proces.~or utilization during the period measurements are taken. 
m. AVERAGE PROCESSOR EXECUTION 
The average amount over all arithmetic processors (excluding I/O) of 
execution unit utilization during the period measurements are taken. 
76 
n. AVERAGE EXECUTlON / UTILIZATION RATE 
The average amount over all arithmetic processurs (excluding lIO) of 
execution unit utilization as a ponion of total utilization during the period measurement~ arc 
taken. 
o. AVERAGE NON·OVERLAPPED COMMUNICATION «ATE 
The average amount over all arithmetic processors (excluding lIO) of 
communication not overlapped with execution during the period measurements are taken. 
p. NORMALIZED DATA RATE 
The fate of input into the system compared to the optimum execution 
completion time. A value of '0' means the optimum throughput option was chosen. 
q. SIMULATION TIME (cycles) 
The total time in cycles for the simulation to run to completion. 
r. AVERAGE RESPONSE TIME (cycles) 
The average length of time over all measured graph instances for a graph 
instance to complete. 
A VERAGE THROUGHPUT (InsJances per Megacycle) 
The average number of data flow graphs to be completed per million cycles 
during the time measurements are taken .. 
t. INSTANCE LENGTH STANDARD DEVIATION 
The standard deviation of the completion time of the meas=d instances. 
u. COEFFICIENT OF VARIATION 
The instance length standard deviation divided by the aver .... ge response time. 
v. I/O COMMUNICATION TIME FOR ONE GRAPH INSTANCE 
The required communication time (in cycles) for one data flow graph 
instance which occurs on the I/O processors .. 
77 
w. liD CALCULATION TIME FOR ONE GRAPH INSTANCE 
The required calculation time (in cycles) for one data flow graph instance 
which occurs on the 110 processors .. 
x. NOnE COMMUNICATION TIME FOR ONE GRAPH 
INSTANCE 
The amount of communication time (in cycles) which is related to the graph 
nodes (setup latency, breakdown latency, and the time to load the instruction), excluding 
that which occurs on the 110 processors. 
y. QUEUE COMMUNICATION TIME FOR ONE GRAPH 
INSTANCE 
The amount of communication time (in cycles) which is related to the graph 
queues (reading and writing data), excluding that which occurs on the I/O processors. 
z. COMMUNICATION TIME FOR ONE GRAPH INSTANCE 
The required communication time (in cycles) of one data flow graph 
instance. This does not include the communication time and control time for input and 
output nodes and queues. 
an. CALCULATION TIME FOR ONE GRAPH INSTANCE 
The required calculation time (in cycles) of one data flow graph instance. 
This does not include the calculation time for input and output nodes. 
ab. IDEAL CYLINDER COMMUNICATION OF ONE INSTANCE 
The amount of communication time (in cycles) which would be equally 
divided among the arithmetic processors. This does not include the communication which 
occurs on the 110 processors. 
ac. IDEAL CYUNDER CALCULATION OF ONE INSTANCE 
'The amount of calculation time (in cycles) which would be equally divided 
among the arithmetic processors. This does not include the calculation which occurs on the 
UO processors. 
78 
ad. COMMUNICATION/CALCULATION RATIO 
The ratio of the communication time to the computation time for one 






















'" 119 120 
























1 0 101 16384 16384 
2 0 102 16384 16384 
3 101 103 16384 16384 
4 102 104 16384 16384 
5 103 105 16384 16384 
6 104 106 16384 16384 
7 105 107 4096 4096 
8 106 108 4096 4096 
9 107 109 4096 4096 
10 108 110 4096 4096 
II 109 112 4096 4096 
12 109 113 4096 4096 
13 110 111 41196 4096 
14 111 112 4096 4096 
15 III 114 4096 4096 
16 112 115 4096 4096 
17 113 116 4 4 
IS 114 116 4 4 
19 115 117 2052 2052 
20 116 117 4 4 
21 117 118 513 513 
22 117 120 513 513 
23 II' 119 513 513 
24 119 0 513 513 




101 50000 o 6 0 
102 50000 o 3 0 
103 150000 o 7 0 
104 150000 o 8 0 
105 I()(j()(Kl o 4 0 
106 1()(j()(KJ o 5 0 
107 1000000 o I 0 
108 50000 o I 0 
109 4()(j()(KJ o 7 0 
110 100<= o 2 0 
III 4()(j()(KJ o 8 0 
112 75000 o 2 0 
113 I()(j()(KlO o 3 0 
114 1000000 o 4 0 
115 1000000 o 5 0 
116 50000 o 8 0 
117 800<Kl0 o 6 0 
118 50000 o 6 0 
119 lOO<KlO o 7 0 
120 1()(j()(KJ 0 8 o 8 0 
I 0 o 100 100 6 0 




1 1 1 101 16384 16384 16384 16384 16384 0 131072 11 
2 1 1 102 16384 16384 16384 16384 16384 0 131072 4 
3 0 101 103 16484 16384 16384 16384 16384 0 13 1072 7 
4 0 102 104 16384 16384 16384 16384 16384 0 131072 5 
5 0 103 105 16384 16384 16384 16384 16384 0 13 1072 8 
6 0 104 106 16384 16384 16384 16384 16384 0 131072 6 
7 0 105 107 4096 4096 4096 4096 4096 0 32768 1 
8 0 106 108 4096 4096 4096 4096 4096 0 32768 3 
9 0 107 109 4096 4096 4096 4096 4096 0 32768 10 
10 0 108 110 4096 4096 4096 4096 4096 0 32768 2 
11 0 109 112 4096 4096 4096 4096 4096 0 32768 2 
12 0 109 11 3 4096 4096 4096 4096 4096 0 32768 3 
13 0 110 11 1 4096 4096 4096 4096 4096 0 32768 9 
14 0 111 11 2 4096 4096 4096 4096 4096 0 32768 1 
15 0 111 11 4 4096 4096 4096 4096 4096 0 32768 4 
16 0 11 2 115 4096 4096 4096 4096 4096 0 32768 8 
17 0 113 116 4 4 4 4 4 0 3200 2 
18 0 114 11 6 4 4 4 4 4 0 3200 8 
19 0 11 5 117 2052 2052 2052 2052 2052 0 16416 6 
20 0 116 117 4 4 4 4 4 0 3200 7 
21 0 117 118 513 513 513 513 513 0 4104 7 
22 0 117 120 513 513 513 513 513 0 4104 9 
23 0 118 119 513 513 513 513 513 0 4 104 7 
24 2 119 2 513 513 513 513 513 0 4104 2 
25 2 120 2 513 513 513 513 513 0 4104 3 
83 
APPENDIX D: SAMPLE RUN OF PROGRAMS 
This appendix outlines the general procedure for running a simulation session with the 
programs in this thesis and [Ref. 4). Input and output mes will be indicated in bold. 
Commands required to run programs will be indicated in italics. Consult the user's 
manual (Appendix B) for detailed descriptions ofthc: program input and output fIles. Note 
that all output ftles are opened for writing (except slats.out) during program execution. 
TIlls means that fIle names must be modified when running successive iterations of the 
same program or data will be lost 
1. Modify the NODES.IN, QUEUES.IN, PROeS.IN, and CHAINS.IN mes 
for the graph to be analyzed. 
2. Run the node allocation program by typing schedule and entering the proper input 
data. This prognun will generate the ORDER.IN file and the cylinder.Qut fIl e. 
3. Run the mapper program by typing map and entering the proper input data. This 
will generate the cylinder.cIat fIle and other descriptive output flies. 
There are now severnl different steps to be taken. depending on whether it is desired to 
analyze the FCFS or RC technique and dependent on whether wrap-around or no wrap 
around is desired. Each of the tel:hniques and variations will be discussed separately. 
A. FCFS 
Modify the nodes.dat. queues.dat, and machine.cfg fIles for the graph to be 
analyzed. The cylinder.dat ftle must also be present for the program to run, although it 
is not necessary to be concerned about the wrap-around or no wrap-around option since the 
simulator does not use this file for FCFS. Run the simulator program by typing simulate 
and enter the proper input data. 
84 
B. RC TECHNIQUES 
For the various RC techniques, synchronization arcs must be ~enerated before the 
simulator program is nm. 
1. Start-arter-start (SAS) or start-arter-finish (SAF) without binding 
nodes to processors. 
The machine.erg file must contain a zero for each processor assigned. 
a. No wrap-around 
Use the cylinder.dat flle from the mapper program. 
b. Wrap-amund: 
The cylinder.out file must be renamed to cylinder.dat. Ensure the 
cylinder.dat file is renamed~, or it will be overwritten. 
Run the synchronization arc program by typing generate. Select the technique 
desired. If it is desired to generate another set of synchronization arcs for a different 
technique (i.e., SAF arcs are generated, and SAS is now desired) the queues.dai file 
must be renamed (to something appropriate, e.g., queues.SAF) and the oqueues.dat 
file renamed to queues.dat. The generate program can now be run for the new technique. 
2. Start-arter-start (SAS) or start-arter-finish (SA F) with binding 
nodes to processors. 
For this technique. the machine.cfg ftle must now contain a number for each 
processor assigned (i.e., 1, 2, etc.). The same rules apply as above for generation of arcs 
for wrap-around and no wrap-around techniques. 
C. SIMULATIONS 
For the simulations it is important to maintain the proper input files. Ensure the 
queues.dat file matches the appropriate cylinder.dat file. i.e., wrap or no wrap and that 
the machine.cfg rue corresponds to the desired binding or no binding condition. A<; an 
example, say the input flles were originally named (after generating the synchronization 
arcs) queues,SAFnW (no wrap), queues.SAFbnW (bound, no wrap), 
85 
queues.SAFnW (with wrap), queues.SAFbW (buund, with wrap), cylinder.W 
(with wrap), cylinder.nW (no wrap). machine.cfgb (bound), and machine.cfgnb 
(not bound). In order to run a simulation for the no wrap, non-bound configuration, the 
files, queues.SAFnW, cylinder.nW, and machinc.cfgnb must be renamed to 
qucucs.dat. cylinder.dat, and machine.erg. The simulator program can now be run 
by typing simulate. Remember that the three fil es named above must be renamed again in 
order to simulate new RC techniques. 
86 
LIST OF REFERENCES 
I. Shukla, S.B., Little, B.S., and Zaky, A., "A Compile-Time Techn ique for 
Controlling Real-Time Execution of Task-Level Data Flow Graphs," presented at 
the 1992lntemational Conference on Parallel Processing. 51. Charles, Illinois. 
2. Cross, D. M., Usefullness 0/ Compile-Time Restructuring oj LGDF Programs in 
Throughput-Critical Applications, Masler's Thesis, Naval Postgraduate School, 
Monterey, California, September 1993. 
3 . Bell, RA., A Compile-Time Approach For Chaining and Execution Control in 
the ANIUYS-2 Parallel Signal Processing Architecture, Master's Thesis, Naval 
Postgraduate School, Monterey, California, June 1992. 
4. Cross, D.M., Shukla, S.B., and Zaky, A., Revolving Cylinder Analysis: A 
Technique for Restructuring 0/ Large Grain Data Flow Graphs Representing 
Throu!?hpUf-Critical Applications, Naval Postgraduate School Technical Repon 
NPS-EC-93-015, September 1993. 
5. Akin, c., Efficient Scheduling 0/ Real Time Compute-Intensive Periodic 
Graphs on Large Grain Data Flow Multiprocessor, Master's Thesis, Naval 
Postgraduate School, Monterey, California, March 1993. 
87 
INITIAL DISTRIBUTION LIST 
1. Defense Technical lnfonnation Center 
Cameron Station 
Alexandria, VA 22314-6145 
2. Dudley Knox Library, Code 52 
Naval Postgraduate School 
Monterey, CA 93943-5101 
3. Chairman, Code Ee 
Department of Electrical and Computer Engineering 
Naval Postgraduate School 
Monterey, CA 93943-5121 
4. Prof. Shridhar B. Shukla, Code ECJSh 
Deparunent of Electrical and Computer Engineering 
Naval Postgraduate School 
Monterey, CA 93943-5121 
5. Prof Arnr Zaky, Code CSfZa 
Department of Computer Science 
Naval Postgraduate School 
Monterey, CA 93943-5118 
6. Mr. David Kaplan 
Naval Research Laboratory 
4555 Overlook Avenue, SW 
Washington, D.C. 20375-5000 
7. Mr. Richard Stevens 
Naval Research Laboratory 
4555 Overlook Avenue, SW 
Washington, D.C. 20375-5000 
8. Mr. Paul J. Hays 
Mail Stop 473 
Langley Research Center 
lnfonnation Systems Division 
Harnton, VA 23681-0001 
9. LT John P. Carnany. USN 
113 Mervine St. 
Monterey, CA 93940-6205 
10. Mr. W. John Pohl 
1710 Springbill Ct. 




DUDLEY K OX LIBRARY 
NAW'l POSTGRADUATE SCHOOl 
MONTEREY CA 93943-5101 
111II1111IIH~irll!fff{( 11 1;1 
3 2768 00019531 7 
