A comparison of multiprocessor scheduling methods for iterative data flow architectures by Storch, Matthew
NASA Contractor Report 189730
/H- 3
72[
#t
A Comparison of Multiprocessor
Scheduling Methods for Iterative
Data Flow Architectures
Matthew Storch
tJ_llverslty of Illinois
Urbana, Illinois
Grant NAG 1-613
Fcbrttary 1993
N_lliov)al Ae.r()llauli(:,(; ;ir1(l
,_;i)_r:f: Admiv)it;Ir;)lioi I
Langley Research Center
I lamt)ton, Vir,clinia P3(;_ I 0()()1
(NASA-CR-189730) A COMPARISON OF
MULTIPROCESSOR SCHEDULING METHODS
FOR ITERATIVE DATA FLOW
ARCHITECTURES (I11inois Univ.}
22 p
G3/33
N93-I9107
Unclas
0148124
https://ntrs.nasa.gov/search.jsp?R=19930009918 2020-03-17T07:16:45+00:00Z

TABI.E OF CONTENTS
I. Inlroduction ........................................................................................................... 1
I.I. Purpose ................................................................................................... 1
i.2. Assumptions ....................................................................... . ..................... 1
2 Terminology ............................................................................................................ 1
2.1. ATAMM Terminology ............................................................................. 1
2.2. Parhi and Messerschmitt .......................................................................... 5
3. Colnl:,arison of ATAMM Scheduling to Parhi and Mcsserschmitt Scheduling .......... 5
3. I Fully Static Scheduling .............................................................................. 5
3. I.I. Classes of Schcdulcs .................................... :.............................. .5
3.1.2. Fully-Static Schedules ............................................................... 6
3.2. Comparison of memory requirements ........... . ........................................... 9
4. Olhcr Relalcd Work ............................................................................................... 12
4. I. Range Cha,t Scheduling ........................................................................... 12
4.2. Optimal Processor Assignment l'or Pipeline Computations ........................ 1.5
5_ (?onclusion ............................................................................................................. 16
6. I:ulurc Work ........................................................................................................... 17
(_. i. Theoretical Work ..................................................................................... 17
_.2. hnplementation Work .............................................................................. 17
References ................................................................................................................... 18
LIST OF FIGURES
Figure I. (a) An AMG. (b) F.xcculion ofthe AMG. (c) 'I'GP diagram. Data
packet numbers are shown in parenthesis ..................................................................... 4
Figure 2. An AMG with initial tokens ......................................................................... 7
Figure 3. (a) Cyclo-static schedule. (b) Optimally-static schedule. (c) Improved
cyclo-stalic schedule with period equal to TBO^I., ....................................................... 8
Figure 4. The graph of Figure 2(a) unfolded by a factor of 3 .................... , ....... _.......... 12
A Comparison of Multiprocessor Scheduling Methods
for lterative Data Flow Architectures
1.1. l_ml_ose
!. Introduction
This paper provides a qualitative comparison between the Algorithm To
Architecture Mapping Model (ATAMM) [1] and three related works [2, 5, 7], with the
p,imary focus on [2]. The problem domain is non-preemptive scheduling ofiterative,
large-grain data flow graphs as may typically be tbund in signal processing applications.
The purpose of this paper is tburlbld: to resolve differences in terminology used by the
various authors, to highlight the similarities and diltErences between ATAMM and the
other models, to point out the relative featt,res and limitations of the various approaches,
and to suggest possible directions Ibr future ATAMM research.
1.2. A.s:s'umplions
All schedules in this paper are assumed to be mulliprocessor schedules, so schedule
will be used as shorthand lbr mM/iproce,_:_'or ._'_'tle_htle.
Unless otherwise stated all observations regarding ATAMM behavior are for a graph
running in steady-state.
2. Terminology
2. i. ,4 7",4A,IA47'erminoh_g;,
In order to tacilitate a comparative disct,ssio,1 of lhc ATAMM and Optimum
Unlblding scheduling strategies, thc terminology used in the two strategies will be
introduced and contrasted in this section. Some fimfiliarity with both ATAMM and [2] is
assumed; the terms used in both works are introduced not to give precise definitions but
rather to provide a mapping between the two models. The terminology used by ATAMM
will be covered first, and then it will be shown how the definitions used in [2] relate.
A dataflow graphin ATAMM iscalledanalgorithm marked graph (AMG). Two
other graph types are used in the ATAMM design system, namely the node marked graph
(NMG) and tile computathJmd marked graph (CMG). While these graphs are
fundamental to the ATAMM model and ATAMM design procedure, they are primarily a
means by which data packet injection interval and node lirings are controlled. The effects
of the NMG and CMG, such as controlling TBO, ensuring that buffers are not
overwrittell, and ensuring steady-slate operation, are important but the exact manner in
which the NMG and CMG are used to achieve those effects need not be reviewed to
compare ATAMM schedules with [2], [5], and [7].
An AMG consists ofnode._' representing large-grain computations and directed
edges liom one node to another (not necessarily distinct) node, which represent the flow
of data and thus indicate temporal precedence constraints. A source is a special type of
node that has no incoming edges but nonetheless produces tokens at a fixed rate. A sink
is also a special type of node that has no outgoing edges, and which therefore produces no
token when it fires. One or more initial tokens indicating tile presence of initial data may
be placed on any edge before the graph begins execution. The successor node of an edge
will use the nth token from the predecessor to produce the (n+d)th token of the successor,
where d is tile number of initial tokens. The nth packet of data consists of the nth token
produced by each node excluding the sinks, which never produce any tokens
The time between the completion of two consecutive packets is called 7't30 (time
between outputs). Part of a design procedure may bc to achieve a certain desired or
target 7'It0. When a graph is either simulated or run on actual hardware, the actual TBO
may slightly vary or even oscillate if injection is not controlled properly. Since this paper
is primarily concerned with operation at a theoretically perfect steady-state, from here on
TBO should be read as target TBO unless otherwise stated. In an injection-controlled
environment such as ATAMM, TBO is necessarily equal to the injection interval for the
graph to run in steady-state The smallest achievable TBO for any number of processors
is called TBOAz . where the subscript indicates absolute lower bound. In [1 ], TBOAL B is
shown to be the maximum time per token of any directed circuit in the AMG Let C, be
the ith directed circuit (numbered arbitrarily), T((') bc the sum of execution times of the
nodes in C, and M(C,) be the number of initial tokens in C,, then
= Max/T(C,)l'BOu
•" I M(C,)
(l)
The smallest achiew_ble 'FBO for a given number of processors is TBOt. w Let
'1'(3,_ be the sum of all node execution times and R be the number of processors; then
{  CEI (2)"I'B()Lt I = Max TBOu,., i,1 J"
if there is no directed circuit (recurrent loop) in the AMG, then T(C) = 0 and
TBO^_., --: 0. There is anolher thctor which may limit TBO^L,n. Ifit is assumed that nodes
cannot be multiply instantiated 1, then TBO_B will be either the result of Equation 1 or the
largest node execution time, whichever is larger. Such an assumption is made in [5].
ttowever, ATAMM [3] directly allows for multiple instantiations, and a similar effect is
achieved in [2] by m_dditlg tile DFG Unfolding is a transform that takes a data flow
graph G and an uqfoldmgfactor d as input and constructs a new graph which contains J
copies ofeach node of G The J copies of a node A of G correspond to J consecutive
instantiations of A, and edges are added to tile untblded graph so as to enforce all of the
intra-packet and inter-packet data dependencies. See [2] for the unfolding algorithm and
additional discussion.
The execution of a node on some data packet is referred to as an mstanliation of
that node. A node is said to be mnltiply htslantialed ifthere exists an instant of time in
which the node code is operating on two distinct data packets sirnultaneously. Although
inshottiation is primarily a sotiware term, it can apply to hardware as well. Note that for
our purposes a pipelined hardware multiplier which is in the process of computing several
results in its different pipeline stages is in essence multiply instantiated Two even more
closely related hardware examples are a multiple-issue CPU with multiple identical
functional units, and a supercomputer with multiple vector units on each CPU.
A schedule is periodic with period TBO if and only if the following property is
met. lfa node fires at time t, then the next time that the node will fire is exactly t+TBO.
Both ATAMM and [2] are concerned only with graphs which operate in a periodic
lhshion Ira graph executes in a periodic fashion it is said to have reached steady-state.
There may be a length of time when a graph first begins executing that it is not in steady-
stale, in which case a traltsiettl comlition exists.
A total graph ph(v diagram, or TGP, is a graphical snapshot showing the node
activity at steady-state for exactly one TBO interval of timeL The earliest-created packet
is usually numbered I, tile next earliest is numbered 2, and so forth. An example AMG is
shown in Figure I(a), a feasible schedule is given as Figure l(b), and the resulting TGP is
shown in Figure I(c)
For a cornplete introduction to ATAMM and a preliminary system implementation
see I3 ]
ISee nexl paragraph for a dcfinilion of multiply inskmli_flcd.
2See III or 131for an altcrnale bul cquiv:flcnl dcfinilion.
I
....JilB'>L--- Node Time
(a)
<
A (2) 1](2)
Bt _) C t_) A (3)
>.(
X
C (2)
B (3)
( >
2
(b)
A (2) [3(2)
),,,(
i{(_) C(_)
2
(c)
Figure 1. (a) An AMG (b) Execution of the AMG. (c) TGP diagram,
numbers are shown in parenthesis.
Data packet
4
2. 2. l'_trhi attd Messerschmitt
The terminology used in [2] corresponds quite closely to that of ATAMM. The
counterpart to the ATAMM AMG is referred to simply as a dataflow graph, or DFG.
Like an AMG, a DFG consists of nodes and edges, l lowever, there are no sources or
sinks; every node is part of at least one directed cycle, which is defined in the usual graph-
theoretical sense. A register is analogous to an initial token and specifies both initial data
and delay. The "initial data" property is not explicitly stated in [2], but is implicit because
a DFG must have at least one token in every loop to start up and reach steady-state. As
[2] provides no notation for specifying "full" versus "empty" registers, it must be assumed
that all registers are initially full.
One A'.rAMM data packet corresponds to one iteration ofa DFG TBO_B is
referred to in [2] as the iteration bound, oz" 7'o_ but the more general TBOLn has no analog
since [21 is not concerned with running with fewer processors than is required to achieve
the iteration bound. A processor schedule in which the actual JteratJott period is equal to
the iteiation bound is said to be a rate-optimal schedule Iteration immber is equivalent to
packet number.
Throughout the remainder of this paper terms flora both ATAMM and [2] will be
used as is appropriate.
3. Comparison of ATAMM Scheduling to Parhi and Messerschmitt
Scheduling
3. 1 l"ully Static Sche_htfing
3.1 I. Classes of Schedules
Three terms which are used in [2] but not in the referenced ATAMM literature are
overhq)ped schedule, fully-static sche_hde, and cych)-static schedule. The following
delinitions are consistent with those presented in [2]. A schedule is overlapped if any
node of packet ._ Nl-I Iires before all nodes of packet N have completed. The strategies
used in ATAMM, [2], and [5] may create overlapped schedules. A periodic schedule is
fully-static if all instantiations of any given node are scheduled on the same processor. A
3The phrase "node of packel N" is not slriclly consislcnl wilh Ihc definitions of node and packet;
Icclmically Ihe phrase should be "node inslanlialion thai produces a token of packet N" However, the
former phrase will be used for brevily when Ihe meaning is unambiguous.
periodic schedule is cyclo-static if the following condition is met. If any given node is
instantiated on processor Pk for packet p then it is instantiated on processor P_k.K),,_r_ for
packet p.t I, where K is some integer constant and N is the number of processors. A
schedule must be periodic if it is to be either fidly- or cyclo-static. These terms are more
rigorously detined in [2].
There exist classes of scheduling which impose constraints weaker than cyclo-static.
(;eoeralperiodic sche_hdes may or may not be cyclo-static. That is, the function which
maps node iterations to processors ,nay not be linear modulo the number of processors, as
is necessary for a cyclo-static schedule. [2] and [5] are concerned only with compile-time
schedules, whereas ATAMM scheduling is done at run-time, although scheduling
performance is predicted at design time With regard to the question of which processor
executes which node iteration, rtm-time assignme,_t of nodes to processors can be
Unl)redi_.'l_thlt "4. For example, consider Figure I (c). In the current ATAMM
implementations, at the time instant t when both N-'> and B_'_ complete, the two processors
enter a race condition, the winner of which will run whichever ofB_ 2_or C_') has higher
priority The node priority is only used in node to processor assignment but all nodes are
executcd as soon as they are enabled Thus a general periodic schedule will result and
time performance and periodicity are not adversely att_cted.
3.1.2. Fully-Static Schedules
In [2], it is shown that tbr any DFG tllere exists a fully-static rate-optimal
schedule, given adequate resources tlowever, it is crucial to note that Parhi and
Messerschmitt achieve a schedule which is fully-static with re.spect to the unrolledDFG,
not the original 1)t;'G, although curiously this fact is not directly stated in [2]. As far as
the original DFG is concerned, the schedule that their unfolding-based algorithm creates is
actually cyclo-static. For example, consider Figures 2 and 3(a), both taken from [2],
which show a DFG and the resuhing rate-optimal schedule. To see why Parhi and
Messerschmitt did not specitically address this issue, note that under the assumption that
nodes may be multiply instantiated (or equivalently tbr purposes of this argument, that the
graph may be unfolded), it will nol be possible to create a fully-static schedule if the time
of the largest node is greater than TI]OA,.,. Clearly any two or more distinct instantiations
of the same node which are running at the same time instant must be on two distinct
processors. In this paper, schedules that are fully-static except for violations caused by
node execution times that are greater than TBO^,.,_ will be referred to as oplimally-static.
An optimally-static schedule is technically cyclo-static, but it is also as fully-static as it can
be; hence the use of a new term. Also note that even if no node has execution time greater
than TBOA,., , , to achieve a fully-static rate-optimal schedule ,'nay require more processors
41-1islorically, no effort was made to show Ihat ATAMM architectures cxhibilcd fully-slalic or cyclo-static
behavior (allhough Ihey are periodic), so ATAMM schcdulcs mighl bc placed in this calegory. However,
in the fillure it is planned Ihal ATAMM schcdulcs may bc made fidly-slalic through Ihc addition of
processor/nixie constraints.
6
thaufor _lcyclo-staticrate-optimalschedule'_.Againconsidertheexampleshownin
Figures2and3(a),whichweretakenfiom Figure7of 1211TheDFG is showninFigure
2. Figure3(a) repeatsthescheduleshownin [2]; thisscheduleis theminimal-processor
rate-optimalscheduleandis cyclo-staticwith respecto thegraphof Figure2. Notethat
thisscheduleis asstaticasit canbe,giventhe3 processorlimitation. Figure3(b) shows
thatin thiscasetheadditionof I extraprocessorallowstbr anoptimally-staticschedule.
Not only is the schedule in Figure 3(a) not fully-static with respect to the graph of
Figure 2, it is not even periodic with respect Io a period of length TBOAH_; its actual
period is three times longer than "I'BO^H _. A modified version of the schedule that is
periodic with period TBO^I.u is shown in Figure 3(c).
a) . . .......'
Figure 2. An AMG with initial tokens.
-Vrhis follows easily from Ihe secoud iminl, cohmm 2, p. 355, of 151
7
P
r
0
C
C
S
S
0
F
1
2
3
A(I)
r_ (3)
),_--- •,(
i_(4)
),at"-).<
A (4)
A(5)
_._ (;<_ A(7) >_[9)
i_ (7) A(s)
_3 (2) A (3) !_ (s) A(6)
.-. (8) A(9)
I t t
2 4 6
t- I t t I t 11 I I t f t- -t-.t -t--t---....
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
P
I"
o
C
e
S
S
0
r
(a)
2
3
A (_) A.) A (7)
< ), .( ), < ),
A (2) A (5)
,( - • ,( ....... •
A C8)
<7-- ---" - -- -- -),.
A (3) A (6) A(9)
,( ), <- • .( __ ).
11 I _ t---t t I l I t I I I t -t t q b- +---q---
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
P
i
0
C
e
s
s
o
r
l •
2
3
(6)
A'" •[3 _ A `4) •[3'; ') a `7, )_3 (9,
[3_(_ A(2) I_(4) A(s) R (7) A(8))at-•< >ar'•< ..... >.
_3C2) A (') I._(_) A (") _3_ A (9,),< •,_'><
t t t .... t t t 1 t I I 1 I t 1 1---.t----_ !--q ..... !_q--
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
>
(c)
Figure 3. (a) Cyclo-static schedule. (b) Optimally-static schedule. (c) Improved cyclo-
static schedule with period equal to TBOAL_ '
8
Theprimaryachievementof [2] is to show that any DFG can be scheduled rate-
optimally in a fully-static manner with respect to the uqfigded version of the DFG. Before
eltbrt is made to show that ATAMM schedules can be made to be optimally-static with
respect to the origimtl graph, it seems wise to consider the importance of optimally-static
scheduling from an engineering point of view. Such an engineering viewpoint was not
adopted in [2] due to the fact that determining a fully-static (or optimally-static) rate-
optimal schedule is an NP-complete problem [6], as pointed out in [2].
Nonetheless, the optimally-static property is of practical as well as theoretical
interest. Ira node can be executed on any processor, as is the case in current ATAMM-
based architectures, then every processor requires a copy of the code for that node. Such
replication is a strength ifa high degree oft_ault-tolcrance is desired [3], but is wasteful if
memory conservation is a major concern.
Both optimally-static and cyclo-static scheduling have a communication delay and
bandwidth usage advantage over dynamic scheduling if the processors are connected via a
non-bus (i.e. point-to-point) nclwork. For a dynamic schedule, it can easily be the case
that any node will run on many different processors, and potentially on all processors at
different points in time. Thus, depending on the implementation of course, ifa copy of
each node _' is kept on every processor on which it may run (which in general may be all
processors), then graph updates such as making a token available on an incoming edge
must be broadcast to all processors. This is not a problem for current ATAMM
implementations because the communication network is always a bus.
if an optimally-static schedule is known, then of course the processor 7 on which
each node will run is a known constant, in which case the potentially high overhead of
broadcasting on a non-bus network is eliminated In the case ofa cyclo-static schedule,
the processor tbr each node is by definition not necessarily a constant, but nonetheless the
processor which will run the next instantiation of a node is known (if the schedule is
determined before run-time) or can be computed on the fly (if dynamic scheduling is
used) The destination processor for graph mai,ltenance messages is known and
broadcasting is not necessary.
3.2. _ "omparison q[memor), requirements
To achieve rate-optimal operation, there must be a mechanism for exploiting
sufficient pipeline concurrency In [2] this is achieved through overlapped schedules and
6A copy of a node might b'pically inchldc the code for the node plus graph information such as lists of
incoming :rod oulgoing arcs, and buffer space for incoming tokens wailing to be used.
7Of course, if the COml_ul,'ltionlime of a node is grealcr Ihan TBO, then it is aclually the .vetof processors
on which the node will run Ihal is knowII.
optimum unlblding. In ATAMM tile nlechanisms are overlapped schedules and multiple
node instantiations
Although memory requirements were not of concern in [2] due to its theoretical
point of view, it is nonetheless interesting to compare the memory requirements of
optinmm unfolding to a dynamic scheduling system typified by ATAMM. One measure
lbr this comparison is the peak amount of memory required for node instantiations; unit
nlemory requirements tbr each node will be assumed. Another measure is the required
number of buffers on the edges of a data flow graph.
In [2], nodes are not multiply instantiated so the node measure is easy to compute;
it is simply JN, where J is the tmtblding factor and N is the number of nodes in the DFG.
From the unfolding algorithm it can also be seen that the number of edges and hence
minimal number of buffers _ is also JN. Thus from the memory point of view every node is
muhiply instantiated J times whether it "needs" to be or not, and every edge is buffered J
times whether it "needs" to be or not. Whether or not a node needs to be multiply
ins|an|tared or multiply butt"ered is a function of its computation time and TBO.
Due to its use of'injection control ATAMM will require no more than the
minimum number of instantiations tbr each node, and relatively few buffers. First consider
multiple node instantiations. Let A be a node and T(A) be the execution time of the node,
then from [3]
{nmnl)er of instanliations} (A) / 'rlio / (3)
easy to see thai I-_/--q/t(A,i/rBO | iS a lower bound on the number ofinstantiations.It is
In steady-state exactly one instance of A must complete every TBO interval. Each
instance of A takes T(A) units of computing time to complete execution. Hence,
{number of instantiations }(A) >_T(A)/TBO, and the lower bound follows immediately.
The fact that [T(A)_BO ] is also an upper bound is less obvious, but assuming that the
graph is periodic in TBO 9, this bound follows fiom the fact that each instant|at|on of a
node begins at the same time offset from the beginning ofa TBO interval.
Xl':ach cdgc requires at lcasl one buffer Io hold a Iokcn from Ihc lime it is generated by the predecessor
node m_lil lhc lime when il is uscd by Ihc succcssor nodc. It is possible to assume Ihat the successor node
mainlains Ihis buffcr ralhcr than Ih¢ edge, but Ihc buffcr musl exist somewhere.
9A proof thai Ihc ATAMM slralcgy leads to periodic cxeculion is given in [ 1] for graphs which contain at
mosl oilc Ioken per cdgc :rod which are nm wilhoul multiple node itlslanlialions. The proof is presently
being cxlcndcd to show Ih;ll graph plays remain periodic even when the above reslrictions are lifted.
10
Turningto thenumberof buffersrequiredtbr asingleedge,theupperboundis
givenby
number of buffers _< [ TT_O]
where TT is tile total packet computation time as defined in [1]. TT represents the
maximum length of time the graph may take to fully process a packet. Thus just before a
packet completes, /'rT//TBO / new packets may have started, and by research now in
progress u_, Equation 4 is a valid upper bound.
Optimum unfolding will need to unfold the graph al h'ast
(4)
,i = I,,I,,,,,,I,.,h i,,w_,.',:l /TBO/ (5)
times, and most probably many more times than that. In the worst case the unfolding
factor may be exponential in the number of arcs with registers _1. Therefore J copies of
each node will be made. J copies of each edge will also be made, and so J buffers will be
required Under ordinary circumstances we could expect the number ofinstantiations
required by (3) and the number of buffers required by (4) to be considerably less than the
number required by quantity (5), at least for most nodes/edges.
The DFG of Figure 2 provides an illustrative example of when optimum unfolding
does and does not cause extra nodes and edges to be instantiated From the schedule of
Figure 3(a), node A must be instantiated 3 times concurrently, but B only needs to be
singly instantiated The optimally unfolded version of Figure 2 is given in Figure 4. This
graph contains 3 instances ofnode A, 3 instances of node B, and three copies of each
edge All 3 instances of A are required, but only I instance at a time orB is needed.
Although this example is relatively kind to optilnum unfolding in that an
exponential untblding factor is not required, it is still seen that optimum unfolding requires
considerably more than the nainimttm amount of memory in terms of node instantiations
and edge buffers.
t°Privatc Communication wilh Robert L. Jones, NASA La,gley Research Center, Hampton, VA, Summer
1992.
t tRcmark 7.2 of 12] implies Ihis facl. Clearly. ordcring Ihc loops in an optimally unfolded DFG is not
exponenlial in anything, u,less the mtmher of hmps or the mtmher of nodes in some loop is exponential
wilh respecl to some quanlily. The killer possibilily is Ihc casc. The worst-case number of nodes in a loop
of the unfolded gr,'Jph is li,early proporlional to Ihc tmfoldi,g faclor J, which is equal to the least common
multiple of the immbcr of regislcrs i, all loops, a qtlanlily which c:m bc exponential in the number of
loops (and thus edges) of the original graph.
11
Al)
/-
•
o- '4B2)
Figure 4. The graph of Figure 2 untbldcd by a factor of 3.
4. Other Related Work
4. 1. Range ( "hart Sche&dmg
An alternative method of scheduling DFGs is given in [5]. The primary tool used
to determine a schedule in [5] is the range chart. A range chart is equivalent to a TGP
diagram with the execution time ofeach node extended by the float time of that node. As
a result, the point of view taken in [5] is very similar to that of ATAMM, which is based
on the concept of TGP. Like ATAMM but unlike optimum unfolding as presented in [2],
the range chart techniqt, e is applicable to data llow graphs which may or may not contain
loops; there is no requi,'ement that every node be in some loop. One difference between
[5] and recent ATAMM work is that [5] assumes the longest node execution time to be an
additional lower bound on TBOA,.,,.
Two specific problems are tackled in [5], and both are solved with essentially the
same heuristic. The first problem is to minimize the number of processors required to
schedule a data flow graph, given a fixed TBO value greater than or equal to TBOALB',
where the prime indicates the additional constraint of longest node execution time on
TBO,,a. u. The second problem is to find a fully-static schedule that minimizes TBO for P
processors, where P is greater than or equal to the number required for a fully-static
schedule at TBOAH_'.
The heuristic tbr the processor minimizatio,_ problem is presented first, and
operates as follows. The range chart is computed and a node is chosen as the reference
12
node Thechoiceof referencenodeisaweakspotin the algorithm, as the performance of
the heuristic is dependent on the choice of reference node, and the only recommendation
that the authors make is tlmt the reference node be "carefully chosen" [5] The algorithm
enters phase 1 and lbr each node does the following:
!. A pointer to the current level is maintained lbr each time unit in the TGP. Initially this
pointer is set to I for all time units.
2. The node with least scheduling range is chosen for processing.
. The start time of the node within the TGP is chosen so that the node runs within its
scheduling range and occupies the lowest possible level at each time unit. Note that
since levels do tuft necessarily correspond to processors in this step, a single node may
occupy more than one level. The pointers are updated to reflect the new first available
level of each time unit.
After phase 1 the algorithm sorts the nodes in decreasing order of exectltion time,
and then enters phase 2, executing the following steps for each node:
. A pointer to the current level is maintained Ibr each time unit in the TGP. Initially
this pointer is set to 1 tbr all time units. In this phase a level will correspond to a
processor.
, The first node from the sorted list is removed and assigned to the first level that has
available all the time slots that the node requires. The pointers are updated for the
time units now occupied by the node.
The algorithm for the TBO-minimization problem uses the above algorithm as a
subroutine. The TBO-minimization algorithm operates as follows. The number of
available processors P is given, and TBO is initially guessed to be TBO,xH_, The
processor-minimizatlon algorithm is run. and it'in either phase the algorithm needs more
than P levels, it is aborted and restarted with a new TBO guess one time unit longer than
the previous guess. It is not explicitly mentioned in [5], but this linear search for the
minimum feasible TBO could be changed to a binary search by making use of the
knowledge that the number of nodes is an upper bound on the number of processors
required (recall that multiple instantiation is not allowed in [5]). In this way the running
time would be increased logarithmically rather than linearly from the time of the
processor-nlinimization algorithm.
Given the beneficial properties of fully-static scheduling as discussed in Section
3.1.2, it may be desired to develop an optional additional node-binding step in the
ATAMM design procedure for achieving fidly-static (or cyclo-static if that is all that is
13
desired)operation.Thenode-bindingstepassumesaTGPwith float timeshasbeen
constructed'L Nodescanthenbeassignedto processorsin oneof threeways.
1. For sufficiently small graphs, an exhaustive search using branch-and-bound or other
spcedup technique(s) could be used
. A heuristic algorithm, such as the one presented in [5] could be used; this is perhaps
tile best alternative tbr large graphs. As pointed out in [5], two interdependent
decisions need to be made in order to assign nodes to processors: the nodes need to
be fixed in time, and bin-packed onto processors. The heuristic in [5] first makes the
time decision for all nodes, then makes the processor decision for all nodes. While this
seems to be a reasonably intuitive approach, there is no obvious reason why a different
approach cannot be tried, such as fixing a node in time, assigning it to a processor, and
repeating the procedure for each node. This "one node at a time" approach is likely to
be the one a human would use if attempting to create the schedule by inspection, as in
alternative 3 below.
A heuristic algorilluu will require tile construction and maintenance of a range chart,
which can be done either with existing ATAMM code that computes a TGP with node
tloats, o," with the range chart construction algorithm provided in [5]. A comparison
of the running time of these two methods would be interesting but is beyond the scope
of this paper.
. The user can be relied upon to assign nodes to processors by inspection. This
procedure could be SUl_ported in the ATAMM Design Tool_° through the addition of a
resource display that allows nodes to be placed on processors one at a time in
sequence. Alternatively, the user could be shown an initial display, such as one node
per processor, and then be allowed to change the processor on which a node is
running, as long as the resulting schedule is feasible. Node assignment should be an
easy task as the float available to a node is dynamically updated in the Design Tool.
As nodes are moved around, and if desired by the user, a node could be fixed in time
thus eliminating its float. Currently, the ATAMM design procedure uses control edges
as a mechanism for controlling placement of a node within its float. Control edges do
not allow arbitrary l_lacement of a node within its float time, but nonetheless in a bin
packing situation tile only meaningful times to start a node are upon completion of
another node (also, in a data flow architecture the only event which can start a non-
source node is the tinish of sonic other node).
1:ZRccallthat such a TGP is cquiwdcnl to the schcdulillg-range chart of15 I, so if thc pcrformance of the
currently-used ATAMM algorithm for finding Ihe TGP :rod floats ever proves tmsalisfactory, a rough time
complcxily comparison could bc made between il and Ihc :algorithm uscd to computc range charts in [5].
The range-chart algorithm would then offer an allerimlive if ils performance was shown to be better.
14
4. 2. Optimal l¥ocessor Assignmen/_n" Pipehne ('ompula/ions
Another paper that address the scheduling of iterative data flow graphs is [7], but
in this work a different point of view is taken. In ATAMM, [2], and [5], it was assumed
thai nodes may be neither split nor joined -- that the data tlow graph expresses whatever
concurrency is and is not available in the problem. In [7], a fundamentally different
assumption is made that it is possible to speed up a node by allocating multiple processors
to it, and that a function is known for each node, either through analysis or (more likely)
through experimentation, that maps the number of available processors for the node to
execution time for the node. In most cases, it is expected that assigning more processors
to a node will generally decrease the node execution time, although it is possible that
assigning more processors may actually increase execution time. A further assumption is
that the input data llow graphs do not have initial tokens or recurrence loops.
The purpose of[7] is to set up and solve the mathematically well-defined response
time optimization problem. Given the above assumptions, the problem is as follows.
Given an upper bound on the total number of processors that may be used, and an upper
allowable bound on the resulting TBO, choose an allocation of processors to nodes of the
data flow graph such that the execution time (i.e. TBIO) of the graph is minimized.
[7] also defines an analogous throughput ¢q_timization problem. Again given an
upper bound on the total number of processors available, and an upper bound on TBIO,
the problem is to find an allocation of processors to nodes such that TBO is minimized.
Note the similarity of the above two problems to the TBO versus TBIO tradeoffin
ATAMM operating points, but also note that in this case, unlike in ATAMM, the choice
of "operating point" is a matter of how many processors are assigned to each node.
For highly-specific, precisely-stated mathematical optimization problems such as
those in [7], it is dangerous to make comparisons with other problems since often a
seemingly minor difference in problem statement or the assumptions can lead to quite a
difl_rent problem. Nonetheless an attempt will be made to relate the throughput
optimization problem to tile ATAMM model. In the ATAMM rnodel there is no way to
speedup the execution time (TBIO) of a task by using additional processors, so it is very
difficult to relate tile response time problem to ATAMM. However, for the types of data
flow graphs considered in [7], namely trees and series-parallel graphs, it is always possible
(because these graphs have no loops) to increase throughput by allowing for additional
multiple node instantiations. So, given the appropriate type of data flow graphs, it is
possible to use the throughput optimization algorithm of [7] to compute the number of
processors to assign to each node, i.e. the number of multiple instantiations allowed for
each node. Before the algorithm can be used, it is necessary to determine the processor
count to execution time mapping, but in tile ATAMM model this function is always linear:
Effective execution time
Node execution time
Number of instantiations
15
This "execution time" is not TBIO. Instead, it is necessary to view n consecutive
iterations of a node as a single "task", where n = Number of instantiations.
While a mapping between the problem domains of ATAMM and [7] can be made
as above, it is of little practical value since the dynamic programming algorithm in [7] is
designed to handle arbitrary processor to execution time functions, and so is an overkill in
terms of both conceptual and time complexity lbr tile special case of a linear execution
time function as results from ATAMM throughput speedup by multiple instantiation.
5. Conclusion
The primary objective of[2] is to show that tbr any DFG it is possible to find a
rate-optimal schedule that is fully-stalic with respect to the unfolded graph and cyclo-
static with period J.TBOAN _wilh respecl to the original graph. Because it may be
necessary to untbld a DFG an exponential number of times, the cyclo-static period may be
quite large.
Algorithm 7. I of[2], which rt, ns in exponemial time, constitutes the proof that a
schedule with the desired properties can be tbund. As a side effect of this algorithm, an
upper bound on the number of processors required to execute the schedule is found.
If(and only if) tile number of processors t, sed is not an issue, the determination of
a rate-optimal schedule is trivial. For each node A, simply assign P^ = FT(A)/TBO_B]
processors to A and allow PA concurrent instantiations of A. With such an assignment, the
graph will never enter a resource-limited mode, and will execute at TBO = TBOxc B [1].
Furthermore the schedule will be optimally-static since a node is never executed on a
processor not originally assigned to it, and no more processors are assigned to a node than
are required by that node.
The difficult aspect ofthe problem is attempting to minimize the number of
processors required tbr a optimally-static rate-optimal schedule. A nearly identical
problern is discussed in [5], the difference being that multiple instantiations are not
allowed since [5] uses the strict definition of fully-static; the problem is found to be NP-
complete as expected. As is done in [5], it is wise to directly attack, either through
heuristics or exhaustive search, the problem of optimizing the number of processors
required to achieve desiled execution properties, such as a rate-optimal TBO and
optimally-static processor assignment
1(9
6. Future Work
The following is a summary of recommendations for possible future work for the
ATAMM project, some of which have aheady been mentioned in various places in this
document.
6. !. Theoretical Work
Formally prove that a graph executing under ATAMM rules will eventually reach
steady-stale.
Formally prove an upper bound on the number of transients that occur from the time
when a graph begins execution until steady-state is reached. (this is a stronger
version of the previous item.)
Since required memory space appears to be a major advantage of ATAMM vis a vis
optimum unfolding, additional analysis, both theoretical and empirical examples, could
be used to prove this point.
6. 2. Implementation Work
Develop an optional additional node-binding step in the design procedure for achieving
fully-static (or cyclo-static if that is all that is desired) operation, as discussed in
Section 4. I.
• Evaluate the node to processor assignment strategy by simulation.
17
References
Ill S. Som, J. W. Stoughton, and R. R. Miclke, "Strategies for Concurrent
Processing of" Complex Algorithms in Data Driven Architectures," NASA
Contractor Report 187450, Grant NAG1-683, October 1990.
[21 K. K. Parhi and D. G. Mcsscrschmitt, "Static Rate-Optimal Scheduling oflterative
Data-Flow Programs via Optimum Unfolding," 1EEE Trans. Computer_vol. 40,
pp. 178- !95, February 1991.
[3] R. Mielke, J. Stoughton, S. Sore, R. Obando, M. Malekpour, and B. Mandala,
"Algorithm to Architecture Mapping Model (ATAMM) Mutticomputer Operating
System Functional Specification," NASA Contractor Report 4339, Cooperative
Agreement NCC I- 136, November 1990.
[41 R. Mielke, J. Stoughton, S. Som, "Modeling and Optimum Time Performance for
Concurrent Processing," NASA Contractor Report 4167, Grant NAG1-683,
August ! 988.
[51 Sonia M. tteemstra de Groot, Sabih tl. Gerez, Otto E. Herrmann, "Range-Chart-
Guided lterative Dala-Flow Graph Scheduling," IEEE Trans. Circuits and
_Syste_ vol.39__no. 5, pp. 351-364, May 1992.
[61 M. R. Garey and D. S. Johnson, Compulers and Intractability: A Guide to the
Theory. of the NP-Comp!__eteness, San Francisco, CA: Freeman, 1979.
[7] D. M. Nicol, R. Sinha, A. N. Choudhury, B. Narahari, "Optimal Processor
Assignment for Pipeline Computations", NASA Contractor Report 189550,
Contract No. NAS 1-18605, October 199 i.
18

i , •
Form Approved
REPORT DOCUMENTATION PAGE oMs No.ozo4.o,ea
_'_l_c ,e_rl,n,1 n_,.J_n _r,r I_. oftechon of L.#otma(i(]_ ,_ ,",T,m_1Pd rO_e_qe _ &oul" her .es_t%e. _n,:l,admg t/_ ffme for rev,ew_ng insteuc_,ons, sea¢c_,.Iq existing data sources,
_Lilhe¢in_l _nd r_m_1.hn,nq IhP dala nee_le(J. ,_nd _Omple_in, l an(] re_,p_.mq *,he <oiledion of informat_on Send cOt_'l_Pl_Is r(LC_ardll',C_ this bucd@n _If_'i_Itl_ Or _1'I_ O'h_'r _i_c_ (:If _hl_
cOl]P£[*Ofi _,T ,n_,_f m_lOn, _f_i.i,_Oifi(_ SL_IesEI(JCI$ for f_C_u, If_ lh*% I}Uf(]P/_ i'D _A%hlf_l{On He_aauar_er,, Services. D,eeCIofgle for infof_al_om Ol_fat_OnS and Re,_rI$. I_ 15 lefferson
_4_,s H.}h._a_, *,_te 1.)04 _, hnqIon. VA .)2.102 4 )(]2 ,trod 1c_ _h_ Oth, e ,)t Man_qemen_ and Bud,}el P,_Derwork RedudiO. Proled (0704-01B6), WasP.ngIon, OC 20S0)
I. AGENCY _JSEONLY EL'saveblank] 2_'REPORT DATE " 3. REPORT TYPE AND DATES COVERED '
.... February 1993 Contractor Re)ort 6/29-B_14/92
41 TiTLE--AND SUBTITLE S. FUNDING NUMBERS
A Comparison of Multlprocessor Scheduling Methods for G NAG1-613
Iteratlve Data Flow Architectures
WU 586-03-i I
AUTHOR(S)
Matthew Scorch
7.I_[I_FOR_aTNGO-RGAN_IATION NAME'(S) AND A'DORESS(ES) ' 8. PERFORMING ORGANIZATION
University of Illinois REPORT NUMBER
The Department of Computer Science UIUCDCS-R-92-1776
Illinois Computing Laboratory for Aerospace Systems UILU-ENG-92-1756
and Software (ICLASS)
Urbana, Illinois 61801
9. SdON_dRI_*G_MONITORING AGENCY NAME(S) AND ADDRESSEES)' 10. SPONSORING / MONITORIFIG
National Aeronautics and Space Administration AGENCY REPORT NUMBER
Langley Research Center NASA CR-189730
Hampton, VA 23681-001
ql-. SdP_PL'E_MENTARYNOTES
Langley Technical Monitor of Grant: Kathryn A. Smith
Langley Technical Monitor for this supplemental research: Paul J. Hayes
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Unclassified - Unlimited
Subject Category 33
13. ABSTRACT (M@_,mum 200 words)
r
12b. DISTRIBUTION CODE
i _
6t
\ A comparative study is made between the Algorithm to Architecture Mapping Model
\ (ATAMM) and three other related multiprocesslng models from the published
iliterature. The primary focus of all four models is the non-preemptive
Scheduling of large-grain Iterative data flow graphs as required in real-time
systems, control applications, signal processing, and pipelined computations.
Important characteristics of the models such as Injection Control, Dynamic
As_lgmment, Multiple Node Instantiatlons, Static Optimum Unfolding, Range-Chart
Cuided Scheduling, and Mathematical Optimlzacion are identified. The models from
ithe literature are compared wlth the ATAMM for performance, scheduling methods,
•memory requirements, and complexity of scheduling and design procedure.
t4_"SIJ_JECT TEI_MS
Iterative data flow, mulclprocessing, static scheduling,
dynamic scheduling
7. SeCUR,TYC:L-_-s_FiCATION_TM _8.-SECURITY CLASSIFICATION
OF REPORT OF THIS PAGE OF ABSTRACT
Unclassified Unclassified Unclassified
NSN 7_,_00:-280-5500
i9. S'ECUI_ITY CLASSIFICATION
15. NUMBER OF PAGES
21
16. PRICE CODE
A03
2C. LIMITATION OF ABSTRACT
UL
P,*_c'.t)_ r,V aN$. %lq ,'39 :B
2q'_-"';2
