Strategies for concurrent processing of complex algorithms in data driven architectures by Stoughton, John W. et al.
/¢
NASA CONTRACTOR REPORT 187450
/w -_3
Strategies for Concurrent Processing
of Complex Algorithms in Data
Driven Architectures
Sukhamoy Som, John W. Stoughton,
and Roland R. Mielke
Old Dominion University
Norfolk, Virginia
Grant NAG1-683
OCTOBER 1990
(NASA-CR-Ib745U) ST_:ATEGIES FOP CONCURRENT
PRUCESSINn IF COMPLEX ALGr_RTTHMS _N OATA
DRIVEN ARCHI[cCTU_-S Cinal Report, May 1988
- Au,3. 19R_ (Uld Oon_inion U_iv.) Z78 p
C_CL O9C
N91-20395
Unclas
0003_61
Nalional Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23665-5225
https://ntrs.nasa.gov/search.jsp?R=19910011082 2020-03-19T18:32:58+00:00Z
'11
TABLE OF CONTENTS
PAGE
LIST OF TABLES ........................................... _ii
LIST OF FIGURES ........................................... iv
LIST OF SYMBOLS ........................................... viii
EXEO3TIVE OVERVIEW ........................................ xi
CHAFFER
i. INTRODUCTION .......................................... 1
i. 0 Preface .......................................... 1
I. i Background ....................................... 1
1.2 Problem Pepresentation by the ATAMM model ......... 5
1.3 Objectives and Organization of Dissertation ....... 13
2. PERFORMANCE MODEL ..................................... 17
2.0 Introduction ..................................... 17
2.1 Performance Measures ............................. 17
2.2 Marked Graph Characteristics ..................... 20
2.3 Graph Theoretic Performance Bounds ......... ...... 30
2.4 Resourc_ Requirements ............................. 36
2.5 Sunmmry .......................................... 53
3. AIC43RITHM TRANSFORMATION .............................. 55
3.0 Introduction ... .................................. 55
3.1 Algorithm Transformation Guidelines .............. 55
3.2 Performance Improvements by Transformation ........ 62
3.3 Implementation of Periodicity by Transformation... 72
3.4 Structural Changes in Algorithm by Transformation. 80
3.5 Summary......................................... 93
4. ATAMMOPERATINGPOINTDFSIGN......................... 95
4.0 Introduction ..................................... 95
4.1 C_aracteristics of Operating Point ............... 95
4.2 Operating Point Design........................... i01
4.3 Test Results ..................................... Iii
4.4 S_ry .......................................... 146
5. CONCI/JSION............................................ 147
LIST OFREFERENCES.......................................... 153
APPENDIX................................................... 156
ii
TABLE
4.1
4.2
LIST OF TABLES
Comparisonof Results for Test 1 ......... . ......
Comparison of Results for Test 2 ................
PAGE
119
127
iii
FIGURE
i.i
1.4
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.10
3.1
3.2
LIST OFFIGURES
PAGE
Algorithm markedgraph for discrete
system equation ................................ 9
ATAMMnodemarkedgraph model.................. 12
ATAMMcomputational marked graph
model for discrete system equation ............. 14
ATAMMmodelcomponents ......................... 15
An algorithm for flight simulation plan ........ 21
Example algorithm marked graph ................. 23
Example of node and process circuits ........... 26
Computational marked graph for theAF_ ......... 28
Example of recursion and parallel
path circuits .................................. 29
Modified algorithm marked graph
for Figure i.i ................................. 32
Algorithm marked graph for
illustration of SGPANDSRE .................... 42
SGP and SRE .................................... 43
Total graph play and total resource
envelope for TBO = 2 ........................... 50
Single resource envelope and
total resource envelope for TBO= 3 ............ 52
Transformed algorithm marked graph
in Application 1 ............................... 60
Computational marked graph
for the transformed ANK_ ........................ 61
AMG for illustration of Application 2 .......... 68
SRE and TRE for TBO = 2 ........................ 69
iv
3.7
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
4.1
4.2 (a)
4.2 (b)
4.5
4.6
4.7
Transformed AMGfor Figure 3.3 ................ 70
For the AMGtransformed by control place
I, SREand TREfor TBO= 2 ..................... 71
For the transformed AMG with all control
places, SRE and TRE for TBO = 2 ................ 73
Injection control by Application 3 ............. 75
Example AMG for illustration of
Application 4 ................................. 78
SGP and TGP for TBO = 2 ........................ 79
Transformed AMG and total graph play
for TBO = 3 .................................... 81
AMG A 1 and transformed AMG A 2 .................. 83
Algorithm i, Algorithm 2, and Algorithms 1
and 2 are combined by dummy transitions ........ 84
AMG for the linear time invariant system ...... 85
Transformed AMG for the linear
time invariant system ......................... 88
An AMG with a large transition T and T is
decomposed in N parallel transitions ........... 91
AMG before decomposition of B and B is
decomposed ..................................... 94
ATAMM operating point characteristics .......... 98
AOP characteristics under
specific transformations ........... ........... 102
The strategies for AOP design under
resource constraints ........................... 106
SGP and TGP for TBO = 2 ........................ 107
TRE for TBO = 3 in Step 4 and
TRE for TBO = 4 in Step 6 ...................... 109
SGP and TGP for TBO = 2 ........................ II0
Transformed AMG for Steps 5 and 6 ............. 112
ATAMM operating points for
the example algorithm marked graph ............. 113
v
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
The testbed ATAMMdata flow architecture ....... 115
AMGfor Test 1 and transformed
AMGfor Test 1................................. 117
Simulation results for the AMGin Test 1....... 120
Simulation results for the
transformed AF_ in Test 1..................... 121
Experimental results for the AMGin Test 1.... 122
Experimental results for the
transformed AM_in Test 1...................... 123
AMGfor Test 2 and transformed AMG
for Test 2..................................... 125
simulation results for the AMGin Test 2....... 128
Simulation results for
the transformed AMGin Test 2.................. 129
Experimental results for
the AMGin Test 2.............................. 130
Exper_tal results for the
transformed AMGin Test 2...................... 131
For Test 3, AMGand SRE........................ 132
Simulation results for AOPof
Step 3 in Test 3............................... 134
Simulation results for AOPof
Strategy A in Test 3........................... 135
AM_for Test 4 and SREfor the
AMGof Test 4.................................. 136
For the transformed AMG,SREand SGP........... 138
TGPfor transformed AFK;....................... 139
Simulation results for AOPof
Step 3 in Test 4 .............................. 140
Simulation results for AOPof
Strategy A in Test 4 .......................... 141
Simulation results for AOPof
Step 3 in Test 5 .............................. 142
vi
4.28
4.29
4.30
Simulation results for AOP of
Strategy A in Test 5 ..........................
simulation results for AOP of
Strategy B in Test 5 ..........................
Simulation results for AOP of
Strategy C in Test 5 ..........................
143
144
145
vii
SYMBOL
AOP
AMG
ATAMM
b
Ci
C_G
CC
CE
DR
FUN
G
GC
%
GIN
SGP
GN
IE
IF
NANG
M(C i)
NMG
LIST OF SYMBOLg
DESCRIFPION
ATAMMOperating Point
Algorithm MarkedGraph
Algorithm ToArchitecture_MappingModel
Section Number of SGPand Data Packet Number
jth Directed Circuit
Computational MarkedGraph
Computing Capacity
Computing Effort
Data Ready
Ikanctional Unit
?_AlgorithmMarkedGraph
A Computational Marked Graph
Modified Algorithm Marked Graph for G
Global Memory
Single Graph Play
Graph Manager
Input Buffer Empty
Ynput Buffer Full
MedifiedAlgorithmMarked Graph
Number of Tokens in Circuit i
Node MarkedGraph
viii
OE
OF
P
Pj
Pi
PR
SRE
r
R
RU
tj
T(C i)
T(P i)
TBI
TBO
TBOop
TBOtU.B
TBOI_
TBIO
TBIO_ B
TBIOI. B
_3P
Output Buffer Empty
Output Buffer Full
Process Time
Place j
ith Directed Path
Process Complete
Process Ready
Single Resource Envelope
Read Time
Number of Computing Resources
largest Peak Value of TRE for any TBI > TBOLB
Peak Value of SRE
Resource Utilization
Transition j
Total Transition Times in C i
Total Transition Times in Pi
Input Data Injection Interval
Time Between Outputs
TBO at the Operating Point
Absolute Lower Bound for TBO
Lower Bound for TBO
Time Between Input and Output
Absolute Lower Bound for TBIO
Lower Bound for TBIO
Total Graph Play
Total Resource Envelope
Total Backward Computation
ix "'
TC
TCE
TFC
TFCE
TT
TTA/_
TTLB
W
Total Computation
Total Computing Effort
Total Forward Computation
Total Forward Computing Effort
Task Time
Absolute lower bound for TT
Lower bound for T_
Write Time
X
_IVE OVERVIEW
The _ of this report is to document research to develop
strategies for concurrent processing of complex algorithms in data
driven architectures performed under Grant N_I-683 during the period
May 1988 to August 1989. In this overview, the problem domain is
described, the motivation for this researc_ is explained, and a
s_ of research activities are presented. The detailed
description of the investigation is taken from the doctoral
dissertation by Dr. Sukhamoy Sore entitled "Perfo_ Modeling and
Enhancement for the ATAMM Data Flow Architecture".
During earlier grant periods, a compatational model called the
Algorithm To Architecture Mapping Model (ATA_4) was formulated for
mapping large-grain, decision-free algorithms to a multicomputer data
flow architecture. Major applications are expected to be real-time
implementation of control and signal processing algorithm_ where
performance is required to be highly predictable and fault tolerant.
Of interest is the periodic execution of algorithms. For our
purposes, an algorithm is expressed as a directed graph where vertices
(nodes) represent algorithm operations and edges represent data sets
or signals, large-grain refers to the assumption that the time
required to perform algorithm operations is large compared to the time
required to move data from one node to another. Decision-free refers
to the absence of data dependent paths in the algorithm graph
xi
representation. The architocture i g aggLm_l to consist of two to
twenty flmctional units or resources each having a capability of
processing, communication, and memory. The resources share a common
global memory which is centralized or distributed. The coordination
of resources in relation to data and control flow is directed by a
graph manager. The graph manager also is centralized or distributed.
Assignment of a ft_Ycional unit to a specific algorithm node is made
by the graph manager according to ATAFN rules and a priority ordering
of algorithm nodes. All assignments are non-preemptive for minimum
co_nunication cost. In a specific hardware setting, the graph
manager, global memory, and fLmctional unit activities together form
the ATAMM Multicomputer Operating System or AMOS.
The ATAMM model is important because it specifies a criteria for
a multicom_xlter operating system to achieve predictable and highly
fault tolerant performarr_e, and it creates a p]atform for
investigating different algorithm decompositions and implementation
strategies in a hardware independent context. In earlier reports, the
use of the ATAMM model is described for determining analytically
performance bounds and developing an operating strategy for optimum
time performance. In addition, the construction of an ATAMM defined
data flow architecture and development of sirmllation and analysis
tools are reported. During the present grant period, research is
carried out for performance modeling and performance enhancement for
the ATAMM data flow architecture. In order to have a predictable
performance, it is necessary that assignment of algorithm nodes to
functional units be as much priority independent as possible. This is
doi_ to avoid the priority inversion problem. Even for small run-time
xii
variations of _ication delays and execution time variations, a
low priority algorithm node may be enabled before a high priority
algorithm node. As the assignment is non-preemptive, this may
completely change the graph execution pattern and resource
requirements. In order to overcome this problem, it is suggested that
the operating system (AMOS) transform the algorithm graph and control
input data injection interval so that a functional unit always is
available for every enabled algorithm node. In other words, even if
priority inversion changes the order of execution of algorithm nodes,
graph execution patterns and resource requirements will not be changed
drastically. Two performance measures, TBIO and TBO, are defined for
periodic processing of algorithms. TBIO is an indicator of com_ting
speed for an algorithm. TBO is a measure of the time interval between
algorithm outputs, and the inverse of TBO indicates throughput. The
time performance (TBIO, TBO) and the number of required resources
define an operating point for AMOS. If enough functional units are
available, optimum TBIO and TBO can be achieved. However, if a
limited number of resources is available, one must increase either TBO
or TBIO, or a combination of both. Two key methods for shifting the
operating point are control of the input injection interval and
traru_formation of the algorithm graph. Transformation of the
algorithm graph is achieved by adding dunm_ nodes (transitions) and
control edges (places) as described below. A dummy node is an
algorithm node which implements an identity operation and requires
zero time. It is used as a buffer to provide additional storage space
for the output of an algorithm node. A dummy node is a pure memory
operation and does not require a resource. A control edge is an
xiii
algorithm edge which imposes a precedence relation amongtwo algorithm
nodes but does not imply data dependency. This type of edge is used
to delay the execution of a node. Thus, predictable performance is
achievable even if the numberof functional units decreases to i. An
ATAMMsimulator and experiments on a three resource testbed provide
verification of performance modeling and graph transformation method_.
The use of brand namesin this report is for completeness, and
does not indicate NASAendorsement.
xiv
_ONE
INTRODUCTION
i. 0 Preface
Algorithm _ToArchitecture _MappingModel (ATAMM)is a new graph
theoretic model from which the rules for data and control flow in a
homogeneous,multicomputer, data flow architectures may be defined
[i, 2]. The subject of this dissertation is the investigation of
concurrent processing in such an ATAMMdefined architecture for
large-grain, decision-free algorithms. Performance modeling,
performance _t, and the development of operating strategies
for periodic execution of such algorithms are the key research
objectives. Chapter One is an introduction of ATAMMand a discussion
of the motivation behind the research. Background for the ATAMMmodel
and this research is presented in Section i.I. The computational
problem representation by the ATA_4model is presented in Section
1.2. The objectives and organization of this dissertation are
described in Section 1.3.
i. 1 Background
The principles of computer architecture design historically have
been based upon the von Neumannorganization [3 ]. These principles
have led to architectures consisting of a single computer in which low
level machine language instructions perform simple operations on
elementary operands, and centralized, sequential control of
2computation is employed. Despite the fact that electronic components
are becoming increasingly faster, the desired computer performance has
always been muchmore than that which is obtainable with the von
Neumannorganization. Advances in the solid state technology alone
are not ex_ to be enough to produce computers to meet the
computational needs of the future. There is a growing agreement that
the next (fifth) generation of computers will be based upon non-yon
Neumannstructures.
Recently, a numberof new computer architectures have been
proposed from which a number of computer systems have been built [3].
The need for new computer architectures has been motivated mainly by
three objectives. First, there is the desire to increase computer
performance through the use of concurrency. Second, there is the
desire to more fully exploit very large scale integration (VISI) in
the design of computers. Third, there is interest in new programming
methods which facilitate the mapping of algorithms onto
architectures. These ideas suggest a decentralized computer
architecture in which a number of independent computers are to work
together. These independent computers, each having a capability for
processing, communication, and memory, can be as large as a
geographically distributed mainframe computer or as small as
microcomputers on a single VISI chip. Unfortunately, strategies for
interconnecting and programming such architectures based upon yon
Neumann principles have not evolved. It appears that yon Neumann
organization principles are not adequate to address the complex issues
of scheduling, coordination, and communication.
, 3
Strategies for control of computations on decentralized computer
architectures can be classified br(_idly as control flow, demand
driven, and data driven. In control flow computers, explicit flows of
control cause the execution of instructions. In demand driven
architectures, the execution of operations are triggered by the
requirements of outputs or results. In data driven architectures
(also known as data flow computers), the availability of operands
trigger the execution of operations. Data flow architectures are the
primary interest of this research because of their suitability for
concurrent processing of complex algorithms.
A useful mathematical tool for modeling execution of complex
algorithms on a data flow decentralized architecture is the Petri
net. Petri nets were first developed in 1962 by Carl Petri [4], and
later were identified as a useful analysis tool in the work of Holt
and Commoner [5]. A comprehensive treatment of Petri nets is
presented in [6]. One problem with the Petri net model is that it
tends to be too complicated to analyze. An important subclass of
Petri net is the marked graph where each place has exactly one
incoming and one outgoing arc. Marked graphs can be used to model the
processing of decision-free algorithms [7]. Properties such as
liveness, safeness, and reachability can be achieved for marked graph
models [6]. Procedures also exist for expanding and reducing marked
graphs while preserving these properties [8]. These graph features
are suitable for modeling the succession of single events such as data
and status conditions. In this dissertation, the marked graph is used
as a modeling tool for data driven computations.
The data flow concept has already attracted the attention of a great
many researchers, and a numberof data flow computers have been built
[9]. However, only a few researchers have tried to develop a
theoretical model for evaluating computation in a data driven
architecture [i0]. These models do not appear to be adequate to
address the complex issues of scheduling, coordination, and
communication. Therefore, the performance of algorithms is often
unpredictable and hardware dependent in these data flow computers.
There is a need for a simple, but effective, model for data
driven computations in order to investigate the relative merits of
different algorithm decompositions and implementation strategies in a
hardware independent context. Ongoing research at Old Dominion
University has led to the development of a newmarked graph model for
describing data and control flow associated with the execution of
algorithms in data flow architectures [2]. The model is identified by
the acronym ATAMMwhich represents Algorithm To Architecture _Mapping
_Model[ii]. Specifications derived from the model lead directly to
the description of a data flow architecture and will be called the
ATAMMdata flow architecture henceforth. The availability of the
ATAMMmodel is important for at least three reasons. First, it
provides a context in which to investigate algorithm decomposition
strategies without the need to specify a specific ATAMMdata flow
architecture. Second, the model identifies the data flow and control
dialogue required of any ATAMMdata flow architecture which implements
the algorithm. Third, the model provides a basis for analytically
calculating performance bounds and developing a methodology for
improvement in performance.
The problem domain addressed by the ATAMMdata flow architecture
and this research consists of decision-free, large-grain, complex
algorithms which are assumedto be executed periodically in a
multicomputer environment. The algorithms are assumedto require
large computations which would include such computations as matrix
addition, multiplication, etc. The anticipated multicomputer
environment is assumedto consist of two to twenty identical computers
or functional units each having a capability of processing,
communication and memory. The primary reason for such assumptions is
the objective of implementing control and signal processing algorithms
in next generation multicomputer architectures for real time
applications on future spacecraft. The granularity level of the
algorithm decomposition is kept high to avoid communication
bottlenecks as observed in many fine-grain data flow architectures
[12]. The range of functional units is suggested due to the
large-grained aspect of the algorithm decomposition. Of interest is
the definition of a performance model so that the performance of the
algorithms can be evaluated and improved. Also an operating procedure
is needed for obtaining predictable performance with respect to
available computing elements.
1.2 Problem Representation by the ATAMM Model
The ATAMM model consists of a set of Petri net marked graphs
which incorporate general specifications of _ication and
processing associated with each computational event in a data flow
architecture. In this section, the computational problem is
represented by the ATAMM model. First of all a detailed description
6of the problem context is stated. This is followed by the definition
of the ATAMMmodel consisting of the algorithm marked graph, the node
marked graph, and the computational marked graph. Somefamiliarity
with Petri nets [6] and marked graphs [13] is assumed.
A problem description normally results in the definition of a
function given by the triple (X, Y, F), where X represents the set of
admissible inputs, Y the set of admissible outputs, and F: X -> Y the
rule of correspondence which unambiguously assigns exactly one element
from Y to each element of X. Associated with a computational problem
is one or more algorithms. An algorithm is an explicit mathematical
statement, expressed as an ordered set of primitive operations, which
explains how to implement the rule of correspondence F. A primitive
operation is a complex computation. Matrix multiplication and
addition are examples of primitive operations. In general, a given
problem can be decomposed by several different primitive operator
sets. Also, for a given primitive operator set, there are often
different ordering of primitive operations which can be specified to
carry out the problem. Of special interest are algorithm
decompositions in which two or more primitive operations can be
performed concurrently. For such decompositions, the potential exists
for decreasing the computational time required to solve the problem by
increasing the computational resources which implement the primitive
operations.
The hardware environment for executing the decomposed algorithms
is assumed to consist of R identical computers or functional units
(FUN's), where R has a value in the range of two to twenty. These
computers or functional units are also denoted by the terms
"computing element" or "resource". Each functional unit is a
processor having local memoryfor program storage and temporary input
and output data containers. Each functional unit can execute any
algorithm primitive operation. The functional units share a common
global memory (GIM), which maybe either centralized or distributed.
The coordination of functional units in relation to data and control
flow is directed by the graph manager (GM). The graph manager also
may be centralized or distributed. Output created by the completion
of a primitive operation is placed into global memory only after the
output data containers have been emptied. That is, outputs must be
consumed as inputs to successor primitive operations before allowing
new data to fill the output locations. Assignment of a functional
unit to a specific algorithm primitive operation is made by the graph
manager only when all inputs required by the operation are available
in global men_ry and a functional unit is available.
An algorithm marked graph (AMG) is a marked graph which
represents a specific algorithm decomposition. Transitions and places
are represented as vertices and directed edges respectively. Vertices
of the algorithm marked graph are in a one-to-one correspondence with
each c_-_rrenc_ of a primitive operation. The transition times
represent the computation times of the respective primitive
operations. The algorithm marked graph contains an edge (i, j)
directed from vertex i to vertex j if the output of vertex i is an
input for vertex j. Edge (i, j) is marked with a token if an output
from vertex i is available as an input to vertex j. By the rules of
the marked graph, the computation of a vertex can only be done when
all the incoming edges have a token on them. When constructing an
8algorithm marked graph, vertices (transitions) are displayed as
circles, and edges (places) are displayed as directed line segments
connecting appropriate vertices. The presence of a token on an edge
is indicated by a solid dot placed on the edge. Source transitions
and sink transitions for input and output signals are represented as
squares. Sources for constants are not usually included in the
algorithm marked graph; however, triangles are used for this purpose
when necessary.
To illustrate the construction of an algorithm marked graph,
consider the problem of computing the output of a discrete linear,
time invariant system given a sequ_ of inputs to the system. Let
the system be described by the state equation
x(k) = Ax(k-l) + Bu(k)
and the output equation
y(k) = Cx(k),
where x is a p-vector, u is an m-vector, and y is an r-vector. The
primitive operations are defined as matrix multiplication and vector
addition, and the natural algorithm decomposition resulting from the
state equation description is selected. The algorithm marked graph
for this decomposed algorithm is shown in Figure i.i. The initial
marking indicates that initial condition data are available.
The algorithm marked graph is a useful tool for representing
decomposed algorithms and for displaying data flow within an
+_ qpm
I
Om
"0
L
0
.C
Q.
E
(P
qpn
0
6
i0
algorithm. However, the algorithm marked graph does not display
procedures that a computing structure must manifest in order to
perform the computing task. In addition, the issues of control, time
performance, and resource management are not apparent in this graph.
These important aspects of concurrent processing are included in the
ATAMM model through the definition of two additional graphs. The node
marked graph (NMG) is defined to model the execution of a primitive
operation. The computational marked graph (C_G), obtained from the
AMG and the NMG by a set of construction rules, integrates both the
algorithm requirements and the computing environment requirements into
a comprehensive graph model. These additional marked graphs are
defined below.
The node marked graph (NMG) is a Petri net representation of the
performance of a primitive operation by a functional unit. Three
primary activities: reading (r) of input data from global memory,
processing (p) of input data to compute output data, and writing (w)
of output data to global memory, are represented as transitions
(vertices) in the NMG. Data and control flow paths are represented as
places (edges), and the presence of signals is notated by tokens
marking appropriate edges. The conditions for firing the process and
write transitions of the NMG are as defined for a general Petri net,
while the read transition has one additional condition for firing. In
addition to having a token present on each incoming signal edge, a
functional unit must be available for assignment to the primitive
operation before the read node can fire. Once assigned, the
functional unit is used to implement the read, process, and write
operations before being returned to a queue of available functional
,¢
ii
units. The initial marking for an NMG consists of a single token in
the Process Ready place. The NMG model in shown in Figure 1.2.
A computational marked graph (CMG) is constructed from the AMG
and the NMG by the following rules:
I) Source and sink nodes in the algorithm marked graph are
represented by source and sink nodes in the (IMG.
2) Nodes corresponding to primitive operations in the algorithm
marked graph are represented by NMG's in the CMG.
3) Edges in the algorithm marked graph are represented by edge
pairs, one forward directed for data flow and one backward
directed for control flow, in the C_K;.
The forward directed edge goes from predecessor write transition
to successor read or sink transition. This forward edge is also shown
as part of the NMG where it is the OF and IF edge of the predecessor
and successor respectively. The backward directed edge goes from
successor read transition to predecessor read or source transition.
This backward edge is also shown as part of the NMG where it is OE and
IE edge of predecessor and successor respectively. The initial
marking for the edge pair consists of a single token in the forward
directed place if data are available, or a single token in the
backward directed place if data are not available.
The play of the _ proceeds according to the following graph
rules:
i) A node is enabled when all incoming edges are marked with a
token. An enabled node fires by encumbering one token from
each incoming edge, delaying for some specified transition
time, and then depositing one token on each outgoing edge.
12
OE
IF
R4
_--- tt----_ Write
_ocl
OF
NMG EDGE LABELS
IF Input Buffer Full
IE Input Buffer Empty
DR Dot(] Read
PC Process Cornplete
PR Process Ready
OE Output Buffer Empty
OF Output Buffer Full
Figure 1.2. ATAMM node marked graph model.
13
2) A source node and a sink node fire when enabled without
regard for the availability of a functional unit.
3) A primitive operation is initiated when the read node of an
NMG is enabled and a functional unit is available for
assignment to the NMG. A functional unit remains assigned to
an NMG until completion of the firing of the write node of
the NMG.
In order to illustrate the construction of a computational marked
graph, the CMG corresponding to the algorithm marked graph of Figure
I.i is shown in Figure 1.3. The computational marked graph is useful
because it clearly displays the data and control flow which must occur
in any hardware implementation of the algorithm, and because it
provides a hardware independent context in which to evaluate algorithm
performance.
The complete ATAMM model consists of the algorithm marked graph,
the node marked graph, and the computational marked graph. A
pictorial display of this model is shown in Figure I. 4. ATAMM model
characteristics are described in detail in the Appendix.
1.3 Objectives and Organization of Dissertation.
The behavior and performance for periodic execution of complex
algorithms in the ATAMM data flow architecture is investigated in this
dissertation. The problem domain consists of large-grain,
decision-free algorithms. Tne major research objectives are
threefold. First, a performance model is established. Second, rules
for transformation of algorithms for performance enhancement and
reduction of computing element requirements are identified. Third,
14
-it
o--
,L
©
q)
©
E
c-
O_
O
C_
0
E
c)
c
©
o--
,.l.,J
o
E
F--
<
,L
o__
i,
0
_l--r
c_
o-
F
-+-J
GO
15
/
ma,
graph
putationc
marked graph
Rgure 1.4. ATAUM model components.
16
operating strategies are developed for optimum time performance and
for sub-optimum time performance under limited availability of
computing elements.
The dissertation is organized in five chapters and an appendix.
In the Appendix ATAMM model characteristics, some of which are used in
this dissertation, are described in detail. Definitions of the
computing environment, performance measures, and evaluation of
performance bounds and resource requirements are presented in Chapter
Two. In Chapter Three, algorithm transformations for improving
performance, and methods for enforcing desired resource envelope and
inducing structural changes in algorithm marked graphs are described.
The definition and characterization of an operating point design
procedure, and the results of simulations are presented in Chapter
Four. Finally, conclusions from this research and future research
topics are presented in Chapter Five.
C_TER _WO
PERFORMANCEMODEL
2.0 Introduction
A performance model for the ATAMM(_AlgorithmTo Architecture
_Mapping_odel) data flow architecture is described in this chapter.
The objective is to determine ocmputing speed, throughput capacity and
resource (computing element) need for implementing decision-free
large-grain algorithms on the ATAMMdata flow architecture. The
computing environment and performance measures are defined in Section
2.1. In Section 2.2, characteristics of marked graphs, which are
needed to establish the performance model, are described. Graph
theoretic lower bounds for the time performance of algorithm marked
graphs operated in the ATAMMdata flaw ardhitecture are established in
Section 2.3. Resource needs are predicted and performance bounds in
the presence of resource limitations are evaluated in Section 2.4. A
summaryof the chapter is presented in Section 2.5.
2.1 Performance Measures
The inportance of the ATAMMmodel is that it provides a hardware
independent context in which to investigate the performance of
decomposedalgorithms as long as the architecture obeys the rules of
the (_G. It is assumedthat a decomposedalgorithm is impl_ted in
a ATAMMdata flow architecture containing R identical resources or
17
18
functional units. Each functional unit is capable of performing any
of the primitive operations whose sequence defines the decomposition.
The tokens on the CMG indicate the data and control flow that must
occur in any hardware implementation of the algorithm. Consider a (_G
in some initial marking. A task is defined when, for a given input
data packet, the CMG proceeds through all its marking and returns to
its initial marking. Equivalently, a task is the sequence of
computations defined by the AMG operations on a given input data set.
Task output occurs when a corresponding output data token is deposited
at the output sink node. It should be noted that task output and task
completion do not always coincide. In many iterative signal
processing algorithms, computations are required to generate initial
conditions for the next iteration which often occur after the output
has been calculated. For control and signal processing applications,
tasks are repeated periodically with new input data sets (data
packets). New tasks are begun when new data sets are injected as
input tokens from the input source node at a finite interval of time
so that computing time and resource needs are identical for all data
sets. Of interest is the relationship of concurrency to performance
for repeated inputs.
Computational co_ occurs in two ways. First, several
transitions of the task may be performed on an individual data set
simultaneously. This type of concurrency is termed parallel
concurrency because it is the result of inherent parallelism in the
algorithm. Parallel concurrency has a direct effect on task computing
speed. It is limited by the number of transitions that can
19
be performed simultaneously for the given task and by the numberof
functional units available to perform the transitions. Second,
transitions of the task belonging to different data sets can be
performed simultaneously in the computing system. This type of
cogency is referred to as pipeline concurrency because the task is
repeated for successive data sets, like a pipeline. This type of
concurrency has a direct effect on throughput. Throughput is limited
by the capacity of the graph to aoco_te additional data sets and
by the number of functional units available to implement the algorithm
periodically.
Three performance measures, TBIO, T_, and TBO, are now defined
for concurrent processing of complex algorithms in ATAMM data flow
architectures. TBIO and TT are indicators of computing speed for a
task and thus reflect the degree of parallel concurrency. TBO is a
measure of time interval between task outputs. The inverse of TBO
indicates throughput, and thus reflects the degree of pipeline
concurrency.
Definition 2. i: TBIO. The performance measure TBIO (time between
input and output) is the elapsed computing time between a task input
and the corresponding task output.
Definition 2.2: TT. The performance measure T_ (task time) is the
elapsed computing time between a task input and the completion of all
computation associated with that task input.
D_finition .2.3: TBO. The performance measure TBO (time between
outputs) is the elapsed computing time between successive task outputs
when the graph is operating periodically at steady state.
To illustrate, an algorithm marked graph for an aircraft flight
simulation is shown in Figure 2.1. S I is the input source
2O
representing flight plan data. SO is the output sink representing
moving mapand flight instruments data. Transitions of the graph
represent activities. Places represent data dependencyor precedence
relation. Tokens on places are initial tokens representing initial
condition data. As an example, transition 3 represents inertial
navigation computation and requires ten time units for processing.
Time units associated with transitions are relative and are measured
with respect to a reference. Transition 7 (zero processing time) is
used to combine outputs of the coordinate transform computation
(moving map) and the auto-pilot computation (control for flight
insets). TBIO is the time to produce the outputs in SO for
flight plan data. qT is the time to finish all processing for a task
input. TBIO and TY need not be the same for all problems although
they are related. TBO is the time between arrival of successive
output tokens in the output data sink when the algorithm is executed
periodically at steady state.
2.2 Marked Graph C_k_racteristics
Marked graphs, a class of Petri nets, are used as a device for
expressing the ATAMM. A marked graph is viewed as a directed graph
where the vertices are the transitions and the edges are directed
places. In this section, concept of path and circuit for the marked
graph is developed. Only directed paths and circuits are of interest
to this dissertation. If not mentioned, a path or a circuit of a
marked graph should always be understood to be a directed path or a
directed circuit respectively. Some properties of the marked graph
21
o \o
v-- O _"
qll
(,,,1
.1.1
h
Q.
o
m
.=
Bll
m
q"
L
O
E
"i::
O
m
O
C
E
22
which are needed to establish a performance model are stated. Also,
circuits of the (]_G are classified. Let t i and Pi denote
transition i and place i respectively.
Definition 2.4: Directed Path. A directed path in a marked graph is
a finite alternating sequence of distinct transitions and distinct
directed places with the following property. The sequence begins and
ends with transitions and every place originates from the immediate
predecessor transition and ends on the immediate successor transition
in that sequence.
To illustrate, the sequence SI, Pl, tl, P2, t2, P3, t3, P4, and SO
is a directed path in Figure i.I. But the sequence tl, P2, t2, P6,
t4, P5, t2, P3, and t 3 is not a directed path in Figure I.i as
transition 2 is repeated twice in that sequence.
Definition 2.5: Directed Circuit. A directed circuit in a marked
graph is the same as a directed path except that beginning and end
transitions are the same in a directed circuit.
To illustrate, the sequence t2, P6, t4, P5 and t 2 is a directed
circuit in Figure I.i.
Definition 2.6: Parallel Paths. Parallel paths are directed paths
which have identical beginning and ending transitions; however, all
other transitions and places on all directed paths are distinct.
In Figure 2.2, the sequence tl, P2, t2, P3, t3, P4, t4, P5, and
t5 and the sequence tl, P6, t6, P8 _ and t 5 are parallel paths.
Definition 2.7: Group Of Paths. Group of paths are a finite number
of directed paths from a marked graph.
To illustrate, the sequences t2, P7, t7, P9, t4 and tl, P6, t6,
P8, t5 form a group of paths in Figure 2.2.
23
I=
o.
J=
O.
"0
E
E
5
"C
0
01
0
e_
O.
24
Definition 2.8: Path Lenqth. The length of a directed path in a
marked graph is defined to be the summation of all the times for
transitions in that directed path.
Definition 2.9: Circuit Length. The length of a directed circuit in
a marked graph is defined to be the summation of all the times for
transitions in that directed circuit.
Definition 2. i0: Critical Path. The critical path among a group of
paths is the one which has the highest path length.
This definition of critical path is identical to the one used in
task scheduling [14, 15] and project management [16, 17].
To illustrate, let T(i) stand for the time of the ith
transition. In Figure I.I, let T(1) = 4, T(2) = I, T(3) = 5 and T(4)
= 6, T(SI) = 0 and T(So) = 0. Then, the directed circuit t2,
P6, t4, P5, and t 2 has length 7. The directed path used to illustrate
Definition 2.4 has length i0. The directed path SI, Pl, tl, P2, t2,
P6, and t4 has length ii. These two directed paths form a group of
paths. In that group of paths, the directed path from SI to t 4 is
the critical path. It is to be noted that there can be more than one
critical path in a group of paths.
Property 2. I. The critical path length of a group of paths is the
lowest possible time to move tokens from the input of the beginning
transition to the output of the end transition on all directed paths
of that group.
This is a property of the critical path known from critical-path
scheduling [14] and project manag_t [17]. In the context of a
marked graph, as the token has to move through all the transitions of
the directed path in order to reach the output of the end transition
25
from the input of the beginning transition, tb_ minimum time required
is the length of the directed path. Considering all the directed
paths of the group, the lowest possible time to move tokens on all
directed paths frcm the input of the beginning transition to the
output of the end transition is the critical path length.
Property 2.2. With unlimited resources, tokens always take time equal
to critical path length to complete the move from the input of the
beginning transition to the output of the end transition on all
directed paths of the group.
This is another property of the critical path known frcm task
scheduling [14] and project mar_gement [17]. In the context of the
marked graph, with unlimited resources, a transition can always be
fired as soon as it is enabled by input data. Therefore, the lowest
possible time can actually be achieved. Hence, the critical path
length is the time to move all tokens from the _put of the beginning
transition to the output of the end transition.
Directed circuits are created in the computational marked graph
in four different ways. They are node, process, recursion and
parallel path circuits. Formal definitions of each kind of directed
circuit are presented below along with examples.
Definition 2. II: Node Circuit. This is a directed circuit in the
which is the only internal directed circuit of an NMG.
To illustrate, the sequence tr, PE_, _, PPC, tw, PPR, and tr is
a node circuit in the ATAMM node marked graph model of Figure i.2.
One such node circuit in the CMS of Figure 1.3 is shown in Figure
2.3 (a). This is the node circuit of transition 1 in the AMG of Figure
I.i. Node circuits always have one token, as described in the
Appendix.
26
Node
circuit
NMG of transition 1
(a)
Transition 2 Transition .3
(b)
Figure 2.3. Example of node and process
circuits.
='.:
27
Definition 2.12: Proces_ Circuit. This is a directed circuit in the
which is formed each time an NMG or source is linked to another
NM_ or sink. The backward directed place frcm successor read or sink
transition to predecessor read or source transition, along with
forward directed places from predecessor to successor create the
process circuit.
A process circuit of Figure i. 3 is shown in Figure 2.3 (b). This
process circuit is formed when node marked graphs of transition 2 and
3 are linked. Process circuits always have one token as described in
the Appendix.
Definition 2.13: Parallel Path circuit. This is a directed circuit
in the CMG which is created by any two parallel paths in the AMG. The
circuit is formed by the forward directed places through the NMS'S of
one directed path and backward directed plaees from the successor read
to the predecessor read transition from the NM_'s of the other
directed path.
To illustrate, the f_G of Figure 2.2 is shown in Figure 2.4. The
parallel paths of the AMG form parallel path circuits in the _G. One
such parallel path circuit is shown in Figure 2.5(a). This circuit is
created by two parallel paths in the Figure 2.2 between transition 1
and transition 5.
Definition 2.14: Recursion circuit. This is a circuit in the
which is created due to a directed circuit in the algorithm marked
graph.
To illustrate, the recursion circuit of Figure 1.3 is shown in
Figure 2.5(b). The directed circuit t2, P6, t4, P5, and t2 in Figure
I. 1 translates itself into a recursion circuit in the CMG of
28
@
29
<,)
c'q
. T--
4m ._
n-
o
9
u
o
O.
,n
4-
0 0
k_
O
-6_u-
o_- E
_'-_ 0
c- _
0 _. __
t_
f" 0
o ..c: .__
_ _ c-
_ .'_
4- 0 _
0 IZ) _
E
x IZ) .._
t'N
I1)
Li_
30
Figure 1.3. Directed circuits are created in the AM_ mainly due to a
recursion in computation and hence the corresponding circuits in the
C_4G are called recursion circuits.
2.3 Graph Theoretic Perfo_ Bounds
The process of algorithm decomposition imposes bounds on the
amount of parallel concurrency and pipeline concurrency possible in a
given problem. If sufficient computing resources are available,
operation at these bounds can be achieved. In this section, graph
theoretic lower bounds on three performance measures are established
for decomposed algorithms to be operated in ATAMM data flow
architectures. These lower bounds are only a function of the
algorithm marked graph and the node marked graph. Therefore,
performance cannot be improved beyond these bounds by increasing the
number of resources. The remainder of this section is devoted to
developing lower bounds for these performance measures.
Let G denote an algorithm marked graph representing a decomposed
algorithm. The lower bound for TBIO is the shortest time required for
a data token from the data input source to propagate through the graph
to the data output sink. Similarly the lower bound for TT is the
shortest time required to complete all computing activity initiated by
the injection of a data from the input source. These shortest times
are the actual performance times when only a single data set is
present in the graph during any time interval (no pipeline
concurrency), and as many computing resources as are required are
available (maximum parallel concurrency). Under these operating
conditions, lower bounds for TBIO and TP are calculated by identifying
31
certain longest paths in a graph obtained from the algorithm marked
graph. This new graph, called the modified algorithm marked graph
GM, is defined and then used to determine lower bounds for TBIO and
_T.
Definition 2.15: Modified Alqorithm Marked Graph. Let Pi be a
place of G, directed from transition tr to transition ts, which
contains a token of the initial marking. The modified algorithm
marked graph GM is obtained from the graph G by the following
construction rules:
i) Place Pi is deleted from G.
2) A new place, Pil, directed from the data input
source to transition ts, is added to G.
3) A new output sink Si different from all other
output sinks, and a new place Pi2, directed frc_
transition tr to Si, are both added to G.
4) The above rules are repeated for each place of G
containing a token of the initial marking.
Example: The recursion problem of Figure I.I is used to generate a
modified algorithm marked graph as shown in Figure 2.6. Only place 5
from transition 4 to 2 has an initial token in the algorithm marked
graph of Figure i.I. According to rule i, place 5 is deleted. A new
place 5-1 is inserted from data input source to transition 2 by rule
2. Rule 3 is then used to generate a new output sink (S5) and a new
place 5-2 as shown in Figure 2.6. As there are no more plaoes with
initial tokens, this completes the procedure to generate a modified
algorithm marked graph.
32
Transition 2\
5-_ 6 Place 4
Figure 2.6. Modified algorithm marked
graph for Figure 1.1.
33
Theorem 2.1: Graph Theoretic Lower Bound for TBIO. Let Pi be the
ith directed path in GM from the data input source to the data
output sink, and let T(Pi) denote the sum of transition times for
transitions contained in Pi" Then,
TBIOLB = Max {T(Pi) },
where the maximum is taken over all paths Pi between the data input
source and the data output sink in graph GM.
Proof. T(Pi) is the le/x/th of path Pi; therefore, Max (T(Pi))
is the length of the critical path from the data input source to the
data output sink. From the properties of the critical path [14, 17],
TBIOLB = Max (T(Pi)). This completes the proof.
Theorem 2.2: Lower Hound for TT. Let P i be the ith directed path
in GM from the data input source to any output sink, and let T(Pi)
denote the sum of transition times of transitions contained in Pi"
Then,
TfLB = Max (T(Pi) },
where the maximum is taken over all paths Pi in graph GM.
Proof. By the construction rules for graph GM, a task is initiated
with an input from the data input source, and is completed when all
output sinks have accepted tokens. Therefore, TT is the time which
elapses from injection of input tokens to the arrival of a token at
the last fired output sink. Let T(Pj) = Max (T(Pi)}, among all
Pi in GM" Pj is the longest path among all paths from the
34
data input source SI to any output sink. Therefore, Pj is the
critical path amongall paths from the data input source to any output
sink. Hence, by the properties of the critical path [14, 17], TTIB
= T(Pj) = Max{T(Pi) ), where the maximumis over all paths Pi in
GM. This completes the proof.
To illustrate the application of Theorem2.1 and Theorem2.2,
TBIOLBand TTLBare computed for the algorithm marked graph shown
in Figure I.I. For this example, the following transition times are
assumed: T(1) = 4, T(2) = i, T(3) = 5, and T(4) = 6. The modified
algorithm marked graph corresponding to Figure I.i is shown in Figure
2.6. The modified algorithm marked graph contains two paths directed
from the data input source SI to the data output sink S0. Path
Pl is the sequence tl, P2, t2, P3, and t 3 with T(PI) = I0. Path P2
is the sequence t2, P3, and t3 with T(P2) = 6. Since T(PI) > T(P2) ,
path Pl determines the lower bound for TBIO and TBIOLB = I0. The
modified algorithm marked graph contains two additional directed paths
from the data input source SI to the output sink S 5. Path P3 is the
sequence tl, P2, t2, P6, and t 4 with T(P3) = Ii. Path P4 is the
sequence t2, P6, and t 4 with T(P4) = 7. Since T(P3) is the highest,
path P3 determines the lower bound for _T and TTLB = ii.
Next a lower bound for the performance measure TBO may be
determined. Let G be an algorithm marked graph representing a
decomposed algorithm. It is assumed that the operating conditions for
G are set to maximize pipeline concurrency. That is, data tokens are
continuously available at the data input source, and as many computing
resources as needed can be called to perform primitive operations.
The graph G is executed periodically and.TBOLB is the shortest time
possible between successive outputs.
35
Theor)_m 2.3: Graph Theoretic Lower Bound for TBO. Let GC be a
computational marked graph and let C i be the ith directed circuit
in GC. The notation T(Ci) denotes the sum of transition times of
transitions contained in Ci, and M(Ci) denotes the number of
tokens contained in Ci. Then,
TBOLB = Max (T(Ci) / M(Ci) ),
where the maximum is taken over all directed circuits in G. The
circuits which determine TBOLB will be called critical circuits of
the C_G.
Proof. Without loss of generality, let tf be the output transition
in G C so that an output is produced each time tf ccmpletes
firing. TBOLB is then the minimum firing period of transition
tf. By consistency property of the Appendix, GC is consistent so
that all transitions of GC fire periodically with minimum period
TBOLB. It is shown in [18] (pp. 58-60) that the minimum firing
period of each transition of a marked graph is given by Max
(T(Ci)/M(Ci)), where the maximum is taken over all directed
circuits C i in G. Therefore, the theorem follows.
The algorithm marked graph shown in Figure i. 3 is used to
illustrate Theorem 2.3. The C_4G contains many directed circuits.
However, the recursion circuit which contains all NMG nodes of
transitions 2 and 4 has only one token and maximizes the ratio
T(Ci) / M(Ci). Therefore, the shortest time possible between
successive outputs in this graph is TBOLB = 7.
36
2.4 Resource Requirements
The performance bounds of the last section assumeavailability of
a resource for each transition to fire whenenabled. Therefore, graph
theoretic performance bounds are absolute bounds provided sufficient
resources are available to meet the firing requirements. However, for
insufficient resources, performance cannot reach the graph-theoretic
bounds. The number of resources (R) of an ATAMM data flow
architecture imposes bounds on performance of an algorithm marked
graph. In this section, characteristics of resource usage, maximum
resource requirement, and resource imposed performance bounds are
investigated. Formal definitions of computation, graph execution, and
resource requirements are stated. Definitions and results are
illustrated with examples.
Definition 2.16: TC. Total Computation (TC) is the sum of all
transition times of an algorithm marked graph.
Definition 2.17: TFC. Total Forward Computation (TFC) is the sum of
all transition times that appear in the forward paths from the data
input source to the data output sink of the modified algorithm marked
graph.
Definition 2.18: TBC. Total Backward Computation (TBC) is the sum of
all transition times that do not appear in the forward paths from the
data input source to the data output sink of the modified algorithm
marked graph.
Lemm_ 2. I. TC is the sum of TFC and TBC of an algorithm marked graph.
Proof. With the notation of Definitions 2.16, 2.17, and 2.18,
transitions which oonstitute TFC and TBC are mutually exclusive and
collectively exhmustive of all transitions of the algorithm marked
37
graph. Hence, the sumof all transition times of the algorithm marked
graph equals the sumof transition times for both transitions on the
forward paths and not on the forward paths from the data input source
to the data output sink of the modified algorithm marked graph.
Therefore, TC equals the sumof TFC and TBC. This completes the
proof.
Definition 2.19: Computer Time. A unit of Computer Time is defined
to indicate one functional unit available over one unit of time.
To illustrate, if two functional units are used for three units
of time, six units of computer time are used.
Definition 2.20: Computinq Capacity (T). Computing Capacity (CC) is
the total available units of computer time over an interval of time T.
To illustrate, for a time interval of T, the computing capacity
of an ATAMM data flow architecture with R functional units is given by
R * T. Thus CC (T) = R * T.
Definition 2.21: Computinq Effort (T). Computing Effort (CE) is the
total used units of computer time over an interval of time T.
To illustrate, for a time interval of T and R functional units,
let T i be the number of time units the ith functional unit is
used. Then T i * 1 = T i units of computer time is the computing
effort due to the ith resource in interval T. Thus the computing
effort due to R resources is given by
R
CE (T) = 7, (Ti)
i=l
units of computer time.
38
Lemma 2.2. For any number of functional units and any interval of
time, computing effort is always less than, or equal to, computing
capacity.
Proof. With the notation of definitions 2.20 and 2.21,
CC (T) = R * T
R
CE (T) = Z (Ti),
i=l
where T i is the number of time units the ith functional unit was
used in time interval T. So T i cannot be more than T [15]. Hence,
CE(T) _< CC(T). This completes the proof.
Definition 2.22: Resource Utilization (T). The Resource Utilization
(RU) of functional units over a time interval T is given by the ratio
of computing effort to computing capacity over that time interval.
Thus,
RU (T) = CE (T) / CC (T).
Lemma 2.3. Resource Utilization (RU) over a time interval T is always
greater than, or equal to, zero but less than, or equal to, I.
Proof. By definition, resource utilization is a ratio of computing
effort to capacity. With the notation of Definitions 2.20 and 2.21,
T i > 0 , T > 0. So CE(T) > 0. CC(T) = R * T > 0 as the ATAMM data
flow architectures must have at least one functional unit. So _J(T) >
0. Also as CE (T) < CC (T), RU (T) < i. This completes the proof.
39
Definition 2.23: Total Computinq Effort {TCE). TCE is defined to be
the computing effort required to execute once all transitions of an
algorithm marked graph.
IPmm_ 2.4. TCE equals TC units of computer time.
Proof. With the notation of Definitions 2.16, 2.21, and 2.23,
TCE = CE(T)
R
= Z (T i)
i=I
= TC
units of computer time as total computation to execute all transitions
of the AMG once is TC. This completes the proof.
Definition 2.24: Total Forward ComDutin_ Effort (TFCE). TFCE is
defined to be the computing effort required to execute once all
transitions on forward paths from the data input so_ to the data
output sink of the modified algorithm marked graph.
Lemma 2.5. TFCE equals TFC units of computer time.
Proof. The proof is similar to that of _ 2.4.
With the above definitions and leam_s regarding computation of a
task, it is now intended to establish resource imposed bounds on the
computing time of a task. The following two theorems state the
minimum possible value of TT and TBIO for an ATAMM data flow
architecture of R resources.
Theorem 2.4: Minimum qT for R Resources. The minimum value of TT for
an algorithm marked graph operated with R resources is always greater
than, or equal to, TCE / R.
Proof. T_ is the computing time to complete all computation
associated with a task input. For a time interval of TT, the
40
computing capacity of R resources is R * Tr. The total computation
for any task input is the execution of all transitions of the
algorithm marked graph once and hence, equals TC. The corresponding
computing effort is TCE. By Lemma 2.2, R * TT > TCE, or _T> TCE / R
[19]. This completes the proof.
Theorem _. 5; Minimum TBIO for R Resources. The minimum value of TBIO
for an algorithm marked graph operated with R resources is always
greater than, or equal to, TFCE / R.
Proof. TBIO is the computing time to generate data output for a
task. For a time interval of TBIO, the computing capacity of R
resources is given by R * TBIO. In order to generate data output, all
transitions on all the forward paths from the data input source to the
data output sink in the modified algorithm marked graph must be
executed once. The computation involved is TFC and the corresponding
computing effort is TFCE. By Lenm_ 2.2, R * TBIO _> TFCE [19], or
TBIO _> TFCE / R. This completes the proof.
Two graph execution features (SGP and TGP) and two hardware usage
measures (SRE and TRE) are now defined for predicting resourc_
requirements. SGP describes the execution of transitions of the
algorithm marked graph for a single data packet. SRE is the
description of the resource usage to process one data packet. TGP and
TRE are the graph execution description and resource usage envelope
when the algorithm marked graph is executed repeatedly and
periodically.
Definition 2.25: SGP. SGP (single graph play) is a drawing depicting
beginning, duration, and end of execution for each transition of the
task when operated for a single data packet.
41
Definition 2.26: TGP. TGP (total graph play) is a drawing depicting
beginning, duration, and end of execution for each transition of each
algorithm input at steady state when the AMG is executed periodically
with an input data injection interval of TBO.
Definition 2.27: SRE. SRE (single resource envelope) is an envelope
of resource usage by a single data packet between the time of
algorithm input and the completion of all computation associated with
that algorithm input.
Definition 2.28: TRE. TRE (total resource envelope) is an envelope
of resource usage to execute the graph at steady state with input
period TBO.
Definition 2,29: Construction of SGP and SRE. SGP and SRE are
generated by firing every transition in the algorithm marked graph at
the earliest possible moment assuming unlimited resources and a single
task input. Graph play is generated by depicting execution of all
transitions in every time interval. Symbols (<, >) are used to show
the beginning and the end of execution for a transition respectively.
The resource usage envelope is obtained by counting the number of
computing resources used during each time interval.
Ex_m_ple. Consider the algorithm marked graph of Figure 2.7.
Transitions i, 2, and 4 have duration of one time unit. Transitions
3, 5, and 6 have duration of two time units. The graph is played
according to Definition 2.29 and the SGP is shown in Figure 2.8(a).
The need for resources is the same as the number of active transitions
in each time interval. The SRE is computed by counting the number of
resources used in each time interval and is shown in Figure 2.8 (b).
42
Time for _ 1 __Transition 2
transition 2 (_/
1 2 2
4NX_ 1_ /'_
Place 7
Figure 2.7. Algorithm marked graph for illustration
of SOP and SRE.
43
/(b)
Section
I I
1 3; 51
, I
4--_< ><: i
I Ii
2 ', 4 ,
i i
! !
0 2 1.4
><
6
>
6
Ii
I !
---> 0 I 1 I 2
(data packet) number
Time
5
¢
2
o
n/
0
Section
number
data
(o]
packet)
I
I
I
2 iI
i
I I I I I I
0 1 2 3 4 5 6
Time ---->
(b)
3
7
;>
Figure 2.8. (a) SGP. (b) SRE.
44
Now suppose the algorithm is executed periodically. Assume that
the input data injection interval is long enough so that every data
packet executes the graph as the SGP and needs resources over the task
time as given by the SRE. As a result, the algorithm is executed with
an input period equal to output period TBO. The total resource
envelope (TRE) is to be determined then by adding the resource needs
of the concurrently p_sed data packets. The total graph play
(TGP) is generated by drawing the execution of transitions from all
the concurrently processed data packets. It is shown in the following
two theorems that TRE and TGP are periodic with period TBO. If SRE
and SGP are divided from the beginning in sections of TBO time units,
these sections are shown to be the contributions from the consecutive
concurrent data packets towards a period of TRE and TGP. AS an
example, SGP and SRE of Figure 2.8, are divided in sections of TBO = 2
time units. Section as well as data packet numbers are represented by
the integer variable b. To illustrate, data packet 2 has been
injected two time units before data packet i. Moreover, transitions 3
and 2 for data packet 0, transitions 5 and 4 for data packet 1 and
transition 6 for data packet 2 are executed concurrently at steady
state requiring a total of five resources. This will be later
illustrated in detail after Theorems 2.6 and 2.7 are developed.
Theorem 2.6. When the algorithm marked graph is operated periodically
for input period TBO with all data packets requiring resource
envelopes identical to SRE, the total resource envelope at steady
state is periodic with period TBO and one period of TRE is generated
by the summation of sections of SRE of width TBO as follows.
Let SRE (x) represent the resource envelope for a single task
input where SRE (x) = 0 for x > TT. Let the origin of time axis (t)
45
at steady state be the injection of a data packet.
value of total resource requirement at time t.
concurrently processed data packets at time t.
then given by
let TRE (t) be the
Let b represent the
A period of TRE(t) is
TRE (t) = Z SRE (t + b * TBO),
b
where
0< t<TBO
0 s b < [TT / TBO].
Proof. By the rules of operation, data packets are injected and
outputs are generated at the interval of TBO at steady state.
Consider three consecutive data packets P, Q, and R injected at
t = K * TBO, (K+I) * TBO and (K+2) * TBO respectively, where K is a
positive integer, let d be a time unit in which the total resource
requirement is desired, let s denote the time between d and time for
the previous data packet injection. Suppose d is a time between the
injection of data packets P and Q. Thus K * TBO < t < (K+I) * TBO,
and s = t - (K * TBO). TRE(t) in this interval is made of SRE's due
to data packet P and previous data packets whose computations are
completed after P has started. As all data packets have resource
envelopes identical to SRE of duration T_, any data packet which is
injected Tr or more time before P has no effect on TRE in this
interval. Consequently, the total number of concurrently processed
data sets creating TRE(t) in this interval is given by rTT / TBO].
46
Hence, let the range of b be 0 < b < [TT / TBO] ; b is an
integer. TRE(t) for time interval between P and Q is then the
s_mmation of the resource requirements for these concurrently
processed data packets. Let b = 0 identify task input P whose
contribution to TRE (t) is SRE (s). The data packet which has started
TBO time units before P will contribute SRE (s + TBO) and is
identified by b = i. In general, a data packet which is injected
b * TBO time units before P is identified by the data packet number b
and contributes SRE (s + b * TBO) to TRE (t). Therefore, sunTning SRE
(s + b * TBO) over the entire range of b for the concurrently
processed data packets will give the corresponding TRE (t). The data
packet corresponding to the largest b may contribute to TRE(t) for
only a partial interval. As SRE (x) = 0 for x _> qT,
SRE (s + b * TBO) properly represents the contribution due to the data
packet corresponding to the largest b. Therefore, TRE (t) at d
between P and Q is given by the following equation,
TRE (t) = 7. SRE
b
(s + b * TBO)
= 7_ SRE (t - K * TBO + b * TBO) (2.4.1)
b
where
K * TBO < t < (K +i)
0 _< b < [TT/TBO] .
*TBO
Now let d be a time unit t + TBO from the origin. As d now is a time
unit between data packet injection Q and R, s = (t+TBO) - (K+I)*TBO.
By similar _ts as before,
47
TRE (t + TBO) = Z SRE (s + b * TBO)
b
= Z SRE ((t+TBO) - (K+I)*TBO + b * TBO)
b
= Z SRE (t - K*TBO + b*TBO)
b
= TRE (t),
from equation (2.4.1). Thus, TRE(t) is periodic with period TBO.
Hence, it is sufficient to specify TRE (t) for one period only; let s
= t, or K = 0. Modifying equation (2.4.1) we get,
where
TRE(t) = Z SRE (t + b * TBO)
b
0<t<TBO
0 < b < [TT/TBO] .
Thus, one period of TRE(t) is generated by the summation of the
sections of SRE (x) of width TBO, starting from x = 0. The sections
are identified by the corresponding value of b. This completes the
proof.
Theorem 2.7. When the algorithm marked graph is operated periodically
for input period TBO with all data packets executing the AMG as SGP,
total graph play at steady state is periodic with period TBO and one
48
period of TGPis generated by the overlapping of sections of SGPof
width TBOas follows.
Let SGP(x) represent the graph play for a single task input
where 0 < x < T_. Let the origin of time axis (t) at steady state be
the injection of a data packet. Let TGP(t) be the total graph play
at time t. Let b represent the concurrently processed data packets at
time t. A period of TGP (t) is then given by,
where
TGP(t) = Z SGP (t + b * TBO)
b
0<t<TBO
0 < b < [TT / TBO].
Proof. The proof is similar to Theorem 2.6 with one exception.
Unlike SRE, sections of SGP of width TBO represent portions of graph
play for successive data packets which overlap to form TGP at steady
state. Hence, instead of adding sections of SGP, one period of TGP
should be constructed by overlapping sections of SGP with each section
being identified separately by the value of b. If two values of b are
i and i+l, it means data packet i+l is injected TBO time units before
data packet i. This completes the proof.
Example. One period of TGP and TRE is constructed for the AMG of
Figure 2.7 according to Theorem 2.6 and 2.7 with an input period TBO
of two time units. SGP and SRE of Figure 2.8 are divided in sections
of width two time units as shown in Figure 2.8 by the dotted
49
lines. Figure 2.9 shows the TGPand TREfor input period TBOof 2.
Time t is any time when a new data packet is injected at steady
state. In the TGP, the superscript of transitions indicate the value
of b (data packet number). Data packet 1 is injected TBOtime units
before data packet 0. 1(0) and 5 (1) represent the execution of
transition 1 and 5 for the data packet 0 and 1 respectively in Figure
2.9(a). The TGP indicates that 5 (1) begins after the completion of
1 (0) . As in SGP, (<, >) arrow symbols indicate the beginning and
end for execution of a transition respectively. In Figure 2.9(a),
transitions 3 (0) , 5 (1) , and 6 (2) have started in this period but
did not end. Similarly 3 (1) , 5 (2) , and 6 (3) have been completed
in this period but did not start in it. The resource usage in the
four sections of SRE in order of increasing b are (I, 2), (i, 2),
(i, I), and (I, 0). One period of TRE is calculated by adding the
four sections of SRE. The total resource need in one period of TRE is
(4, 5) as shown in Figure 2.9(b). It is to be noted that TRE could
also have been calculated from TGP by counting the number of active
transitions in each time interval.
2.6. Computing effort in one period of TRE is TCE at steady
state when the algorithm marked graph is operated periodically with an
input period of TBO.
Proof. As the algorithm marked graph is operated periodically,
computing effort in every period is the same. Computing effort in a
period TBO of TRE will equal TCE as one task output is generated in
every TBO time units. This completes the proof.
Iemma 2.7. Resource Utilization (RU) in one period (TBO) of TRE is
given by (TCE / (R * TBO)}.
5O
Time
1(o) >. 3 (0) i
I > <- 4_1) i3( 1)
| -'-
I 6(3)
(a)
I
I
I
I t+TBO
0
4;
o_
4
3
I
I
I
t t+l
Time --->
I
I
I
J
t_I-TBO
(b)
Rgure 2.9. For TBO=2, (a) Total graph play.
(b) Total resource envelope.
51
Proof. By Leam_ 2.6, computing effort in one period (TBO) of TRE is
TCE. Computing capacity in the TBO time interval is R * TBO. By
definition then, resource utilization is {TCE /(R * TBO)). This
completes the proof.
Example. Consider the SRE as shown in Figure 2.10(a) with TT = 7, TC
= 15 (ignore the dotted lines). The peak of SRE is 4 which indicates
that the ATAMM data flow architecture requires at least four
functional units to process the task according to the SRE in seven
time units, let TBO = 3. Tasks are initiated and outputs are
generated at the interval of three time units with all having
identical SRE at steady state. TRE is calculated from Theorem 2.6.
Dividing SRE from the beginning in sections of width TBO, as in Figure
2.10(a), with the dotted lines, (I, i, 2), (4, 3, 3), and
(i, 0, 0) are the contributions of three overlapping task inputs to a
period of TRE. Adding three sections of SRE, a period of TRE is given
by (6, 4, 5) and is shown in Figure 2.10(b). The computing effort in
three time units of TRE is 15 as claimed by Lemma 2.6. Since the peak
of TRE is 6, a minimum of six functional units is required to operate
an algorithm marked graph with SRE of Figure 2. i0 (a) and TBO = 3. By
leamm 2.7, resource utilization (RU)
by {15 / (6 * 3)) = .833.
With the help of above lemmas,
for six functional units is given
the resource imposed bound on TBO
is established in the following theorem.
Theorem 2.8: Minimum TBO for R Resources. The minimum value of TBO
for an algorithm marked graph operated periodically with R resources
is always greater than, or equal to, TCE / R.
Proof. By Theorem 2.6, the total resource envelope is periodic. By
Lemma 2.6, the computing effort needed in period TBO is TCE. The
52
Data packet (section) number
o_ 3-
U
2-
0
Co
CD
n- 1
0
0
_0 I 1
, , I ,
1 2 3 4
Time
5 6
2
(a)
_6
_D
U
5-
0
09
_" 4-
I I
t+3 t+6
-rime
(b)
Figure 2.10. (o)
(b)
Single resource envelope.
Totol resource envelope
for -rBO=3.
53
computing capacity for time interval of TBO is R * TBO. By Lemma 2.2,
R * TBO > TCE. Hence, TBO > TCE / R. This completes the proof.
Corollary 2.8.1. The minimum value of resource requirements (R) for a
desired TBO is bounded by [TCE / TBO] when the graph is
operating periodically at steady state.
Proof. As TBO > TCE / R, it follows that R> TCE / TBO. Since R is
an integer, R > [TCE / TBO]. This completes the proof.
Example. Consider the algorithm marked graph of Figure I. 1 and the
corresponding modified algorithm marked graph of Figure 2.6. Let T(1)
= 4, T(2) = i, T(3) = 5, and T(4) = 6. The sum of all transition
times are 16. Hence, TC = 16. TFC and TBC are calculated from the
modified algorithm marked graph. Transitions i, 2, and 3 appear in
the forward paths from S I to SO . Therefore, TFC = T(1) + T(2) +
T(3) = i0. As only transition 4 does not appear in any of the forward
paths from data input source to data output sink, TBC = T(4) = 6.
Also, TFC and TBC add up to TC. If only two functional units are
available, the minimum values of T9, TBIO, and TBO are 8, 5, and 8
respectively. For a TBO of 7, the minimum R is [TCE / TBO] = 3.
2.5 Summary
The computing environment and performance measures in the ATAMM
data flow architecture are established. Graph time performance is
expressed by time between input and output (TBIO), task time (TT), and
time between outputs (TBO). The modified algorithm marked graph is
defined to compute lower bounds for qT and TBIO. Lower bounds for the
performance measures are calculated analytically from the modified
algorithm marked graph and the computational marked graph with the
54
assumption that a functional unit is available for every enabled
transition to fire. The availability of a limited number of
functional units is then considered. The modified algorithm marked
graph is used to distinguish between forward computation (TFC) and
backward computation (TBC) and to establish their relation to total
computation (TC). Computing capacity, computing effort, and resource
utilization are defined. The range of values for performance measures
are established assuming that the ATAMM data flow architecture has
only R functional units. The algorithm marked graph execution for a
single task input or data packets periodically are defined in terms of
SGP and TGP. The requirements of functional units to process a single
task input or data packets periodically are expressed by SRE and TRE.
Resource utilization is defined; construction rule for SGP and SRE are
defined; and properties of TRE are described. Methodologies for
generating TRE and TGP are established. All definitions and results
are illustrated with examples.
C_APTERTHREE
ALGORITHMTRANSFORMATION
3.0 Introduction
The lower bounds for performance measures of an algorithm marked
graph are developed in Chapter Two. Oneof the two remaining
important problems concerning performance measures is considered in
Chapter Three. Of interest is the potential of transforming an
algorithm marked graph, with or without d_ition, in order to
decrease lower bounds for performance. Investigation is also carried
out to use transformations to reduce resource requirements, enforce
periodicity in execution, and provide structural changes in the
algorithm marked graph. All required transformation techniques,
including an investigation of their usefulness and limitations, are
described in this chapter. Algorithm transformation techniques are
defined and elaborated in Section 3.1. Applications of algorithm
transformations for performance improvements and reduction of resource
requirements are discussed in Section 3.2. A steady state periodic
execution of algorithm marked graphs is realized in Section 3.3.
Structural changes of algorithm marked graphs are considered in
Section 3.4. A sunmary of the chapter is presented in Section 3.5.
3.1 Algorithm Transformation Guidelines
The aim of this section is to define algorithm transformation
techniques and illustrate their significance. Algorithm
55
56
transformation is defined to be a process to change some features of
an algorithm marked graph while preserving its equivalence in
computations. In other words, algorithm transformations produce a new
AMGwhich is equivalent to the original AMGbut better in some
respect. The primary objectives are to improve time performance and
lower resource requirements through algorithm transformation.
Therefore, algorithm transformation techniques which can lower
critical path ler_, lower time per token for the critical circuit of
the CMG, lower resource requirements, and enforce periodicity in the
execution of the AM_ are of great interest. A formal definition of
equivalency of two algorithm marked graphs and algorithm
transformation techniques are stated and explained below.
Definition 3.1: Equivalency Of TWo Alqorithm Marked Graphs. Two
algorithm marked graphs are equivalent if they map any set of input
variables into the same set of output variables and produce an
identical output sequence for an input sequence.
Definition 3.1 specifies the allowable transformations. An
algorithm marked graph can be transformed as long as the new AMG is
input-output equivalent with the old one. It is to be noted that if
the computations of transitions and data dependency among the
transitions of the original AMG are not altered, the transformed AMG
will remain input-output equivalent with the original AMG.
Definitions 3.2 through 3.5 describe four transformation techniques
which are based on this observation.
Definition 3.2: Control Place. A control place is any place in the
algorithm marked graph whose deletion generates an equivalent
algorithm marked graph.
57
A control place is an artificial place in the sense it is not
necessary for the correctness of an algorithm. A control place
imposes a precedence relation amongtwo transitions. The control
place needs to be initialized by an initial token if it creates a
circuit in the algorithm marked graph. The designer inserts a control
place in the algorithm marked graph to delay the firing of a
transition. All places in the AMG other than control places will be
called active places henceforth. If broadcasting is used to transmit
data between transitions, insertions of control places are not going
to change read and write times of transitions. Also, control places
need not transmit data vectors; therefore they can be implemented at
very low coa_unication cost. Thus for analyses purposes, insertion of
control places in an AM_ will be assumed not to increase read and
write times of transitions.
Definition 3.3: Durmmf Transition. A dummy transition is any
transition in the algorithm marked graph which is not required for
executing a primitive operation.
A dummy transition is a redundant transition in the sense that it
is not required for computation. However, it can be used to control
operation or improve performance. All transitions other than dummy
transitions will be called active transitions henceforth. A dummy
transition can act as a buffer to provide storage for the output of
any transition. Such buffers will be shown to be needed at times when
the algorithm marked graph is operated periodically. A dummy
transition can be used to comb_ input or output data vectors in
order to create single input or output vectors respectively. Another
application of a dunmy transition is as a delay operator for holding
58
firing of one, or a group of, transitions. Read and write time for
the NMGof a dummytransition depend on implementation and data
length, but should be less, or equal to, read or write times of an
active transition of equal data length respectively. A dummy
transition has zero process time when it is used as a buffer; it has a
very small process time when it is used for combining data vectors. A
dummytransition as a delay operator has a process time corresponding
to the amount of delay needed. As operations are restricted to large-
grain algorithms, read and write times are expected to be
significantly smaller than the process time of an active transition.
Thus for analyses purposes, a dummytransition will be assumedto have
zero time when it is used as a buffer or for combining data vectors.
Also, it will be assumedthat a du_y transition for applications
other than a delay operator does not require a resource because a
resource is required to implement such a dummytransition for a very
short time. A dummytransition for delay application has not been
explored in detail in this dissertation, but poses an interesting
topic for future research.
Definition 3.4: Predefined Token. A predefined token is any initial
token on a place of the algorithm marked graph.
A predefined token indicates the presence of precomputed initial
data or initial control. A predefined token is necessary at times for
execution of the task and for forward flow of data.
Definition 3.5: Decomposition of a Transition. Decomposition of a
transition in the AMG is to replace the transition by an equivalent
marked graph of a group of transitions.
The transition decomposition of Definition 3.5 is to distribute
the computation of a transition among a group of transitions in order
59
to reduce the original transition time. This is important because
large transition times are major contributors to critical path length
and time per token of critical circuits. It should also be noted that
the d_sitions of transitions are not always reasonable or
possible due to added _ication cost, higher resource
requirements, and transition characteristics. Serial, or a
combination of serial and parallel, decompositions of a transition
tends to decrease TBOLBsignificantly while TBIOLBdoes not
improve muchand can even increase due to added serial communication
time. In those cases, a proper deccmposition is dependent upon the
relative importance of TBOand TBIO. Pure parallel decomposition of
transitions decreases both TBOIBand TBIOIB.
Subsequent sections of this chapter will develop a theoretical
basis for the applications of control places, dummytransitions,
initial token and deconposition. A software program, called Ttime
[20], will be used for determining lower bounds for TBO, TT, and
TBIO. This program constructs the C_Gfrom the specified AM_to
determine TBOLB. Twoexamples are presented to illustrate the
transformation of an AS_ through the use of control places and dummy
transitions.
Example. Consider the algorithm marked graph of Figure 2.2. The
corresponding CMG is shown in Figure 2.4. A transformed AMG and
corresponding C_G are shown in Figures 3.1 and 3.2 respectively. A
durm_ transition of zero time is used as a buffer between transitions
1 and 6. The AMG's of Figures 2.2 and 3.1 are equivalent as they
produce the same output sequence for identical input sequences. The
dummy transition provides an additional storage space for the output
60
.C
0.
0
L
E
4.a
"__
m
0
E 0
em
m0 t_L
Q_
L_
61
In
II II
II =
A_
:E
.<
E
C
E
¢B
L
0
t-
O.
0
L
o..
!
IZ
62
of transition i, which is to be used as an input of transition 6.
Without this dummy transition, transition 1 can fire only once before
transition 6 fires; however, with the dummy transition, transition 1
can fire again before transition 6 fires. Application of this
transformation will be described later.
An example of transformation by control places is shown in
Figures 3.3 and 3.5. Control places delay firing of selective
transitions and therefore modify SRE and TRE. The dummy transition is
used again as a buffer. Improvement due to this transformation will
be described later.
3.2 Perforrmance Improvements by Transformation
Applications of dummy transitions and control places for
improving time performance and reduction of resource requirements are
discussed in this section. New results are stated in Application 1
and 2. Application 1 describes how dummy transitions can reduce
TBOLB of an AMG to the largest time/token among the process and
recursion circuits. Application 2 describes how the SRE of an AMG can
be modified to give a lower peak TRE through the use of control
places.
Application i. This is an application where a dummy transition is
used as a buffer. A dummy transition can provide storage space for
the output of a transition. This can increase the firing rate of
transitions as ATAMM does not allow firing of an active transition
unless its outputs are read by successor transitions. In terms of the
C_4G, a dummy transition can increase the number of tokens in the
circuits of a CMG created by parallel paths in the AMG. This is the
basis for Theorem 3. i.
63
Theorem 3.1: Reduction of _ to the Iarqest Time Per Token Amonq
the Process and Recursion Circuits by Dummy Transition. Any AMG can
be transformed by using dummy transitions as buffers so that
TBOIB = Max (T(Ci)/M(C i) ) (3.2.1)
where T(Ci) and M(Ci) denote the sum of transition times and the
number of tokens contained in C i of the C_KZ respectively. Circuit
C i is a process or recursion circuit.
Proof. There are four kinds of circuits in a C_G, as mentioned in
Section 2.2. They are node circuits, process circuits, recursion
circuits, and parallel path circuits. Theorem 2.3 has proved equation
(3.2. I) when C i is any directed circuit of the C_G. From ATAMM
model characteristics, as described in the Appendix, both node and
process circuits always have only one token. Also the sum of
transition times for process circuits are always greater than, or
equal to, that of their corresponding node circuits as process
circuits include the successor read transition. Consequently, the
largest time/token ratio of process circuits is always greater than,
or equal to, the largest time/token ratio of node circuits. The
remaining task is to show that the time/token ratio for circuits in a
C_4G due to parallel paths in the AMG can be reduced sufficiently to
make them insignificant in determining TBOLB. Consider any two
parallel paths Pi and Pj of the AMG which begin and end at
transitions S and E respectively. Consider the parallel path circuit
in the CMG created by forward directed places (for data flow) from NMG
64
transitions of path Pi and backward directed places (for control
flow) from NMG'sof path Pj. Each of these backward directed places
has a token in the initial marking. The numberof such backward
directed places are one more than the number of transitions on path
Pj, excluding transitions S and E. Inserting a dummy transition of
zero time on path Pj will increase the number of tokens in this
circuit by one. As this dummy transition does not have any time, it
cannot increase the T(Ci) of this circuit or any other. Hence, the
time/token ratio of this circuit will decrease while not increasing
the time/token ratio of any other circuit. By inserting more dunm_
transitions on path Pj, the time/token ratio for this circuit can be
arbitrarily reduced. If the time/token ratio for this circuit is
greater than the largest time/token ratio from process or recursion
circuits, dum_my transitions can be used to reduce the time/token ratio
to a value lower than, or equal to, the largest time/token ratio among
process or recursion circuits without increasing the time/token ratio
of any other circuit. Following this procedure, sufficient dunm_
transitions may be added so that the time/token ratio for any parallel
path circuit in the (_4G is smaller than, or equal to, the largest
time/token ratio among process or recursion circuits. The procedure
is guaranteed to terminate as dummy transitions, when used as buffers,
never increase the time/token ratio of any circuit. This completes
the proof.
Example. Consider again the AMG of Figure 2.2. The corresponding C_S
is drawn in Figure 2.4 assla_g zero time for read and write
transitions. Therefore, TBOLB is 3. There is no r_ion circuit
in the AMG. The largest time/token ratio among all process circuits
65
is 2 and the largest time/token ratio amongnode circuits is 2.
However, the largest time/token ratio amongall directed circuits is 3
due to two parallel path circuits as shown in Figure 2.4. For both of
these circuits, parallel paths in the AMGstart and end in transitions
1 and 5 respectively. Let t i denote transition i and pj denote
place j. Path Pj for both circuits is the forward path tl, P6,
t6, P8, and t 5. Path Pi for the two parallel path circuits are tl, P2,
t2, P3, t3, P4, t4, P5, and t5, and tl, P2, t2, P7, t7, P9, t4, P5,
and t 5 respectively. Both of these circuits have two tokens from
backward directed places from the NMG transitions of path Pj, as
shown in the C_3. Now the AMG is transformed by inserting a dummy
transition on path Pj as shown in Figure 3.1. The corresponding (Y_
is shown in Figure 3.2. The number of tokens on the parallel circuits
are now 3 and therefore the time/token ratio is 2. Time/token ratio
for any other circuit does not increase as the dummy transition has
zero time. The largest time/token ratio over all directed circuits is
now 2. However, TBDiB for the AMG of Figure 3.1 is 2, and
transformation by a dungy transition has improved throughput
performance.
Application 2. This is an application to demonstrate a procedure for
reducing resource requirements. Control places and dummy transitions
are the two transformation techniques which are used. Suppose tbmt
all the data sets of an AMG require a resource envelope, as given by
SRE, and data sets are injected at the interval of TBO time units.
The total resource envelope will then be given by TRE and the peak
value of TRE will be the required number of functional units. From
Chapter Two, TRE is periodic and one period of TRE is made by
66
additions of sections of SREof width TBO. This immediately leads to
the possibility that the peak value of TREmight be lowered by
adjusting the shape of SREif the peak value of TREis more than the
minimumrequirement [TCE/TBO]. SRE can be modified by
delaying active transitions selectively with the help of control
places. This mayor maynot lead to an increase in TrLB (thereby
duration of SRE)or TBIOLBdepending on the "float" of delayed
active transitions. Float is the amount of time an active transition
can be delayed without increasing TBIOLB and qTLB.
A desired result is to modify SRE without increasing TBIOLB and
TTLB to achieve TBOLB with a minimum number of resources.
Unfortunately, this problem is equivalent to a class of scheduling
problems which is known to be NP complete [12]. Thus, SRE must be
modified heuristically by control places. Judicious insertion of
control places may reduce the resource requirement for the same
T_IB, but perhaps at the expense of TBIOLB. A control place is
useful if it can reduce resource requirements by delaying transitions
with float or by sacrificing parallel concurrency to some extent.
Lastly, insertion of control places in the AF_ can create dominant
parallel path circuits in the corresponding C_4G which are made
insignificant following the procedure of Application I.
The methodology for lowering the resource requirement is now
stated. First, construct SRE and TRE for the AMG at specified TBO.
The peak value of TRE is the resource requirement for an input data
injection interval of TBO. If the _ value of TRE is more than
FTCE/TBO], heuristically modify SRE by transforming
the AMG with control places with as small an increase in TBIOLB and
67
TTLBas possible. Makeall dominant parallel path circuits created
by control places insignificant by adding dummytransitions. An
example is given below to illustrate Application 2.
Example. Consider the algorithm marked graph of Figure 3.3. Frc_ the
AMG,TCE= 12, TBOLB= 2, and TBIOLB= TrLB = 6. The minimum
resources to achieve TBOLBare [TCE / TBOLB ] = 6. SRE is shown
in Figure 3.4. Adding sections of SRE of width TBOLB , a period of
TRE is computed and is shown in Figure 3.4. The peak value of TRE is
9. Hence, nine functional units are required for implementing this
AMG for optimum time performance. As the minimum resource requirement
for TBOLB is 6, Application 2 is considered. The AMG is transformed
heuristically, as shown in Figure 3.5. The dotted lines are control
places 1 through 4. Ignore control places 2, 3, and 4 initially. The
justification of control place 1 is as follows. It is noted that
transition 5 is the only transition which has a float in the AMG.
Transition 5 can be delayed up to two time units without delaying the
output. Considering section 1 of SRE as shown in Figure 3.4,
transition 5 should be delayed one time unit so that the peak value of
TRE is reduced to 8. This is accomplished by control place i. The
modified SRE and TRE are shown in Figure 3.6. Unfortunately, control
place 1 creates a parallel path circuit among transitions i, 4, and 5
whose time/token ratio is more than 2. The time/token ratio of this
circuit is made less than 2 by inserting a dummy transition on the
place between transition 1 and 5. Now consider section 2 of SRE as
shown in Figure 3.6. It contributes (4, i) to a period of TRE. In
order to reduce the peak value of TRE, a more equal distribution of
transitions among the time intervals (t, t+l) and (t+l, t+2) of TRE
68
c_
C
0
0
U
el
m
0.
tim
0
C
0
II
0
L
m
m
e_wl
L_
0
-j
r_
G
L
69
Data packet (section) number
3-
CD
0
2-
0
¢h
CD
n- 1
0
0
\
0
!
1 2
2
I I
3 4 5
Time
6
I
7
(o)
9
o 6
o
5
c_
0
Figure 5.4.
' t:+2 ' t+4 ' t+6
Time
(b)
For Figure 3.3, (a)
(b) TRE for TBO=2.
SRE.
7O
_ I==
71
6q
©
U
I,...
D
0
G1
_b
EE
f
4-
_
_
Data packet
/
0
0 1 2
(section) number
2
3 4 5
Time
(a)
6
Figure
,
60
q3
0
_- 6
0
03
_4
pr"
2
...............t 1 t+2 _ t+4 ' [+6
Time
(b)
3.6. For the AMG transformed by
place 1, (a) SRE. (b) TRE
>
control
for TBO= 2.
72
is needed. Control places 2, 3, and 4 do this job at the expense of
increasing TBIOIB and TgLB by one time unit. The SRE and TRE of
the fully transformed _ of Figure 3.5 are shown in Figure 3.7. Now
only six functional units are required, which is the minimum for a
TB01B of 2. It is to be noted that the maximum utilization of
resources may not be achievable by use of control places in all cases
unless the AMG is turned into a complete chain.
3.3 Implementation Of Periodicity By Transformation
This section describes a procedure for enforcing periodicity in
the execution of an algorithm marked graph for successive data sets.
It is desired that performance and resource needs be identical for all
data sets for two reasons. First, input data should not experience a
waiting time on the critical path of a task so that TBIOLB is
achieved for all data sets. Second, the resource envelopes for all
data sets should be identical so that the total resource need can be
predicted. It will be shown in Application 3 and 4 of this section
that by controlling input data injection and transforming the AMG by
dummy transitions, periodicity can be realized in the execution of the
AMG. The need and methodology for injection control of input data is
explained in Application 3. Application 4 describes the conditions
for operating an AMG periodically with each data packet having
identical resource envelopes.
Application 3. When presented with continuously available input data
sets, the natural behavior of a data flow architecture results in
operation where new data sets are accepted as rapidly as the available
resources and the input transition of the AM_ permits. From C_apter
"73
#
4-
3-
0
2-
0
n- 1
0
0
Section number
/
0 1 2
-l_
I
I
I
, I l I- i I
1 2 3 4 5 6
Time
7
(o)
0
_ 6
0
¢;4
2
' t+2 't+4 ' t+6
Figure 3.7.
Time
(b)
For the transformed AMG with oll control
places, (a) SRE. (b) TRE for TBO=2.
74
Two, the output of the AMG cannot be generated at a higher rate than
I/TBOLB or R/TCE. Therefore, if the data sets are continuously
available, they experience a waiting time inside the architecture
which increases TBIO from TBIOI_. That is, the architecture will
naturally operate at high levels of p_peline concurrency with the
possible loss of capability for achieving high levels of parallel
concurrency. This will result in performance characterized by high
throughput rates, but relatively poor task computing speed. In many
control and signal p_sing applications, it is important to achieve
both a high throughput rate and high task computing speeds.
Therefore, it is necessary to control injection rate of data sets so
that input data never waits on the critical path. The input data
injection interval must always be greater than, or equal to, TBOLB
and it should be such that all task inputs always have a resource
available to fire transitions on the critical path to the data output
sink. This can be accomplished by either adjusting the time for the
source transition or as shown in Figure 3.8. It is not always easy to
adjust the source transition time as this will be the sampling
interval of sensors in a real system. All that is required is to
limit the rate at which new input data are presented to the C_G. This
is done in Figure 3.8 by adding a dummy transition in a directed
circuit with the data input source. The predefined token on the
directed circuit is for initialization. The dummy transition imposes
a minimum delay of D time units between inputs. D is chosen to be the
designer specified TBO.
Application 4. It is necessary that all data sets have the same
resource envelope so that the total resource requirement can be
75
Dummy troneition
of time D
Figure 3.8. Injection control by Applicotion 3.
76
predicted. Also at steady state, it is desirable that all data sets
require resource envelopes identical to SRE as SRE can be modified to
lower the peak value of TRE as described in Application 2. In order
to achieve such a resource envelope, all transitions of the AMG should
fire as soon as there is a token on every input place. The first step
is to control the data injection interval as discussed in Application
3. If this condition is satisfied, then it can be guaranteed that a
data token never waits on the critical path from the data input source
to the data output sink for all data sets. Hence, TBIOLB is
achieved for all data sets. Secondly, the resource envelope for a
data set of an AM_ at steady state may not be identical to the SRE
even though injection is controlled for the following reason.
Whenever there are parallel paths in the algorithm marked graph, the
transitions on non-critical paths of the algorithm marked graph will
have a float associated with them. The float of a transition is the
time by which a transition can be delayed without increasing TBIOLB
and qTLB. If there is not enough storage space for previous data,
transitions in the AM_ with float may not fire even though all the
input places have tokens. The reason is that one or more output
places of the transition contain previous data. This will change the
steady state resource envelope from the SRE. One way to prevent this
frc_ happening is to use control places to eliminate all floats from
the AMG. However, this may not be always possible as any control
place has to be generated from the completion of execution for a
transition. Also, use of control places may require dummy transitions
to prevent T_3IB from increasing, which will make the AM_ more
complex. A better way of enforcing SRE for all data sets
77
is to use dunmytransitions as buffers in the output of transitions
with float which need more storage space for previous data. The
position and number of dummy transitions can be determined from TGP
based on SGP. As the input injection interval is greater than, or
equal to, TBOLB , SRE should be enforced for the injection interval
of TBOLB. This will also guarantee SRE for all data sets with any
higher injection interval. The reason is that transitions are
executed at a lower rate for a higher injection interval and the need
for storage space at the output of floating transitions will be
lower. The detailed procedure is now stated below.
Construct the TGP based on SRE for TBO = T_LB. Locate all
transitions with float and identify their corresponding task input
number. By inspection of TGP, check whether all the successors of a
floating transition for the previous task inputs have fired before the
floating transition fires. If not, the floating transition needs
dummy transitions as buffers at its output. The number of required
dummy transitions equals the number of previous task inputs for which
at least one of the successor transitions has not fired at the time of
firing of the floating transition.
Example. Consider the algorithm marked graph of Figure 3.9. From the
AMG, TBOLB = 2 and TBIOLB = T_LB = 5. Only transition 5 has a
float of two time units. SGP and TGP for TBO = TBOLB = 2 are shown
in Figure 3.10. Task input 1 has started THOLB before task input 0,
and task input 2 has started another TBOLB before task input i. The
successor of floating transition 5 is transition 4. Another
predecessor of transition 4 is transition 3. Notice from the TGP that
4 (2) has started before 5(0); 3 (1) begins with 5 (0) . As 4 (1) is
78
1 ITronsition
time
Figure 3.9. Exomple AMG for illustrotion of
Applicotion 4.
"/9
< ><
Section number
for TBO=2
0
21
I
OI
I
I
I
1 2
(0)
3
I
4
2
4
Time
>
Time
I
I 1 (o)
F
I
I
I
I 2 (1)
I
I 4(2)
'1
L
---_t,
I
\
.1
2 (o)
I
(o) I
5
_4
3(1) I
I
I
1
i t+TBO
(b)
Figure 3.10. (a) SGP. (b) TGP for TBO:2.
80
executed after 3 (I) in the SGP, 4 (1) has not started before
5 (0) . Hence, one dummy transition is needed at the output of
transition 5 to store 5 (1) so that 5 (0) can fire according to the
SGP. Otherwise, the firing of 5 (0) will be delayed as the NMG model
of a transition does not allow the firing of a transition unless the
output buffer is empty. The transformed A_3 is shown in Figure
3.11(a). The TGP for TBO = 3 is shown in Figure 3.11(b). Transition
5 no longer needs a dummy transition in the output for enforcing SRE.
Hence, the transformed AMG of Figure 3.11(a) enforces SRE for TBO
equal to both 2 and 3.
3.4 Structural Changes In Algorithm by Transformation
The transformations considered so far try to preserve the
original structure of an algorithm marked graph. In certain
conditions that may not be possible, or desirable. For example, it is
possible to improve TBOLB of linear time invariant systems by
modifying the state equations, in this section, three kinds of
structural changes of algorithms are considered in Application 5
through 7. Application 5 explains how multiple input-output
algorithms or a group of algorithms can be combined into a single
input-output algorithm. This is necessary because the analysis tools
developed in this dissertation are based on single input-output
algorithms. Improvement of throughput by modifying the state
equations of linear time-invariant systems is demonstrated in
Application 6. The linearity property of state equations is used in
developing this technique and hence may not be applicable for other
graphs, in general. Application 7 considers the parallel
deconposition of transitions as a way of improving performance.
81
1 1 Transition
time
Dummy transition
of zero time
Transition 3
(o)
__ 1(0) 2(0)
I
(1)t 3(t)
Time t _
___ I
..J
!
!
I
I
I t+TeO
I
I
(b)
Rgure 3.11. (o) Tronsformed AlulG.
graph ploy for TBO=3.
(b) Total
82
_pplication 5. The perfornkmnc_ model of Chapter Two considers only
single input and single output algorithms. The addition of dummy
transitions provides a way of converting multiple input-output
algorithms or a number of algorithms into one single input-output
algorithm. A dunm_ transition is used to combine input data vectors
or output data vectors. All the inputs are synchronized and fed to
the dummy transition at the same rate. Performance is evaluated from
the combined algorithm which represents the total task. Two examples
are shown in Figures 3.12 and 3.13. In Figure 3.12, AMG A 1 has two
inputs and two outputs. It is transformed into a single input-output
algorithm A 2 by dummy transitions. Figure 3.13 shows how dummy
transitions can be used to combine two algorithms into one algorithm.
Application 6. This is an application of increasing throughput of
linear time invariant systems by increasing the number of tokens in
the circuit. Linear time invariant systems are described by the state
equations as stated below.
x(k) = Ax(k-l) + Bu(k)
y(k) = Cx(k) + Du(k) (3.4.1)
where x is the state vector, y is the output vector, and u is the
input vector. A, B, C, and D are time-invariant system matrices. The
corresponding algorithm marked graph is shown in Figure 3.14.
Usually, Ax(k-l) is the most time consuming computation in the AMG.
In such a system, the recursion circuit determines the TBOLB. It is
shown that it is possible to reduce the time/token ratio of this
recursion circuit by doubling the number of tokens so that TBOLB is
83
(o)
Dum(ny transition for combining inputs/outputs
(b)
Figure 3.12. (o) AUG A 1. (b)
AUG A 2 •
Tronsformed
84
85
le)
86
improved to the largest time/token ratio of the process circuits in
the C_4G. This is useful if decomposition is not desirable and TBOLB
needs to be reduced approximately to the largest transition time of
the AMG. The methodology for reducing the time/token ratio of the
recursion circuit is expressed below by the statement and proof of
Theorem 3.2 with the assumption that A * () is the largest transition
in the AF_ representing the state equation.
Theorem 3.2. It is possible to improve TBOLB to the largest
time/token ratio of the process circuits of a linear time invariant
system by reducing the time/token ratio of the recursion circuit by
doubling the number of tokens in the recursion circuit.
Proof. Theorem is proved by construction. Assuming A * ()
(transition 4) to be the largest transition of Figure 3.14, TBOLB is
determined from the recursion circuit. Application 1 has shown that
any AMG can be transformed so that TBOLB is determined by only
process circuits and recursion circuits. Thus, the statement of
Theorem 3.2 will be proved if the AF_ for the state equation can be
transformed so that the time/token ratio of the recursion circuit is
smaller than that of the largest process circuit. Let the state
equation represent a 1-input, m-output, and n-element state vector
system. The dimensions of A, B, C, and D are then (n, n), (n, i),
(m, n), and (m, I) respectively. Now
x(k)
x(k) = Ax(k-l) + Bu(k) ;
x(k-l) = Ax(k-2) + Bu(k-l) ;
= A{Ax(k-2) + Bu(k-l) ) + Bu(k).
87
It follows from the linearity of the system that
x(k) = (A * A)x(k-2) + (A * B)u(k-l) + Bu(k).
_t A* A=E andA* B= F. The,
x(k) = Ex(k-2) + Fu(k-l) + Bu(k). (3.4.2)
Notice that the dimension of E and A and F and B are the same.
Therefore, the amount of ocmputation of Ax(k-l) and Ex(k-2) and
Fu(k-l) and Bu(k) are the same. However, if equation (3.4.2) is used
instead of equation (3.4.1) for representing a linear time-invariant
system, the recursion circuit has two initial tokens as x(k) is
get,rated from x (k-2). The new AMG based on equation (3.4.2), and the
original output equation, is shc_n in Figure 3.15. The dunm_
transitions are inserted to act as buffers so that transitions are not
blocked from firing because output buffers are never empty. TI,
T2, and T 3 are predefined tokens. T 1 = F * u(k-l), T2 = E * x(k-2),
and T3 = x(k-l). Let k = i, 2, 3... and the initial state vector be
x(0). Therefore, the first input and output are u(1) and y(1)
respectively. That is, u(s) = 0 for s equal to zero or negative.
Therefore, the initial values of TI, T2, and T 3 correspond to k
= i. Hence, the initial values of T 1 and T 3 are T 1 = F * u(0) =
0 and T 3 = x(0). From (3.4.2),
T2 = Ex(k-2) = x(k) - Fu(k-l) - Bu(k).
88
4-*
t-
O
0
C
I"-
89
Therefore, the initial value of T 2 is given by x(1) - Fu(0) -
Bu(1). As u(0) = O, the initial value of T2 = x(1) - Bu(1). Hence,
it follows from the equation (3.4.1) that the initial value of T2 =
Ax(0) + Bu(1) - Bu(1) = Ax(0). Therefore, all the initial values of
the predefined tokens can be calculated from the initial state
vector. The recursion circuit now consists of transitions 2 and 4 and
there are two tokens in that circuit. _ne ccmputation level of
transition 4 has not changed, although that of transition 2 has
doubled. Thus, the new time/token ratio of the recursion circuit is
T(4)/2 + T(2), where T(4) and T(2) are the times for transition 4 and
2 of the original algorithm marked graph. Assuming T (4) is much
greater than T(2), the TBOLB of the new algorithm marked graph of
Figure 3.15 is given by the process circuit of transition 4 whose
time/token ratio is the same as in Figure 3.14.
Application 7. This application establishes a method for finding the
maximum level of parallel decomposition of a transition in an AMG for
the best computing speed of the transition. Decomposition reduces
process times of transitions; unfortunately, it also increases the
communication cost due to an increase in number of transitions and
places in the graph. Therefore, computing speed is improved with
decompositions up to a certain level. For the lowest process time,
transitions are decomposed uniformly. The maximum level of
decomposition of the transition is determined from the condition for
the fastest ccmpletion of the computation represented by the original
transition.
Let T be the ccmputation time of a transition which can be
decomposed in parallel arbitrarily without changing T. Let this
9O
transition be decomposedinto N equal parallel transitions as shown in
Figure 3.16. Each Ti is T/N. The time to complete the total
computation (A) for T in the worst case is then given by
A= r + _N+ C0 + w. (3.4.3)
r and w are the read and write times to complete reading and writing
of data for all T i transitions. When this set of N transitions is
computing T, some other transitions of the AMG may be concurrently
processed. CO is the time required by each functional unit to
receive data from the transitions of the rest of the AMG during the
computing of T. CO is assumed to be independent of N and i. Any
data are assumed to be broadcast to all functional units by a
transmission medium. It is assumed that one data packet can be
broadcasted at a time to all functional units. It is also assumed
that total transmission time for output data for all N transitions
together does not change with N. The worst case value of read and
write time for all N transitions together can then be expressed by the
following equation:
r + w = C1 + N*L*C 2 + C3, (3.4.4)
where C1 is the time that the transmission medium has to be used to
serve the rest of the AMG during the read and write operations for N
transitions of T. C 1 is assumed to be independent of N. C 2 is
the average access time for the transmission medium and L is the
number of times a functional unit has to access the transmission
91
(o)
(b)
Figure 3.1 6. (a) An AMG with a large transition T.
(b) T is decomposed in N parallel
transitions.
92
medium for computing a transition. C 3 is the time to transmit
output data over the transmission medium for all N transitions
together and is assumed to be independent of N. Therefore, from 3.4.3
and 3.4.4,
A = T/N + CO + C1 + N*L*C 2 + C3 .
For minimizing A, dA/dN = 0; d2A/dN 2 = positive. Now
dA/dN = (- T/N 2) + (L*C 2);
d2A/dN 2 = 2 * (T/N3).
I
As T and N are always positive, d2A/dN 2 is positive.
dA/dN = 0,
0 = (-T/N 2) + (L'C2);
N = [(T / (L*C2))'5]
As N has to be an integer and higher N will mean higher communication
cost,
N = [[{T / (L,C2))'5]J.
Also as N _> 2 for any decomposition,
(3.4.5)
T_ 4 * L * C2. (3.4.6)
93
Thus knowing C2, which is an architecture dependent parameter, the
minimumvalue of T for decomposition can be evaluated from (3.4.6).
Equation (3.4.5) provides the maximumlevel of decomposition.
Example. Let T be the processing time for transition B in an AMGas
shown in Figure 3.17. SupposeB can be arbitrarily decomposedin
parallel. Let T = i0, C2 = 0.25 and L = 2. As T > (4*2*.25 = 2), B
can be decomposed to improve performance. Let B be decomposed in N
transitions in parallel. Hence, N > [[{10/(2,.25) ).5]j = 4.
In order to maintain process time for c_r_utation T reasonably higher
than communication time for large granularity, a level of
decomposition, less than or equal to, half the maximum level is
assumed to be appropriate in the following example. Thus N is chosen
to be 2. The decomposed transition B is shown in Figure 3.17.
3.5 Summary
Applications of algorithm transformation are discussed in this
chapter and transformation techniques are defined. Improvements of
TBOLB are achieved by dummy transitions. Resource requirements may
be lowered by control places and dummy transitions. Input data
injection is controlled by predefined token and dummy transition.
Periodicity in the resource envelope is enforced by dummy
transitions. The methodology for transforming algorithms into single
input-output algorithm is described. The TBOIB of linear
time-invariant systems is improved by predefined tokens. Lastly,
parallel decomposition of transitions are considered to illustrate the
trade-off between decreased granularity and increased communication
cost.
94
"ransition B
\
T=10
.° J.
Trans,tmn t_me
(a)
Transition time
(b)
Rgure 3.17. (a) AMG before decomposition of B.
(b) B is decomposed.
CYJU_R FOUR
ATAMM OPERATING POINT DESIGN
4.0 Introduction
Tne ATAMM operating point (AOP) describes the specification of
the input data injection interval (latency), resource requirements and
the time performance of an algorithm marked graph operated on an ATAMM
data flc_ architecture. The design of operating points based on the
number of resources of the ATAF_ data flow architecture is
investigated in this chapter. The methodology is demonstrated through
examples, simulations, and experiments. Properties of the ATAMM
operating point under the allowable transformations and implementation
strategies are discussed in Section 4. i. In Section 4.2, AOP design
methodology is developed. Performance model, transformation
techniques and the AOP design methodology are verified by simulations
and experiments on test algorithms in Section 4.3. A _ of the
chapter is presented in Section 4.4.
4.1 Characteristics of Operating Point
The ATAMM operating point is the parameter set (TBI, R, TBIO, _T,
and TBO) for an algorithm execution where TBI is the input data
injection interval (latency) and R is the minimum number of resources
required by the ATAMM data flow architecture. The design problem is
to specify an operating point for executing an AMG in the ATAMM data
flow architecture which achieves optimum time performance with a
95
96
minimumnumberof computing resources. Unfortunately, this problem is
equivalent to a class of scheduling problems which is knownto be NP
complete [12]. Thus, there exists no methodology for obtaining an
optimum solution which is better than enumerating all possible
solutions and then choosing the best one. However, it is possible to
develop a procedure for generating sub-optimal solutions. This is the
objective of this chapter. The design objective is to determine an
operating point given the number of resources, and to provide the
guidelines for generating a new operating point should the number of
resources change. Also, the expected time performance for TBIO and TT
should remain the same with any input data injection interval greater
than that of the operating point as long as the number of resources
are not decreased. The following properties are assumed in the
operating point design:
a) Input data from the source are injected into the ATAMM data
flow architecture at a constant rate, and hence the time
between successive inputs (TBI) is always the same.
b) For all input data of the task, TBIO = TBIOLB and _T =
TFLB.
c) Each data set requires a resource usage envelope identical to
SRE.
All of these properties are realized by the use of Applications 3
and 4 of Section 3.3. These properties are needed for achieving the
best task computing speed for all task inputs and to accurately
predict resource requirements. As stated in Application 3, the time
between successive data inputs (TBI) is adjusted to be greater than,
or equal to, TBOLB so that input data never wait on the critical
97
path to the data output sink. The algorithm marked graph is
transformed as in Application 4 so that the resource envelope for each
task input is SRE. The design procedure must determine the allowable
range of TBI so that the ATAMMdata flow architecture has sufficient
resources to meet the resource requirements of all task inputs. Let
Rmin be the peak value of SRE. Therefore, any task input requires
at least Rmin resources to meet properties b and c. Let Rmax be
the largest peak value of _ for any TBI > TBOLB. Hence, with
Rmax or more functional units, any ATAMM data flow architecture can
execute the AMG while achieving TPLB and TBIOLB for any injection
interval greater than, or equal to, TBOLB. It is to be noted that
TBI and TBO are the same for any AMG at steady state. Finally, let
the number of resources of the ATA_P4 data flow architecture be denoted
by R.
The operating point for various numbers of resources can be
displayed on a graph of TBO versus TP. Every point in the graph is
associated with a value of TBIO and R. From Chapter Two, TT > TCE/R
and TBO > TCE/R. Also TBI and, hence, TBO need not be increased
beyond qT as Rma x = Rmi n on the TBO = TP line. Therefore, the AOP
is expected to lie in a triangular area of the graph determined by the
number of functional units of the ATAMM data flow architecture. The
characteristics of the operating point are shown in Figure 4. I.
Let the problem be specified by an algorithm marked graph. Let
the best possible performance under the rules of operating point
design be defined as the absolute lower bounds for the time
performance. Formal definitions of the absolute lower bounds for TY,
TBIO, and TBO are now stated.
98
oo. 
TCE/Rp.....
| / I AOPlies in the shaded
I/ II area forR res°urces
L/ 'z , )
- TCE/R Tr ---_
Rgure 4.1. ATAMM operating polnt characteristics.
99
Definition 4. i: Absolute Lower Bound for TBIO. The absolute lower
bound for TBIO (TBIOALB) is defined to be the lowest TBIOLB for
the algorithm marked graph with or without any transformations.
Definition 4.2: Absolute Lower Bound for TT. The absolute lower
bound for T_ (TTALB) is defined to be the lowest TFLB for the
algorithm marked graph with or without any transformations.
Definition 4.3: Absolute Lower Bound for TBO. The absolute lower
bound for TBO (TBOALB) is defined to be the lowest TBOLB with or
without any transformations.
Let the transformation be restricted such that only chammy
transitions (of zero time) and control places (with no initial token)
are used for transforming the algorithm marked graph. Theorems are
now described to determine the absolute lower bounds under the above
transformations.
Theorem 4.1. The absolute lower bound for TBIO is equal to the lower
bound without any transformations.
Proof. Control places can create new paths in an algorithm marked
graph but do not alter existing paths in the AM_. Dummy transitions
of zero time increase the number of transitions on a path in the AM_
but do not increase the path length. Therefore, any path in the
original AM3 is also a path in the transformed AMG with equal path
length. The critical path from the data input source to the data
output sink in the MAMG of the original algorithm marked graph is also
a path from the data input source to the data output sink in the MAMG
of the transformed AMG. Hence, TBIOLB of any transformed AMG under
the stated transformations cannot be lower than that of the original
one. Therefore, the TBIOAI B of an algorithm marked graph is
i00
determined by the TBIOLBof the AD_without any transformations.
This completes the proof.
Theorem 4.2. The absolute lower bound for qT is equal to the lower
bound without any transformations.
Proof. The proof is similar to that of Theorem 4.1. However, TTLB
is determined by the critical path among all paths from the data input
source to any output sink in the MAMG. By the arguments of Theorem
4. i, this critical path in the MAMG of the original AD_ is also
present with equal path length in the MAMG of the transformed AMG.
Thus, TTIB cannot be reduced by transformation with dummy
transitions (zero time) and control places (no initial token). Hence
the T_AI B of an AD_ is determined by the TTLB of the AMG without
any transformations. This completes the proof.
Theorem 4.3. The absolute lower bound for TBO is equal to the largest
time/token ratio an_ng the process and recursion circuits in the
of the original algorithm marked graph without any transformations.
Proof. Theorem 3.1 has proved that the TBOLB of an algorithm marked
graph can be reduced to the largest time/token ratio of the process
and recursion circuits by transforming with dummy transitions of zero
time. Also, the time/token ratio of process and recursion circuits
are not going to increase as long as dummy transitions do not require
computer time. Control places, on the other hand, can create new
parallel path circuits in the C_G but do not change the time/token
ratio value of the circuits in the _ of the original AMG.
Therefore, the lowest TBOLB and TBOAI B is determined by the
largest time/token ratio among the process and recursion circuits in
the C_G of the original AMG. This completes the proof.
i01
Any operating point will have TBIO, Tr, and TBOvalues greater
than, or equal to, those specified by the respective absolute lower
bounds. Figure 4.2 (a) displays the characteristics of the operating
point when designed with only dummy transitions (zero time) and
control places (no initial token). Any operating point resides in the
area BVWH. The point B corresponds to the operating point which
achieves the absolute lower bounds for TBIO, TT, and TBO. Lines BV
and _H represent operating points which achieve the absolute lower
bounds in task computing speeds (qT and TBIO) and the output interval
(TBO) respectively. With the specified transformations, _TLB cannot
be more than TC. Any operating point on line HW has TTLB = TC,
which indicates the absence of any parallel concurrency. Point W is
characterized by TTLB = TBOIB = TC and represents complete
sequential operation with no concurrency. ATAMM is most appropriate
for problems which require both parallel and pipeline concurrency. It
is assumed that TBIOLB and qTLB are achieved for any TBI greater
than, or equal to, the data injection interval at the operating
point. Therefore, the minimum resource requirement at any operating
point is the greatest peak value of TRE for any TBI > TBOop , where
TBOop is the data output interval and the input data injection
interval at the operating point.
4.2 Operating Point Design
Let the problem be specified by an algorithm marked graph for
which the ATAMM operating point is to be determined. The only
allowable algorithm transformations are dungy transitions of zero time
and control places. Predefined tokens and decomposition will not be
102
,_ TBO=TT line
TC I- ---. -- -- -_/_/W
; , /_,,x.xlJl_ H AOPshadedresideSareain the.,
TBOAuB --
I/ ! ! )
TrALa TC l"r -'-->
Rgure 4.2(a). AOP characteristics under specific
transformations.
103
considered for operating point design. The AOP design consists of six
steps. These steps are described in the remainder of this section.
The operating points are determined corresponding to different number
of resources for the algorithm marked graph of Figure 3.3 to
illustrate each step as it is presented.
Ste_. Construct the CMG from the AMG. Determine lower bounds and
absolute icier bounds for TBIO, Tg, and TB0 for the AM_. If TBOLB
is greater than TBOALB, transform the AMG with dummy transitions to
achieve TBOALB, as in Application 1 of Section 3.2. Determine Rma x
and Rmi n. If Rma x > [TCE/TBOALB] , heuristically transform
the AMG with control places and du_m I transitions to reduce Rma x
without increasing TBIOLB , TTLB , and TBOLB , as in Application 2 of
Section 3.2. Determine new Rma x and Rmi n values. Lower bounds of
performance for the resultant AMG are also the absolute lower bounds
for TT, TBIO, and TBO under the specified transformations.
From the AMG of Figure 3.3, TBIOLB = 6, TTLB = 6, TBOLB = 2.
Also TBIOAL B = 6, TTAL B = 6, and TBOAL B = 2. SRE and TRE
corresponding to TBO = 2 are shown in Figure 3.4. Checking all TBI >
2, Rma x = 9. The AMG of Figure 3.3 is now transformed heuristically
to lower Rma x without increasing TBIOLB , TrLB , and TBOLB , as
described in Application 2 of Section 3.2. The transformed AMG is
shown in Figure 3.5 (ignore control places 2, 3, and 4). SRE and TRE
corresponding to TBI = TBOLB = 2 are shown in Figure 3.6 for the
resultant AMG. By checking all TBI > 2, it is determined that Rma x
= 8, Rmin = 4.
104
SteP_2. Choosea convenient transition firing rule. A rule to
determine when an enabled transition in the C5_3fires must be
specified in the graph manager. The rule usually used is that enabled
transitions fire when computing resources are available. If
contention exists, such as when there are more enabled transitions
than computing resources, firing occurs according to a priority
ordering of the transitions. For the algorithm marked graph of Figure
3.5, the highest to lowest priority ordering of the transitions is
chosen as (Ii, I0, 9, 7, 8, 5, 6, 4, 3, 2, 12, and I).
_. If R > Rma x functional units are available, operate at TBI
= TBOAL B. Use Applications 3 and 4 of Section 3.3 to adjust TBI to
TBOAL B and to transform the AMG by dummy transitions in order to
realize SRE as the resource envelope for all task inputs. Eliminate
all unnecessary dummy transitions. The operating point time
performance is the absolute lower bound values for TBIO, qT, and TBO.
The AMG can also be operated for any TBI > TBOAI B while maintaining
TBIO and TT at absolute Icier bound values. When R < Rmax,
determine the operating point from one of the following strategies:
Strategy A: Strategy A is applicable when Rma x > R > Rmi n-
Preserve TBIO and TT at their respective absolute
lower bounds at the expense of increasing TBI and
TBO above TBOAL B.
Strategy B: Strategy B is applicable for the following range of
R. Rma x > R > [TCE/TBOALB]. Preserve TBO
to its absolute lower bound at the expense of
increasing one, or both, of TBIOLB and TrLB.
105
Strategy C: Strategy C is applicable whenRmax > R > i. The
operating point is determined by first following
Strategy B so that Rmax > R _>Rmin, and then
increasing TBI above TBOALB. The strategy tries
to minimize performance degradation in TBIO, Tr, and
TBO from their respective absolute lower bound
values.
These three strategies of the AOP design under resource
constraints are illustrated in Figure 4.2(b). Strategy A maintains TT
and TBIO at their respective absolute lower bound values and reduces
pipeline concurrency to lower resource requirements. Strategy B
reduces resource requirements by decreasing parallel concurrency
resulting in a higher lower bounds for one or both of TBIO and _T.
Strategy C sacrifices both pipeline and parallel concurrency to some
extent for lowering resource requirements.
If the ATAMM data flow architecture has eight or m_re functional
units, the algorithm marked graph of Figure 3.5 can be operated at
TBIO = TY = 6 and TBO = 2 by adjusting TBI = 2 using Application 3 of
Section 3.3. SGP and TGP corresponding to TBI = 2 are shown in Figure
4.3 which suggest that no new dtmm_ transitions are required to
enforce SRE and SGP. Resource utilization over a period TB0 is given
by (TCE/(R*TBO)) = 12/16 = .75.
SteD 4. Execute this step if strategy A is appropriate. Increase TBI
to TBOop such that TBOop is the lowest time interval between
overlapping SRE's for the peak value of TRE to be less than, or equal
to R, for all TBI _> TBOop. TBOop is guaranteed to lie in the
range [TCE/R] < TBOop < TTAL B. Operate at TT = TTAL B,
106
TC
TBO,o_.n
TBO='Fr line
•v ./ _ I strategy
TT_. e TC "1"1"---'_
Figure 4.2(b). The strategies for AOP design
under resource constraints.
107
Sectio_
numbert
I
I
I
I
0 I 1 I 2 I
I t i
iv2 (___ 7 11i,_------> 4----->4---->
I
I
I4 9
i1<_______ K >
I kl°>
(o)
Time t
I ., (o) I
I< ' >t
I 2(1) 5<__t<--------_
I (1) '
I I
I
t+TBO
(b)
Figure 4.3. (o) SOP. (b) -I-GP for -I-BO=2.
108
TBIO = TBIOALB,
Section 3.3.
TBOop-
Assume,
and TBO = TBI = THOop using Application 3 of
TBIOAI B and qTAI B are also achieved for any TBI >
the ATAMM data flow architecture has five functional
units. As Rmi n = 4, Strategy A can be applied. Following Strategy
A, it is found that TBOop = 3. Overlapping of SRE's for TBI = 3 is
shown in Figure 4.4 (a). The operating point is given by TT = TBIO = 6
and TBI = TBO = 3 and RU(TBO) = (12/(5"3)) = .8.
Ste_. Execute this step if Strategy B is appropriate.
Heuristically transform the _ to reduce Rma x using control places,
as in Application 2 of Section 3.2. Maintain TBOLB at TBOAL B by
using dummy transitions. A good heuristic is to reduce Rmi n
significantly. There is a guaranteed solution at T_LB = TC,
TBIOLB = TFC, and TBOLB = TBOAL B by transforming the AMG into a
complete chain. Eliminate all unnecessary dummy transitions. Operate
the transformed AM_ for TBI = TBOAI B = TBO, Tr = qTLB , and TBIO =
TBIOLB using Applications 3 and 4 of Section 3.3.
Suppose the ATAMM data flow architecture has six resources. TCE
= 12 units of computer time. As R > [TCE/TBOALB] = 6,
Strategy B can be applied. Rma x is reduced to 6 by control places
2, 3, and 4 as shown in Figure 3.5. New SRE and TRE for TBI = 2 are
shown in Figure 3.7. The peak value of TRE is 6. _TLB = TBIOLB =
7. By checking all TBI _> 2 for this AFt, it is found that Rma x = 6
and Rmi n = 3. SGP and TGP for the transformed AMG are shown in
Figure 4.5. 0nly transition 5 has a float associated with it. The
successor of transition 5 is transition Ii. By inspection of the TGP,
transition 5 (1) fires before transition 11 (2) , which is impossible
109
5
0
M
®3
2
T|me
(a)
4
_3
o
_2
+ • t+4 + +
Time
Figure 4.4. (a)(b)
(b)
TRE for TBO=3 in Step 4.
TRE for TBO=4 in Step 6.
ii0
Sectionl 11 0
/
numbed
I
I
I
I 1 I 2 1 3
I I i
12 7'
:><_ 4__ <___8 11
_-_ <-_->
I
I I
______, _,
', &04
I I
(a)
Time t
I (o) I1
K
I 2 (1) 5(1)1
IJ')_
I I
14 (1) r
r<----->
,2 _ (2)1
18 _j 7
I (2) 1
I11 (.`3) 1
, ,t+TBO
(b)
Figure 4.5. (o) SGP. (b) TOP for TBO=2.
iii
in an ATAMM unless there is a buffer between transitions 5 and Ii.
Hence one dummy transition is required between transitions 5 and ii as
shown in Figure 4.6 to enforce SRE as the resource envelope for all
task inputs. The operating point is given by TT = TBIO = 7 and TBO =
TBI = 2; RI/(TBO) = i.
Step 6. Execute this step if Strategy C is appropriate. Transform
the AMG by Strategy B until Rma x > R > Rmi n and then increase TBI
to determine THOop , as in Strategy A.
Let R = 4. The AMG is transformed by Strategy B as described in
Step 6. Now Rma x = 6 and Rmi n = 3. As R is within the range of
Rma x and Rmin, the operating point can be determined by increasing
TBI as in Strategy A. Increasing TBI, TBOop = 4. Overlapping of
SRE's and TRE for TBI = 4 are shown in Figure 4.4 (b). The operating
point is given by TT = TBIO = 7 and TBI = 4. Adjust TBI to 4 for the
AMG of Figure 4.6 to implement the operating point. _(TBO) -- .75.
These operating points for the AMG of Figure 3.5 are shown in
Figure 4.7. Operating point B is the only operating point which
achieves the absolute lower bounds for T_, TBIO, and TBO and is
achieved in Step 3. OPA, OPB, and OP C are the operating points
developed by Strategies A, B, and C respectively.
4.3 Test Results
The performance model, transformation techniques, and the ATAMM
operating point design procedures are tested by simulations and
experiments, simulations on the test algorithms are done by a
software simulator developed to simulate the execution of an algorithm
in the ATA2_4 environment [21]. The input parameters for the simulator
112
/
113
5.
TBO -- 2
0
°PoR-5
OPA ) R-4
- --_.,-__
I m
! R-8
5 8 7 8
l"r_.s 17
Figure 4.7. ATAMM operating points for the
example algorithm marked graph.
114
are the algorithm marked graph including all NMG transition times, the
number of resources, and a priority ordering for the transitions of
the AM_. The input data injection interval is controlled by adjusting
the source transition time. The simulator detects and writes all
events associated with the execution of transitions for each task
input on a graph diagnostic file. The analyzer is a program developed
to analyze this graph diagnostic file [21]. The two features of the
analyzer used in this dissertation are the node activity display and
the input/output display. The node activity display shows the
execution of transitions as a function of time. The input/output
display shows TBI, TBO, and TBIO for each task input and also plots
these quantities as a function of time. Detailed information about
the simulator and the analyzer are found in [21]. Another useful
program developed is called Ttime which determines the lower bounds
for T_, TBIO, and TBO in an algorithm marked graph by constructing the
CMG and MAM_ [20].
A testbed is developed to run test algorithms in the ATAMM
environment [20]. The ATAMM data flow architecture consists of three
functioDml units with a distributed global memory and graph manager.
Figure 4.8 shows the architecture. Functional units are realized by
I_M Personal Computer AT's. Functional units co_mmicate between each
other by a ET_hernet communication bus. In addition, another I_M PC/AT
which implements the source and sink transitions of the AMG is
connected on the Ethernet bus. This I_M PC AT is used to begin and
end the execution of the test algorithm and to generate a graph
diagnostic file recording all events during the execution of the AMG.
At the present stage, the source transition time cannot be adjusted to
control the injection rate and this rate is always equal to a small
115
ETHERNET_
GM
FUN
GLM
GM
FUN
GLM
IBM PC AT
GM
FUN
GLM
1
Rgure 4.8. The testbed ATAblM data flow
architecture.
116
write time. Thus, it is not possible to check the entire ATAMM
operating point design procedure on the testbed. However, two
experiments are carried out to show the effect of dummy transitions in
improving T_hB and the use of control places to reduce resource
requirements. The analyzer is used to determine the performance of
the test algorithm from the graph diagnostic file. Detailed
information about the testbed can be found in [20].
Five test algorithms are chosen to test the design procedure,
performance model, and transformation techniques on algorithms with a
wide range of structural characteristics. Execution of all five
algorithms were simulated but only two algorithms were actually
implemented on the testbed, mainly due to the resource limitations and
inability to control the input data injection interval. The results
are stated and analyzed for each of the test algorithm execution in
the following discussion.
Test i. The primary objective of this test is to show the use of a
dummy transition as buffer in reducing the time/token ratio of a
parallel path circuit. Experimental time performance is also compared
with the theoretical time performance predicted by the performance
model. The test AMG and a transformed test AMG are shown in Figure
4.9(a) and (b) respectively. The purpose of the dummy transition is
to reduce the time/token ratio of the parallel path circuit for the
parallel paths between transition 1 and 3 in Figure 4.9(a) so that
TBOLB is improved to the time/token ratio of the largest process
circuit. All the transition times are expressed in seconds. Priority
ordering frc_ highest to lowest in the test AMG and transformed test
AMG are (3, 2, I) and (4, 3, 2, i) respectively. The dummy
117
Transition
1
Transition time
4s/in seconds
88 3s
(a)
Transition
I
/ Transition time
4s in seconds
Dur_my 3s
transition
(b)
Figure 4..9. (a) AMG for Test 1. (b)
AMG for Test 1.
Transformed
118
transition is implemented as an active transition of zero process
time. Read and write times of the transitions are assumed to be 220
ms and 255 ms for simulation and theoretical performance evaluation
(these communication times were measured for the testbed in [20] for
two functional units). Lower bounds for TBIO and TBO are calculated
for both the test AM_ and the transformed test AMG. It is assumed in
simulations and experiments that no resource is needed to implement a
dummy transition. Both the AMY's are executed and simulated for two
functional units which are the maximum resource requirements to
achieve TBOLB and TBIOLB in either case. Although experimental
and simulated time performance are expected to be TBIOLB and
TBOLB , there can be some differences due to the following reasons.
The simulated performance measures are always a little higher than the
theoretical expected performance. This is due to lost clock cycles in
assigning transitions to resources and the fact that even a dummy
transition will also require a resource, though only for a small
duration. Experimental time perfo_ values are higher in some
cases from the theoretical expected time performance due to one or
more of the following reasons. First, Ethernet cannot implement more
than one read or write operation at the same time. Second, as the
dummy transition is nonideal, it requires a resource. Third, read and
write times for NMG transitions were measured with no contention,
which is not true when a number of transitions try to communicate at
the same time. Fourth, there is a slight increase in actual process
times for transitions due to interrupt from other functional units.
Experimental and simulation results for both AMY's are presented in
Figures 4. i0 through 4.13 and compared with theoretical performance
lower bounds in Table 4. I. The node activity display shows the
119
TABLE 4.1
COMPARISON OF RESULTS FOR TEST 1
Experimental ,ulation
Reeults Results
Av. Av. Av.
TBO 11910 TBO TBIO
for Teet 13.13 16.41
1
q
13.28 18.53
beoretlcal
LB'8
rBo._
Transformed
AMG for Test
1
9.23 18.43 g.1 18.53 8.695
120
u
N
n
m
r ,_ /_ °°°°
r_
t::__._z
_mmml
Nwn
f,[u,'Hl.r] HH!,;T, Ti _'
| _ _n"_i_n
__,"rrz'_
E
I'[J[ Iti|JJl,zl[ '.!:,JJl
E'--
1111illlillt!llilltt;illl
E_
l,,liltLti[:Hf,fll,fil]
I
t][littt]ilt;!l!lllll,TU,I
r_:
I'[II!]t_:tlHL,'illl:!!I
L
i_iiltltllllltlJlll:lll
k'
o'1
C:
ll
.,:(
L
@
,14,_
l/}
.l,..I
m
::3
1/1
ill{/)
LIT.
12].
m
m
m
!
!
I
[
F
t
I
| i
I
L_._____=..
I
I
I
I
, ,_ ,
L
I
f-.---_
I -..
I
I '
I
,, ,
i ..
!
I -
I '
i n
i
n
.E
m
o'1
el
-I
122
N
I
i: 1I ['1I't'l_ I I| [ [!1 ][ U []!
[ ...... ,----
[__
_____._.__---
_m
u|
,l_ll!l!lllillliil_iili
F
oi
r-
om
r-
t
0
'4--
O_
m
L
E
i
°I...
X
L
C_
L_
m
-l,,a
L
LT_
124
execution of transitions with time in the order of transition numbers,
with transition 1 being the lowest. TBI, TBO, and TBIO of the
input/output display are to be divided by i00 for converting all times
to seconds. From the input/output display there is a significant gain
in TBO by the transformation. Performance varies very little with
task inputs. From the table, it can be seen that TBOLB is improved
frcm 13.17s to 8.695s by the dummy transition. It can also be seen
that the experimental and simulated performances are very close to the
theoretical lower bounds of performance, except for the TBO of the
transformed test AMG. This is primarily due to the fact that the read
of transition 3 and that of the dummy transition in Figure 4.9 (b)
cannot occur at the same time. Also, as there are only two resources
with the priority of transition 1 being the lowest, no new task input
will be accepted until the operation of the dtmm_ transition is
completed. All other results are as expected.
Test_____22.This test illustrates the use of a control place to reduce
resource requirements (peak of TRE) while maintaining TBgIB. Also,
theoretical and experimental time performances are compared. The test
AMG and the transformed 7kMG are shown in Figures 4.14(a) and 4.14(b)
respectively. The test AMG of Figure 4.14(a) requires three resources
to operate at TBIOLB and TBOLB. The AMG is transformed as shown
in Figure 4.14 (b) which achieves TBOLB with only two resources at
the expense of increasing TBIOLB (assuming that no resources are
required for the dummy transition). All the transition times are
expressed in seconds. Priority ordering from highest to lowest for
the AMG of Figures 4.14(a) and 4.14(b) are 4, 2, 3, 1 and
125
_j Transition time
.. _L_/ 2e in seconds
Tro_sltmon/--_\
_Z/le
(o)
28
Dummy _ _ .. .
transition ;_ \ Tronmhon ttme
.. ,n..con,.
Transition _ I \
1\
18
Rgure 4.14.
('b)
AktG for Test 2.
Transformed AMG for Test 2.
126
5, 3, 4, 2, 1 respectively. Read and write times for each NMG
transition were measured in [20] to be 0.275s and 0.31s respectively
for three resources. The test AMG of Figure 4.14(a) and the
transformed AMG of Figure 4.14(b) are run on the testbed and simulated
with three and two resources respectively. Experimental and
simulation results are described in Figures 4.15 through 4.18 and
compared with theoretical lower bounds in Table 4.2. In Figures 4.15
through 4.17, TBI, TBIO, and TBO are divided by i00 to get time in
seconds. The times in the input/output display of Figure 4.18 are
divided by 18.2 to get time in seconds. It can be observed that the
transformed AMG achieves almost the same TBO as the original AMG;
however, TBIO is increased by nearly the time for transition 3 of
Figure 4.14 (a) in the experiment and simulation. The differences in
experimental results from theoretical lower bounds for both the AMG's
are primarily due to nonideal dummy transition and Ethernet
communication problems, as described in Test I. The difference in the
simulation results from the theoretical expected performanc_ is mainly
due to lost clock cycles in assigning transitions to reso_ and due
to nonideal dummy transitions. The experimental performance for the
transformed AMG unex_ly went through a wide variation initially.
One probable reason is the lack of proper injection control, which may
cause the cormm/nication software (for implementing Ethernet
_ication) to be unpredictable. All other results are as
expected.
Test 3. This is a simulation for the execution of a test algorithm
shown in Figure 4.19(a) to check the ATAMM operating point design
procedure. Let T = I000 time units. The read and write times of the
TABU[ 4.2
COMPARISON OF RESULTS FOR TEST 2
Algorlthme
Expoflmental
b.._, (,)
Av. Av.
TBO TBIO
• l llm ii.
Simulation ITheoreticol
_,,.lt, (,) uB',(,)
Av. Av.
mo TBIO TBOI.n TBIOLe
AUG for Teet 5.00
2
8.25 4.98 8.36 4.86 8.255
Transformed
NdO for Te_
2
5.16 9.81 5.13 9.56 4.70 9.4
].28
Z
m
a...
m
iiim
N
,p_ w_
i,..3.w_1_
l.,_j_ 11j, vj.Tj,_.'j2_ J_
L
I
t'
C__.__
I
• i i
I
i__.--m
i i
i
I
_q
,.I,.J
!-
l"
ilaun
t_
i-
ll.
0
u
a_
im
_u
E
iiimb
0"1
u_
129
i
i
i
130
N
lb.,,,
N
w-1
J-,
z.a
O
X
e--
C.)
, jI
I 0:
I
I
I
I
I
.. .- .t .o ow i. .. °.
_=,___
I
_'_.._._.._._.-_-__
,..--...-,.-----
...... I
._=
<
&.
O
K.
.=
E
Jm
&.
X
ILl
"e
131
r j_j_
!
I
I
l
I
f
I
I
I
I
I
I
I
I
,,
I
,,
_ _ °..°.° --°l°'_'"
,.,-.._-"........ _ -.-:-::". .................
I
I
I
I
i
m
m
m
m
.4-#
.E
="
.l,a
14-
Ul
I.
132
2T
Tr6nsition
Transition
_____ 2T
time
(o)
u_
o
&..
D
0
O_
2
o
! I
0 1T 4T 5T
Time
(b)
Figure 4.19. For Test 5, (a) AMG. (b) !SRE.
133
NMG transitions are assumed to be zero. Then TBIOLB = 4T, TrLB =
5T, and TBOLB = 3T. No further improvement of TBOLB is possible
as it is determined by the time/token ratio of the recursion circuit.
Hence, TBIOAL B = 4T, TrAI B = 5T, and TBOAL B = 3T. SRE is shown
in Figure 4.19(b). By checking out all TB0 > TBOAL B, Rma x = 3,
and Rmi n = 2. Also TC = 8T, TCE = 8T units of computer time. As
[TCE/TBOALB] = 3, Rma x cannot be improved any further
and Strategies B and C cannot be applied. So if R > 3, the ATAMM
operating point is determined by Step 3 as TBI = 3T, TBIO = 4T, Tr --
5T, and TB0 = 3T for all task inputs. As there are no floating
transitions, Application 4 is not required. For R = 2, Strategy A of
Step 4 in the ATAMM operating point design determines TBI = 4T, TBIO =
4T, TT = 5T, and TBO = 4T for all task inputs. The AMG execution at
the operating points determined by Steps 3 and 4 are simulated and
results are described in Figures 4.20 and 4.21 respectively. The
achieved time performance in simulation is very close to the predicted
theoretical time performance of the ATAMM operating point design. In
the simulation of the operating point given by Step 3, TBI = 3.02T is
used instead of 3T because TBOAI B is slightly higher in the
simulation due to lost clock cycles.
Test 4. The algorithm of Test 4 is a subsystem of a Space
Surveillance System and is described in Figure 4.22(a) (ignore the
dotted line). Let T = i00 time units. The read and write times of
NMG transitions are assumed to be zero. Then, TBIOLB = TgLB =
TBIOAL B = TgAL B = 18T and _ = TBOAL B = 10T. SRE is shown
in Figure 4.22(b). By checking out all TBI > TBOAL B, Rma x = 4,
and Rmi n = 3. Now TCE = 25T units of computer time. As
134
IIilIIIIIII
,,I,-4
====
I=
i=:=
IIlIILIIIIIi
mm]]itm
:+++++
,............--.
mmm_
mmmIm,
im]mml
mm]]m]
I======
t,.P,m'm'.,rr
+Ii|iii|ii+i
i===== I=:==:=
rntl:nmn: ,"rrrl,r.T'.'n
l::=====
======
.n"i:m'm'+ l'nm'mnm
1===== F====
,6
.c:
i+,I
+6
.p.m
"5
IL
0
¢,,,I
,.,+
I,,..
LT..
135
OIIllllllll
IIIIIII11111
_4||Jvmoln
[llliillilii
,JVI,,41I.._
IIIIililllll
Z
=reline
[IIIIIIHtl
i=_mIm=
_,,,,,,,_
Imizmur
I_4|| JVmlIIn
rrrmemmm
iimmmz
=
=
=
=
Immm=
=
OII_INIII_
E
==
cozmzmI
II_lll_llll
m'rrrrm_
,_=
<
_6
==
IT.
136
Transition _ 2T
time /
(0)
D
O
_n
C_
/\
5
u
I
0
0 1T '2T I 7T Time 18T
(b)
Figure 4.22. (e) AMG for Test 4.
the AMG of Test 4.
(b) SRE for
137
[TCE/TBOALB] = 3, it may be possible to lower Rma x to 3.
A control place is placed from transition 5 to 3 for that purpose, as
shown by the dotted line in Figure 4.22(a). The new SRE is shown in
Figure 4.23(a). It was checked by the Ttime program that TBIOLB ,
TYLB , and TBOLB were unc_ed by the control place. By checking
all TBI > 10T, Rma x = 3, and Rmi n = 2. Hence, Strategies B and C
of the ATAMM operating point design are not appropriate as Rma x will
always be equal to or more than 3. For R > 3, Step 3 of the ATAMM
operating point design determines TBI = loT and TBIO = TY = 18T for
all task inputs. For R = 2 Strategy A of the ATAMM operating point
design determines TBI = 17T, TBO = 17T, and TBIO = TF = 18T. _ne
graph play for a single task and the total graph play for TBO = loT is
shown in Figures 4.23 (b) and 4.24 respectively. By inspection of TGP,
no dummy transition is required to enforce SGP and SRE. The
execution at the operating points, determined by Steps 3 and 4, are
simulated and the results are described in Figures 4.25 and 4.26
respectively. The achieved time performance in simulation is very
close to the predicted time performance of the ATAMM operating point
design.
Test 5. Execution of the algorithm marked graph in Figure 3.3 is
simulated for all the operating points developed in Section 4.2. All
the process times for the transitions of the AMG are multiplied by T
(T = i000 time units) in the simulation. The read and write times of
the NMG transitions are assumed to be zero. The results of the
simulation for the operating points of Steps 3 through 6 are described
in Figures 4.27 through 4.30 respectively. It is to be noted that the
TBI's used in the simulation for the operating points in Steps 4
138
Oh
o
lib,.
D
0
(n
h-
0
0
I r
1 T 2T 81-
| ...................... 7 ....
1 6T
Time
(o)
1 8T
' I I
I I 41j
I
_ ', I,
'' I--4
6 1+ 3f ++ 8+ ++T
(b)
6
/
I"
I
"tl
1
I
__->
Time
18T
Figure 4.23. For the lronsformed AMC, (o)
(b) SCP.
SRE.
139
i 1
! 1(o) 2(°) 5(°) .(o) I
I -" 4(°) 1
(1) I
I II
Time ---> t i i t+T130
! iI
I !
Figure 4.24. TGP of the tronsformed AMG
for TBO-" I OT.
140
)
E
.E
_i1111Ulll
m
m,"
m
N _HIIt1111i
.am,
........................, "--_
m_
mm_
m_
@
m_
_m
_mmm
mm_m
m
m
m
m
m
m
I
I
ml
_mm
m
_mmm
_mm
mmmmm
m_
mmmm
mmmmmml
m
_mm
_mm
_mmm
m
_m
mm
mmmmm
mm
mmmmmm
.__=
o
to.
0
n
L--
o
m
k..
a
E
ON
ol
141
ou
111111t11Hq
l[_llllillll
lUllllll}li
Hr_TITIIIII
i_ilUllll
IIIIIIII|I
lllltlllllli
lll._,l,lll'
_I-IIIIIIIII
.,.,m
i
I
I
I
I
i
I
I
I
I
I
I
i
I
I
I
I
I
I
I
I
i
I
I
I
t
E
E
_m
ms
_mmmm
u
m
_m
_mmm
mm_
m
_mmmm
mm
m
m
m
m
m
m
m
4
I--
E
0m
<
L
ffl
0
0
<
L
0
I¢1
m
3
£.
wm
3
I2.
142
i.
I
1
I
I
I
I
I
I
!
t
I
1
 !iI111111111111
----[--_-s
--
--
--
--
-- m_
--
1'I
--
-- _m
-
-
-
._,_
mm
_m
mm _
mm []
_=-
I
._=
,=
<
m
E
e_m
(n
04
,6
&.
:3
143
I
I
I
I
I
t
I
I
I
I
I
1
I
1
1
.e_
E
-
- _ _
-- ,_--_ =
G
| I B E E
I
II i l c
|
' _ ' _
144
m
°,_
, I
i I
I
I
I
I
, I
I
, !
' iI
I
, I
Q
I
I
, I
l I
I I
I
, I
' III
' i
I
_i
I
,4_
in
I--
e"
el
tD
I._
0
0
<
]
ID
145
|
_ot
m-
=>
m
c,J
I
, I
I I
I
I
' II
I
I
I
I
_ I_ _2: E_ e2u _,.._
-- _ _: _ ,._
-- _ _ zz_ ,._ I_ _" _"
-- ,_ _
-- e__ _ I_z
-- ._m _ _'-
_= I_
[]
ImI
[]
._=
(D
(1.
o
,e
e
146
through 6 are slightly higher than the value predicted in the ATAMM
operating point design. The reason is, again, a slight increase in
the transition times of the AMG in the simulation due to the time
needed to assign transitions to resources.
4.4 Summary
A new term, the ATAMM operating point (AOP), is defined to
express all the parameters of an algorithm execution in the ATAMM data
flow architecture. The characteristics of an AOP are explored for
finite resources and under specified transformations. The absolute
lower bounds for performance measures are defined. TBIOAIB,
TTALB, and TBOAL B are determined under transformations by control
places and dummy transitions. A procedure is developed for operating
point design given the number of functional units. The performance
model and the use of dummy transitions and control places for
improving time performance and resource requirements are illustrated
through experiments and simulations. The ATAMM operating point design
methodology is checked by simulations on test algorithms.
CHAPTERFIVE
CONCIDSION
Performance modeling and enhancement for concurrent processing in
the ATAMM data flow ardhitecture have been the primary thrust for this
research. Several key results are achieved in that respect. First, a
performance model is developed to determine performance of an
algorithm executed periodically in the ATA_ data flow architecture.
Second, algorithm transformation techniques are identified and their
applications are illustrated in improving time performance and
resource (computing element) requirements. Third, an ATAMM operating
point design procedure is developed to specify time performance and
input data injection control for periodic execution of an algorithm on
an ATA_4 data flow architecture. Significant results in these three
areas have been discussed. Finally, future research topics are
suggested.
The starting point of this research has been to define the
computing environment and performance measures for the periodic
execution of algorithms in the ATAMM data flow architecture. The
architecture is assumed to have R identical computers, or functional
units, and executes algorithms according to the rules of ATAMM. These
computers, or functional units, are also denoted by the terms resource
and computing element. The performance of an algorithm is measured by
the time between input and output (TBIO), task time (T_), and time
between outputs (TBO). Graph theoretic and resource imposed bounds
147
148
are developed for these performance measures. Also, the graph
execution pattern and resource requirements are defined through SGP,
SRE, TGP, and TRE. These results establish a new model for evaluating
performance of algorithms in a hardware independent context as long as
the architecture obeys the rules of ATAMM.Hence, it is now possible
to compare the relative merits of different algorithm decompositions
with respect to performance and resource requi_ts for the ATAMM
data flow architecture.
The performanc_ model enables the user to identify the cause of
performance limitations. It is observed that the critical circuits of
the CMG and the critical paths of the MAMG are the detelmtining factors
for the graph theoretic lower bounds of time perfo_. Also, the
total resource requirement (the peak value of TRE) is determined by
the shape of the resource envelope (SRE) and TBO. Hence, it may be
possible to enhance performance or reduce resourv_ requirements by
transforming the algorithm marked graph while maintaining its
equivalency. Algorithm transformation techniques are identified which
can be used to improve time performance or aid resource envelope
modification. Transformation of an AMG may, or may not, involve
decomposition of transitions. This research has concentrated on two
of the transformation techniques, namely dummy transitions and control
places. Concentration on these techniques is due to their wide range
of applications, ease of implementation, and negligible increase in
cormnunication time by transformation. The most important contribution
of this research is the application of dummy transitions which provide
storage space for output of transitions. Dummy transitions have made
parallel path circuits in the CMG insignificant for determining
149
TBOLB. Thus, it is now possible to use control places and dunmy
transitions together to change the SRE without increasing TBOLB.
Dummy transitions can improve TBOLB by reducing the time/token ratio
of dominant parallel path circuits. Another application of durany
transitions is to enforce the SRE as the resource envelope for all
task inputs. Hence, it is now possible to enhance the throughput of
an algorithm execution in the ATAMM data flow architecture. Also, the
algorithm marked graph can be transformed according to the resource
capability of the architecture or to make the resourc_ need for
periodic operation predictable.
The ATAMM operating point (AOP) design procedure uses the
knowledge of the performance model and algorithm transformation to
specify an operating point for executing an algorithm in the ATAMM
data flow architecture. The only transformations used for the AOP
design are dummy transitions as buffer and control places. The AOP
design describes the procedure to achieve the absolute lower bound of
time performance under these transformations. It proposes three
strategies corresponding to sacrificing pipeline concurrency, parallel
concurrency, and a combination of both to meet the limited
availability of resources. Pipeline and parallel concurrency can be
reduced by reducing input data injection rate or by transforming the
AMG to modify the shape of SRE respectively. Although the design
procedure is partially heuristic because of the NP completeness of the
problem, it allows the user to make a trade-off between pipeline and
parallel concurrency for limited availability of resources.
Test algorithms are simulated by a PC-based simulator [21] to
validate the ATAMM operating point design procedure. The read and
150
write times of transitions are assumedto be zero. Process times of
transitions are in the order of hundreds of clock cycles to keep the
algorithms at a large-grain level. This order of transition times are
appropriate as the simulator takes less than ten clock cycles for
assigning transitions to resources. Dummytransitions and control
places are realized as regular active transitions (of zero process
time) or active places respectively. It is assumedthat a dummy
transition does not require a resource, simulated performance of
algorithms are always very close to that predicted by the AOPdesign
(within 2.1% for TBIO and within 5.8% for TBI and TBO). One
significant observation is that the proper input data injection
interval in the simulation is slightly higher than that predicted by
the AOPdesign (within 5.8%). These differences between theoretical
and simulated results are mainly due to a slight increase in
transition times by the unaccounted clock cycles in assigning
transitions to resources.
Test algorithms are executed on a testbed ATAMM data flow
architecture [20] to verify the performance model and the use of dunm_
transitions and control places for transformation of algorithms.
Dummy transitions and control places are implemented as active
transitions of zero process time and active places respectively. Read
and write times for the transitions in the experiments are assumed to
be those measured in [20]. The largest p_s time among the
transitions of the test algorithm is kept at least ten times higher
than read or write times for maintaining algorithms in the large-grain
level. The performance model is verified as experimental time
performances are close to theoretical time performances (within 4.4%
151
for TBIO and within 9.8% for TBO). The use of dummytransitions for
making parallel path circuits insignificant is verified in Test I.
The TBOof the transformed AMGin Test 1 is determined by the
time/token ratio of the largest process circuit (experimental TBOis
6.15%more). A control place and a dummytransition together in Test
2 have reduced the total resource requirement from 3 to 2 while
maintaining the change in THOwithin 3%. The larger differe/Ic_
between the experimental and theoretical results compared to the
simulation can be attributed mainly to two reasons. First,
implementing a dummytransition as an active transition has a much
greater effect in the testbed. The dummytransition requires read and
write times in the experiments and hence, requires a resource for a
considerable amount of time contrary to the assumption. Second, as
pointed out in [20], Ethernet cannot impl_t concurrent read or
write operations. This fact is not taken into account in the
measurement of read and write times. The experimental results suggest
that a better method of implementing a dungy transition and a more
accurate _ication model for read and write times are necessary.
There are several topics that can be the subject of future
research. On the theoretical side, the following problems need
attention. In order to properly decompose an algorithm, a specific
definition of large granularity is needed corresponding to the
communication time of an ATAMM data flow architecture. The first step
is to develop a general and more accurate model for read and write
times. The use of duchy transitions of finite time, control places
with initial tokens, and predefined tokens in performance improvement
and reduction of resource requirements needs to be explored.
152
Experiments and simulations have shownthat the proper input data
injection interval is slightly higher than the predicted value. This
observation and the possibility of slight variation in transition
times suggest that a_tic injection control maybe necessary.
Execution of multiple AMG'sor AMG'swith multiple input and output
transitions provide a complex, but interesting, topic of future
research. Finally, the performance of algorithms with conditional
data flow need to be analyzed. On the implementation side, realizing
dummy transitions as buffers in the functional unit or graph manager,
a better technique for measuring _ication times, a fully
automated ATAMM operating point design procedure, and transformations
of algorithms by dummy transitions and control places in real time are
useful topics for future research.
..
.
.
.
.
.
.
.
i0.
Ii.
12.
LIST OF REFERENCES
J. W. Stoughton and R. R. Mielke, "Petri-Net Model for Concurrsnt
Processing of Complex Algorithms," Proceedinqs of Government
Mic!Aocircuit Applications Conferenoe, San Diego, CA, November
1986.
R. R. Mielke, John W. Stoughton, and Sukhamoy Sore, '_4odeling and
Performance Bounds for Concurrent Processing," l>roceedinqs of the
8th International Conferenc_ on Distributed Computing_ Systems,
San Jose, CA, June 1988.
J. Tiberghien, New Computer Architectures, Academic Press,
Lor_on, 1984.
C. Petri, "_ikation mit Automaten,', Ph.D.
University of Bonn, Bonn, West Germany, 1962.
Dissertation,
A. Holt and F. Commoner, Events a.nd Conditions, Applied Data
Research, NY, 1970.
J. L. Peters.n, Petri Net Theory and the Modelinq of Systems,
Englewood Cliffs, NJ, Prentice Hall, 1981.
TadaoMurata, "Synthesis of Decision-Free Concurrent Systems
for Prescribed Resources and Performance," IEERTransactions on
Software Enqineerinq, pp. 525-530, November 1980.
T. Murata and J. Koh, "Reduction and Expansion of Live and Safe
Marked Graphs," IEEE Transactions on Circuits and Systems, vol.
CAS-27, pp. 68-70, January 1980.
T. Agerwala and Arvind, "Data Flow Systems," Co_, pp. 10-13,
February 1982.
TadaoMurata, "Relevance of NetworkTheory to Models of
Distributed/Parallel Processing," Journal of Franklin Institute,
pp. 41-49, 1980.
R. R. Mielke, John W. Stoughton, and Sukhamoy Som, "Modeling and
Optimum Time Performance for Concurrent Processing," NASA
Contractor Report, Grant NAGI-683, August 1988.
M. Granski, I. Koren, and G. Silberman, "The Effect of Operation
Scheduling onthe Performance of a Data Flow Computer," IEEE
Transactions on Computers, vol. 36, pp. 1019-1029, Septealber
1987.
.. 153
154
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Tadao Murata, "Circuit Theoretic Analysis and Synthesis of Marked
Graphs," IEK_ Tr-4/_action$ on Circuits and Systems, vol. 24, pp.
400-405, July 1977.
E. G. Coffman, Computer and Job-Shop Scheduling Theory, pp.
190-194, John Wiley & Sons, NY, 1976.
E. G. Coffman, Jr. and P. J. Denning, Operating _System Theory,
Prentice-Hall, Inc., NJ, 1973.
K. G. Lockyer, An Introduction to Critical P_th Analysis,
Pitman Publishing Limited, London, 1969.
J. J. Moder and C. R. Philips, Project Management with CPM and
PERT, pp. 63-83, Van Nostrand Reinhold, NY, 1964.
T. Murata, '_odeling and Analysis of Concurrent Systems,"
Handbook of Software Engineering, C. Vick and C. Ramamoorthy
Editors, pp. 39-63, Van Nostrand Reinhold, 1984.
Dennis B. Gannon and John Van Rosendale, "On the Impact of
Communication Cc_plexity on Design of Parallel Numerical
Algorithms," IEEE Transactions on Computers, vol. 33, pp.
1180-1191, December 1984.
W. R. Tymchyshyn, "ATAMM Multiccmputer System Design," Master's
Thesis, Old Dominion University, Norfolk, VA, August 1988.
R. Obando, "Software Tools for Performance Evaluation of
Concurrent Processing," Master's Thesis, Old Dominion
University, Norfolk, VA, August 1987.
R. Agrawal and H. V. Jagadish, "Partitioning Techniques for
Large-Grained Parallelism," IEEE Transactions on Computers,
vol. 37, pp. 1627-1634, December 1988.
S. H. Bokhari, "Partitioning Problems in Parallel, Pipelined, and
Distributed Computing," IEEE Transactions on Computers, vol. 37,
pp. 48-57, January 1988.
Z. Cvetanovic, "The Effect of Problem Partitioning, Allocation,
and Granularity on the Performance of Multiple-Processor
Systems," _EEE Transactions on Computers, vol. 36, pp. 421-432,
April 1987.
R. Johnsonbaugh and T. Murata, "Additional Methods for Reduction
and Expansion of Marked Graphs," IEEE Transactions on Circuits
and Systems, vol. CAS-28, pp. 1009-1014, _ober 1981.
Dan I. Moldovan and Jose A. B. Fortes, "Partitioning and Mapping
Algorithms into Fixed Size Systolic Arrays," IEEE Transactions on
Computers, voi. c-35, pp. 1-12, January 1986.
155
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
C. V. Ramamoorthyand Gary S. Ho, "Performance Evaluation of
Asynchronous Concurrent SystemsUsing Petri Nets," IEEE
Transactions on Software Engineerinq, vol. 6, pp. 440-449,
Sept_ 1980.
S. Seshu and M. Reed, Linear Graphs and Electrical Networks,
Addison-Wesley Publishing Co., Inc, 1961.
M. Sowa and T. Murata, "A Data Flow Computer Architecture with
Program and Token Memories," IEEE Transactions on Conputers, pp.
940-948, November 1986.
V. Srini, "An Arc/%itech/ral Comparison of Dataflow Systems,"
Com_ter, pp. 68-88, March 1986.
J. W. Stoughton and R. R. Mielke, "Petri Net Model for Analysis
of Concurrently Processed Complex Algorithms," Proceedinqs of
southeastcon Confer, March 1986.
J. W. Stoughton and R. R. Mielke, "Strategies for Concurrent
Processing of Complex Algorithms," Proceedinqs of Workshop on
Future Directions in Computer Architecture and Software, Army
Pesearch Office, May 1986.
M. N. S. Swamy and K. Thulasiraman, Graphs, Networks, and
_lqorithms, John Wiley & Sons Publication, NY, 1981.
D. F. Vrsalovic, D. P. Siewiorek, Z. Z. Segall, and E. F.
Gehringer, "Performance Prediction and Calibration for a Class of
Multiprocessors,,, IEEE Transactions on Computers, vol. 37, pp.
1353-1365, November 1988.
P. M. Kogge, The Architecture of Pipelined Computers, Advanced
Cc_uter Science Series, McGraw-Hill, NY, 1981.
H. Tokuda, C. W. Mercer, Y. Ishikawa, T. E. Marchok, "Priority
Inversions in Real-Time Ommmmnication," Proceedinqs of the Real-
Time Systems Symposium, Santa Monica, California, December 5-7,
1989.
APPENDIX
This appendix is an excerpt from [II]. The ATA_4model is
studied analytically to determine important graph operating
characteristics. First, a state description which expresses the next
graph marking as a function of the present marking and a vector
indicating which transition is to be fired is developed. Then the
marked graph properties of reachability, liveness, and safeness are
considered for the CMG. Two excellent papers by Murata [13, 18] on
properties of marked graphs are the sources for muchof the material
presented in this appendix.
Let G be a marked graph consisting of m places and n
transitions. The m-vector Mk denotes the marking vector for G
resulting from the firing of somesequ_ of k transitions. The
following two definitions are necessary to develop the state
description of the fI_G.
Definition A.I: Complete Inci4ence Matrix. The complete incidence
matrix for a marked graph G is the (n x m) matrix A = [aij ] having
rows corresponding to transitions and columns corresponding to places
and where
aij = I +I
I
I 0
(-i) {if place j is incident at transition i
and directed out of (into) the transition}
if place j is not incident at transition j.
156
157
Definition A. 2: Elementa!vy Firinq Vector. An elementary firing
vector uk is an n-vector having all zero entries except for the
ith component, which is 1 denoting that transition i is the kth
transition to fire in some transition firing sequence.
To gain insight to the state equation description, it is helpful
to consider the firing of transition k. If
is an input (output) place to transition k.
is enabled if M(i) = 1 for each input place.
aki = -i (+I), place i
Therefore, transition k
When transition k fires,
one token is removed from each input place and one token is added to
each output place. These observations lead to the following next
state description for a marked graph.
P_operty A.I: Next State Description. For a marked graph G with
present marking vector Mk_ 1 and elementary firing vector Uk, the
next marking vector is given by
Mk= Mk_ 1 + ATuk .
The next state description can be used to express the graph
marking resulting from the application of sequences of elementary
firing vectors. This is done in the next definition and property.
Definition A.3: Firing Count Vector. Let (Ul, u2,...,Ud) be a
sequence of elementary firing vectorstakingamarkedgraph G from an
initial marking M 0 to a destination markingM d. The firing count
vector xd forthis firing sequence is defined by
158
Property A._: State Equation Description. For a marked gra;_h G with
initial marking vector M0, the marking vector resulting from the
application of an elementary firing vector sequence
(uI, u2,...,Ud) is given by
%=M o + ATe.
Using the state description of a marked graph as a basis, the
property of reachability is investigated. Necessary and sufficient
conditions for a (3K_ marking vector to be reachable from an initial
marking are established, and it is shown that the number of tokens
contained in any directed circuit of the _ is _iant under
transition firings.
Definition A.4; Reachabilitv. A marking Md is reachable from an
initial marking M O if there exists a sequence of elementary firing
vectors that transforms MO to Md.
The following definition is required to state the reachability
conditions for a (_4G.
Definition A. 5: Fundamental Circuit Matrix.
connected marked graph G. The set of (m-n+l)
Let T be a tree of a
circuits, each uniquely
formed by appending one cotree edge to the tree, is called the set of
fur_amental circuits of G for tree T [28]. The fundamental circuit
matrix for G for tree T is the (m-n+l) x (m) matrix Bf = [bij]
having rows corresponding to fundamental circuits and columns
corresponding to places, and where bij is determined by the rules as
described on the next page.
159
bij
I +I(-i) if place j is contained in f-circuit i and the
i place and circuit directions agree (disagree)
I
i 0 if place j is not contained in f-circuit i.
Property A. 3: Reachability in the CMG. In a computational marked
graph G, a marking Md is reachable frown an initial marking MO if
and only if BfMd = BfMo, where Bf is a fundamental circuit
matrix for G.
Proof. It is shown in [13] (Theorem 3) that the property is true for
marked graphs containing no token-free directed circuits. By the
construction rules for the (_G, directed circuits occur in exactly
four ways. First, each NMG consists of a directed circuit which
contains an initial marking token in the Process Ready place. Seoond,
a directed circuit is formed each time an NMG is linked to another
NMG. Since one of the two linking places contains an initial marking
token and both places are contained in the circuit, this circuit is
never token free. Third, directed circuits exist in the CMG
corresponding to interconnected feedforward paths in the algorithm
marked graph. Each such circuit contains one or more backward
directed control edge containing one initial marking token. Fourth,
directed circuits exist in the CMG corresponding to directed circuits
in algorithm marked graph. Each such circuit contains exactly one
forward directed edge containing one initial marking token which
represents initial condition data. Therefore, the _ contains no
token-free directed circuits and the property follows.
160
As a direct consequenceof the reachability property of the CMG,
it can be shownthat the numberof tokens in any directed circuit is
constant, This characteristic i_ stated as Property A.4.
ProDertv A.4" Token_ Count Invar!ance. In a C_4G, the number of tokens
contained in a directed circuit is invariant under transition firing.
Proof. Consider a directed circqit C of a (_G. The entries in the
row of a circuit matrix B correslQonding to C are +I in columns
representing edges in C and are O otherwise. If M is a marking
vector, the component of I_ corresponding to C is equal to the number
of tokens in directed circuit C marking M. Therefore, if Md is any
marking reachable from an initial marking MO, it follows from
Property A.3 that _M d = _O" That is, the number of tokens in
directed circuit C under initial _arking MO is equal to the number
of tokens under any marking Md reachable from MO. This completes
the proof.
Next, liveness and a closely related property called consistency
are considered. It is shown that the (3MG is live and consistent.
Definition A.6: Liv_ness. A marked graph G is said to be live for a
marking F if, for all markings reachable from M, it is possible to
fire any transition of G by p_sing through some transition firing
sequerK_.
Property A.5: Liven_ss in the CM_. The computational marked graph is
live for all appropriate initial marking vectors.
Proof. It is shown in [18] (property 2) that a marked graph G is live
for a marking M, if _ only if, G contains no token-free directed
circuits in marking M. As stated in the proof of Property A. 3, for
all appropriate initial markings MO,the _ contains no token-free
directed circuits. Tnerefore, the property follows.
Definition A. 7: Consists. A marked graph G is said to be
consistent if there exists a marking M and a transition firing
sequence S from M back to M such that every transition occurs at least
once in S.
Property A.6: Consistency in the (_G. A connected computational
marked graph G is consistent. In addition, each transition of G
occurs an equal number of times in a firing sequence frown a marking M
back to M.
Proof. From Property A.2, if a CMG is consistent then there exists a
marking Md = M0 and a firing count vector xd > 0 such that
ATxd = O. The converse is also true. The incidence matrix for a
marked graph G is an (n x m) matrix A. If G is connected, then it is
known [28] that the rank of A is n-l, and thus the null space of AT
has dimension one. It is observed that each row of AT has one (I),
one (-i), and all remaining terms are zero (0). Therefore, if
denotes the jth column of AT, it follows that
n
7_ Cj = 0.
j=l
Thus, there exists a vector xd = [k k .... k] T, k > 0, which
uniquely satisfies ATxd = 0. This completes the proof.
The final graph property considered in this section is safeness.
This property is first defined and then it is shown that a CMG is
safe.
161
162
Definition A.8: Safeness. A marked graph G is said to be safe for
marking M if, for all markings reachable from M, no plaoe contains
more than one token.
property A. 7: Safeness in the C_G. The computational marked graph is
safe for all appropriate initial marking vectors.
Proof. By Property A.4, the token count for each directed circuit of
the C_3 is invariant under transition firing. Therefore, it is
sufficient to show that each edge of the CMG belongs to at least one
directed circuit containing a single token. By the construction rules
for the CMG, all CMG edges can be classified into two groups NMG edges
and linking edges. NMG edges occur in groups of three and always form
a directed circuit oontaining one token. _ edges occur in
pairs, one forward directed and one backward directed, and also form a
directed circuit with the forward directed edges of the NM_. One of
the linking edges, but not both, always contains one token while the
forward directed edges of the NM_ contain no tokens. Therefore, each
edge of the CMG is contained in a directed circuit with one token, and
the property follows.
1. Report No,
NASA CR-187450
4. Title and Subtitle
Report Documentation Page
2. Government Accession No. 3. Recipient's Catalog No,
5. Report Date
Strategies for Concurrent Processing of Complex
Algorithms in Data Driven Architectures
7. Author(s)
Sukhamoy Som, John W. Stoughton, and Roland R. Mielke
October 1990
6. Performing Organization Code
8. Performing Organization Report No.
9. Pedorming Organization Name and Address
Department of Electrical and Computer Engineering
Old Dominion University
Norfolk, Virginia 23529-0246
12. Sponsoring Agency Name and Address
National Aeronautics and Space Administration
Langley Research Center
Hampton, Virginia 23665-5225
10. Work Unit hie.
590-32-31-01
11. Contract or Grant No.
NAGI-683
13. Type of Report and Period Covered
Contractor Report
14. Sponsoring ,Agency Code
15, Supplementary Notes
Langley Technical Monitor:
Final Report
May 1988 - August 1989
Paul J. Hayes
16. Abstract
This research report is concerned with performance modeling and perfomTmrme
enhancement for periodic execution of large-grain, decision-free algorithms
in data flow architectures. Applications include real-time implementation of
control and signal processing algorithms where performance is required to be highly
predictable. The mapping of algorithms onto the specified class of data flow
architectures is realized by a marked graph model called ATAMM (Algorithm To
chitecture Mapping Model). Performance measures and bounds are established.
gorithm tra_sformatTon techniques are identified for performance enhancement and
reduction of resource (computing element) requiraq_ents. A systematic design
procedure is described for generating operating conditions for predictable
performance both with and without resource constraints. An ATAMM simulator is
used to test and validate the performance prediction by the design procedure.
Experiments on a three resource testbed provide verification of the ATAMM model and
the design procedure.
17. Key Words (Suggested by Author(s){
Data flow computers
Algorithm to architecture mapping
Petri net
Control and signal processing
19. Security Classif. (of this reportl
Unclassified
_IASA FoRM "IB2B OCT 86
18. Distribution'Stat ement
UNCLASSIFIED - UNLIMITED
Subject Category 33
_. Security Cla_if. (of this page)Un lassified
21. No. of pages
177
22. Price
A09

