Strategies for concurrent processing of complex algorithms in data driven architectures by Stoughton, John W. & Mielke, Roland R.
_0
©
©
V
/
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
COLLEGE OF ENGINEERING & TECHNOLOGY
OLD DOMINION UNIVERSITY
NORFOLK, VIRGINIA 23529
STRATEGIES FOR CONCURRENT PROCESSING OF
COMPLEX ALGORITHMS IN DATA DRIVEN ARCHITECTURES
By
John W. Stoughton, Principal Investigator
Roland R. Mielke, Co-Principal Investigator
Sukbamoy Som, Graduate Research Assistant
Rodrigo Obando, Graduate Research Assistant
Robert Tymchyshyn, Graduate Research Assistant
Progress Report
For the period May 16, 1987 to May 15, 1988
Prepared for the
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665
Under
Research Grant NA6-1-683
Mr. Paul J. Hayes, Technical Monitor
ISD-Information Processing Technology Branch
(_AS_.-C_i-181329) S_BA%_GI__ c ICB CGNCUIIBEN_
_BOC_551kG GF CCEIL_X AIGCI_I_£_S I_ DATA
L_/VE_ A_CHI_CI_I_5 _[OgE_-_ Ec_crt, Ib Bay
leE'; - i__ Bay 1(.{:_ |(.]d I:cmiricn Da.Lv.)
I_(_ p CSCL 09B G3161
_89-11_C6
Uncla- _
June 1988
https://ntrs.nasa.gov/search.jsp?R=19890002035 2020-03-20T05:46:09+00:00Z

DEPARTMENTOFELECTRICALANDCOMPUTERNGINEERING
COLLEGEOFENGINEERING& TECHNOLOGY
OLDDOMINIONUNIVERSITY
NORFOLK,VIRGINIA23529
STRATEGIES FOR CONCURRENT PROCESSING OF
COMPLEX ALGORITHMS IN DATA DRIVEN ARCHITECTURES
By
John W. Stoughton, Principal Investigator
Roland R. Mielke, Co-Principal Investigator
Sukhamoy Som, Graduate Research Assistant
Rodrigo Obando, Graduate Research Assistant
Robert Tymchyshyn, Graduate Research Assistant
Progress Report
For the period May 16, 1987 to May 15, 1988
Prepared for the
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665
Under
Research Grant NAG-l-G83
Mr. Paul J. Hayes, Technical Monitor
ISD-Information Processing Technology Branch
Submitted by the
01d Dominion University Research Foundation
P. O. Box 6369
Norfolk, Virginia 23508
June 1988

TX.T
•2uom_saopuo VSVN £1dm3 2ou s_op
pue glUO ssoue2_l dmo_ ao_ sx 2u_tun_op s_2 ul s_m_u pu_aq go _sn ey_
UMMIV'I_SI(I

ST_LT]KGIICS FOg CO]ICU]CRE]_ PROCgSSING OF COMPLKX
ALC, Oi_TTH_ IN DATA DRIVEN AgCRITKCTI_KS
By
John W. Stoughton I , Roland R. Mielke 2 , Sukhamoy Som 3 ,
Rodrigo Obando _ and Robert Tymchyshyn 5
ABSTRACT
The purpose of this report is to document research to develop stra-
tegies for concurrent processing of complex algorithms in data driven archi-
tectures. The problem domain consists of decision-free algorithms having
large-grained, computationally complex primitive operations. Such are often
found in signal processing and control applications. The anticipated multi-
processor environment is a data flow architecture containing between two and
twenty computing elements. Each computing element is a processor having
local program memory, and which communicates with a common global data mem-
ory. A new graph theoretic model called ATAMM which establishes rules for
relating a decomposed algorithm to its execution in a data flow architecture
is presented. The ATAMM model is used to determine strategies to achieve
optimum time performance and to develop a system diagnostic software tool.
In addition, preliminary work on a new multiprocessor operating system based
on the ATAMM specifications is described.
IAssociate Professor, Department of Electrical & Computer Engineering, Old
Dominion University, Norfolk, Virginia 23529.
2professor, Department of Electrical & Computer Engineering, Old Dominion
University, Norfolk, Virginia 23529.
3Graduate Research Assistant, Department of Electrical & Computer Engineer-
ing, Old Dominion University, Norfolk, Virginia 23529.
Graduate Research Assistant, Department of Electrical & Computer Engineer-
ing, Old Dominion University, Norfolk, Virginia 23529.
5Graduate Research Assistant, Department of Electrical & Computer Engineer-
ing, Old Dominion University, Norfolk, Virginia 23529.

TABLE OF CONTENTS
DISCLAIMER .......................................................
ABSTRACT .........................................................
I.O INTRODUCTION ............................................
II.O RESEARCH OVERVIEW .......................................
I1.1
II.2
II.3
Modeling and Performance .....................
Diagnostic Tool Development ..................
Testbed Development ..........................
III.O OPTIMUM TIME PERFORMANCE ................................
III.l
III.2
III.3
III.4
III.5
Introduction .................................
ATAMM Model Development ......................
Model Characteristics ........................
Performance Analysis .........................
Strategy For Optimum Time Performance ........
IV.O DIAGNOSTIC TOOL DEVELOPMENT ................................
IV.1 Analyzer Development .........................
IV.I.1
IV.1.2
IV.I.3
IV.l.4
IV.1.5
IV.1.6
IV.1.7
IV.1.8
IV.1.9
IV.I.IO
Introduction .........................
Prototype and its Communication
Events ...............................
Graph Manager Diagnostic Routines ....
Sequential Account for Concurrent
Processing ...........................
Analyzer Program .....................
Measurement of TBIO, TBO, TBI ........
Concurrency Measurement ..............
General Statistics ...................
Graph Simul ation/Analyzer ............
Output of the Graph Simulation/
An alyzer .............................
V.O EXPERIMENTAL RESULTS .......................................
V.1
V.2
Introduction .................................
Graphs with Parallel Paths ...................
V.2.1
V.2.2
V.2.3
V.2.4
V.2.5
V.2.6
Simulation ............................
Analysis of Output Data ...............
Minimum Number of Resources for
Maximum Performance ...................
Graphs with Interative Loops ..........
Simul ation ............................
Analysis of Output Data ...............
Page
iii
iv
1
3
3
5
6
9
9
9
14
20
27
33
33
33
34
36
38
39
40
42
43
43
45
47
47
47
48
49
5O
51
52
52

TABLE OF CONTENTS (Continued)
V1.0
Vll.O
V.3 Performance Factors ..........................
FURTHER RESEARCH ...........................................
REFERENCES ................................................
TABLES ...........................................................
FIGURES ..........................................................
APPENDIX A:
APPENDIX B:
National Aerospace and Electronics
Conference Paper ....................................
Distributed Computing Systems Conference
Paper ...............................................
Page
54
56
59
61
66
A-1
B-1
LIST OF TABLES
Table
Results from first experiment, first priority assignment ....
Results from the first experiment, second priority
assignment ..................................................
Results from first experiment, third priority assignment ....
Results from second experiment, first priority assignment...
Results from second experiment, second priority
assignment ..................................................
Peformance factors for graph of Section 4.1 .................
Performance factors for graph of Section 4.2 ................
62
62
63
64
64
65
65
LIST OF FIGURES
1
2
3
Algorithm marked graph for discrete system equation .........
ATAMM node marked graph model ...............................
ATAMM computational marked graph model for discrete
system equation .............................................
67
68
69
vi

TABLE OF CONTENTS (Continued)
LIST OF FIGURES (Continued)
4 ATAMM model components ......................................
5 Modified algorithm graph for Figure I .......................
6 Operating strategy implementation ...........................
7 Algorithm graph for design example ..........................
8 Computational marked graph for design example ...............
9 Graph play with TBO=3 and unlimited functional units ........
10 Resource utilization envelope for design example ............
11 Graph play with TBO:4 and no control edges ..................
12 Resource envelope overlay diagram with TBO:3 ................
13 Resource envelope overlay diagram with TBO=3.5 ..............
14 Resource envelope overlay diagram with TBO:4.0 ..............
15 Example algorithm graph performance analysis summary ........
16 Performance margin for example algorithm ....................
17 Prototype block diagram .....................................
18 Prototype communications dialog .............................
19 A sample FIPSO file .........................................
20 Analyzer information flow ...................................
21 Analyzer node activity display ..............................
22 Analyzer functional unit display ............................
23 Analyzer input/output display ...............................
24 Analyzer concurrency display ................................
25 Graph simulation/analyzer information flow ..................
26 Graph with parallel paths ...................................
27 CMG using single node model .................................
70
71
72
73
74
75
76
77
78
79
8O
81
82
83
84
85
86
87
88
89
90
91
92
93
vii

TABLE OF CONTENTS (Concluded)
LIST OF FIGURES (Concluded)
28
29
30
31
32
33
Circuit to obtain TBOLB .....................................
Path to obtain TBIOLB .......................................
Graph description and simulation control file used for
the first experiment ........................................
Graph with iterative loops ..................................
CMG of the graph in Figure 31 using single node model .......
Graph description file for the second experiment ............
94
95
96
98
99
tO0
viii

1.0 INI_IODgOTIOW
The purpose of this report is to document research to develop strate-
gies for concurrent processing of complex algorithms in data driven archi-
tectures. The problem domain consists of decision-free algorithms having
large-gr_ined, computationally complex primitive operations. The antici-
pated multiprocessor environment is assumed to contain between two and
twenty computing elements for concurrent execution of the various primitive
operations. Each computing element or functional unit is a processor having
local memory for program storage and temporary input and output data con-
tainers. The functional units have a common global data memory, and func-
tional unit activity is coordinated by a graph manager. The global memory
and graph manager may be either centralized or distributed. The authors
have proposed a new graph theoretic model to provide a basis for establish-
ing rules for relating a decomposed algorithm to its execution in"a data
flow environment. The model is identified by the acronym ATAMM which repre-
sents A__igorithm T_o Architecture Mapping Model. The availability of the
ATAMM model ils important because it provides a context in which to investi-
gate algorithm decomposition strategies, it provides a basis for predicting
and improving time performance, and it identifies the data flow and control
flow required of any data flow architecture which implements the algorithm.
During an earlier grant period, May 16, 1986 to May 15, 1987, the au-
thors formulated the ATAMM model for representing the implementation of a
decomposed algorithm in a data flow architecture. In addition, a simulation
tool was developed to display data flow and control flow for algorithms
operating according to the ATAMM rules. During the present grant period,
May 16, 1987 to May 15, 1988, the ATAMM model was used to determine analyti-
cally performance bounds for task computational time and system throughput
time. An operating strategy which achieves optimum time performance was
developed. In addition, a newdiagnostic software tool was developed for
use with the simulation tool. The diagnostic tool monitors detailed system
operation and displays global system performance indicators and measures.
Also, a new multiprocessor operating system based on the ATAMMspecifica-
tions is being constructed to validate the ATAMMrules and to provide a
testbed for further experimentation. It is the purpose of this report to a
detailed description of the research performed during the present grant
period.
In Section II, a overview of research performed during the period May
16, 1987 to May 15, 1988 is presented. This overview consists of summaries
of work to develop strategies for optimum time performance, diagnostic soft-
ware tools, and a testbed operating system. In Section III, the development
of strategies for optimum time performance is described. The new diaganos-
tic software tools are explained and illustrated in Section IV. Recommenda-
tions for continuing and future research are briefly outlined in Section V.
Twopapers describing recent research efforts are included as
appendices.
II. RBSIUULCIIOVBIVII_W
In this section, a summary of research activity conducted during the
period May 16, 1987 through May 15, 1988 is presented. A more detailed
description of this work, as well as illustrative examples, is given in the
following sections and the appendices.
II.i Modeling and Performance
The development of a new graph theoretic model for describing data and
control flow associated with the execution of large-grained algorithms in a
special distributed computing environment is presented. The model is iden-
tified by the acronym ATAMM which represents Algorithm To Architecture
Mapping Model. The purpose of such a model is to provide a basis for
establishing rules for relating an algorithm to its execution in a multi-
processor environment. Specifications derived from the model lead directly
to the description of a data flow architecture. The availability of the
ATAMM model is important for at least three reasons. First, it provides a
context in which to investigate algorithm decomposition strategies without
the need to specify a specific computer architecture. Second, the model
identifies the data flow and control dialog required of any data flow archi-
tecture which implements the algorithm. Third, the model provides a basis
for calculating analytically performance bounds for computing speed and
throughout capacity.
The problem domain of the ATAMM model consists of decision free algo-
rithms with computationally complex primitive operations which are assumed
to be implemented in a dedicated data flow environment. The algorithms are
such as may be found in (but not limited to) large scale signal processing
and control applications. The anticipated multiprocessor environment is
assumedto consiste of two to twenty processing elements for concurrent
execution of the various algorithm primitives.
The development of new computer architectures based upon distributed,
multiprocessor organizations [I], [2] is motivated mainly by the requirement
for increased speed and greater throughput capability in complex signal
processing applications [3]. Recent advances in the production of high-
density microelectronics [4] has madepossible the construction of parallel
architectures consisting of identical, special purpose computing elements
[5]. A numberof models for describing the behavior of algorithms in this
setting have been developed [6] - [8]. However, these models represent only
the data flow and do not adequately display the complex issues of communi-
cation and control flow which must occur in any realization of the model.
For this reason, it has been difficult to investigate how to effectively
match the decomposition and scheduling of algorithms to the structure and
control of parallel architectures. The importance of better understanding
the relationship between algorithms and architectures is only now becoming
recognized [9].
A newmodel useful for understanding the relationship between decom-
posed algorithms and data flow architectures has been presented. Named
ATAMMfor Algorithm To Architecture Mapping Model, the model consists of
Petri net marked graphs called the algorithm marked graph, the node marked
graph, and the computational marked graph. After establishing that the
computational marked graph is live, safe and consistent, graph time perform-
ance measures of time between input and output (TBIO), task time (TT), and
time between outputs (TBO) are defined. Then lower bounds for the
performance measures are calculated analytically from the modified algorithm
4
graph and the computational marked graph. A desighn strategy for achieving
optimum time performance is proposed and illustrated with a design example.
11.2 Diagnostic Tool Development
Although the ATAMMmodel is not complicated in principle, the execution
of a system modelled with it becomeshardly tractable whenboth the number
of nodes as the numberof resources increase. Therefore, it is necessary to
have Diagnostic Tools to explore the execution of a given algorithm. Oneof
the important parameters'necessary to observe is concurrency. Concurrency
is a measure of the numberof resources that work at the sametime for a
specified length of execution of an algorithm. Other parameters include
TBIO (Time BetweenInput and Output), TBO(Time BetweenOutputs), and TBI
(Time Between Inputs). Theseparameters refere to the time performance of
the system: the elapsed time between when input data is read and its
corresponding output data is written (TBIO), the time elapsed between
repetitive output writings (TBO), and the time elapsed between repetitive
inputs data readings (TBI). Another necessary measurementsare the time the
system takes and the different states it goes through to reach steady
state.
The Analyzer, a computer program, provides measurementof the items
denoted above. The input to the program is a file containing a sequential
account of the execution of a concurrent system. It displays the activity
of the individual nodes of a graph. This display is drawn on a commontime
axis for easy reading of the concurrent execution of nodes. An alternate
display is the plotting of the activity of the resources versus time. The
p[ogram also displays the function of concurrency versus time which is now
called Total ResourceUtilization Envelope. For individual data packets,
the program displays the values of TBIO, TBOand TBI. It also reports
general statistics of the transitions per node. This program is primarily
to be used for post-execution detailed analysis of the execution of an
algorithm.
Another computer program, the Graph Simulation/Analyzer, provides not
only simulation of the execution of an algorithm but also analysis of data
immediately after execution. It generates the sequential files containing
firing of transitions in the CMG(Computational Marked Graph) to be analyzed
by the Analyzer, the program described above. It also generates files with
average values of TBO, TBI and TBIO. The simulation module has been
improved so that it may include randomvariables as the values of the tran-
sitions in the CMG. It accepts as input an ASCII file containing a descrip-
tion of the topology of a graph, transition time assignments, priority
assignment, initial marking, numberof resources, etc.
11.3 Testbed Development
A multiprocessor operating system has been developed based on the ATAMM
specifications. It is the third prototype system to have been built in the
past two years. The motiviation for this is to give further credibility to
ATAMMthrough system validation and to provide a testbed experimentation.
This discussion is divided into three design phaseg. In the system parti-
tioning the ATAMM model is divided into logical components. Combined, these
logical components must fully represent the ATAMM description. The next
phase is the hardware mapping in which the logical components are mapped
into a target architecture. Necessary inter-module communications and
control dialogue paths must also be specified. The multiprocessor operating
system implementation is the final design phase and will be referred to
briefly.
Three logical components have been isolated in the ATAMM partition; the
Graph Manager (GM), Funcitonal Unit (FUN), and Global Memory (GLM). The
Graph Manager is responsible for implementing the state transitions of the
processes. It must monitor all token movement within the CMG required to
determine the fireability of a process. When a process can fire the Graph
Manager must assign the first available Functional Unit to that process.
The Functional Unit will then execute all three NMG transitions for that
particular process. It must also, via interrupt, update all important token
movement within the NMG to the Graph Manager. ks a Functional Unit can be
assigned to any process, it must also have the code available for the compu-
tation of every process in the AMG. The Global Memory is the final logical
component in the partition and is responsible for storing data associated
with all Output Full edges in the CMG. Because of this the it must have a
communications path to all Functional Units for both the reading and writing
of data.
The three prototype multiprocessor operating systems previously
mentioned have all had different hardware mappings. Each new mapping was
guided through observaitons made in the development of the previous mapping.
In the current mapping all three logical components are distributed within
each hardware module. The hardware modules chosen are IBM PC/AT's and are
connected on an Ethernet Local Area Network. This mapping presents two
advantages over the previous two in which the logical components were not
completely distributed. First, the redundancy of all logical components
provides a greater degree of fault tolerance. Secondly, a reduction of
inter-module communications, the major bottleneck in multiprocessor design,
is expected as the logical components all reside in the same hardware
module.
The final step in the design process is to develop a multiprocessor
operation system to implement the logical componentsas designated by the
hardware mapping. In addition to the hardware modules, a Sink/Source node
module wasdesigned for the system initialization and monitoring. It is
also responsible for injecting input data into the system and for receiving
output data. The resulting multiprocessor has been successfully developed
and is currently undergoing tests for ATAMMvalidation. Initial results are
positive and all tests should be completed by the end of August.
8
III.O OPTIMUMTIMKPKRFO_
III.i Introduction
The development of a new graph theoretic model for describing the
relation between a decomposed algorithm and its execution in a data flow
environment is presented. Performance measures of computing speed and
throughput capacity are defined. Lower bounds for these performance
measures are established. In Subsection 111.2 of this report, the modeling
process to describe algorithms in data flow architectures, ATAMM, is pre-
sented. The model consists of three Petri net marked graphs called the
algorithm marked graph (AMG), the node marked graph (NMG), and the compu-
tational marked graph (CMG). In Subsection 111.3, the operating character-
istics of these graphs are investigated. A state variable description is
presented and used to establish sthe graph properties of teachability, live-
ness and safeness. Time performance measures for concurrent processing are
defined in Subsection 111.4. The ATAMM model is used as the basis for
calculating analytically lower bounds for these performance measures. Then
in Subsection 111.5, an operating strategy which achieves optimum time per-
formance is developed. Several examples are presented to illustrate these
concepts.
111.2 ATAMMModel Development
In this subsection the ATAMM model to describe concurrent processing of
decomposed algorithm is presented. The model consists of a set of Petri
net marked graphs which incorporate general specifications of communication
and processing associoated with each computational event in a data flow
architecture. First, a detailed description of the problem context is
stated. This is followed by the definition of the ATAMM model consisting of
• 9
the algorithm marked graph, the node marked graph, and the computational
marked graph. Somefamiliarity with Petri nets [I0] and marked graphs [II]
is assumedin this presentation.
The problems of interest are decision-free, computationally complex
problems as are often found in signal processing and control applications.
A problem description normally results in the definition of a function given
by the triple (X,Y,F). The set X represents the set of admissible inputs,
the set Y represents the set of admissible outputs, and F:X->Y is the rule
of correspondence which unambiguously assigns exactly one element from Y to
each element of X. Associated with a computational problem is one or more
algorithms. An algorithm is an explicit mathematical statement, expressed
as an ordered set of primitive operations, which explains how to implement
the rule of correspondence F. In general, a given problem can be decomposed
by several different primitive operator sets. _Iso, for a given primitive
operator set, there are often different orderings of primitive operations
which can be specified to carry out the problem. Of special interest are
algorithm decompositions in which two or more primitive operations can be
performed concurrently. For such decompositions, the potential exists for
decreasing the computational time required to solve the problem by increas-
ing the computational resources which implement the primitive operations
program storage and temporary input and output data containers.
The hardware environment for executing the decomposedalgorithms is
assumedto consist of R identical processors or functional units (FUNs)
where R has a value in the range of two to twenty. This range of resources
is suggested for practical reasons due to the large-grained aspect of the
algorithm decomposition and the need to maintain small communication times
relative to process times. Each FUNis a processor having local memoryfor
4
I0
program storage and temporary input andoutput data containers. Each FUNcan
execute any algorithm primitive operation. The FUNsshare a commonglobal
memory(GLM)which maybe either centralized or distributed. The coordina-
tion of FUNsin relation to data and control flow is directed by the graph
manager_GRM). The GRMalso may be centralized or distributed. Output
created by the completion of a primitive operation is placed into global
memoryonly after the output data containers have been emptied. That is,
outputs must be consumedas inputs to successor primitive operations before
allowing new data to fill the output locations. Assignment of a functional
unit to a specific algorithm primitive operation is madeby the GRMonly
whenall inputs required by the operation are available in global memoryand
a functional unit is available.
An algorithm marked graph is a marked graph which represents a specific
algorithm decomposition. Vertices of the algorithm graph are in _ one-to-
one correspondence with each occurrence of a primitive operation. The algo-
rithm graph contains an edge (i,j) directed from vertex i to vertex j if the
output of primitive operation i is an input for primitive operation j. Edge
(i,j) is marked with a token if an output from primitive operator i is
available as an input to primitive operator j. When constructing an algo-
rithm graph, vertices (primitive operations) are displayed as circles, and
edges (input-output signals) are displayed as directed line segments con-
necting appropriate vertices. The presence of a token on an edge is indica-
ted by a solid dot placed on the edge. Source transitions and sink transi-
tions for input and output signals are represented as squares. Sources for
constants are not usually included in the algorithm marked graph; however,
triangles are used for this purpose when necessary.
II
To illustrate the construction of an algorithm marked graph, consider
the problem of computing the output of a discrete linear system given a
sequenceof inputs to the system. Let the system be described by the state
equation
x(k) = Ax(k-l) + Bu(k)
and output equation
y(k) = Cx(k).
where x is p-vector, us is an m-vector, and y is an r-vector. The primitive
operations are defined as matrix multiplication and vector addition, and the
natural algorithm decomposition resulting from the state equation descrip-
tion is selected. The algorithm marked graph for this decomposed algorithm
is shown in Fig. [. The initial marking indicates that initial condition
data are available.
The algorithm marked graph is a useful tool for representing decomposed
algorithms and for displaying data flow within an algorithm. However, the
algorithm graph does not display procedures that a computing task. In addi-
tion, the issues of control, time performance, and resource management are
not apparent in this graph. These important aspects of concurrent process-
ing are included in the ATAMM model through the definition of two additional
graphs. The node marked graph (NMG) is defined to model the execution of a
primitive operation. The computational marked graph, obtained from the AMG
and the NMG by a set of construction rules, integrates both the algorithm
requirements and the computing environment requirements into a comprehensive
graph model. These additional marked graphs are defined in the following.
The NMG is a Petri net representation of the performance of a primitive
operation by a functional unit. Three primary activities, reading of input
data from global memory, processing of input data to compute output data,
12
and writing of output data to global memory, are represented as transitions
(vertices) in the NMG. Data and control flow paths are represented as
places (edges), and the presence of signals is notated by tokens marking
appropriate edges. The conditions for firing the process and write tran-
sitions of the NMGare as defined for a general Petri net, while the read
transition has one additional condition for firing. In addition to having a
token present on each incoming signal edge, a functional unit must be avail-
able for assignment to the primitive operation before the read node can
fire. Once assigned, the funcitonal unit is used to implement the read,
process, and write operations before being returned to a queue of available
FUNs. The initial marking for an NMGconsists of a single token in the
"process ready" place. The NMG model is shown in Fig. 2.
A computational marked graph (CMG) is constructed from the AMG and the
NMG by the following rules.
I. Source and sink nodes in the algorithm marked graph are represented
by source and sink nodes in the CMG.
2. Nodes corresponding to primitive operations in the algorithm marked
graph are represented by NMG8 in the CMG.
3. Edges in the algorithm marked graph are represented by edge pairs,
one forward directed for data flow and one backward directed for
control flow, in the CMG. The initial marking for the edge pair
consists of a single token in the forward-directed place if data
are available, or a single token in the backward-directed place if
data are not available.
The play of the CMG proceeds according to the following graph rules.
I. A node is enabled when all incoming edges are marked with a token.
An enabled node fires by encumbering one token from each incoming
13
edge, delaying for somespecified transition time, and then depos-
iting one token on each outgoing edge
2. A source node and a sink node fire whenenabled without regard for
the availability of a FUN.
3. A primitive operation is initiated when the read node of an NMGis
enabled and a FUNis available for assignment to the NMG. A FUN
remains assigned to an NMGuntil completion of the firing of the
write node of the NMG.
In order to illustrate the construction of a computational marked
graph, the CMGcorresponding to the algorithm marked graph of Fig. 1 is
shownin Fig. 3. The computational marked graph is useful because it clear-
ly displays the data and control flow which must occur in any hardware
implementation of the model process, and because it clearly displays the
data and control flow which must occur in any hardware implementation of the
model process, and because it provides a hardware independent context in
which to evaluate process performance.
The complete ATAMM model consists of the algorithm marked graph, the
node marked graph, and the computational marked graph. A pictorial display
of this model is shown in Fig. 4. In the next subsection, important oper-
ating characterists of the ATAMM model are investigated.
111.3 Model Characteristics
In the previous subsection, a marked graph model consisting of the AMG,
the NMG, and the CMG is defined as a means to describe concurrent processing
of decomposed algorithms. In this subsection the ATAMM model is studied
analytically to determine important graph operating characteristics. First,
a state description which expresses the next graph marking as a function of
14
the present marking and a vector indicating which transition is to be fired
is developed. Then, the marked graph properties of teachability, liveness,
and safeness are considered for the CMG. Twoexcellent papers by Murata
[II], [12] on properties of marked graphs are the source for muchof the
material presented in the subsection.
Let Gbe a marked graph consisting of m places and n transitions. The
m-vector Mk denotes the marking vector for G resulting from the firing of
somesequenceof k transitions. The following two definitions are necessary
to develop the state description of the CMG.
Definition i: Complete Incidence Matrix. The complete incidence matrix for
a marked graph G is the (nxm) matrix A = [aij] having rows corresponding to
transitions, columns corresponding to places, and where
a° °
i]
+l(-l) if place j is incident at transition i
and directed out of (into) the transition
if place j is not incident at transition j
Definition 2: Elementary Firing Vector. An elementary firing vector uk is
an n-vector having all zero entries except for the ith component which is 1
denoting that transition i is the kth transition to fire in some transition
firing sequence.
To gain insight to the state equation description, it is helpful to
consider the firing of transition k. If aki = -I(+I), place i is an input
(output) place to transition k. Therefore, transition k is enabled if
M(i) = I for each input place. When transition k fires, one token is
removed from each input place and one token is added to each output place.
These observations lead to the following next state description for a marked
graph.
15
Property I: Next State Description. For a marked graph G with present
marking vector Mk_ 1 and elementary firing vector Uk, the next marking vector
is given by
Mk= Mk_ 1 +ATu k.
The next state description can be used to express the graph marking
resulting from the application of sequences of elementary firing vectors.
This is done in the next definition and property.
Definition 3: Firing Count Vector. Let (Ul,U2,...,Ud) be a sequence of
elementary firing vectors taking a marked graph G from an initial marking M0
to a destination marking Md. The firing count vector xd for this firing
sequence is defined by
Xd = _ u k •
k=l
Property 2: State Equation Description. For a marked graph G with initial
marking vector M0, the marking vector resulting from the application of
elementary firing vector sequence (Ul, u2,...,Ud) is given by
Md = M 0 + ATxd .
Using the state description of a marked graph as a basis, the property
of teachability is investigated. Necessary and sufficient conditions for a
CMG marking vector to be reachable from an initial marking are established,
16
and it is shownthat the numberof tokens contained in any directed circuit
of the CMGis invariant under transition firings.
Definition 4: Reachability. A marking Md is reachable from an initial
marking M 0 if there exists a sequence of elementary firing vectors that
transforms M0 to Md.
The following definition is required to state the reachability condi-
tions for a CMG.
Definition 5: Fundamental Circuit Matrix. Let T be a tree of a connected
marked graph G. The set of (m-n+l) circuits, each uniquely formed by
appending one cotree edge to the tree, is called the set of fundamental
circuits of G for tree T [13]. The fundamental circuit matrix for G for
tree T is the2(m-n+l x (m) matrix Bf _ [bij] having rows corresponding to
fundamental circuits, columns corresponding to places, and where
+i(-i) if place j is containedin f-circuit i and
the place and circuit drections agree
(disagree)
if place j is not contained in f-circuit i..
Property 3: Reachability in the CMG. In a computational marked graph G, a
marking Md is reach;able from an initial marking M 0 if and only if BfM d =
BfMo, where Bf is a fundamental circuit matrix for G.
Proof. It is shown in [II] (Theorem 3) that the property is true for marked
graphs containing no token-free directed circuits. By the construction
rules for the CMG, directed circuits occur in exactly four ways. First,
each NMG consists of a directed circuit which contains an initial marking
token in the "process ready" place. Second, a directed circuit is formed
each time an NMG is linked to another NMG. Since one of the two linking
17
places contains an initial marking token and both places are contained in
the circuit, this circuit is never token free. Third, directed circuits
exist in the CMGcorresponding to interconnected feedforward paths in the
algorithm marked graph. Each such circuit contains one or morebackward-
directed cohtrol edge containing one initial marking token. Fourth,
directed circuits exist in the CMGcorresponding to directed circuits in the
algorithm marked graph. Each such circuit contains exactly one forward-
directed edge containing one initial marking token representing initial
condition data. Therefore, the CMGcontains no token-free directed circuits
and the property follows.
As a direct consequenceof the reachability property of the CMG,it can
be shownthat the numberof tokens in any directed circuit is constant.
This characteristic is stated as Property 4.
Property 4: Token Count Invariance. In a CMG, the number of tokens'con-
tained in a directed circuit is invariant under transition firing.
Proof. Consider a directed circuit C of a CMG. The entries in the row of a
circuit matrix B corresponding to C are ±I in columns representing edges in
C and are 0 otherwise. If M is a marking vector, the component of BM
corresponding to C is equal to the number of tokens in directed circuit C
under marking M. Therefore, if M d is any marking reachable from an initial
marking M0, it follows from Property 3 that BMd = BM O. That is, the number
tokens in directed circuit C under initial marking M0 is equal to the number
of tokens under any marking Md reachable from M 0. This completes the proof.
Next, liveness and a closely related property called consistency are
considered. It is shown that the CMG is live and consistent.
18
Definition 6: Liveness. & marked graph G is said to be live for a marking
M if, for all markings reachable from M, it is possible to fire any tran-
sition of G by progressing through some transition firing sequence.
Property 5: Liveness in the CMG. The computational marked graph is live
for all appropriate initial markling vectors.
Proof. It is shown in [12] (Property 2) that a marked graph G is live for a
marking M if and only if G contains no token-free directed circuits in mark-
ing M. As stated in the proof of Property 3, for all appropriate initial
markings M 0, the CMG contains no token-free directed circuits. Therefore,
the property follows.
Definition 7: Consistency. A marked graph G is said to be consistent if
there exists a marking M and a transition firing sequence S from M back to M
such that every transition occurs at least once is S.
Property 6: Consistency in CMG. A connected computational marked graph G
is consistent. In addition, each transition of G occurs an equal number of
times in a firing sequence from a marking M back to M.
Proof. From Property 2, if a CMG is cosistent, then there exists a marking
T
Md = M 0 and a firing count vector x d > 0 such that A xd = 0. The converse
is also true. The incidence matrix for a marked graph G is an (n x m)
matrix A. If G is connected, then it is known [13] that the rank of A is n-
T It is observed that
I, and thus the null space of A has dimension one.
T
each row of A has one (I), one (-I), and all remaining terms are (0)
.th AT
Therefore, if C. denotes the J column of ' it follows that
J
n
C. = 0"
j=1 J
19
Thus, there exists a vector xd = [k k ... k] T k > 0, which uniquely satis-
fies ATx d = 0. This completes the proof.
The final graph property considered in this section is safeness. This
property is first defined, and then it is shown that CMG is safe.
Definition 8: Safeness. A marked graph G is said to be safe for marking M
if, for all markings reachable from M, no place contains more than one
token.
Property 7: Safeness in the CMG. The computational marked graph is safe
for all appropriate initial marking vectors.
Proof. By Property 4, the token count for each directed circuit of the CMG
is invariant under transition firing. Therefore it is sufficient to show
that each edge of the CMG belongs to at least one directed circuit contain-
ing a single token. By the construction rules for the CMG, all CMG edges
can be classified into two groups, NMG edges and linking edges. NMG edges
occur in groups of three and always form a directed circuit containing one
token. Linking edges occur in pairs, one forward directed and one backward
directed, and also form a directed circuit with the forward directed edges
of the NMG. One of the linking edges, but not both, always contains one
token while the forward directed edges of the NMG contain no tokens. There-
fore, each edge of the CMG is contained in a directed circuit with one
token, and the property follows.
111.4 Performance Analysis
The importance of the ATAMM model is that it establishes a context in
which to investigate the performance of decomposed algorithms in multipro-
cessor data flow architectures. In this subsection, performance measures
indicating computing speed and throughput capacity are defined. Bounds for
20
these quantities are calculated analytically from the algorithm marked graph
and the computational marked graph. This information is essential for effi-
ciently matching algorithm decompositions with architecture implementations.
The work presented in this subsection is an interesting application and
extensio_ of recent investigations of the performance of Petri Nets [14],
[15] and marked graphs [16].
It is assumedthat a decomposedalgorithm is implemented in a multipro-
cessor architecture containing R computing resources or functional units.
Each functional unit is capable of performing any of the primitive oper-
ations whose sequencedefines the decomposition. A computational task con-
sists of completing the algorithm for one frame of data and is initiated
whenan input data token from the source node is encumbered. Task output
occurs when a corresponding output data token is deposited at the output
sink node. A task is completed whenall computing associoated wi_h the task
is completed. It should be noted that task output and task completion do
not always coincide. In many iterative signal processing algorithms, com-
puting to generate initial conditions for the next iteration often occurs
after an output has been calculated. Task completion is usually indicated
in the AMGor CMGby the return of the graph to somesteady-state initial
marking. To facilitate measurementof throughput capacity, it is assumed
that tasks are repeated periodically with new input data sets. Newdata
sets are available continuously as input tokens from the input source node.
Included in this problem class are iterative algorithms where the present
task requires as inputs data from previous task calculations.
Concurrency in this problem setting occurs in two ways. First, differ-
ent functional units mayperform simultaneously several primitive operations
belonging to a single task. This type of concurrency is referred to as
21
vertical concurrency. Vertical concurrency has a direct effect on task
computing speed. It is limited by the numberof primitive operations that
can be performed simultaneously in a given algorithm decomposition, and by
the numberof functional units available to perform the primitive opera-
tions. Second, different functional units may perform simultaneously r
primitive operations belonging to different tasks sequentially input to the
computing system. Called horizontal concurrency, this type of concurrency
has a direct effect on throughput capacity. It is limited by the capacity
of the graph to accommodateadditional task inputs, and by the numberof
functional units available to implement the tasks. In the following it is
shownthat the process of algorithm decomposition imposes bounds on the
amount of vertical concurrecy and horizontal concurrency possible in a given
problem. If sufficient computing resources are available, operation at
these bounds can be achieved. If the numberof computing resources is limi-
ted, the bounds cannot be reached simultaneously and trade-offs between the
amount of vertical concurrecy and horizontal concurrency are possible.
Three performance measures for concurrent processing are defined. The
first two parameters, TBIO and TT, are indicators of computing speed and
reflect the degree of vertical concurrency. The third parameter, TBO, is a
measureof throughput capacity and thus reflects the degree of horizontal
and vertical concurrency.
Definition 9: TBIO. The performance measure TBIO is the computing time
which elapses between a task input and the corresponding task output.
Definition I0: TT. The performance measure TT is the computing time which
elapses between a task input and the completion of all computation associ-
ated with that task.
22
Definition II: TBO. The performance measure TBO is the computing time
which elapses between successive task outputs when the graph is operating
periodically in steady-state.
The remainder of this section is devoted to developing lower bounds for
these performance measures.
Let G denote an algorithm marked graph representing as decomposed
algorithm. The lower bound for TBIO is the shortest time required for a
data token from the data input source to propagate through the graph to the
data output sink. Similiarly, the lower bound for TT is the shortest time
required to complete all computing activity initiated by the injection of a
data input source. These shortest times are the actual performance times
when only a single task is active in the graph during any time interval (no
horizontal concurrency), and as many computing resources as are required are
available (maximum vertical concurrency). Under these operating conditions,
lower bounds for TBIO and TT are calculated by identifying certain longest
paths in a graph obtained from the algorithm marked graph. This new graph,
called the modified algorithm graph GM, is defined and then used to
determine lower bounds for TBIO and TT.
Definition 12: Modified Algorithm Graph. Let Pi be a place of G, directed
which contains a token of the initial
from transition t r to transition t s,
marking. The modified algorithm graph GM is obtained from the graph G by
the following construction rules.
I. Place Pi is deleted from G.
2. A new place Pil' directed from the data input source to transition
t , is added to G.
s
• different from all other output sinks, and a
3. A new output sink sI
new place Pi2' directed from transition tr to si, are added to G.
23
4. The above rules are repeated for each place of G containing a token
of the initial marking.
Lower bounds for TBIOand TT are presented in Theoremi and Theorem2
respectively.
Theorem I: Lower Bound for TBIO. Let PI be the ith directed path in GM
from the data input source to the data output sink, and let T(P.) denote the
i
sum of transition times for transitions contained in Pi" Then,
TBIOLB = Max {T(Pi)},
where the maximum is taken over all paths P'1 graph GM.
Proof. Without loss of generality, let tf be the last transition in all
paths P. directed from the data input source to the data output sink Tran-i
sition tf is enabled when each input place for tf contains a token. Since
by assumption a computing resource is available, tf fires as soon as it
becomes enabled. Let pq be the last input place for tf to acquire a token,
and let t be the input transition for place p . Continuing this labeling
g q
procedure results in a backward path construction process. This process is
repeated, first at t , and then at each succeeding transition until the data
g
input source is reached, identifying a path pj. By the construction process
for the path, it is clear that T(Pj) = Max {T(Pi)}, where the maximum is
over all paths P'l in GM. It is also clear that TBIOLB can be no shorter
than T(P.) so that TBIOLB _ T(P.). Since a computing resource is availableJ J
when each transition in P. is enabled, the time between input and corre-
J
sponding output can be no longer than T(P.) so that TBIOLB _ T(P.). There-
J J
fore, TBIOLB = T(P.) = Max {T(Pi) } where the maximum is over all paths P.J ' i
24
in GM. This completes the proof.
th
Theorem 2: Lower Bound for TT. Let P. be the i
i
directed path in GM from
the data input source to any output sink and let T(P.) denote the sum ofJ I
transition times of transitions contained in P.. Then,
i
TTLB = Max {T(Pi) 1
where the maximum is taken over all paths P in graph G .i M
Proof. By the construction rules for graph GM, a task is initiated when
input data tokens are input from the data input source, and is completed
when all output sinks have accepted tokens. Therefore, TT is the time which
elapses from injection of input tokens to the arrival of a token at the last
fired output sink. Let T(P t) = Max{T(Pi) _, Pi in _' be the longest path
time of paths from the data input source sI to any output sink, say st .
Since a token must reach sink st before a task is completed, it follows that
TTLB _ T(Pt). Since a resource is available for each transition to fire
when enabled, and since Pt is the longest path in GM, it also follows that
TTLB _ T(Pt). Therefore, TTLB = T(P t) = Max{T(Pi)_, where the maximum is
over all paths P. in GM. This completes the proof.
i
To illustrate the application of Theorem 1 and Theorem 2, TBIOLB and
TTLB are computed for the algorithm graph shown in Fig. I. For this exam-
ple, the following transition times are assumed: T(1) = 4, T(2) = i, T(3) =
5, and T(4) = 6. The modified algorithm graph coresponding to Fig. I is
shown in Fig. 5. The modified algorithm graph contains two paths directed
from the data input source s I to the data output sink so • Path PI consists
of edge set { I, 2, 3, 41 with T(P I) = I0, and path P2 consists of edge set
25
{5-1, 3, _ With T(P 2) = 6. Therefore, since T(P I) > T(P2) , path P1 deter-
mines the lower bound for TBIO and TBIOLB = I0. The modified algorithm
graph contains two additional directed paths from the data input source s I
to the output sink s 5. Path P3 consists of edge set { i, 2, 6, 5-2} with
T(P 3) = 11, and path P4 consists of edge set {5-I, 6, 5-2} with T(P 4) =
7. Since T(P 3) > T(P I) > T(P 4) > T(P2), path P3 determines the lower bound
for TT and TTLB = ii.
Next a lower bound for the performance measure TBO is presented. Let G
be a computational marked graph representing a decomposed algorithm. It is
assumed that operating conditions for G are set to maximize horizontal con-
currency. That is, data tokens are continuously available at the data input
source, and as many computing resources as needed can be called to perform
primitive operations. With these conditions, the graph plays periodically
in steady-state, and TBOLB is the shortest time possible between successive
outputs.
Theorem 3: Lower Bound for TBO. Let G be a computational marked graph and
let C. be the ith directed circuit in G. The notation T(C.) denotes the sum
i l
of transition times of transitions contained in C., and M(C.) denotes the
l i
number of tokens contained in C.. Then,
1
TBOLB = Max {T(Ci)/M(Ci) },
where the maximum is taken over all directed circuits in G.
Proof. Without loss of generality, let tf be the output transition in G so
that an output is produced each time tf completes the firing. Then TBOLB is
the minimum firing period of transition tf. By Property 6, G is consistent
so that all transitions of G fire periodically with minimum period TBO
LB
26
It is shownin [12] (pp. 58-60) that the minimumfiring period of each tran-
sition of a marked graph is given by Max{T(Ci)/M(C.)},_ where the maximumis
taken over all directed circuits C. in G. Therefore, the theorem follows.i
The computational marked graph shownin Fig. 3 is used to illustrate
Theorem3. This CMGcontains manydirected circuits. However, the directed
circuit which contains all NMGnodes of transitions 2 and 4 contains only
one token and maximizes the ratio T(C.)/M(C.). Therefore, the shortest timei i
possible between successive outputs in this graph is TBOLB= 7. In the next
subsection, a strategy for achieving optimum time performance is investi-
gated.
111.5 Strategy for OptimumTime Performance
A model describing decomposedalgorithms for implementation in a dis-
tributed data flow architecture is described in Subsections 111.2 and 111.3,
and performance measuresare defined in Subsection 111.4. An important
problem remaining is to develop an operating strategy for the ATAMMmodel
which achieves optimum time performance with a minimumnumberof computing
resources. Unfortunately, this problem is equivalent to a class of schedul-
ing problems which is knownto be NP-complete. Thus, there exists no algo-
rithm for obtaining an optimum solution which is better than enumerating all
possible solutions and then choosing the best one. However, an important
suboptimal operating strategy which achieves optimum time performance, but
possibly requires more than the minimumnumberof computing resources, has
been developed. This strategy is presented and illustrated by example in
this subsection.
Whenpresented with continuously available input data sets, the natural
behavior of a data flow architecture results in operation where new data
sets are accepted as rapidly as the available resources permit. That is,
27
the architecture naturally operates at high levels of horizontal concurrency
with the possible loss of capability for achieving high levels of vertical
concurrency. This results in performance characterized by high throughput
rates, TBO=TBOLB, but relatively poor task computing speed so that TBIO > >
TBIOLB and TT > TTLB. In many signal processing and control applications,
it is important to achieve both high throughput rate and high task computing
speeds. Often, designers are willing to provide extra hardware to realize
optimum time performance. The suboptimal operating strategy presented in
this section results in performance having the following characteristics.
I. When R ) RMax, operation achieves TBIOLB , and TBOLB. RMa x is
computed in implementing the strategy, and represents the minimum
number of resources which insures maximum horizontal concurrency
and maximum vertical concurrency under this strategy.
2. When RMa x > R ) RMi n, operation achieves TBIOLB and TTLB , But TBO
) TBOLB. The strategy preserves task computing speed or vertical
concurrency at the expense of throughput rate or horizontal con-
currency. RMi n is also computed in implementing the strategy, and
represents the minimum number of resources needed to maintain
vertical concurrency with limited horizontal concurrency.
3. When RMi n > R ) I, operation continues but performance degrades so
that TBIO ) TBIOLB, TT ) TTLB, and TBO _ TBOLB.
Implementation of the operating strategy is illustrated in Fig. 6. All
that is required is to limit the rate at which new input data are presented
to the CMG. This is accomplished by adding a control transition connected
in a directed circuit with the data input source. The control transition
imposes a minimum delay of D time units between inputs. Delay D is chosen
according to the following rule:
28
DTBOL B R _ RMa x
TBOMin RMa x > R _ RMi n
TCE RMin < R > i.
TCE denotes the total computing effort required to complete the task, and
TBOMin ' RMax, and RMi n are computed as part of the strategy design proce-
dure.
The operating strategy design process consists of five steps. These
steps are presented and explained in the remainder of this subsection. An
operating strategy is developed for the example algorithm graph shown in
Fig. 7 to illustrate each step as it is presented.
Step I. Choose a convenient transition firing rule. A rule to determine
when an enabled transition in the CMG fires must be specified. A natural
rule is to specify that enabled transitions fire when a computing resource
is available. If conflict exists, such as when there are more enabled
transitions than computing resources, then firing occurs according to a
priority ordering of the transitions. For the example algorithm graph, the
highest to lowest priority ordering of the transitions is chosen as (5,4,3,-
7,2,6,1).
Step 2. Determine TBOLB- The performance bound TBOLB is found from the
computational marked graph by application of Theorem 3. The CMG correspond-
ing to the example algorithm graph is shown in Fig. 8. The directed circuit
identified in this figure contains 6 transition time units and 2 tokens,
and maximizes the ratio T(C.)/M(C.) for all directed circuits. Therefore,
1 1
TBOLB = 3.
Step 3. Determine the resource utilization envelope of a single task
required for maximum vertical concurrency at steady-state with TBO = TBOLB.
29
The purpose of this step is to determine the numberof computing resources
required as a function of time to achieve maximumvertical concurrency in a
single task. The envelope is determined by playing the graph assuming
unlimited resources and an input rate of TBOLBuntil steady-state operation
is reached. The resource utilization envelope is obtained by counting the
numberof computing resources used for a single task during each time
interval. The play of the example algorithm graph under these conditions is
shownin Fig. 9, and the resulting resource utilization envelope is shown in
Fig. I0.
Step 4. Stabilize the resource utilization envelope by adding control
places as necessary. If the time between inputs to the CMG is increased
above TBOLB , the resource utilization envelope may change from that observed
in Step 3. Since knowledge of the envelope is required to calculate the
number of required resources, additional places are appended to the AMG and
the CMG to freeze the shape of the envelope. For example, the play of the
example algorithm graph of Fig. 8 with an injection time of 4 is shown in
Fig. ii. At this slower injection rate, transition 6 fires one time unit
earlier. To prevent time movement of transition 6, a control place directed
from transition 2 to transition 6 is added. This place prevents the firing
of transition 6 until transition 2 has completed firing. Thus the resource
utilization envelope computed for an input period of TBOLB is the envelope
for all input periods TBO ) TBOLB.
Step 5. Compute RMax, RMin, and TBOMin(R ) using the resource utilization
envelope. RMa x is determined by overlaying resource utilization require-
ments, each delayed by TBOLB with respect to the previous one, as shown in
Fig. 12 for the example. RMa x is equal to the largest resource requirement
30
during any time interval within the steady state operating period. RMin is
the minimumnumberof resources necessary to insure maximumvertical con-
currency with no horizontal concurrency. This number is equal to the maxi-
mumresource requirement indicated in the resource utilization envelope for
a single" task. For the example problem, RMax = 5 and RMin = 3. The value
of TBOMin for each resource numberR between RMax and RMin inclusive, is
determined by increasing the delay between overlapping resource utilization
envelopes until the maximumresource requirement is R. TBOMin is the small-
est input delay to produce this resource requirement. For the example, the
calculations of TBOMin for R = 4 and R = 3 are illustrated in Fig. 13 and
Fig. 14 respectively. The results of these calculations are TBOMin(4)= 3.5
and TBOMin(3)= 4.
The performance of the example algorithm graph is summarizedin Fig.
15. Optimumtime performance of TBIOLB= TTLB= 7 and TBOLB = 3 _s achieved
for R ) RMa x = 5. At R = 4, TBIO and TT remain at the optimum values and
TBOMi n decreases to 3.5. At R ffi3, TBIO and TT again remain at the optimum
values and TBOMi n decreases to 4. For values of R below RMi n, time perform-
ance generally degrades. However, in this example TBIO and TT remain at 7
for R = 2, while TBOMi n decreases to 6. Finally, at R - I, performance
degrades to TBIO = TT = TBO ffiTCE ffiI0. Another perspective of algorithm
performance is shown in Fig. 16. This figure displays throughput rate,
(I/TBO), as a function of the number of functional units R. The peak height
of each bar indicates the maximum throughput rate which can be achieved with
the indicated number of processors. The bars also indicate more clearly
that operation at any throughput rate less than maximum is possible for a
given number of functional units. This design procedure is easily applied
31
to much larger algorithm graphs more representative of actual signal
processing and control problems.
32
IV.O DIAGNOSTIC TOOL DEVELOPHENT
IV.I Analyzer Development
IV.I.I Introduction
Concurrent processing is the capability of a computer system to execute
two or more tasks at the same time. For example, a processor may execute a
given computation at the same time that an I/O coprocessor performs an I/O
operation. There are new computer architectures that organize processors in a
parallel fashion requiring customized algorithms to take advantage of the
parallelism of the systems. However, the models developed to describe these
architectures do not adequately model the issues of scheduling, coordination,
and communication (Ref. 17). On the other hand, the strategy proposed by
Stoughton and Mielke (Ref. 17-19) addresses these particular issues. The
strategy uses timed Petri nets (Ref. 20) to model processor behavior for each
computational node of an algorithm graph.
Detailed data are needed to evaluate and study the performance of a
concurrent processing system. Data such as the function of concurrency with
respect to time can be investigated. Therefore, a sophisticated evaluation of
the concurrent system can be performed. To achieve this objective, it is
indispensable that data, such as when the processing of a data packet is
initiated and when it is terminated, be available. Performance measures such
as TBIO or TBO can be obtained from global information such as when an input
is read by the graph and when its corresponding output is written. This kind
of information can be obtained from an outside observer which monitors the
system. The best information the system is able to provide is the firing of
every transition of every node during execution. With these data, a more
comprehensive study of a concurrent processing system can be done. Although
the system itself is used to provide the information, it does not affect the
33
performance of the system due to the relatively low speed communication chan-
nels used in the prototype. Another method to probe the system should be
devised if high speed communication channels are to be used. This chapter
describes an analyzer system that yields the required evaluation. In Sub-
section IV.I.I, a prototype system and its communication events will be close-
ly examined. What the Diagnostic Routines do in the Graph Managerand their
effect in the overall performance is contained in Subsection IV.I.3. How
information of internal events is recorded is presented in Subsection IV.I.4.
In Subsection IV.I.5, generalities of the Analyzer program are examined,
including what information is input to it and what is obtained as output data.
In Subsections IV.I.6, IV.I.7 and IV.I.8, how the Analyzer program processes
this output data to generate measurement information is presented. These
measurements include TBIO (Time between Input and Output), TBI (Time between
Inputs), TBO (Time between Outputs), concurrency of the computing system and
general average process times. In the last two Subsections IV.I.9 and
IV.I.IO, a different tool is presented. This tool integrates the simulation
•of the system and the analyzer in one program.
IV.I.2 Prototype and Its Communication Events
A prototype of a concurrent processing system was developed. It was
used to prove some of the theories of the graph representation of such systems
and to establish a basis for comparison of the simulation program to its
hardware counterpart. The block diagram of the prototype, which was origi-
nally presented in (Ref. 17), is shown in Fig. 17.
The system consists of several S-100 units using Intel 8088 micropro-
cessors. Each unit has I/O boards to communicate with the external world as
well as 32k of random access memory (RAM). For test purposes, there are three
34
units acting as processing elements or Functional Units, one as the Graph
Managerand one that serves as Global Memory. The communication between them
is madethrough serial ports (Standard RS-232). An IBM Personal ComputerXT
(IBM-PC/XT) is used to communicatewith the Graph Manager. A communication
channel can be set through the Graph Manager to the Functional Units and the
Global Memory.
The Graph Manager is designed to monitor the graph execution and is
itself controlled by the data flow in the system. The Graph Managerkeeps a
record of the places in the graph as well as which functional unit is per-
forming which process node. The Graph Manager "schedules" the assignment of a
functional unit to a process node according to the priority of the nodes,
functional units available and the process nodes that can be fired.
A serial communication link is set between the Graph Managerand every
Functional Unit. A link is also set between the Global Memoryand every
Functional Unit. Serial communication between the IBM-PC/XTand the Graph
Manager is used for initialization, and for controlling and monitoring of the
system.
Whena node which is found that can be fired, i.e., its input places are
full and its output places are empty (the last requisite for single node model
only), such node is assigned to a Functional Unit; i.e., that node is fired.
To fire a node, a communications protocol is initiated between the Graph
Managerand an available Functional Unit, as shownin Figure 18. This proto-
col begins with the code word D for Do; it is followed by a Task Number, the
Inputs places or labels, and the Output places. This communication event is
called Assign Task. This information, which is given to the Functional Unit,
is taken from the graph data that are in the Graph Manager's memory. In this
step a task or a node is said to be assigned to a Functional Unit.
35
The next piece of communication done between the assigned Functional Unit
and the Graph Manager is the acknowledgementfrom the Functional Unit that the
input places have been read (Acknowledge Input). The Functional Unit reads
the data from the Global Memoryusing another protocol before Acknowledge
Input is sent to the Graph Manager.
Process of data is started as soon as the input data are acknowledgedby
the Functional Unit. The unit communicateswith the Graph Manager indicating
that the process is finished when the process is done and that it is ready to
place the output data in the Global Memory. The token information of the
output places of the associated node is examined and it is verified that the
output places are empty (the latter event is true only for the triple node
model). The code for Outputs Empty is sent to the Functional Unit that is
working on that node.
The data is written to the output places once the Functional U_it has
clearance for writing. The Graph Manager is informed when the output is
written and the Functional Unit is freed; i.e., the Functional Unit is in a
wait state until the next task is assigned to it.
IV.l.3 Graph Manager Dla2nostic Routines
The entire concurrent processing system is accessible to the Graph Man-
ager; therefore, the Graph Manager is the most suitable subsystem to inform
the outside world of what events are taking place in the concurrent system.
In order to keep a proper time record of the different events in the
graph, an internal real-time clock is started simultaneously with the exe-
cution of the graph. As each event is recorded, the clock is read to register
the time at which the event is taking place.
36
There are five different events that are recorded. These correspond to
the communication events previously mentioned:
i (F) Firing of a process node and binding to a Functional Unit.
"Assign Task".
2 (I) Input places read by the Functional Unit (process node).
,,AcknowledgeInput".
3
4
5
(p) Process done by the Functional Unit (process node).
(S) Output places empty. "Enable Outputs".
(O) Output places written by the Functional Unit (process node).
,,AcknowledgeOutput".
It should be noted that after a node and a Functional Unit have been
assigned to each other, they cannot be distinguished from each other. They
become one entity and is the only time when either one, the node or the
Functional Unit, is considered active.
Every event is recorded in the following format:
T{clock}N{node number}{event}[functional unit number]
where levent} can be any of the next letters:
F (The node fires),
I (The input data is read),
p (The process is done),
S (The output places are empty), and
O (The output data is written).
37
The parameter of [functional unit number] is only written when {event} is
equal to 'F.' To simplify the reading of the file, commasare inserted
between every letter and number. The output file of the Simulation program
(Ref. 21) does not require such an adjustment or addition since it is already
provided wi.th commas.
Any probe that is installed in a system for testing purposes introduces
someerror in the reading. The probe used here is no exception to the rule
and, in order to minimize the error, a special interrupt driven routine was
written. The diagnostic routines use a buffer to write the information of
every graph event. This buffer is accessed every time the real-tlme clock is
incremented and if the serial port to the IBM-PCis ready to send a character,
this routine sends the next character in the buffer. If there are no charac-
ters in the buffer or the serial port is not ready the routine just increments
the internal clock and exits without further action. To minimize the time
that would take to write the commasto the output, a post-processor program
was written that inserts the commasin their proper places. Due to the low
speed communication channels, this schemeis good enough to minimize any delay
introduced in the system by these Diagnostic Subroutines.
IV.I.4 Seauential Account for Concu[rent Proqessin_
All the events that are reported in the format explained in Subsection
IV.3.3 are captured in a file that becomes what has been called the "ticker
tape". This file contains all the necessary information to analyze the per-
formance of the system. This file is called the FIPSO file because it
accounts for Firing, Input, Process, OutStat and Output of every node in the
graph. OutStat is the "enable outputs" signal sent by the Graph Manager to
38
the Functional Unit as shown in Fig. 18. A sample of a FIPSO file is pre-
sented in Fig. 19.
If the time between two different events is desired, the difference
between the first and the last has to be computed. Or if the number of
computers that were working at the same time during a certain interval is
requested, the computations or procedures to obtain this number are much more
complex, but not impossible to obtain.
With this kind of information, the encumbering and depositing of tokens
can be monitored, although there is no direct information about these parti-
cular events. Knowing the graph topology, the depositing of tokens is done
when a node writes data to its output places. The tokens are encumbered when
a specific node is fired. Although it is not obvious, any type of event can
be registered with this information. Getting the information can be a complex
job but with the help of a specialized program this can be done rather
easily.
IV.I.5 Analyze_ Program
The Analyzer is a program that reads FIPSO files and obtains different
data from the execution of the given graph (see Fig. 20). The data is
processed to obtain such information as TBO and TBIO.
The file is read and the information is placed in a two-dimensional
array, which for convenience is also called the FIPSO array. This array has
fields defined as follows:
39
Clock Node I
Event #I [ ] [ ]
Event #2 [ ] [ ]
Event #3 [ ] [ ]
: : :
Node 2 Node 3
[ ] [ ]
[ ] [ ]
[ ] [ ]
:
:
:
The clock field contains the value of the clock at the time of the event.
The node field contains a code that indicates the event the node is in and, if
in any, what functional unit is working on it.
The primary display of this program shows the activity of every node in
the graph (see Fig. 21). The display is actually several plots aligned in
time, i.e., all of them sharing the same time axis. In this way the activity
of every node can be compared with the rest. For example, it can be deter-
mined if several nodes were active at the same time. Another display shows
the activity of every functional unit instead of the nodes (see Fig. 22).
Among other data, the concurrency of the system can be extracted at any inter-
val in time or for the entire graph execution. In this manner, there is a
display of the concurrency as a function of time. Other data are obtained and
are explained in detail in the following sections.
IV.l.6 Measurement of TBIO. TBO. TBI
To measure TBIO, TBO, and TBI of the system, there is the need to know
which are the input and output nodes of the system. Since this cannot be
reliably extracted from the obtained information, these are parameters that
have to be supplied beforehand to calculate the desired data. After the
40
program determines which nodes are the input and the output of the system, it
proceeds to search in the matrix for occurrence of
i) When input data is read by the input node, and
2) When output data is written by the output node.
These times are recorded in another matrix for further use. Every time an
output is written by the output node the time from its corresponding input is
calculated and stored in the same array.
After every output has been recorded, TBI and TBO are calculated. For
TBI, this is done starting from the last input entry and going down to the
second input entry, substituting the ith entry by the difference of the ith
entry and the i-lst entry. Calculation of TBO is done similarly, except that
the output data is used instead of the input data. This output difference
calculation may be expressed by
tOi - tOi tOi-i for i- n, n-l, n-2 .... 2
where n is the number of outputs.
larly performed by
The input difference calculation is simi-
tli - tli - tli-I for i - n, n-l, n-2 .... 2
where n is the number of inputs.
The display yields such information as when the system reached steady
state (see Figure 23). When TBI, TBO, and TBIO do not change from one data
packet to the next the system is said to be in steady state.
41
IV.I.7 Concurrency Measurement
Concurrency is the property associated with the capability of a computing
system of executing two or more tasks at the same time. The concurrency
function or what has lately been called the "Resource Utilization Envelope"
can be measured or displayed in a rather simple fashion.
To obtain the concurrency information, the FIPSO array is swept in its
two dimensions. The array is swept along the "event" rows and along the
clock and nodes columns (see Subsection 3.4). At every row in the array,
every node is checked for activity and the sum of all active nodes is obtained
for that time or row. This is done for every row in the array and the
function of number of resources vs. time is plotted on the screen. This is
the Concurrency Display (see Fig. 24).
There is a value that is also obtained. It is called Computing Power
(CP). This value is equal to the area under the curve of the Concurrency
Display or the "Resource Utilization Envelope" The units of this figure is
"computer-seconds". The "Resource Utilization" can be obtained by
CP * i00
RU -
n * T E
where RU is Resource Utilization (%), CP is Computing Power (computer-
seconds), n is the number of resources (computers) and TE is Execution Time
(seconds). These two quantities can be obtained for the entire execution or
for a portion of it. The interval over which the evaluation is made is
defined by the user.
42
A table showing percentages of numbersof resources concurrently used
with respect to the execution time is displayed. Thus the maximumpossible
concurrency and its percentage with respect to the execution time can be
determined.
IV.I.8 General Statistics
The different transition times have an exact value in the original simu-
lation program (Ref. 21). However, in a hardware implementation there are
some variations in these transition times. For example, a memory reading may
take a longer or shorter time than expected.
There is a menu option that allows the user to get the average transition
time for any node. The only parameter supplied is the node number. The
program will scan the FIPSO array and calculate the average time to read the
input data, process the data, wait for output place clearance and write the
output data for the node indicated.
In a hardware implementation of this concurrent system, the different
computers that serve as resources or functional units may have different main
clocks, or can be totally different computers and of course have some differ-
ences in the time that they would take to either read, process or write data.
This provides a way to obtain average time values of the activities in the
system for any given node.
IV.I.9 Graph Simulation/Analyzer
The Analyzer program is an invaluable tool for the analysis of the FIPSO
file of a single simulation. If the need for exploring the effect of param-
eter variation arises additional program support is needed. This program is
43
called Graph Simulatlon/Analyzer program. This program controls the simu-
lation and, immediately after execution, analyzes its data to obtain the
desired result or reading. Sometimesonly a certain numberof values are
required to be analyzed and then this specialized program is ideal for auto-
mated or batch simulation and execution analysis. An overview of its features
is presented in this subsection.
The Graph Simulatlon/Analyzer program contains basically the samesimu-
lation kernel that the original Simulation program (Ref. 21). It has been
modified to provide the use of randomvariables as transition times.
The original Simulation program is not only a simulator but also a graph
creator, i.e., the graph need not be defined when the program is called, but
can be defined by the use of graphics commands. The Graph Simulation/Analyzer
needs to be supplied with a graph description and simulation control (GDSC)
file (see Fig. 25). This is a text file that can be created with any pure
ASCII word processor and the commandsyntax can be found in the manual of the
program in the appendix of this thesis.
The main purpose of this new program is to "schedule" a series of simu-
lations of a graph, change parameters, and collect specific output data such
as ATBIO (Average TBIO in steady state) or the usual FIPSOfiles. One of the
advantages over the former simulation program is that most of the program
setup can be in the GDSC file or, in short, the graph file. In this way,
setting up a simulation can be as quick as loading the graph file and typing a
few keystrokes. One of the disadvantages is that the execution of the graph
cannot be seen graphically. The only parameters that can be observed are the
clock and the number of outputs from the graph. Even the clock can be
suppressed from updating to reduce screen update overload. A notable differ-
ence with respect to the former simulation is the capability of adding random
44
variables to the different transition times in a graph. The range of vari-
ation is specified by the user in the graph file.
The new program is suitable to integrate a design tool for the concurrent
processing systems under study. The automatic control of the simulation
routine makes the program ideal to find, through iterations, some optimum
performance parameters for a given graph.
The program provides on-line context-sensitive help. At every stage
where user intervention is expected, the key FI can be typed and a window
appears providing specific explanation of what the user may do at this part of
the program. The help window information can be as simple as the statement of
the purpose of the menu option or examples to illustrate the possible choices.
This program is expected to be changed in the future and to undergo a
series of enhancements. This is the reason it was written in C language, a
flexible and simple, yet powerful and easy-to-maintain language. The program
can be easily expanded or modified to meet the future demands of the ongoing
research.
IV.l.10 Outpu_ of the Graph Simulation/Analyzer
The Graph Simulation/Analyzer program generates only four kinds of files.
These are Average Time Between Input and Output (ATBIO), Average Time Between
Inputs (ATBI), Average Time Between Outputs (ATBO) and the FIPSO files. The
"average" files collect data that is calculated once the system has reached
the steady state. The computation of the steady state values is done by the
use of a running average, in the following manner:
I- An average is computed for the first six outputs (TBIO,TBI or TBO) and
stored in an average array.
45
2- The first of those outputs is then discarded and the seventh output is
taken to form the next six outputs.
3- Another average is computed for the new six outputs and is stored in
the next location of the average array.
4- ThPs procedure is applied until there are no more outputs left to work
with.
5- The next step is to find which of the computed averages is within a
+/- 1% of its predecessor.
6- An overall average is calculated beginning with this predecessor up to
the last average and this is the ATBIO, ATBI or ATBO.
The FIPSO files are obtained the usual way, that is, from the recording
of every event, every event code is translated to text and the FIPSO file is
created. This file contents can be examined in the Analyzer as explained in
the last sections.
There are some instances when, although the steady state has been
reached, the program will print "N/SS" (Non-Steady State) instead of the
value sought. This usually occurs because the running average has too few
outputs to work with and the reaching of steady state is hidden in one of the
averages, i.e., the +/- 1% is too restrictive to detect it. Another error
message that can be given is :"N/EO," meaning "Not Enough Outputs." The
reason for this message is that there are less than nine outputs to work with
and it makes it difficult to calculate the average.
The method of running averages is adequate to find when the graph reaches
steady state. However, it requires many graph outputs which may create a
great time burden in terms of simulation time. These computation factors
46
depend on the number of nodes of the graph,
resources available.
execution time and number of
V. 0 EXPERIMENTAL RESULTS
V.I Introduction
This Subsection presents the use of the Analyzer and the Graph Simu-
lation/Analyzer programs to evaluate the performance of two different graphs.
In Subsection IV.A.I, a graph with parallel paths is investigated. TBOLB and
TBIOLB are calculated and a simulation of the system is performed. Analysis
of the output data is used to obtain the minimum number of resources necessary
to obtain maximum performance regardless of priority assignment. Subsection
IV.4.2 is dedicated to investigate a graph with iterative loops. The same
data are obtained as in Subsection IV.4.1. Subsection IV.4.3 presents two
performance factors based upon TBOLB and TBIOLB.
V.2 Graphs With Parallel Paths
Graphs with parallel paths are important due to the possibility of high
concurrency in the execution of tasks. Fig. 26 presents an example of a graph
with three parallel paths. This example is used to illustrate the calculation
of TBOLB and TBIOLB.
The first step to calculate TBOLB and TBIOLB is to choose a Node Marked
Graph. The Single-Node model is selected because the resulting CMG is dead-
lock free. The second step is to obtain the CMG for the given graph. This is
shown in Fig. 27. The third step is to obtain the circuit that takes the
longest time to execute in order to obtain and get TBOLB. Fig. 28 shows the
circuit with the longest time per token. TBOLB is equal to 1065 time units.
As the fourth step, the path from the input to the output of the graph with
47
the longest time has to be located. This is shownin Fig. 29. This time is
TBIOLBand is equal to 2240 time units. The fifth step is to calculate the
data injection rate which is controlled by the input source node. The time
that has to be associated with this node is equal to the inverse of the input
injection rate. To obtain the effective input rate to the graph, it is neces-
sary to consider the input read time of the input node. The source node will
fire when a token is placed at its control edge. This is done when the input
read time of the input node is over. Therefore, the source node write time is
equal to
Write time - TBOLB Input read time (Input Node).
The effective input rate to the graph is
IR - I/(TBOLB . tlNl )
where IR is the input rate, and tlNl is the input read time of node I.
TBOLB is 1065 andINl is 140, the source node write time is 925.
Since
v.2.1
The simulation is performed with the calculated data for all possible
number of resources. The simulation is executed for one resource, two
resources and so on, up to seven resources. The data is input to the Graph
Simulatlon/Analyzer by means of a Graph Description and Simulation Control
file. The simulation is stopped when the graph has processed fifteen data
packets. The GDSC file used to simulate the example is presented in Fig. 30.
48
Average TBO, Average TBIO and the FIPSO files are gathered for every
simulation cycle. Resource Utilization (RU) and maximum number of resources
concurrently used are obtained from these files.
The simulation was also run for another priority assignment. The former
priority assignment tries to output as many data packets as possible; the
latter tries to load the graph to its maximum before an output is written.
The first priority assignment has its highest priorities toward the output of
the graph, i.e., the closer to the output the higher the priority. In this
way, the highest priority in the graph is to process and output data. The
system tries to output data as soon as possible. The second priority assign-
ment tries to input as much data as it can before data is output. The closer
a node is to the input of the graph the higher is its priority.
V.2.2 Analysis of Output Data
The Graph Simulatlon/Analyzer and Analyzer data are tabulated in Tables i
and 2. The computing power is about the same for every case since it is the
total computing power required for processing fifteen data packets. The
resource utilization decreases with the increase of number of resources. The
resource utilization is almost the same for one and two resources. For three
and more resources the resource utilization decreases more drastically for a
change of one resource. For every resource added to the system the resource
utilization is reduced by about ten percent.
TBOLB is closely achieved using more than four resources. The small
difference is due to the overhead time introduced by the Graph Manager, or the
Simulation, in the scanning and firing of the nodes of the graph. TBIOLB is
obtained using more than two resources. Again, the difference with respect to
the calculated value is due to the scanning of the graph.
49
This value of TBOLBwas obtained for two different priority assignments.
The value of TBOLBis not calculated based on priority assignment but on the
transition times in the circuits of the graph. If it is obtained for a given
number of resources, it should be maintained regardless of the priority
assignment for at least the samenumberof resources.
The maximumnumberof resources used concurrently is five. After five
resources there is no effect on adding resources except to lower the resource
utilization. This graph can be executed at its optimum performance with five
resources.
V.2.3 Minimum Number of Resource_ _or Maximum Performance
Two important values are observed in Table I. These are the minimum
number of resources necessary to obtain TBOLB and the minimum number of
resources necessary to obtain TBIOLB. TBOLB is attained for at least five
resources and TBIOLB is attained for at least three resources. The minimum
number of resources for maximum performance is five since with this number of
resources TBOLB and TBIOLB is obtained. This minimum number of resources
coincides with the maximum concurrency in the graph. This value has been
obtained, by theoretical means, by the ODU research team and has been called
Rmax.
It is important to test if this minimum number of resources is indepen-
dent of priority assignment. The simulation of this graph was run for five
resources and for every possible priority assignment. It turned out that the
maximum performance was obtained for every priority assignment. This test
method is not recommended as a common practice since it requires too many
hours of simulation execution. It was done here as an exercise. It was done
50
to test that this minimumnumber of resources is independent of priority
assignment for this example.
The minimumnumber of resources at which the TBIOLBis preserved is not
priority independent. A priority can be found at which, for this numberof
resources, TBIO is higher than TBIOLB. Table 3 shows the results for the same
graph with a different priority assignment than the last two. The minimum
number of resources at which TBIOLBis preserved is four instead of three as
in the last two examples of priority assignments.
It should be noted that the first two simulations performed in the graph
did not require more than thirty minutes, making the use of the Graph Simu-
lation/Analyzer and the Analyzer a viable method to evaluate the performance
of a given algorithm graph.
V.2.4 Graphs with Iterat%ve Loo_s
Graphs with iterative loops belong to another class of graphs that is
important to the ongoing research. These kinds of algorithm graphs are found
primarily in applications such as digital signal processing or control sys-
tems, where data from predecessor cycles are needed for computation of a
present data packet. Figure 31 presents an example of a graph with iterative
loops.
The Single-Node model is also used in this example to model the nodes in
the graph. Figure 32 shows the resulting CMG, using the Single-Node model, of
the graph..
The circuit with the longest time per token in the CMG is located in
either of the iterative loops, nodes 2 and 5, or nodes 3 and 6. Since there
is only one token in the circuit, the value of TBOLB is 960 time units. The
effective write time of the input source is equal to TBOLB less the read time
51
of the input node. The value of the write time of the source node is 890 time
units.
Following the procedure described in Section 2.8, nodes 5 and 6 are
eliminated to calculate TBIOLB. The value of TBIOLB is equal to the sum of
the times _rom the input source to the output sink. This value is 1600 time
units.
V.2.5 Simulation
The simulation is performed with the calculated data for all possible
numbers of resources. The simulation is executed for one resource, two
resources and so on, up to six resources. The data is input to the Graph
Simulation/Analyzer by means of a Graph Description and Simulation Control
file. The simulation is stopped when the graph has processed fifteen data
packets. The GDSC file used for this example is presented in Fig. 33.
Average TBI, Average TBO, Average TBIO and the FIPSO files are gathered
for every simulation cycle. Resource Utilization (RU) and maximum number of
resources used concurrently are obtained from these files.
The simulation was run for two priority assignments. This difference in
priority assignments was explained in Subsection IV.4.1.1.
V.2.6 Analvsls of Output Data
The Graph Simulation/Analyzer and Analyzer data are tabulated in Tables 4
and 5. Rma x is equal to three for this graph with iterative loops. Both TBO
and TBIO degrade for numbers of resources less than Rma x. This is different
from the case of the example of Subsection IV.4.1 in which only TBO degrades
below Rma x (in the mentioned example TBOLB is also attained for one and two
resources below Rmax). For the first priority assignment TBIOLB is still
52
obtained for two resources, but for the second it degrades. This behavior
indicates that, for this graph, TBIO is priority dependent below Rmax.
There is a difference of ten or eleven time units between ATBI and ATBO
which is not expected since ATBI and ATBOshould be equal for the conditions
of the simulation. There is also an increase in the average of TBIO with
respect to ATBIOfor two resources in the first priority assignment. A more
detailed observation of the execution in the Analyzer reveals that the differ-
ence between TBOand TBI is being added to TBIO at every data packet. Every
time a data packet is injected in the graph, it takes ten more time units to
arrive to the output than the precedent data packet. This is the reason of
ATBIOto be muchhigher than expected. The reason of the difference between
ATBI and ATBOcan be observed in the Analyzer. The critical circuit, nodes
two and five, takes more time than calculated due to the scanning of the nodes
in the graph. This increase is directly applied to TBO,but TBI continues
being the samethat was calculated theoretically.
The source write time was incremented to 900 and the simulation was run
again. The results are as expected: ATBI is 975, ATBOis 975, and ATBIO is
1620 for Rmax.
The increase in the source write time is an experimental adjustment to
obtain the best possible performance. This yields an expression for a lower
bound TBO adjusted to compensate for system overhead during the execution:
TBOLBA - TBOLB + E
where TBOLB A is the adjusted lower bound for TBO, and E is the adjustment
factor obtained from the simulation of the graph, or in the case of a hardware
system, the one obtained by executing the graph.
53
It should be noted that this adjustment factor, E, was not necessary for
the example of Subsection IV.4.1. The two graphs of the examples are from two
different classes of graphs. The graph of Subsection IV.4.1 belongs to a
class that has its input circuit directly "coupled" to the critical circuit
(the circuit with the longest time per token in the CMG). Two circuits are
coupled when they have a transition in common. The graph of this section is
from a class that has its input circuit "uncoupled" from the critical cir-
cuit, i.e., they are connected through other circuits in the graph. The graph
of section 4.1 is not as sensitive to variations in the time of the critical
circuit as the graph of this section. Since this subject is not in the scope
of this thesis, there will be no further analysis of these classes.
Without the help of the Simulation and the Analyzer, this adjustment
could not be made in such a short period of time. These adjustments sometimes
can be predicted, but the Analyzer is a required tool to discover these real-
ization differences in performance.
V.3 Performance Factors
There is a need for an absolute time independent performance factor to
classify the graphs by their performance. The absolute time in a given graph
is not as critical as the relative amount of time each node has with respect
to each other. If each and every transition time in any of the graphs evalu-
ated in this chapter are multiplied by a constant, the resultant graph has the
same critical circuit as the former graph. The difference is in the absolute
value of the computations. If the appropriate injection rate is applied at
the input, the same resource utilization is obtained.
54
The TBOperformance factor (PFTBO)is obtained by
TBOLB
PFTBO -
TBOm
where TBOm is the measuredTBOof the system.
The TBIO performance factor (PFTBIO) is obtained by
TBIOLBPFTBIO-
TBIOm
where TBIOmis the measuredTBIO of the system.
It should be noted that the maximumpossible value of these factors is
1.0. The value of the performance factors for the graphs of Sections 4.1 and
4.2 are presented in the Tables 6 and 7, respectively.
55
VI. 0 FURTHER RESEARCH
During the grant period,the ATAMM model was used as the basis for deter-
mining analytically bounds for task computational time and system throughput.
An operating strategy which achieves optimum time performance was developed.
In addition, a new diagnostic tool was developed with which to evaluate per-
formance and functional unit behavior. The diagnostic tool provided moni-
toring of detailed system operations and the displaying of global system
performance indicators and measures.
Continuation of the present effort includes the development of a new
multicomputer test bed. The operating system and communication processes are
to obey the ATAMM model and to exhibit a completely distributed graph manager
operating system. The operating system is to allow for continuously assigned
functional units. This system is to be composed of personal computers com-
municating over a local area network.
The ongoing research has established ATAMM as a viable basis for the
specification of data flow multicomputer systems. Further research should
proceed in several directions. An outline of these activities is presented
below.
i. Fault Tolerance. Due to the inherent nature of the ATAMM model to
allow continuous assignment of the functional units, the soft-fail
nature of an ATAMM defined multicomputer system is evident in terms
of hardware failure. That is, the algorithm may be expected to
continue executing, though with degraded performance, with elimi-
nation of functional resources. However, additional effort needs to
be directed towards recovery strategies associated with error in the
data. One applicable method is triple modular redundancy (TMR),
56
..
.
.
which involves the triplication of the node processes and majority
voting. The TMR strategy needs to be investigated with respect to
both performance and the preservation of the ATAMM model.
An important part of the ATAMM research program is to enhance the
understanding of the relationship between performance measures such
as TBIO, TBO, and TT with respect to the algorithm graph character-
istics and the availability of functional resources. On the basis
of recent observations, research is to be directed toward the
improvement of the performance measures as a result of modifying the
algorithm graph by the addition of nonexecutable features such as
control edges and "dummy" nodes. Present investigations suggest
that these graph augmentations may alter and improve certain aspects
of performance without changing the underlying algorithm.
Overhead. Research should be continued toward the refinement of the
node marked graph (NMG) representation. This refinement should
better model the time associated with communication overhead and
other system overhead in relation to node process time. A goal of
this modeling would be to determine limits on algorithm decompo-
sition in view of graph complexity and increased communication
overhead.
Advanced Hardware. An appropriate step in the ATAMM development is
the infusion of the processing rules to advanced technology multi-
computer hardware for avionic or space-bourne applications. An
appropriate environment would include VHSIC technology such as the
MIL-STD-1750A processor as the processing element.
Theoretical Advancements. So far the ATAMM model has been used
57
under somewhatrestricted conditions. Further research should
include such issues as multiple graphs, nonhomogeneousfunctional
units, reliability, fault recovery strategies, and system archi-
tecture which takes advantage of the ATAMM model.
58
VII. REFERENCES
i. p. Treleaven, D. Brownbridge and R. Hopkins. "Data-driven and demand-
driven compute architectures," Computing Surveys, vol. 14, pp. 93-143,
March 1982.
2. V. Srini, "An architectural comparison of dataflow systems," Computer,
pp. 6888, March 1986.
3. W. Rheinbolt, "Report of the panel on future directions in computa-
tional mathematics, algorithms, and scientific software," Sponsored by
NSF Grant DMS-85-3483, SIAM, 1985.
4. T. Longo, G. Herzog and D. Maxwell, "A fast single chip 1750A CPU and
compatible support components in VHSIC-size CMOS technology," Proceed-
ings of the Government Microcircuit Applications Conference, PP. 317-
320, 1986.
5. W. Wehner, W. Everhart, S. Shankar and K. Stalsberg, "A VSHIC archi-
tecture for highly parallel image understanding," Proceedings of the
Government Microcircuit Applications Conference, PP. 117-120, November
1986.
6. M. Sowa and T. Murata, "A data flow computer architecture with program
and token memories," IEEE Transactions on Computers, vol. 31, pp. 820-
824, September 1982.
7. K. Kavi, B. Buckles and U. Narayan Bhat, "A formal definition of data
flow graph models," IEEE Transactions on Computers, vol. 35, pp. 940-
948, November 1986.
8. M. Granski, I. Koren and G. Silberman, "The effect of operation sched-
uling on the performance of a data flow computer," IEEE Transactions on
Computers, vol. 36, pP. 1019-1029, September 1987.
9. L. Jamieson, H. Siegel, E. Delp and A. Whinston, "The mapping of
parallel algorithms to reconfigurable parallel architectures," Proceed-
ings of Future Directions in Computer Architecture and Software," D.
Agrawal Ed., ARO Contract DAAG29-8-D-0100, PP- 147-154, May 1986.
10. Ji Peterson, _he Modeling, Englewood
iffs, N.J.:
11. T. Murata, "Circuit theoretic analysis and synthesis of marked graphs,"
IEEE Transactions on Circuits and Systems, vol. CAS-24, vol. 7, pp.
400-405, July 1977.
12. T. Murata, "Modeling and analysis of concurrent systems," Handbook of
Software Engineering, C. Vick and C. Ramamoorthy Editors, PP. 39-63,
Van Nostrand Reinhold, 1984.
13. S. Seshu, and M. Reed, Linear Graphs and Electrical Networks, Addison-
Wesley Publishing Co., Inc., 1961.
59
14.
15.
16.
17.
18.
19.
20.
21.
J. Sifakis, "Performance evaluation of systems using nets," Net Theory
and Applications, W. Brauer Editor, pp. 307-319, Springer-Verlag, 1979.
C. Ramamoorthy and G. Ho, "Performance evaluation of asynchronous con-
current systems using Petri nets," IEEE Transactions on Software
Engineering, vol. 6, pp. 440-449, September 1980.
T. Murata, "Synthesis of decision-free concurrent systems for
prescribed resources and performance," IEEE Transactions on Software
Engineering, vol. 6, pp. 525-530, November 1980.
J. W. Stoughton and R. R. Mielke, "Strategies for concurrent processing
of complex algorithms," Proceedings on Future Directions in Computer
Architecture and Software, Army Research Office, Charleston, SC, May
1986.
J. W. Stoughton and R. R. Mielke, "Petri net model for analysis of
concurrently processed complex algorithms," Proceedings of IEEE South-
eastcon 1986, Richmond, VA, March 1986.
J. Stoughton and R. Mielke, "Petri net model for concurrent processing
of complex algorithms," Proceedings of the Government Microcircuit
Applications Conference, San Diego, California, pp. 11-14, November
1986.
T. Murata, "Use of resource-time product concept to derive a perform-
ance measure of timed petri nets," Proceedings of Midwest Symposium on
Circuits and Systems, Vol. 28, pp. 407-410, August 1985.
K. Jackson, R. Tymchyshyn, R. Mielke and J. Stoughton, "Simulation
software for concurrent processing," Proceedings of the IEEE South-
eastcon, Tampa, Florida, pp. 82-86, April 1987.
60
TABLES
ORIGINAL PAGE IS
OF POOR QUALITY
_E-.OL :'.-ES ;: O_':Fb"!:JG
: 4_,;.I G
, 4=1409
4 49856
5 4._55
.¢, 4..q35_
- 49,._¢5
TABLE I. Results from first experiment, first priority assignment.
GRAPH WlTH PAP_'J.E_PATHS
_'=ll_t_JTfASSIGNMENT 1 2 '"_4 5
°,
_C "_, ,C,_'_ ' r" ,_1_#; Tli'
............. -........ _ ,=,ESOURCE
;O_E;, ;-,IIUZATION
AVEPJ_,GEAVERAGE H#_MUM
_O T=JIO CONCURP-,_NCY
4_818 39 !1 4 3243.0 3235.0 !
50068 9, 67% _,623.0 2582.0
4'_ ...LOC O'_ ",',he/
.... U..... 1294.9 _'_¢K_
4_69"., 68 50% ! i 3S0 22650 4
50002 57 34% 1083.0 22650 5
50002 4778% _08:].0 2265 0 5
_0002 48.96% 10_.0 2265.0 5
TABLE 2. Results from first experiment, second priority assignment.
62
ORIOINAL PAGE iS
OF POOR QUALITY
_ES,]bPCESC,'.,i-_FL_H,iG
:]'_"E_.
i 4q818
S 4_511
4 49592
49999
49999
;'P,!,C'_]TTA ,S,G,,h4E:,q"i726 3
r_E-3OUP,CE .,:,VERAGEAVERAGE H.4kX]HUH
l..r!q.UZATIO!'_ TBO TBIO CONCURRENCY
9_11% 32430 583,90 1
97">_' 1$22.0 3243.0 2
8072_'; 1324.0 2346.8 3
68.56% t 1370 2273.0 4
57 35.% 10830 22730 5
47.7.Q% 108_0 _ n 5v &,t,I W,,.I
40S":,; 1083.0 22730 5
TABLE 3. Results from first experiment, third priority ass_p_ment.
63
ORIGINAL PAGE iS
OF POOR QUALITY
L.3RAF_-,-_, i'rE;A_'..'E LC;C:PS
_ESC,U_CES-C_:P'L,muG P,ES,_'2U_;E .'VEE-_GEA',EEAGE AVE.P..=GE _,:=._]_!U_4
FO"_,'EF, _,'T;L:Z=_N =_.i TBO TBiO :_-ONC_IPPENCY
3,616 )`3 16"; 25.;4 0 _.5".-,40 25_9 0 1
2 ?,_.'.1 9T 24% 12940 13010 16140 2
2, 7921_ =%¢,1S;; _,4.,3 ._7,.,0 1,:.790 3
4 39-11 64 20;". '.)640 S750 1679.0 3
5 3,3"!I 5! 36:; 9640 9750 16790 3
6 ._9-"I; 4280% 964 '.) 975.0 ;6;'90 3
Table 4. Results from second experiment, first priority assignment.
GR=,PN_'EH ITERAT1VELOOPS
F,PJOC;,q'f5 6 1 ,_3 4
=ES.nURCE$ COI'.IPbq_NG_.ESOUI_.CE H,_XI_LIFI
P_O",_ER L_TIU,.A_L,,, I_.I TBO TBIO .....I,.._F,P_EI'd._,
. _,, ,¢,.... 94.0 25940 _o_o
2 39319 9..54:,; 1294 0 1300 3 !987 9 2
) 39261 85 97% 961 0 9710 _67! 5 3
...... : ,6,! = "I_"¢,1 64 47.:_'.4 _¢.__, 9?_'0 " " ... :,
= _'92".,'I 51 =:" 961 _, '_71O 1671.5 3
6 79261 42.98;; 9610 `3710 167'15 3
A',,E_AGEAVER.aGE AVERAGE
Table 5. Results from second experiment, second priority assignment.
64
. !,1_%
1 0 ".,28399 0 692426
2 0 6545?9 0 E_67544
". 0 314843 0988962
4 0 937500 0 388962
6 '._S-??_. 0 988962
.... " ........ _' _0e96 &
Table 6. Performance factors for graph of Section 4.1.
=EBFOF,M_,NCE FACTOIRSFOR GB/,,PNW_ri rI"EBATi'.ELOOPS
Re_iources Pr'l'BO PFTBIO
4 01370084 0 617999
2 0737893 0.991325
3 0 98_15 098;'654
4 0 984615 0987654
5 OI98_ 5 0.98765_
098_15 098765'I
Table 7. Performance factors for graph of Section 4.2.
65
F IGURE S
O',
--,,,4
B_I } I}+ll C4'II
Axlk--1) _ xlk)
A+I)
Figure 1. Algorithm morked groph lor discrete system equotion.
IF
IE
0R
PR
0E
OF
NMG EDGE LABELS
IF Input Buffer Full
I E Input Buffer Empty
D R Data Read
P C Process Complete
P R Process Ready
OE Output Buffer Empty
OF Output Buffer Full
Figure 2. ATAMM node marked graph model.
68
L"uo!|onbe we|sXs e|eJos!p JOt lepow LIdOJ6 peHJDUJ IOUO!lOl ndwo3 HHVIV "g eJn6!-I
I I_V
_h
QD
I 1+3 I 1+I I l I+,8
ALGORITHM
COMPUTING
ENVIRONMENT
ALGORITHM
DIRECTED
GRAPH
PETRI
NET
THEORY
ALGORITHM
MARKED
GRAPH
NODE
MARKED
GRAPH
COMPUTATIONAL /
Figure l,. ATAMM model components.
7O
5--I
6
Figure 5. Modilied algorithm graph for Figure I.
71
Problem CHG
Control
Figure 6. Operating strategy implementation.
72
2Figure 7. Algorithm graph for design example.
Figure 8. Computational marked graph for design example.
4-
1 2 3 /. _ 5
7
4
2 3 4
7
41_ w
1
41
5
2 3 1. 5
7
Time
Figure g , Graph play
tunctional
with
units.
TBO=3 and unlimited
75
Resources
3
2
1
! I I I
2 3 4 5
,i
6 7
y Time
Figure 10. Resource utilization
for design example.
envelope
76
1 2 3 l,
7
5
1 2 3
4
7
Time
Figure 11. Graph play with TBO =4 and no control edges.
77
©
@
®
R
2 3 4 5 6 7 8 9 10 11 12 13 1415
I I I I I I I I II I I I
1 1 3 2 1
1 1 1
I
I
3 21 1
1 1 321 1
325
I I I i I I I i I I I I I
1 2 3 4 5 6 7 8 g 10 11 12 13 1415
"- Ti me
, , _ Time
Figure 12. Resource envelope overlay diagram with T80=3.
78
Q
Q
R
0 1 2 3 ¢ 5 6 7
l I i t I I ! l I I t i t I I
11111133221111
8
I _ Time
11,1113322
/,33 2 2 2/,
I I I I I I I i I I I I i I i I I _ Time
0 1 2 3 /. 5 6 7 8
Figure 13. Resource envelope overlay diagram with TBO=3.S.
7g
R0 1 2 3 /, 5 6 7 8 g 10 11
I I I I I I I I I I 1 I
1 11 3 2 11
1 113 21 1
3 223
I I I I I I I I I I I I
0 1 2 3 /, 5 6 7 8 g 10 11
Time
Time
Figure II,.Resource envelope overlay
TBO=/,.O.
diagram with
8O
TBO
10
9
8
7
6
5
/.
TBOLB 3
2
1
................... I R=I
....... OR=2
I
I
............. -6R=3
............. 4_R=_
............. 4bR=5
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
1 2 3 /. 5 6 7 8
TBIO LB
g 10
T810
Figure 15. Example
analysis
algorithm graph
summary.
pertormance
81
0 If)
..J
).-
I
I
0 In 0 • 0 u) 0
O • • O • • •
($1.!url ,(JeJ_,!qJY)_,ndq§noJq.L
i
ID
O
0
N
1I-
O
,'-'4
L_
(,,,O
1.1--
82
Zt,
u
o
o
0
LO.
r_
C.
LL
83
_8
-n
co
"E
-)
0
_r
0
c_
0
0
B
r-
_-
N
0
0
LO
GRAPH MANAGER
r
_O
V V
FUNCTIONAL UNIT
A
v
A
GLOBAL MEMORY
T,72,N,1 ,F,1
T,!!E,N.l.I
L249.H,t ,P
T,250,N,1 .S
T,276,N,1 ,O
T,278.N,2,F.2
T,322,N2,i
T.323,N.I .F.3
T,3G?,N,1 ,I
T.L,55.N.2,P
T.456,N,2,S
T,482,N,2,O
T,485,H,3,F,4
T.500.N.1 .P
T,501 .N,1 ,S
T,52?.N,1 ,O
TEE1 ,N,3,1
T,EE3,N,2,F.5
T,707 N,2,1
T.?OS,N,1 ,F,1
T,752N,l,I
T.8,4,0,N,2,P
T,841 ,N,2,S
T.SET,N.2,O
T,885,N,1 ,P
T.886,N,t ,S
T,912,N,t,O
T.1190.N.3.P
T.1191,N,3,S
T.1295,N.3,O
_ord descr iDtion
<-- Node I is fired at time 72 and assigned to FUI
<-- Node 1 read the input places at time 116
<-- Node i finished the process at time 249
<-- Node I got clearance to output data at time 250
<-- Node 1 wrote the output data at time 276
<-- Node 2 is fired at time 278 and assigned to FU2
<-- Node 2 read the input places at time 322
<-- Node I is fired at time 323 and assigned to FU3
<-- Node I read the input places at time 367
<-- Node 2 finished the process at time 455
<-- Node 2 got clearance to output data at time 456
<-- Node 2 wrote the output data at time 482
<-- Node 3 is fired at time 485 and assigned to FU4
<-- Node 1 finished the process at time 500
<-- Node 1 got clearance to output data at time 501
<-- Node I wrote the output data at time 527
<-- Node 3 read the input places at time 661
<-- Node 2 is fired at time 663 and assigned to FU5
<-- Node 2 read the input places at time 707
<-- Node 1 is fired at time 708 and assigned to FUI
<-- Node 1 read the input places at time 752
<-- Node 2 finished the process at time 840
<-- Node 2 got clearance to output data at time 841
<-- Node 2 wrote the output data at time 867
<-- Node i finished the process at time 885
<-- Node 1 got clearance to output data at time 886
<-- Node I wrote the output places at time 912
<-- Node 3 finished the process at time 1190
<-- Node 3 got clearance to output data at time LI91
<-- Node 3 wrote the output places at time 1295
Figure 19. A sample FIPSO file.
85
Co
C_
Concurrent
Personal
Compukr
__ TBlO
EJ,.,.,,,,,,;
__ ATSO
I I!I iF., ,. ,, | ,_no
_'llllJllllll ILLI
'_- x...... Comparison
__" _ ] forchan_e
_tt_ultlui_ of Parameters
Generation of Data
_j_eln
ShnulaUon
Figure 20. Analyzer information flow.
!T_4 I,{OD{__TIQIT'I)[SP_Y
Figure 21. Analyzer Node Activity Display.
87
ItS4
_TIVITY DISPL_V
I till
I I++
I
"++ + I ,+ t
+++'+,,' + ,....It
CUreRx 1 tl_ LmlTS
! 8
AssignedFU's
Input/Output
To991e displa_
Split ct_sor
PlergecarsoPs
Factor((XLr_p)
Define windm
Restore widow
NodeStatistics
Concurrency
quit
DEPITI[ 1 ]
Numberof events: 543 Execution time: 17436
Figure 22. Analyzer Functional Unit Display.
88
ORIQINAL PAGE IS
OF POOR QUALITY
ITS4
F
_ Tg! TBO_I0
t:- - _z7¢ z_ ......
!_.2:1067 I_7 _73
3:1099 10992273
4 1_7 1_7 Z/73
5:1_J9 1099 2273
6:t067 1867 2273
7' 1899 1899 Z273
B,; 1067 1067 2273
9_ 189910992273
_s_isled _J's
ln[_t/Output
Toggle displa_
Spilt ccrsor
Kez'ge cursors
Factor(cursor)
Perine windou
Node Statistics
f_r_rre_
181067 1067Z_3
141106710672273
_: _ _ z_
I
i
l
I
,I
CLIRSORX 1 YIB£ UNITS
DEH14 [ 1 l
Number d e_ents:543 Executiontim: 17436
F_re 23. Analyzer Input/Output Display.
89
ORIGINAL PAGE IS
OF POOR QUALITY
IITR4
I ,;.:4 ,,,.....
". .... ,l/z,
l- ¢. ,.,. ;,'/,;:/:;//'//,,,
: "'.):i' _ , "" ', '_ 7//, Z/Z//,
"" ............. ;::'_i::"';':,,:.',"" ;': "';'//'/
,, • ,, : ,,," :",,, ,,t.,< .,./..; ,'",;,;:,/.. ,:, >;;,'z4
...... 1>, . ,. ,/" , " ,,,,;,./,,,,/,_,,
'" ..... :.' • '.. i 7 :-'2"//;;"."..'/
" ':",.". ;';/'":".",' i .... ". ,,;"
7': :,i .... ,
,;/-: :v,
• /
'-I,/,
•....... :,:-,' ...... ,,/,,,.,. ,. ,',',,:;:, :;.":
:....... ........ , ,, ', ,.,< ,"., /h,
• ' " .... ' " " .... " ' ' 9 ",' '/7.!
v/,
CUI_SOiiX 1 fii_ _I_S Nuabe)ot
Figure 24. Analyzer Concurrency Display.
C -
Graph Description &
5hnuhsUon Control Filo
J
I_-D05
PersonJl
Compubr
JFUmfj •
-,tTBIO
-ATBf
-ATBO
-Timed events
-Concurrency
-etc._
Figure 25. Graph Simulation/Analyzer information flow.
Source
process 210
write 40 write 80
Figure 26. Graph with parallel paths.
L,o
Figure 27. CHG usingSingle Node model.
FiguPe 28. Circuit to obtain TBO
LB"
t.n
Figure 29. Path to obtain TBIO LS"
ORIGINAL PAGE IS
OF. POOR QUALITY
# !11111S.'nut_,tJc,n v,-_ _ :lr_l::,h_.'_ D_Jlet :,81_s NNll
# NIlE it. i.: $!mLji jt!e,:J_.,,I_,f=._i;r_d-_e ._ r._$,3ur_'e$ iNN
# Ill _l_d .'Rh t.'O d,'fle,'ent _,':,Or:t; _sugr, menL_ NUn
Noae.- 7
Sources 1
Stnl_ !
"l_ce.: !0
g_esource._ -1 m NNII =.¢,m i :o 7 re.:ource_ HI
FnOr_' 5 4 _; 7 ;. _, 1 # A,ernite _._sv_=nment1 _ 2 7 _ 4 5
Tokens I I! O_ta _-.'_!._J_le_t ¢_e ir,_,_¢ node
_r_D._t_ II T_-,,e,np_ r,G,3e is node I
Outpu_ 5 # The ou_:.,at noae _sno_e 5
Times # Glob_J _me t_sigr, ment
_e&O 70 # :'_.e.ce _tJme_signments ue for _JI
Process ;'l 0 # nodes _n the gr_,l',. "rhey ¢:tn be
,,_r.e 46 # overnd_en x_ter on.
Node 1
;n_ut_ 1
OuC'u',s 2 7
lSme #Loca] _me _sigr.ment
=e_,':'. ! 40 _ _,ese _me L_signments ovemde
Process 420 # _',e 9,0_.&I _me _signments
Nede 2
inl:,UlS 2
Node ?
Ir.puL_3
C,a'l::u_ 4
Node 4
InDu_ 4 10
Ou_u_s5
Figure 30. Graph Description and Simulation
used for the first experiment.
Control file
96
ORIGINAL F;.CC IS
OF POOR QUALITY
Timt # Loc_ _me _._-icnmen!
Pe_d _40 #Th.ec.e_me _ss_gnments ovemde
;'rotes.; 4_.0 # _,e 91ob_ tJtr,e _sljnmont_
Wr_e 80 M
Hode 5
.' .
Node 6
T_me #LocaJ*Jme_ssignment
_esd Ira) #These _me a_signments ouemde
Process 4,"0 # _e 91obsJ time _$ignmen_
Wrne 80 #
Noc_e 7
Inl:,.t__9
Ou_,s 10
Source 1
0 "._'J"5 1
,_,m,e
'_'r_e3_5
5me
=e_d ?0
Ena
# Source output wr_e time is TBOLB - IN1
# W,,_e _me : ! 0¢.5 - !
# This ends me Graph Oescno_on _le
Figure 30. (Continuation).
97
,,,o
oo
Source _ (Z)_(_)
_. _)
read _70 "_/'_ _" read 140
__9 L _ )'p_;_ess 420
write 40 write 80
Figure 31. Graph with iterative loops.
Figure 32. CMGof the graph' in Figure 31 using
Single Node Model.
ORIGINAL PAGE IS
OF POOR QUALITY
WUNmm_irnui.i_on _4 j graoh .,w_ _er_ue i¢o05 mliml
# nu _ ;.,.-im,.Jatedw_lht-,j.ir_-,geotresources _IIn
Nil &-_d _ith t_,'O._lff.erel',tPriGnty _;iiwlhment; IINN
Gra#h Gr._oh _',4h a_era_,_eloop.= TB,_'3(GL.B!= 960 TBIO(G[..8) = 1600
Nodes 6
'.-.our:e_ 1
Sinks I
Fiac.es 9
Resources -I
P_om.' 4 3 21 6 5
Token_ I _ 9
Model Single
Input!
OU_un 4
Reld 70
Process 210
Wmc 40
# NNmPTom 1 to 6 re._ources NNI
a A,_ema:e_slgnmen! 5 8 1 2 3 4
# O;t& ;v'_l_leatthe ;npu'tnode
# and inou_uts of_her_erat3,,,eloops
# .The input node is node 1
# The sulfur node is node 4
# Glol;a time _srgnment
# These time _ssJgnments _e for all
# node.¢ {n _he gr,_h. They can be
# eve ,m,dden later on.
Node
C,u_ u_ 2
Node 2
Ou4_u-_ 3 6
Node 3
;nput$ ;_9
Cu_uns 4 8
Figure 33. Graph Description File for the second experiment.
I00
ORIGINAL _'.%,'iE !S
OF POOR QUALITY
T_me # LocaJtjrr,e _s_i9r,ment
Pro_;;; 420 # .'J-,e_ioba; _r._e a.;=_;nment;
W'r.e _0 #
N,..c_e4
q _" 4
,nw._.c
Node 5
iqrne I_Loc_J_me _signrnen!
P,eacl I_ # These _me usignrnents ovemde
Proce-_s42.0 # the _!oD._timeassignmen;$
'...r_e 8g #
Node 6
Ou_u_ 9
¢,.ource !
Ou_u_ !
lqme
".'me 890
$;_1, 2
"_me
Pe_d 70
# Source o,J_C.,_wr.e t_me ,s T_O(LBI - 1N1
# '.'r_e ,]me = 3£0 - 70
=rdm
# Th_sen_ _he Graph Oescno_on File
Figure 33. (Continuation).
lOl
APPENDIX A
THE ATAMM PROCEDURE MODEL FOR CURRENT PROCESSING OF
LARGE GRAINED CONTROL AND SIGNAL PROCESSING ALGORITHMS
Presented at
National Aerospace and Electronics Conference
Dayton, Rio
May 1988
A-I
ORIGINAL PAGE IS
OF POOR QUALITY
[HE ATA,_i.Xi PROCEDURE MODEL FOR CONCURRENT PROCESSING OF LARGE
GRAINED CONTROL AND SIGNAL PROCESSING ALGORITHMS
John W. Stoughton and Roland R. Mielke
Department of Electrical and Computer Engineering
Old Dominion University
ABSTRACT
.\n overview is presented of a model for
,!L.-,td_ingdata aaldcontrol flow associatedwith the
cxv_uti.nof large_trained.decision freealgorithms in
a _l)Ocial distributed computer environment.
[,!, n*iti,'d bv tile acronym ATAMM, which represents
\!,v,ntlm_-"JT.o-Architecture Mapping Model, the
_udct provxdes a basis for relating an algorithm to its
cN{.( tttioii in a dataflow multicomputer environment.
the A ['A.MM model features a marked graph Petri
_,,,r do-.cription of the algorithm behavior with re_ard
to both data and control flow. The model provides an
_m,dvticai basis for calculating performance boundJ on
:!._;tputcharac:eristics which axe demonstrated in the
I_,m-r.
IX IIN)I)t;C'I'ION
i'ho development of new computrr architectures
t. d upon iL-'rihuted, muitiproce:_or or_,anizations
.2! !, i|lo:i'.,ted mainly by the requirement for
h:i:,'i_od .-oeeu ,ut(I :,roarer throughput capability in
c-u;:.i)tex _i:nal processing applications "i:_" With the
._d',ont ot" h_,'t_-density microelectronics t construction
,_t :_,_t'_fik'l _rchitectures consisting of identical, special
,,=,.._c,_e computln¢ elements is now a reality [4],[5]. A
::'.x;:ne_ ,;_ mod_ls for describing the behavior of
t.-, :i,lm_ ixt tltis settin¢ have boon developed [6]_8].
i!,,.,.,'v,'r. 'hi-.,' n'odels represent only Th(, data flow and
.'.,_ _ur ,_dcttuato!v di._pi,ty the cotnplex issuett of
_,):_,!HHi*;i.,'iL iOll all(i controi flow wltich ttttlst occur in any
','._,[/.;li,,u. I'I,U.¢.it JUtS J_n diffiettlt tO itlvesti_ate how
,t,, !1,', t',voiv in:tt('h the dor'Oll|_).'_ititm and ,,q'hedltlint of
_l...,rtr!.m_ "to the _tructure and control of parallel
. I_ :.it c( !ult'._. The hnport,tneo of hetter understanding
_L¢ :clatlon_hip bet:veen algorithms and arehitcetures is
,jmv t'UW tzecotllitlg recocnizod [01.
"Ii;i_ paoer pr_ents an overview of a graph
:hcmetic model for describing both data and control flow
,,_.mciated with the concnrrent processing of larRe grained
.;_..:',lrill;tll> ill it ,_l)o('i;tI ,[i.,tribttted COlllptlt¢'r Otlvll'olt--
:::,_:_, I hi_ IIlo(i¢'l it:. iclotff, ifit'd hv the ;LcrOl|ylt] .'\T.XM._I
'.vhit'h r_'prr'-ents .._12orlthnl It) >,rchitecture .'Happing
'I'!,o ptnpo._o nf _te :\TAMM umdol is important
",,i :hie(, r(,n_olis. I.'ir_t. the mode[ [)ro',i(h's a har(Jwaro
. ,I,.l_Vt,!(nt (;ontoxt in which 1o iI|',l':<ti',';ltO the relative
_:,._it:< uf dilfcrcnt ,d_orithm def'ulttl_)Stl iOtl at|d
::;.:¢.omcntnt:on stratc¢ics. Second. tlm mt)tleldefin._ the
',' ',low and rontroi flow which mu.,,t. I)r inanifested by
,Atv d:tt,lt'low computer architecture intplementtn_ the
,l_cmH_osed al.zorithm. Third, the model provide= an
:m,,i','r iC_',lbasis for performance evaluation.
The problem domain of the ATAMM model
consists of laxge--gralned, decision--free algorithms with
computationally complex primitive operations which axe
assumed to be implemented in a dedicated distributed
dataflow environment. The algorithms are such as rr_y
processing and pP" " • .PO "
muiticomputet environment might consist otr two to
twenty processing elements composed of VHSIC
technology.
A'rA MM MODEL DEVELOPMENT
The compdsition of the algorithms of interest may
be such that two or more operations can be performed
concurrently. Thus, the potential e.xists for decrea_inz
the computational time required to executing the
algorithm by incre_ing the computational resources
which process the large grained primitive operations.
The hardware environment (Figure I) for
executing the decomposed algorithms is _ssumcd 'o
con:ist or" R identical processors or functional units
(FUNs) where R has a value in the range of _wo to
twenty. '['his range of resources is suggested for practical
reasons due to the large--grained aspect of the algorithm
decomposition and the need to maintain small communi-
cation times relative to process nines. Each FUN is a
processor having local memory for program storage and
t_mlmrary input and output data containers. Each FI'N
t;m ,,xccttte any algorithm primitive operation. The
I.'l 'Ns Htare a common global memory (GLM) which may
he either centralized or distributed. "The coordir.ation of
I:I:Ns in relation to data and control flow is directed by
Ihe ,_rnph ma,,t_er (GRM). The GRit also .may.b_
cent ralized or distributed. Transaction rules provioe tl.mt
t)til =,tit created by the completion of a primitive operation
i._ ;fi,ttcd into global memory only after the output data
c()iii;liners have been emptied. T_at is. output_ mu_t be
((m._umed a.5 inputs to successor primitive operations
before allowing new data to fill the output locations.
Assignment o{ a functional unit to a specific algorithm
i)rilnttive operation is made by the GRM only when all
ir|put._ required by the operation are available in RIoi)a]
memtJrv _tnd a functional ttnit is available.
_['he algorithm to be e:'ecuted has its data flow
represented in a directed graph termed the ah:orithm
directed graph (ADGL The ADG provides a descrtp'.:on
{)I _he operand data flow and operation 5oqtlc,nce required
Iw the algorithm decomposition. V(,rtices of tl:e ADG
are in a one--to-one correspondence with each occurrence
of a primitive operation. The ADG contains an ed_,e It.j)
directed from vertex i to vertex j if the output of
primitive operation i is an input operand for primitive
operation j. When constructing an algorithm graph.
A-2
ORIGINAL PAGE IS
OF POOR QUALITY
icesor nodes (primitiveoperations)are displayed as
es. and edges (input--outputsignals)are displayed as
.'ted line segments connecting appropriate vertices.
'ces and sinks for input mad output signals are
esented as squares. Sources from constants are not
dlv included in the algorithm graph: however,
_giesaxe used for thispurpose when necessary.
To illustrate,consider the computations for the
.=equation
x(k) = Axik-l) + Bu(k)
output equation
y(k)= Cx(k)
re x is a p--vector, u is an m-vector, y is an r-vector,
A and B are constant matrices The primitive
'ations are defined as matrix multiplication and
or addition. The algorithm directed graph for this
rithm decomposition isshown in Fisure 2. Note that
iedge islabeledwith the corresponding operands and
nodes are labeled to indicate the associated
putationaloperation.
Petri-netshave been establishedas an approp---
model for describing systems derided by some
Lence of events. \Vithout argument, the ah_orithm
ctcd _raph satisfiesthis general aspect. Further.
e computers need to communicate and be controlled
the occurrence of certain events, the Petri-net
m_e_ a .<uitaDletheoreticalvehicle for tl,e.\T.\.\IM
lel. Certain physical characteristicsof the c!assof
)len:sunder considerationlead to a simplii3edPearl-
representation.(For a formal descriptionof Petri-
features,rilereader isre'erredto referencesit0-12].)
Con.qdcriag the data flow in an al-,orithmdirected
_h.the e×ec:ttionof a pr'.mitiveoperation isprccon-
meal on the availability of input signals (or
•ands). This process may be directly modeh'd by a
i-ne: "transition" which is "enabh_l" For "firing:'
n input "places" to the transitionare marked with
.ens". Because the signal or data availability is a
.ry condition, it is appropriate that the wokens are
ted _o the set {0.I} in order to associate places
,(tiTio,s) to transactio_ (events)in a biuary way'. A
t-_:ct having such restricted input anti output
lions is called an ordinary Petri-net. The
rpretation of plACeS in the system model developed
is _l_e availability of a signal. That is, the absence
token indicates tile absence of a data signal, and tile
once of a token indicates the availability of a data
a[. Petri_tets having such tx,stricted markings are
,(I :ale or one-l)ounded Petri_tcts. Finally. the
mption is made that the algorithnxs under cons;dera-
contain no conflict or decision making such as
then-.else" or "do-while" statements, thus limitint,
Pptri-nrt pt,,'m to ha_itt_ onr, inptlt transition a,d
otttpttt Iransitiott. Thi_ ct,'_._ of restricted l'rtri-nets
died marked grapb.s. Ti,erefore. the Petri-nets used
iris report are ordinary. _afc markoi _.raphs.
Limiting the model for consideration of decision-
al-.orithtus is m_tde ix,cause the resulting marked
)h mo(lels are better understuod than general
ri-nets and hold the potential [or the development of
ormance boutld5 for coqcurrent processing st rategios.
An aleorith,n marked _raph (.\MG) is a marke(I
)h which reprcseqts a specific ;dgorithnt dr('omIw_si-
t alld is identical in topology to tit(. (utrosptmdine;
_rithm directed _raph. The :\Mr; represents the first
component in the development of the ATAMM model.
The constructionrulesand symbols are the same as the
ADG except that the edges are marked with tokens to
rcpresentthe availabilityof data. That is,edge (i,j)is
marked with a token ifan output from primitiveoperator
i is availablea= in input to primitive operator j. The
presence of a token on an edge isindicatedby a soliddot
placed on the edge. The vertices correspond to
transitionswhich may fire_ter being enabled by the
availabilityof all input data tokens. The decomposed
state equation represented in Figure 2 is also used to
illustratethe AMG. It should be noted that the initial
conditiottsfor the recursionire representedby tokens on
the loopedges.
The AMG isa usefultoolforrepresentingdecom-
posed aigorithrnsand for displayingdata flow within an
algorithm. However, the AMG does not identifyproce-
dures that a computing structuremust manifest in order
to perform the computing task.
Algorithm requirements and the computing
environment may now be integratedintoa comprehen-
sive Petri-net model to complete the ATAMM model.
The model consistsof I Pearl-pet marked graph called
the computation =I marked graph (CMG). The CMG
displays the d&ut flow =mdcontrol flow required to
implement a decomposed algorithm in a mul tiprocessor
data flow computer architecture. Before defining this
model, it ishelpfulto definean intermediategraph called
the node marked graph (NMG), [13].
A NMG is a Pearl--netrepresentation of the
computing behavior of a FUN for each AMG operation.
l'hrce primary activities,reading of input data from
global memory, processingof innut data to compute an
output, and .,¢ritingofoutput data to global memory, are
representedas transitions(vertices)in the NMG. "Data
and controlflow paths are represented as places (edgesL
and the presence of signalsis notated by tokens marking
appropriateedges. The conditions for firingthe process
and write transitionsof the NMG are as defined for a
general Petri-net. while the read transition has one
additional condition for firing. In addition to Itavintt a
token present on each incoming signal edge, a functioTaal
unit must be available for assignment to the primitive
operation before the read node can _re. Once assigned.
the functional unit is used to implement the read.
process, and write operations before being returned to a
queue of Available FUNs.
The NMG of interest in this paper requires control
signals indicating that empty data containers are
available to receive new output u input edges to the read
transition. Therefore. initiation of the node operation
r('quires not only the Availability of input data and a
(tm('tioi_al unit. but also the availability of empty output
data containers in global memory. Tfiis model is shown
in Figure 4.
A computational marked eraph (CMG) is
((}nstr,('tedfrom a l),_.rticttiar,_MG and the NM(;
accordingto the followingrules.
I. Source and sinknodes in the AMG are repre-
sented l)ysource and sinknodes in the CMG.
.. to
'} Nodes correspondint primitiveoperations in
the :\MG are representedby N.MG_ in the CMG.
3. Edges in the A.MG are represented by cd_-e
pairs, one forward directed for data flow and one
backward directedforcontrolflow.in the CMG.
]he play of the CMG procr,edsaccordin_ to fly,
Foil.win'.' granh rules.
]. A node is enabled when all incoming edges are
n:a, ked with a token. An enabled node fires by encum-
A-3
[ I
•bering one token from each incoming edge, delaying for:
some specifiedtra_sRion time, and then deposKing one
token on each outgoing edge.
A source node and a sink node fire when
enabled'_vithoutregard forthe availabilityof a FUN.
3. A node operation is initiatedwhen the read
node of an NMG isenabled and when a FUN isavailable
forassignment to the NMG and thus firesthe read node.
A FUN remains assigned to an NMG untilcompletion of
the firingof the write node of the NMG. Supervision of
thislogicalassignment of the FUN is managed by the
GRM. For illustration,the CMG corresponding to the
l,,onthm raph of Figure 2 is shown in Figure 4. The
a_ " g • " lays the data and
CMG is usefulbecause it clearlydlsp __ ._--,,t,,...
'utrolflow which must occur in anv.naruw.-_ ....v..'- _
co . ocess an(! t_._'.ause it provtue_
mentauon of the model pr , . .....
hardware independent context m wmcn to evamate
processperformance.
The ATAMM model consists of the algorithm
marked graph, the node marked graph, tnd the
computational marked graph, _nd the data flow
architecture. A pictorial description of the ATAMM
model is shown in Figure 5.
ATAMM MODEL GRAPll CIIARACTERISTICS
The theoreticalanalysis of the CMG from tlm
stand pintof marked graph theo_ isb_,,ondthe scope o[
,t,ls_r and may be found in [14]. However, several
p;ope"r_i'_ are noted below forclarRy.
Let the CMG he a marked graph eonsistin_ of m
places aud n transitions. The m-vector M k is the
marking vector resulting from the firing of some sequ_ce
of k transitions. It may be shown that the number of
tokens contained in any directed circuit of the CMG is
invaria_t under transition firings.
The CMG is live for all appropriate initial
marking vectors. That is. for a marking M if, fo¢ all
markings reachable from M, it is possible to fire any
transition of the CMG by progremng through some
transition firinz sequence.
The CMG is said to be consistent. That is, there
exists a marking M and a transition firing sequence from
31 back to 31 such that every transition occurs at least
nee. In addition, each transition of G occurs an equal
°nmber of times in t firing _tlUertc_ from a marlting M
back to M.
The CMG issaid to he safefor marking M if,for
all ma_kinz_sreachable from M. no place contorts more
t[laH one token.
/?ERFOR MALICE MEASURRS
In this section, performance measures indicating
comput n_ speed and throughput capacity are defined.
Bounds for these quanti,.ies are calculat_ analytically
from the AMG and CMG. This information is essential
for efficiently matching algorithm decompositions with
architecture'implementations. The work presented in
this section is exteusion of recent investigations of the
performance of Petri-nets [151,{161 and _arked graphs
[tTi Assume that R FUNs are available for the
al_orithm execution. A computational task is initiated
wt_eu ,'m input data token from the source node is
encumbered. Task output occurs when a corresponding
output data token is deposited at the output sink node.
ORIGINAL PAGE IS
OF POOR QUALITY
I
task iscompleted when allcomputing associatt_,with
A. ...... '-'^d However, task output and tUx
the taSK IS ComFiest • . ,__ t_.._a ;.
mnlotinn do not always coinciae aS may ._._u.u_y.,?
co...r-.... . _, ..... _,,, alforithrns in wnscn intttau
iterattve stf,n=' p,u_.._.-_ o .......... ,¢ter an
conditions_or the next iterauon mn.y u,._,, -,outputh=
indicatedin the AM_ o, ei_ni_a_k;n_ To f=litate
graph to some st.e_y--s_¢ -----:t" it is"assumed that
asurement of tl_rougnpu_ r._p_., y,
me ........ --_ ,,eriodically with new input dt_ ,s__.s.
tasKS are In,_.,_., v
New dat_ setsare availablecontinuouslyas input tosens
the tn ut source node. Included in this problemfrom " P ........ _--.,-the nresent t&$k
class are iterative algortmnm .,,-__,___,.- V,_,,t_,i,mS
requires as inputs data trom pre_aous _= _,_,,,-" _ •
Concurrency, at any instant,fallsinto one of two
tc pries On one hand, differentfuncti,onaluni_
ca • • ' taneotm _,er,a
. s ma be rformmg s_mul. Y . .(FU_) y .. _e, .... :-- tO a ,_,ttcular task within
rimitive operations t_tun&,-8 r-- _ .._ ._,,4 ,n __,
pt ,_...... h This type of concurrency ,= L=,_,_" _ _'Z
HL _.LV_ 6_J" • ' ett on 1.11_L
vertical concurrency and hl_ .__dtr_to_t ,.rnh,_ of
computing speed. It ts umtte_ u_ ..........
t,ve overations th&t _ be _.rformed
primi ' -- • ' m deeps ti_, and
' a yen al onth post
simultaneously m I; t
by the number of F_Ns tvmlable. The second
ncurrency relatesto FUNs which may be operating o_
_I,o--nt ihnut tasks within the gr._ph. ,rhis_tyl_.o_
" ........ effect on tnrougnpu_ caw_.,-_.
concurrency has t direct, h to acco_te
is limited by the capaoty of the grap .... • _ _..;.,.,._
It . ne humus m _-_,.,,.----
_ditional task mp.uts: and by tl ..._. In the following
units available to tmp|ement _ne ,_,_.
itis shown that the process of algorithm d.ecornpositio.n
imposes bounds on the amount ol verueal concurnmcy
ficient computing rcso
[aftS _ese bounds can be achieved. If the number of
computing het e
reached simultaneously anu u_--., ....
amount of vertical concurrency and horizontal
concurrencY are possible, for concurrent
Three performance _ur_
"n axe defined. The performance measure T.BIQ
prOCesSl g ..... _.:.k .t=,.,_ hetWetql a task input
corn utinR time wn,_.- _-v','--"iS the P ............. ut The performance
and the corres.ponomg t_.___-_ =',.;,h elam_ between
• U/lID _.l_s_ wu_ r--- .
measure rr is tne comp 8 '-ti-n of tll computatmn
a task input and tim comPS3 v .
associated-with that t_k. Tlae penorma_ace secure
TBO is the computing time which elal:_m between
succcssive task outputs when the sraph is operating
periodically in steady-etate. The first two p&rarn_ers,
d TT, =re indicators of computing speed and
TBIO an ............... encv The third
the de ee Ol ver_ir,_w,_-, z- " vthus reflect ___ gr. ......... of throughput capatnt.
narameter, Ttsu, is • .m_,_--_ ___.___,._i,._ncurrenc_'.
and thus reflects the degree ot nonzun_= _,,
_d_cn compared to TT.
Lower bounds of these measures may now be
outlined, and may be found in detail in [141. Consider an
AMG representing a decomposed algorithm. The lower
bound for TBIO is the shortest time required for a data
m the data input source to propatate through
token fro_ to the data output s.nk. Similarly, the lower
the gr.apn ....... _.... t time ,equired to complete
bounO lOS rt IS _ne _uut,_
all computing activity initiated by the injection of a data
token from the data input source. These shortest tim_
erformance times when only a single task
are the act.ual p _ ----:-- -nv time interval (no
"Tdman>"compu :  
horizontal IL'IJllUIJtt_,"_ It _:l--t-,t= im_-xi um vert,cal
resources as are requtreo are av_,ita-,_ ,.. ---m
concurrency). Under these operating conditions, lower
bounds .for TBIO and TT are calculated by identifvinf,. - . o .
A-4
certain longest paths in a graph obtained from the
algorithm marked graph. This new graph, called the
modified algorithm marked graph, MAMG, is defined
and then used to determine lower bounds for TBIO and
TT
The construction of the modified AMG proceeds
by the following rule,. Let Pi be a place of of the AMG,
directed from transition t r to transition ts, which
contains a token of the initial marking. Then the
.XIAGM may be obtained from the original AMG by
1. Deleting place Pi from the A.MG;
'2. Adding place Pil' directed from tile data input
source to transition Is, is added to G:
3. Adding a new output sink si different from all
other output sinks, and a new place Pi2' directed from
transition t rtO si; and
4. Repeating l-3 for each place of the AMG
contai,fing a token of the initial marking.
Let Pi be the ith directed pa_ in the MAMG
from the data input source to the output sink. The lower
I_mud for TBIO is defined as
"I'BIOLB = Max { 'r(Pi) },
where the maximum is taken over all paths Pi in tile
MAG.M and T(Pi) denote the sum of transition times for
transitions contained in Pi"
Let Pi be the ith directed path in the MAMG
from the data input source 'o an)" output sink. The lower
bound for TT is defined a.s
TTLB = Ma.,: { T(Pi) }
'vl_ere T(Pi) denote the sum of transition times of
transitions contained in Pi' and the rna.ximum is taken
over all paths Pi in the MAGM.
To illustrate, TBIOLB and TTLB are computed
for the A.XIG shown in Figure 2 for which the following
'raasi:ion times are ,Lssumed: T(I}=4. T(2)=I. T(3)=5,
and T(i't=6. The MAMG is shown in Figure 6. It may
be easily shown that TBIOLB=I0 and TTLB=II.
A lower bound for the performance inca.cure rBO
is now determined from the CMG representing a
,_ecomposed algorithnL It is assumed that operating
co,mlitions Are set to maximize horizomal concurrency.
That is. data tokens are continuoush. Available at tl'le
d_Lta input source, and as mauv coml_utiag resources
needed can he called to perform primitive operations.
With these conditions, the _raph plays l_riodically in
steady--state, and TBOLB is the shortest time possible
l_,tweeu _uccessive outputs. [.et C. be the ith directed
I
circuit in the CMG. The notation T(Oi) denotes the sum
of transition times of transitions contained in C i, attd
•XI(Ci) denotes the number of tokens contained in C i.
Then.
TBOLB = .Ma._ { T(Ci) / M(Ci) },
where the maximum is taken over all directed circuits in
the CMG. ,,
The CMG in Fig,Ire _t has many directed circuits.
H'owever. the directed circtiit which contAinS all NMG
r
nodes of transitions 2 and 4 contains only one token and
maximizes the ratio T(Ci) / M(Ci). Therefore, the ,
shortesttime possiblebetween succe_ive outputs in this
graph is TBOLB=7.
STRATEGY FOR OPTIMUM TIME PERFOrC_ANCFr
Of inferrer is the development of an operating
strategy for the ATAMM model which achieves optimum
time performance with a minimum number of computing
r_ources. Unfortunately, this problem is equivalent to a
tass of scheduling problems which is known to be
P-complete. Thus, there exists no alaorithm for
obtaining an optimum solution which is _etter than
ucmerating.all possible solutions and then choosing the
one. tlowever, a suboptimal operating stratetv
which achieves optimum time performance, but possib_-
reqmres more than the minimum number of computing
resources, has been developed and is illustrated in this
section.
When presented with continuously available input
data sets, the natural behavior of a data flow architecture
results in operation where new data sets are accepted as
rapidly a._ the available resources permit. That is, the
architecture naturally operates at high levels of
horizontal concurrency" with the possible loss of capability
for achieving high levels of vertical concurrency. This
results in performance characterized by high throughput
rate,, TBO=TBOLB ' but relatively poor task computing
speed so that TBIO>>TBIOL B and TT>>TTLB" In
many signal processing and control applications, it is
_tnl'mrtant to achieve both high throughput rate and high
task computing speeds. The suboptimal operating
strategy presented in this section results in performance
having t he following characteristic,.
1. WMm R>RMax, operation achieves TBIOLB '
TTLB, and TBOLB. RMa x i$ computed in implement-
ing the str_.el[7: and repres_ts the minimum number of
resourceswnlcn insuresma.vamum horizontalconcurrency
and maximum verticalconcurrency under thisstrategy.
2. When RMax>R>RMin ' operation achieves
TBIOLB and TTLB , but TBO>TBOLB. The strategy
preservestaskcomputing speed or verticalconcurrency at
the expense of throughput rateor horizontalconcurrency.
It.xlin isalsocomputed in implementing the strata., and
representsthe minimum number of resourcesneeded to
maintain verticalconcurrency with limited horizontal
concurre:tcv.
3. "['herateat which new data ispresentedto the
C.MG must be limited. This isaccomplished by adding a
controltransitionconuected in a directedcircuitwith the
data input source. The control transitionimposes a
mimmum delay of D time unitsbetween inputs. Delay D
ischoscn according to the followingrule:
TBOLB R > RMa x
D = TBOMi n RMa.x > R > RMi n
TCE RMi n> R> I.
TCE denotes the total computing effort required to
complete the task, Min .Ma.x Min Areand TBO , R , and R
computed as part of the operating strategy design
procedure.
ORIGINAL _" _e
A-5 ,OE POOR QUAL_"Yt:
OF pOOR QUALITY
o rating strategy design process consists of
The _hese steps are presented and explained in
five steps.
the retnainder of this section. An operating strategy is
de_eloped for the e_xample algorithm graph shown in
Figure 7 to illustrate each step as it s pres.ented.
S_L_.I. Choose a convenient transition firing rule. For
exam le algorithm graph, the highest to Iowcst
;_oritv oP_dering of the transitions is cho_n a.*
($.4.3._.2_.6.t).. ...,_,., Thp OMG corresponding to
._tP._2. oetermme L_ULB .......
tl,e example algorithm _raph. is shown in Figure $. Tile
directed circuit identifiea =n this figure contains 6
transition time units and 2 tokens, and maximizes the
ratio T(Ci)/M(C i) for all directed circuits. Therefore,
TBOLB =3-
_. Determine the resource utilization envelope of a
_ingie task required for maximum vertical concurrency at
steady--state with TBO = TBOLB tinder the assumption
of unlimited resources. The play of the example
al¢orithm _aph under these concliti0nsis shown, in
Figure 9. and the resultingresource utmzauon envelope
is shown in Figure 10.
St__lC.p_. Stabilize the resource utilization envelope l)y
adding control places as necessary. If the lime Iw,t,,.vot*n
inputs to the CMG is increased nix)re "rBOLB. the
urce utilizaxion envelope may chance from that
r,es_ol.... :- ere" 3 Since knowle(i_eof the .envelopeis
Ot)Serveu **, ._ . e "" , ._L-..._,' .,,t, oireO resources,
r uired _o calculate the nut!_uT., ,,, ,.-'_ ,,,f. _,,, ,he
,iT(]itional places are appenae(t to flip :',.st,,J n,_ .....
\tG to treeze the shape of the envelo,",e. For example,L'. " [_ of Fioure 7 with
tile play 0 f the exm-nple.algonthm.graD =. A '"
au iniect_on time of 4 ts shown in t-_-ure |! At tn!s
s!ower miection rate. transition 6 fires one time unit
earl!er. [o prevent time movement of transition 6. a
control piace directed from transition '2 to transition 6 is
added. "['hisplace prevents the firing of transition 6 until
transition 2 has completed firing. Thus the resource
utilization envelope computed for an input period of
TBOLB is the envelope for all input periods
TBO>TBOLB.
¢,,on. _,. Compute R.Ma, c, RMin, and TBOMin(R) using
tile resource _,tiliz&tion envelope. RMa x is determined
hv over!avint resource utili-,ttion requirements, each
delayed hi" TBOLB with respect to the pl'l_iOUS nile. aS
s|tox_n i,t Figure 12 for the example. R.Ma. x is equal to
the largest resource requirement durint anv time interval
within the steady state nlw'rating period." ll),li n i> the
tnhthnutn tutnlber of r_onrees necessary to insure
tati_xintUln vertical concurrency with no horizoulal
• :l_ema.xintunt
re¢ource re(tllirOlllCn t inUlC,%Le(l lit L| c ulilizatiOll
envelope for a sin,_le task. For the rxample problem.
RMax--a and ll.Min=3 . The value of I'BOMi n t'or earh
resource number R between I1.Max and R.Xlin hlclusive, is
dt,tortuhied by itlcreasllw-, the del;k v Iw,t'wCq'li o_t,rlappin_
tusource utiiization eltvelopes tltltll the uu_imu11*
resource requiren_ent is R. TBOMi n is the smallest input
dcla.v to produce this resource requirement. For the
example, the calculations of TBOMi n {i)r R=3 are
illustrated in Figure 13. The results of these calculations
are TBOMin(3) =4.
The performance degradation _ a function of R of
the example algorithm _raph is summarized in Figure 14
which shows the thruput rate or performance mar_n as a
function of R. Note that for the e.xampie, no
i,nl)rovement in thruput isavailable for R>R.Ma. x-
The ATAMM model has been demonstrated to be
a useful graph theoretic model for describing data and
control {_ow asr_.iated with the execution of laxge
rainod, decision free tlgorithrm in a speciM distributed
mputer environment. The ATAMM model has been
shown to provide am =_t)_tie-alb_is for .cal._la_n§
performance bound= on thruput cnaractertsuc_ au_
suboptimum performance behavior. The ATAMM model
leads directly to the communcation and data flow
specifications for & data flow _chitectuure a_l thus
becomes the basis of design for them structure=.
ACKNOW3,EDGEMEN_
The paper is _ on researchwork which w=s
supported in part by NASA Immgley Research Center
under Grant NAG-I--683.
I. P. Treleaven, D. Brownbridse and R. Hopkins.
"Data---drivenand demand--drive_ computer architec-
ture." Computing Surveys, vol. 14, pp. 93--k43, March
19S2.
'2.V. Srini, "An architectural comparison of
dataflow systems," Computer, pP. 68-88, March 19S6.
3. W. Rheinbolt. "Report ol the panel on future
directions in compucatiotml tmtthe.m, aries, alzorithrm.
• ,, by NSF Grant
and scientific software, s'pO0_
DMS--63-34S3. SIAM, 19S5.
4. T. Lento, G. Herzog and D. Maxwell. "A fast
single chip IT_0A CPU l_d compatible support
components in VHSlC-si=e CMOS techno[o_'."
Proceedingsof the Government MicrocircuitApplications
Conference.PP. 317-3-_).I986.
5. W. Wehner, W. Everhart, S. 5hankar and
• lsr>er "A VSHIC architecture for highly parallel
I( Sta "..g, .. ,, ..... I;n_ Of the Government
i nhq._o tlnllersta_rn rlg, t - rt_.t.._.-,._,
Mi(ro(ircuit Applications.
6. M. Sowa and T. Murata. "A data .q.ow
computer architecture with program and ,o_en
memories." IEEE Transactions on Computers, vol. 31.
pp.820-_21. September 1982.
7. K. Karl, B. Buckles and U. Naravan Bka'. ".\
formal definitionof data flow graph modcls." IEEE
Transactions on Computers. voi. 35, pp. 940-'14_"
November 1986.
S. M. Granski, I.Koren a_dG. si]fl_e/_m,a-ace"o_h:
eration scheduLint on tne per_umu,=.*,
cffect of o_ ......... tcrl:. Transactions on Computers.
cla.ta_ttowcom vu"_'on'_e_ember 198T.VO[. 3b, pp. tu_-,v--,, v
A-6
9. L. Jami_on, H. Siegel, E. DeJp and
A Whinston, "TI_
• mapping of parallel algorithms to
reconfigurable parallel atcKitectu_s, "Proceedings of
Future Directions in Comput_ Architecture and
Software, D. Agrawal Ed., ARO Contract
DAAG29--SI-D-4}I00, pp. 147-154, May 1986.
I0. J. Peterson, Petri-net Theory and the
Modeling of Systems, Englewood Cliffs, N.J.:
Prentice*Hall, 1981.
11. T. Murala, "Circqit theoretic analysis and
synthesis of^rnati_,d graphs," [EEE Transactions on
_mrcmts and :_ystems, vol. 24, pp. 400--405, July 1977.
t2. T. Murala, "Modeling and analysis of
concurrent systemj,,,Han.4_...,. _fo_,.
C '"- ..... _,_,_a u_ooltwa:e Engineering,j. vtcK a_a u. H.amarnoorthy Editors, pp. 39-63, Van
._ostrand Reinhold, 1984.
13. J. Stoughton and R. Mielke, "Petri net model
,for concurrent, processing ot" complex algoril, hms,"
roceeamg$ o! the t_vernment Microcircuit Applications
Conference, pp. 11-14, November 1956.
,, 14. R. Mielke, J. Stoughton. and S. Sore.
Modeling and performance bounds ['or c
rocessin " - oncurrent
. _, 8th lnternataonal Conference on Distributed
_oraputmg 5ysttmm, San Jose CA, June 13-17, 1988.
15. J. $ifakis, "Performance evaluation of systems
using nets," Net Theory and Applical, ions, W. Brauer
Editor, pp. 307--319, Springer-Verlag, 1979.
16. C. Ramamoorthv and G. Ho, "Performance
evaluation of uvnchrono_s concurrent systems using
Petri-nets," II_EE Transactions on .Software
Engineering, vol. 6, pp. 440--N9. September 1980.
17. T. Mutate. "Synthesis of decision-free
concurrent systems for prescribed resources and
performance." IEEE Transactions on Software
Engineenng, vol. 6, pp. 523-530, November 1980.
8"( ! f l*f
Ax(k--I|
A*I
C*f I
xIkJ
Figure 2. Algorithm Marked Grap5 - Example 1
IF Input Buffer Full
mE InDut Buffer EmlOty
DR Oata Read
PC Process ComDlete
P R Process Reeoy
OE OutDul Buffer Emlo:y
OF OutDui Buffer Fu:_
Fig_,re :L ATA.MM Node Marked Graph Model
Figure 1. Representative ATAMM Architecture
o*t v t I*1 I :.g 7
.-.....F=
&+l !
l:i:_ure .1. Computational Marked Graph - ExamDl e I
ORIGINAL PAGE IS
OF POOR QUALITY
A-7
1Fillure 6. Modified AMG - Example 1
Figure7. AlgorithmM_rked Graph - Example2
Figure 8. Computational M_ked Graph - Example 2
¶ •
i
A-8
2 ) c $
6
N
7
w
1| ,l ,_.. i 5
7
6
1 2 3 _. 5
i
7
._L.
L Time
Figure 9. Griph Plly With TBO=3 _lild Unlimited
Functioriil Units
0 1 2 3 & 5 6 7 8 9 10 11
L i i I I I I i I 1 I t
=-- Time
Q '"'I'''
® l'"'"'
',=> 132 1
i _ l r i i i i 1 i = i ILL-- Time
0 1 2 3 I 5 li 7 I t 10 11
Figure 12. Resource Envelope Overlay Diigr_un -
TBO=3.0
-- _ .
_ell OU/*Cell
! 3 3 & S S 7
_-- lime
FigureI0. ResourceUtilizilion Envelope -
Eximple 2
1 I ) l 5 I T I I 10 It 12 13 me,i5
i t t t i i t i i a l t i i t _
Q -- TimeI I I I 1 I I
Q I I I 1 l I I
G 1 1 i I I I !
I :_ll> ill
' ' ' ' ' ' ' ' _ ' ' ' ' , • =- Time
I i i i 5 I 7 O I IO 11 12 13 ills
Figurei3. ResourceEnvelopeOverlay Diigrim -
TBO=4.0
I
.
2 3, 1 5
m
-L.
1l
I 1, ]
?
l $
i ix_
._0
i
i!
_mp
W
.80
,,C
In
o
JI
_- .10
.OlD
Figure ! 1. Griiph Play With TBO=,4 W/O
CoQirol Edlll3
_°
i -
• __ I/TI,, .... I
I • • 4 • • ? •
No. of Proceilor8
Figure 14. PerforminceMlrtli n _ Example2
IO
A-9
ORIGINAL PAGE IS
OF POOR QUALITY
APPENDIX B
MODELING AND PERFORMANCE BOUNDS FOR CONCURRENT PROCESSING
Presented at
International Conference on Distributed Computing Systems
San Jose, California
June 1988
B-I
ORIGINAL PAGE. IS
OF POOR QUALITY
MOI)ELING AND PERFORMANCE IIOUNI)S FOR CONCUlUtl,;N'F I'ItOCI,iSSING
Roland R. Mielke, John W. Stou_l,ton and Sukhamov _om
Department of El_trical and ('omputer Engin,=erit,Z
Old Dominion l'niv_'rsity
Norfolk, Virginia
ABSTRACT
The development of a new graph theoretic
model for describing the relation between a
decomposed algorithm and its execution in a
multiprocessor environment is presented. Called
ATAMM. tile model consists of a set of Petri net
marked graphs which incorporates the general
specifications of a data flow architecture. The model
is useful for representing decision-free algorithms
having large-grained, computationally complex
primitive operations. Performance measures of com-
putin_o speed and throughput capacity ate defined.
The _TAMM model is used to develop analytically
lower bounds for these quantities.
The development of a new graph theoretic model for
describing data and control flow associated with the
execution of large--grained algorithms in a special
distributed computing environment is presented. The
model is identified by the acronym ATAMM which
represents .Algorithm To Architecture Mapping Model
2" _e purpose" of such a model is to proT'ide a'basis for
.establishing rules for relating an algorithm to its executiot_
m a multiprocessor environment. Specifications derived
from the model lead directly to the description of a data
flow architecture. The availability of the ATAMM model
is important for at, least, three re_ons. First, it provides a
context in which to investigate algorithm decomposition
strategies without the need to specify a specific computer
architecture. Second, the model identifies the data flow
and control dialog required of any data flow architecture
which implements the algorithm. And third, the model
Drovides a basis for calculating analytically performance
oounds for computing speed and throughput'capacity,.
The problem domain of the ATAMM model consists
of decision free algorithms with computationally complex
primitive operations which axe assumed to be implemented
in a dedicated data flow environment. The algorithms are
such a_ may be found in (but not limited to) large scale
signal processing and control applications. lhe
anticipated multiprocessor environment is assumed to
consist of two to twenty processing elements for concurrem
execution of the various algorithm primitives.
The development of new computer architectures
based upon distributed, nmltiprocessor orga,lizations [t I.
[21 is motivated mainly bv tile reqt,irement for increased
speed aud creater throughput capahilily in complex si_md
processing.. " application_. [3].. lhx'eut advames ii, th,.
production of high,tensity micrt_,h'ctronics it' ha., re,d,,
possible the construction of parallel architectures
consisting of identical, special purpose computing elements
ajS]. A number of models for describing the behavior of
gorithms in this setting have been developed [6]-[8].
However, these models represent only the data flow and do
not adequately display the complex issues of
communication and control flow which must occur in any
reali_atiou of the model. For this reason, it has been
diffic;zlt to investigate how to effectively match the
decomposition and scheduling of all_o.rithms to the
strutture and control of parallel architectures. The
importance of better understanding the relationship
between algorithms and architectures is only now
becoming recognized [9].
In Section II of the paper, the modeling process to
describe algorithms in data flow architectures, ATAMM, is
presented. The model consists of three Petri net marked
graphs called the algorithm marked graph (AMG), the
uode marked graph (NMG), and the computational
marked graph (CMG). In Section Ill, time performance
._'_.___._1 fn°rodc_n_sUrruent_dP_oc?hs_ngbys e f_fined' 1 The
.... " " calcu ating
analytically lower bounds for these performance measures.
An example is presented to illustrate these concepts, and
the results of experimental runs on actual multiprocessor
hardware are reported.
II. ATAMM MODEl, DEVELOPMENT
In this section the ATAMM model to describe
concurrent processing of decomposed algorithms is
presented. The model consists of a set of Petn net marked
_,raphs which incorporate general specifications of
communication and processing associated with each
computational event in a data flow architecture. First, a
d+,tailod description of the problem context is stated. This
is followed by the definition of the ATAMM model
consistin¢ of the algorithm marked graph, the node
marked graph, and the computational marked graph.
Some familiarity with Petri uets [10] and marked g,raphs
[111 is a._,_umed [n this presentation.
The problems of interest are decision-free,
computationally complex problems as are often found in
signal proto.',sing and control applications. A problem
(h'...criptitm aornmlty results in tile definition of a function
¢ive_ hv the triple (X.Y,F). The set X represents the set
of adn_issible inputs, the set Y represents tile set of
adHm,.-ihh, outputs, and F:X->Y is the rule of corres-
l_<md,'tkt,' _ hich ulI,tn|higuously a.ssio.,ns exa(.tl 3 one
,'lt'lltt'lll (tOlll "'1' It) each element _f X. Ass<wiated _ith a
t_utq_ii!.+TItmal prohletll is lille or lllore algorithlll>, ..\ii
,il_Olil IIlll I- ,Ill "\plicit lllathelllal i_'al ,',latelllellt. oxpres>cd
kt> 1_ ] _ ']r''t' r''_ S_'I _ _f prinfitive operalion., which explail_s
CH254 I- l, 88. 0000/0538501.00 © 1988 IEEE B-2
4J
i
!
i
W
ORIC_qAL '_...... r-.Y-W.',.J _
OF ,-"_;'_ Q_,_.t.,_ ¢
how to in_plement the rule of correspoudencc I: hi
zeneral, a given problem can be decomposed by sev.eral
_til,l,erent primitive operator sets. Also. for a _lVt'n
primitive operator set. there are often different orderin_.s
of primitive operations which can I,,e specified to carry out
the problem Of _pecial interest are algorithm decomposi-
tions in which tv,'o or more primitive operations can be
performed concurrentlv. For such decompositions, the
potential exists for decreasing the computational time
required to'solve the problem by increasing the computa-
tional resources which implement the primitive operations.
The hardware environment for executing the
._ed algorithms is a,qsumed to consist of R identical
decompo_ ,- ...... ;,- _UNs/ where R has a value
processors or lunctlon&l u.,_.o _- . J
i
maintain small communication times relative to process
times. Each FUN is a processor having local memory for
PorO[ram stora_;e and temporary input and output data
n'talners. _:ach FUN can execute any algorithm
ve o ration The FUNs share a common global
primiti ._Pe..., •...... be either centralized or
memory GLM} wmcn may , .
distribute_. The coordination of FUNs m relaton to data
and control flow is directed by the graph manager (GRM).
The GRM also may be centralized or distributed. Output
created by the completion of a primitive operation is
placed into global memory only_ after the output data
containers have been emptied. That is, outputs must be
consumed as inputs to successor primitive operations
allowin new data to fillthe output locations.
before _ ........ ;- ,o a snpeitic algorithm
Assignment at a. tun.cuona, u,j:_ _,, GR._l--on|v when all
rimitive operatson is maul: u.y ..... _:,__.,2;. ,,l,',bal
iPr,,uts reouired by the operation are aV,-lmUL= ,-- _,
m_,mory and a flmctional unit is available.
An algorithm marked graph is a marked graph which
represents a specific algorithm decomposition. Vertices of
the gorithm graph are in a one--to--one correspondence
al ch occurrence of a primitive operation. tne
with ea . . - J-- - il directed from vertex
ithm ra t_ contaans an _uS_ _,,,,,
algor g . P.............. r nrimitive o ration i is .an
i to vertex.J !l.tne uU_.l.,,u- ,,., _ ...... pe ...... :---
input for pnm_ttve operation J. I_tage l,t,j} ts marxeu w,_, *
f an outout from primitive operator i is available astoken i - - ...... ; When constructing an
an intmt to primtttve opera_u_ j: .ii.L: .... ,;,,-s_ are
al-gori'thm graph, vert)c___pr_m2_t,'_ut_u__'s_n'al;)are
dis layed as orcl_, ann eus,= _,,,v,.- -_-- -Y ....riatc
disPla_,ed as directed line segments connecung ,_t'V":'.V ,
vertices. The presence of a token on an edge is mdlCate(s
lid dot olaced on the edge. Source transitionsand
' are
by a _ransitions for input and output s_gnals notsink
represented as squares. Source* for constants are
usually included in the algorithm marked graph: however,
triangles are used for this purpose when necessary.
illustrate the construction of an algorithm
._To .... _ consider the problem of compuung the
a cl'iscrete linear system given a sequence ut
output o . ' .... t.. ,,,,t_m _ ascribed by the
inputs to t._e system, t._ _,,,: o., ........ d
stateequation
x(k) = Ax(k-1) + Bu(k)
and output equation
y(k) = Cx(k),
where x is a p-vector, u is an m-vector, and y is an
r-vector, lhe prh]litive operations are ,h,l]nt'd as matrix
multiplication and vector addition, and the nat_tral
algorithn_ decompositiolI resulting l'rOlll the ,t ate equatiotx
description is selected. File algorithm marked graph l,or
this decolnposed algorithm i,_ showu in Fizure l. The
initial marking indicates that initial condition data are
available.
Axlk--tl _ KII, tl
A*I I
Figure I. Algorithm morkod graph tar discrete $¥$l@m equahon.
The algorithm marked graph is a useful tool for
ntm -decomposed algorithms and for displaying
represe " g . . v r the al orithm
data flow withm an algorithm. Howe e , g .
raph does not display procedures that a computmg
gtructure must manifest in order to perform the computing
and resource managemem ar_ .v_ ,,vv--_.. .
These "mportant aspects of concurrent processmg are
• i in the ATAMM model through, the d.efiniti0n_of
Includeditonal zraohs The node marked grapn)s uelmeo
two add i _ - : - " " ive o rattan, the
to model the executmn of a prlm_t pe
ionalmarked graph, obtained from the AMG and
computat • inte rates both
the NMG by a set of construction rules, g
the algorithm requirements and the computing
environment requirements into a comprehensive graph
model. These additional marked graphs are defined in the
following.
A node marked graph is a Petri net representation of
the performance of a primitive operation by a functional
unit. Three primary activities, reading of input data from
mar , orocessintt of input data to compute
idobai me Y-. - . . _- ..... a_,_ to olobal memory,
an(J writing otoutpu_, u,t,= o , "
output data. ^a ...... sactions (vertices) in the N.MG.
are representeu ,_ .....
Data and control flow paths are represented as places
(edges), and the presence of signals is notated by tokens
marking appropriate edges. The conditions for firing the
process and write transitions of the NMG are as defined for
a ieneral Petri net. while the read transition h_ one
additional condition for firing. In addition to having a
token present on each incoming signal edge. a l,,lnctional
• available for assignment t6 the primitive
tlnlt must be .... a .... nro Once a,ssi_ned, the
operation betore the reau nuur ,-= ........ and
used to iml)lement the read. process,
l,unctionalunit is before being returned ton qtJe_esi,__
write operations
available FUNs. The initial marking tar an ._._,......
o1" a .qngle token in the "process ready" place. I'he NMG
model is shown in Figure 2.
B-3 _
ORIGINAL PAGE tS
OF POOR QUALITY
J_- • --OE
IF R OR _C
lit _OF
Figure 2.
IF Input Buffer Fuji
I E Inpul 8tAffe¢ Empty
DR Data Repel
PC Process Complete
P FI Procdltl Ready
OS Ou_ Buffet Empty
OF Oulput Buffer Fuji
ATAMM node marked graph model,
A computational marked graph(CMG) is constructed
from the AMG and the NMG by the following rules.
I. Source and sink nodes in the algorithm marked
r]_(_. are represented by source and sink nodes in the
2. Nodes corresponding to primitive operations in
the algorithm marked graph axe represented by NMGs inthe CMG.
3. _ in the algorithm marked graph are
represented by edge pairs, one forward directedfor data
flow lad one bac.i_ward directed for control flow, in the
CMG. The initial marking for the edge pair consists of a
single token in the forward-directed place if data are
available, or a single token in the backward-directed placeif data are not avaulable.
The play of the CMG proceeds according to thefollowing graph rules.
1) A node is enabled when all incoming edges are
marked with , token. An enabled node fires by
encumbering one.token from each incoming edge, delaying
for some specine_ transition time, and then depositing one
token on each outgoing edge.
2) A source node and a sink node fire when enabled
without regard for the availability of a FUN.
3) A primitive opt,cation is initiated when the read
node of an NMG is enabled and a FUN is available for
assignment to the NMG. A FUN remains assigned to an
NMG until completion of the firing of the write node ofthe NMG.
In order to illustrate the construction of a
computational marked graph, the CMG corresponding to
the algorithm marked graph of Figure ! is shown in
Figure 3. The computational mar_ed graph is useful
because it clearly displays the data and control flow which
must occur in any hardware implementalon of the model
proces& and because it provides a hardware independent
context in which to evaluate process performance.
II*! ! ¢ 1.4 I C.¢ !
all!
I. _ml mlkmm merlin4 _ _ !e¢ ¢I*_*I. _
The complete ATAMM model consists of the
algorithm marked t,ranh, the node -*-'---' ......
• ,,,,-,_-u rapn, lad the
computationaJ mar'f_,_-graph. A ict "
model IS ShOWn in Fi_Jre T. P,..°-r_PI.a.Y of thls
v-. 4. ,,, .,,_= ue.xLsection, time
.perfor.man_ characterlstlcs of the ATAMM ....a., ___
Investigated. ,,,,.,,.,,=_Lr_
Fltlure t. ATAMI¢ meCkN cmCe_e_l,.
IlL PERFORMANCE BOUND._
The importance of the ATAMM model is that it
establishes a context in which to investigate the
performance of decomposed algorithms in multiprocessor
data flow architectures. In this section, performance
measures indicating computing speed and throughput
capacity are defined. Bounds for these quantities are
calculated analytically from the algorithm marked sraph
and the computational marked graph. This informatlon is
540
B-4
i
OF POOR "" _.'--_'f
ential for efficiently matching algorithm decompositions
ess • - • ........ :^-° The work pre_nted in
ith architecturetmpLemeu_,-,.u-o.
_;hi$ section is an interesting application and extension of
recent investigations of the performance of Petri nets [|2 I,
I13]and marked graphs [14].
It is assumed that a decomposed algorithm is
implemented in • multiprocessor .arch!•ecrue containing R
computing resourcesor funcuonal umts. t_acn lunctlonaa
unit is capable of performing any of the primitive
operations whose sequence defines the decomposition. A
computational task is initiatedwhen an input data token
from the source node is encumbered. Task output occurs
when • correspondinl output data token is deposited •t
the output sink node. A task is completed when all
computing associated with the task is completed. It
should be noted that task output and task completion do
not always coincide. In many iterativesignal processing
or•thins,computing to generate initialconditionsfor the
nalegx(ti eration often occurs after an output has been
calculated. Task completion is usually indicated in the
AMG or the CMG by the return of the graph to some
steady-st•re initial marking. To facilitate measurement of
throughput capacity, it is assumed that tasks are repeated
periodically with new input d•t• sets. New d•t• sets are
available continuously as input tokens from the input
source node. Included in this problem class are iter•tive
algorithms where, the p resen, t task requires as inputs data
from previous tasw cmcumuous.
Concurrency in this problem setting occurs in two
ways. First, different functional units .m_y .perform
simultaneously several primitive operations belonging to •
single task. This type of concurrency is referred to
vertical concurrency. Vertical concurrency has • mrect
effect on task computing speed. It is limited by the
number of primitive operations that can be performed
simultaneously in • given algorithm decomposition, and by
the number of functional units available to perform the
primitive operations. Second, different functional units
n_ty perform simultaneously primitive operations
belonginl to different tasks sequentially input to the
computingsystem..O,led
capacitY. It i$ llml1_l Vy tu_ _* ,.
accornx_odate additionaltask inputs,mac[by the number of
functional units •vNl&ble to implement the tasks. In the
following it is shown that the proceu of algorithm
decomposition imposes bounds on tim amount of vertical
coneurr_cv and horizontal concurrmcy pomible in a.,giv,ea
"'--- * " [_lo_rc(_ _'e •V•llitUle,
,problem. -'If su_aent computing . . •
operation &t these bound.s _ ..be _hleved. Ift_e _nnU_r
o[ corn uting raour_. _s Iffmt .e_, t.ne _ounos c&r__ ._..
a.wtount Of vertlC&i concl_es_cy --u .v..,.,,
ate possible.
Three performance measures for concurrent
processing are defined. The first two par&meters, TBIO
and TT, ate indicators of computing speed mad thus reflect
e d ree of vertical concurrency. The third parameter,
_oegls • measure of throughput capacity and thus
reflects the degree of horizontal concurrency.
Definition I: TBIO_ The performance measure TBIO is
the computing time which elapses between • tas_ input
mad the corresponding task output.
B-5
Defi,_ The performance measure TT is the
_hich elapses be.tween a.ta.sk, input and%repletionf computat,onas ,&t w thth•t
task.
The performance measure TBO is the
utin_ time which elapses between successive task
comp -_-- -_-..... h is ooerating periodically in
outputs when _.c _--v
steady---state.
The remainder of this section is devoted to developing
lower bounds for these performance measures.
Let G denote an algorithm marked graph
representing • decomposed algorithm. The lower bound
for TBIO is the shortest time required for • data token
from the data input source to propagate through the graph
to the data output sink. Similarly, the lower bound •or
TT isthe shortesttime required to complete allcomputing
activityinitiatedby the injectionof • data token from the
data input source. These shortest times are the actual
performance times when only • singletask is active in the
raph during any time interval (no horizontal
g • __a ...... comnuting resources as are
c°nc.urre'_cyl'_.,-_, ":"_ .... --= ,,o'ical concurrency).
e ,o,.L :o
and "IT ate c_"cu]at_ by identifying certain longest p
in • raph obtained from the algorithm marked graph.
This ngew graph, called the modified algorithm gr•ph G M,
is defined and then used to determine lower bounds for
TBIO mad TT.
Definition 4: Modified Al__orithm Granh. Let Pi be • place
of G, directed from transition t r to transition t s, which
contains a token of the initial marking. The modified
algorithm graph G M is obtained from the graph G by the
followingconstruction rules.
I. Place Pi is deleted from G.
2. A new place Pil' directed from the d•t• input
source to transition t s, is added to G.
3. A new output sink si differentfrom all other
output sinks, and t new place PiT directed from
transitiontr to si,ate added to G.
4. The above rulesare repeated for each place of G
containing • token of the initialmarking.
Lower bounds for TBIO and TT are presented in
Theorem I and Theorem 2 respectively.
Theorem l" Lower Bound for TBICL Let Pi be the ith
directed path in G M from the data input source to the
data output sink, and let T(P i) denote the sum of
transitiontimes fortransitionscontained in Pi" Then,
TBIOLB = Max { T(P i) },
where the maximum is taken over all paths Pi in graph
G M
Proof. Without loss of generality, let tf be the la._t
transition in all paths Pi directed from the data input
source to the data output sink. Transition tf is enabled
when each input place for tf contains a token. Since by
a_sumption a computing resource is available, tf fires a.q
soon as it becomes enabled. Let pq be the last input place
for tf to acquire a token, and let tg be the input transition
for place pq. Continuing this labeling procedure results in
a backward path construction process. This process is
repeated, first at tg, and then at each succeeding transition
until the data input source is reached, identifying a path
Pj. By the construction process for the path, it is clear
that T(Pj) = Max { T(Pi) }, where tile maximum is over
all paths Pi in G M. it is also clear that TBIOi, B can be
no shorter than T(Pj) so that TBIOLB _>T(Pj). Since a
computing resource is available when each transition in Pj
is enabled, the time between input and corresponding
output can be no longer than T(Pj) so that
TBIOLB<_T(Pj). Therefore, TBIOLB = T(Pj) = Max
{ T(Pi) }, where the maximum is over all paths Pi in O M-
This complet_ the proof.
Theorem 2: Lower Bound for TT. Let Pi be the ith
directed path in G M from the data input source to any
output sink, and let T(Pi) denote the sum of transition
times of transitions contained in Pi' Then,
TTLB = Max { T(Pi) }
where the maximum is taken over all paths Pi in graph
G M •
Proof. By the construction rules for graph G M, a task is
initiated when input data tokens are input from the (lata
input source, and is completed when all output sinks have
accepted tokens. Therefore, TT is the time which elapses
from injection of input tokens to the arrival of a token at
the l_t fired output sink. Let T(Pt) = Max{T(Pi)}, Pi in
G,[, be the longest path time of paths from the data inpu!
source s I to any output sink, say s t. Since a token must
reach sink s t before a task is completed, it follows that
TTLB _> T(Pt). Since a resource is available for each
transition to fire when enabled, and since Pt is the longest
path in G M, it also follows that I'TLB<_T(Pt). Therefore,
TTLB = T(P I ) = .Ma-x{T(Pi) }. where the miLxitllHnl is
over a/l paths Pi in G M. l'his completes thc proof.
To illustrate the application of Theorem I and
Theorem 2. TBIOLB ana TTLB are computed for the
algorithm graph shown in Figure I. For this example, the
folh)win_ transition times are assumed: T(I) = 4,
"I"(2) = 1, T(3) = 5, and T(4) = 6. The modified
algorithm graph corresponding to Figure 1 is shown in
["igure 5. The modified algorithm graph contains two
paths directed from the data input source s I to the data
output sink s O. Path PI consists of edge set {1, 2, 3, 4}
with T(PI) = 10, and path P2 consists of edge set {.5---1.3,
4} with T(P2) = 6. Therefore, since T(PI) > T(P2)" path
PI determines the lower bound for TBIO and TBIOLB =
I0. The modified alsorithm graph contains two additional
directed paths from the data input source s i to the output
sink s 5. Path P3 consists of edge set { 1, 2, 6, .5---2} with
T(P3) = 11, and path P4 consists of edge set {5-1, 6, 5-2}
with T(P4) = 7. Since T(P3)>T(PI)>T(P4)>T(P2) '
path P3 determines the lower bound for TT and
TTLB=ll.
$--1 _ |)
I:igure S. Modified 41gordhm graph Io¢ F_gure f.
Next a lower bound for the performance measure
TBO is presented. Let G be a computational marked
_aph representing a decomposed algorithm. It is assumed
that operating conditions for G are _t to maximize
horizontal concurrency. That is, data tokens are.
continuously available'at the data input source, and as
many computing resources as needed can be called to
perform primitive operations. With these conditions, the
graph plays periodically in stead.v---state, and TBOLB is
the .,honest time possible between succ_sive outputs.
The_rem 3: Lower Bound for TBO. Let G be a
COml)utational marked grapt_ and let C i be the ith directed
circuit in G. The notation T(Ci) denotes the sum of
transition times of transitions contained in C i, and M(Ci)
denotes the number of tokens contained in C i. Then,
TBOLB = Max { T(Ci) / M(Ci) },
where the maximum is taken over all directed circuits in
G
Pr_ff. Without loss of generality, let tf be the output
t1ali,.iTi<)n hl (; so thai itll Olltpnt iS produced each time tf
__a__6
_RIGINAL PAGE IS
OF POOR QUALITY
I
,i
l cotnpletes firing. Then TP, OLB is the minimmu firin_
period of transition tf. It is sho_u in [t.5! (pp "_-4;i_', tha!
the nfinimum firing period of each transition of a marked
graph is given by Ma-x{T(Ci)/.M(Ci)t, _here the
ma.,dnmm is taken over all directed circuits C i iu G.
Therefore. the theorem follows.
The computational marked graph ._hown in Figure 3
is used to illustrate Theorem 3. This CMG contains many
directed circuits, tiowever, the directed circuit which
contains all NMG nodes of transitions "2 and .I contains
only one token and maximizes the ratio T(C i) / .',I(Ci)-
Therefore, the shortest time possible between successive
outputs in this graph is TBOLB = 7.
The optimum time performance for this example
algorithm is described by the following characteristics.
The algorithm accepts an input and issues an output every
7 time units. Each input requires a total of 11 time units
of processing, and an output is issued 10 time units after
the input is accepted. It can be shown by simulation that
3 functional units are required to achieve this performance.
The addition of more functional units will not improve the
computing speed or throughput rate for this algorithm
decomposition.
IV. CONCLUSION
A new model useful for understanding the
relationship between decomposed algorithms and data flow
architectures has been presented. Named ATAMM for
Algorithm To Architecture Mapping Model, the model
consists of Petri net marked graphs called the algorithm
marked graph, the node marked graph, and the
computational marked graph. Time performance measures
of time between input and output (TBIO), ta-_k time
(TT), and time between outputs (TBO) were defined.
Then lower bounds for the performance measures were
calculated analytically from the modified algorithm graph
and the computational marked graph. An example to
illustrate these concepts was presented.
Simulation tools and an actual hardware prototype
have been developed, to test and validate the ATAMM
model. The simulation software package [16] consists of a
PC--based computer model of the CMG.. Algorithms ,_e
entered to the packase by speciIying tae aJgontnm marKen
graph, and simulauon output consists of a graphical
display of the movement of tokens. An acc.ompanying
diagnostic software package [17] automatically computes
and-displays performance measures and other performance
data. A hardware prototype [18] has also t_en constructed
to validate the ATAMM operating rules and to provide a
benchmark for testing the simulation software. The
architecture is shown in Figure 6 and is one of several
candidates which could be used to perform concurrent
operations according to the ATAMM rules, A primary
motivation for this particular design was the availability of
hardware. The system consists of S-100 crates _avinz an
InteI 8088 CPU card, multiple serial I/O cha,mels, and
32K memory. An IBM/XT is used to host the system and
to down load algorithm graph descriptions to the system
A number of decomposed algorithms, including, tho_e
presented here, have }wen investigated using these tools.
B-7
54_
I IBMPC/XT
t
I o,,,,. I
l,.,,,,o,,,t
" g600 BAUD
! !1,o,.,II'o,'=I I'-'
NIEN.
F'tluro 6 Prololypo hIrdwql¢o ¢og_flgueqltion lot
ATANN vedOdet_.
Continuing research is designed to generalize the
ATAMM model and is focused in three ma_n areas. The
present model assumes that all functional units are
identical and that each is able to perform all primitive
operations. An important extension is to model the
situation where there are two or more different groupings
of processors where each group is able to perform only a
subset of the required primitive operations. The present
model represents only decision-free algorithms. Another
important extension is to develop the capability to admit
algorithms containing d&_-..dependent branchin_ points.
Finally, methods for achieving optimum time permrmance
are being studied in the context of the ATAMM model.
_kCKNOWLEDG EM ENT
The work reported here wu supported in part by the
.NASA Langley Research Center under Grant NAG-I---683.
[i] P Treleaven, D. Brownbridge and R. llopkins.
"Data---drivenand demand-driven computer architecture,"
Computing Surveys, Vol. 14, pp. 93 - 143,March 19S2.
[2] V. Srini, "An architectural comparison of
dataflow systems," Computer, pp. 68- 8g. March 1986.
[3] W. Rheinbolt, "Report of the panel on future
directmns in computational mathematics, algorithms, and
scientific software," sponsored by NSF Grant
D.MS--85-3,183, SIAM, 1985.
[4] T. Longo, G. Herzog and D. Maxwell. "A fa_t
_i,,gle chip 1750A CPU and compatible support
(umponents in V tt SIC----size CMOS t,,,.h ,oh,c,y."
I'm, eedings of the Government Microcircuit Applications
C.uf,.rence, pp. 317 - 320, 1986.
[5] W. Wehnet, W. Everhm-t, S. Shankar and
K. Stalsberg, "A VSHIC architecture for highly parallel
!range understanding,*' Proceedings of the Government
Microcircuit Applications Conference, pp. II7- 120,
November 1986.
[6} M. Sows and T. Murata, 'IA data flow computer
architecture with program ann token memories," IEEE
Transactions on Computers, Vol. 31, pp.820 - 824,
September 1982.
[7] K. Karl, B. Buckles and U. Nartyan Bhat, "A
rmal definition of data flow graph models,'* IEEE
ransactions on t.;omputers, Vol. 35, pp. 940 - 948,
November 1986.
o81 M. Granski, I. Koren and G. Silberman, '*The
effect operation scheduling on the performance of a data
flow computer,*' IEEE Transactions on Computers, Vol.
36, pp. I019 - 1029, September 1987.
[9] L. Jamieson, H. Siegel, E. Delp and A. Whinston,
'*The mapping of parallel al.$orithms to reconfigurtble
allel architectures,'* Proceedings of Future Directions in
omputer Architecture and Software, D. Agrawal Ed.,
1ARueoO.Contract DAAG29--81-D-0100, pp. 147 - 154,May
[10] J. Peterson, Petri Net Theory and the Modeling
of Systems, Englewood Cliffs, N.J.: Prentice-Hall, 1981.
[11] T. Murata, "Circuit theoretic analysis and
_nth.esis of marked graphs," [EEE Transactions on
t_lrctuts and Systems, Vol. 24, pp. 400 - 40,5, July 1977.
[12] J. Sifakis, *'Perform_ee evaluation of s_tems
using nets," Net Theory and Applications, W. _raue_
Editor, pp. 307 -319, Springer-VerhqL 1979.
• [13] C. Renmmoorthy and G. Ho, "Performance
evaJuation of asynchronous concurrent systems using Petri
nets," IEEE Transactions on Software Engineering, Vol. 6,
pp. 440 -449, September 1980.
[14] T. Murata, *'Synthesis of decision-free
concurrent systen_s for prescribed resources and
performance,*' IEEE Transactions on Software
Engineering, Vol. 6, pp. 525- 530, November 1980.
_ _ [15J,I": Murat a, '*.M_od.eling__d analysis of concurrent
_s_n_, rlana_olc o! _oftware Engineering, C. Vick and
_. rtamamoortl_y, Editors, pp. 39 - 63, Van Nostrand
Reinhold, 1984.
{16] K. Jackson, lq. Tymchyshyn, R. Mielke and
J. Stou8hton, _ "Simulation software for concurrent
procesmng," Proceedings of the IEEE Southeastcon
Conference, pp. 82 - 86, April 1987.
[17] R. Obando, '*Simulation software for
performance evaluation of concurrent processing,'* Master's
Thesis, Old Dominion University, Norfolk, Virginia,
October 1987.
[18] J. Stoughton and R. Mielke, "Petri net model for
concurrent processing of complex algorithms," Proceedings
of the Government Microcircuit Applications Conference,
pp. 11 - 14. November 1986.
OR_(_f,]_I ...._-_
. _-_._ i!'i
OF POOR n_ _ ,_--.,
544
B-8


_...,mm..m.m_mmm

