Simulator for heterogeneous dataflow architectures by Malekpour, Mahyar R.
NASA Contractor Report 191545
78
i ,
7
Simulator for Heterogeneous Dataflow Architectures
Mahyar R. Malekpour
Lockheed Engineering & Sciences Company
Hampton, Virginia
Contract NASI-19000
September 1993
National Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23681-0001
(NASA-CR-191545) SIMULATOR FOR
HETEROGENEOUS DATAFLOW
ARCHITECTURES Report, I Jun. 1991 -
31 Aug. 1992 (Lockheed Engineering
and Sciences Corp.) 56 p
G3/33
N94-15797
Unclas
0187789
https://ntrs.nasa.gov/search.jsp?R=19940009324 2020-06-16T21:30:35+00:00Z

Table of Contents
1. Introduction ....................................................................................................... 1
2. Overview of ATAMM .............................................................. ,...,,,, ...... ,, ........... 2
2.1 Model Components ............................................... . .................................. 2
2.2 Performance Measures and Bounds .......................................................... 4
2.3 Control Edges .......................................................................................... 7
3. Simulator Implementation Issues ................ . ............................... . ............................ 8
3.1 Target Hardware Architecture .................................................................. 8
3.2 Implementing ATAMM ............................................................................ 10
3.3 Generic State Diagram of the AMOS ....................................................... 12
3.4 Event-Driven ........................................................................................... 14
3.5 Simulation of Graphs with Variable Node Latencies ................................. 15
3.6 Simulation of Graphs with Static Node to Processor Assignments ............ 15
3.7 Simulation of Multiple Graphs .................................................................. 16
3.8 Graph Enu-y, Simulator, Analysis, and AIE Tools ..................................... 16
4. Simulator Design and Development ........................................................................ 18
4.1 Object-Oriented Programming .................................................................. 18
4. 2 Programming Environment and Language ................................................ 18
4.3 Objects and their Relationships.....,,, ................................... ' ..................... 19
4.4 Simulator-Kernel .................................................... , ......... .,,,,,,..,., .......... 19
4.5 Algorithm Graphs .................................................................................... 23
4.6 Processor-Group ...................................................................................... 24
4.6.1 Graph-Manager ........................................................................... 26
4.6.2 Functional Units ...................................................... , ................... 27
4.6.3 Functional Unit State Diagram Description .................................. 29
4.6.4 FU Lists ...................................................................................... 31
4.6.5 Local-Networks ............................. ,.., ...................... , .................. 31
4.7 Global-Networks ..................................................................................... 32
4.8 TBOfrBIO and Ensemble TBO/TBIO ..................................................... 32
4.9 System ...................................................................................... . .............. 33
4.10 The Input and Output File Formats .......................................................... 34
4.11 How to Use the Simulator ................................ , ......... ,.. ...... ......,.... .......... 34
5. Case Studies and Experimental Results_ .................................................................. 36
5.1 Case Study 1 ............................................................................................ 36
5.2 Case Study 2 ............................................................................................ 37
6. SUMMARY ....................................................................................................... 41
References ....................................................................................................... 42
Appendix A ....................................................................................................... 44
Appendix B ....................................................................................................... 47
Appendix C ....................................................................................................... 48
Appendix D ....................................................................................................... 49
Appendix E ....................................................................................................... 50
Appendix F ....................................................................................................... 51
ADM
AIE
AMG
AMOS
ATAMM
CMG
ENS
FDT
FU
GRF
GVSC
NMG
OOP
PI-Bus
SGP
TBI
TBIO
TBO
TCE
TGP
VHSIC
Acronyms
Advanced Development Model
ATAMM Integrated Environment
Algorithm Marked Graph
ATAMM Multicomputcr OperatingSystem
Algorithm To ArchitectureMapping Model
Computational Marked Graph
Ensemble
Fixc/DamfI'h-ne
Functional Units
Graph
Generic VHSIC Spaccborne Computer
Node Marked Graph
Object-Oriented Programs
Parallel Interprocessor Bus
Single Graph Play
Time Between successive Inputs
Time Between Input and corresponding Output
Time Between Outputs
Total Computing Effort
Total Graph Play
Very High Speed Integrated Circuit
ii
1. Introduction
The Algorithm To Architecture Mapping Model (ATAMM) is a Petri net based model
capable of describing the periodic execution of large-grained, data-independent algorithm graphs
on multiprocessor architectures. ATAMM provides a description of the data flow and control
flow necessary to provide for the predictable execution of an algorithm in real-time.
The objective of this research is to develop a software simulator capable of simulating the
execution of a graph on a given system under the ATAMM rules. The purpose of the simulator is
to enable a study of the behavior and performance of both heterogeneous and homogeneous
multicomputer dataflow systems prior to the availability of hardware prototypes. This simulator
is able to assist with the development of ATAMM-based architectures and the investigation of
theories concerning the ATAMM model. This simulator is user-friendly and flexible to permit
examining different attributes of a generic system. The simulator also provides the means to
identify an architecture by specifying different parameters of the system in order to evaluate the
periodic execution of an algorithm on a given hardware system. Evaluation of the simulator is
conducted through several case studies.
Section 2 of this report is an overview of ATAMM. Performance measures are also
defined in Section 2. The implementation issues of this new simulator, which will hereafter be
referred to as the Heterogeneous ATAMM Simulator or simply as the Simulator, are discussed in
Section 3. The design and development of the Simulator are presented in Section 4. Case studies
and simulation results of example algorithm graphs are presented in Section 5. This report.
concludes, Section 6, with a discussion of ongoing and future research to expand the model to a
broader class of multiprocessor architectures.
The use of brand names is forcompleteness and does not imply NASA endorsement.
2. Overview of ATAMM
2.1 Model Components
...... AT-AMM kS_desfgned _t0 model _c _on_rol: schedu_g: _d communication issues for
computational algorithmsacceptingperiodicinputdataand generatingperiodicoutput data [I].
ATAMM models data-drivenreal-timealgorithmswhich may be representedby data-independent
directedgraphs.The nodes of thegraph arc assumed to be of sufficientcomputationalcomplexity
to warrant parallelexecution. The targethardware system has previouslyconsistedof a set of
hornogcncous processors. This Simulator,however, is intended to support the extension of
ATAMM toheterogeneous processors.
The model consistsof a set of Petrinet marked graphs [2, 3, 4] which combine the
functi0nsof an algorithm with the necessarycomputing activities.The Algorithm Marked Graph
(AMG), the Node Marked Graph (NMG), and the Computational Marked Graph (CMG)
constitutethe three components of the ATAMM. The Algorithm Marked Graph (AMG)
representsa specificdecomposition of the functionalcomputation requirements. The AMG, as
illustratedby the example in Figure I, uses nodes (circles)to representblocks of code or
processes which arc to be executed and edges (directedline segments) to represent data
dependencies between the nodes. Each AMG node is executed to completion before another
node may be scheduled on the same processor. A token (soliddot) on an edge representsthe
presence of a single data packet. All edges may have a pool of buffers and can accommodate
more than one token at a time. A node consumes one token from each of its input edges when it
fires (begins execution) and deposits one token on each of its output edges when it completes
execution. Source and sink transitions for input and output signals are represented as rectangles.
The Node Marked Graph (NMG), illustrated in Figure 2, is a representation of th_
execution of an AMG node by a processor. Three primary activities associated with execution of
an AMG node, reading of input data (R), processing of input data to generate output data (P),
40
I Source
Figure I. An example Algorithm Marked Graph.
Sink
Input Empty Output Empty
" "11" " "
Process Ready
Output
Available
Figure 2. An example Node Marked Graph.
and writing of output data OAr), are incorporated in the NMG. A recent enhancement of the
model [4, 5] allows m tokens on the Process Ready edge, which permits m simultaneous
instantiations of the node to be executed in parallel on different processors with different data
packets. The n tokens on the Output Empty edge indicate that the predecessor AMG node can be
instantiated up to n times before an output is consumed by the successor node. The value of n is
always greater than or equal to m. The values of n and m are determined by a graph analysis
procedure and are typically different for each AMG node. Tokens on the Output Available edge
indicate the presence of data on the edge.
The Computational Marked Graph (CMG), illustrated in Figure 3 (for the AMG of Figure
I and for the simple case of m = 1 for all nodes) is constructed by replacing each AMG node with
its NMG and replacing each AMG edge with an edge pair , consisting of a forward directed edge
representing data/low and a backward directed edge representing control flow. As both a
graphical and mathematical model, the CMG is useful for determining the performance bounds as
well as the data and control flow required for a hardware implementation.
Two types of concurrency are possible when executing an algorithm decomposition as
specified by the CMG. First, several nodes of the dataflow graph without data interdependency
may be simultaneously performed on the same data packet. This is referred to as parallel
concurrency because it is the result of inherent parallelism in the-graph [6]. The amount of
parallel concurrency depends on the number of parallel paths in the algorithm decomposition as
well as the number of available resources. Second, several nodes of the dataflow graph may be
simultaneously performed on different data packets. This happens when new data packets ar_
accepted for execution before the completion of computation of previous data packets. This
simultaneous processing of different data packets is referred to as pipeline concurrency [6]. This
type of concurrency has a direct effect on throughput. The amount of pipeline concurrency
depends on the number of available reS0urces_ weli_the Structure of the AMG,
2.2 Performance Measures and Bounds
The two primary pe-rform_e-me_s=for a graph _-_d_c s_i_-state Time Bct_veen--
Outputs (TBO) and the Tune Bcrweeninput an_dcorresponding Output (TBIO). TBO is the
elapsed computing dmc between successivealgd-rithmOutputs. Therefore,the inverseof
steady-statevalue of TBO isa measure of throughput in dam :paCketsper unit_. The _)
lower bound, TBOlb, and hence theupper b0undon_oughput, isdeterminedby_cai_gorithm -
4
I!
[
!i
M
q_
.a.i
5
graphandthenumberof availableresources.Thealgorithm imposed TBOIb is determined by the
largest time per token of all directed circuits in the CMG [6]. In graphs with recurrent circuits,
TBOIb is determined by the time per token of the largest recurrent circuit in the CMG. The
second bound on TBO is imposed by the availability of resources [6] and is given by the ratio of
TCE over R where TCE (Total Computing Effort) is the summation of all the node latencies of a
CMG and is the time required for all graph nodes to execute a single data packet. R is the
number of resources. For instance, the TBO of the AMG of Figure 1, which has no recurrent
circuit, is limited only by the number of available resources. TBIO is defined as graph latency,
which is the time for a single data packet to progress from source to sink. The algorithm-imposed
lower bound, TBIOIb, is determined by the critical path from source to sink. However, the TBIO
is a function of TBO and is determined by analyzing the algorithm graph and considering the
number of resources.
To achieve a desired TBO for a given algorithm graph, ATAMM requires that the input
data to the algorithm graph be supplied at the steady-state TBO rate. Therefore, the injection
rate, defined as the Time Between successive Inputs (TBr), and TBO are synonymous at the
steady-state and are used interchangeably.
Other performance measures are speedup and resource utilization. Speedup for a
homogeneous processor system is defined as the ratio of TCE over TBO. Resource utilization for
a homogeneous processor system [6], U, is defined by
TCE
U -
TBO*R
where R is the number of available resources, and
TBO _>TCE/R, for 0<U<I.
6
The sp_:dup and r_sourcc utilization may similarly be c_fined for the hcterogcncous processor
configurations. _
2.3 Control Edges
A con=ol _gc is an AMG _gc Which_mposcs an ax_c_l dam dcpcndcncybetween two
AMG nodes [6]. The con=o; edges are used to cid;cr alter node schedulesto eliminate needless
concurrencyorto_provercsou_cu_za_on.
3. Simulator Implementation Issues
3.1 Target Hardware Architecture
The generic heterogeneous architecture considered is displayed in Figure 4. This generic
heterogeneous architecture consists of a number of processor groups that in turn are composed of
a number of resources or functional units (FU), which arc the actual processing units, and a
number of local networks. Although the functional units and the local networks within each
processor group are assumed to be homogeneous, the different processor groups are not required
to have similar characteristics. In other words, a heterogeneous system is realized by groups of
processors with different characteristics that communicate with each other over the global
network.
The Advanced Development Model (ADM) [7, 8] and the Generic VHSIC Spacebome
Computer (GVSC) [8] are typical architectures which have been the primary targets of ATAMM
implementations. These systems consist of four identical MIL-STD-1750A functional units that
communicate over a Parallel Interprocessor bus (PI-bus), as shown in Figure 5, and a MIL-STD-
1553B communication module that is also connected to the PI-bus and serves as the front-end of
the system. The 1553B is essentiaUy a 1750A with less memory and the 1553B interface. These
are examples of heterogeneous systems with two groups of processors where one group has four
resources (four 1750As) while the other has only one resource (one 1553B) and they
communicate over a global network, the PI-bus. However, previous ATAMM implementations
on these hardware systems modeled only the behavior of the homogeneous set [9] of 1750A
processors. The new simulator described herein could support the modeling of the more general
heterogeneous architecture.
: ! 7-:
I
i Ji ri I
ProcessorGroup 1
..... i
Algorithm Graphs I
Processor
Group 2
ooo
Global Network
Processor
Group n
!-!
!:i
_ i
i !
:!!
=
Processor
Group i
I
Local Network
I Gateway
Graph Manager
LAvailable-L iS,_
Figure 4. Architecture modeled by the Heterogeneous Simulator.
t
9
1750A 1750A 1750A 1750A 1553B I
MICROVAX
or RS 6000
-COMPILE
-DOWNLOAD
-DEBUG
Communication
PI-BUS
Link
IBM PC-AT
-CONTROL
-DATA IN/0UT
-GRAPH STATUS
-INSERT FAULTS
-DISPLAY RESULTS
Figure 5. Layout of the ADM and GVSC systems.
3.2 Implementing ATAMM
Systems implementing the ATAMM consist of four logical components: the graph
manager, the global memory, a set of functional units, and the communication bus [9]. The graph
manager _spons[b]e for ensuring that the overall system operates according to the ATAMM
rules. The functional unit is the logical component that executes all three node marked graph
(NMG) transitions of each algorithm operation. When a read transition of the CMG graph is
enabled, the graph manager assigns a functional unit from the fist of available functional units to
execute the corresponding _lgorithm node. If there are additional enabled nodes, the graph
manager assigns them, according to priority, to the subsequent resources in the available list. The
IO
i "
2.!
graph manager updates the marking of the CMG using status information reported by the
functional units. The input and output data corresponding to each AMG node are stored in the
global memory. In the context of ATAMM, the memory is considered to be logically global to all
functional units. However, in a real system, the global memory may be either centralized or
distributed. The functional unit communicates with the graph manager to update the status of the
CMG, and with the global memory to read and write data. The communications between the
graph manager, the global memory, and functional units are asynchronous and are carried out by
means of a communication bus. To synchronize movement of tokens in the CMG and to arbitrate
among different functional units, it is assumed that only one functional unit communicates with
the graph manager at any one time. This is accomplished by the means of a semaphore.
Therefore, the functional unit that possesses the semaphore has control of the communication bus
and can communicate with the graph manager and update the status of the CMG. In this regard,
the communication bus and semaphore are often used interchangeably.
Thus far, ATAMM implementations have only considered systems with a single
semaphore and a single communication bus. One of the purposes of this Simulator is to explore
systems with multiple semaphores. In order to ensure that all functional units have an identical
copy of the graph data structure, a functional unit grabs the semaphore before changing the graph
data structure. In a distributed system, the updated graph data structure is wansmitted to all
functional units by a broadcast, and only then does the functional unit release the semaphore for
other communications.
The graph manager and global memory may be distributed among all the functional units.
This distribution of activities has the advantage of increasing the number of functional units in the
system and at the same time improving the potential for achieving a higher degree of fault
tolerance to processor failure. Also, a distributed global memory eliminates the need for shared
memory among functional units.
The integration of the graph manager with the operating system constitutes the ATAMM
Multicomputer Operating System (AMOS). The resottrce list, global memory, and the algorithm
11
marked graph provide the necessary support to AMOS. An AMOS controlled architcc_
consisting of personal computers has been developed arid"tesmd:_o_valldate _e AT_ rules
[10, 11]. In this testbed, a centralized graph_manager and centralized global memory are Utili_d_
Other testbeds with increased functionality, the ADM and the GVSC, utilize a distributed graph
manager and distributed global memory.
3.3 Generic State Diagram of the AMOS
The generic state diagram of the AMOS is shown in Figure 6. The AMOS is composed of
six states: Idle, Reading, Processing, Writing, Grab-Semaphore, and Graph-Manager. Other
implementations of ATAMM have included other states such as Testing [7, 8]. Initially, all
functional units start in the Idle State. A functional unit remains in this state until either its
identification number (ID) appears at the top of the resource list, which is a F'trst-In-First-Out list
of available functional units, or it receives a message indicating that a node has been assigned to it
by another functional unit acting as graph manager. When idle with its ID at the top of the
resource list, the functional unit monitors the status of the CMG until a read transition of an
algorithm node becomes enabled. Once an enabled read node is identified, the functional unit
attempts to acquire the semaphore which makes it the active graph manager of the system. It then
assigns a node to itself, consumes one token from each input edge of the algorithm node, updates
the CMG marking, and removes itself from the available list. _,, _....
Before progressing to the next state, Reading, the functional unit examines the algorithm
graph and assigns other enabled nodes to the subsequent functional units in the available list. It
notifies other functional units via fire-messages, updates the CMG accordingly, broadcasts the
updated graph data structure, and then releases the semaphore. This broadcast is termed a "Fire"
broadcast. Assigning other enabled nodes to idle functional units while holding the semaphore, is
an enhancement to the GVSC AMOS that reduces the communication overhead. .......
12
I
° !,
Fire
Message
Idle EnabledNode
Reading
Fire or
Data
Graph
Manager
Successful
Successful
Grab
Writing
Completed Grab Successful
Successful
Figure6. AMOS statediagram.
i
+ I
'i
The "F'trc"broadcastcontainstheupdatedversionof theCMG, theupdatedresourcelist,
and theID ofthefunctionalunitsprocessingtheAMG nodes.Thisbroadcast,aswellas theother
. r + _
broadcastdiscussednext,providethe statusinformation cccssaryfor the graph manager to
maintaintheshams of theCMG. When thegraphmanager isdistributed,thiscommunicationis
+ ..... _ __ _ .........-_ ............
cspecialiyimportant oensurethatallindividualgraphmanagerscontainthesame CMG marking.
Upon detectinga firemessageintheIdicstate,thefunctionalunittransitsto theReading
Statewhcm itroadstheinputdatainpreparationfornode execution.The functionalunitthen
migratesto theProcessingStatewhere itperformsthctaskrepresentedby thealgorithmnode.
The functionalunitremainsintheProcessingStateuntilthenode operationiscomplete. Then,
13
the functional unit attempts to undergo another state transition to the Writing State by grabbing
the semaphore. In the Writing State it updates the CMG, writes the output data, and broadcasts
the updated information to other functional units. This broadcast, termed a "Data" broadcast,
provides the updated CMG and the output data of the node to the other functional units. The
functional unit then goes to the Graph-Manager State. Now that _e functional unit holds the
semaphore and is the active graph manager, it attempts to fire as many nodes as possible prior to
releasing the semaphore. Since the operation of the system is asynchronous, the graph manager
must generally be interrupt driven.
The CMG and resource iist in the global memory of a functional unit cart be updated while
in any state by "Fire" or "Data" broadcasts from other functional units. The "Fire" and "Data"
broadcasts not only provide the communication necessary for the integrity of overall system
operation, but also the means to analyze the system performance. By labeling, time tagging, and
storing information about each broadcast, such as the event (Fire and Data), the node number,
and functional unit ID, the token movement within the CMG, as well as functional unit activity
can be reconstructed. Other measurements such as TBIO, TBO, and functional unit utilization and
concurrency may also be extracted.
3.4 Event-Driven
The previous ATAMM simulator [9] was clock-driven in the sense that the system-clock
of the simulator was incremented by one tick at a time. Simulation of algorithm graphs proved to
be slow and time consuming. To speed up the simulation process, the system-clock of the
_ _ __ == E __ __
Simulator must be incremented by more than one tick without violating the liming constraints of
the system. Since the Simulator has the full knowledge of the overall system, it can determine the
exact time of occurrence of the next event and thus increment the system-clock accordingly, In
this regard, the Simulator defined herein is event-driven. Since, in general, the _xt event will
take place in the time interval of greater than or equal to one system-clock tick, the event-driven
Simulator is expected to be considerably faster.
14
]
3.5 Simulation of Graphs with Variable Node Latencles
The previous ATAMM simulator [9] simulated graphs with fixed node latencies. Since
algorithm graphs representing real applications may not have fixed latencies, :it-is desirable to be
ableto simulategraphs with variablenode latencies.This is accomplished by representingthe
dininglatencyof AMG nodes by statisticalfunctions.The Simulatorthen determines the actual
latency of an AMG node during the simulation process, for every input data packet, by executing
the appropriate statistical function representing the AMG node. When the AMG nodes have
variable latencies and upon multiple instantiations of nodes, it is possible that the data packets
produced by the nodes may arrive out of order. To enforce firing of AMG nodes at the proper
time with the appropriate data packet, the data packets are tagged to guarantee correctness of the
CMG marking.
Specific statistical functions are included in the Simulator and additional functions may be
inserted. The Delta function represents the fixed node latency and is assumed to be a positive
value. Using the Delta function, the Simulator defaults to the fixed node latency case. The
Uniform Distribution function requires a lower bound and an upper bound. The Gaussian
=_ *
function requires a mean and a standard deviation. The Discrete function requires an input file
where the discrete values for each input data packet are stored. The Exponential function
requires a mean value. .......
3.6 Simulation of Graphs with Static Node to Processor Assignments
The previous implementations of the ATAMM targeted homogeneous architectures [3, 7,
8] where all nodes of the algorithm graphs are mapped to and executed on all identical functional
units of a system. However, it may not be practical or necessary to always have a fully redundant
system. In some real systems, memory c0n_trahat is a iirnifing factor. In other' systems, functional
..... 7
units may have different characteristics from one processor group to another. By partitioning the
algorithm graph into groups of nodes and assigning each group to a different processor group, the
15
sameperformanceasthe fully redundantsystem(a single processor group) may be achieved. A
proper partitioning of the graph can m/n/mizc interprocessor communication overhead and
increase throughput.
Analysis of the AMG reveals that it is often possible to group some of the nodes into
separate sets and statically preassign each set to different processor groups to get equivalent
performance. In the static assignment of nodes to processor groups, execution of the sets are
assumed to be confined to the functional units to which they are assigned. However, in a fully
redundant system where all nodes are assigned to all functional units of a single processor group,
these sets may appear as patterns that migrate from processor to processor.
To accommodate for the static assignment of nodes to processor groups, this Simulator is
designed so that each processor group is independent of other groups. The assigned nodes arc
encapsulated within each processor group and arc internally managed by the group.
3.7 Simulation of Multiple Graphs
While simulating multiple independent graphs, it is often necessary to phase the graphs
with respect to one another and to simulate them in a predefined sequence. The phasing and
sequencing of algorithm graphs requires certain dependencies among them. These dependencies
arc imposed by the introduction of control edges that connect the sources of different graphs
together. However, due to the nature of the phasing and sequencing problems, these control
edges must be dealt with separately in the Simulator. To handle these control edges, the
Simulator starts the phasing process of a source as soon as an input control edge becomes active.
This corresponds to performing an OR operation on the control edges. The Simulator then fires
the source after the specified delay interval.
3.8 Graph Entry, Simulator, Analysis, and AlE Tools
The relationship of the Simulator with_thc other ATAMM tools is shown in Figure 7. As
shown in the figure, the input to the Simulator is a graph (GRF) file; graph files have ".grf"
16
extensions. The Simulator output is a Fim/Datafrime (FDT) file; FDT files have ".fdt"
extensions.The GRF filecontainsthe algorithmrnarkcdgraph and the setup informationabout
the Simulator,e.g.,the number of groups of processorsand number of functionalunitsin each
group type. The FDT fileis a collectionof time-tagged events which provide a means of
evaluatingtheresultsof algorithmgraph execution. Basic informationinthe FDT fileincludethe
time of occurrence of each event,name of the event,node identifier,node color,and functional
unitidentifier.The format of the GRF and FDT filesarcdiscussedinSection4.10. The GRF falc
isthe output of the Graph Entry software tooldeveloped to draw a graph and defineattributesof
the nodes and the edges. The FDT fileservesas the input to the Analysis Tool [12] which
graphicallydisplaysalgorithmand resource activitiesand provides automaticand user-interactive
performance assessment. To smooth out the transition of the Graph Entry output into the
Simulator and the Simulator output into the Analysis Tool, an ATAMM Integrated Environment
(AIE) was proposed to integrate these ATAMM tools.
Analysis
Figure 7. Flow of information among the ATAMM tools.
17
4. Simulator Design and Development
The development of the Simulator is presented in this section. This Simulator allows the
study of the behavior of algorithms in heterogeneous dataflow architectures operating in real-time
based on ATAMM. The Simulator permits an architecture-independent study of behavior and
performance of a system prior to the availability of a hardware prototype.
4.1 Object-Oriented Programming
Object-oriented programming lends itself to modeling different parts of a complex entity
and the relationship among its parts. The objects can be defined and developed separately to
ensure privacy of data, reusability, and readability. This also makes maintenance and debugging
more manageable and systematic." Further discussions of OOP are provided in Appendix F and in
Reference [9].
4.2 Programming Environment and Language
The implementation of the Simulator requires a powerful programming language and
software environment. The Simulator is written in the C++ programming language. The main
reasons are: 1) it is an object-oriented language with multiple inheritance and thus is a good
system programming language; 2)it provides good data structures, control flow primitives, and a
rich set of operators; and 3) it is compatible with Microsoft WindowsL The Simulator is
developed in the Microsoft Windows environment because of its object-oriented programming
capabilities including message passing and a vast library of graphics routines, especially the
windowing capabilities. Other Microsoft Windows environment features include the capability to
run more than one application in parallel, permitting the user to run more than one instance of the
Simulator at the same time. This provides a means to simulate and compare two or more
Microsoft Windows is a trade mark of Microsoft Corporation.
18
simulations simultaneously. As another example, the Simulator, the Graph Entry, and the
Analysis tools can be running concurrently allowing an easier a'ansition between them.
The objectsare defined and develol_d Separately to ensure privacy of data, reusability,
and readability. This makes maintenance and debugging more manageable and systematic [9].
Every object that directly interacts with the user has its own independent window which allows
the display of different windows to be viewed at the same time.
4.3 Objects and their Relationships
The main logical components or objects of the Simulator arc, in part, a result of the
ATAMM. Since the ATAMM is a set of rules by which an algorithm graph can be mapped to an
architecture, the three main classes of objects are Graph-Manager, Graph, and Processor-Group.
=
The Processor-Group object consists of a set of functional units and, hence, the F'd-List object
and the Functional Unit object (within the b'U-List object) arc introduced. Any system has some
means of communication among its components; thus the Network object evolved. A
management mechanism for arbitration among these objects is provided by the Simulat0r-Kernel
object [9]. Interconnection among these and other entities is portrayed in Figure 8.
4.4 Simulator-Kernel
The Simulator-KemeI provides, manages, and simulates the multitasking environment
where the functional units can operate without conflict. This object is the operating system for
the Simulator and the heart of this software. The arbitration among different objects is enforced
in a non-preemptive manner, where every object is given enough time to accomplish its task. This
is easily realized by employing object-oriented programming methodology [9].
19
Figure 8. Interconnection of objects.
2O
The Simulator-Kernelobjecthas a number of chiid'objects including Processor_Group,
Network, Algorithm Graph, and a System-Clock object. The Simulator-Kernel passes full control
over the sysmm to a constituent, specifically to a Processor or to a Network object, and by doing
so, suspends itself. Upon completion of its task, the target object returns control back to the
Simulator-Kernel along with the anticipated time of the next event in that object. Transfer of
control is accomplished through the message passing capability of object-oriented programming.
Upon execution of all objects, the Simulator-Kernel updates the System-Clock appropriately to
indicate the time of occurrence of the next eventin the entire system. Since the Simulator has the
: : = : -2 ..........
.
full knowledge of the systefia, it is aware of the timing and nature of the next event. If, however,
the time of occurrence of the next event is beyond the upper bound of all events, the Simulator
stops the simulation process and provides an error message with indications of the probable
causes. This process continues for all objects, in an orderly fashion, until simulation of the graph
is complete.
The order in which the objects are invoked is as follows. First, the Processor-Group
object, described in the following section, is invoked. It then passes control to the Graph
Manager and, subsequently, to the Functional Units via the FU-Lists object. Second, the
: = :,
Network object is invoked to carry out its communication task. The Network object, described in
Section 4.7, in turn, passes control to its child objects. The Processor-Group and the Network
objects have the same behavior as the Simulator-Kernel toward their constituents. F'mally, the
System-Clock is appropriately updated. . The hi'erarchy of passing control to the lowest level
objects, child objects, is also portrayed in Figure 8.
Thus far, the functionality of the Simulator-Kernel from an internal information viewpoint
was described. Another functional aspect of this object is its central role with respect to user
interactions. The Simulator-Kernel object and all other objects that require user interactions have
their own independent windows through which information may be passed and displayed. For
these objects, the mrms object and window are used interchangeably.
21
11 0 Turbo I Ditcard_'- ]1 TBO_IO I[., Peute II _o._.. i
Figure 9. Simulator-Kernel.
For user interactions, the Simulator-Kernel provides a set of push buttons in its window,
Figure 9. Some of these push buttons contain a sublayer of selections. The top level selections
arc for informative purposes while the sublayer selections perform an operation. For example, the
second layer of the "Processors" and "Networks" push buttons are the "+" and "-" push buttons
that allow the user to increase and decrease the number of these objects, respectively. The speed
of the simulation can be adjusted through the "Speed" button to turbo, fast, medium, or slow.
The "Open..." button allows the user to open a GRF file and to load the algorithm marked graphs
for sinaul-atlon. The "Discard..." button lets the User specify the number of initial da-ta packets that
are to be _scarded. The riu_ber of discarded data packets corresponds to the data'packets prior
to reaching the steady state. This number is important in calculating the TBO, TBIO, ensemble
TBO, and ensemble TBIO points where the ensemble values are defined as the average values.
The duration of the simulation process, the "Duration..." button, can be defined by specifying the
number of data packets._vI'he-"_O/TBIO '' and "Ensemble"'toggie key'let the user set up_e
simulator for calculating the TBO and TBIO points or the ensemble TBO and ensemble TBIO
points. The "Run" and "Stop" toggle buttons allow the user to initiate and terminate the
simulation process. When calculating "TBO/TBIO" points, the Simulator prompts for an output
FDT file name. When calcula_g "Ensemble" TBO and TBIO points, the Simulator prompts for
the number of ensemble points desired. The "Pause" and "Resume" toggle buttons pause and
resume the simulation process, reSl_ctively. All windows have a help option w_ere the '_elp"
buttons invoke the appropriate help files for specific guidance concerning window functions. The
22
"About" button invokesthe signatureanddisplaysthegeneralinformationaboutthe Simulator.
ThisSimulatoronly operatesin thesimplexmode.
The Simulatorkeepsu'ackof clock ticks,numberof events,andnumberof datapackets
into and out of the graph. It also reports the current status of these activities for user's
information upon receiving control of the systemvia the "System"window. The speedof
simulation may be adjustedto turbo, fast, medium,or slow at any time. This provision is
providedfor animationpurposeswherethe simulationof the graphis carriedout at the desired
pace. Sincethis window is theheartof thissoftware,existenceof otherwindowsdependon its
existence,i.e.,closingthiswindow resultsin terminationof theSimulator.
4.5 Algorithm Graphs
The Algorithm Graph object of the Simulator is a set of objects that arc connected
together by a set of linked lists. The objects that constitute the Algorithm Graph objects are the
nodes and the edges. The node object has three variations and represent the nodes, the sources,
or the sinks of the algorithm graphs. The edge object has two variations and represent the data or
the control edges of the algorithm graphs. These objects and their interrelations represent the
algorithm marked graphs. The input algorithm marked graph files provided by the Graph Entry
tool, discussed in Section 4.10, conveys the necessary information about these objects.
When loading an input file, the Algorithm Graph object scans the input file and upon
detecting a node or an edge, creates a new instance of the appropriate object and sends a message
to the object to read its own data and initialize itself. The Algorithm Graph object then inserts the
object into the proper linked lists. The linked lists that represent the algorithm graph arc a linked
list of node and a linked list of edge objects. Each node object has, in turn, two linked lists of
edge objects, one for the input edges and the other for the output edges. Each edge object has
two linked lists of edge objects, one for the output edges of its initial node and the other for the
input edges of its terminal node. The source object has two additional linked lists of edges, one
for the input control edges and the other for the output control edges of the source. These
23
control edges that connect the sources of the algorithm graphs together are for phasing and
sequencing purposes and require special treatment by the Simulator. The data structure of the
algorithm graphs, as portrayed by the Algorithm Graph object, is depicted in Figure 10.
:=
4.6 Processor-Group
To model and simulate a heterogeneous architecture, the Processor-Group object is
designed to represent a genetic system where different attributes of the system can be r_ored to
match a particular architecture. Since every Processor-Group object represents a homogeneous
system, two or more of these objects characterize a heterogeneous system. In a heterogeneous
system, different Processor-Groups may have different characteristics, e.g., number of functional
units, test time, and speeds; but all functional units within a Processor-Group object share similar
characteristics. The Functional Unit object is designed so that it can undertake any or all tasks
represented by the input AMG. In this regard, the sources, the sinks, and the nodes of the AMG
are treated equally. In this Simulator, the number of Processor-Groups, Functional Units, and
Networks are not limited by any upper bound, but by the availability of memory.
The objects that constitute the Processor-Group object and their relationships are
portrayed in Figure 8. The Processor-Group object treats its constituents in the same manner as
its parent object, the Simulator-Kernel. The Processor-Group passes control to the Graph-
Manager object which, in turn, passes control to the Functional Units (within FU lists) to carry
out the execution of the AMG nodes assigned to this Processor-Group.
Through Processor-Group's window, the number of Functional Units can be sIx_ified to
match a particular architecture such as that shown in Figure 11. The submenu of the "FU" menu
selection increases or decreases the number of Funtional Units by selecting "+" or "-",
respectively. The upper bound on the number of Functional Units within a Processor-Group
object can be specified via the "FU Limit" push button. The upper bound of the number of
Functional Units is the maximum number of resources during the simulation process. If the total
24
4-q
Node N1
Next-Node
)Input-Edge
Output-Edge
t-----
Edge El
Next-Edge
Next-lnput-Edge
Next-Output-Edge
Edge E2
Next-Edge
Next-lnput-Edge
Next-Output-Edge
Node N3
Next-Node
)Input-Edge
Output-Edge
Edge E_3
Next-Edge
Next-Input-Edge
Next-Output-Edge
Edge E4
Next-Edge
Next-Input-Edge
Next-Output-Edge
FigureI0. Portionofan examplegraphand itsdatastructure.
25
number of Functional Units in a Processor-Group object is less than the upper bounC the
Simulator creates, during the run time, as many Functional Units as necessary to carry out its
operation without violating the upper bound restriction. The relative speed of a Processor-Group
object compared to other Processor-Groups can be specified by the "Speed" submenu. The
relative speed of a Processor-Group object can be decreased by "+" and increased by "-". The
"Help" button invokes the appropriate help file where specific guidance for the Processor-Group
window is provided. A push button is provided for a future selectable "Test Time" to simulate
self-testing by the Functional Units. However, the specific use of self-test is not yet implemented.
"IDLE _
Node ....
Packet ..
FU4 FU3 FU2
Graph Hanager Iv_
1
Nodes iNS
6
8inks ISin k lEVI1
Hax Resources 4
Idle FU= Bu=yFU=
: _ ..... _ ..................... _____L___; = _? ,. ........... __..................
Figure 11. Processor Group.
4.6.1 Graph-Manager
The graph manager is responsible for ensuring that the overall system operates according
to the ATAMM rules_ The Graph-Manager object, representing the graph manager of ATAMM,
updates and monitors the status of the CMG. When a read transition of this graph is enabled, the
Graph-Manager assigns a Functional Unit from the list of available Functional Units to perform
the corresponding algorithm node according to priority if more than one node is enabled. If there
arc additional enabled nodes, the Graph-Manager assigns them to the subsequent Functional Units
26
in the available llst. The Graph-Manager updates the marking of the CMG ushag status
information reported by the Functional Units.
Since the Graph-Manager object is part of the Processor-Group object, it only keeps track
of the AMG nodes that are assigned t0 the Processor-Group object by a linked list of source,
node, and sinkobjects.Although the source and the sink objectshave a lotincommon with the
node objects,theyalsohave some differences.For instance,the sourceobjectsmust dealwith the
specialsource controledges and the sink objectsmust keep track of the output data packets.
Therefore,the sourceand thesinkobjectsare storedin separatelinkedlistsfrom the node objects
to kccp theiroperationsseparateand to speed up the simulationprocess. The data structureof
thealgorithmgraphs,asportrayedby theGraph-Manager object,isdepictedinFigure12.
Upon updating the CMG, if necessary, the Graph-Manager broadcasts the updated
informationto other Graph-Managers. The necessityof broadcastingpartor allof the updated
CMG depends on the partitioning of the nodes Of the AMG. If dependencies exist among the
AMG nodes of the Graph-Managers or if an AMG node is assigned to multiple Graph-Managers,
then whenever one G_ph-Manager is updated, part or all of the updated information ought to be
shared with other Graph-Managersl Since the Graph-Manager object has knowledge of the
system, it is also responsible for creating Functional Units at run time, based on need without
violating the upper bound limitation of the Processor-Group object.
The Graph-Manager object displays information about the graph and the status of the
Functional Units in the Processor-Group object. This information mainly consists of the count
and names of the sources, nodes, and sinks that are assigned to the Processor-Group object, and
the content of the idle and busy Functional Unit lists of the Processor-Group object.
4.6.2 Functional Units
The Function_Unitobject is designed to carry out the tasks represented by the AMG
nodes. The Functional Unit object, therefore, does not distinguish between the sources, the sinks,
27
Processor-Groupl
Node N1
Next-Node
Node-Pointer
Node N2
_ Next-Node
,, ) _ :
Figure 12. An example graph and the node data structure for one group of nodes ....
and thenodesoftheAMG. To carryoutexecutionofan AMG node of any kind,theFunctional
Unitmust be assigneda nodetoexecute.The assignmentofan AMG node totheFunctionalUnit
isaccomplishedby theFunctionalUnitthatcurrentlyholdsthesernaphorcand istheactivegraph
managerofthesystem.The activegraphmanager,a FunctionalUnit,canassignan AM(} node to
itselforanotherFunctionalUnit.A FunctionalUnitbecomes Theactivegraphmanager when itis
28
in the Writing State or when it is both in the Idle State and at the top of the list of available
Functional Units. The active graph manager possesses the semaphore and is the only Functional
Unit that can talk over the Network while other Functional Units listen. To grab the semaphore,
the Functional Unit may have to compete with others. The semaphore is granted based on the
specified protocol of the defined architectm'e. Sections 4.6.5 and 4.7 discuss the communication
network protocols.
To complete execution of the AMG node, the attached Functional Unit goes through a
sequence of states as depicted in Figure 6 for the AMOS. These states define the operating
system characteristics of the ATAMM Multicomputer Operating System (AMOS) and, thus, the
state diagram of the Functional Units. This state diagram is described in the next Section.
Through its window, the Functional Unit object displays information about its current status such
as current state, the name of the assigned AMG node, and the number of the current data packet.
4.6.3 Functional Unit State Diagram Description
Idle
z
When idle, the Functional Unit awaits a fire-message indicating an AMG node is assigned
to it for execution. It also continuously scans the Idle-List of available Functional Units to
determine..... whether it is at the top of the list. When it finds itself at the top of the list and still idle,
it scans the CMG for enabled read nodes. A CMG read node is enabled when every one of its
input edges have a token with the appropriate tag and all of its output edges have an empty
buffer. If there axe enabled CMG read nodes, it attempts to grab the semaphore to become the
active graph manager. Upon receiving a fire-message, the Functional Unit migrates to the
Reading State.
Grab Semaphore 1
In this state the Functional Unit attempts to establish a communication link with other
Functional Units. After establishing a communication link and grabbing a semaphore, the
29
FunctionalUnit becomesthe active graph manager of the system and moves to the Graph
Manager State. Otherwise, it goes to the Idle State.
Graph Manager
Being the active graph manager, the Functional Unit assigns the CMG read nodes to the
idle Functional Units in the Idle-List. It sends fire-messages to the appropriate Functional Units,
possibly including itself; moves the assigned Functional Units from the Idle-List to the Busy-List
of Functional Units; updates the CMG and broadcasts the updated information to others. After
the "Fire" broadcast, it releases the semaphore,
State.
The Functional Unit then migrates to the Idle
Reading :.
The Reading State represents the activity of reading the input data. The reading of input
Z 2 :
data is a,'complished by consuming one token from each input edge with the appropriate token
tag. After reading the node's input data, the Functional Unit progresses to the Processing State.
Processing
In this state, the Functional Unit executes the task represented by the node. The duration
of this state is represented by the process time of the node. However, when simulating graphs
with variable node times, the duration of this Smte=_ - c6mputed on the fly by caning the
appropriate statistical function that represents the node. Upon completion, it p_sses to the
Grab Semaphore State.
Grab Semaphore 2
To writ(: the generated output data on the output edges, the Functional Unit must grab the
semaphore and become the active graph manager 0fthe system. It reds ha this state and
competes for the semaphore until it is granted.
3O
w Writing
After becoming the active graph manager, the Functional Unit migrates from the Busy-List
to the Idle-List of Functional Units. It then writes the output data on the output edges of the
nodes and updates the CMG accordingly. The writing of output data is accomplished by inserting
one token on each output edge registering the tag associated with it. The updated information is
broadcast to other Functional Units via the "Data" broadcast. Before releasing the semaphore, it
goes to the Graph Manager State.
4.6.4 FU Lists
The FU-Lists object manages the Functional Units and the Idle-List and Busy-List of
Functional Units within a Processor-Group object. It creates and destroys Functional Units and
moves them between the Idle-List and Busy-List upon receiving appropriate messages from the
Graph-Manager object. It also keeps track of the number of Functional Units in the Processor-
Group object. This object was created to facilitate the management of the Functional Units
objects.
4.6.5 Local-Networks
The Local'Netw0rk object is envisioned to manage the arbitration of local semaphores
among the Functional Units and to provide a means of establishing communication with the
Global-Network object. Although all implementations of ATAMM have considered only a single
== = === :
semaphore, the Local-Network and Global-Network objects are intended to explore systems with
multiple semaphores and a hierarchy of semaphores. In this regard, the I.._cal-Network is a child
of the Global-Network.
Nonetheless, this Simulator assists in the development of theories regarding the ATAMM
under ideal conditions. Networks do not exist under ideal conditions. Due to lack of time, the
Local-Network object is not yet implemented. In this regard, the communication latency is zero
31
=and the simulation is performed under the ideal condition. However, the system is still limited to
a single semaphore to ensure the integrity of the CMG markings.
4.7 Global-Networks
The Global-Network object is envisioned to manage the arbitration of global semaphores
among the different Processor-Group objects and tO provide a means of establishing
communication with the Local-Network objects. For the reasons stated in Section 4.6.5, the
Global-Network object is not yet _plemented. The communication latency among the
Processor-Group objects are also zero and the simulation is performed under the ideal condition.
The single semaphore mentioned earlier is global throughout the system and ensures the integrity
of the CMG markings.
4.8 TBO/'I'BIO and Ensemble TBO/TBIO
The Ensemble object is designed to calculate the TBO, TBIO, ensemble TBO, and
ensemble TBIO points. During the simulation process, the time when a data packet is injected
into an algorithm graph and the time when the same data packet exits the algorithm graph are
recorded. This information is then used to calculate the TBO and TBIO points for all data
packets. Through the Ensemble's window, the TBO and the TBIO points are plotted as shown in
Figure 13. This process continues until the TBO and TBIO points of all data packets are
determined. However, if ensemble TBO and ensemble TBIO points are desired after each
s_nulation of the algorithm graphs, the TBO and the TBIO points are averaged for each
simulation to calculate the ensemble (or average) TBO and ensemble (or average) TBIO points,
respectively, and only these averages are recorded and plotted. This process continues until all
ensemble points are determined.
s
32
i_
ZI°....J--- L ..... , , , ,
¢ ! f ¢ , , ( • _.-m_
• I , I I i ' _ i !
r .... 1..... _'''_ ............. I "','_ ..... _ .... •..... r ....
....! ..::::::::::::::::::::::::............,
I I ; | I
DataPacketNo. --_
TBO
Figure 13. The TBO and TBIO plots.
L
Through the menu options of the Ensemble window, the calculated TBO and TBIO points
along with their averages can be stored in an ensemble (ENS) file for future references by the
"Save..." option. Ensemble files have ".ens" extensions and are described in Section 4.I0. It is
also possible to print the plotted points as depicted by this window. The "Average" option gives
the averages of the points and the "Grid" option draws a grid along the x-axis and the y-axis for
better visualization of the plotted diagrams. The "Scale Down" option allows resizing of the
plotted diagram to the desired scale.
4.9 System
The System object is created to display the status of the system. While the simulation is in
progress, the System-Clock and name of the output FDT f'fle are displayed. Continuous display of
the System-Clock gives an indication of the speed and duration of the simulation process.
33
4.10 The Input and Output File Formats
The input algorithm marked graph files provided by the Graph Entry tool are a set of node
and edge objects and the information about the relationships among them. The format of the
input graph (GRF) file generated by the Graph Entry tool is given Appendix A.
The format of the output FDT f'de as generated by the Simulator is defined in Appendix B.
Fordens onthemeaningand gr cance ofeachelon n pleaserefertod .mentsprovided
with the ATAMM Analysis tool [12]. An example of an FDT is provided in Appendix C.
The computed TBO, TBI, ensemble TBO, ensemble TBIO points, and their averages are
stored in the output ensemble tENs) files. Two examples of the ENS files are provided in
Appendices D and E. Appendix D represents the TBO and TBIO points for a single simulation
and Appendix E fists the ensemble TBO and ensemble TBIO points for each of 12 simulations and
the ensemble (average) for all simulations.'
4.11 HoW to Use the Simulator
To simulate an algorithm graph, the algorithm graph must first be generated by use of file-
Graph Entry tool. The graph must be drawn and its attributes such as read, process, and write
times of the nodes; node function (for variable node Iatencies); node assignment to groups of
processors; buffer sizes and initial tokens of the edges; and injection time and sequencing of the
s0_ces must be defined. The algorithm graphs can then be loaded into the Simulator. The
Simulator extracts the necessary information from the GRF file and sets up the system
accordingly. It is also possible to specify the system attributes through the Simulator's objects.
The procedure to simulate an algorithm graph is shown in Table 1 as well as in the help_ fll_ :_
provided by the Simulator software. _ .... - ...............
34
iiJ"
I. "Open...*an existinggraph,
2. create as many "Processor-Group" objects as necessary and Design these
objects to fityour specifications(thisinformation could also be provided
_ bY a GRF file),
3. "Discard...*as many data packets as necessary,
4. select *TBO/'rBIO*or "Ensemble",
5. specify *Duration...*of the simulation process (thisinformation is also
provided by a GRF fileand as a sinkaffribute),
6. setthe *Speed" of the simulation process, and
7. "Run* the Simulator. When finished,the Simulator willprompt accordingly.
8. To exitthe Simulator,either double clickon the system menu button of the
Simulator'swindow or choose the exitoption inits ystem menu.
Table 1. Simulator Execution Procedures.
i,
ii i,
35
5. Case Studies and Experimental Results
In this section, two case studies arc presented as a demonstration of the appLication
capabilities of the Simulator in studying the behavior of algorithm graphs imdcf the ATAMM
rules. These case studies arc conducted and presented in a manner that typically would take the
user of the Simulator through the procedural steps for creating algorithm graphs and evaluating
the desired systerrL An example graph referred to as Intermediate 1 (Interl.grf) and depicted in
Figure 14 is considered for all case studies. The first case study is a homogeneous simulation of
the Interl.grfgraph. The second case study is a heterogeneous simulation of the Interl.grf graph
that demonstrates capabilities and features of the Simulator in static assignment of nodes to
different groups of processors.
40 Node Name
Node Time
I Source Sink I
Figure 14. The Interl.grf graph.
5.1 Case Study 1
This case study is primarily conducted for validating the results of the simulation with the
theoretical predictions and compliance with a previous simulator [9]. All nodes exeeum on a
single Processor-Group. The timing latencies of the nodes in Interl.grf arc shown in Figure 14.
For this case study the read time and write time of the nodes are assumed to be zero time units for
36
the ideal simulation of the graph. The Single Graph Play (SGP) and the Total Graph Play (TGP),
[5, 6], for four resources of this graph are shown in Hgure 15. The TGP of Hgure 15 is the
modified TGP of the graph after adding a control edge from node N3 to node N4.
After loading the lnterl.grffile, the Simulator-Kernel window's caption bar _is updated and
reflects the name of the file loaded, as shown in Figure 9. The Processor-Group windows are also
updated to reflect the specified system, Figure 11. Results of the simtdati0n of the graph are then
analyzed by the Analysis Tool [12] and are shown in Figure 16. Analysis of the results of the
simulation of the graph reveal compliance with the theoretical prediction where TBO equals 25,
as depicted in Figure 15.
5.2 Case Study 2
The static assignment of nodes and heterogeneous capabilities of the Simulator are studied
here. In this case study, the nodes N1 and N2 and the Source arc assigned to one Processor-
Group with two functional units. Nodes N3, N4, N5, and the Sink are assigned to another
Processor-Group with two functional units. This partition of nodes among Processor-Groups is
consistent with the modified TGP of Figure 15 and should result in the same TBO and TBIO
performance as forCase Study I. Analysisof the resultsrevealthatthe same performance as the
previous case study are achieved. Figure 17 isthe task and resource activitydisplayand the
cursorsmark a time intervalcorrespondingtotheTGP of the graph.
,r
37
Data Packet Number
. =
!3) (2)
t !
l |
NI , N2 N5
r-_ .....
* !
! t
_ ' ,
SGP ,,.
' Float Tunes
t
N4,
| I
t I
t t
I I
t t+25 t+50
Modified
TGP
_q i(3_ _ N2(3]
N2(2) ,...
N5(1) N3(3) N4(31
N4(2) .
if
t t+25
(1)
t+75
"- T'urne
"-Ttme
Figure 15. SGP and Modified TGP.
38
Total
D_izplay Pleasure ._elect Help
1'otz[
Proc1-1
_ | l | | | I I | | | I I L l |
l l I I l I I [ I l [ I I l J
i * * ! L_LLLI I I 1 1 I
' Proc1-2
I 1 1 1 I 1 I 1
l 1 l I I 1 I l 1
HIlIIIIIIJI'IIIIIIIIIIIIIIL
I!1111111 I11111111111111111
lll&lllLlllllLlllllllllllll
1111III111
lllllllllll
LL|I[IIIll
lllllllllIlll[lq
III1111111111 I_
JJ'llllllllllllllll[llllll
IIILIIIIIIIIIIIIlIlIIIIll
IIIIllilllllillIIIIlllllll
lU Source
_B 8ink
w
Figure 16. The task and resource activities for Case Study 1. The spacing
between the vertical cursors is 25 time units.
39
Total
I_"_:_: :" ' i_:::_::_ i_:_:'::":':_:::_--_:_i:i_:i_ :_.,_ :_:::_"": ..... " _ _ ": :_ :" i: _, _-'" :" '" :: "- i"_is.."i{_:i _ :_.:i_:_ ::"._::i.$::i_:_: :_:.'.':i:. ":"::_::_:t.l
D_i:pley _.Heasure _Select Help
1'ot¢l
Proc2-1
_XXXXXXXXXX3
Proc2-2
Procl-1
,,
i:i_:-.:.i::.i:i:i_,:.:'::.i:i:i:i_.:.!_:i._
_X\\Y\\\\X\\\xx,xx'_
,, :.:,:.:,:,:,:,:,:,..:,:,',:.:,:,:.. ,:,:.:,.:,:,._:.:,
''HHHHH'HHH,''''''',...
_XX\XXX\X\XXXXXXXX'__';';';':":';_,\XN
_XXX %XXXXXXXXX',_
_#H/H##/#A_,",'._V ';",'V_,V,',Y,',',','Vy,',',',','_VWV,'_,V_,V_v
/
- ......17 :::Z:;;_/::_::: :: ::::: :22: :: ::: : :_ .....
Figure 17. The task and resource activities for Case Study 2. The spacing
between the vertical cursors is 25 time units.
40
6. SUMMARY
A Simulator is developed to simulate the execution of algorithm marked graphs in
accordance with the ATAMIVI rules. Whereas previous ATAI_ simulatorsassume that all
algorithmgraph nodes are executed on a horuogcnc0us setof functionalunits,thisnew Simulator
enablesgroups of graph nodes to execute on differentprocessor groups, where each processor
group may representa differentype of functionalunit.Thus, a heterogeneous architecturemay
be simulated. The Simulator isbased on objectorientedprogranmting and is event-drivento
acceleratesim_ation speed. It provides the simulationfunctionsin an ATAIVIIVI Integrated
Environment, which also includes a Graph Entry tool for describinggraphs for simulation,a
Design Tool for analyzing and alteringa graph to obtaindesiredperformance,and an Analysis
Tool for playingback the resultsof a simulation.Test cases show that the simulatoraccurately
executes the ATAMM rules for both a heterogeneous architectureand a homogeneous
architecture,which isa specialcase foronly one processorgroup.
.
41
References
[1] John W. Stoughton and Roland R. Mielke, "The ATA .MIvl Pr0cedure Model for Concurrent
Proc_ssi-ng of Large Gi'_ed'Cdntrol and Signal Processirig A]g6Hthh_<" ]_[._,dJllg_.._
_, 121, May 1988.
[2] R.R. Mielke, John W. Stoughton, and S. Som, "Modeling and Performa_nce Bounds for
Concfirre-nt l_0cessing," NASA CR 4167, Grant NAG1-683, August 1988.
[3] Asa M. Andrews, Robert L. Jones, Paul J. Hayes, and Harry F. Benz, "Simulator for
Enhanced ATAMM Multiprocessing," AIAA Computing in Aerospace 8: A Collection of
Technical Pa_t_r_, Vol. 2, 542, October 21-24, 1991.
[4] R. L. Jones, P. J. Hayes, A,M_ Andrews, S. Som, J0hn W. St0ughton, and R. RI Mielke,
"Enhanced ATAMM for Increased Throughput Perfo_ance of Multicomputer Data Flow
Architectures," Proceedings of the IEEE NAI_CON 91, Vol. 1,238, May i991.
[5] Sukhamoy Som, R. Mielke, R. Obando, J. Stoughton, P. J. Hayes, and R. L. Jones,
"Throughput Enhancement by Multiple Concurrent Instantiations in the ATAMM Data Flow
Architecture," Proceedings of the ISMM International Symposium on Computer
Applications in Desi_. Simulation. and Analysis, Las Vegas, NV, 71, March 1991.
[6] S. Som, J. W. Stoughton, and R. R. Mielke, "Strategies for Concurrent Processing of
Complex Algorithms in Data Driven Architectures," NASA CR 187450, Final Report, Grant
NAG1-683, October 1990.
[7] P. L Hayes, R. L. Jones, H. F. Benz, A. M. Andrews, J. W. Stoughton, R. R. Mielke, M. R.
Malekpour, and P. R. Appleget, "VHSIC Multiprocessor Implementation of the ATAMM
Strategy," GOMAC91/1991 Digest of Papers, 521, November 1991.
[8] R. Mielke, J. Stoughton, S. Som, R. Obando, M. Malekpour, and B. Mandala, "Algorithm to
Architecture Mapping Model (ATAMM) Multicomputer Operating System Functional
Specification, "NASA CR 4339, Cooperative Agreement NCC1-136, November 1990.
[9] Mahyar R. Malekpour, John W. Stoughton, and Roland R. Mielke, "Simulator for
Concurrent Processing Data Flow Architectures," NASA CR 189604, Cooperative
Agreement NCC1-136, March 1992.
[10] S. Som, J. W. Stoughton, and R. R. Mielke, "Performance Prediction, Simulation, and
Measurement for Real-Time Computing in a Class of Data Flow Architectures,"
of the ISMM International Conference on Computer Applications in Design. Simulation. and
Analysis, ACTA Press, pp. 64-68, New Orleans, March 5-7, 1990.
42
t
i:
!-
iZ
I:
[11] W. R. Tymchyshyn, "ATAMM Multicomputer System Design," Master's Thesis, Old
Dominion University, Norfolk, Virginia, August 1988.
[12] Robert Jones, John Stoughton, and Roland Mielke, "ATAMM Analysis Tool," NASA CR
187625, Cooperative Agreement NCC1-136, October 1991.
,13
Appendix A
Format of Graph Description in Graph-Entry Output File
Note 1. _AL/'Cs underline are for information
Note 2. Only/7"AL/C is for a choice or decision
Note 3. AII Times are Positive Long Integer Values
Note 4. All Locations 'X Y' are range 1..100
Version 2.0.13
System_Max_CPUs
Current_Number_CPUs
Max_Index
Current.Index
Max_Number_Groups
Max_Nodes
Max_Arcs
Max_Sources
Max_Sinks
Self_TesLTime
Display_CPU_Number
Display_Index_Number
Selected_Group
Show_All_Objects
Ori .v
Right_Bottom.X
Right_Bottom.Y
Grid_Status
Heterogeneous
Number_CPUS_Group
Object_Type
LOOP
if ObjectType = NODE then
Node_Graph_Number
Block_Index
Node_Number
Node_Name
Node_Mode
Node_User .File_Name
Node_Priority
-- Number of CPUs allowed in the system range 1..32
- Initial number of CPUs range 1..System Max_OPUs
- Max Indexes for the Operating Point Table range 1..10
- Initial Index for the run range 1..Max_Index
- Number of Heterogeneous Groups
- Max Number of Nodes in all Graphs
- Max Number of Arcs in all Graphs.
- Max Number of Sources.
- Max Number of Sinks.
- Used to Display a certain configuration of a Graph
- Display which index configuration
Display for Enabled or Disabled Control Arcs
Used to size of the Graph Window
Used to size of the Graph Window
Used to size of the Graph Window
- Used to size of the Graph Window
- Grid Display ON or OFF
- True / False Flag for Heterogeneous System Simulations
-- (Array[Max_Number groups.. 1] of Integers)
-- (NODE, SOURCE, SINK, ARC)
- unique for all blocks in all graphs
-- (SIMPLEX, DUPLEX, TMR)
Node_Instantiation s(1..S ystem_Max_CPUs, 1..Max_Index)
Node_Read_Time
44
\
T i
i
Node_ProcessTime
Node_Write_Time
Node_Color
Node_Number_Inputs
N0de.Number_Outputs
Node_Random_Function
Node_LowerProcessTimeBound
Node_UpperProcessTimeBound
- (Mean Value of Process Time)
-- (A,B,C, etc.)
- (Smallest Possible Bound on Process Time)
- (Largest Possible Bound on Process Time)
Node_ProcessTypc - (Boolean an'ay[1..Max_Number_Groups] Heterogeneous)
Node_SubGraph_File_Name - Ifnode has a subgraph.
Node_Location --(X Y) Coordinates
end if
IfObjec_Type = SOURCE then
Source_Graph_Number
Block_Index
Source_Number .
Source_Name
Source_Mode
Source_Priority - Graph Priority(?)
Source_TBI(1..System_Max_CPUs,1..Max_Index)
, . - Time Between Inputs (TBI)
Source_Number_DataPackets - Number of Data Packets for each Source Edge
Source_Write_Time
Source_ProcessType - (Boolean array[1..Max_Number_Groups] Heterogeneous)
Source_Location -- (X Y) Location of the Source
end if
if ObjectType = SINK then
Sink_Graph_Number
Block_Index
Sink_Number
Sink_Name
Sink_Mode
Sink_Read_Time
Sink_Number_DataPackets -- Number of Data Packets Received at Sink
Sink_ProcessType - (Boolean array[1..Max_Number_Groups] Heterogeneous)
Sink_Location -- (X Y) Location of the Source
end if
if ObjectType = ARC then
Edge_Number
Edge_Type
Edge_Initial_Type
Edge_initial
Edge_Initial_String
Edge_Initial_Block_Index
Edge_Initial_Parm_Number
- (CONTROL, DATA)
--(SOURCE_TYPE, NODETYPE, SINKTYPE)
- Number of Initial
- Name of the Node, Source, Sink
- Block Index of the Initial Block
- Position in procedure call(0 if CONTROL)
45
Edge_Terminal_Type
Edge_Terminal
Edge_Terminal_String
Edge_Terminal_Block Index
Edge_Terminal_Parm_Number
if Edge_Type = DATA then
Edge_Data_Type
Edge_Size
Edge_Tagging_Rule
e/se
-- (SOURCETYPE, NODE_TYPE, SIAq<_TYPE)
- Name of the Node, Source, Sink
- Block Index of the Terminal Block
- Position in procedure call (0 if CONTROL)
- TBD Either a File_Name or Dam_Type Name
- TBD Whether or not to include.
- Data Packet Distance
Edge_Tagging_Rule(1..System_Max_CPUs,1..Max_Index)
if Edge_Terminal_Type = SOURCETYPE then
Edge_Delay(1..System_Max_CPUs, 1..Max_index)
- F'tring Delay for Terminal
Edge_Selector(1..System_Max_CPUs,1..Max_Index)
- Output edge selection for token
end if
end if
Edge_Inital_Tokens(1 ..System_Max_CPUs, 1..Max_Index)
- Seperated by <CR>
Edge_Tokens_Limit(1..System_Max_CPUs, i..MaxIndex)
- Arc not enabled if size = 0
Edge_Max_Buffers
Edge_Number_J'oints .....
Edge_1oint (1..Max_Number_Joints) - X Y coordinates
end if
REPEAT UNTIL <EOF>
/
,16
Appendix B
Format of the FDT File Generated by the Simulator
/[I'he FIELDS to be read from an FDT event
Fields = 5
TIME "%lu"
EVENT "%s"
TASK "%s"
COLOR "%d"
RESOURCE "%s"
//The possible EVENTS that can be found in the FDT file
//{FIRE, DATA, RUN, HALT, EVENT}
Events = 10
NodeRead >FIRE
NodeProcess
NodeWfite
NodeIdle >DATA
FU_Test >FIRE
FU_EndTest >DATA
SourceWrite >FIRE
Sourceldle >DATA
SinkRead >FIRE
SinkIdle >DATA
//The possible
Activities = 4
Process
ReadWrite
Test
Idle
ACTIVITIES that can be found in the FDT file
>NodeProcess
>SourceWrite,SinkRead, NodeRead,NodeWrite
>FU_Test
>Nodeldle,Sourceldle,Sinkldle,FU_EndTest
//The clock resolution of time tags in clock ticks per second
//Clock = 1000000
_' 47
Appendix C
FDT File Example
//Simulator Version 3.0, Output FDT File
//Graph file name: E:_SIM3kinterl.grf
25 SourceWrite Source 1 Procl-4
25 SourceIdle Source 1 Procl-4
25 NodeRead N1 1 Procl-3
25 NodeProcess N1 1 ProcI-3
35 NodeWrite NI 1 Procl-3
35 Nodeldle N1 1 Procl-3
35 NodeRead N4 1 Procl-2
35 NodeProcess N4 1 Proc 1-2
35 NodeRead N3 1 Proc 1-1
35 NodeProcess N3 1 Procl-1
35 NodeRead N2 1 Procl-4
35 NodeProcess N2 1 Proc 1-4
45 NodeWrite N3 1 Procl-1
45 NodeIdle N3 1 Proc 1-1
50 SourceWrite Source 1 Procl-3
50 Sourceldle Source 1 Procl-3
50 NodeRead N 1 1 Proc 1-1
50 NodeProcess N1 1 Procl-I
60 NodeWrite N1 1 Procl-1
60 NodeIdle N1 1 Procl-1
60 NodeRead N4 1 Proc 1-3
60 NodeProcess N4 1 Proc 1-3
60 NodeRead N3 1 Proc 1- I
60 NodeProcess N3 1 Procl-1
65 NodeWrite N4 1 Procl-2
65 NodeIdle N4 1 Procl-2
65 NodeRead N2 1 Procl-2
65 NodeProcess N2 1 Proc 1-2
70 NodeWrite N3 1 Procl-1
70 NodeIdle N3 1 Proc 1-1
75 NodeWrite N2 1 Procl-4
75 Nodeldle N2 1 Proc 1-4
75 NodeRead N5 1 Procl-1
75 NodeProcess N5 1 Procl-I
75 SourceWrite Source I Procl-4
75 Sourceldle Source I Procl-4
75 NodeRead N1 1 Procl-4
75 NodeProcess N1 1 Proel-4
85 NodeWrite N5 1 Procl-1
85 Nodeldle N5 1 Pmcl-I
48
Appendix D
ENS File Example for TBO & TBIO for 12 Data
Packets of a Single Simulation
H Simulator Version 3.0, TBO/TBIO Points
//Graph fifle name: E:_SIM3kinterI.grf
Number of TBO/TBIO points at Sink: 12
TBO TBIO
85.00 60.00
30.00 65.00
25.00 65.00
25.00 65.00
25.00 65.00
25.00 65.00
25.OO 65.00
25.OO 65.00
25.OO 65.00
25.00 65.00
25.00 65.00
25.00 65.00
TBO/TBIO Averages:
30.42 64.58
49
Appendix E
ENS File Example of Ensemble TBO and Ensemble TBIO
Points for 12 Simulations
//Simulator Version 3.0, TBO/TBIO Ensemble Points
//Graph file name: E:_SIM3_interl.grf
Number of TBO/TBIO Ensembles at Sink: 12
TBO TBIO
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
30.00 64.00
TBO/TBIO Ensemble Averages:
30.00 64.00
50
4Appendix F
Object-Oriented Programming
The following is quoted from [9] because of its importance in the development of this
Simulator.
"Structured programming flourished because it was efficient in terms of human resources.
Building and testing programs in discrete pieces enabled large applications to be developed in less
time with fewer bugs than their non-structured counterparts. In addition, the ran-time impact of
structunng becomes less evident as a program grows in size. Object-0riented programming
extends structured programming by encapsulating both data and their associated functions.
In traditional procedural languages like C or Pascal, the programmer defines data
structures and writes functions and procedures to operate on the data. Although normally a
correspondence exists between which functions operate on which types of data, most procedural
languages offer no formal support for this correspondence; it is entirely the programmer's
responsibility to manage such an abstraction.
In an object-oriented approach, both data and operations that work with that data are
combined into a single logical unit known as an object. Dividing a program into objects
encompassing both data and operations makes the program more closely represent the logical
design that is being implemented. As a result, object-oriented programs are generally easier to
understand and maintain than procedural programs.
Object-oriented programming is merely the art of breaking a program down and
organizing it. In the case of structured programs, the primary concern is what the program is
doing. A structured program is based on operations. When writing object-oriented programs, the
program is organized around data types and their associated operations. It is a significant change
in perspective; instead of functional hierarchies, there are data hierarchies. Programming in an
object-oriented language involves creating objects and sending them commands or messages to do
things.
51
Object-orientedprogramsare based0-_--n-_our concepts: classes, objects, methods, and
inheritance. A class is similar to a Pascal RECORD. It describes an overall structure for any
number of types based upon it. The main difference between a class and a record is that a class
combines data fields (called instance variables) and procedures and functions (called methods)
that act upon the data.
An object is a variable of a class_ All objects derived fi'om a class are considered members
of that class and share si_lar characteriStics 0fthat class.
Methods are procedures and functions encapsulated in a class or object, Calling a method
is referred to as passing a message to an object. Object-oriented programs do most of their
works by sending messages to objects.
52

REPORT DOCUMENTATION PAGE ,orm._pp_o_ed
O_4B _o 0704-0_88
1. AGENCY USE ONLY (Leave Dlank) 2 REPORT DATE 3 REPORT TYPE AND DATES COVERED
September. 1993 Contractor Report (6/1/91-8/31/92)
4 TITLE AND SUBTITLE 5.
Simulator for Heterogenous Dataflow Architectures
6. AUTHOR(S)
Mahyar R. Malekpour
"_. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Lockheed Engineering & Sciences Company
144 Research Drive
Hampton, VA 23666
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23681-0001
11. SUPPLEMENTARY NOTES
Langley Technical Monitor: Paul J. Hayes
FUNDING NUMBERS
C NASI-19000
WU 586-03-11-31
8 PERFORMING ORGANIZATION
REPORT NUMBER
10. SPONSORING MONITORING
AGENCY REPORT NUMBER
NASA CR-191545
12a. DISTRIBUTION / AVAILABILITY STATEMENT
Unclassified - Unlimited
Subject Category 33
1213_ DISTRIBUTION CODE
13. ABSTRACT (Maximum 200 words)
A new simulator is developed to simulate the execution of an algorithm graph in accordance
with the Algorithm to Architecture Mapping Model (ATAMM) rules. ATAMM is a Petri Net
model which describes the 15edodic execution of large-grained, data-independent dataflow
graphs and which provides predictable steady state time-optimized performance. This
simulator extends the ATAMM simulation capability from a heterogenous set of resources, or
functional units, to a more general heterogenous architecture. Simulation test cases show
that the simulator accurately executes the ATAMM rules for both a heterogenous architecture
and a homogenous architecture, which is the special case for only one processor type. The
simulator forms one tool in an ATAMM Integrated Environment which contains other tools for
graph entry, graph modification for performance optimization, and playback of simulations for
analysis.
14. SUBJECT TERMS
Simulation software, dataflow architecture, Petri Nets, concurrent
processing, multiprocessing
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION
OF REPORT OF THIS PAGE
U nclassified U nclassifi ed
NSN 7540-01-280-5500
19. SECURITY CLASSIFICATION
OF ABSTRACT
Unclassified
15. NUMBER OF PAGES .
55
16. PRICE CODE
A04
20. LIMITATION OF ABSTRACT
UL
Standard t:orm 298 (Rev 2-89)
Pre_cr_De_l by AN¢,t c.td Z]q-iB
29_.1S2
