A static data flow simulation study at Ames Research Center by Howard, Lauri S. & Barszcz, Eric
-~ ~ -~ ~~ _ _ _ _  
NASA Technical Memorandum 89434 
A Static Data Flow Simulation 
Study at Ames Research Center 
Eric Barszcz and Lauri S. Howard 
(BASA-!ZB-89434) STATIC LAIA ELOY N0 7-222 01 
Z l n o L A T I c N  s m c ~  AI ABES S E E E A F ~ C B  CENTER 
( L A S A )  23 F A v a i l :  AXIS EC aO;#aP A01 CSCL 09B Unclas 
81/62 0076037 
June 1987 
. 
National Aeronautics and 
Space Administration 
https://ntrs.nasa.gov/search.jsp?R=19870013768 2020-03-20T10:38:35+00:00Z
NASA Technical Memorandum 89434 
I .  
I .  
A Static Data Flow Simulation 
Study at Ames Research Center 
Eric Barszcz, 
Lauri S. Howard, Ames Research Center, Moffett Field, California 
June 1987 
NASA 
National Aeronautics and 
Space Administration 
Ames Research Center 
Moffett Field, California 94035 
ABSTRACT 
Demands in computational power, particularly in the area of computational fluid 
dynamics (CFD), have lead NASA Ames Research Center to study advanced computer 
architectures. 
based on research done by Jack B. Dennis at Massachusetts Institute of Technology. 
To improve understanding of this architecture, a static data flow simulator, written 
in Pascal, has been implemented for use on a Cray X-MP/48. 
two-dimensional fast Fourier transform (FFT), two algorithms used in CFD work at 
Ames Research Center, have been run on the simulator. 
factor of more than 2 depending on the partitioning method used to assign instruc- 
tions to processing elements. Service time for matching tokens has proved to be a 
major bottleneck. 
the execution time. The best sustained MFLOPS rates were less than 50% of the 
maximum capability of the machine. 
One architecture being studied is the static data flow architecture 
A matrix multiply and a 
Execution times can vary by a 
Loop control and array address calculation overhead can double 
INTRODUCTION 
Demands in computational power at NASA Ames Research Center, particularly in 
the area of computational fluid dynamics (CFD), have lead Ames Research Center to 
study advanced computer architectures. One architecture studied is the static data 
flow architecture based on research done by Jack B. Dennis at Massachusetts Insti- 
tute of Technology (MIT) (ref. l ) .  At Ames Research Center, a static data flow 
machine is being simulated and evaluated with respect to CFD. 
The simulator is written in Pascal and executes on a Cray X-MP/48. The Cray 
permits larger problems than are typically simulated, allowing better evaluation of 
the architecture. The data flow machine presented in this paper is a modification 
of the architecture presented in reference 1. It incorporates a network connecting 
array memory (AM) modules and a separate network connecting processing elements 
(PE). 
The paper contains: an overview of data flow, a description of the architec- 
ture simulated, information about the simulator itself, discussion on the generation 
and partitioning of code, justification of different service times used for token 
matching, results of simulation runs on a matrix multiply and two-dimensional fast 
Fourier transform (FFT), and a discussion of problem areas for these algorithms. 
1 
DATA FLOW 
In a data flow architecture, instruction execution is determined by the pres- 
ence of data, not a program counter. As soon as all data for an instruction is 
present, it is executed without regard for its position in the program. 
A data flow program is represented by a data flow graph indicating the data 
dependencies. The graph is composed of two parts, nodes and directed arcs. Nodes 
represent instructions to be executed and arcs represent data dependencies between 
nodes. During execution of a data flow program, a node 'Ifires," consuming input 
data and generating a result. Tokens carry copies of the result along output arcs 
to dependent nodes. 
input arcs. For a more complete description see reference 1. 
A node is enabled or ready to fire when there are tokens on all 
There are two approaches to the implementation of instruction level data 
flow. One, static data flow, allows at most one token on an arc at any given 
instant (ref. 2). The other, dynamic tagged token data flow, allows multiple tokens 
on an arc (refs. 3-5). In this study, static data flow is implemented using 
instructions bound to processing elements at compile time. 
sent from the consumer to the producer after an instruction has fired. 
Acknowledge signals are 
An instruction contains opcode, enable and reset counts, operands and destina- 
tion addresses. The enable count records the number of outstanding operands. Space 
for  operands is allocated inside the instruction. The acknowledge signals prevent 
operands being overwritten. 
addresses. The destination addresses contain locations of instructions that receive 
either a copy of the result or  an acknowledge signal (ref. 1). 
The reset count indicates the number of destination 
MACHINE ARCHITECTURE 
Static data flow is implemented on a variety of machine architectures. In this 
study, the machine architecture has four basic elements: PE, AM modules, PE network 
and AM network. It is a modification of the architectures proposed in references 1 
and 6. The simulator allows up to 1024 PE in the machine. A group of four process- 
ing elements share an AM module. Array Memory modules store array values and are 
used for I/O. Input values are placed in AM prior to execution of the data flow 
graph. Output values are extracted from AM after the simulation. The PE network 
provides communication among PE while the AM network does the same for AM modules. 
The networks are hybrid packet switched Omega networks (refs. 7 and 8). 
networks are assumed to have a 50-nsec cycle time. Figure 1 gives an overview of 
the machine architecture. 
The PE and 
2 
Processing Element 
Each PE is composed of an update/matching unit, fetch unit, functional units, 
token generator unit, send and receive units, an instruction store, an enabled 
instructions queue, and a port to the local AM module (fig. 2 ) .  
The updatehatching unit accepts tokens from the receive unit and the token 
generator unit. The instruction associated with the token's address is determined, 
and its enable and reset counts are fetched from the instruction store. If data is 
present, it is stored in the appropriate operand slot in the instruction at this 
time. The enable count is decremented and checked. When the enable count reaches 
zero, the instruction is enabled. The address of the instruction is placed in the 
enabled instructions queue, and the reset count is stored in place of the enable 
count . 
The fetch unit removes addresses of enabled instructions from the enabled 
instructions queue. The unit then fetches the opcode and operands from instruction 
store. 
and instruction address is passed to the appropriate functional unit or AM port. 
The opcode is partially decoded and a packet containing the opcode, operands 
The functional units are composed of two floating-point multipliers, a 
floating-point adder, and an arithmetic and logic unit (ALU). Each of the floating- 
point multipliers is assumed to be capable of 2.5 MFLOPS and the floating-point 
adder capable of 5 MFLOPS. (The Weitek chips WTL 1064 and WTL 1065 could be used as 
the multiplier and adder, respectively.) The ALU handles all other operations. All 
operations are done using 64-bit operands. 
addresses are sent from the functional units to the token generator unit. 
Results and the associated instruction 
The AM port is the only link between the PE and AM. All AM requests are sent 
to the port from the fetch unit. When a request is satisfied, the result with its 
instruction address is sent to the token generator unit. 
The token generator unit takes a result and fetches the associated destination 
address list. One token is created for each destination address. If the address is 
a local instruction, the token is passed to the updatehatching unit. Otherwise, 
the token is passed to the send unit. 
The send and receive units are the interface between PE and PE network. The 
send unit places tokens on the PE network and the receive unit removes them. When a 
token is removed from the network, it is passed to the update/matching unit. 
Array Memory 
Data flow does not have the classical concept of data stored in writeable 
memory. Instead, all variables and arrays are treated as values on tokens which are 
generated, never modified, and then consumed. The large amount of data on array 
tokens makes this difficult to implement. One approach is presented in refer- 
ence 9. In this study, AM is used to store array values indefinitely. 
3 
Read and wri te  nodes are t h e  two node t y p e s  t h a t  deal w i t h  t h e  AM. Whenever a 
v a l u e  is needed,  the p h y s i c a l  l o c a t i o n  is c a l c u l a t e d  and  passed  t o  one of these 
nodes.  The read or write r e q u e s t  is t h e n  s e n t  t o  t h e  local  AM module. 
All AM r e q u e s t s  p a s s  through the  AM p o r t .  Each r e q u e s t  has a n  opcode t o  
describe t h e  r e q u e s t ,  a p h y s i c a l  l o c a t i o n ,  and an  associated i n s t r u c t i o n  address. 
write r e q u e s t  a lso i n c l u d e s  data. 
used f o r  the  AM module address. Each module can d e t e r m i n e  whether or n o t  t h e  
address being re ferenced  is local t o  t h e  module. 
A 
The high-order  b i t s  of t h e  p h y s i c a l  address are 
Local r e q u e s t s  are immediately satisfied w h i l e  remote r e q u e s t s  c a u s e  p a c k e t s  t o  
be genera ted  for  the  AM network. 
i n s t r u c t i o n  address is immediately s e n t  t o  the  token g e n e r a t o r  u n i t  w h i l e  t h e  p a c k e t  
is w a i t i n g  t o  be placed on t h e  AM network. The writes do n o t  c a u s e  read/write races 
because t h e  AM modules use I - S t r u c t u r e  S t o r a g e  ( re f .  3 ) .  Remote reads are b l o c k i n g ;  
a copy o f  the r e q u e s t  is p laced  i n  a d e l a y  queue u n t i l  a r e s p o n s e  r e t u r n s  from a 
remote AM module. When t h e  response  a r r i v e s ,  t h e  r e q u e s t  is removed from t h e  d e l a y  
queue. 
g e n e r a t o r  u n i t .  
Remote writes are nonblocking;  t h e  associated 
The data w i t h  its associated i n s t r u c t i o n  addres s  is t h e n  s e n t  t o  t h e  token 
The way AM is s t r u c t u r e d  and connected has  s e v e r a l  a d v a n t a g e s .  Firs t ,  o n l y  one 
copy o f  an a r r a y  is needed. Second, t h e  i n t e r f a c e  between PE and AM is very  
s imple .  There are only two r e q u e s t s  t h e  PE can make and  t h e y  a l l  p a s s  through t h e  
AM p o r t .  T h i r d ,  wi th  m u l t i p l e  PE s h a r i n g  an AM module, the  AM network is smaller 
than t h e  PE network and so  h a s  a smaller network d e l a y .  F o u r t h ,  having a l l  remote 
AM r e q u e s t s  handled on a s e p a r a t e  network reduces  t h e  t ra f f ic  on t h e  PE network. 
Networks 
Both t h e  PE network and t h e  AM network are h y b r i d  packet-switched omega n e t -  
They are composed of 2- by 2 - r o u t e r  nodes wi th  16-bit-wide data p a t h s .  works. Each 
input  p o r t  can b u f f e r  a 9 6 - b i t  p a c k e t .  A packet  header c o n t a i n s  32 b i t s  of informa- 
t i o n  broken i n t o  10 b i t s  for PE i d e n t i f i c a t i o n ,  12 b i t s  for  i n s t r u c t i o n  i d e n t i f i c a -  
t i o n ,  2 b i t s  for  operand p o r t  s e l e c t i o n ,  and 8 b i t s  t o  c a r r y  computat ion error codes 
( e . g . ,  d i v i d e  by z e r o ) .  Packets, which c o n s i s t  of  a header and data or  j u s t  a 
header, are called data and acknowledge s i g n a l s ,  r e s p e c t i v e l y .  A data s i g n a l  con- 
t a i n s  a header w i t h  a nonzero p o r t  s e l e c t i o n  and 64 b i t s  of data. 
s i g n a l  is a header with a p o r t  s e l e c t i o n  of z e r o .  
wide ,  a packet  is broken i n t o  sl ices.  The first s l ice ,  called the d e s t i n a t i o n  
s l i ce ,  c o n t a i n s  network r o u t i n g  informat ion .  The other sl ices of a p a c k e t  f o l l o w  
t h e  d e s t i n a t i o n  s l i c e  c o n t i g u o u s l y  through t h e  network.  I t  is p o s s i b l e  t o  move a 
s l i c e  between nodes every  50 nsec .  
An acknowledge 
S i n c e  t he  data p a t h s  are 16 b i t s  
C o n f l i c t s  can a r i s e  when p a c k e t s  t r a v e l  th rough the  network.  
o c c u r s  when two packets  i n  a node r e q u e s t  t h e  same o u t p u t  l i n e  s i m u l t a n e o u s l y .  
Round r o b i n  s e l e c t i o n  i s  used t o  de te rmine  which p a c k e t  advances .  
o c c u r s  when a packet  r e q u e s t s  a l i n e  c u r r e n t l y  i n  u s e .  I t  is r e s o l v e d  by making t h e  
packet  wait u n t i l  the l i n e  is i d l e .  A block c o n f l i c t  o c c u r s  when a p a c k e t  
A node c o n f l i c t  
A l i n e  c o n f l i c t  
r e q u e s t s  an  i n p u t  p o r t  a l r e a d y  b u f f e r i n g  a n o t h e r  p a c k e t .  
r e s o l v e d  when t h e  d e s t i n a t i o n  sl ice of t h e  b u f f e r e d  p a c k e t  moves t o  a n o t h e r  node. 
A t  t h i s  time, t h e  blocked packet  can start  t o  advance.  Each s l ice  i n  t h e  p a c k e t  
w i l l  occupy the l o c a t i o n  i n  t h e  b u f f e r  h e l d  by t h e  e q u i v a l e n t  s l ice  i n  t h e  p r e v i o u s  
packet .  
Block c o n f l i c t s  are 
The method used t o  move p a c k e t s  through t h e  network is similar t o  V i r t u a l  Cut- 
A packe t  can advance any time t h e  l i n e  is n o t  busy and  t h e  r e c e i v i n g  
Through ( ref .  7 ) .  The d i f f e r e n c e  is  t h a t  b lock  c o n f l i c t s  are r e s o l v e d  u s i n g  p a r t i a l  
c u t  th roughs .  
p o r t  does n o t  c o n t a i n  a d e s t i n a t i o n  slice for  a n o t h e r  p a c k e t .  T h i s  h a s  its g r e a t e s t  
advantage  when t h e  network is l i g h t l y  loaded. A t  b e s t ,  i t  is a b o u t  s i x  times faster 
than r e g u l a r  p a c k e t  s w i t c h i n g  for  t r a n s m i s s i o n  of data s i g n a l s .  I n  t h e  worst case, 
it is t h e  same as normal packet  swi tch ing .  
SIMULATOR 
The s i m u l a t o r  is w r i t t e n  i n  Pascal  and e x e c u t e s  on a Cray X-MP/48. I t  can 
s i m u l a t e  between 1 and 1024 PE i n  powers of 2 .  
t i o n e d  t o  u s e  any number o f  PE between 1 and 1024. A t y p i c a l  s i m u l a t i o n  takes 30 t o  
760 Cray CPU sec. The r a t io  o f  s imula ted  machine time to  Cray time r a n g e s  from 
1:25,000 t o  1:750,000 depending on the number of PE s i m u l a t e d  and t h e  data flow 
graph.  
The data flow g r a p h  can  be p a r t i -  
Besides t h e  a b i l i t y  to  change the  number of PE s i m u l a t e d ,  t he  f o l l o w i n g  machine 
a t t r i b u t e s  can be v a r i e d :  
1.  
2. 
3. 
4 .  
5. 
6 .  
Number o f  PE a s s o c i a t e d  w i t h  each AM module. 
S i z e  of t he  local Array Memory modules. 
Number and t y p e  o f  f u n c t i o n a l  u n i t s .  
S e r v i c e  time and maximum queue l e n g t h  for a l l  u n i t s  i n  t h e  PE. 
Cycle  times for  PE and networks.  
Data p a t h  wid ths  between nodes i n  t h e  networks.  
The s i m u l a t o r  reads t h e  machine a t t r i b u t e s  and checks for i n i t i a l  v a l u e s  t o  
load  i n t o  AM. I t  then loads t h e  p a r t i t i o n e d  data flow graph  i n t o  i n s t r u c t i o n  
s t o r e .  A token is placed  i n  t h e  u p d a t e h a t c h i n g  queue of PE z e r o .  Execut ion  of t h e  
data flow graph  commences when its root  node is e n a b l e d  by t h e  t o k e n .  
During e x e c u t i o n  of t h e  data flow g r a p h ,  any of 30 machine opcodes  can be 
executed .  Real r e s u l t s  are computed to check t h e  c o r r e c t n e s s  of t h e  g r a p h .  
s i m u l a t o r  does n o t  handle  computation errors; i f  one  o c c u r s ,  t h e  s i m u l a t i o n  
aborts. 
r e q u e s t s .  
The 
Three  o t h e r  opcodes are used s o l e l y  by t h e  AM modules for  remote read/write 
5 
All 1/0 is handled through AM. Input is achieved by placing values in AM 
before execution of the data flow graph. 
extracted after the simulation is complete. 
Results for output are placed in AM and 
BENCHMARKS 
Initially, ARC3D (ref. IO) and large eddy simulation ( L E S )  (ref. 1 1 )  were 
considered as potential benchmarks. 
mated one complete simulation of ARC3D would take 625 hr of Cray X-MP/48 time. 
this point, it was noted that groups of benchmarking kernels already exist. 
representative of CFD work done at Ames Research Center have been assembled for the 
Numerical Aerodynamic Simulation (NAS) project (ref. 12). The data flow character- 
istics of the kernels were analyzed. Because of their high degree of parallelism, 
the matrix multiply, and two-dimensional FFT were chosen as benchmarks. 
benchmarks can be expected to use a static data flow machine more efficiently than 
ARC3D and LES.  
After the simulator was written, it was esti- 
At 
Kernels 
These 
Two data flow graphs exist for each benchmark. A fully unrolled version of the 
matrix multiply calculates all inner products for the result matrix in parallel. A 
partially unrolled version calculates the inner products for a single column simul- 
taneously. 
dual node pairs in all columns in parallel. A partially unrolled version operates 
on all dual node pairs in a single column simultaneously. Because of the symmetry 
in the two-dimensional FFT algorithm, machine code is generated to perform forward 
FFTs on columns only. 
For the two-dimensional FFT, a fully unrolled version operates on all 
The large data flow graphs combined with human error make the use of code 
generators necessary. 
replicate and link many small sections of code. 
variation in problem size. Output from the generators is partitioned for use in the 
simulator. 
High degrees of parallelism allow the code generators to 
The generators also allow easy 
Partitioning 
The partitioning of instruction nodes among PE is an important and difficult 
problem. 
optimal partitioning is difficult, an acceptable solution is an automatic parti- 
tioner that runs fast and produces execution times close to, or  better than, a hand 
partitioning for the same data flow graph. 
Utility of the machine is determined by the node assignments. Since 
Hand partitioning of the benchmarks is done by looking at the overall structure 
This is accomplished by the code generators assigning nodes as the machine code 
of the data flow graph and assigning large blocks of replicated code to different 
PE. 
is created. In contrast, automated partitioners have no knowledge of the overall 
structure. 
6 
Two automated partitioners were developed at Ames Research Center. Each parti- 
tioner makes two passes through the data flow graph. On the first pass, a depth 
first traversal is performed to gather information about the bottom of the graph and 
propagate it upward. The second pass is a breadth first traversal using the infor- 
mation acquired in the first pass to make intelligent decisions about node assign- 
ments. The two partitioners assign nodes to PE at different times. 
binding partitioner assigns nodes as soon as possible while the late binding parti- 
tioner delays assignment as long as possible. 
The early 
The automated partitioners are relatively fast and work fairly well. The late 
binding partitioner partitioned a 7000 node graph in 17 sec of Cray time. For the 
matrix multiply benchmark, the execution times of the automatically partitioned code 
(late binding) versus hand partitioned code averaged 6.9% slower for the partially 
unrolled version and 41.7% faster for the fully unrolled version. 
SERVICE TIME 
Each PE with its two multipliers and one adder has a potential floating point 
capability of 10 MFLOPS. However, if it is not possible to keep the functional 
units busy, an execution rate of 10 MFLOPS will not be achieved. Analysis done 
prior to simulation indicates the updatehatching unit has the greatest influence on 
the megaflop rating. Taking into account the service time required for each token 
and the average number of destination addresses per instruction, the updatehatching 
unit cannot supply enabled instructions fast enough to keep the functional units 
busy. 
The floating point units can accomplish 1 addition and 2 multiplications every 
400 nsec. Each arithmetic operation requires a pair of operands. Thus, at least 
one pair of operands must be supplied every 100 nsec to maintain full use of the 
floating point hardware. Since each token contains at most one operand, the update/ 
matching unit must process at least one token every 50 nsec to maintain full use. 
To achieve a service time of 1 cycle/token, the update/matching unit must be 
pipelined and the instructions stored in banks of 25-nsec RAM. The 25-nsec RAM is 
necessary since the updatehatching unit and the fetch unit must access operands in 
the same 50-nsec clock cycle. In the first 25 nsec, the updatehatching unit reads 
the enable and reset counts and stores any data. At the same time, the fetch unit 
reads an opcode. In the second 25 nsec, the updated enable count from a previous 
token is stored by the updatehatching unit and the fetch unit reads an operand. 
Service times closer to 2 or 4 cycles/token are more reasonable. The update/ 
matching unit must still be pipelined, but slower and cheaper RAM can be used. As 
the service time increases, the average number of destination addresses per instruc- 
tion becomes more important (demonstrated in tables 1 and 2 based on one PE). 
7 
The benchmarks average between 3 and 3.5 destination addresses per instruc- 
tion. This limits the floating point capability to a maximum of 6.7 MFLOPS. 
has been done at MIT to reduce the number of acknowledge signals used in the data 
flow graphs (ref. 13). This lowers the average number of destination addresses per 
instruction, reducing the work load on the update/matching unit. However, as the 
number of acknowledge signals is lowered, the number of potentially enabled instruc- 
tions decreases because of the loss  of pipelining in the data flow graph. In any 
case, the floating-point units are not fully utilized. 
Work 
SIMULATION RUNS 
The matrix multiply is simulated by multiplying a 16 by 8 matrix by an 8 by 
4 matrix. The two-dimensional FFT operates on a 16 by 16 array of values. Both 
benchmarks are run using partially unrolled and fully unrolled versions. Table 3 
contains the number of nodes in each version. The benchmarks are simulated using 
1 to 32 PE by powers of 2. For the matrix multiply, hand and automated partitioners 
are used. For the two-dimensional FFT, only automated partitioners are used. Each 
partitioning is run with update/matching service times of 1 ,  2, and 4. 
RESULTS 
Figure 3 gives the speedup curves for the partially unrolled matrix multiply 
when partitioned using hand and automated partitioners. Hand-partitioned code 
performs better until a large number of PE are simulated. The crossover is due to 
the hand partitioner partitioning across all PE and ignoring increased network 
delay. In contrast, the automated partitioners take network delay into considera- 
tion when partitioning. 
13.6% slower than hand partitioned code. 
tioner averaged 6.9% slower than hand partitioned code. 
Code generated by the early binding partitioner averaged 
Code generated by the late binding parti- 
Figure 4 gives the speedup curves for the fully unrolled matrix multiply when 
Automatically partitioned code partitioned using hand and automated partitioners. 
performed better than hand partitioned code. 
faster than hand partitioned code. 
hand partitioned code (in one case, it was 56.0% faster). 
Early binding code averaged 34.7% 
Late binding code averaged 41.7% faster than 
In both versions of the matrix multiply, there is a kink in the graph for the 
early binding partitioner at eight PE. This is the first point multiple AM modules 
are simulated. Read requests are delayed in the second AM module for  a significant 
amount of time while waiting for responses. No attempt is made to make data local 
to a particular AM module. Although the number of array accesses does not change, 
as more AM modules are simulated, the delay queue in each is shorter. 
8 
Figures 5 through 8 show the execution times for each version of the benchmarks 
simulated using update/matching unit service times of 1 ,  2, and 4. 
bers of PE,  the completion time was proportional to the service time of the update/ 
matching unit. As the number of PE increases, the effect of service time is reduced 
as network delay becomes the dominant factor. 
For small num- 
Two other problems that relate directly to CFD performance on a static data 
flow architecture are loop control and array address calculations. Both of these 
affect the execution time of the program. The fully unrolled version of the matrix 
multiply is about 3 times faster than the partially unrolled version for an update/ 
matching unit service time of 4 clock cycles (figs. 5 and 6). This speedup is due 
to the reduction in loop control overhead and location calculations for array 
elements. 
Loop Control 
A time/space tradeoff is involved Nhen a loop is unrolled. Loop unrolling 
decreases execution time but increases the data flow graph size. This tradeoff must 
be considered carefully. Loop control grows linearly with the information passed 
into the loop whereas loop unrolling often grows nonlinearly. This is demonstrated 
by the matrix multiply. 
The matrix multiply is an O ( N  3 ) algorithm. For a square order 100 matrix 
multiply, there are approximately 2 million nodes in the fully unrolled data flow 
graph versus 49 nodes in the fully rolled version. Complete unrolling is prohibi- 
tive for problems of this size. 
4, unrolling is a practical approach to execution. In this case, the partially 
unrolled version and the fully unrolled version almost balance, 1334 versus 
1366 nodes, respectively. However, the fully unrolled version has megaflops ratings 
2 to 3.5 times higher than the partially unrolled version. 
For smaller problems such as the 16 by 8 times 8 by 
The node overhead for loop control can be calculated in the following manner. 
Let k be the number of items requiring a merge/switch pair for loop control 
(ref. 1 1 ,  then the maximum number of overhead instructions is 
3K + 2 
i =O. 
where 3k represents the merge/switch pairs and fanout nodes. The summation repre- 
sents an upper bound on the fanout of the conditional. 
is a maximum fanout of 4. The conditional is represented by 1. Therefore, loop 
control is O(k) since it can be shown that the summation reduced to approximately 
k/3 
Log4 is used because there 
by noting that it can be rewritten as 
9 
Given the overhead for loop control, one wants to keep the ratio of productive 
work to overhead as high as possible. The fully rolled matrix multiply is a good 
example where the number of nodes involved in loop control overwhelms the desired 
computation. 
Only two nodes in the graph are floating-point instructions. 
point operations to overhead is 1:7.5. 
Loop control accounts for 65% of the nodes in the data flow graph. 
The ratio of floating- 
Software calculations are eliminated if each memory module has an array map and 
addresses are calculated in hardware. This is possible because the location of 
arrays are known at compile time allowing the generation of one map which is repli- 
cated across memory modules. When a node asks for an element of a particular array 
the address is calculated in hardware inside the AM module. 
i 
Loops should be unrolled where feasible to reduce overhead (ref. 14). How 
loops are unrolled deserves careful consideration. 
computation are the first to be unrolled. But, it is unlikely they can be com- 
pletely unrolled for real CFD problems. 
control structure around them, the amount of loop overhead could be cut down. 
Results presented in reference 15 show the execution time decreasing by a third when 
loop invariants are held on tokens which are not consumed when the instruction 
fires. 
In general, areas of heaviest 
If loop invariant values did not need the 
L 
Array Address Calculations 
Another problem facing CFD applications is the software calculation of array 
element addresses. In many cases, the overhead for address calculations overshadows 
the computation in which the elements are used. If the computation is unrolled, the 
addresses are calculated at compile time, thereby producing a significant saving in 
execution time. 
In the matrix multiply case, the fully expanded version calculates all loca- 
tions at compile time. In the fully rolled version, array element locations are 
calculated at run time. Each location calculation involves the execution of four 
nodes. For the square order 100 case, the fully unrolled version executes a total 
of 2 million nodes. 
location calculations alone. 
The fully rolled version would execute over 8 million nodes for 
CONCLUSION 
Static data flow shows good speedup curves for the benchmarks over a limited 
number of PE. However, von Neumann multiprocessors have demonstrated similar speed- 
ups over the same numbers of PE. The expected gain from using a data flow 
10 
architecture did not appear. 
method used and the service time for the updatejmatching unit. 
partitioner achieved an average speedup of 41.7% over hand partitioning for the 
fully unrolled version of the matrix multiply. F o r  small numbers of PE, the 
updatehatching service time caused a doubling of the completion time as it was 
doubled. As the number of PE increases, the network delay becomes the dominant 
factor. 
does not guarantee full utilization of the 10 MFLOPS floating-point capability. 
best sustained megaflops rates were less than 50% of peak speed and trailed off as 
the number of processors increased, similar t o  the observations for von Neumann 
machines. 
The simulations proved sensitive to the partitioning 
The late binding 
A service time of one cycle per token will be expensive to implement and 
The 
Loop control overhead and array address calculation are important issues in the 
execution time of CFD applications. Both cause degradation of the machine's perfor- 
mance. The partially unrolled two-dimensional FFT had an average execution time 14% 
longer than the fully unrolled version. 
an average execution time 198% longer than the fully unrolled version. However,'for 
real CFD problems it is not feasible to fully unroll loops because of the size of 
the data flow graph. Therefore, time/space tradeoffs must be carefully considered 
when coding and compiling real CFD applications. 
The partially unrolled matrix multiply had 
The static data flow machine studied here does not enjoy a significant advan- 
tage over current von Neumann multiprocessors. Using VLSI technology, the update/ 
matching unit and networks can be improved. The automatic partitioners show that 
good speedup can be achieved without human intervention. Using locality of data may 
further increase the speedup. Modified loop control will reduce the overhead asso- 
ciated with loop invariants. Hardware calculation of array addresses will decrease 
the execution time. 
become a more attractive alternative in the future. 
As the hardware and software matures, static data flow may 
11 
REFERENCES 
1 .  Dennis, J. B.: Data Flow Supercomputers. Computer, vol. 13, no. 11 ,  Nov. 
1980, pp. 48-56. 
2. Dennis, J. B.; Gao, Guang-Rong; and Todd, K. W.: A Data Flow Supercomputer. 
Computation Structures Group Memo 213, MIT Laboratory for Computer Science, 
Cambridge, MA, March 1982. 
3 .  Arvind; and Culler, D. E.: Tagged Token Dataflow Architecture. TR 229, Mass. 
Inst. of Tech., MIT Laboratory for Computer Science, Cambridge, MA, July 
1983. 
4. Gurd, J.; and Watson, I.: Preliminary Evaluation of a Prototype Dataflow 
Computer. Information Processing 83, IFIP, Elsevier Science Publishers 
B.V., North-Holland Publishing Co. (Amsterdam), 1983, pp. 545-551. 
5. Yuba, T.; Shimada, T.; Hiraki, K.; and Kashiwagi, H.: Sigma-1: A Dataflow 
Computer for Scientific Computation. Computer Physics Communications, 
vol. 37, Elsevier Science Publishers B. V., North-Holland Publishing Co. 
(Amsterdam), 1985, pp. 141-148. 
6. Gostelow, K. P.; and Thomas, R. E.: Performance of a Simulated Dataflow Compu- 
ter. IEEE Transaction on Computers. vol. c-29, no. 10, Oct. 1980, 
pp. 905-919. 
7. Kermani, P.; and Kleinrock, L.: Virtual Cut-Through: A New Computer Communi- 
cation Switching Technique. Computer Networks, vol. 3, Elsevier Science 
Publishers, North-Holland Publishing Co. (Amsterdam), 1979, pp. 267-286. 
8. Siegel, H. J.: Interconnection Networks for  Large-Scale Parallel Processing. 
Lexington Books, D.C. Heath and Company, 1985. 
9. Dennis, J.; and Gao, Guang-Rong: Maximum Pipelining of Array Operations on 
Static Data Flow Machine. Computation Structures Group Memo 233-1, Mass. 
Inst. of Tech. Laboratory for Computer Science, Cambridge, MA, Sept. 1984. 
10. Pulliam, T. H.: Solution Method in Computational Fluid Dynamics. Lecture 
Notes, NASA Ames Research Center, Jan. 1986. 
11. Rogallo, R. S.: Numerical Experiments in Homogeneous Turbulence. NASA 
TM 81315, 1981. 
12. Bailey, D. H.; and Barton, J. T.: The NAS Kernel Benchmark Program. NASA 
TM 86711, 1985. 
12 
13. Ackerman, W. B.: Program and Machine Structure for Incredible Fast Computa- 
tion. Preliminary Draft, Computation Structures Group, MIT Laboratory for  
Computer Science, Cambridge, MA. 
14. Ackerman, W. B.: Efficient Implementation of Applicative Languages. Ph.D. 
Thesis, Mass. Inst. of Tech., 1984. 
15. Shimada, T.; Hiraki, K.; Nishida, K.; and Sekiguchi, S.: Evaluation of a 
Prototype Data Flow Processor of the Sigma-1 for Scientific Computations. 
Proc. 13th Int. Symp. on Comp. Arch., 1986, pp. 226-234. 
13 
TABLE 1.- THREE ADDRESSES/INSTRUCTION 
4 600 ns. 
2 300 ns. 
1 150 ns. 
1.67 
3.33 
6.67 
TABLE 2.- THREE AND ONE-HALF ADDRESSES/INSTRUCTION 
4 700 ns. I 1.43 1 
2 350 ns. 2.86 
TABLE 3.- NODE AND FLOATING POINT OPERATION COUNTS 
1 
MATRIX MULTIPLY 1/4 2D FFT 
175 ns. 5.71 
NO. NODES 
FL. PT. OPS. 
FULLY PARTIALLY FULLY PARTIALLY 
UNROLLED UNROLLED UNROLLED UNROLLED 
1 4  
I 
I PE NETWORK 
SEND 
UNIT 
- 
AM NETWORK 
Figure 1.- Static data flow machine architecture. 
TOKEN 0 
1 0 4  
0
GEN. * 
UNIT 
 
4 
REC. - 
UNIT 
A m INSTRUCTION STORE 
I UPDATE -+ ENABLED FETCH 
INSTRUCTIONS UNIT 
L 
I 
I I 
Figure 2.- Processing element block diagram. 
15 
32 
PAR TI TI ON E RS 
0 HANDPART. 
0 EARLY BINDING 
0 LATE BINDING 
n 
3 
5 16 
w 
P 
v) 
8 
4 
2 
1 
4 8 16 32 
NUMBER OF PE’s 
1 2  
Figure 3 - Matrix multiply (partially unrolled, all partitioners, 
service time = 4). 
16 
32 
PART IT I ON E RS 
0 HANDPART. 
0 EARLY BINDING 
0 LATE BINDING 
n 
3 
2 16 
n 
w 
v) 
8 
4 
2 
1 
1 2  4 8 16 
NUMBER OF PE's 
32 
Figure 4.- Matrix multiply (fully unrolled, all partitioners, service time = 4). 
17 
6000 
UPDATE FETCH 
0 1  2 
0 2  3 
0 4  3 
4800 
0 
=L 
% 
W' E 3600 
2 
0 
;;i 2400 
z 
- 
L 
8 
1200 
0 
1 2  4 8 16 32 
NUMBER OF PE's 
Figure 5.- Matrix multiply (partially unrolled, late binding partitioner, 
various service times). 
18 
6000 
4800 
v) 
Y. 
w 
I- 3600 
1200 
UPDATE FETCH 
0 1  2 
0 2  3 
0 4  3 
I I I I I I I T I I I I I I I I I I I I I I I ~  
m 
1 2  4 8 16 32 
NUMBER OF PE's 
Figure 6.- Matrix multiply (fully unrolled, late binding partitioner, 
various service times). 
19 
6000 
4800 
u- r 3600 
I- 
z 
0 
I- ? 2400 
p. 
I 
8 
1200 
0 
UPDATE FETCH 
0 1  2 
0 2  3 
0 4  3 
1 2  4 8 16 
NUMBER OF PE's 
32 
Figure 7.- Two dimensional FFT (partially unrolled, late binding partitioner, 
various service times). 
20 
6000 
UPDATE FETCH 
4800 0 1  2 
0 0 2  3 % 
Y. 0 4  3 
W' r 3600 
I- 
z 
0 
I- 5 2400 
- 
2 
00 
1200 
0 
1 2  4 8 16 32 
NUMBER OF PE's 
Figure 8.- Two dimensional FFT ( f u l l y  unrolled, late binding partitioner, 
various service times). 
21 
Report Documentation Page 
1. Report No. 
NASA TM-89434 
2. Government Accession No. 
A Static Data Flow Simulation Study at Ames 
Research Center 
17. Key Words (Suggested by Author(s1) 
Static data flow 
Computational fluid dynamics 
Loop control 
Array address calculation 
7. Author(s) 
18. Distribution Statement 
Unclassified-Unlimited 
Subject Category - 62 
Eric Barszcz and Lauri S. Howard 
Simulator 
9. Performing Organization Name and Address 
Ames Research Center 
Moffett Field, CA 94035 
- 
12TSponsoring Agency Name and Address 
National Aeronautics and Space Administration 
Washington, DC 20546 
19. Security Classif. (of this report) 
-- 
3. Recipient’s Catalog No. 
I 
20. Security Classif. (of this page) 
~- - - - -- 
5. Report Date 
June 1987 
6. Performing Organization Code 
__ - ~ _ _ _ _ _  __ .. - _- 
8. Performing Organization Repoit No. 
A-87 129 
10. Work Unit No. 
505-0 1-00 
i __-- 
11. Contract or Grant No. 
-13. Type of Report and Period Covered 
Technical Memorandum 
14. Sponsoring Agency Code 
15. Supplementary Notes 
Point of Contact: Eric Barszcz, Ames Research Center, M/S 233-14 
Moffett Field, CA 94035 (415) 694-6014 o r  FTS 464-6014 
___ 
16. Abstract 
Demands in computational power, particularly in the area of computational 
fluid dynamics (CFD), have lead NASA Ames Research Center to study advanced 
computer architectures. One architecture being studied is the static data 
flow architecture based on research done by Jack B. Dennis at Massachusetts 
Institute of Technology. To improve understanding of this architecture, a 
static data flow simulator, written in Pascal, has been implemented for use on 
a Cray X-MP/48. 
form (FFT), two algorithms used in CFD work at Ames Research Center, have been 
run on the simulator. 
depending on the partitioning method used to assign instructions to processing 
elements. 
neck. 
execution time. The best sustained MFLOPS rates were less than 50% of the 
maximum capability of the machine. 
A matrix multiply and a two-dimensional fast Fourier trans- 
Execution times can vary by a factor of more than 2 
Service time for matching tokens has proved to be a major bottle- 
Loop control and array address calculation overhead can double the 
Unclassified 1 Unclassified 24 1 A02 I 
For sale by the National Technical Information Service, Springfield, Virginia 2 2  16 1 
NASA FORM 1626 OCT 86 
