An architecture for a loosely-coupled parallel processor by Keller, Robert M. & Patil, Suhas
AN ARCHITECTURE FOR A LOOSELY-COUPLED 
PARALLEL PROCESSOR
Robert M. Keller 
Gary Lindstrom 
Suhas Pati l
UUCS - 78 - 105
October 1978
Department of  Computer Science 
Universi ty  of Utah 
Sa l t  Lake City ,  Utah 84112
This work was supported in par t  by grants  DCR-74-21822, MCS-77-09269 and 
MCS-78-03832 from the National Science Foundation.
A bs t rac t : An a rc h i te c tu re  for  a large (e .g .  1000 processor) para l le l
computer is presented. The processors are loosely-coupled,  in the sense
th a t  communication among them is  fu l ly  asynchronous, and each processor
is  general ly  not unduly delayed by any immediate need for  spe c i f ic  data
values. The network supporting th i s  communication is  t ree  shaped,
with the individual processors connected a t  l e a f  nodes. The machine
executes a graphical version of app l ica t ive  Lisp. The program
execution model is  demand-driven, with a special deferred in te rp re ta t io n
for  dotted pa i r  eva lua t ion ,  termed " len ien t  cons". Opportunities for
concurrency a r i s e  in the pa ra l le l  evaluat ion o f  arguments to s t r i c t
opera tors ,  i . e .  those known to require  evaluat ion of  t h e i r  full  s e t  of  arguments.
Such oppor tuni t ies  are exploi ted by exporting function appl ica t ion  tasks
to neighboring processor nodes in the t r e e ,  subject  to a h ierarchical
notion of  load balancing. Local ity  of task a l loca t ion  and communication
is a key objec t ive  o f  the machine. An in tegra ted  design toward th a t  end
is  presented ,  combining language i s su e s ,  firm semantic foundations,
and an t ic ipa ted  hardware technologies.
keywords and phrases : app l ica t ive  programming, a r c h i t e c tu re ,  
concurrency, data flow, demand-driven, l en ien t  cons, Lisp, l o c a l i t y ,  
loosely-coupled,  packet switching, p a ra l le l i sm ,  reduction machine, 
tagged a r c h i t e c tu re .
CR categories: 6 .21 ,  4 .2 2 ,  4 .1 2 ,  4.32
CONTENTS
1. Introduction ..........................................................................................................  1
2. Language I s s u e s ......................................................................................... .... . 3
3. Basic Architecture  .............................................................................................  5
4. Communication Network ..................................................................................... 7
5. L o c a l i t y ..................................................................................................................  9
6 . Information F l o w ................................................................................................. 11
7. Machine Language .................................................................................................  12
8 . Program Execution .............................................................................................  14
9. Task E v a l u a t i o n ................................................................................................. 16
10. Word F o r m a t ..........................................................................................................18
11. Representative  Operators ................................................................................  20
12. Function Closures and the Operator ap p ly  ............................................... 22
13. forw ard  C h a i n i n g ................................................................................................. 28
14. Processor Archi tec ture  .....................................................................................  30
15. Load B a l a n c in g ......................................................................................................31
16. Comparison with Related Machines ...............................................................  32
17. Conclusions and Future Research ................................................................ 35
Figures:
1. Form of the physical a rc h i te c tu re  of the loose ly ­
' coupled pa ra l le l  processor .......................................................  37
2. Graph represen ta t ion  and i n i t i a l  datablock for
sample main program M .................................................................... 41
3. Graph rep resen ta t ion  and codeblock represen ta t ion
of the consequent of a p r o d u c t i o n ........................................... 42
4. Tree summation e x a m p l e ................................................................ 43
5. One possible  snapshot of the program of Figure 4 . . . 44
6 . Overall task processing f l o w ...................................................  38
7. Evaluate/propagate for  ordinary task t y p e .......................... 39
8 . Evaluate/propagate for  invoke  task  type .......................... 40
9. D is t r ib u te /n o t i fy  processing ...................................................  45
10. Evaluate/propagate for  oar, adr  task t y p e s .....................  46
11. I l l u s t r a t i o n  of forw ard  chaining ........................................... 47
12. Simple example of function c losures  ..................................  48
R e fe re n c e s ..............................................................................................................  49
1The a rch i te c tu re  of  h igh ly -pa ra l le l  machines has received increased 
a t t e n t io n  from researchers  over the past  decade. At f i r s t ,  because of 
t h e i r  novel ty , workers were content  with proposing e labora te  machine 
a rch i tec tu re s  without giving grea t  considera t ion to how such machines 
would u l t imate ly  be programmed to ex p lo i t  t h e i r  ava i lab le  computational 
power. Experience with I l l i a c  IV, Star-100, e tc .  has shown th i s  to be 
a mistake. Indicat ions  are t h a t  programming languages deserve considera t ion 
a t  the e a r l i e s t  stages of a r c h i te c tu ra l  conception. Included in such 
cons idera t ions  are issues  such as s torage management and task management.
This paper describes  considera t ions  fo r  what might be ca l led  a 
l o o s e ly - c o u p le d  a r c h - i te o tu re . This term was used in [Arden and Berenbaum 75] 
in discussing memory management t r a d e -o f f s  in mult i -processor  systems.
We use i t  to denote a machine which p o te n t i a l ly  incorporates  a large 
number (say 1 0 0 0 ) of processors which can function independently to a 
large ex ten t ,  but which can e f f e c t iv e ly  communicate with one another when 
necessary. Furthermore, we require  t h a t  the computations being supported 
are  not t i ed  to the s t ru c tu r e  of the machine a t  the program leve l .  A 
coro l la ry  of  t h i s  a r c h i t e c tu ra l  concept is  th a t  the system is  eas i ly  
expan dab le , the re  being no logical  dependence on the number of processors .
Such expandabil i ty  is  f u r th e r  enhanced by the p a r t i c u la r  physical organizat ion 
to be descr ibed. A ddi t ional ly ,  through the use of a packet switching 
intercommunication network, the system can by seen to have many of the 
fea tu res  a t tending r e a o n f ig u r a b le  s y s te m s ,  {o f .  [Reddi and Feustel 78]).
The a r c h i te c tu re  presented here was influenced by work reported 
in [Dennis and Misunas 74] and [Arvind and Gostelow 77] on d a ta  f le w  
machines. Our machine a rch i t e c tu re  attempts to bring in terna l  communication 
costs  within the machine to a more manageable level by taking advantage
1. INTRODUCTION
2of l o c a l i t y  of reference.  The communication network in our machine plays 
the ro le  of the a r b i t r a t i o n  and d i s t r i b u t io n  network of the Dennis da ta ­
flow machine. However, the processing units  which assemble in s t ru c t io n s  and 
i n i t i a t e  information flow are more l ike  the processors of Arvind and 
Gostelow. Even though the a r c h i t e c tu re  of  our machine has a t r e e - l i k e  
s t r u c tu r e ,  i t  i s  not a " recurs ive a rch i tec tu re"  in the sense of [Davis 78]. 
Our system has in common with those c i ted  in th i s  paragraph the des i re  
to in te g ra te  a rc h i t e c tu ra l  and language cons idera t ions .  This is  one 
of the ways i t  d i f f e r s  from s u p e r f i c i a l l y  s im i la r  systems, such as Cm* [Swan, 
e t  a t .  77]. These s i m i l a r i t i e s  and d i f fe rences  wil l  be fu r th e r  reviewed 
in Section 16.
Our a r c h i t e c tu re  is  cu r ren t ly  in the development s tage .  We present 
in t h i s  paper some of the major phi losophical  decis ions  which are influencing 
us,  along with an execution model for  a subset of  the ul t imate  machine 
language.
3Heretofore, research on h igh ly -para l le i  machines seems to have predom­
ina te ly  emphasized numerical,  r a the r  than symbolic,  computations. He feel th a t  
fu r th e r  inves t iga t ion  of the l a t t e r  is  merited. The p o s s ib i l i t y  of  such 
appl ica t ions  has been alluded to before ,  e .g .  [Hearn 76]. Presently  we 
are choosing Lisp as a t a r g e t  language for  our a r c h i t e c tu re .  We would 
l ike  to present  arguments in fu r th e r  defense of th i s  choice. The f i r s t  is  tha t  
there is  a subs tan t ia l  community of Lisp users who are seeking the higher 
computing*speeds which a p a ra l le l  processing computer can give. We 
believe th a t  the problem of the acceptance of a new a rch i te c tu re  
will  be s u b s ta n t i a l ly  solved i f  Lisp can be supported on the computer, 
since th a t  choice would not involve acceptance of a new language.
Secondly, we feel th a t  Lisp, possibly  with some advice on programming 
s ty l e ,  can be much b e t t e r  matched to the power of  a loosely-coupled 
system than o ther  languages. For example, extensive transformation of 
Fortran programs i s  done to make e f fe c t iv e  use of  the I l l i a c  IV, e .g .
[Lamport 74]. Consequently, the connection between ob jec t  and source 
programs is  obscured, and debugging is  a f fec ted  adversely. We feel 
th a t  the ob jec t  language of  our machine can be made reasonably close  to 
a usable subset of Lisp.
Furthermore, Lisp, with some minor modif ica t ions ,  such as l e n ie n t  eons 
discussed l a t e r  (o f .  [Friedman and Wise 76], [Henderson and Morris 76]) seems
to include a l l  oppor tun i t ies  for  ex p lo i ta t ion  of  concurrency th a t  proposed data 
flow languages do. I t  a lso  seems to provide more, e .g .  concurrent  operations 
on t r e e  or graph data s t ru c tu re s  during the  l a t t e r ' s  c re a t io n ,  
and natural ways for  dealing with conceptually i n f i n i t e  s t r u c tu r e s .
2. LANGUAGE ISSUES
4Fina l ly ,  even i f  fu l l  Lisp proves to be too d i f f i c u l t  to support 
e f f i c i e n t l y ,  in our attempt to design a machine for  i t ,  we will  
gain valuable experience about the inherent d i f f i c u l t i e s  in supporting 
such languages on a loosely-coupled computer.
I t  may seem th a t  ca ter ing  to Lisp would have the e f f e c t  of  excluding 
most of  the poten t ia l  users of  o ther  data-flow machines, e .g .  those 
in te re s te d  in large numevioal computations, as users  of  our machine.
I t  is  our hope th a t  such users wil l  approach our design with an open mind. 
We bel ieve ,  fo r  several reasons , th a t  our machine can compete with others 
in the numerical computation domain. F i r s t ,  although our evaluator  is 
d i f f e r e n t ,  o ther  machines are  l ik e ly  to incur very s im i la r  mechanization 
problems, making the execution speeds s im i la r  fo r  the same underlying 
computation, independent of source language used. Secondly, numerical 
computations, e .g .  large Fortran programs, can be mechanically t ran s la ted  
into Lisp. There are known case s tu d ie s ,  e .g .  [Fateman 73],  where the 
Lisp version a c tu a l ly  runs f a s t e r ,  even when i t e r a t i o n  is  replaced with 
recurs ion.
53 . BASIC ARCHITECTURE
Figure 1 shows the physical arrangement of components in our machine.
The in terna l  nodes of the t r e e  s t ru c tu re  are b i -d i r e c t io n a l  communication 
u n i t s ,  thus combining the a t t r i b u t e s  of the a r b i t e r  and d i s t r ib u t io n  
uni ts  of the Dennis machine along with addit ional  b a lan c in g  functions .  
Processing uni ts  are at tached to the machine as l e a f  nodes. The le a f  
nodes are not necessar i ly  equ id i s tan t  from the root node of the t r e e .
One might expect,  for  example, specia l-purpose u n i t s ,  of which there 
are r e l a t i v e ly  few, to be c lose r  to the root  node, fo r  enhanced a c c e s s i b i l i t y  
and u t i l i z a t i o n .  Although the f igure  shows a b in ary  t r e e ,  and the discussion 
in th i s  paper makes t h a t  assumption for  s im p l ic i ty ,  technology considera t ions  
suggest t h a t  a 4-ary or 8 -ary  t r e e  might be more appropria te .
A general processing un i t  is  roughly the s ize  of  a conventional 
micro-computer, but i t s  a rc h i t e c tu re  is  su b s ta n t i a l ly  d i f f e r e n t .  I t  is 
able  to carry out local computation, p a r t i c u la r ly  with respect  to assembly 
and dissemination of  information, and to i n i t i a t e  ac t ions  for  fetching 
information from other  nodes of  the t r e e .  I t  will  be able to execute s ingle  
program ta sk s ,  and c rea te  tasks in response to the execution of  invoke  
(procedure app l ica t ion)  opera t ions ,  which may then be executed e i t h e r  in 
the local processing un i t  or in another processing un i t .
The primary memory of the system is  d i s t r ib u te d  among the processing 
u n i t s .  Each processing un i t  has immediate access to t h a t  segment of 
memory located within i t .  I t  a lso  has access ,  through the communication 
netw ork ,  to the segments of memory located a t  other  processing un i ts .
Even though the memory is d i s t r ib u ted  among the processing u n i t s ,  
there  is  only one u n i f i e d  l o g i c a l  a d d ress  space .  Given the address of  a 
datum, any node in the machine is able to lo g ica l ly  access i t  d i r e c t ly .  
The in terna l  nodes of  the communication network are responsible  for  any 
required physical routing of  addresses and data.  Access to au x i l ia ry  
memory and other  forms o f  external communication take place through 
special-purpose l e a f  processors .
7The communication network is  designed to help the machine to take 
advantage of l o c a l i t y  of information flow, thereby reducing communication 
costs  which often tend to be high in data-flow or iented machines. I t  is 
a lso  responsib le  for  d i s t r i b u t in g  the computing load among ava i lab le  
processing un i t s .
In the data-flow machine of  Dennis, the a r b i t r a t i o n  and d i s t r i b u t io n  
networks are  d i s j o i n t ,  and any piece of  information which needs to be sent 
from one in s t ru c t io n  ce l l  to another  needs to t rave rse  the e n t i r e  depth of 
these networks, even i f  the c e l l s  are  phys ica l ly  c lose  neighbors.  By combining 
the a r b i t r a t i o n  and d i s t r i b u t io n  funct ions ,  we can cut down the dis tance 
information needs to t ravel  in such cases.
In our machine, information f i r s t  t r a v e l s  up the t r e e  towards the 
root node unt i l  i t  comes to a node from which the d e s t ina t ion  ce l l  is 
reachable by going down the t r e e ,  then i t  proceeds down the t r e e  unt i l  i t  
f i n a l l y  reaches the desired  d e s t ina t ion  c e l l .  Thus, fo r  sending or receiving 
information from neighboring c e l l s ,  i t  is not necessary for  the information 
to t ravel  the e n t i r e  depth of  the t r e e .  Rela t ively  local data- flow therefo re  
takes less  time and improves the overal l  communication cos t  of  the computation. 
Furthermore, another  important consequence of combining the a r b i t r a t i o n  and 
d i s t r i b u t io n  networks is  t h a t  the t r a f f i c  congestion a t  the narrow ends of 
these networks is  reduced, enabling the communication network to handle a 
higher volume of data.
A second function of  the communication network is  to provide a reasonably 
balanced d i s t r i b u t i o n  of the computing load. Such a funct ion is  not required 
in the Dennis machine, as the l a t t e r  does not attempt to a l lo c a te  tasks 
dynamically ( i . e .  cel l  addresses are fixed a t  compile t ime). Each node of
4. COMMUNICATION NETWORK
8our communication network p e r iod ica l ly  obtains m on itor in g  s ig n a ls  from 
i t s  subordinates ,  which ind ica te  t h e i r  current  u t i l i z a t i o n s .  When such 
s ignals  indica te  a s u f f i c i e n t ly  unbalanced s t a t e ,  the node can cause the 
t r a n s f e r  of u n in i t i a t e d  tasks from one subtree to the o ther  (see Section 
15) .
9One of the most important concepts of our a rc h i t e c tu re  is  to improve 
performance by explo i t ing  l o c a l i t y  of information flow. Local ity  of 
reference is  an es tab l ished  concept for  program execution,  which should 
there fo re  be exp lo i tab le  within data- flow computations. Locality will 
be enhanced by the f a c t  t h a t  functions are apt to reference t h e i r  arguments 
repeatedly.  Secondly, repeated global references to the same data will  
become local ized  by a caching e f f e c t  which r e s u l t s  from the implementation 
of such references .  The l a t t e r  will  be fu r th e r  discussed in Section 12, 
deal ing with the a pp ly  operator.
I f  computations which i n t e r a c t  heavily with one another are a l loca ted  
space in such a way th a t  they are  a sho r te r  average d is tance  apar t  in 
the nodes of  the communication network, the overal l  time spent in information 
flow will  be reduced. I t  i s  important to note th a t  even i f  i t  i s  not 
poss ib le  to a l lo c a te  space for  a new computation in the address space 
of the same l e a f ,  the correc tness  of the overal l  computation will  be 
maintained, even though the speed of the computation may be degraded.
This is  a consequence of the uniformly access ib le  address space.
In designing a h igh ly -pa ra l le i  machine, one must be careful  t h a t  costs  
involved in c rea t ing  and communicating with new tasks do not outweigh 
the speed advantage gained from overlapped execution of these tasks .  
Consequently, our design prescr ibes  th a t  a l l  computation local to a procedure 
body ( i . e .  exclusive of c a l l s  to other  procedures) wil l  usually  be done 
within one processing u n i t .  Hence, the global s t ru c tu re  does not seek 
gains from para l le l i sm  on the level of ,  say, an ar i thm et ic  expression 
(although th i s  could be done within the processing un i t  i t s e l f  i f  d e s i red ) ,  
but r a th e r  from in te r -p r o c e d u r e  concurrency. -
5. LOCALITY
Another an t ic ipa ted  e f f e c t  which will  con tr ibu te  to lo c a l i t y  might 
be ca l led  the seed in g  e f f e c t .  As shall  be seen, when a task A in execution 
creates  a second task B, the l a t t e r  may be a l loca ted  i t s  s torage in any 
of  the processing uni ts  in which there  is s u f f i c i e n t  space. Since B 
may cause the crea t ion  of other  tasks C-j, C ^ , . . . . ,  Cn , lo c a l i t y  is 
enhanced i f  the s torage for  the l a t t e r  is  a l loca ted  in processing units  
near to t h a t  of B in the t r e e .  Hence, even i f  B is a long dis tance from A, 
thus incurr ing a major communication cost  between the two, th i s  cost may 
be balanced out by the lower costs  of  communicating between B and 
C-j, C2 , . . . . ,  Cn. Hence, th i s  seeding e f f e c t  c rea tes  a t rad eo ff  in 
resolving a choice of  how fa r  away a created task should be placed. I t  
a lso demonstrates the p o s s i b i l i t y  of a c e r ta in  amount of r e - l o c a l i z a t i o n  
in recovering from bad task-placement decis ions by the system. For 
example, even i f  B is  placed in a congested a rea ,  the s torage from 
completing tasks near B can be reclaimed to provide more space for
11
The cha rac te r iza t io n  of information flow within the machine is  very 
dependent on the conceptual level being considered. For example, a t  the 
ta s k  l e v e l , we are concerned with the flow of operands between tasks .
In p a r t i c u la r ,  our system permits demand d r iv e n  computation a t  t h i s  leve l .
In c o n t r a s t ,  the machines of Dennis, Arvind and Gostelow, and Davis are a l l  
data-driven machines, in th a t  an in s t ru c t io n  never asks for  data to be sent 
to i t .  Instead,  i t  waits for  data to be sent to i t ,  and when a l l  pieces of 
data are received,  i t  i n i t i a t e s  computation whose r e s u l t s  are then sent to 
a l l  o ther  designated in s t ru c t io n s .  In the demand-driven scheme, a procedure 
may ac t ive ly  seek addit ional pieces of data a f t e r  i t  has demanded and 
received some i n i t i a l  pieces of data .  This topic wil l  be fu r th e r  discussed 
in subsequent sec t ions .  .
At the communication network l e v e l , we find the information flow 
separated in to the flow of ta s k s  (which are invoke  i n s t r u c t io n s ) ,  operands  
( s ingle  data words),  and b lo c k s  (mul tiple  data words). All such pieces 
of  information are accompanied by addit ional routing information in the 
form of des t in a t io n  addresses,  e tc .  All information transmit ted  through 
the communication network is  done by p a c k e t  sw i tc h in g  (or s to r e -a n d - fo rw a r d )  
as opposed to l ine  switching. The l a t t e r  type of switching is  not used 
because of  the potent ia l  congestion incurred by tying up long paths through 
the network.
A node of the communication network communicates to i t s  parent  through a 
t r a d i t io n a l  form of handshaking. However, for  block t r a n s f e r s ,  a h u rs t  mode 
of communication is  used in which the handshaking occurs only before and 





Our machine executes a compiled version of  Lisp as i t s  machine language.
We avoid syn tac t ic  issues by using a p a r a l l e l  program graph , such as 
described in [Kel ler  77], instead of the conventional l i s t  represen ta t ion  
of Lisp programs. For sake of d e f in i t e n e s s ,  we r e fe r  to the graphical 
language as Flow-Graph L isp  {fgl) . FGL allows us to c le a r ly  disp lay 
the data flow between operators  and thus potent ia l  concurrency within programs.
The equivalent  of  procedure c a l l s ,  including recurs ive  ones, is 
provided in FGL through graph p r o d u c t io n s ,  which specify how a programmer- 
defined operator  (the a n te c e d e n t  of the production) is to be replaced by 
a program graph (the consequent  of the production).
FGL also  supports l e n i e n t  c o n s , which allows the machine to exp lo i t  
concurrency which i t  could not with conventional s t r i c t  cons [Friedman and 
Wise 78]. For the curren t  p resen ta t ion ,  i t e r a t i o n  is implemented by 
recurs ion ,  in the manner of  [McCarthy 63]. This automatical ly  gives the 
same concurrency-detect ion e f f e c t  o f  "look-ahead" processors ,  which 
"unfold" i t e r a t i o n s  to achieve concurrency [Keller 75].
For sake of t h i s  p resen ta t ion ,  l e t  us suppose tha t  data s t ru c tu re s  
are t r e e s ,  with the in tegers  and n i l  as atoms. Boolean values may be 
implemented by in te rp re t in g  n i l  as f a l s e , and any non- n i l  value as 
t r u e .  The program cons is ts  of a network of operators  which are functions  
on t r e e s .  For s im p l ic i ty ,  we do not discuss in p u t  of t rees .  Rather, we 
assume them to be res iden t  a t  the beginning of  the computation. Our t rees  
are  represented using an appropria te  network of  cons operators  and atoms.
In summary, the program and a l l  of  i t s  data are represented  as one 
network in the machine, in a manner not too d i f f e r e n t  from conventional 
represen ta t ions  of  graphs in a 1 inear ly-addressable  memory.
To cause a r e s u l t  to be p r in ted ,  a demand is  generated a t  some p r i n t
Jnode in the network. This causes propagation of the demand to the operator  
feeding the p r i n t ,  which in turn eventual ly  causes the value of th a t  operator  
to be evaluated and pr in ted .
Evaluation co n s is t s  of a combination of  t ransmutations to the graph and 
operat ions  which produce new values from o thers .  In t h i s  sense, we have a 
re d u c t io n  machine a la  [Berkling 75],  executing a r e d u c t io n  language a la  
[Backus 73]. By using graphs ra the r  than s t r i n g s ,  we can avoid much of  the 
co m b in a to r ia l  ex p lo s io n  which takes place in purely s t r in g -o r ien te d  machines.
Figures 2, 3, and 4 give examples of  programs in FGL. In Figure 2, 
there  is  a main program  M. M c a l l s  a r e c u r s i v e  procedure  g, the graph of which is  
presented in Figure 3. In each f igure  we give the graph represen ta t ion  and 
the corresponding "code block" rep resen ta t ion  (see Section 8 ).  The paren the t ic  
l abe ls  on the graph ind ica te  the correspondence between the two. I n tu i t i v e l y ,  
g(n) "computes" the i n f i n i t e  sequence
n n+1 n+2 n+3 ___
In the context  of the main program, the value pr in ted is  the th i rd  element 
{caddr)  of the sequence with n = 0 .
A second program, which sums a t r e e  of in teg e rs ,  is  shown in Figure 4. .
This example uses a s t r i c t  opera tor ,  a d d , to cause the crea t ion  of  instances  
of operators  which can be evaluated concurrently .  Figure 5 shows a possible 
snapshot of the program during i t s  app l ica t ion  to a spe c i f i c  t re e .
In the next sect ions  we descr ibe ,  in more d e t a i l ,  program storage,  task 
execut ion,  typical  opera to rs ,  graph expansion via the special invoke  opera tor ,  
and forw ard  c h a in in g , which is  a key idea in implementing l e n i e n t  cons and our 
p a r t i c u la r  form of procedures.  We do not discuss s torage reclamation here, 
as i t  is an issue s t i l l  under inves t iga t ion .
13
All storage is  a l loca ted  in b lo c k s .  Blocks make storage management 
more e f f i c i e n t ,  and are c ons is ten t  with t ry ing to keep the l o c a l i t y  
of  a computation contained with one processing un i t .  A block is e i t h e r  
a d a ta  b lo c k  or a code b lo c k .  The words of a data block are i n i t i a l l y  code 
and l i t e r a l s .  The former gradually get changed to  data during execution.
A code block is  copied as the source of i n i t i a l  code to be stored in a 
newly a l loca ted  data block. The contents of a code block form a l in e a r  
represen ta t ion  of an FGL program graph. •
The copying of  code blocks may be contras ted  with approaches such 
as t h a t  in [P a t i1 67], which in t e rp r e t  a pure code block without copying.
The approach taken here is more e f f e c t iv e  in keeping references local to 
a processing un i t .  I t  a lso reduces the amount of  word fetching required 
during actual  task processing.
The words in a data block correspond roughly to data  values which 
may eventual ly  appear on the output arcs  of operator  nodes in the 
program graph. I n i t i a l l y  however, instead of  containing da ta ,  a word 
contains the in s t r u c t i o n  code rep resen ta t ion  of the corresponding 
opera tor ,  along with the local addresses of words corresponding to i t s  input 
a rc s ,  i . e .  the sources of i t s  operands. We assume here fo r  s im pl ic i ty  
th a t  each operator  has only one output a rc ,  although such arcs may 
fan out as necessary.
8. PROGRAM EXECUTION
15
In addit ion to specifying the input arcs  of  i t s  operands, an 
in s t ruc t ion  may be accompanied by n o t i f i e r s ,  which are addresses of 
operators  which have th i s  o p e ra to r ' s  output arc as one of t h e i r  input 
a rcs .  These could conceivably be s e t  dynamically, but in t h i s  presenta t ion  
we have elected to have them se t  in the i n i t i a l  code. Again, Figures 2, 3, and 
4 give examples of code blocks corresponding to program graphs. Further 
information on in te rp re t in g  these blocks i s  given in subsequent sec t ions .
By keeping data  blocks reasonably small,  say 256 words, and by using 
only addresses r e l a t i v e  to the s t a r t  of the block in the code, the operat ion 
code and necessary se t  of operand and n o t i f i e r  addresses can be accommodated 
within a reasonable word s iz e ,  say 48 b i t s .  For references  across  blocks, 
which the re fo re  involve global addresses,  we provide some special opera to rs ,  
to be described subsequently. By dividing the physical memory into blocks 
and a l lo ca t in g  on block boundaries only, a paging e f f e c t ,  which s im pl i f ies  
s torage management, is r ead i ly  obtained.
The loosely-coupled aspect  of task evaluat ion is  achievable through 
a ta s k  l i s t  o rganiza t ion ,  which allows many processors to partake in the 
evaluat ion of t a s k s , i . e .  p a r t i c u la r  instances of operators  with t h e i r  
associated data .  The task l i s t  is  decomposed into two separa te  l i s t s  
which may be served independently. These are:
demand l i s t :  contains addresses of operators  for  which evaluation is  
to be attempted.
r e s u l t  l i s t :  contains addresses of opera tors ,  along with t h e i r  
corresponding values a f t e r  evaluation.
At t h i s  s tage of development, the recommended p r io r i t y  of  service is  
r e s u l t  f i r s t ,  then demand. The reasoning here is t h a t  r e s u l t  values 
genera l ly  enable successful evaluat ion of ta sks ,  while demand general ly  
c rea tes  more tasks .  These l i s t s  are fu r th e r  divided and d i s t r ib u te d  to 
individual processing un i ts  by the communication network, which takes into 
account the cu r ren t  processor load d i s t r i b u t i o n .  Only invoke  
i n s t ru c t io n s  will be considered fo r  d i s t r i b u t i o n ,  for  i t  i s  only these which 
might p ro f i t ab ly  be executed in another  processing u n i t ,  due to the commun­
ica t ion  cos t  incurred in ge t t ing  them there .  Hence, the invoke  l i s t  i s  a 
s u b - l i s t  of  the demand l i s t ,  containing only invoke  in s t ru c t io n s .
Figures 6  through 11 show the organizat ion of  the task 
evaluat ion mechanism. The flow diagrams are to be in te rp re ted  in an 
informal sense, and are less  akin to conventional flowcharts than they are 
ind ica t ive  of d a ta  f lo w ,  with ta s k s  as data .
The following b r i e f  n a r ra t iv e  wi l l  aid in the understanding of the 
flow diagrams. . I n i t i a l l y ,  the address of  the word which will  produce the 
"main r e su l t "  is  put on the demand l i s t .  The word i t s e l f  is  then fetched.
I t  i s  evaluated, i f  poss ib le .  I f  not,  then demand is  propagated to i t s  
arguments by placing t h e i r  addresses on the demand l i s t .
16
9. TASK EVALUATION
Once evaluated,  a r e s u l t  value rep laces the coded operator  as ready  
data .  Via the r e s u l t  l i s t ,  any n o t i f i a b l e  operators  awaiting th i s  
r e s u l t  as an argument are then n o t i f ied  by putt ing them on the demand 
l i s t  to be r e t r i e d .  We notice  th a t  a l l  demanded operators  remain accessi 
un t i l  they become ready as da ta ,  e i t h e r  through:
( 1 ) being on the demand l i s t ,  
or (2 ) being referenced by a n o t i f i e r  of an access ib le  operator ,
or (3) being referenced by the "forwarding address" of an access ib le  
operator .
Forms of evaluat ion other  than pure demand evaluat ion can thus be 
supported by jud ic ious  s e t t in g  of "d-b i ts"  and advanced placement on 
the demand l i s t .
18
10. WORD FORMAT
A word in a data block may begin as a code word and l a t e r  be changed 
to a datum as the computation proceeds, corresponding to the evaluat ion 
of the operator  represented by th a t  code word. The ready b i t  { r  b i t )  
in each case is  s e t  when the word does contain a datum. I t  may be se t  
i n i t i a l l y  in some words, to provide i n i t i a l i z e d  l i t e r a l s .
A datum can e i t h e r  be an atom , in which case i t  contains a l i t e r a l  value, 
or i t  can be a p a i r  p o in te r .  In the l a t t e r  case,  i t  is  the global address 
of a pa i r .  A p a i r  cons is t s  of two consecutive words within some block, 
each of which is  e i t h e r  a datum or a forw ard  operator .  The purpose of 
the l a t t e r  will  be described subsequently.
There are several o ther  formats for  data which extend the above, such 
as represent ing l i s t s  in contiguous space, chains of po in te rs ,  e t c . ,  as used 
in [Bawden, e t  a l .  77].  These wil l  not be discussed here for  b rev i ty .
All global addresses are  represented as
B.R
where B is  the base address of  a data block and R the local address 
of  a word within the block. The advantage of  th i s  scheme is  th a t  once 
the word in question has been referenced,  the processor will  usually  
need access to o ther  words in the block and can gain i t  using only t h e i r  
local addresses.
19
The following f ie ld s  will  always he Dresent in a code word: ^
d  b i t :  s e t  to ind ica te  th a t  i t s  u l t imate  data value has been demanded
op : o p e ra t io n  code .
The following f i e ld s  may or may not be p resen t ,  depending on the 
nature of the p a r t i c u la r  operat ion code:
as : local addresses of arguments to the operator
ns : n o t i f i e r S y  i . e .  local addresses of  n o t i f i e e  operators
-*-B.R : where B.R is a global address ,  which is  e i t h e r :
a forw ard ing  a d d re s s s which is  used with a forw ard  opera to r ,  or 
a f e t c h  a d d ress  , which is  used with a f e t c h  ope ra to r ,  or
a p o in te r  to a code block (in which case R = 0 ) ,  which
is  used with an invoke  operator .  .
The presence of  the demand b i t  in a code word allows support of a
demand d r iv e n  e v a lu a t io n  s t r a t e g y .  In t h i s  s t r a t e g y ,  no operator  is 
evaluated unless i t  produces some value known to be e s sen t ia l  to the
computation. Aside from the obvious po ten t ia l  e f f ic ien cy  gain,  another 
advantage of t h i s  approach is  t h a t  i t  provides a. natural means of deciding 
whether and when to t r i g g e r  the invocation of a defined funct ion ,  which
requires  the a l lo c a t io n  of  a s torage block.
The use of b i t s  to d i r e c t  the processor to i n t e r p r e t  a given word 
as da ta ,  in s t r u c t io n ,  e tc .  exemplifies the "tagged a rch i tec tu re"  approach 
[Feustel 73].  Adopting th i s  approach allows us to keep open a l l  of 
i t s  a t tendan t  opt ions as the design progresses.
20
11. REPRESENTATIVE OPERATORS ,
The r ep e r to i r e  of  operators  includes the Lisp operators  c a r ,  cd r  , 
co n s ,  atom} e q ,  i f - t h e n - e l s e  (cond), e t c .  Of these ,  a l l  but the f i r s t  
three are ca l led  o r d in a r y ,  as they operate purely within the data block.
The f i r s t  three are  ca l led  s p e c i a l ,  because they can cause data t r a n s f e r  
between blocks.
In co n t ra s t  to conventional Lisp, we have e lec ted  to make cons 
a le n ie n t  operator.  That i s ,  i t  has a " re su l t"  even i f  one of  i t s  
arguments has not y e t  been computed. This can be argued to increase  the 
asynchrony of  a computation and hence improve the u t i l i z a t i o n  of a pa ra l le l  
processing system on which i t  may be run, o f .  [Friedman and Wise 76].
A consequence of  the lenience of  cons  i s  t h a t ,  in our implementation, 
cons i s  not r e a l ly  an operator  a t  a l l ,  but r a th e r  j u s t  a pa i r  of  data* 
namely, i t s  arguments.
Some other  special opera to rs ,  which do not appear in the program 
graph, are used to e f f e c t  the necessary t r a n s f e r s  of data  between procedures,  
and o ther  housekeeping operat ions .  These are i d e n t ,  fo rw ard ,  f e t c h ,  l o c p t r ,  
and in vo k e .  .
The operators  i d e n t ,  fo rw ard ,  and f e t c h  a l l  have the nature  of 
i d e n t i t y  fu n c t io n s .  The d i s t i n c t i o n  is  as follows: i d e n t  has a local 
argument and local n o t i f i e r s .  I t  i s  used mainly fo r  increased fan-out 
when there  are  more n o t i f i e r s  fo r  a word than can f i t  in a s ingle  word; 
f e t c h  has a global argument and one or more local n o t i f i e r s ;  forw ard  
has one local argument and one global forw ard ing  a d d r e s s .  The l a t t e r  is  
s e t  when a demand i s  issued to the corresponding f e t c h .  All cons pa irs  com­
p i le  as two consecutive forw ard  opera tors ,  or l i t e r a l s .  The operator  l o c p t r  
is  used to generate global pointers  to cons pa i rs .
21
The following discussion describes the compilation of an i n v o k e :
where /  is a programmer-defined symbol
invoke + f  ns .  . . .  nz,J 1 k
forward ax1 -+? 
forward ax -»■?
forward aa^ -»■?
where -+/ is  the address of f ' s  code block, the ax. are local arguments, the nz .'1' 't'
are  local n o t i f i e e s ,  and the ? ' s  are s e t  when the forwards  are demanded.
The data block corresponding to f  begins with:
forward au -+x 
fetch +(x+l) nyj  ••• 
fetch + (x+2 ) ny£ . . .
fetch  -+(x+^) ny  . . .
where x is the address of  the in v o k e , u is  the local word which will  contain 
the r e s u l t  to be delivered by the in v o k e , and r\y. . . .  are the n o t i f i e r s  
of the i - t h  parameter of f .  Following crea t ion  of the data block, demand 
propagates to the forw ard  in the data block for  f .
compiles as:
22
12. FUNCTION CLOSURES AND THE OPERATOR a p p l y .
An important aspect of Lisp programming is the manipulation of functions  
as data values. While we do not envision supporting run-time c r e a t io n  of 
function d e f i n i t i o n s , we do accommodate the formation and manipulation of 
function c lo s u r e s  (records combining compiled code poin ters  with environ­
ments for  t h e i r  u l t imate  ap p l ica t ion ;  i . e . ,  FUNARGs). This will permit not 
only the programming of f u n c t io n a l s  ( funct ion-valued functions)  on our 
machine, but a lso  provides a form of shared values ,  thereby re l iev ing  the 
need to exhaust ively  parameterize functions.
We assume th a t  our programs are b lo c k  com piled .  That i s ,  the program 
cons is ts  of a se t  of symbolically named funct ion d e f in i t io n s  tha t  are com­
piled  as a group. Within these " top- level"  function d e f in i t i o n s ,  there may 
be some number of nested function d e f in i t io n s  of the following form:
In FGL In Lisp:
(FUNCTION
(LAMBDA ( <bound v a r i a b l e s >) 
<body>))
(bound v a r ia b l e s )
(g lo b a l  v a lu e s )
denotes graph of  <body>
Such forms c rea te  c lo su r e  values a t  run-t ime.  Each combines the entry 
point fo r  the nested fu n c t io n ' s  compiled code with an environment p o in te r  
which references  the cu r ren t ly  executing ac t iv a t io n  of the i[mediately 
surrounding funct ion d e f in i t i o n .  Thus global ( i . e .  " f ree" ,  or non-local)  
var iable  occurrences within the nested function are bound s t a t i c a l l y  to
23
re fe r  to the matching dec la ra t ion  ( i . e .  parameter) binding a t  the place of 
the c lo su re ' s  c rea t ion .
For completeness,  we include:
In FGL: In Lisp:
(FUNCTION F)
where F is the symbolic name of a function.  This makes the semantics of 
function a p p l ica t ion  more uniform, and s y n ta c t i c a l ly  d is t ingu ishes  between 
the funct ion F and any parameter F th a t  may be access ib le .  Note, however, 
tha t  the environment po in ter  in such a closure i s  superfluous, since a named 
function may not contain any occurrences of var iab les  global to i t .
A closure value may be passed as a function argument, returned as a 
function value,  cond i t iona l ly  se lec ted ,  e tc .  un t i l  u l t im a te ly  i t  is  applied 
via the operator a p p ly ,  akin to the APPLY funct ion of Lisp:




(c l o s u r e ) (argum ents)
Observe th a t  a l l  function c a l l s  in our source language could be expressed in 
APPLY notation through the following transformat ion:
( F OLj • • ■ afe) (APPLY (FUNCTION F) ct^  . . .  afe)
However, we r e ta in  the option of the d i r e c t  function ca l l  nota t ion (and the 
invoke  opcode supporting i t )  for  expressive convenience and run-time e f f i ­
ciency.
These cons truc ts  are compiled as follows (see Fig. 12 for  examples): 
Construct 1: Function c losures .
In FGL: In Lisp:
(FUNCTION <f>)
( g lo b a l  va lues)
We use the opcode lo o p tr  to  generate  a fu l l - ad d re ss  ( i . e .  "global") 
pointer  to a cons pa ir  representing the c losure .  The cap  of the pa i r  is  the
25
keyword atom FUNARG, while the cd r  is  a pseudo-opcode dummy with a code 
poin ter  to <t> as i t s  argument. Thus the ca r  of a closure may be computationally 
inspected a t  run-t ime,  but since dummy causes a run-time e r ro r  i f  executed, 
the odr  of the closure  is  inspectable  only by a p p ly .  Note th a t  the global 
pointer  S . 3 to the closure as b u i l t  by l o c p t r  contains  the c lo su re ' s  environ­
ment po in ter  d i r e c t l y  in S.
Construct 2r. Nested fu n c t io n s .
In FGL: In Lisp:
(LAMBDA ( <bound v a r ia b le s > )  <body>)
{bound v a r ia b l e s )  (g lo b a l  v a lu e s )
Each funct ion d e f in i t io n  is  compiled into a separate  code block to 
minimize code copying a t  function app l ica t ion  time. (Note th a t  i f  nested 
funct ions were compiled " in - l i n e " ,  t h e i r  code would s t i l l  need to be copied 
when appl ied ,  since several app l ica t ions  of tha t  p a r t i c u la r  c losure  value 
may occur .)  Within each fu n c t io n ' s  code, special "pseudo-parameter" f e t c h  
opcodes are compiled for  each var iab le  accessed g lobal ly  from within i t s  
d e f in i t i o n .  Observe th a t  such f e t c h e s  are compiled even for  global var iab les  
accessed only a t  deeper nested function leve ls .
Any global va r iab le  occurrence is  thus connected a t  run-time through a 
sequence of f e t c h  opcodes, one per level of  textual  function nes t ing ,  from i t s  
containing ac t iv a t io n  record to i t s  binding as a bona f i d e  parameter a t  some 
outer  leve l .  The S.|i pointers  of the global j'cLrh.es are bound in two stages:  
the 8 is  fixed a t  compile time (with complete s e c u r i ty ) ,  and the S is  fixed
26
a t  app l ica t ion  time to be the c lo s u r e ' s  environment poin ter  S.
Thus,in the same sense th a t  the ac t iv a t io n  record 's  dynamic ( i . e .  c a l l ing )  
l ink is redundantly represented in each parameter f e t c h , i t s  s t a t i c  l ink  is 
redundantly represented in each global f e t c h .  The f e t c h  opcode o f fe r s  s u f f i ­
c ien t  space for  such fu l l  addresses,  and the design provides uniform f e t c h  
processing in both cases with less  memory contention (as might a r i s e  i f  the 
s t a t i c  and dynamic l inks  were put into a s ingle  header word in the ac t iv a t io n  
record).
An a l t e r n a t iv e  accessing scheme for  globals  would be to  replace th i s  
"bucket brigade" approach and provide d i r e c t  f e t c h  l inkage from occurrence 
to binding lev e ls .  Although such a scheme might o f fe r  f a s t e r  access in 
ce r ta in  cases ,  we consider i t  to be less  des irab le  for  two reasons. F i r s t ,  
the compiled code would need to  be adapted to contain two-dimensional addresses 
( i . e .  [ s t a t i c  l e v e l ,  o f f s e t ] ,  as is  customary in Algol- l ike  language implemen­
t a t i o n ) ,  with the added app l ica t ion  time set-up a c t i v i t y .  Secondly, a 
p o te n t i a l ly  valuable caching  e f f e c t  would be l o s t  along global f e t c h  sequences. 
Given our concern for  explo i t ing  l o c a l i t y  on th i s  machine, we feel t h a t  the 
l a t t e r  concern will be economically dominant.
Construct V. Function a p p l i c a t i o n s .
In FGL: In Lisp:
(APPLY <closure-form >  
<arg 2~form>
c lo su r e  arguments <argk-form >)
The ap p ly  operator  is  compiled in a manner s imilar  to  t h a t  for  in v o k e , 
but with the c lo s u r e  being an operator  argument (as opposed to the
27
argum ents , which are compiled using forw ards  as per in v o k e ) .
The act ions  taken by the app ly  opcode are viewed as a s l i g h t  extension of 
the invoke  opcode, with the added a c t i v i t i e s  of global f e t c h  se t-up and 
argument count checking. When demand reaches an app ly  opera tor ,  i t  propagates 
immediately to the a p p l y ' s  f i r s t  argument. Upon rec e ip t  of the necessary 
closure value for  t h i s  argument, the appl y  task becomes an invoke  task and is  
moved to the invoke  l i s t .
28
13. forward  CHAINING
The n a r ra t iv e  in Section 9 does not discuss special  a t t e n t io n  paid 
to various opera tors ,  e .g .  forw ard .  The handling of such operators 
is the essence of both the procedure linkage mechanism and the successful 
handling of l en ien t  eons.
When an operator  is  evaluated,  i t  i s  replaced with a value. At th i s  
time, the presence of  any n o t i f i e r s  i s  noted and the corresponding operators  
are  put on the demand l i s t .  These operators  can then access the data as 
an operand.
No use is  to be made of the argument par t  of the contents  of operators  
over-wri t ten  by forw ard .  Ins tead, a special forw ard  chain ing  technique is  
required for  co n s is ten t  handling of len ien t  cons. I f  the operator  being r e ­
placed is  a forw ard ,  the data a lso replaces the  contents  of a forw ard ing  
address  which may be present .  This process is  repeated,  un t i l  an operator  
containing no forwarding address is encountered. The need for  t h i s  technique 
can be seen by the following argument:
Figure 11a shows par ts  of three  data blocks as par t  of a s t a t e .  Notice 
t h a t  X-j and can both p o te n t i a l ly  request  the same value,  namely the 
value of U, which is  not ye t  ready (nor demanded). When the f i r s t  demand 
on Z is generated, as indicated in Figure 11a, the forwarding address in 
Z is  s e t  to X-| and U is  demanded.
Suppose meanwhile t h a t  demand is  generated on X£, which in turn  r e s u l t s  in 
a second demand on Z. Since a forwarding address has al ready been stored in Z, 
there is  i n s u f f i c i e n t  room for  a second. (Even i f  two could be s to red ,  there 
might be three  demands generated, e t c . ) .  Since we know th a t  X^  is to receive the 
r e s u l t  of Z, we s to re  forw ard  in Z, as in Figure 11c. When U is  f in a l l y  
n o t i f i e d ,  any data s tored over a forw ard  wi l l  be s tored  over the contents of the
word spec i f ied  by i t s  forwarding address ,  according to the d i s t r i b u t e / n o t i f y  
phase of the evaluat ion algorithm.
Although we used oar  to motivate the above example, we mention tha t  
s im i la r  t reatment is  given to cd r  and f e t c h  (when used for  global 
value l inkage) .
We do not go in to  great  de ta i l  here on the organizat ion of individual 
general processing u n i t s .  As described in Section 9, each un i t  s e le c t s  tasks 
from i t s  demand l i s t .  While on th i s  l i s t ,  a task is  represented by i t s  . 
address in memory. This word is fetched and i f  not present ly  ready as da ta ,  
an attempt is made to evaluate  i t .  For ord in ary  t a sk s ,  t h i s  normally 
e n t a i l s  reference to one or more addit ional  words in the memory; hence 
a fe tch  of these words occurs.  Since each of them might res ide  in the 
physical memory of any processing u n i t ,  fe tching may involve transmission of 
words through the communication network. In order  th a t  the processor need not be 
id le  while such a fetch is  taking place,  we provide for  buffer ing a se t  
of such tasks while t h e i r  operands are being assembled. We cal l  such a 
buffer  a s ta g in g  area .  I t  is  conceptual ly s im i la r  to a conventional ■
p i p e l i n e , except th a t  order of task execution is  unimportant, a l l  
e s sen t ia l  ordering being e x p l i c i t  in the program graph. The size of the 
s taging area is chosen to maintain reasonably good u t i l i z a t i o n  of the 
function uni ts  within the processing u n i t ,  which carry  out the actual 
operat ions once the task leaves the s taging area.  Of course , each 
function un i t  could i t s e l f  be p ipe l ined ,  depending on economic advantages 
which would accrue due to a p a r t i c u la r  app l ica t ion  load. Design of 





Load balancing occurs through the r e d i s t r ib u t io n  of tasks from the 
invoke l i s t  of one processing un i t  to tha t  of another.  This is  a separa te ,  
but topo log ica l ly  comptabile,  function of the communication network from 
the routing of operand data.
By the load  a t  a processing u n i t ,  we mean the number of tasks  on the 
segment of the invoke l i s t  a t  t h a t  un i t .  In a s im i la r  manner, we can define  
the load  a t  any node of  the communication network to be the sum of the loads 
a t  i t s  leaves ,  divided by the number of i t s  leaves as a normalizing fa c to r .
Again, to s implify the explanat ion,  we are assuming tha t  the communication 
network is  a binary t r e e .  Each node of the communication network
maintains lower and upper l im i t s ,  L and U, on the loads of  i t s  immediate 
descendants.  I f  the load of one is  above U and th a t  of the other  below L, 
i t  attempts to s h i f t  tasks  from the invoke l i s t  of the overloaded descendant 
to th a t  of the underloaded one. I f  loads of both i t s  descendants are above 
U, t h i s  wil l  be communicated to i t s  parent ( i f  any),  so t h a t  the l a t t e r  may 
t ry  to s h i f t  some of the load to one of i t s  descendants having load less  
than L. In t h i s  way, the balancing function i s  d i s t r ib u te d  throughout the 
communication network, with each node thereof  applying the same balancing 
s t ra tegy .
The e f fec t iveness  of the balancing scheme r e l i e s  on the loosely-coupled 
aspect of the system. That i s ,  no task is  bound to a p a r t i c u la r  processor 
un t i l  storage is  a l loca ted  for  i t .
15. LOAD BALANCING
32
16. COMPARISONS WITH RELATED MACHINES
I t  is  e a s i e s t  to understand the r e la t io n  between the machine a rc h i t e c tu re  
presented here and the a r c h i te c tu re  of the data-flow computer proposed in 
[Dennis and Misunas 74] by folding the l a t t e r  through the center  of i t s  
in s t ru c t io n  c e l l s  and functional un i ts  in such a way th a t  the a r b i t r a t i o n  
network overlaps the d i s t r i b u t io n  network. Our general processing uni ts  
then play the ro le  of the ins tuc t ion  ce l l  blocks, and our communication 
network performs the function of both a r b i t r a t i o n  and d i s t r ib u t io n  networks. 
Furthermore, our a rc h i t e c tu re  may o f fe r  improved performance because data would 
not often have to t ravel  as fa r  to get from a source cell  to a des t ina t ion  c e l l .
As in the machine proposed in [Arvind and Gostelow 77],  the 
machine proposed here uses micro-computers to do the processing. However, 
we feel th a t  the communication network used in our machine is 
super ior  to the one in th a t  machine. The communication bus s t ru c tu re  of  the 
former machine may cause in to le ra b le  delays in t ransm it t ing  information from 
one processing un i t  to another ,  a f a c t  th a t  may prove to be a grea t  ,
impediment to the success of  the machine.
The DDM-1 [Davis 78] is  a very d i f f e r e n t  kind of machine than the 
one proposed here. I t s  h ierarch ica l  s t r u c tu re  seems to impose c e r ta in  
co n s t r a in t s  on the c rea t ion  of  new computations and on the flow of information 
in the machine. For example, when a processing element c rea tes  a 
ta sk ,  the l a t t e r  must be placed e i t h e r  in the space of  the processor 
carrying out the app l ica t ion  or in the space of  a subordinate 
processor ,  even i f  the subordinates  are crowded fo r  space and the machine 
has o ther  processors which have plenty of  f ree  space. This problem does
33
not occur in our machine, due to the cons truct ion  of the communication 
network, the uniformity of the address space, and our notion of load 
balancing.
Some t r e e - s t ru c tu r e d  reduction language machines th a t  have been 
proposed are  fundamentally d i f f e r e n t  in t h e i r  operation when compared 
with the machine presented here. In these machines, the expressions 
th a t  need to be evaluated are mapped d i r e c t ly  onto the physical t r e e  , 
of  the machine. In our machine, such expressions would not be mapped 
onto the communication t r e e ;  instead they would be mapped via p a ra l le l  
program graphs in to  the address space of  the machine, and would res ide  
in the memory space of one or more processing un i ts  of the machine.
A common fea tu re  of a l l  of the above a r c h i te c tu re s  is  th a t  they 
are data-driven r a th e r  than demand-driven, as ours i s .  One might be 
led to think th a t  the l a t t e r  presents  some addi t ional  overhead. However, 
c lo se r  examination of  the other  a r c h i te c tu re s  may reveal th a t  some 
form of  ready-acknowledge s igna l l ing  is  taking place when i t  comes to 
transmission of data via s torage words. This i s ,  in f a c t ,  a special 
case of demand-driven computation, in which the demand for  an operand 
is  equated with readiness  of  i t s  r e c ip ie n t .  We ex p lo i t  the f l e x i b i l i t y  
of the general case , to obtain advantages in deciding when to invoke 
procedures.  I t  is  a lso  c lea r  t h a t  the demand-driven f ea tu re  is  a necess i ty  
in supporting len ie n t  cons.  One the other  hand, i t  i s  a lso c le a r  th a t  
demand-driven computation can be en g in eered  on the other  a rch i tec tu re s  
by t r e a t in g  demands as da ta ,  but t h i s  seems to be cumbersome.
Although a t  the physical level the Cm* computer [Swan, e t  a l .  77] 
may appear s im i la r  to our machine, the two are qu i te  d i f f e r e n t  on account 
of t h e i r  underlying mechanism of program execution. In Cm*, para l le l
processing is  based on the concept of in te ra c t in g  sequentia l processes 
t h a t  run on conventional processors (PDP-11), while our machine embodies 
an evaluat ion scheme for  the FGL language and is  capable of d i r e c t ly  
evaluat ing data-flow graphs and ap p l ica t iv e  expressions.  Our evaluation 
scheme, language, and overal l  organizat ion have been developed in an 
in tegra ted  fashion as parts  of  one functioning system.
We have s ta ted  our fee l ing  th a t  machine a rch i te c tu re s  should be 
developed with g rea te r  a t t e n t io n  paid to  u l t imate  programmability.  As an 
example, we discussed p r inc ip les  for  a loosely-coupled a r c h i te c tu re  and the 
use of Lisp as a language w el l - su i ted  fo r  such a machine. We sketched in 
some de ta i l  the in te rna l  represen ta t ion  of programs in our machine and 
the execution of programs on i t .
Our implementation seems to be the f i r s t  de ta i led  one presented fo r  
Lisp programs on a pa ra l le l  machine. An implementation has been 
described q u a l i t a t i v e ly  in [Friedman and Wise 78]. However, t h e i r  work 
r e l a t e s  mainly to the issues  associa ted  with c o lo n e l  versus 
se r g e a n t  t a sks ,  the l a t t e r  being d is t inguished  from the former as 
tasks  whose evaluation may never be a c tu a l ly  requ ired ,  but which 
provide a p o t e n t i a l ly  useful way of employing otherwise id le  processors .
In c o n t r a s t ,  a l l  tasks  in the machine described here are of the colonel 
v a r i e ty ,  whose exis tence may be traced to ce r ta in  s t r i c t  opera tors ,  
such as add in the t r e e  sum example. Hence such issues  have not 
been of immediate concern here. On the o ther  hand, subtle  d e t a i l s ,  
such as the need for  forw ard  cha in in g  have been discovered in the 
course of designing our eva lua tor .  How such s u b t l e t i e s  in t e r a c t  with 
an implementation which does support sergeant tasks remains a topic  for  
fu ture  inves t iga t ion .
The ideas presented here were derived a f t e r  considering many 
possib le  a l t e r n a t iv e s .  I t  i s ,  of course , possible  th a t  we may e l e c t  
to re turn  to one or  more of these a l t e rn a t iv e s  a f t e r  more experience 
in programming the machine has been gained. A simulator  for  the evaluat ion 
model has been w r i t ten  in Pascal to a s s i s t  in such a venture.
35
17. CONCLUSIONS AND FUTURE RESEARCH
36
Many important d e t a i l s  remain to be inves t iga ted .  These include not 
only the necessary support for  the language described here in terms of 
storage reclamation and schedul ing, but extension of  the language to 
allow other  fea tures  as wel l .  We are cu r ren t ly  contemplating how to best  
introduce a d i s t r ib u te d  heap for  more e f f i c i e n t  long-term data s torage.
We must decide how to  deal with o ther  fea tures  of Lisp, such as p r o g , 
upon which many programmers have learned to r e ly .  A re la ted  issue is  
whether in d e te rm in a te  computations should be supported, as there are some 
indica t ions  th a t  they permit e f f ic ien cy  gains not otherwise achievable 
[Kel ler  78]. The usefulness  of ap p l ica t iv e  programs in allowing graceful 
backup when a processing un i t  f a i l s  a lso  remains to be explored. Thus 
many issues ,  a t  levels  from de ta i led  processor cons truct ion  to more 
fundamental language problems, await us.
ACKNOWLEDGEMENTS
Comments by Al Davis, Milos Ercegovac, and Mark Franklin ,  as well as 
encouragement from Jack Dennis, are appreciated.
The authors express t h e i r  thanks to Kathy Burgi, Jodie Doyle, Karen Evans, 
Lujuana Fornelius ,  and Mary Ann Kleiner t  for  t h e i r  a s s i s tan c e  in preparing 
the manuscript.
37
O Leaf node'- e i th e r  a general 
processing un i t  (with memory), 
special processing u n i t ,  or 
in te r face  to external I/O.




0 forward a2 ->X
1 fetch -*-(X+l) n3 n7
2 locp tr  a3 nO
(5)
3 forward al -*■?
(6 )
4 forward a 5 -*■?
(7)
5 invoke -kj n4
6 forward a 7 -*■?
7 addl al n6
cons
Figure 3 Graph represen ta t ion  and code block rep resen ta t ion  of  the
consequent of a production, x is  the global address of the 
invoke  operator  which crea tes  the corresponding data block. 
? ind ica tes  poin ter  f i e ld s  which are se t  on demand of th i s  
word. is  an operator  which generates the global
address of tFie word i t  references .
(DE SUM (TREE) (COND 
((NULL TREE) 0)
((ATOM TREE) TREE)
















forward a 2 -*x 
fetch ->-(x+ l) n3 n6 n14 
cond a3 a4 a5 nO 
null al n2 
r 0
cond a6 al4 a7 n2 
atom al n5 
add a8 all n5 
invoke -*sum n7 
forward alO ->-? 
car a 14 n9 
invoke ->-sum n7 
forward a 13 -*? 
cdr a 14 n 12 
ident al n5 nlO nl3
Fiqure 4 Tree summation example: Lisp code; consequent of production 
defining SUM; compiled code.
41
Figure 5 One possible snapshot of the program of Figure 4 during 
its computation on a tree.
42
initial task addresses 
(d bits of tasks already set)
Figure 6 Overall task processing flow. Asterisk denotes sequence o f.
The evaluate/propagate box for different task types is expanded 
in Figures 7, 8 and 10. The distribute/notify box is expanded 
in Figure 9.
43












Figure 7 Evaluate/propagate for ordinary task type.









F i g u r e  8 E v a l u a t e / p r o p a g a t e  f o r  invoke  t a s k  t y p e .
44











( T ,  V)
forwarded value 
(task address, value)
( T ,  V)
notifiee tasks
F i g u r e  9 D i s t r i b u t e / n o t i f y  p r o c e s s i n g .
46









let X be task address, 
let Y be argument location
fetch contents of Y,
•setting d bit if not already data
I








let Z be the address in Y, 
let W be l(a a r ) or Z+l{adr)
i
contents of W already data? ■
no
^ y e s
propagated task, 
address Y
W's forwarding address set?
evaluated task,




contents with W's rE
I no
JL













X2 car Y^ X2 car Y^
V2 r +Z V2 r +Z
second
demand
d forward all -+X
d f nZ







xi d car Y1 xi d car Y 1
Yi r +Z Yi r ->Z
X2 car Y2
X2 d forward ->X-|
V2 r ->Z V2 r -+Z
Z
U
d forward all +X.
d f nZ
(c) (d)
Figure 11 Illustration of forw ard chaining. (r and d denote ready  
and demand bits, respectively.)
48
Lisp code: (DE ADDK (K) (FUNCTION (LAMBDA (J) (ADD J K))))
FGL code:





forward a2 -»-x 
fetch ->-(x+ 1 ) 







forward a3 -»-x 
fetch -»-(x+ l) n3 
fetch + U + 1 )  n3 
add al a2 nO
Fiqure 12 Simple example of function closures: "Currying" the operator 
add to have a bound second argument, (x denotes the dynamic 
link, and £ denotes the static link, both bound at invoke time.)
49
REFERENCES
[Arden and Berenbaum 75] B. W. Arden and A. D. Berenbaum. A multi­
microprocessor computer system architecture. Operating systems review,
9, 6, 114-121 (Nov. 1975).
[Arvind and Gostelow 77] Arvind and K. P. Gostelow. A computer capable 
of exchanging processors for time. Proc I FIP '77, 849-853 (1977).
[Backus 73] J. Backus. Programming language semantics and closed 
applicative languages. Proc. ACM Symp. on Principles of Programming 
Languages (1973), 71-86.
[Bawden, et al. 77] A. Bawden et al. Lisp machine progress report. MIT 
Al Memo No. 444 (August 1977).
[Berkling 75] K. J. Berkling. Reduction languages for reduction machines. 
Second Annual Meeting of Computer Architecture (1975), 133-138.
[Davis 78] A. L. Davis. The architecture and system method of DDM-1:
A recursively-structured data driven machine. Proceedings of the Fifth Annual 
Symposium on Computer Architecture (1978).
[Dennis and Misunas 74] J. B. Dennis and D. P. Misunas. A preliminary 
architecture for a basic data flow processor. Proc. 2nd Annual Symposium 
on Computer Architecture, 126-132 (Dec. 1974).
[Fateman 73] R. J. Fateman. Reply to an editorial. ACM SIGSAM Bulletin,
No. 25, 9-11 (March 1973).
[Feustel 73] E. A. Feustel. On the advantages of tagged architecture.
IEEE Trans, on computers, C-22, 7, 644-656 (July 1973).
[Friedman and Wise 76] D. P. Friedman and D. S. Wise. CONS should not 
evaluate its arguments, in Michael son and Milner (eds.), Automata,
Languages, and Programming, 257-284, Edinburgh University Press (1976).
[Friedman and Wise 78] D. P. Friedman and D. S. Wise. Aspects of 
applicative programming for parallel processing. IEEE Trans. C-27,
4, 289-296 (April 1978).
[Hearn 76] A. C. Hearn. Symbolic computation. Proc. CERN School of 
Computing, 201-211 (Sept. 1976).
[Henderson and Morris 76] P. Henderson and J. H. Morris, Jr. A lazy 
evaluator. Proc. 3rd ACM Conference on Principles of Programming Languages, 
95-103 (Jan. 1976).
[Keller 75] R. M. Keller. Look-ahead processors. Computing Surveys,
7, 4, 177-195 (Dec. 1975).
[Keller 77] R. M. Keller. Semantics of parallel program graphs. University 
of Utah, Department of Computer Science, Tech. Rept. UUCS-77-110 (July 1977).
[Keller 78] R. M. Keller. An approach to determinacy proofs. University 
of Utah, Department of computer Science, Tech. Rept. UUCS-78-102 (March 1978)
50
[Lamport 74] L. Lamport. The parallel execution of DO loops. CACM,
1_7, 2, 83-93 (Feb. 1974).
[McCarthy 63] J. McCarthy. Towards a mathematical science of computation. 
Proc. IFIP *62, 21-28 (1963).
[Pati1 67] S. Patil. An abstract parallei-processing system. M.S.
Thesis. MIT, Department of Electrical Engineering (June 1967).
[Reddi and Feustel 78] S. S. Reddi and E. A. Feustel. A restructurable 
computer system. IEEE Trans, on computers, C-27, 1, 1-20 (Jan. 1978),
[Swan, e t  a l. 77] R. J. Swan, S. H. Fuller, and D. P. Siewiorek. Cm* - 
A modular, multi-microprocessor. AFIPS Conference Proc., 46_, 637-644 
(June, 1977).
