The torus: an exercise in constructing a processing surface by Martin, Alain J.
The Torus :  
An E x e r c i s e  i n  C o n s t r u c t i n g  a  P r o c e s s i n g  S u r f a c e  
A l a i n  J. M a r t i n  
Computer S c i e n c e  Department 
C a l i f o r n i a  I n s t i t u t e  o f  Technology 
P r o c e e d i n g s  o f  t h e  Second C a l t e c h  Conference  
on VLSI, J a n u a r y  1981 
THE TORUS: AN EXERCISE I N  CONSTRUCTING A PROCESSING SURFACE 
Phi l ips  Research Laboratories 
5600 MD Eindhoven 
The Netherlands 
Abstract. A "Processing Surface" is defined a s  a large,  dense, and 
regular arrangement of processor and storage modules on a two-dimensional 
surface, e.g. a VLSI chip. A general method is described f o r  d i s t r i bu t ing  
pa ra l l e l  recursive computations over such a surface. Scope r u l e s  enforcing 
the " local i ty"  of var iables  and procedure parameters a r e  introduced i n  the 
programming language. These ru l e s  and a par t icu la r  interconnection 0: the 
modules on the surface make it possible  t o  transmit parameter and va - i ab l e  
values between modules without using extraneous communication actions.  
The choice of the  Processing Surface topology f o r  binary recursive 
computations is  discussed and a torus-l ike topology is chosen. 
0. INTRODUCTION 
L e t  us c a l l  a "Processing Surface" a large,  dense, regular arrangement 
of processor and storage modules on a two-dimensional surface,  e.g. a VLSI 
chip. How can a computatkion be d i s t r i bu ted  over such a surface? what a r e  
the  arrangements of the modules on the  surface bes t  su i ted  f o r  a ce r t a in  
c l a s s  of computations? 
We propose t o  explore this problem i n  the following direct ion.  I n  such 
an environment, an action on a var iable  d i f f e r s  i n  complexity ( i n  terms of 
the  number of elementary s teps  necessary t o  perform the ac t ion)  depending 
on the distance between the processor module performing the  act ion and the 
storage module containing the variable.  We want t o  r e f l e c t  t h i s  issue a t  
the programming leve l  by introducing scope ru l e s  defining the distance 
between the program component where a var iable  is declared, and the program 
components where the variable can be used. 
Since we expect intense communications between the  program components, 
we expect assignments of the form x:=y where x and y belong t o  two 
adjacent components ( t h i s  assignment can take the form of a procedure c a l l  
or a pair  of matching communication ac t ions)  t o  occur a s  frequently a s  
assignments between variables .of the  same component. I n  most d i s t r i bu ted  
systems, the f i r s t  type of assignment is an order of magnitude more complex 
than the second one. We consider t h i s  hidden discrepancy between equivalent 
actions unacceptable. We w i l l  show t h a t  it is possible t o  def ine some 
loca l i ty  r u l e  fo r  the program var iables ,  and t o  organize the processor and 
storage modules on the surface such t h a t  no discrepancy of t h i s  s o r t  
appears. In  such a case, the Processing Surface is sa id  t o  be 
CALTECH CONFERENCE ON V L S I ,  January 1981 
AZain J. Martin 
Furthermore, s ince fo r  instance inver t ing a 2+2 matrix does not 
require as  much paral le l ism a s  inver t ing a 1000~1000 matrix, the  po ten t ia l  
parallel ism of an algorithm should not be f ixed beforehand (e.g. by the 
number of avai lable  processors) but  should be determined dynamically 
according to the needs of the computation. The component actions of a com- 
putation should be created and destroyed a s  the computation proceeds, and 
should be automatically d i s t r ibu ted  over the  available modules. 
f . THE GENERAL METHOD 
The general method we use has been described i n  [ I ] .  We s h a l l  r e c a l l  it 
brief ly .  
The component act ions  of a computation - the "nodes" - are regarded 
a s  the ver t ices  of a graph - the "computation graph" - which grows and 
shrinks during the computation. An edge - a "channel" - between two 
nodes means t h a t  one of t h e  two, say node A , has created the  other,  say 
node B , by a procedure c a l l ,  and t h a t  A and 33 communicate d i r ec t ly  
with each other.  A is the "father" of B , and B is  5 "son" of A . 
Thanks t o  a p a r a l l e l  procedure c a l l ,  a f a the r  may create  several  sons 
simultaneously. The father/son r e l a t i on  defines a p a r t i a l  ordering of the  
nodes, and a l l  nodes t h a t  a r e  not r e l a t i v e l y  ordered can be performed i n  
para l le l .  
A computation graph grows and shrinks through a given f i n i t e  
"implementation graph", whose ver t ices  - the "ce l l s"  - represent the  
available modules, and the edges - the  "l inks" - the communication 
poss ib i l i t i e s  between modules. Each node is  mapped on a c e l l ,  and each 
channel on a l ink .  
Hence, each c e l l  may have t o  accommodate an unbounded number of nodes, 
Since a c e l l  represents  a very small number of sequential automata ( i n  most 
cases, one!!, fhe activit ies c?f a l l  rides sLicltar?esusl.~ present i n  a cell 
have t o  be sequentialized i n  some way. ~ u t  such a sequentialization may 
introduce deadlock. The main r e s u l t  of [ I ]  i s  t o  prove t h a t  the nodes of a 
c e l l  can be inter leaved without introducing deadlock provided t h a t  the grain 
of inter leaving be cor rec t ly  chosen. The solut ion is  very simple i n  t h a t  it 
does not require  any pa r t i cu l a r  knowledge about the nodes or  the  implementa- 
t ion graph nor complicated scheduling. 
In  t h i s  paper we s h a l l  consider a spec ia l  c l a s s  of computations, 
namely recursive computations. For t h i s  c l a s s  of computations we s h a l l  
describe how t o  implement a continuous Processing Surface, and we s h a l l  
propose a torus-l ike topology for  the implementation graph. 
2.  RECURSION 
Much has been sa id  about the  use of recursion for  p a r a l l e l  programming. 
The reader is  re fe r red  t o  the  abundant l i t e r a t u r e  on t h i s  subject .  
For the sake of s implic i ty ,  we s h a l l  r e s t r i c t  ourselves t o  one of the  most 
usual recursive methods, namely "divide-and-conquer" ( a l so  cal led "recursive 
doubling"). Divide-and-conquer algorithms a re  par t icu la r ly  i n t e r e s t i ng  i n  
t ha t  they produce binary t r e e s  as  computation graphs. Binary t r ee s  a re  
regular s t ruc tures  and each node has an outdegree of two, which i s  in te r -  
es t ing  i n  view of t h e i r  mapping onto a two-dimensional surface. 
ARCHITECTURE S E S S I O N  
The T o r u s :  An E x e r c i s e  i n  C o n s t r u c t i n g  a P r o c e s s i n g  S u r f a c e  
Parallelism is introduced only by calling two procedures "in parallel".  
The possibility of further increasing parallelism by pipelining the 
parameters w i l l  not be mentioned although it can easi ly be added. Neverthe- 
less th is  class of algorithms is large enough ( i n  part icular  numerical 
algorithms) for  the exercise t o  be rea l i s t i c .  
Since a node is created by a procedure cal l ,  a node is a procedure 
instance with i ts  own program counter, and i ts  s e t  of variables and 
parameters. The following rules define the "locality" of variables and 
parameters . 
. The uni t  of locali ty is the node: a variable declared inside a node 
is local t o  that  node. 
. A variable local t o  a node A is a neighbour for  a l l  son nodes 
of A . 
Since the father/son re la t ion between nodes is  not t ransi t ive,  the 
locality or neighbourhood of a variable with respect to  a node is not 
transitive either: if a node PI ca l l s  a node P2 which ca l l s  a node ~3 , 
a variable local t o  PI is neighbour for  P2 , but not for  P3 . 
Three types of parameters are used: 
. An input parameter is used t o  "import" a parameter value into a son node, 
by an assignment of the actual parameter value to the formal parameter 
variable . 
. An output parameter is  used t o  "export" a value from a son node t o  its 
father by an assignment of the formal parameter value t o  the actual 
parameter variable. 
. A reference parameter i s  used both t o  import and to  export, but by a 
prwess of substitution, or "aliasing": the formal parameter replaces the 
actual parameter in  the son node (it is  another name for the same variable).  
In the case of the input and the output parameters, the formal 
parameter is local t o  the son. 
In the case of the reference 'parameter, the formal parameter has the 
same locali ty as the actual parameter. The formal parameter is  thus 
not local t o  the son. 
Assume that  the value x of a variable is to  be imported from a father node 
PI  into a son node P2 . Either an input or a reference mechanism can be 
used. Assume now that  x i s  to  be passed again from P2 t o  a son node P3 . 
If x was passed from PI to  P2 as an input parameter, x w i l l  be 
local to  P3 i f  it is passed as an input parameter from P2 to  P3 , and 
neighbour to  P3 i f  it is paksed as a reference parameter from P2 t o  P3 . 
But i f  x was passed from PI to  P2 as a reference parameter, x w i l l  
neither be local nor neighbour to  P3 , whether it be passed as  an input or as a 
reference parameter from P2 to  P3 . 
( I n  the case where a value is to  be exported from a son node to  a father node, 
exactly the same differences hold according to whether it is passed as an 
output or a reference parameter,) 
CALTECH CONFERENCE ON VLSI, J a n u a r y  1 9 8 1  
AZain J. Mart in  
Hence, the locali ty or neighbouthood of a reference parameter with 
respect to a node is not t ransi t ive whereas tha t  of an input or output 
parameter is. But when a value x is passed as an input or an output 
parameter from node, P t o  node Q , by definition x is copied from the 
storage area of P into the storage area of Q . No copying is necessary 
when x is passed as a reference parameter. 
The repetitive transport of values via global variables and reference 
parameters could be used i n  its f u l l  generality, but we propose t o  r e s t r i c t  
i t s  use by the following "locality rule". 
hca l i t y  rule: An action of a node involves only variables and 
parameters that  are local and/or neighbour for  the node. 
(Whether global variables should be used a t  a l l  is doubtful. They have been 
included for the sake of completeness.) we shall  see that  this local i ty  rule  
permits the implementation of a continuous Processing Surface, 
4. IMPZtEMENTATION OF A CONTINUOUS PROCESSING SURFACE 
Definition: A Processing Surface is said to  be "continuous" when any 
action performed on the surface involves only variables 
that  are  direct ly accessible t o  the processor performing 
the action, i.e. accessible by elementary read or write 
operations. 
Hence, i f  we succeed i n  implementing a continuous surface, we sha l l  
have suppressed-any form of extraneous communication action for  accessing 
variables. 
According t o  the general method, we know that i f  node N1 is mapped 
on ce l l  C1 , a son node N2 of N1 is mapped on a neighhollr c e l l  C2 of 
C1 . For node N1 to  be mapped on C1 means that  the local variables and 
parameters of NI must be allocated i n  the storage module associated with 
C1 , and the same for N 2  relat ive to  C2 . Let MI and M 2  be the 
storage modules associated with C1 and C2 , respectively. According t o  
the locality rule,  any action of N2 . may involve variables located i n  
MI and M2 . The s e t  (MI,  ~ 2 )  is called the "locality area" of N2 . 
In the case where the computation graph is a tree, the locali ty area of a 
node consists of a t  most two elements. 
As a direct consequence of the locality rule and of the defini t ion of 
a continuous Processing Surface, the Processing Surface is continuous i f  - 
the property C(N) holds for any node N . 
C ( N )  : any action of N is performed by a processor direct ly connected 
to the two storage modules of the locality area of N . 
We shall describe a strategy for placing the processoi and storage 
modules on the implementation graph, and for distributing the actions and 
the variables of the nodes over the processor and storage modules, such tha t  
C(N) holds for any node N . 
This strategy thus implements a continuous Processing Surface, 
ARCHITECTURE SESSION 
The T o r u s :  An E x e r c i s e  i n  C o n s t r u c t i n g  a  P r o c e s s i n g  S u r f a c e  
1 )  The placing s t r a t egy  is d i r ec t ly  suggested by the property C(N) . 
A s torage module is placed a t  each vertex, and a processor module a t  
each edge of t he  implementation graph. 
(See an example on f i g ,  1.)  Hence, each processor has d i r e c t  access t o  two 
- -- 
storage modules, and each storage module is shared by a s  many processors a s  
the degree of the  ver tex  where it is placed. 
2) Assume t h a t  C(F) holds fo r  a node F . For instance, F has been 
- -- -- - 
created i n  c e l l  2 o f f i g .  l ( a ) ;  its loca l  var iables  are i n  M2 , i t s  
neighbour var iables  i n  M1 , and i ts  actions a r e  processed by P12 ( see  
f ig .  2). 
( a )  implementation graph (b)  processor and storage placement 
Fig. 1. 
Assume t h a t  a t  some s tage  i n  the domputation of F two son nodes R and 
D ( f o r  r i g h t  and down) of F a re  t o  be created i n  c e l l s  3 and 4, 
respectively.  The l o c a l i t y  areas  of R and D must then be (M2, M 3 )  and 
(M2, M4) , respect ively ( s ee  f i g .  2) .  
This means t h a t  C ( R )  and C(D) w i l l  hold i f  and only i f  R and D a r e  
processed by P23 and P24 , respectively. Upon reaching the procedure 
ca l l s  of R and D i n  the  procedure body of F , P12 must transmit the  
creation of R and D t o  P23 and P24 . 
Since, by construct ion PI2 ,: P23 , and P24 share a common s to re ,  namely 
M2 , the  transmission of procedure c a l l s  is  a simple and loca l  action: 
PI2 adds the  names of R and D t o  the lists - located i n  M2 - of 
nodes t o  be processed by P23 and P24 , respectively.  
A processor switches from one node t o  the other upon a procedure c a l l  
in  the same way a s  i n  a multiprogramming system a processor switches from 
one process t o  another upon a P-operation on a zero semaphore. We s h a l l  not 
describe the  implementation i n  more de t a i l .  
CALTECH CONFERENCE ON VLSI, J a n u a r y  1 9 8 1  
AZain J .  Martin 
Hence, i f  C ( F )  holds f o r  a node F , C ( R )  and C ( D )  hold fo r  the  
two son nodes of F . observe t h a t  the  above s t ra tegy  is independent of the 
topologies of the implementation qraph and of the  computation tree. 
The roo t  node P of t he  computation t r e e  is created by the  "environment" 
of the  computation. A t  l e a s t  one c e l l  of the  implementation graph - a roo t  
c e l l  - is  connected t o  t h e  environment. It is easy t o  map P onto a roo t  
c e l l  i n  such a way t h a t  C ( P )  'holds. 
-- - ?-- 
,.-.* 
,! MI \ F 
Fig. 2. loca l i ty  areas  
5. THE CHOICE OF THE IMPLEMENTATION GRAPH 
We look fo r  a f i n i t e  implementation graph such t h a t  1 )  an a r b i t r a r y  
binary t r ee  can be mapped onto it without knowing the  s i z e s  of the  t r e e  and 
of the graph, 2 )  the  nodes of the t r e e  a r e  optimally spread over t he  c e l l s  
of the graph. 
Becacse af I ) ,  we a h  a t  "-- aLnuilating" an i n f i n i t e  grapn on a f i n i t e  
one. Let us assume t h a t  we could indeed construct  an i n f i n i t e  implementation 
graph, which graph would we choose? Since we a r e  looking f o r  graphs t h a t  can 
be represented i n  the plane by regular and dense s t ruc tures ,  we a r e  bound t o  
choose between the th ree  regular t e s se l l a t i ons  of the  plane, which a r e  the  
square, the t r iangular ,  and the hexagonal t e s se l l a t i ons .  ( ~ l t h o u g h  the 
i n f i n i t e  binary t r e e  is r egu la r ,  it is  not dense, because it grows 
exponentially and therefore  cannot be represented with minimal constant  
edge lengths. ) 
W e  have chosen the  square tesse l la t ion ,  although the  hexagonal is  a l so  
interest ing.  We s h a l l  f i r s t  discuss the problems of mapping a binary t r e e  
onto an i n f i n i t e  gr id .  We s h a l l  then simulate the i n f i n i t e  g r id  on a f i n i t e  
grid. 
6 .  THE INFINITE GRID AS AN IMPLE'MENTATION GRAPH 
An i n f i n i t e  g r id  i s  a graph such tha t :  f o r  i > 0  and j 2 0 , vertex 
( i, j ) is connected with vertex ( I , j ) and ver tex ( i , j+l ) . 
The mapping of a binary t r e e  on the gr id  is obvious. The r o o t  of the 
t r e e  is mapped onto ver tex ( 0 ,  0 )  . I f  a node is mapped on ver tex (i, j )  , 
then i t s  r i g h t  son R  i s  mapped on ver tex (i, j + l )  , and i t s  down son D  
ARCBITECTURE SESSION 
The Torus: An Exercise in Constructing a Processing Surface 
is mapped on vertex ( i + l ,  j) . When an exponential structure ( the binary t ree)  
is mapped on a quadratic one ( the grid) a congestion problem is created: vertex 
(i, j) of the grid may have t o  accommodate up t o  j i ! !  nodes of the 
binary tree simultaneously, 
7 .  THE STRAIGHT TORUS 
The problem now is to  simulate the in f in i te  grid on a f i n i t e  one. For 
reasons of symmetry we choose a square grid of M*M cel ls .  (We shal l  return t o  
th i s  choice later .)  The f i r s t  solution consists i n  connecting c e l l  (x, y) of 
the f i n i t e  grid ( 0  ( x ,  y < M) to the cel ls :  
(x,  (yi-flmod -M) 
and ( ( x + l ) e  MI y) . 
This amounts t o  connecting with each other the corresponding elements of the 
f i r s t  and l a s t  columns, and those of the f i s t  and l a s t  rows. The volume obtained 
is topologically similar t o  a torus. 
Consider an arbitrary c e l l  (i, j )  of the inf in i te  grid and the ce l l  (x, y) 
of the f i n i t e  grid on which it is mapped. According t o  the above connecting rule, 
we have : 
This relation describes the t i l i n g  of the inf in i te  grid by square t i l e s  of s ize 
M*M : i f  (i, j) are the coordinates of a c e l l  of the in f in i te  grid, then (x,  y) 
are i ts coordinates i n  the t i l e  (k, 1) (see fig. 3 . ) .  
The congestion problem can be solved in  the following way. Consider the 
inf in i te  grid. When a vertex is occupied by a node N of the computation tree,  
no other node is accepted by the vertex un t i l  N and the subtree attached to  N 
have terminated their  activity. I t  is easy to prove that  t h i s  cannot lead to  
deadlock on the in f in i te  grid. B u t  th i s  solution cannot be used in  a straight- 
forward manner for Lhe t ~ r u s  without danger of deadlock. Assume that  a ce l l  of 
the torus is  occupied by the node N1 , and a new node N 2  is  not accepted by 
the cel l .  It  may occur that  N 2  belongs to  the subtree of N 1  . This would be 
a deadlock. For each node of the computation tree, it is recorded t o  which t i l e  
the node belongs. When a c e l l  is occupied by a node N1 , it may refuse a node 
N 2  only i f  N 2  belongs t o  the same t i l e  as N1 . ( I f  two nodes belong t o  the 
same t i l e ,  it is impossible that  one belongs to  the subtree of the other.) 
8. THE PROPAGATION PATTERN 
Assume tha t  a l l  ce l ls  and a l l  nodes i n  the ce l l s  have similar behaviours, 
and that  the propagation speeds are similar in  a l l  directions even in  the case of 
an asynchronous implementation. Then we can say that  i n  a phase of homogeneous 
expansion or contraction of the computation, there is  a front  wave of active 
nodes which are  located a t  a maximum distance from the root,  i.e. on a diagonal 
i i- j = K of the inf in i te  grid, which we shall  ca l l  the "active diagonal". 
A t  step K of the computation, the complete computation t ree  contains 
2w(K-1) active nodes ( the leaves). B u t  a t  step K of the computation, a t  
most K(K+1)/2 ce l l s  of the inf in i te  grid can be active, and i f  the strategy 
for reducing congestion is applied, a t  most K : the ce l l s  of the active 
diagonal. A s  a consequence, the 2.n-n(K-I) leaf nodes cannot be active 
CALTECH CONFERENCE ON VLSI, January 1981 
AZain J. Martin 
simultaneously; t h e i r  a c t i v i t i e s  have t o  be sequentialized. The hypothesis of 
homogeneous expansion and contraction then does not s t r i c t l y  hold anymore 
because not  a l l  c e l l s  on a diagonal have t o  accommodate the  same number of l ea f  
nodes, and therefore  the  contraction of the  computation w i l l  not start i n  a l l  
c e l l s  a t  the  same time. But it is an acceptable approximation. 
Fig. 3 .  
Fig. 3 shows t h a t  t he  ac t ive  diagonal of the i n f i n i t e  g r id  is mapped on 
a t  most two diagonals of the  f i n i t e  gr id ,  i .e .  a t  most M cells out  of the 
M*M are active.  
(Algebraically, fo r  a given value of i + j , there  a r e  a t  most two values of 
x + y (0 x t y < 2 - I f u l f i l l i n g  R , namely: 
( i  + j)@M 
i f  ( i +  j ) r & M < M - 1 :  ( i +  j ) e M + M  . 
Hence, i f  the  ac t ive  diagonal approximation is correct ,  the  s t r a i g h t  torus 
topology leads  t o  a poor d i s t r i bu t ion  of t he  computation over the  Processing 
Surf ace. 
9. THE TWISTED TORUS 
Obviously, the drawback of the  s t r a i g h t  torus  i s  caused by the  symmetry 
of the t i l i n g  of f i g .  3 around the ax is  i = j . We can destroy the  symmetry 
by sh i f t i ng  the  t i l i n g  by one posi t ion and i n  one direct ion,  a s  shown by f i g ,  4.  
Now we see t h a t  fo r  the  same ac t ive  diagonal, more diagonals of t he  f i n i t e  gr id  
a r e  occupied. I n  f a c t ,  it can be proved t h a t  the d i s t r i bu t ion  of the  ac t ive  
diagonal over the  f i n i t e  gr id  is  now optimal: i f  the ac t ive  diagonal contains no 
more than M*M nodes, no two nodes of the  ac t ive  diagonal are mapped on the same 
cell of the  torus.  
This t i l i n g  corresponds t o  t he  t i l i n g  re la t ion :  
ARCRITECTURE SESSION 
The Torus :  An E x e r c i s e  i n  C o n s t r u c t i n g  a P r o c e s s i n g  S u r f a c e  
Fig. 4. 
10. THE DOUBLY TWISTED TORUS 
The same r e s u l t  could have been reached by using a rectangular s t r a i g h t  
torus  of M*P c e l l s  where M and P are r e l a t i v e  primes. 
The difference is  t h a t  i n  a twisted to rus ,  a hor izontal  chain of nodes, i.e. 
the succession of nodes w i t h  constant i , is mapped on a cycle containing 
a l l  c e l l s  of the  torus ,  i .e .  on a cycle of length M*M . On the rectangular 
-
torus,  such a s t ruc tu re  is mapped on a cycle of length M (one row of the  
torus!, I n  both cases a vertical chain (ccnstant j ) is inspped on t l e  cells 
of only one column. 
I n  view of c e r t a i n  degenerate binary t r e e s ,  which reduce t o  a chain of only 
r i g h t  or  l e f t  procedure c a l l s ,  it could be in t e r e s t i ng  t o  t w i s t  the  to rus  i n  
both d i rec t ions  i n  such a way t h a t  a v e r t i c a l  chain is a l so  mapped on a l l  
c e l l s  of the  torus.  
To avoid re introduct ion of the  symmetry, the  torus  must be twisted i n  
opposite d i rec t ions  i n  the  two dimensions (e.g. +1 f o r  the  rows, and -1 f o r  
the columns). 
The f a c t  t h a t  the  corresponding t i l i n g  re la t ion :  
i = x + k * M - 1  
j = y + l * + k  
-- -- - .  
has no s o l u t i o n ' f o r  (i, j )  = (M(q+l) - p, pM + 4)  means t h a t  such 
a t i l i n g  does not represent a "plane" surface. We mean t h a t  i f ,  on the  
i n f i n i t e  gr id ,  po in t  B is  rsached from point A by r horizontal  s teps  
and s v e r t i c a l  ones, B is a l so  reached from A by any permutation of 
these steps.  This is  no longer t rue  f o r  this doubly twisted torus.  
This is  shown by the  following counter-example. Consider the  3*3 doubly 
twisted to rus  of f i g .  5 ( a ) .  From poin t  A , one hor izonta l  s t ep  ( indicated 
i n  f i g .  5 by a dot ted path) ,  followed by one v e r t i c a l  s t e p  ( indicated i n  
f i g .  5 by a dashed path) leads  t o  po in t  B ( f i g .  5 ( b ) ) .  From point  A , 
CALTECH CONFERENCE ON VLSI, January 1 9 8 1  
A l a i n  J. Mart in  
one vert ical .  s t e p  followed by one hor izontal  s t e p  leads t o  po in t  c ( f i g .  
5tc)) .  Points B and C a r e  d i f f e r en t .  
( a )  (b) (c) 
Fig. 5. 
This drawback is only s ign i f i can t  i f  one wants t o  implement computation 
graphs other than t r ee s .  In  a t r ee ,  there  is only one path between t w o  
points.  I f  one wants to  maintain the planar i ty  of the to rus ,  one must look 
f o r  t e s s e l l a t i o n s  of the  plane t h a t  a r e  not square, and y e t  s t i l l  use  the 
double t w i s t .  Two are given i n  f i g .  6. The f i r s t  one is due t o  Carlo Sequin 
r - 1  
Fig. 6. 
1 1 . CONCLUSION 
A method has been proposed t o  construct  highly p a r a l l e l  and d i s t r i bu t ed  
systems where the  basic hardware building blocks are 'whole processor and 
storage modules, and the  basic software bui lding block is the procedure. 
The main aspects  of the method a r e  the  following. 
ARCHITECTURE SESSION 
The T o r u s :  An E x e r c i s e  i n  C o n s t r u c t i n g  a P r o c e s s i n g  S u r f a c e  
F i r s t ,  on such a Processing Surface the locat ion of variables r e l a t i v e  
t o  the processors using them is a relevant fac tor .  Scope ru les  have therefore 
been introduced i n  the programming language, which allow the programmer t o  
determine the "distance" between the  var iables  o r  procedure parameters and 
the  actions where they are used. 
Second, since intense comnunications between adjacent  modules a r e  
expected, w e  have attempted t o  smooth away the d i scont inu i ty  i n  var iable  
access caused by the boundary between storage modules. For t h i s  purpose, 
the  access t o  d i s t an t  var iables  has been l imited t o  neighbour var iables  by 
a " local i ty  rule".  Furthermore the processor and s torage modules have been 
arranged i n  such a way t h a t  no extraneous communication procedure is  needed 
t o  "move" var iable  values over a storage module boundary. The r e s u l t  is 
cal led a continuous Processing Surface. 
Third, by using a "boundary-less" topology f o r  t h e  surface (here, a 
torus) ,  the  automatic dif fusion of a divide-and-conquer computation through 
the surface leads t o  an optimal spreading of the load over the modules. 
The programmer need not know the actual  number of modules, and no compli- 
cated scheduling is required. 
12. HISTORY AND ACKNOWLEDGEMENTS 
The f i r s t  torus machine was b u i l t  a t  the beginning of 1979 a t  
phi l ips  Research Laboratories. It is a twisted to rus  of 36 ce l l s .  Each 
c e l l  consis ts  of two INTEL chips (one processor with a 1K byte ROM, and 
one 256 byte RAM.). It is not  a Processing Surface bu t  a network of machines 
cormrmnicating by exp l i c i t  message exchanges. 
Acknowledgement is  due t o  W.J .  Lippmann and G.A. Slavenburg f o r  t h e i r  
invaluable cooperation during t h e  construction of t h i s  machine. Without t h e i r  
hardware and software competence, it would never have been completed within 
such a shor t  term. The f a c t  t h a t  it was completed within 3 months i s  a l so  
a consequence of the  regular i ty  of the  s t ructure .  Acknowledgement is  a l so  
due t o  C.S. Scholten fo r  several  valuable comments on the  f i r s t  paper on 
the subject  [3], and t o  Alan Davis whose comments on the  manuscript led 
t o  many improvements. 
REFERENCES 
[1] A. J. Martin: "A ~ i s t r i b u t e d  Implementation Method f o r  Pa ra l l e l  
Programming." Proceedings IFIP congress 80 - October 1980. 
[2] C.H. Sequin: "Doubly Twisted Torus Networks f o r  VLSI Processor 
Arrays." December 3, 1980 - draf t .  
/ 
[3] A.J. Martin: "A ~ i s t r i b u t e d  Architecture f o r  P a r a l l e l  ~ e c u r s i v e  
Computations." Phi l ips .  AJM18 - September 1979. 
CALTECH CONFERENCE ON VLSI, January  1881  
