Implementing direct, spatially isolated problems on transputer networks by Ellis, Graham K.
NASA Technical Memorandum 101297 
ICOMP-88-14 




Implementing Direct, Spatially Isolated 
Problems on Transputer Networks 
Graham K. Ellis 
Institute for Computational Mechanics in Propulsion 




LEWIS RESEARCH CENTER 
* 
https://ntrs.nasa.gov/search.jsp?R=19880018412 2020-03-20T05:41:29+00:00Z
IMPLEMENTING DIRECT, SPATIALLY ISOLATED PROBLEMS ON TRANSPUTER NETWORKS 
Graham K. E l  1  i s *  
I n s t i t u t e  f o r  Computat ional  Mechanics i n  P ropu l s i on  
Lewis Research Center 
Cl eve1 and, Ohio 441 35 
SUMMARY 
Paramet r i c  s t u d i e s  have been performed on t r a n s p u t e r  networks o f  up t o  40 
processors  t o  determine how t o  implement and maximize t he  performance o f  t he  
s o l u t i o n  o f  problems where no processor- to-processor  da ta  t r a n s f e r  i s  r e q u i r e d  
f o r  t he  prob lem s o l u t i o n  ( s p a t i a l l y  i s o l a t e d ) .  
Two types o f  problems were i n v e s t i g a t e d  i n  t h i s  s tudy.  A compu ta t i ona l l y  
i n t e n s i v e  problem where t he  s o l u t i o n  r e q u i r e d  the  t r ansm iss i on  o f  160 by tes  o f  
d a t a  through t he  p a r a l l e l  network,  and a  communication i n t e n s i v e  example t h a t  
r e q u i r e d  t h e  t r ansm iss i on  o f  3 Mbytes o f  da ta  th rough  t h e  network.  Th i s  da ta  
c o n s i s t s  o f  s o l u t i o n s  be ing  sent  back t o  t he  hos t  processor  and n o t  in te rmed i -  
a t e  r e s u l t s  f o r  another  processor  t o  work on. 
S tud ies  were performed on bo th  i n t e g e r  and f l o a t i n g - p o i n t  t r anspu te r s .  
The f l o a t i n g - p o i n t  t r a n s p u t e r  f e a t u r e s  an on-chip f l o a t i n g - p o i n t  math u n i t  and 
o f f e r s  approx imate ly  an o r d e r  o f  magnitude performance inc rease  ove r  t he  i n t e -  
ger  t r a n s p u t e r  on r e a l  va lued  computat ions.  
The r e s u l t s  i n d i c a t e  t h a t  a  minimum amount o f  work i s  r e q u i r e d  on each 
node per  communication t o  ach ieve h i g h  network speedups ( e f f i c i e n c i e s ) .  The 
f l o a t i n g - p o i n t  processor  r e q u i r e s  approx imate ly  an o r d e r  o f  magnitude more work 
per  communication than the  i n t e g e r  processor  because o f  t he  f l o a t i n g - p o i n t  
u n i t ' s  i nc reased  computing c a p a b i l i t y .  
INTRODUCTION 
With t h e  advent  o f  m u l t i p l e - i n s t r u c t i o n ,  m u l t i p l e - d a t a  (MIMD) stream par -  
a l l e l  processors  comes t he  problem o f  d i s t r i b u t i n g  t he  problem o f  i n t e r e s t  o n t o  
a  network o f  processors  f o r  s o l u t i o n .  Th i s  paper d iscusses some techniques 
t h a t  can be used f o r  implement ing d i r e c t ,  s p a t i a l l y  i s o l a t e d  problems on t r ans -  
p u t e r  networks.  
A  d i r e c t  problem i s  one t h a t  r e q u i r e s  a  known number o f  i t e r a t i o n s  t o  
so l ve  t he  problem ( i . e . ,  n o t  i t e r a t i v e ) .  The term s p a t i a l l y  i s o l a t e d  r e f e r s  t o  
t h e  c l a s s  o f  problems t h a t  can be d i v i d e d  such t h a t  no da ta  a re  r e q u i r e d  f r om 
any o t h e r  processor  f o r  t he  problem s o l u t i o n .  Th is ,  however, does n o t  p rec lude  
t h e  n e c e s s i t y  o f  d i s t r i b u t i n g  and c o l l e c t i n g  i n i t i a l  da ta  and f i n a l  answers on 
t he  t r a n s p u t e r  network.  
D i r e c t ,  s p a t i a l l y  i s o l a t e d  problems a re  i n v e s t i g a t e d  because o f  t h e i r  sim- 
p l i c i t y  when compared w i t h  i t e r a t i v e ,  da ta  coupled problems such as m a t r i x  
i n v e r s i o n .  By us i ng  a  s imple problem, t he  need t o  pe r f o rm  dynamic l oad  ba lanc-  
i n g  t o  keep a l l  o f  the  processors  i n  t he  network busy and t he  e x t r a  programming 
r e q u i r e d  f o r  i n t e l l i g e n t  network communication can be e l i m i n a t e d .  
*Senior  Research Assoc ia te  (work funded under Space Ac t  Agreement C99066G). 
I n  g e n e r a l ,  t h e r e  a r e  two approaches t o  d i s t r i b u t i n g  a  p rob lem o n t o  a  pa r -  
a l l e l  p r o c e s s i n g  network .  The f i r s t  i s  t o  l e t  each p rocessor  p e r f o r m  t h e  same 
computat ions b u t  on d i f f e r e n t  da ta .  The o t h e r  method i s  t o  d i s t r i b u t e  t h e  com- 
p u t a t i o n  i t s e l f  o v e r  t h e  ne twork  so t h a t  each p rocessor  pe r fo rms  o n l y  p a r t  o f  
t h e  numer i ca l  computa t ion .  The f i r s t  method w i l l  be i n v e s t i g a t e d  i n  t h i s  
paper .  
Two t ypes  o f  d i r e c t ,  s p a t i a l l y  i s o l a t e d  problems a r e  s t u d i e d  i n  t h i s  paper .  
A computa t iona l  l y  i n t e n s i v e  p rob lem and a  communicat ion i n t e n s i v e  problem. The 
c o m p u t a t i o n a l l y  i n t e n s i v e  prob lem i s  a  d e f i n i t e  i n t e g r a l  t h a t  i s  used t o  compute 
an a p p r o x i m a t i o n  t o  p i  ( r e f .  1 ) .  The i n t e g r a l  i s  e v a l u a t e d  u s i n g  t h e  r e c t a n g l e  
r u l e .  The communicat ion i n t e n s i v e  prob lem i s  a  mapping o f  t h e  complex p lane  
used t o  v i s u a l l y  search f o r  complex r o o t s  o f  p o l y n o m i a l s  ( p r i v a t e  communicat ion 
w i t h  A lan  P a l a z z o l o  o f  Texas A&M). The communicat ion 1 i m i  t e d  p rob lem occurs  
because o f  t h e  s i z e  o f  t h e  r e s u l t s  o f  t h e  computa t ion .  The computed d a t a  a r e  
q u i t e  l a r g e  and consequen t l y  cause a  communicat ion b o t t l e n e c k  i n  t h e  network .  
T h i s  paper  f i r s t  p r o v i d e s  an i n t r o d u c t i o n  t o  t h e  hardware and s o f t w a r e  
f e a t u r e s  of  t h e  t r a n s p u t e r  and then  t h e  programming and per formance o f  t h e  com- 
p u t a t i o n a l l y  i n t e n s i v e  and t h e  communicat ion i n t e n s i v e  problems a r e  d i scussed .  
A work ing  knowledge o f  Occam i s  assumed. 
HARDWARE 
A t r a n s p u t e r  i s  a  m ic rop rocessor  c o n t a i n i n g  memory, s e r i a l  l i n k s  t h a t  
a l l o w  p o i n t - t o - p o i n t  c o n n e c t i o n  o f  networks  o f  t r a n s p u t e r s  and an o p t i o n a l  
f l o a t i n g - p o i n t  math u n i t  on a  s i n g l e  VLSI c h i p .  The t r a n s p u t e r  i s  a  Reduced 
I n s t r u c t i o n - S e t  Computer (RISC) which g i v e s  h igh-per formance r a t e s  f o r  many 
t ypes  o f  computa t ions .  
The IMS T414 t r a n s p u t e r  c o n t a i n s  2  k b y t e s  o f  50 ns on -ch ip  RAM, f o u r  
b i d i r e c t i o n a l  s e r i a l  l i n k s  t h a t  can o p e r a t e  a t  5, 10, and 20 M b i t s l s e c  d a t a  
t r a n s f e r  r a t e s .  The 20-MHz v e r s i o n  o f  the  T414 i s  capable  o f  a p p r o x i m a t e l y  
10 MIPS and 100 k f l o p s .  The IMS T800 t r a n s p u t e r  c o n t a i n s  4  k b y t e s  o f  50 ns 
on -ch ip  RAM, f o u r  b i d i r e c t i o n a l  s e r i a l  l i n k s  a t  5, 10, 20 M b i t s l s e c  and an 
on -ch ip  f l o a t i n g - p o i n t  u n i t .  A 20-MHz T800 i s  capable  o f  a p p r o x i m a t e l y  
10 MIPS and 1.5 M f l o p s  ( r e f .  2 ) .  A  b l o c k  d iagram o f  t h e  T800 i s  shown i n  
f i g u r e  1 .  
The t r a n s p u t e r ' s  l i n k s  a r e  autonomous D i r e c t  Memory Access (DMA) eng ines 
t h a t  upon i n i t i a l i z a t i o n  can t r a n s f e r  d a t a  w i t h o u t  any p rocessor  i n t e r v e n t i o n .  
The bandwid th  o f  t h e  m ic rop rocessor  bus i s  such t h a t  a l l  f o u r  s e r i a l  l i n k s  
( b o t h  i n p u t  and o u t p u t )  can o p e r a t e  a t  20 M b i t s  c o n c u r r e n t l y .  
The t r a n s p u t e r  system used i n  t h i s  s t u d y  c o n s i s t s  of  a  t r a n s p u t e r  h o s t 1  
development system t h a t  r u n s  on a  PC AT c o m p a t i b l e .  The development board  
c o n t a i n s  a  15-MHz T414 p rocessor  w i t h  2  Mbytes RAM. The t r a n s p u t e r  ne twork  
c o n s i s t s  o f  40 20-MHz t r a n s p u t e r s  w i t h  256 k b y t e s  DRAM p e r  p rocessor  and a  
t r a n s p u t e r  based medi um per formance g r a p h i c s  board.  The g r a p h i c s  board sup- 
p o r t s  a  r e s o l u t i o n  o f  512 by  512 p i x e l s  and 256 s imu l taneous  c o l o r s .  
S i m u l a t i o n s  u s i n g  e i t h e r  40 T414 o r  40 T800 t r a n s p u t e r s  were per formed t o  
de te rm ine  t h e  advantage o f  t h e  on -ch ip  f l o a t i n g - p o i n t  math u n i t .  
The l i n k  speed f o r  a l l  t he  t e s t s  was s e t  a t  10 M b i t s l s e c .  The network was 
w i red  as a  10 processor  by 4  processor  t o r u s .  Th is  a r c h i t e c t u r e  i s  conven ien t  
because many o t h e r  a r c h i t e c t u r e s  of i n t e r e s t  can be mapped o n t o  a  t o r u s .  Exam- 
p l e s  o f  o t h e r  a r c h i t e c t u r e s  t h a t  map o n t o  a  10- by 4- torus a re  a  two-dimensional  
mesh, r i n g ,  p i p e l i n e ,  and hypercubes o f  o r d e r  1 ,  2 ,  o r  3 .  
SOFTWARE 
A l l  o f  t he  so f tware  f o r  t he  performance t e s t s  i n  t h i s  paper were w r i t t e n  
i n  occam. Occam was developed t o  e a s i l y  implement communication and concur-  
rency  ( r e f .  3). Occam can be used t o  desc r i be  t h e  s t r u c t u r e  o f  a  network o r  
system i n  terms o f  p o i n t - t o - p o i n t  communication channels.  I t  i s  a l s o  used t o  
program the  i n d i v i d u a l  processors  i n  t h e  network.  
The so f tware  was developed u s i n g  t he  INMOS Transputer  Development System 
(TDS). The TDS i s  an i n t e g r a t e d  package c o n s i s t i n g  o f  an e d i t o r ,  occam com- 
p i l e r ,  l i n k e r ,  network c o n f i g u r e r ,  and o t h e r  development t o o l s  ( r e f .  4 ) .  The 
v e r s i o n  used f o r  t he  benchmarks was occam TDS, BETA2, D700C. 
Occam a l l ows  a  system t o  be desc r i bed  i n  terms o f  a  c o l l e c t i o n  o f  concur-  
r e n t  processes ( r e f .  5). A process per forms a  sequence of a c t i o n s  and t e rm i -  
na tes .  Concurrent  processes can o n l y  communicate w i t h  each o t h e r  and w i t h  
p e r i p h e r a l  dev ices th rough  p o i n t - t o - p o i n t  channels.  Occam programs a re  b u i l t  
f r o m  t h r e e  p r i m i t i v e  processes ( r e f .  6). 
Assignment 
v := e  ass i gn  express ion  e  t o  v a r i a b l e  v  
Output  
c ! e  o u t p u t  express ion  e  th rough  channel c  
I n p u t  
c ? v  i n p u t  f r om  channel c  t o  v a r i a b l e  v  
Cons t ruc to r s  a re  used t o  combine p r i m i t i v e  processes i n t o  l a r g e r  processes. 
The sequen t i a l  c o n s t r u c t o r ,  SEQ, causes i t s  components t o  be executed one a f t e r  
another .  A SEQ process te rm ina tes  when t he  l a s t  component under t he  SEQ te rm i -  
nates.  The p a r a l l e l  c o n s t r u c t ,  PAR, causes i t s  components t o  be executed con- 
c u r r e n t l y .  I f  a  PAR i s  s p e c i f i e d  on a  s i n g l e  p rocessor ,  the  processes a re  
t i m e - s l i c e d  accord ing  t o  a  round- rob in  scheduler  b u i l t  i n t o  t h e  t r a n s p u t e r  
hardware. A PAR process te rm ina tes  o n l y  a f t e r  a l l  o f  t he  components under t h e  
PAR have te rmina ted .  The a l t e r n a t i v e  c o n s t r u c t ,  ALT, chooses one component 
process f o r  execu t ion .  I f  more than  one component process i s  enabled t o  be 
se lec ted ,  o n l y  one w i l l  be se lec ted .  The s e l e c t i o n  o f  t he  process i s  a r b i -  
t r a r y .  An ALT process te rm ina tes  when t he  se lec ted  component t e rm ina tes .  
The t r anspu te r  suppor ts  two p r i o r i t y  l e v e l s  f o r  i t s  ope ra t i ons .  A l l  
p a r a l l e l  processes r u n n i n g  a t  a  low p r i o r i t y  g e t  t ime  s l i c e d  acco rd i ng  t o  t he  
on-chip hardware scheduler  b u i l t  i n t o  the  t r a n s p u t e r .  A l o w - p r i o r i t y  t ime  
s l i c e  i s  5120 cyc l es  l o n g  (app rox ima te l y  1 ms) ( r e f .  7). A h i g h - p r i o r i t y  
process will run without interruption until it finishes or has to wait for 
channel communication. The PRI PAR statement is used to prioritize a process. 
Only two process can appear under a PRI PAR statement. The first process 
appearing after the PRI PAR statement is run at high priority. To run more 







where a 0 ,  b 0 ,  c 0  and d 0  are occam procedures. Procedures a 0 ,  b 0  and c 0  
running under the PAR construct are considered to be a single process. 
The code fragment listed above means processes a 0 ,  b 0 ,  and c 0  will all 
run at high-priority until they deschedule for communication or they termi- 
nate. Each of the high-priority processes will run preemptively until they 
deschedule. High-priority processes do not get time-sliced. The exact order- 
ing of the processes appearing in a PRI PAR is the last lexically appearing 
process will schedule first, and then the others will be queued up as they 
appear top-to-bottom. In the example listed above, the high-priority process 
list will appear as: 
When none of the processes a 0 ,  b 0 ,  and c 0  can proceed, process d 0  gets 
scheduled. If any of the processes in the high-priority process list get sche- 
duled, it will interrupt the process d o .  
Any programs written using the PAR or PRI PAR constructs should not depend 
on the order that the processes are queued in process list for proper opera- 
tion; however, knowing how the occam compiler generates the code can be helpful 
when maximum performance is required. 
Channel communication is self-synchronizing. Communication occurs when 
one process outputs to a channel and another process inputs from it. If either 
the sender or receiver is not ready, the other process waits until both pro- 
cesses are ready to continue. The programmer does not have to explicitly spec- 
ify the synchronization, it is performed automatically just by using the input 
and output primitives presented previously. There is no implicit buffering of 
channel communications. Asynchronous channel communication is not directly 
supported. 
COMPUTATION INTENSIVE EXAMPLE 
The computationally intensive problem to be solved on a transputer network 
is a small sized problem with very little data transfer through the network. 
The problem is the Pi Program (ref. 1). It computes an approximation to pi by 
using the rectangle rule to approximate the following definite integral: 
where 
and the rectangle rule states: 
Rn(f> = hCf(xi)l 
where 
In order to implement the pi program on a network of transputers, first a 
suitable architecture must be chosen. Unlike many multiprocessors whose archi- 
tecture is fixed, transputers can be wired in any configuration supported with 
four connections per processor. 
Since the programmer must develop all of the communication routing algo- 
rithms for the network, it is currently more convenient to implement a simple, 
regular architecture and use simple communication procedures instead of a more 
complicated scheme. Additionally, the DMA link engines allow data to be piped 
through a network of processors with little performance penalty. 
The problem will be implemented on a pipeline of transputers. A pipeline 
is used because it is very easy to implement the required communication buffers 
for the network data transfer. Simulations will be run for 1 ,  8, 16, 32, and 
40 processors. Both T414 and T800 versions of the transputer will be tested. 
As a baseline, the pi program was implemented on a single transputer (the 
development board) using only SEQ processes. This version gives a datum from 
which the network computations can be compared. The listing of the single pro- 
cessor sequential version of the pi program is given in appendix A. 
The next step was to simulate the desired network of transputers on a sin- 
gle transputer using the PAR construct. One of the advantages of Occam is the 
ability to simulate the network configuration on a single processor. The dis- 
cussion for the simulated parallel network and the actual parallel network are 
combined since they are conceptually identical. The listing of the single pro- 
cessor parallel version of the pi program is presented in appendix B and the 
network version is listed in appendix C. 
As previously mentioned, a pipeline of processors (or processes for the 
single processor case) is used because of the simple communication protocol 
required. The current generation of software tools for transputers is such 
that the programmer must explicitly define and program the processor-to- 
processor data routing procedures. The pipeline and communication buffers are 
shown in figure 2. 
The s i n g l e  processor  sequen t ia l  implementat ion o f  t h e  r e c t a n g u l a r  i n t e g r a -  
t i o n  i s  q u i t e  s imple.  The occam code fragment f o r  t he  r e c t a n g l e  r u l e  computa- 
t i o n  i s  shown below. 
SEQ i = 0 FOR number . in te rva ls  
S EQ 
x i  := ((REAL32 TRUNC i) - 0.5 (REAL3211 * de1ta.x 
sum := sum + (de1ta.x * (4 .0  (REAL321 / (1.0 (REAL32) + ( x i  * x i ) ) ) )  
where 
number . in te rva ls  t he  t o t a l  number o f  i n t e r v a l s  t o  use f o r  t he  i n t e g r a t i o n  
d e l  t a . x  t he  w i d t h  o f  each i n t e r v a l  i n  t he  i n t e g r a t i o n  
x i  temporary s torage 
s  um the  va lue  o f  the  i n t e g r a l  
i loop  counter  
Note t h e  s t r i c t  da ta  t y p i n g  occam r e q u i r e s .  Th is  s t r i c t  t y p i n g  i nsu res  t he  
co r rec tness  o f  any occam express ion .  
The obv ious  method f o r  d i s t r i b u t i n g  t he  i n t e g r a t i o n  o n t o  a  network o f  N  
processors  i s  t o  d i v i d e  the  i n t e r v a l  10, 11 i n t o  N equal segments and l e t  
each p rocessor  work on a  subset o f  t he  i n t e r v a l .  The r e a l  work f o r  implement- 
i n g  the  p a r a l l e l  s o l u t i o n  o f  t he  p i  problem i s  i n  w r i t i n g  t he  communication 
r o u t i n e s  f o r  t h e  network.  
S ince t he  number o f  processors  i s  known a t  compi le  t ime ,  each processor  
can be ass igned a  un ique number t h a t  can be used t o  determine t h e  s t a r t  and end 
va lues f o r  t he  i n t e r v a l  on each processor .  The computat ion r e q u i r e d  t o  compute 
a  l o c a l  i n t e r v a l  Cu, v l  i s  as f o l l o w s :  
where 
n  processor  number (numbers s t a r t  a t  0) 
c  t o t a l  number o f  i n t e r v a l  s on [O, 1 1  
N t o t a l  number o f  processors  i n  network 
Note t h a t  t he  number c/N i s  a  cons tan t .  Because o f  t h i s ,  t h e  h o s t  com- 
p u t e r  can compute i t  once and send i t  t o  every  p rocessor .  The va lue  c/N i s  
r e a l l y  the  number o f  l o c a l  i n t e r v a l s  t o  be computed on each p rocessor .  
The f o r m u l a t i o n  f o r  t he  r e c t a n g l e  r u l e  r e q u i r e s  t he  m u l t i p l i e r  va lue  
h = l / c  f o r  t h e  p roper  f u n c t i o n  e v a l u a t i o n .  Therefore,  t h e  o n l y  va lues  
r e q u i r e d  t o  be o u t p u t  t o  t he  network t o  s t a r t  t h e  s i m u l a t i o n  i s  t o  o u t p u t  t h e  
va lue  f o r  h, and t h e  va lue  f o r  c/N. 
INPUT BUFFER PROCESS 
To d i s t r i b u t e  t he  work o n t o  t h e  network,  some s o r t  o f  i n p u t  b u f f e r  r o u t i n e  
must be w r i t t e n .  S ince t h e  da ta  needs t o  propagate down t he  p i pe ,  t h e  obv ious  
scheme i s  t o  read  t h e  d a t a  f r om  the  h o s t  and pass i t  t o  t he  l o c a l  computat ion 
process and a l s o  t o  t he  n e x t  processor  (process)  i n  t h e  network.  The da ta  can 
be sen t  from the  i n p u t  b u f f e r  t o  t he  l o c a l  compute process and t he  n e x t  proces- 
sor  i n  p a r a l l e l .  Th i s  a l s o  inc reases  t h e  performance o f  t h e  network s i nce  t he  
s e r i a l  l i n k s  can opera te  autonomously w i t h o u t  processor  i n t e r v e n t i o n  once t he  
i n i t i a l  communication i s  s e t  up. The channel set-up t ime  f o r  a  communication 
i s  approx imate ly  1  psec (20  machine cyc l es )  ( r e f .  8). The code f ragment  show- 
i n g  t he  i n p u t  b u f f e r  r o u t i n e  i s  g i ven  below: 
S  EQ 
i n  ? de1ta.x ;  1 o c a l . i n t e r v a l s  
PAR 
to . loca l .compute ! de1ta.x;  1 o c a l . i n t e r v a l s  
to .nex t .p rocessor  ! de1ta.x ;  1 o c a l . i n t e r v a l s  
where 
de l  ta .x  t he  w i d t h  o f  each i n t e r v a l  i n  t he  i n t e g r a t i o n  
1 o c a l . i n t e r v a l s  t he  number o f  i n t e r v a l s  t o  be computed on t h i s  processor  
i n  t he  i n p u t  channel 
to . loca l . compute  t he  channel t o  t he  comput ing process on t h i s  node 
to .nex t .p rocessor  t he  channel t o  t h e  i n p u t  b u f f e r  on t he  ad jacen t  processor  
COMPUTE PROCESS 
The compute process i s  s i m i l a r  t o  t h e  sequen t i a l  program ve rs i on ;  t h e  
l o c a l  s t a r t  and s top  p o i n t s  [u,  v l  a re  used i n s t e a d  o f  t he  whole i n t e r v a l  
[O, 11. The whole process c o n s i s t s  o f  r e a d i n g  i n  de1ta.x and t he  number o f  
l o c a l  i n t e r v a l s  f r o m  t h e  i n p u t  b u f f e r  process,  and then pe r f o rm ing  t h e  r e q u i r e d  
computat ions.  The p a r t i a l  sum i s  then sen t  t o  t he  o u t p u t  b u f f e r  process.  The 
code f ragment  showing t he  computat ion process i s  g i ven  below: 
S  EQ 
f rom. input .process ? de1ta.x;  1 o c a l . i n t e r v a l s  
sum := 0 .0  (REAL641 
SEQ i = (processor.number * 1 o c a l . i n t e r v a l s )  FOR 1 o c a l . i n t e r v a l s  
S  EQ 
x i  := ((REAL64 TRUNC i - 0.5 (REAL6411 * de1ta.x 
sum := sum + (de1ta.x  * (4.0 (REAL64)) / (1.0 (REAL64) + ( x i  * x i ) )  
to .ou tpu t .p rocess  ! sum 
where 
d e l  t a . x  t h e  w i d t h  o f  each i n t e r v a l  i n  t h e  i n t e g r a t i o n  
l o c a l  . i n t e r v a l  s  t h e  number o f  i n t e r v a l s  t o  be computed on t h i s  p rocessor  
f rom. inpu t .p rocess  t h e  i n p u t  channel f r o m  t h i s  node 's  i n p u t  b u f f e r  
to .ou tpu t .p rocess  t h e  channel t o  t h e  o u t p u t  bu f fe r  process on t h i s  node 
i l o o p  coun te r  
OUTPUT BUFFER PROCESS 
Once t h e  l o c a l  d a t a  has been computed, t h e  p a r t i a l  sum needs to be sen t  
back t o  t h e  h o s t  so i t  can be combined w i t h  t h e  r e s u l t s  f r o m  t h e  o t h e r  proces- 
so rs .  The o u t p u t  buf fer  needs t o  be a b l e  t o  r e a d  d a t a  f r o m  b o t h  t h e  l o c a l  com- 
p u t e  process and t h e  a d j a c e n t  processors  o u t p u t  node. S ince t h e  o r d e r  o f  t h e  
r e s u l t s  appear on t h e  network  i s  unknown, t h e  occam ALT c o n s t r u c t  i s  used. The 
code f ragment  showing t h e  o u t p u t  b u f f e r  r o u t i n e  i s  shown below: 
WHILE TRUE 
ALT 
f rom.  l o c a l  .compute ? l o c a l  .sum 
1 i n k . o u t  ! 1ocal.sum 
f rom.ad jacen t .p rocessor  ? sum 
1 i n k . o u t  ! sum 
where 
l o c a l  .sum temporary s to rage  f o r  t h e  p a r t i a l  sum from t h e  l o c a l  
compute process 
temporary s to rage  for  t h e  p a r t i a l  sum f r o m  t h e  ad ja -  
c e n t  node 's  o u t p u t  b u f f e r  process 
f rom.  l o c a l  .compute t h e  channel f r o m  t h e  l o c a l  compute node 
f rom.ad jacen t .p rocessor  t h e  channel f r o m  t h e  a d j a c e n t  node 's  o u t p u t  b u f f e r  
process 
HOST PROCESS 
The h o s t  process s t a r t s  t h e  s i m u l a t i o n  by sending o u t  t h e  d e s i r e d  d e l t a  x 
and number o f  l o c a l  i n t e r v a l s .  The h o s t  process then  w a i t s  for  t h e  r e s u l t s  
f r o m  t h e  network.  S ince t h e  number o f  p rocessors  i n  t h e  ne twork  i s  known, t h e  
number o f  d a t a  packets  r e c e i v e d  i s  recorded  and when a l l  packe ts  have been 
r e c e i v e d  from t h e  network ,  t h e  program t e r m i n a t e s .  The occam code f ragment  f o r  
t h e  h o s t  process i s  l i s t e d  below: 
out ! h; cn 
replys := 0 
total := 0.0 (REAL64) 
WHILE replys < num.processors 
S EQ 
in ? partial.sum 
total := total + partial .sum 
replys := replys + 1 
where 
out the output channel to the first processor in the pipeline 
h (llnumber of total intervals) on CO, 11 
c n cIN, the number of local intervals on each processor 
replys the number of replys received from the network 
num.processors number of processors in network 
total the approximate value for pi 
partial.sum a single processor's contribution to total 
i n the input channel from the first node in the pipeline 
PERFORMANCE 
The pi program was evaluated on 1 ,  8, 16, 32, and 40 transputers. For 
each case the number of total intervals for the integration was varied from 
103 to 107 in powers of 10. The first performance tests were to determine the 
optimum priorities for computation and communication. 
The computation times were obtained using the Occam TIMER statement to 
read the on-chip timer. The low-priority timer has a resolution of 64 psec 
per tick. 
The recommended method of programming transputer networks is to assign a 
high priority to communication and low priority to computation (ref. 9 ) .  To 
verify this, three different process priority configurations were tested. Each 
case uses three processes that run in parallel on each node. The three cases 






compute ( > 




i n p u t . b u f f e r - 0  
o u t p u t . b u f f e r 0  
A l l  processes l o w - p r i o r i t y :  
PAR 
i n p u t . b u f f e r 0  
o u t p u t . b u f f e r 0  
compute( 1 
The r e s u l t s  f o r  the  communication t e s t s  f o r  a T800 f l o a t i n g - p o i n t  t r ans -  
p u t e r  network a re  shown i n  f i g u r e s  3 t o  5. The optimum case i s  t o  ass i gn  p r i -  
o r i t y  t o  communication. The reason t h a t  p r i o r i t i z i n g  communication inc reases  
network performance i s  t h a t  more processors  can be k e p t  busy. By i n t e r r u p t i n g  
p rocess ing  t o  s e t  up a da ta  t r a n s f e r ,  o t h e r  processors  i n  t h e  network r e c e i v e  
da ta  t o  work on r a t h e r  than w a i t i n g  u n t i l  t he  ad jacen t  processor  f i n i s h e s  i t s  
computat ions.  Another reason t h a t  i t  he lps  t o  p r i o r i t i z e  communication i s  t h a t  
t h e  t r a n s p u t e r  l i n k s ,  once i n i t i a l i z e d ,  can t r a n s f e r  da ta  w i t h o u t  processor  
i n t e r v e n t i o n .  By p r i o r i t i z i n g  t he  da ta  t r a n s f e r ,  concu r ren t  computat ion,  and 
communication can occur  on a s i n g l e  t r a n s p u t e r .  
Given t h a t  p r i o r i t y  should  be g i ven  t o  communication, t he  p i  program 
benchmark was r u n  on bo th  T414 and T800 networks.  Both 3 2 - b i t  and 6 4 - b i t  math 
v e r s i o n  were t es ted .  The r e s u l t s  f r om  these t e s t s  f o r  t he  1800 f l o a t i n g - p o i n t  
processors  a re  shown i n  f i g u r e s  6 and 7 .  A comparison o f  t he  f l o a t i n g - p o i n t  
performance o f  t he  T414 and T800 shows t h a t  t he  T800 has approx imate ly  an o r d e r  
o f  magnitude inc rease  i n  performance over  t he  T414 f o r  3 2 - b i t  computat ions.  
There i s  approx imate ly  a 40 t imes speedup on the  T800 ove r  t h e  T414 on 6 4 - b i t  
f l o a t i n g - p o i n t  computat ions.  Th is  can be r e c o n c i l e d  by t he  f a c t  t h a t  t he  T800 
has a 6 4 - b i t  f l o a t i n g  p o i n t  u n i t  b u i l t  i n t o  t h e  hardware and t h e  T414 must 
b u i l d  6 4 - b i t  numbers o u t  o f  3 2 - b i t  operands which t y p i c a l l y  takes f o u r  t imes as 
many ope ra t i ons  as 32 -b i t  computat ions.  
Note,  however, t h a t  f o r  a smal l  number o f  t o t a l  i n t e r v a l s  f o r  t h e  i n t e g r a l  
computat ion t h a t  the  network performance i s  f a s t e r  f o r  8 p rocessors  when com- 
pared t o  40 p rocessors .  The e x t r a  communication t ime f o r  d i s t r i b u t i n g  a smal l  
work load over  40 processors  causes a decrease i n  performance ove r  t h e  8 proces- 
so r  case s ince  l ess  communication t ime i s  spent d i s t r i b u t i n g  t he  work and each 
processor  has more da ta  t o  work on. 
Normal ly  t h e  speedup o f  a network i s  d e f i n e d  as:  
t 1 S o l u t i o n  t ime f o r  1 processor  - -
Speedup = S o l u t i o n  t ime f o r  N processors  - t n  
o r  o f t e n  the  speedup can be norma l i zed  t o  express e f f i c i e n c y .  One hundred per-  
cen t  e f f i c i e n c y  means t h e r e  i s  no overhead f o r  communication on t h e  network.  
t l ( 1 0 0 )  e f f i c i e n c y  ( I )  = tn(N) 
Since t h e  development board b e i n g  used i s  a 15-MHz T414 i t  i s  n o t  meaning- 
f u l  t o  compare t h e  development board  r e s u l t s  w i t h  t h e  20-MHz T414 and T800 n e t -  
work s i m u l a t i o n s .  The f i r s t  method o f  g e n e r a t i n g  s i n g l e  node t i m i n g s  f o r  
speedup computat ions was t o  use t h e  development board and a s i n g l e  T800. T h i s  
shou ld  n o t  be as f a s t  as a s i n g l e  T800 r u n n i n g  t h e  prob lem because o f  ne twork  
overhead. 
I n  o r d e r  t o  g e t  r e s u l t s  f o r  a s i n g l e  processor;a 8004 development board 
was m o d i f i e d  t o  use a 20-MHz T800 f l o a t i n g - p o i n t  t r a n s p u t e r .  A l s o ,  t h e  modi- 
f i e d  8004 development board changed t h e  number o f  w a i t  s t a t e s  on t h e  e x t e r n a l  
memory f r o m  f i v e - c y c l e  a t  a 15-MHz c l o c k  t o  t h r e e - c y c l e  a t  a 20-MHz c l o c k  which 
equates t o  g o i n g  f r o m  a 330-ns memory c y c l e  t o  a 150-ns memory c y c l e .  T h i s  
m o d i f i c a t i o n  s t i l l  makes t h e  compar ison t o  t h e  ne twork  nodes d i f f i c u l t  s i n c e  
t h e y  use f o u r - c y c l e  memory (200-ns c y c l e ) .  Another  c o m p l i c a t i n g  f a c t o r  i s  t h a t  
t h e  t r a n s p u t e r  l o a d e r  used f o r  these t e s t s  does n o t  l o a d  t h e  on -ch ip  RAM on t h e  
development board so t h e  50-ns on-ch ip  memory cannot  be used and memory access 
t i m e s  a r e  n o t  t h e  same as t h e  p rocessors  on t h e  network .  The t imes  f o r  t h e  
development board T800 c o u l d  be expected t o  be a t  most 25 p e r c e n t  f a s t e r  than  
would be expected u s i n g  a f o u r - c y c l e  e x t e r n a l  memory. 
The r e s u l t s  f o r  t h e  20-MHz T800 development board,  a 15-MHz T414 deve lop-  
ment board,  a s i n g l e  20-MHz T800 ne twork  node ( u s i n g  a 15-MHz T414 as a h o s t )  
and a PCIAT 80286180287 a t  8 MHz a r e  shown i n  f i g u r e  8. The s ing le-node n e t -  
work T800 was about  33 p e r c e n t  f a s t e r  than  t h e  T800 on t h e  development node. 
T h i s  i s  p r o b a b l y  due t o  t h e  l o a d e r  n o t  t a k i n g  advantage o f  t h e  on -ch ip  memory 
on t h e  development board.  The use o f  t h e  on -ch ip  memory shou ld  make t h e  T800 
on t h e  development board r u n  f a s t e r  s i n c e  t h e r e  i s  no ne twork  overhead f o r  a 
s i n g l e  p rocessor .  The T414 shows performance a p p r o x i m a t e l y  an o r d e r  o f  magni- 
tude  s lower  than  t h e  T800. The 80286180287 c h i p  s e t  a t  8 MHz i s  a p p r o x i m a t e l y  
50 p e r c e n t  s lower  t h a n  t h e  1414 a t  15 MHz. 
Speedup and e f f i c i e n c y  computat ions u s i n g  t h e  s i n g l e  T800 on t h e  deve lop-  
ment board as t l  and t h e  ne twork  t imes  as t n  a r e  shown i n  f i g u r e s  9 and 10. 
F i g u r e  9 shows t h e  speedup o f  t h e  ne twork  f o r  8 t o  40 p r o c e s s o r s .  f o r  l e s s  
than  10000 i n t e r v a l s ,  t h e  8 p rocessor  ne twork  i s  t h e  f a s t e s t .  F i g u r e  10 shows 
t h e  same d a t a  as f i g u r e  9 excep t  t h e  speedup has been no rma l i zed  by  t h e  number 
o f  p rocessors  t o  express e f f i c i e n c y .  The d i f f i c u l t y  i n  u s i n g  f i g u r e  10 i s  t h a t  
e x e c u t i o n  t i m e  must be computed f r o m  t h e  p l o t .  I t  i s  d i f f i c u l t  to  t e l l  how 
much f a s t e r ,  i f  any, t h e  8 p rocessor  case i s  when compared t o  t h e  40 p rocessor  
case f o r  1000 i n t e r v a l s .  
I n s t e a d  o f  j u s t  measur ing t l l t n  f o r  speedups, ano the r  method o f  q u a n t i -  
f y i n g  speedup i s  s p l i t t i n g  t h e  s o l u t i o n  t i m e  i n t o  communicat ion and p r o c e s s i n g  
components. T h i s  can be used t o  de te rm ine  t h e  maximum p o s s i b l e  ne twork  pe r -  
formance as a f u n c t i o n  o f  t h e  r a t i o  o f  communicat ion t i m e  t o  p r o c e s s i n g  t i m e .  
T h i s  e q u a t i o n  assumes t h e r e  i s  no s e q u e n t i a l  ( s e r i a l )  p a r t  o f  t h e  program t o  
s low down t h e  ne twork .  The prob lem i s  p e r f e c t l y  p a r a l l e l i z a b l e .  The speedup 
e q u a t i o n  i s  w r i t t e n  i n  terms o f  how l o n g  does i t  t a k e  t o  implement a sequen- 
t i a l  p rob lem on a p a r a l l e l  network .  The f o l l o w i n g  e q u a t i o n  i s  p resen ted  i n  
r e f e r e n c e  10. 
Speedup = T p 
Tc + ($) 
where 
Tp p rocess ing  t ime  f o r  one processor  
Tc communication t ime  f o r  concu r ren t  s o l u t i o n  
N t he  number o f  processors  i n  t he  network 
The speedup equa t ion  (10) can be r e w r i t t e n  as 
Speedup = N N($) + 1 
Th is  f o r m u l a t i o n  o f  t he  speedup equa t ion ,  however, does n o t  compare t he  
p a r a l l e l  communication t ime,  Tc, w i t h  t h e  p a r a l l e l  p rocess ing  t ime.  The pro-  
cess ing  t ime,  Tp, i s  f o r  a s i n g l e  processor  w h i l e  Tc i s  t he  communication 
t ime  over  t he  d i s t r i b u t e d  network.  
Another method o f  p r e s e n t i n g  t he  speedup equa t i on  i s  t o  base i t  on t he  
t ime  t o  implement a p a r a l l e l  prob lem on a sequen t i a l  network  ( r e f .  11 ) .  I n  
t h i s  case, t h e  f o l l o w i n g  equa t i on  can be used f o r  speedup: 
where 
Tc '  communication t ime f o r  network s o l u t i o n  
Tp' p rocess ing  t ime  pe r  node on t he  network s o l u t i o n  
N t he  number o f  processors  i n  t he  network 
R e w r i t i n g  equa t i on  (12)  y i e l d s :  
Speedup = N (+$) + 1 
Th i s  equa t i on  p resen ts  t he  p roper  p a r a l l e l  communication t ime  d i v i d e d  by 
p a r a l l e l  p rocess ing  t ime and g i v e s  t he  c o r r e c t  speedup r a t i o s .  The s e r i a l  pro-  
cess ing  t ime  i s  t h i s  case i s  a l s o  assumed t o  be ze ro .  
The i d e a l  case i s  ze ro  network communication t ime.  I n  t h i s  case, t h e  
speedup i s  mere ly  t he  number o f  p rocessors .  For maximum performance, t he  r a t i o  
(TcITp)  should  be min imized.  A p l o t  o f  speedup as a f u n c t i o n  o f  t h e  number of 
p rocessors  i s  shown i n  f i g u r e  11. 
Obv ious ly ,  t he  TcITp r a t i o  can be min imized by e i t h e r  m i n i m i z i n g  Tc by 
u s i n g  a processor  w i t h  high-speed da ta  t r a n s f e r s  f r om  node-to-node o r  maximiz ing 
Tp by  u s i n g  slow processors  o r  a l a r g e  number o f  ope ra t i ons  pe r  communica- 
t i o n .  S ince t he  T800 per forms f l o a t i n g - p o i n t  a r i t h m e t i c  w i t h  excep t i ona l  
speed, cons iderab le  work needs t o  be a l l o c a t e d  t o  each node between communica- 
t i o n s  f o r  maximum performance. A reasonable  goal  seems t o  be a TcITp r a t i o  
o f  a t  most 0.001, ( i . e . ,  1000(Tc) = Tp). 
NETWORK COMMUNICATION LIMITED PROBLEM 
The problem t e s t e d  on t he  network t o  cause a  communication b o t t l e n e c k  was 
a  s imp le  mapping o f  a  r e g i o n  o f  t he  complex p lane  ( p r i v a t e  communication w i t h  
A lan  Pa lazzo lo  o f  Texas A&M). The prob lem d i s c r e t i z e s  a  two-dimensional  r e g i o n  
o f  t h e  complex p lane and eva lua tes  a  s p e c i f i e d  po lynomia l  a t  every  d i s c r e t i z e d  
p o i n t .  Based on t he  quadrant  o f  t h e  complex p lane  t h a t  the  f u n c t i o n  l i e s  i n ,  
one o f  f o u r  c o l o r s  i s  ass igned t o  t h a t  p o i n t  and i s  p l o t t e d .  Where f o u r  c o l o r s  
i n t e r s e c t ,  t he  po lynomia l  has a  r o o t .  Th i s  i s  a  r a t h e r  crude approach, b u t  t he  
problem i s  i n t e r e s t i n g  i n  t h a t  t h e r e  i s  immediate feedback on how e f f e c t i v e  the  
network implementat ion i s  because of t h e  r e a l - t i m e  g raph ics  d i s p l a y  o f  t h e  com- 
p l e x  p lane mapping. The communication b o t t l e n e c k  occurs  because o f  t h e  volume 
o f  da ta  t h a t  has t o  be sen t  th rough  t he  network t o  t he  g raph ics  board f o r  
d i s p l a y .  
S ince no complex math l i b r a r i e s  a re  a v a i l a b l e  i n  Occam, s imple complex 
add, m u l t i p l y ,  and po lynomia l  e v a l u a t i o n  r o u t i n e s  were developed. The po lyno-  
m i a l  e v a l u a t i o n  r o u t i n e  uses Ho rne rs ' s  r u l e  f o r  i nc reased  accuracy and per form- 
ance ( r e f .  9). The complex math r o u t i n e s  a re  l i s t e d  i n  appendix D. 
I n  o r d e r  t o  implement t he  r o o t  v i s u a l i z a t i o n  program on a  network o f  
t r anspu te r s ,  f i r s t  a  s u i t a b l e  a r c h i t e c t u r e  must be chosen. U n l i k e  many m u l t i -  
p rocessors  whose a r c h i t e c t u r e  i s  f i x e d ,  t r a n s p u t e r s  can be w i r e d  i n  any c o n f i g -  
u r a t i o n  supported w i t h  f o u r  connect ions pe r  p rocessor .  
Since t he  programmer must develop a l l  o f  t h e  communication r o u t i n g  a lgo-  
r i t h m s  f o r  the  network,  i t  i s  c u r r e n t l y  more conven ien t  t o  implement a  s imple,  
r e g u l a r  a r c h i t e c t u r e  and use s imple communication procedures i n s t e a d  o f  a  more 
compl i ca ted  scheme. A d d i t i o n a l l y ,  t h e  DMA l i n k  engines a l l o w  da ta  t o  be p i ped  
th rough  a  network o f  processors  w i t h  l i t t l e  performance p e n a l t y .  
The a r c h i t e c t u r e  chosen f o r  t h e  r o o t  v i s u a l i z a t i o n  i s  a  p i p e l i n e .  The 
communication b u f f e r s  r e q u i r e d  f o r  d a t a  t r a n s f e r  on a  p i p e l i n e  a re  o n l y  i n p u t  
and o u t p u t  b u f f e r s .  An example o f  these a l ong  w i t h  t he  p i p e l i n e  i s  shown i n  
f i g u r e  2. I f  the  da ta  p r o t o c o l  f o r  t h e  network i s  chosen w i t h  some f o r e t h o u g h t  
a r b i t r a r y  l e n g t h  p ipes  can be b u i l t  and t e s t e d  w i t h  o n l y  a  so f tware  con f i gu ra -  
t i o n  parameter change. 
The complex r o o t  mapper works as f o l l o w s :  
( 1 )  Prompt user  f o r  r e g i o n  o f  complex p l ane  to map 
(2 )  D i s c r e t i z e  s p e c i f i e d  r e g i o n  
(3) Compute f ( z >  
(4 )  F ind  quadrant  o f  f ( z )  
( 5 )  Send da ta  t o  g raph i cs  board f o r  p l o t t i n g  a  c o l o r  a t  p o i n t  z  
The g raph ics  r o u t i n e s  used i n  t h i s  t e s t  a re  documented i n  r e fe rences  12 
and 13. The source code f o r  t he  r o o t  v i s u a l i z a t i o n  program i s  l i s t e d  i n  
appendix E. 
U n l i k e  t he  d e f i n i t e  i n t e g r a l  problem, t he re  i s  a l a r g e  volume o f  da ta  
f l o w i n g  th rough  the  network.  For each da ta  p o i n t  t o  be p l o t t e d ,  the  x-coordi-  
na te ,  y -coord ina te ,  and c o l o r  must be sen t  t o  t h e  d i s p l a y  board.  The p i p e l i n e  
i s  used t o  compute t h e  da ta  and t h e  computed da ta  i s  sent  back t o  the  hos t  p ro -  
cessor .  The h o s t  processor  then  sends t h e  computed da ta  t o  t he  g raph ics  board.  
A b l o c k  diagram o f  t he  network i s  shown i n  f i g u r e  12. As w i t h  t he  i n t e g r a t i o n  
problem, t he  network a r c h i t e c t u r e  was chosen f o r  i t s  ease o f  implementat ion.  
A more reasonable  cho ice  f o r  a communication bound prob lem would be t o  take  
advantage o f  as many l i n k s  as p o s s i b l e  on each t r a n s p u t e r .  One p o s s i b l e  sug- 
g e s t i o n  i s  shown i n  f i g u r e  13. 
Since t he  s i z e  o f  t h e  da ta  r e q u i r e d  f o r  t he  p l o t t i n g  da ta  on t he  screen i s  
cons tan t ,  t h e  o r d e r  o f  t he  po lynomia l  i s  v a r i e d  t o  change t h e  communication 
computat ion r a t i o  on t h e  network.  A l l  40 processors  were used f o r  each simula- 
t i o n .  The o r d e r  o f  t he  po lynomia l  was v a r i e d  f r om 1 t o  80 t o  see how much com- 
p u t a t i o n  was r e q u i r e d  t o  overcome t h e  communication b o t t l e n e c k .  
As w i t h  t he  f i r s t  example, a sequen t i a l  v e r s i o n  o f  t h e  program was imple- 
mented. U n l i k e  t he  f i r s t  problem, two processors  were r e q u i r e d  f o r  t h e  sequen- 
t i a l  v e r s i o n :  t he  computat ion node and t h e  g raph ics  board.  
The n e x t  s tep  i s  t o  determine how t o  d i v i d e  t h e  prob lem f o r  concu r ren t  
s o l u t i o n  on t h e  network.  S ince t he  po lynomia l  e v a l u a t i o n  i s  d i r e c t ,  a known 
number o f  computat ions w i l l  be per formed f o r  each p o i n t  i n  t h e  complex p lane.  
D i v i d i n g  t h e  two-dimensional  r e g i o n  i n t o  s t r i p s  f o r  each p rocessor  i s  a s imple 
method o f  d i s t r i b u t i n g  work and was t he  method used. 
There a re  40 processors  and t h e  screen r e s o l u t i o n  i s  512 by  512 p i x e l s .  
Not every  processor  can have an equal  number o f  p i x e l s  t o  compute. The number 
o f  p i x e l  l i n e s  (columns o f  p i x e l s )  each processor  ge t s  i s  determined as 
f o l  lows : 
Screen w i d t h  Number Of 
= Number o f  p rocessors  
and t he  number o f  p rocessors  t h a t  w i l l  have (number o f  columns + 1) columns i s  
screen w i d t h  mod number o f  p rocessors  
w i t h  t he  remainder  g e t t i n g  "number o f  columns" columns. 
The h o s t  processor  dec ides how t o  d i s t r i b u t e  t h e  work l oad .  The dx,  and 
dy f o r  t he  complex p l ane  i s  computed a l ong  w i t h  t h e  co r respond ing  coo rd i na tes  
f o r  the  p l ane  r e g i o n  o f  i n t e r e s t .  Each processor  ge t s  sen t  a copy of  t h e  
f o l  l ow i  ng: I .  
o u t  ! Cx.min, y.min1; Cdx, d y l ;  [ x . s t a r t ,  x .s top l ;power+ l  : : c o e f f s  
where 
x.min t h e  minimum r e a l  va l ue  o f  t h e  two-dimensional  r e g i o n  
y.min t h e  minimum complex va lue  o f  t h e  two-dimensional  r e g i o n  
d x t h e  spac ing between t h e  x p i x e l s  
d  Y t he  spac ing between the  y  p i x e l s  
x . s t a r t  t h e  l o c a l  s t a r t i n g  x  coo rd i na te  f o r  t h i s  processor  
x .s top  t h e  l o c a l  s t op  va lue  f o r  t he  x  coo rd i na te  
power t h e  o r d e r  o f  the  po lynomia l  t o  eva lua te  
c o e f f s  t h e  c o e f f i c i e n t s  o f  the  po lynomia l  t o  eva lua te  
INPUT BUFFER 
The i n p u t  b u f f e r  f o r  each node i n  t h e  network i s  s l i g h t l y  d i f f e r e n t  than  
t he  i n t e g r a t i o n  example presented e a r l i e r .  I n s t e a d  o f  p ropaga t ing  a  s i n g l e  
da ta  s e t  th rough  t he  network,  a l l  o f  t h e  da ta  f o r  t he  r o o t  s o l v e r  o r i g i n a t e s  
f r om  the  h o s t  p rocessor .  The i n p u t  b u f f e r  must be ab le  t o  read  the  da ta  f rom 
the  hos t  and dec ide whether t he  da ta  i s  t o  be used l o c a l l y  o r  passed on t o  the  
n e x t  processor .  
The i n p u t  b u f f e r  on each processor  dec ides where t o  send t he  da ta  i t  
rece i ves  by sending t h e  f i r s t  packet  of da ta  i t  ge ts  t o  i t s  own l o c a l  node and 
any o t h e r  da ta  i t  rece i ves  t o  the  n e x t  node i n  t he  network.  Th i s  i s  done by 
s e t t i n g  a  f l a g  t he  f i r s t  t ime  da ta  i s  rece ived .  The code t h a t  performs t h i s  
f u n c t i o n  i s  shown below: 
l o c a l  := FALSE 
WHILE TRUE 
S EQ 
i n  ? coords;du;columns;size::coeffs 
I F  
NOT l o c a l  
S EQ 
o u t  ! coords;  du; columns; s i z e : : c o e f f s  
l o c a l  := TRUE 
th rough  ! coords;  du; columns; s i ze : : coe f fs  
where 
coordsC01 t he  minimum r e a l  va lue  o f  t h e  two-dimensional r e g i o n  
coo rds [ l  I t he  minimum complex va lue  o f  the  two-dimensional r e g i o n  
du[Ol t he  spac ing between t he  x p i x e l s  
duC1 I t h e  spac ing between t he  y  p i x e l s  
columnsC01 t h e  l o c a l  s t a r t i n g  x coo rd i na te  f o r  t h i s  processor  
columnsCl1 t he  l o c a l  s t op  va lue  f o r  t he  x  coo rd i na te  
s i z e  t he  o r d e r  o f  t he  po lynomia l  t o  eva lua te  
coe f  f s t he  c o e f f i c i e n t s  o f  t he  po lynomia l  t o  eva lua te  
Note t h a t  t he  i n p u t  b u f f e r  r o u t i n e  i s  i n  an i n f i n i t e  loop .  Th is  w i l l  a c t u a l l y  
deadlock the  network a f t e r  a l l  t he  r e q u i r e d  da ta  has been sen t ;  however, s i nce  
i t  i s  on a  network node, i t  does n o t  cause any problems and t he  coding i s  
e a s i e r  than t r y i n g  t o  t e rm ina te  a f t e r  some known number o f  da ta  se ts  has been 
rece i ved .  The hos t  processor  i s  t he  one which must t e rm ina te  p r o p e r l y  because 
i f  i t  deadlocks,  t he  h o s t  t r a n s p u t e r  w i l l  have t o  be rebooted.  
COMPUTE PROCESS 
The compute process on t he  network nodes must read  i n  t h e  i n i t i a l i z a t i o n  
da ta  f r om  the  i n p u t  b u f f e r  process and decode t h e  i n f o r m a t i o n  t o  determine what 
p i x e l s  t o  work on. Because o f  t he  way t he  l i n k  t r a n s f e r s  occur ,  i t  i s  b e t t e r  
t o  have a  s i n g l e  l ong  da ta  t r a n s f e r  th rough  a  l i n k  r a t h e r  than  severa l  s h o r t  
ones. The da ta  t r a n s f e r  s i z e  chosen t o  send t o  t h e  o u t p u t  b u f f e r  and u l t i -  
ma te l y  t o  t he  h o s t  was a  whole column (512) o f  p i x e l s .  The d a t a  format chosen 
was t o  send an a r r a y  o f  3(512> t h a t  con ta ined  t he  f o l l o w i n g  i n f o r m a t i o n :  
i -- screen x coo rd i na te  i n  I n t e g e r  Device Coord inates (IDC) 
j -- screen y  coo rd i na te  i n  IDC 
c o l o r  -- c o l o r  o f  t he  p i x e l  a t  (i, j )  
The s i z e  o f  each da ta  t r a n s f e r  becomes (512>(3) (32 b i t s )  which i s  49 152 b i t s  
o r  6144 (6K) by tes .  
Two a d d i t i o n a l  da ta  t r a n s f e r  p r o t o c o l s  were implemented t o  t r y  t o  min imize 
t h e  da ta  t r a n s f e r  t imes th rough  t h e  network.  The f i r s t  was t o  change t h e  
3 2 - b i t  i n t e g e r s  used above i n t o  1 6 - b i t  i n t e g e r s .  Th is  i s  p o s s i b l e  s i nce  a l l  o f  
t h e  da ta  r e p r e s e n t a t i o n s  r e q u i r e d  can be s t o r e d  i n  16-bi t s .  The second scheme 
was t o  encode t he  g raph ics  i n f o r m a t i o n  i n  a  r un - l eng th  (RL) fo rmat .  Two 3 2 - b i t  
words were used t o  s t o r e  the  f o l l o w i n g  i n f o r m a t i o n :  
16-bi t s :  i coo rd i na te  
16-bi t s  : j coo rd i na te  
16 -b i t s :  number o f  p i x e l s  t o  c o l o r  
8 - b i t s :  c o l o r  t o  use f o r  p i x e l s  
8 - b i t s :  by te  f o r  d i r e c t i o n  c o n t r o l  
The c o n t r o l  b y t e  uses t he  two l e a s t  s i g n i f i c a n t  b i t s  (LSB) t o  encode whether t o  
increment  o r  decrement the  p i x e l  drawing i n  t he  x o r  y - d i r e c t i o n .  The cod- 
i n g  i s  as f o l l o w s :  
d i r e c t i o n  LSB 
One change f o r  the  RL encoded b l ocks  t h a t  has n o t  been implemented i s  t o  
pack m u l t i p l e  r un - l eng th  encoded b l ocks  i n  t o  a  s i n g l e  l a r g e  b l o c k  f o r  da ta  
t r a n s f e r s .  R i g h t  now each RL packe t  i s  sent  sepa ra te l y  so the  number o f  chan- 
ne l  w r i t e s  f o r  t h i s  case i s  s i g n i f i c a n t l y  l a r g e r  than t he  o t h e r  two cases. 
The r e s u l t s  f o r  t he  r o o t - s o l v e r  a re  g i ven  i n  t he  performance s e c t i o n  
below. 
OUTPUT PROCESS 
The o u t p u t  b u f f e r  process i s  s i m i l a r  t o  t he  o u t p u t  b u f f e r  process o f  t he  
i n t e g r a t i o n  problem. E i t h e r  o f  two channels i s  scannsd us i ng  t h e  ALT s ta tement  
i n  Occam. Any i n p u t  r ece i ved  i s  sen t  t o  t he  ad jacen t  node. 
One a d d i t i o n  f o r  some of the  t e s t s  was a  number o f  i n t e r n a l  b u f f e r s  
between t he  compute and o u t p u t  process and t he  l i n k  and t he  o u t p u t  process were 
used t o  see i f  they  a f f e c t e d  performance. 
PERFORMANCE 
A l l  performance t e s t s  were r u n  u s i n g  a  p i p e l i n e  o f  e i t h e r  40 T414 o r  40 
T800 t r anspu te r s .  Tests were performed f o r  1, 3, and 5  o u t p u t  b u f f e r s  on each 
processor  i n  t he  arrangement shown i n  f i g u r e  14. Polynomia ls  o f  o r d e r  1, 4, 5, 
10, 20, 40, and 80 were t e s t e d  f o r  each c o n f i g u r a t i o n .  
F i gu re  15 shows t he  r e s u l t s  f o r  t he  32 -b i t  i n t e g e r  ( 6  kby te  b l o c k  s i z e )  
t r a n s f e r  p r o t o c o l  on a  T414 network.  The r e s u l t s  show o n l y  a  moderate per form- 
ance g a i n  f o r  t he  5 -bu f f e r  case and o n l y  f o r  t he  po lynomia l  o f  o r d e r  10. For 
a l l  o t h e r  o rde rs ,  t he  performance i s  approx imate ly  t he  same rega rd l ess  o f  t he  
number o f  b u f f e r s  used. 
The 1 6 - b i t  (3  kby te  b l o c k  s i z e )  t r a n s f e r  p r o t o c o l  r e s u l t s  f o r  a  T414 ne t -  
work a re  shown i n  f i g u r e  16. The r e s u l t s  f o r  t h i s  case a re  t h e  same f o r  every  
b u f f e r  c o n f i g u r a t i o n .  There i s  no advantage f o r  t he  5 -bu f f e r  case w i t h  a  po l y -  
nomial o f  o rde r  10 as t h e r e  was w i t h  t he  3 2 - b i t  t r a n s f e r  p r o t o c o l .  The t imes 
f o r  bo th  the  32 -b i t  and 1 6 - b i t  t r a n s f e r s  were n e a r l y  i d e n t i c a l .  
The r e s u l t s  show a  l i n e a r  inc rease  i n  s o l u t i o n  t ime f o r  po lynomia ls  o f  
o r d e r  g r e a t e r  than  20. S ince the  amount o f  d a t a  t r a n s f e r r e d  i s  cons tan t ,  these 
problems a re  l i m i t e d  by t he  computat iona l  speed o f  the  processors  r a t h e r  than  
t he  speed l i m i t a t i o n s  o f  the  l i n k  da ta  t r a n s f e r s .  
The r e s u l t s  f o r  the  r un - l eng th  encoded da ta  f o r  a  T414 network a r e  shown 
i n  f i g u r e  17. The performance u s i n g  t h i s  p r o t o c o l  i s  approx imate ly  30 pe rcen t  
s lower than e i t h e r  the  1 6 - b i t  o r  3 2 - b i t  t r a n s f e r  p r o t o c o l .  The performance 
degrada t ion  i s  due t o  t he  number of da ta  t r a n s f e r s  r equ i r ed .  The t o t a l  volume 
o f  da ta  i s  l e s s  than t he  3  Mbytes r e q u i r e d  f o r  t he  3 2 - b i t  p r o t o c o l ;  however, 
t he  number o f  channel w r i t e s  increases because each RL b l o c k  encodes o n l y  one 
c o l o r .  
The t r anspu te r  r e q u i r e s  o n l y  1  psec (20  cyc l es )  t o  s e t  up a  channel commu- 
n i c a t i o n  f o r  any s i z e  da ta  t r a n s f e r .  RL encoding a  whole scan- l i ne  i n s t e a d  of  
a s i n g l e  c o l o r  m igh t  inc rease  t h e  performance compared w i t h  the  s t r a i g h t  RL 
encod i ng . 
The comparison i n v o l v e s  t h e  communication t ime t o  send a b l o c k  o f  da ta  
versus t he  t ime t o  compress t he  b l ock ,  send a sma l le r  amount o f  da ta  and decom- 
press i t .  Th is  t e s t  has n o t  y e t  been performed. 
The network o f  T800's was t e s t e d  and t he  r e s u l t s  a re  shown i n  f i g u r e  18 
f o r  t he  32 -b i t  i n t e g e r  da ta  t r a n s f e r  p r o t o c o l .  The p l o t  f o r  t h i s  case i s  com- 
p l e t e l y  h o r i z o n t a l  f o r  po lynomia ls  f r om  o r d e r  1 t o  80. The s o l u t i o n  t ime a l s o  
does n o t  change f o r  1 ,  3, o r  5 r e t u r n  b u f f e r s .  The e x t r a  f l o a t i n g - p o i n t  per -  
formance o f f e r e d  by t he  T800 r e q u i r e s  a cons iderab le  number o f  computat ions f o r  
every  communication, (TcITp < 0 .001) ,  i n  o r d e r  t o  achieve reasonable  network 
speedups. The c u r r e n t  s o l u t i o n  on t he  T800 network i s  s t i l l  communication 
bound. A d d i t i o n a l  work causes no degrada t ion  i n  network performance. 
SUMMARY 
Paramet r i c  s t u d i e s  were performed t o  determine how t o  b e s t  implement 
d i r e c t ,  s p a t i a l l y  i s o l a t e d  problems on t r a n s p u t e r  networks.  Both computat ion- 
a l l y  i n t e n s i v e  and communication i n t e n s i v e  problems were s tud ied .  The r e s u l t s  
i n d i c a t e  t h a t  t he  computat ion t ime per  processor  should  exceed t h e  communication 
t ime per  processor  by a t  l e a s t  1000 t imes f o r  reasonable  network performance. 
APPENDIX A - S INGLE  PROCESSOR SEQUENTIAL VERSION OF THE P I  PROGRAM 
+---------------------------------------------------------------------------- + 
FILE : PI-SEQ. LIS SIZE: 1345 bytes I SAVED: Tue Jul 12 13:38: 36 1988 PAGE : 1 I 
**List of Fold** single transputer implementation 
**List of File** PI. tsr 
**List all lines 
**Excluding : NO LIST folds 
{ { (  
PROC pi.program(CHAN OF ANY keyboard, screen) 
#USE ll\tdsiolib\userio.tsrw 
{ { (  variables 
INT interval : 
INT count 
INT dummy 




{ timer variables 
TIMER time 
INT start.time : 
INT stop.time : 
1 )  1 
( misc vars and consts 
VAL rl.O IS 1.0 (REAL64) : 
VAL rO.0 IS 0.0 (REAL64) : 
INT any : 
1 ) )  
SEQ 
write.full.string(screen, "Enter the number of intervals : It) 
read.echo.int(keyboard, screen, count, dummy) 
newline(screen) 
count := 1000 
sum := rO.0 
de1ta.x := rl.O / (REAL64 TRUNC count) 
time ? start.time 
SEQ i = 0 FOR count 
SEQ 
xi := ((REAL64 TRUNC i) - 0.5 (REAL64)) * de1ta.x 
sum := sum + (de1ta.x * (4.0 (REAL64) / (rl.O + (xi * xi)) ) )  
time ? stop.Zime 
write.full.string(screen, "Time : ") 
write.int(screen, stop.time MINUS start.time, 0) 
write.full.string(screen, low-priority ticks " )  
newline(screen) 
write.full.string(screen, "Approx. value for PI : It) 




APPENDIX B - S I N G L E  PROCESSOR PARALLEL VERSION OF THE P I  PROGRAM 
FILE: PI-PAR.LIS I SAVED: Tue J u ~  12 13:39:06 1988 SIZE: 4980 bytes PAGE : 1 
**List of Fold** single transputer implementation, PARallel 
**List of File** P12. tsr 
**List all lines 
**Excluding : NO LIST folds 
PROC pi.program(CHAN OF ANY keyboard, screen) 
#USE u\tdsiolib\userio.tsrgl 
( ( ( constants 
VALrl.O IS 1.0 (REAL32) : 
VALrO.0 IS 0.0 (REAL32) : 
1 1  1  
( ( (  CHAN definitions 
VAL num.processors IS 40 : 
[num.processors]CHAN OF ANY to.network, from.network : 
[num.processors]CHAN OF ANY to.compute, from.compute : 
111  
( PIPE buffer PROCs 
( ( ( PROC input. buffer 
PROC input.buffer(CHAN OF ANY in, link-out, local-out) 
INT total.intenrals : 
INT 1ocal.intervals : 
SEQ 
in ? total.intervals; 1ocal.intervals 
PAR 
1ink.out ! total.intervals; 1ocal.intervals 
1ocal.out ! total.intervals; 1ocal.intervals 
1 )  1  
( ( (  PROC return.buffer 
PROC return.buffer(CHAN OF ANY local.in, link.in, out, 
VAL INT process.number) 
REAL32 partial.sum : 
INT loops 
INT proc.num : 
SEQ 
loops := 0 
WHILE loops < ((num.processors - process.number)) 
ALT 
1ocal.in ? partial.sum 
SEQ 
loops := loops + 1 
out ! process.number; partial.sum 
1ink.in ? proc.num; partial.sum 
SEQ 
loops := loops + 1 
out ! proc.num; partial-sum 
11 1  
111  
( ( (  END-OF-PIPE buffer PROCs 
( ( (  PROC end.input.buffer, End-of-pipe 
PROC end.input.buffer(CHAN OF ANY in, local-out) 
INT total.intervals : 
APPENDIX B - Continued. 
+---------------------------------------------------------------------------- + 
FILE: PI-PAR.LIS SIZE: 4980 bytes I SAVED: Tue Jul 12 13:39:06 1988 PAGE : 2 I 
INT 1ocal.intervals : 
SEQ 
in ? total.intervals; 1ocal.intervals 
1ocal.out ! total.intervals; 1ocal.intervals 
11 1 
{ ( {  PROC end.return.buffer, End-of-pipe 
PROC end.return.buffer(CHAN OF ANY local.in, out, VAL INT process.number) 
REAL32 partial.sum : 
SEQ 
1ocal.in ? partial.sum 
out ! process.number; partial.sum 
11 1 
1)  1 
{ { { PROC compute 
PROC compute(CHAN OF ANY in, out, VAL INT process.number) 
{ ( ( variables 
INT total.intervals : 
INT 1ocal.intervals : 
REAL32 de1ta.x : 
REAL32 sum 
REAL32 xi 
1)  1 
( ( {  computation 
SEQ 
in ? total.intervals; 1ocal.intervals 
sum := rO.0 
de1ta.x := rl.O / (REAL32 TRUNC total.intervals) 
SEQ i = (process.number * 1ocal.intervals) FOR 1ocal.intervals 
SEQ 
xi := ( (REAL32 TRUNC i) - 0.5 (REAL32)) * de1ta.x 
sum := sum + (de1ta.x * (4.0 (REAL32) / (rl.O + (xi * xi) ) ) ) 
out ! sum 
1) 1 
11) 
( PROC sink 
PROC sink(CHAN, OF ANY in, REAL32 total) 
REAL32 result : 
INT replys: 
INT processor : 
SEQ 
replys := 0 
total := rO.0 
WHILE replys < num.processors 
SEQ 
in ? processor; result 
total := total + result 
replys := replys + 1 
( (  COMMENT write statements 
... 
... A COMMENT FOLD 
( write statements 
A P P E N D I X - B  - Continued. 
FILE: PI-PAR.LIS SIZE: 4980 bytes I SAVED: Tue Jul 12 13:39: 06 1988 PAGE: 3 
+---------------------------------------------------------------------------- + 
write.full.string(screen, "total =Im) 
write.rea132(screenI total, 2, 6) 
write. full. string (screen, It reply from processor: " )  





( timer variables 
TIMER time 
INT start.time : 
INT stop.time : 
11 1 
( main program 
{ { (  variables 
INT any : 
INT dummy : 
INT count : 
REAL32 sum : 
11 1 
SEQ 
{ ( (  print statements 
write.full.string(screen, "Enter the number of intervals per processor: " )  
read.echo.int(keyboard, screen, count, dummy) 
newline(screen) 
1 ) )  
{ { {  main program 
time ? start.time I 
PAR 
to.network[O] ! count * num.processors; count 
( ( {  pipe nodes 
PAR i = 0 FOR (num.processors - 1) 
PAR 
input.buffer(to.network[i], to.network[i+l], to.compute[i]) 
compute(to.compute[i], from.compute[i], i) 
return.buffer(from.compute[i], from.network[i+l], from.network[i], i 
1 
11 1 




from.compute[num.processors-11, num.processors - 1) 
end.return.buffer(from.compute[num.processors-1], 
from.network[num.processors-11, num.processors - 1) 
11 1 
sink(from.network[O], sum) 
time ? stop.time 
11 1 
{ { {  print statements 
write.full.string(screen, ItTime : ") 
, A P P E N D I X  B - Concluded. 
- -  - *  .. - - . . 
+--------------------------------------------------------------------- + 
FILE: 'PI-PAR; &is"  SIZE: 4980 bytes 1 SAVED: Tue Jul 12 . 13:39:06 -. - 1988 PAGE : 4 
+---------------------------------------------------------------------------- + 
write.int(screen, stop.time MINUS start.time,'O) 
write.full.string(screen, l1 low-priority ticks 11) 
newline(screen) 
write.full.string(screen, llApprox. value for PI : 11) 
write.real32(screen1 sum, 2, 6) 
newline(screen) 
read.char(keyboard, any) 
1 )  ) 
1)  
APPENDIX C - NETNORK VERSION OF THE P I  PROGRAM 
. . 
FILE : PI-EXE . LIS SIZE: 2890 bytes 
I r 
PAGE : 1 I 
**List of Fold** network example 
**List of File** P13 .  tsr 
**List all lines 
**Excluding : NO LIST folds 
PROC pi.example(CHAN OF M Y  keyboard, screen) 
#USE "\tdsiolib\userio.tsr" 
#USE llprocnum. tsrw 
( ( (  CHAN definitions 
( channel addresses 
VAL link0.in IS 4 : 
VAL 1inkl.in IS 5 : 
VAL link2.in IS 6 : 
VAL link3.in IS 7 : 
VAL link0.out IS 0 : 
VAL 1inkl.out IS 1 : 
VAL link2.out IS 2 : 
VAL link3.out IS 3  : 
) 1) 
CHAN OF ANY to.network, from.network : 
PLACE to.network AT link2.out : 
PLACE from.network AT link2.in : 
1) 1 
( ( (  PROC sink 
PROC sink(CHAN OF ANY in, REAL32 total) 
INT replys: 
REAL32 partial.sum : 
SEQ 
replys := 0 
total := rO.0 
WHILE replys < num.processors 
SEQ 
in ? partial.sum 
total := total + partial.sum 
replys := replys + 1 
{ ( {  COMMENT write statements 
:::A COMMENT FOLD 
{ { (  write statements 
write.full.string(screen, "partial sum: ") 
write.real32(screen, total, 2, 6) 
newline(screen) 
11 1 
1)  1 
11 1 
( ( (  timer variables 
TIMER time 
INT start.time : 
INT stop.time : 
111 
( ( (  main program 
( variables 
APPENDIX C - Continued. 
+------------------------------------------------------------------- + 
FILE : PI-EXE . LIS SIZE: 2890 bytes I SAVED: Thu Jun 30 13:10:50 1988 PAGE : 2 
INT any : 
INT dummy : 
INT count : 
REAL32 total : 
111  
SEQ 
( ( (  print statements 
write.full.string(screeh, "Enter number of intervals for each processor: 11 
1 
read.echo.int(keyboard, screen, count, dummy) 
newline (screen) 
11 1 
( main program 
time ? start.time 
to.network ! count * num.processors; count -- send init data to networ 
k 
sink(from.network, total) 
time ? stop.time 
111 
( print statements 
write.full.string(screen, Network PRI PAR communication cas 
e If) 
newline(screen) 
write.full.string(screen, 20MHz. T800C " )  
newline(screen) 
newline(screen) 
write.full.string(screen, "Total number of network processors: " )  
write.int(screen, num.processors , 0) 
newline(screen) 
write.full.string(screen, "Total number of intervals : " )  
write.int(screen, num.processors * count, 0) 
newline (screen) 
newline(screen) 
write.full.string(screen, "Time : ") 
write.int(screen, stop.time MINUS start.time, 0) 
write.full.string(screen, " low-priority ticks " )  
write.full.string(screen, Time : 11) 
write.real32(screen,(REAL32 ROUND (stop.time MINUS start.time)) / 
- 15625.0 (REAL32) , 2, 4 )  
write.full.string(screen, " secondsw) 
newline(screen) 
newline(screen) 
write.full.string(screen, "Approximate value for PI : If) 
write.real32(screen1 total, 2, 6) 
newline(screen) 
read.char(keyboard, any) 
1 )  1 
APPENDIX C - Continued. 
FILE: PI-PROG.LIS SIZE: 6293 bytes 
PAGE : 1 
+---------------------------------------------------------------------------- + 
**List of Fold** network example 
**List of File** PI3P. tsr 
**List all lines 
**Excluding : NO LIST folds 
( ( sc pipe 
:::A 4 10 
{{{F pipe 
:::F pp.tsr 
PROC pipe(CHAN OF ANY from.left, to.left, 
to-right, from.right, 
VAL INT process.number) 
#USE I1procnum. tsrql 
CHAN OF ANY input.to.compute, compute.to.output : 
{ PROC input.buffer 
PROC input.buffer(CHAN OF ANY link.in, link.out, 1ocal.out) 
INT total.intervals, 1ocal.intenrals : 
S EQ 
link-in ? total.intervals; 1ocal.intervals 
PAR 
1ink.out ! total.intervals; 1ocal.intervals 
1ocal.out ! total.intervals; 1ocal.intervals 
1 ) )  
{ PROC return.buffer 
PROC return.buffer(CHAN OF ANY local-in, link.in, 1ink.out) 




local-in ? 1ocal.sum 
1ink.out ! 1ocal.sum 
1ink.in ? sum 
link. out ! sum 
1 ) )  
( PROC compute 
PROC compute(CHAN OF ANY in, out) 
( variables 
INT total.intervals : 
INT 1ocal.intervals : 
REAL32 de1ta.x : 
REAL32 sum 
REAL32 xi 
sum := rO.0 
de1ta.x := rl.O / (REAL32 TRUNC total.intervals) 
SEQ i = (process.number * 1ocal.intervals) FOR 1ocal.intervals 
SEQ 
APPENDIX C - Continued. 
+-------------------------------------------------------------------- + 
FILE : PI-PROG . LIS SIZE: 6293 bytes I SAVED: Thu Jun 30 13:11:08 1988 PAGE : 2 I 
xi := ((REAL32 TRUNC i) - 0.5 (REAL32)) * de1ta.x 
sum := sum + (de1ta.x * (4.0 (REAL32) / (rl. 0 + (xi * xi) ) ) ) 
out ! sum 
1 ) )  




input.buffer(from.left, to.right, input.to.compute) 
return.buffer(compute.to.output, from.right, to-left) 
1 )  
. . . F code 
:::A 1 2  
:::F pp.dcd 
... F descriptor 
:::A 1 4  
:::F pp.dds 
... F link 
:::A 1 9  
:::F pp.dlk 
1 ) )  
VAL B003pairs IS 1 : 
( (  CHAN definitions 
( channel addresses 
VAL link0.in IS 4 : 
VAL 1inkl.in IS 5 : 
VAL link2.in IS 6 : 
VAL link3.in IS 7 : 
VAL link0.out IS 0 : 
VAL 1inkl.out IS 1 : 
VAL link2.out IS 2 : 
VAL link3.out IS 3 : 
1 ) )  
CHAN OF ANY dummyl, dummy2 : 
[8 * B003pairsICHAN OF ANY to, from : 
1 )  1 
-- pipeline of processors, architecture f(b003pairs) 
PLACED PAR 
( ( {  B003pairs for pipe 
PLACED PAR j = 0 FOR (BOO3pairs - 1) 
PLACED PAR 
VAL i IS (8 * j) : 
PROCESSOR i T8 
PLACE to[i] AT 1inkl.in : 
APPENDIX C - Cont inued. 
F I L E  : PI-PROG . LIS SIZE: 6293 bytes 
PAGE : 3 I 
PLACE from[i] AT 1inkl.out : 
PLACE to[i+l] AT link2.out : 
PLACE from[i+l] AT link2.in : 
pipe(to[i], from[i], to[i+l], from[i+l], i) 
VALk I S  (8 * j) + 1 : 
PROCESSOR k T8 
PLACE to[k] AT link3.in : 
PLACE from[k] AT link3.out : 
PLACE to[k+l] AT 1inkl.out : 
PLACE from[k+l] AT linkl. in : 
pipe(to[k], from[k], to[k+l], from[k+l], k) 
VAL 1 I S  (8 * j) + 2 : 
PROCESSOR 1 T8 
PLACE to[l] AT- link0.in : 
PLACE from[l] AT link0.out : 
PLACE to[l+l] AT link2.out : 
PLACE from[l+l] AT link2.in : 
pipe(to[l], from[l], to[l+l], from[l+l], 1) 
PLACED PAR m = 0 FOR 2 
VAL n I S  (((8 * j) + 3) + m) : 
PROCESSOR n T8 
PLACE to[n] AT link3.in : 
PLACE from[n] AT link3.out : 
PLACE to[n+l] AT link2.out : 
PLACE from[n+l] AT link2.in : 
pipe(to[n], from[n], to[n+l], from[n+l], n) 
VAL0 I S  (8 * j) + 5 : 
PROCESSOR o T8 
PLACE to[o] AT link3.in : 
PLACE from[o] AT link3.out : 
PLACE to[o+l] AT 1inkl.out : 
PLACE from[o+l] AT 1inkl.in : 
pipe(to[o], from[o], to[o+l], from[o+l], o) 
VALp I S  (8 * j) + 6 : 
PROCESSOR p T8 
PLACE to[p] AT link0.in : 
PLACE from[p] AT link0.out : 
PLACE to[p+l] AT link2.out : 
PLACE from[p+l] AT link2.in : 
P ~ P ~ ( ~ ~ [ P I I  from[pl, to[p+l11 fromCp+lIt P) 
VAL q IS (8 * j) + 7 : 
PROCESSOR q T8 
PLACE to[q] AT link3.in : 
PLACE from[q] AT link3.out : 
PLACE to[q+l] AT link0.out : 
PLACE from[q+l] AT link0.in : 
pipe(to[ql I from[ql 1 to[q+ll I from[q+ll 1 q) 
1 ) )  
{ { { B003pair end-of -pipe 
VAL j I S  (B003pairs - 1) : 
PLACED PAR 
VAL i IS (8 * j) : 
PROCESSOR i T8 
APPEND1.X C - Xoncl u d e d .  
+---------------------------------------------------------------------------- 
. -  . . . , 4- 
F I L E  : PI-PROG . L I S  S I Z E :  6293 bytes I SAVED: Tllu Jun 30 13:11:08 1988 . .  . PAGE : 4 
+---------------------------------------------------------------------------- I 4- 
PLACE to[i] AT 1inkl.in : 
PLACE from[i] AT 1inkl.out : 
PLACE to[i+l] AT link2.out : 
PLACE from[i+l] AT link2.in : 
pipe(to[i], from[i], to[i+l], from[i+l], i) 
VAL k I S  (8 * j) + 1 : 
PROCESSOR k T8 
PLACE to[k] AT link3.in : 
PLACE from[k] AT link3.out : 
PLACE to[k+l] AT 1inkl.out : 
PLACE from[k+l] AT 1inkl.in : 
pipe(to[k], from[k], to[k+l], from[k+l], k) 
VAL1 I S  (8 * j) + 2 : 
PROCESSOR 1 T8 
PLACE to [1] AT link0.in : 
PLACE from [ 1 ] AT link0.out : 
PLACE to[l+l] AT link2.out : 
PLACE from[l+l] AT link2.in : 
pipe(to[l], from[l], to[l+l], from[l+l], 1) 
PLACED PAR m = o FOR 2 
VAL n I S  (((8 * j) + 3) + m) : 
PROCESSOR n T8 
PLACE to[n] AT link3.h : 
PLACE from[n] AT link3,out : 
PLACE to[n+l] AT link2.out : 
PLACE from[n+l] AT link2.in : 
pipe(to[n], from[n], to[n+l], from[n+l], n) 
VAL o I S  (8 * j) + 5 : 
PROCESSOR o T8 
PLACE to[o] AT link3.in : 
PLACE from[o] AT link3.out : 
PLACE to[o+l] AT linkl-out : 
PLACE from[o+l] AT 1inkl.in : 
pipe(to[o], from[o], to[o+l], from[o+l], o) 
VAL p I S  (8 * j) + 6 : 
PROCESSOR p T8 
PLACE to[p] AT link0.in : 
PLACE from[p] AT link0. out : 
PLACE to[p+l] AT link2.out : 
PLACE from[p+l] AT link2.in : 
~i~e(to[pl, from[pl, to[p+lI, from[p+ll, P) 
VALq I S  (8 * j) + 7 : 
PROCESSOR q T8 
PLACE to[q] AT link3.in : 
PLACE from[q] AT link3.out : 
pipe(to[ql, from[ql, dummyl, dummy2, q) 
1 ) )  
APPENDIX D - OCCAM COMPLEX MATH PROCEDURES 
+---------------------------------------------------------------------------- + 
FILE : COMPLEX. LIS SIZE: 967 bytes I SAVED: Thu Jun 30 13: 15:44 1988 PAGE : 1 I 
**List of Fold** math 
**List of File** math. tsr 
**List all lines 
**~xcluding : NO LIST folds 
{ PROC cmplx.mult 
PROC cmplx.mult(REAL64 xl, yl, x2, y2) 
REAL64 t . ~ ,  t.y : 
SEQ 
t.x := (xl * x2) - (yl * y2) 
t.y := (xl * y2) + (x2 * yl) 
x1 := t.x -- return results in 1st 2 paramters 
y1 := t.y 
1 ) )  
{ { {  PROC cmplx-add 
PROC cmplx.add(REAL64 xl, yl, x2, y2) 
SEQ 
xl := xl + x2 
yl := yl + y2 
1 1 )  
{ { {  PROC cmplx.poly 
PROC cmplx.poly(REAL64 x, iy, VAL INT n, []REAL64 coeffs) 
-- compute value of complex polynimial using Hornerfs rule 
REAL64 t . ~ ,  t.y : 
INT a : 
SEQ 
t.x := x 
t.y := iy 
cmplx.mult(t.x, t.y, coeffs[n], coeffs[n]) 
SEQ i = 1 FOR (n - 1) 
SEQ 
a : = n - i  
cmplx.add(t.x, t.y, coeffs[a], coeffs[a]) 
cmplx.mult(t.x, t.y, x, iy) 
APPENDIX E - ROOT V I S U A L I Z A T I O N  PROGRAM 
+ ....................................................................... + 
FILE: ROOT EXE.LIS SIZE: 11803 bytes I SAVED: Thu zun 30 13:13:10 1988 PAGE : 1 
+ ....................................................................... I + 
**List of Fold** root. test 
**List of File** roottest.tsr 
**List all lines 
**Excluding : NO LIST folds 
PROC root.test(CHAN OF ANY keyboard, screen) 
#USE "\tdsiolib\userio.tsr~ 
#USE "procnum. tsrUU 
( (  channel constants 
VAL link0.in IS 4: 
VAL 1inkl.h IS 5: 
VAL link2.in IS 6: 
VAL link3.in IS 7: 
VAL link0.out IS 0: 
VAL 1inkl.out IS 1: 
VAL link2.out IS 2: 
VAL link3.out IS 3: 
11 1 
... F TGT graphics routines, PARTIAL -- link2 NO LIST 
... 
... F TGP.tsr 
{ { {  pipeline channel definitions -- link3 
CHAN OF ANY to.pipe, from.pipe : 
PLACE to.pipe AT link3.out : 
PLACE from.pipe AT link3.in : 
1)  1 
{ { {  global variables 
REAL32 x-min, x.max, y-min, y.max : 
INT order : 
[100]REAL64 coeffs : 
1 ) )  
(((F PROC distribute.work 
: : : F PROC02. tsr 
PROC distribute.work(CHAN OF ANY out, INT power) 
{ ( (  variables 
INT block.size : 
INT remainder : 
INT x.start : 
INT x.stop : 
REAL64 dx : 
REAL64 dy : 
1 1 1  
INT any : 
S EQ 
block.size := screen.width / num.processors 
remainder := screen.width \ num.processors 
APPENDIX E - Continued. 
+----------------------------------------------------------------------------. + 
FILE: ROOT-EXE.LIS SIZE: 11803 bytes I SAVED: Thu Jun 30 13:13:10 1988 PAGE : 2 
+-----------------------------------------------------------------,------->---- + 
x.stop := block.size 
( column size = block.size + 1 
SEQ i = 0 FOR remainder 
SEQ 
( ( (  COMMENT write statements 
:::A COMMENT FOLD 
( ( ( write statements 
write.full.string(screen, "*n*cx.start = ") 
write. int (screen, x. start, 0) 
write. full. string (screen, x. stop = ") 
write.int(screen, x.stop, 0) 
11 I 
1 1 1  
out ! [x.min, y.min]; [dx, dy]; [x.start, x.stop]; (power+l)::coeffs 
x.start := x.stop + 1 
x.stop := x.start + (block-size) 
- - 
1 )  1 
keyboard ? any 
( column size = block.size 
SEQ i = 0 FOR (num.processors - remainder) 
SEQ 
( COMMENT write statements 
:::A COMMENT FOLD 
( write statements 
write.full.string(screen, I1*n*cx.start = ") 
write.int(screen, x.start, 0) 
write.full.string(screen, " x.stop = " )  
write. int (screen, x. stop, 0) 
1 1  1 
11 1 
out ! [x.min, y-min]; [dx, dy]; [x.start, x.stop]; (power+l)::coeffs 
11 1F 
( ( ( F PROC retu;n. buffer 
:::F PROC03.tsr 
PROC return.buffer(CHAN OF ANY in, out, reply) 
( ( (  COMMENT case: c.color.line.16 
:::A COMMENT FOLD 
{ ( (  case: c.color.line.16 
INT replys : 
INT size : 
INT any : 
[725][3]INT16 scan.lines : 
SEQ 
replys := 0 
WHILE replys < screen.width 
SEQ 
APPENDIX E  - Continued. 
+---------------------------------------------------------------------------- + 
FILE : ROOT-EXE . LIS SIZE: 11803 bytes I SAVED: Thu Jun 30 13:13:10 1988 PAGE : 3 I 
in ? size::scan.lines 
out ! c.color.line.16; size::scan.lines 
reply ? any 
replys := replys + 1 
111 
i i ~  
( COMMENT case: c.RL.line 
:::A COMMENT FOLD 
( ( (  case: c.RL.line 
INT rows : 
INT replys : 
INT size : 
INT any : 
[20]BYTE scan.lines : . 
SEQ 
rows := 0 
WHILE rows < num.processors 
SEQ 
in ? size::scan.lines 
IF 
size = 0 
rows := rows + 1 
TRUE 
SKIP 
out ! c.RL.line; size::scan.lines 
reply ? any 
replys := replys + 1 
1 ) )  
11 1 
{ case: c.color.line 
INT replys : 
INT size : 
INT any : 
[725][3]INT scan.lines : 
SEQ 
replys := 0 
WHILE replys < screen.width 
SEQ 
in ? gize::scan.lines 
out ! c.color.line; size::scan.lines 
reply ? any 
replys := replys + 1 
1 )  1 
) ) I F  
( ( (  MAIN program 
VAL mywin IS 0 : 
( ( (  variables 
TIMER time : 
INT start.time, stop.time : 
INT size : 
APPENDIX E - Continued. 
+---------------------------------------------------------------------.-------. + 
FILE : ROOT-EXE . LIS SIZE: 11803 bytes I SAVED: Thu Jun 30 13:13:10 1988 PAGE : 4 
+---------------------------------------------------------------------------- 
I + 
INT reply : 
INT kchar : 
1 1 )  
SEQ 
init.graphics() 
{ ( {  COMMENT init coeffs 
:::A COMMENT FOLD 
{ { { init coeffs 
order := 8 
coeffs[O] := 0.75 (REAL64) 
coeffs[l] := -0.25 (REAL64) 
coeffs[2] := 1.25 (REAL64) 
coeffs[3] := -2.0 (REAL64) 
coeffs[4] := 1.0 (REAL64) 
111 
( ( (  coeffs for casel.dat -- order 88 






coef fs [5] 
coef fs [6] 
coeffs[7] 
coeffs[8] 
coef fs [9] 
coeffs[lO] 


















APPENDIX E - Cont inued.  
+ ......................................................................... + 
























































APPENDIX E - Cont inued. 
FILE: ROOT-EXE-LIS SIZE: 11803 bytes 
PAGE : 6 
+---------------------------------------------------------------------------- 
I + 
coeffs[82] := 7.982842253949253044E-0031 (REAL64) 
coeffs[83] := 2.950261391660566535E-0033 (REAL64) 
coeffs[84] := 3.514343332711036982E-0034 (REAL64) 
coeffs[85] := 8.394511312395237207E-0037 (REAL64) 
coeffs[86] := 9.856712543206136081E-0038 (REAL64) 
coeffs[87] := 1.142896632201841234E-0040 (REAL64) 
coeffs[88] := 1.322886039443583949E-0041 (REAL64) 
1 ) )  
( prompt user for screen coordinate data 
write.full.string(screen, "Enter Real min: 
read.echo.real32(keyboard1 screen, x-min, kchar) 
write.full.string(screen, "Enter Real max: It 1 
read.echo.rea132(keyboardt screen, x.max, kchar) 
write.full.string(screen, "Enter Imaginary min: It) 
read.echo.real32(keyboard1 screen, y.min, kchar) 
write.full.string(screen, "Enter Imaginary rnax: '") 
read.echo.real32(keyboard1 screen, y.max, kchar) 
write.full.string(screen, "Enter order of polynomial: " )  
read.echo.int(keyboard, screen, order, kchar) 
1 ) )  
time ? start.time -- after user inputs data .... 
( ( (  init windows/viewport 
set.window.2d(x.min, y.min, x.max, y.max, mywin) 
set.viewport. 2d (0.0 (REAL32) , 0.0 (REAL32) , r1.0, r1.0, mywin) 








1 )  1 
{ { (  distribute work and read results 
PAR 
distribute.work(to.pipe, order) 
return.buffer(from.pipe, to.graphic, from.graphic) 
1 1 )  
f init. graphics ( ) 
time ? stop.time 
( write results 
write.full.string(screen, "Low priority ticks: It) 
write.int(screen, (stop-time MINUS start.time), 0) 
keyboard ? reply 
1 )  1 
1 ) )  
APPENDIX E - Cont inued. 
+ ........................................................................ + 
FILE: ROOT-PRG.LIS SIZE: 19688 bytes I SAVED: Thu 3un 30 13:14:28 1988 PAGE : 1 
+- ....................................................................... I + 
**List of Fold** pipe program 
**List of File** pipe00.tsr 
**List all lines 
**Excluding : NO LIST folds 
{ { {  SC pipe node 
:::A 3 10 
{{{F pipe node 
:::F pipe.tsr 
PROC pipe(CHAN OF ANY from.left, to.left, 
to.right, from.right, dummy) 
{((F PROC input 
:::F PROCOO.tsr 
PROC input(CHAN OF ANY in, out, through) 
{ ( {  variables 
[100]REAL64 coeffs : 
[2]REAL64 du : 
[2]REAL32 coords : 
[2 ] INT columns : 
INT size : 
) 1 )  
BOOL local : 
SEQ 
local := FALSE 
WHILE TRUE 
SEQ 




out ! coords; du; columns; size::coeffs 
local := TRUE 
TRUE 
through ! coords; du; columns; size::coeffs 
1 )  IF 
{({F PROC output -- replicated ALT 
:::F PROCOl.tsr 
PROC output( [2-] CHAN OF ANY in, CHAN OF ANY out) 
{ ( {  COMMENT round-robin ALT : requires [2] CHAN OF ANY in 
: : :A COMMENT FOLD 
{ { {  round-robin ALT : requires [2] CHAN OF ANY in 
INT size : 
--[725][3]INT scan.lines : 
--[725][3]INT16 scan-lines : 
--[200]BYTE scan.lines : 
INT count : 
SEQ 
count := 0 
WHILE TRUE 
SEQ 
ALT i = count FOR (SIZE in) 
in[i\(SIZE in)] ? size::scan.lines 
APPENDIX E - Continued. 
FILE : ROOT-PRG . LIS SIZE: 19688 bytes 
PAGE : 2 
+----------------------------------------------------------------------,------ + 
SEQ 
out ! size::scan.lines 
count := count PLUS 1 
1 )  1 
1 )  
( ( { liregularll ALT 
INT size : 
[725][3]INT scan-lines : 
--[725][3]INT16 scan-lines : 
--[200]BYTE scan.lines : 
INT count : 
SEQ 
count := 0 
WHILE TRUE 
SEQ 
ALT i = 0 FOR (SIZE in) 
in[i] ? size::scan.lines 
out ! size::scan.lines 
1 1 ) F  
( COMMENT PROC output -- non-replicated ALT 
: : :A COMMENT FOLD 
( ( (  PROC output -- non-replicated ALT 
PROC output(CHAN OF ANY inl, in2, CHAN OF ANY out) 
( ( (  COMMENT round-robin ALT : requires [2] CHAN OF ANY in 
:::A COMMENT FOLD 
( round-robin ALT : requires [2] CHAN OF ANY in 
INT s i z e  : 
--[725][3]INT scan.lines : 
--[725][3]INT16 scan.lines : 
--[200]BYTE scan.lines : 
INT count : 
SEQ 
count := 0 
WHILE TRUE 
SEQ 
ALT i = count FOR (SIZE in) 
in[i\(SIZE in)] ? size::scan.lines 
SEQ 
out ! size::scan.lines 
count : = count PLUS' 1 
1 )  1 
1 )  1  
( ALT 
INT size : 
[725][3]INT scan.lines : 
--[725][3]IN~16 scan.lines : 
--[200]BYTE scan.lines : 
INT count : 
SEQ 
count := 0 
WHILE TRUE 
APPENDIX E - Continued. 
+ .............................................................. + 
FILE : ROOT-PRG. LIS SIZE: ' 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 3 
+ ...................................................................... I + 
SEQ 
ALT 
in1 ? size::scan.lines 
out ! size::scan.lines 
in2 ? size::scan.lines 
out ! size::scan.lines 
1 1 1  
1 1  1  
1 1  1  
{{{F PROC compute 
:::F PROC-tsr 
PROC compute(CHAN OF ANY in, out) 
{ COMMENT case: c.color.line.16 
:::A COMMENT FOLD 
{ case: c.color.line.16 
{ { {  PROC cmplx.mult 
PROC cmplx.mult(REAL64 xl, yl, x2, y2) 
REAL64 t.x, t.y : 
SEQ 
t.x := (xl * x2) - (y1 * y2) 
t.y := (xl * y2) + (x2 * yl) 
xl := t.x -- return results in 1st 2 
yl := t.y 
paramters 
1 )  ) 
{ PROC cmplx.add 
PROC cmplx.add(REAL64 xl, yl, x2, y2) 
S EQ 
xl := xl + x2 
yl := yl + y2 
1 )  1  
{ { (  PROC cmplx.poly 
PROC cmplx.poly(REAL64 x, iy, VAL INT n, []REAL64 coeffs) 
-- compute value of complex polynimial using Hornerls rule 
REAL64 t.x, t.y : 
INT a : 
SEQ 
t.x := x 
t.y := iy 
cmplx.mult(t.x, t.y, coeffs[n], coeffs[n]) 
SEQ i = 1 FOR (n - 1) 
SEQ 
a : = n -  i 
cmplx.add(t.x, t.y, coeffsra], coeffs[a]) 
cmplx.mult(t.x, t.y, x, iy) 
cmplx.add(t.x, t.y, coeffs[O], coeffs[O]) 
x := t.x 
iy := t.y 
1 1 1  
APPENDIX E - Continued. ' , 
+----------------------------------------------------------------------------- 
FILE: ROOT-PRG.LIS SIZE: 19688 bytes i 1 SAVED: Thu Jun 30 13:14 :28 1988 PAGE : 4 i 
( ( (  variables and constants 
INT size : 
[100]REAL64 coeffs : 
[2]REAL32 coords : 
[2]REAL64 du : 
[2]INT columns : 
( ( (  define color registers 
VAL black IS 0 : 
ers 
VAL red IS 31 : 
VAL green IS 47 : 
VAL blue IS 63 : 
VAL yellow IS 79 : 
-- define some color register nurk 
INT color : 
1 ) )  
VAL screen.height IS 512 : 
1 1  1 
SEQ 
--out ! [x.min, y.min]; [dx, dy]; [x.start, x.stop]; power::coeffs 
in ? coords; du; columns; size::coeffs 
x := (REAL64 coords[O]) + (du[O] * (REAL64 TRUNC columns[O])) 
{ { (  compute the rows 
SEQ i = columns[O] FOR ((columns[l] - columns[O]) + 1) 
SEQ 
y := REAL64 coords[l] --y.min 
( compute the column 
SEQ j = 0 FOR screen.height 
SEQ 
x.old := x 
y.old := y 
cmplx.poly(x, y, size - 1, coeffs) 
( compute quadrant, assign color 
IF 
x >= 0.0 (REAL64) 
IF 
y >= 0.0 (REAL64) 
color := yellow 
TRUE 
color := red 
TRUE 
IF 
y >= 0.0 (REAL64) 
color := green 
TRUE 
color := blue 
1 1  1 
APPENDIX E  - Continued. 
+----------------------------------------------------------------------- + 
FILE: ROOT-PRG.LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 5 
x := x.old 
y := y.old 
scan.lines[j][O] := INT16 i 
scan.lines[j][l] := INT16 j 
scan.lines[j][2] := INT16 color 
y := y + du[l] 
11 1 
x := x + du[O] 
out ! screen.height::scan.lines 
11 1 
111  
1 )  1 
( COMMENT case: c.RL.line 
. . . 
. . .A COMMENT FOLD 
( { (  case: c.RL.line 
( ( {  PROC cmplx.mult 
PROC cmplx.mult(REAL64 xl, yl, x2, y2) 
REAL64 t.x, t.y : 
SEQ 
t.x := (xl * x2) - (yl * y2) 
t.y := (xl * y2) + (x2 * yl) 
xl := t.x -- return results in 1st 2 paramters 
yl := t.y 
11 1 
( ( (  PROC cmplx.add 
PROC cmplx.add(REAL64 xl, yl, x2, y2) 
SEQ 
xl := xl + x2 
yl := yl + y2 
1 1  1 
( ( { PROC cmplx. poly 
PROC cmplx.poly(REAL64 x, iy, VAL INT n, []REAL64 coeffs) 
-- compute value of complex polynimial using Hornerfs rule 
REAL64 t.x, t.y : 
INT a : 
SEQ 
t.x := x. 
t.y := iy 
cmplx.mult(t.x, t.y, coeffs[n], coeffs[n]) 
SEQ i = 1 FOR (n - 1) 
SEQ 
a : = n - i  
cmplx.add(t.x, t.y, coeffs[a], coeffs[a]) 
cmplx.mult(t.x, t.y, x, iy) 
( ( (  variables and constants 
APPENDIX E - Cont inued.  
+---------------------------------------------------------------------- -- + 
FILE: ROOT-PRG.LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 6 
INT size : 
[100]REAL64 coeffs : 
[2]REAL32 coords : 
[2]REAL64 du : 
[2]INT columns : 
{ { {  define color registers 
VAL black IS 0 : 
ers 
VAL red IS 31 : 
VAL green IS 47 : 
VAL blue IS 63 : 
VAL yellow IS 79 : 
-- define some color register numb 
INT color : 
11 1 
VAL screen.height IS 512 : 
11)  
{ { { abbreviations 
VAL bpw IS 4 : 
[4]INT 1ine.buffer : 
[4 * bpw]BYTE pixel-buffer RETYPES 1ine.buffer : -- done this way for alig 
nment 
INT16 i.16 RETYPES [pixel.buffer FROM 0 FOR 21 : 
INT16 j.16 RETYPES [pixel.buffer FROM 2 FOR 21 : 
INT16 count RETYPES [pixel-buffer FROM 4 FOR 21 : 
BYTE colour IS pixel.buffer[6] : 
BYTE control IS pixel.buffer[7] : 
1 )  1 
SEQ 
control := #01 (BYTE) -- + y direction 
--out ! [x.min, y-min]; [dx, dy]; [x-start, x.stop]; power::coeffs 
in ? coords; du; columns; size::coeffs 
{ { {  compute the rows 
x := (REAL64 coords[O]) + (du[O] * (REAL64 TRUNC columns[O])) 
SEQ i = columns[O] FOR ((columns[l] - columns[O]) + 1) 
INT j : 
INT color.start : 
SEQ 
y := REAL64 coords[l] --y.min 
j := 0 
{ { {  compute the column 
WHILE j < screen.height 
SEQ 
{ { {  get color for starting point 
x.old := x 
y.old := y 
cmplx.poly(x, y, size - 1, coeffs) -- get first color for loopi 
APPENDIX E - Continued. 
+----------------------------------------------------------------------- + 
FILE : ROOT-PRG . LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 7 
+ ........................................................................ 
I + 
n (3 { ( {  compute quadrant, assign color 
IF 
x >= 0.0 (REAL64) 
IF 
y >= 0.0 (REAL64 ) 
color := yellow 
TRUE 
color := red 
TRUE 
IF 
y >= 0.0 (REAL64) 
color := green 
TRUE . 
color := blue 
1 ) )  
x := x.old 
y := y.old 
111 
( init variables for this column 
i.16 := INT16 i -- starting point 
j.16 := INT16 j 
count := 0 (INT16) -- init count 
color.start := color 
colour := BYTE color 
1 ) )  
WHILE (color = color.start) AND (j c 
SEQ 
( increment variables 
y := y + du[l] -- increment by dy 
j : = j + l  
count := count + (1 (INT16) ) 
1 )  1 
( (  get color for next y 
x.old := x 
y.old := y 
cmplx.poly(x, y, size - 1, coeffs) 
( ( (  compute quadrant, assign color 
. IF 
x >= 0.0 (REAL64) 
IF 
y >= 0.0 (REAL64) 
color := yellow 
TRUE 
color := red 
TRUE 
IF 
y >= 0.0 (REAL64) 
color := green 
TRUE 
color := blue 
1 1  1  
x := x.old 
APPENDIX E - Continued. 
+---------------------------------------------------------------------------- + 
FILE: ROOT-PRG.LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 8 
+---------------------------------------------------------------------------- + 
y := y.old 
1 )  1 
out ! 8::pixel.buffer -- this won't work for d700d 
1 )  1 
out ! 0::pixel.buffer --signal end of column 
1  11 
1 ) )  
{ { {  case: c.color.line 
( PROC cmplx.mult 
PROC cmplx.mult(REAL64 xl, yl, x2, y2) 
REAL64 t.~, t.y : 
SEQ 
t.x := (xl * x2) - (yl * y2) 
t.y := (xl * y2) + (x2 * yl) 
xl := t.x -- return results in 1st 2 paramters 
yl := t.y 
1 ) )  
{ { {  PROC cmplx-add 
PROC cmplx. add (REAL64 xl, yl, x2, y2) 
SEQ 
xl := xl + x2 
yl := yl + y2 
1 1  1  
{ { {  PROC cmplx-poly 
PROC cmplx.poly(REAL64 x, iy, VAL INT n, []REAL64 coeffs) 
-- compute value of complex polynimial using Hornerts rule 
REAL64 t.x, t.y : 
INT a : 
SEQ 
t.x := x 
t.y := iy 
cmplx.mult(t.x, t.y, coeffs[n], coeffs[n]) 
SEQ i = 1 FOR (n - 1) 
SEQ 
a : = n - i  
cmpl'jr.add(t.x, t.y, coeffs[a] , coeffs[a] ) 
cmplx.mult(t.x, t.y, x, iy) 
{ { {  variables and constants 
INT size : 
[100]REAL64 coeffs : 
APPENDIX E - Continued. 
+---------------------------------------------------------------------------- + 
FILE: ROOT-PRG.LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 L3:14:28 1988 PAGE : 9 I 
[2]REAL32 coords : 
[2]REAL64 du : 
[2]INT columns : 
-- define some color register numb 
( ( (  define color registers 
VAL black IS 0 : 
ers 
VAL red IS 31 : 
VAL green IS 47 : 
VAL blue IS 63 : 
VAL yellow IS 79 : 
INT color : 
1 )  1 
VAL screen.height IS 512 : 
11 1 
SEQ 
--out ! [x.min, y.min]; [dx, dy]; [x.start, x.stop]; power::coeffs 
in ? coords; du; columns; size::coeffs 
x := (REAL64 coords[O]) + (du[O] * (REAL64 TRUNC columns[O])) 
( compute the rows 
SEQ i = columns[O] FOR ((columns[l] - columns[O]) + 1) 
SEQ 
y := REAL64 coords[l] --y.min 
( ( (  compute the column 
SEQ j = 0 FOR screen.height 
SEQ 
x.old := x 
y.old := y 
cmplx.poly(x, y, size - 1, coeffs) 
( ( (  compute quadrant, assign color 
IF 
x >= 0.0 (REAL64) 
IF 
y >= 0.0 (REAL64) 
color := yellow 
TRUE 
color := red 
TRUE 
IF 
y >= 0.0 (REAL64) 
color := green 
TRUE 
color := blue 
11) 
x := x.old 
y := y.old 
scan.lines[j][O] := i 
scan.lines[j][l] := j 
scan.lines[j][2] := color 
y := y + du[l] 
APPENDIX E - Cont inued. 
FILE: ROOT-PRG.LIS I SAVED: Thu 3un 30 13: 14:28 1988 SIZE: 19688 bytes PAGE : 10 I 
1 )  1 
x := x + du[O] 
out ! screen.height::scan.lines 
) ) I F  
{{{F PROC thru-buffer 
. . .  
:::F PROC04.tsr 
PROC thru.buffer(CHAN OF ANY local.in, 1ocal.out) 
INT size : 
[725][3]INT scan.lines : 
--[725][3]INT16 scan.lines : 
--[200]BYTE scan.lines t 
WHILE TRUE 
SEQ 
1ocal.in ? size::scan.lines 
1ocal.out ! size::scan.lines 
))IF 
CHAN OF ANY to.compute, from.compute : 
[2]CHAN OF ANY to.buffer : 
[2]CHAN OF ANY thru : 
PRI PAR 
PAR 





output(thru , to.left) 
--output(from.right, from.compute , to.left) 
compute(to.compute, from.compute) 
1 )  IF 
. . . F code 
:::A 1 2  
:::F pipe.dcd 
... F descriptor 
:::A 1 4  
: : :F pipe-dds 
. . . F link 
:::A 1 9  
:::F pipe.dlk 
1 ) )  
( SC b007 board 
:::A 3 10 
{{(F b007 board 
:::F b007.tsr 
PROC graphics(CHAN OF ANY to, from, 1oad.link) 
#USE ~\d700c\graphlib\b0071ib.tsrw 
APPENDIX E  - Continued. 
+ ..................................................................... + 
FILE : ROOT-PRG ; LIS S I Z E :  19688 bytes 
PAGE : 11 I 
BOO7 (to, from) 
1 )  I F  
. . . F code 
: : :A 1 2  
: : : F  b007.dcd 
... F descriptor 
:::A 1 4  
: : : F  b007.dds 
. . .F link 
: : : A  1 9  
: : : F  b007.dlk 
1 ) )  
( CHAN definitions 
( channel addresses 
VAL link0.in IS  4 : 
VAL 1inkl.in I S  5 : 
VAL link2.in I S  6 : 
VAL link3.in I S  7 : 
VAL link0.out I S  0 : 
VAL 1inkl.out I S  1 : 
VAL link2.out I S  2 : 
VAL link3.out I S  3 : 
1 )  1 
CHAN OF ANY dummyl, dummy2 , b007.boot : 
[8 * B003pairsICHAN OF ANY to, from , dummy : 
CHAN OF ANY to.graphics, from.graphics : 
1 ) )  
-- pipeline of processors, architecture f(b003pairs) 
PLACED PAR 
{ { ( B003pairs .for pipe 
PLACED PAR j = 0 FOR (B003pairs - 1) 
PLACED PAR 
 VAL^ IS (8 * j )  : 
PROCESSOR i T8 
PLACE tori] AT 1inkl.in : 
PLACE from[i] AT 1inkl.out : 
PLACE to[i+l] AT link2.out : 
PLACE from[ijl] AT link2.in : 
pipe(to[i], from[i], to[i+l], from[i+l], dummy[i]) 
VAL k IS  (8 * j )  + 1 : 
PROCESSOR k T8 
PLACE to[k] AT link3.in : 
PLACE from [k] AT link3.out : 
PLACE to[k+l] AT 1inkl.out : . , 
APPENDIX E - Continued. 
+----------------------------------------------------------------------------- + 
FILE : ROOT-PRG . LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 12 I 
PLACE from[k+l] AT 1inkl.in : 
pipe(to[k], from[k], to[k+l], from[k+l], dummy[k]) 
VAL 1 IS (8 * j) + 2 : 
PROCESSOR 1 T8 
PLACE to[l] AT link0.h : 
PLACE from[l] AT link0.out : 
PLACE to[l+l] AT link2.out : 
PLACE from[l+l] AT link2.in : 
pipe(to[l], from[l], to[l+l], from[l+l], dummy[l]) 
PLACED PAR m = 0 FOR 2 
VAL n IS (((8 * j) + 3) + m) : 
PROCESSOR n T8 
PLACE to[n] AT link3. in : 
PLACE from[n] AT link3.out : 
PLACE to[n+l] AT link2.out : 
PLACE from[n+l] AT link2.in : 
pipe(to[n], from[n], to[n+l], from[n+l], dummy[n]) 
VAL0 IS (8 * j) + 5 : 
PROCESSOR 0 T8 
PLACE to[o] AT link3.in : 
PLACE from[o] AT link3.out : 
PLACE to[o+l] AT 1inkl.out : 
PLACE from[o+l] AT 1inkl.in : 
pipe(to[o], from[o], to[o+l], from[o+l], dummy[o]) 
VALp IS (8 * j) + 6 : 
PROCESSOR p T8 
PLACE to[p] AT link0.h : 
PLACE from[p] AT link0.out : 
PLACE to[p+l] AT link2.out : 
PLACE from[p+l] AT link2.in : 
P ~ P ~ ( ~ ~ [ P I I  from[pl, to[p+lI, from[p+lI, dummy[pl) 
VALqIS (8 * j) + 7 : 
PROCESSOR q T8 
PLACE to[q] AT link3.in : 
PLACE from[q] AT link3.out : 
PLACE to[q+l] AT link0.out : 
PLACE from[q+l] AT link0.h : 
pipe(to[ql I from[ql I to[q+ll I from[q+lI I dummy[ql) 
VAL j IS (B003pairs - 1) 
PLACED PAR 
VAL i IS (8 * j) : 
PROCESSOR i T8 
PLACE to[i] AT 
PLACE from[i] AT 
PLACE to[i+l] AT 
PLACE from[i+l] AT 
pipe(to[i], from[i], 
VAL k IS (8 * j) + 1 : 
PROCESSOR k T8- 
PLACE to[k] AT 
PLACE from[k] AT 




to[i+l], from[i+l], dummy[i]) 
APPENDIX E - Concluded. 
+ ................................................................... 
--+ 
FILE : ROOT-PRG . LIS SIZE: 19688 bytes I SAVED: Thu Jun 30 13:14:28 1988 PAGE : 13 
+ ............................................................... I + 
PLACE to[k+l] AT 1inkl.out : 
PLACE from[k+l] AT 1inkl.in : 
pipe(to[k], from[k], to[k+l], from[k+l], dummy[k]) 
VAL 1 IS (8 * j) + 2 : 
PROCESSOR 1 T8 
PLACE to[l] AT link0.in : 
PLACE from[l] AT link0.out : 
PLACE to[l+l] AT link2.out : 
PLACE from[l+l] AT link2.in : 
pipe(to[l], from[l], to[l+l], from[l+l], dummy[l]) 
PLACED PAR m = 0 FOR 2 
VAL n IS (((8 * j) + 3) + m) : 
PROCESSOR n T8 
PLACE to[n] AT. link3.in : 
PLACE from[n] AT link3.out : 
PLACE to[n+l] AT link2.out : 
PLACE from[n+l] AT link2.in : 
pipe(to[n], from[n], to[n+l], from[n+l], dummy[n]) 
VAL0 IS (8 * j) + 5 : 
PROCESSOR o T8 
PLACE to [o] AT link3.in : 
PLACE from[o] AT link3.out : 
PLACE to[o+l] AT 1inkl.out : 
PLACE from[o+l] AT linkl. in : 
pipe(to[o], from[o], to[o+l], from[o+l], dummy[o]) 
VALp IS (8 * j) + 6 : 
PROCESSOR p T8 
PLACE to[p] AT link0.in : 
PLACE from[p] AT link0.out : 
PLACE to[p+l] AT link2.out : 
PLACE from[p+l] AT link2.in : 
pipe(to[pl I from[pl , to[p+lI, from[~+ll, dummy[p]) 
VAL q IS (8 * j) + 7 : 
PROCESSOR q T8 
PLACE tofq] AT link3.in : 
PLACE from[q] AT link3.out : 
PLACE b007.boot AT link0.out : 
pipe(to[q], from[q], dummyl, dummy2, b007.boot) 
11 1 
{ ( {  graphics 6oard 
PROCESSOR 999 T8 
PLACE to.graphics AT 1inkl.h : 
PLACE from.graphics AT 1inkl.out : 
PLACE b007.boot AT link0.in : 
graphics(to.graphics, from-graphics, b007.boot) 
11 1 
REFERENCES 
1. Babb, R.G. II., ed.: Programming Parallel Processors. Addison-Wesley, 
1987. 
2. Homewood, M., et al.: The IMS T800 Transputer. IEEE Micro, vol. 7, 
no. 5, Oct. 1987, pp. 10-26. 
3. May, D.; and Taylor, R.: OCCAM - An Overview. Microprocessors and 
Microsystems, vol. 8, no. 2, Mar. 1984, pp. 73-79. 
4. Transputer Development System User Manual. INMOS Corp., Colorado Springs, 
CO . 
5. May, D.; and Shepherd, R.: The Transputer Implementation of OCCAM. Fifth 
Generation Computer Systems 1984, Elsevier North Holland, 1984, pp.533-541. 
6. May, D.: OCCAM. SIGPLAN Notices, vol. 18, no. 4, Apr. 1983, pp. 69-79. 
7. T414 Engineering Data. INMOS Corp., Colorado Springs, CO. 
8. Atkin, P.: Performance Maximization. Technical Note 17, INMOS Corp., 
Colorado Spri ngs , CO. 
9. Carnahan, B.; Luther, H.A.; and Wilkes, J.O.: Applied Numerical Methods. 
John Wiley and Sons, 1969. 
10. Danial, A.; and Watson, J.: Iterative Finite Element Solver on Transputer 
Networks. Lewis Structures Technology 1988, Vol. 1 - Structural Dynamics, 
NASA CP-3003-VOL-1, 1988, pp. 113-123. 
11. Gusfaston, J.L.; Montry, G.R.; and Benner, R.E.: Development of Parallel 
Methods for a 1024-Processor Hypercube. SIAM J. Scientific Stat. Cornput., 
vol. 9, no. 4., July 1988, pp. 609-638. 
12. Ellis, G.K.: Two-Dimensional Graphics Tools for a Transputer Based 
Display Board. NASA TM-100820, 1988. 
13. Ellis, G.K.: User Manual for the Two-Dimensional Transputer Graphics 
Toolkit. NASA TM-100974, 1988. 
IMS T800 







FIGURE 1. - BLOCK DIAGRAM OF A T800 FLOATING POINT TRANSPUTER. 
1000 PROCESSORS 
NUMBER OF INTERVALS 
FIGURE 3. - PI PROGRAM USING PRIORITIZED COMPUTATION. 








1 11111111 1 I IIIIIII 1 11111111 1 1111'1'1 1 11111111 1 1  lllllJ 
lo2 lo3 lo4 lo5 106 107 108 
NUMBER OF INTERVALS 
FIGURE 5. - PI PROGRAM USING ALL LOW-PRIORITY 




102 103 104 105 106 lo7 108 
NUMBER OF INTERVALS 
FIGURE 4. - PI PROGRAM USING PRIORITIZED COMPUTATION, 






NUMBER OF INTERVALS 
FIGURE 6. - PI PROGRAM USING PRIORITIZED COMMUNICA- 
TION. 64-BIT MATH. AND T800 FLOATING-POINT 
PROCESSORS. 
FIGURE 7. - PI PROGRAM USING PRIORITIZED COMMUNICA- 










102 103 104 105 106 107 108 
NUMBER OF INTERVALS 
FIGURE 8. - SINGLE PROCESSOR PERFORIANCE SOLVING PI 
















w 2 .6 







102 103 lo4 lo5 lo6 lo7 108 lo2 lo3 lo4 105 lo6 lo7 108 
NUMBER OF INTERVALS NUMBER OF INTERVALS 
FIGURE 9. - PI PROGRAM SPEEDUP USING T800 FLOATING- FIGURE 10. - PI PROGRAM NETWORK SOLUTION EFFICIENCY 


























. 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1 1  1 1 IIIII~I I I I I I ~  
lo2 lo3 104 lo5 lo6 lo7 I O ~  
NUMBER OF INTERVALS 
INTERVALS 
r + 103  lo4 105 
0 1 0  2 0  3 0  40  50  
NUMBER OF PROCESSORS 
FIGURE 11. - SPEEDUP FOR P I  PROGRAM. 
PROCESSOR PIPELINE 
I GRAPHICS BOARD I 
FIGURE 12. - NETWORK USED FOR ROOT VISUALIZATION PROBLEPl. 
PC HOST. 
BOO4 DEVELOPMENT 
P 9 w P % 7 S 
BOO7 
GRAPHICS I BOARD I 
FIGURE 13. - SUGGESTED HIGH-BANDWIDTH NETWORK TO TAKE ADVANTAGE OF AS MANY 
TRANSPUTER LINKS AS POSSIBLE. 









3 + 5 
7 
5' Q 
ORDER OF POLYNOMIAL 
FIGURE 15. - ROOT VISUALIZATION TEST PERFORMANCE USING 





10 0  20 40 60 80 100 
ORDER OF POLYNOnlAL 
FIGURE 16. - ROOT VISUALIZATION TEST PERFOWCE USING 
16-BIT TRANSFER PROTOCOL. 
U W "I BUFFERS 
100 !i i u 1 
I- * 3 
- 
- ---0- 5 
- 
- 
10 I I I I I 
0 20 40 60 80 100 
ORDER OF POLYNOMIAL 
FIGURE 17. - ROOT VISUALIZATION TEST PERFORMANCE USING 




Y  5 
"I 
10 
0 20 40 60 80 100 
ORDER OF POLYNOMIAL 
FIGURE 18. - ROOT VISUALIZATION TEST PERFORME USING 
32-BIT TRANSFER PROTOCOL AND 1800 PROCESSORS. 
NASA FORM 1626 OCT 86 
"For sale by the National Technical Information Service, Springfield, Virginia 22161 
NASA 
Nat8onal Aeronautics and 
Space Adm8nlstratlon 
Report Documentation Page 
1. Report No. NASA TM-101297 
ICOMP-88-14 
2. Government Accession No. 3. Recipient's Catalog No. 
4. Title and Subtitle 
Implement ing D i r e c t ,  S p a t i a l l y  I s o l a t e d  Problems 
on T ranspu te r  Networks . 
7. Author@) 
Graham K. E l  1  i s  
5.  Report Date 
August 1988 
6. Performing Organization Code 
8. Performing Organization Report No. 
E-4278 
10. Work Unit No. 
505-63-1 B 
9. Performing Organization Name and Address 
N a t i o n a l  A e r o n a u t i c s  and Space A d m i n i s t r a t i o n  
Lewis Research Center  
C l  eve1 and, O h i o  441 35-31 91 
12. Sponsoring Agency Name and Address 
N a t i o n a l  A e r o n a u t i c s  and Space A d m i n i s t r a t i o n  
Washington, D.C. 20546-0001 
- 
11. Contract or Grant No. 
13. Type of Report and Period Covered 
Techni  c a l  Memorandum 
14. Sponsoring Agency Code 
15. Supplementary Notes 
Graham K. E l l i s ,  S e n i o r  Research A s s o c i a t e  a t  t h e  I n s t i t u t e  f o r  Computat iona l  
Mechanics i n  P r o p u l s i o n ,  NASA Lewis Research Cen te r  (work  funded under Space A c t  
Agreement C99066G). 
16. Abstract 
P a r a m e t r i c  s t u d i e s  have been per formed on t r a n s p u t e r  networks  o f  up t o  40 proces-  
s o r s  t o  de te rm ine  how t o  implement and maximize t h e  per formance o f  t h e  s o l u t i o n  
o f  problems where no p rocessor - to -p rocessor  d a t a  t r a n s f e r  i s  r e q u i r e d  f o r  t h e  
p rob lem s o l u t i o n  ( s p a t i a l l y  i s o l a t e d ) .  Two t ypes  o f  problems were i n v e s t i g a t e d  
i n  t h i s  s t u d y .  A  c o m p u t a t i o n a l l y  i n t e n s i v e  prob lem where t h e  s o l u t i o n  r e q u i r e d  
t h e  t r a n s m i s s i o n  o f  160 b y t e s  o f  d a t a  th rough  t h e  p a r a l l e l  ne twork ,  and a  commu- 
n i c a t i o n  i n t e n s i v e  example t h a t  r e q u i r e d  t h e  t r a n s m i s s i o n  o f  3 Mbytes o f  d a t a  
t h r o u g h  t h e  network .  T h i s  d a t a  c o n s i s t s  o f  s o l u t i o n s  b e i n g  s e n t  back t o  t h e  h o s t  
p rocessor  and n o t  i n t e r m e d i a t e  r e s u l t s  f o r  ano the r  p rocessor  t o  work on. S t u d i e s  
were per formed on b o t h  i n t e g e r  and f l o a t i n g - p o i n t  t r a n s p u t e r s .  The f l o a t i n g -  
p o i n t  t r a n s p u t e r  f e a t u r e s  an on -ch ip  f l o a t i n g - p o i n t  math u n i t  and o f f e r s  a p p r o x i -  
m a t e l y  an o r d e r  o f  magni tude performance i n c r e a s e  o v e r  t h e  i n t e g e r  t r a n s p u t e r  on 
r e a l  va lued  computa t ions .  The r e s u l t s  i n d i c a t e  t h a t  a  minimum amount o f  work i s  
r e q u i r e d  on each node p e r  communicat ion t o  ach ieve  h i g h  ne twork  speedups ( e f f i -  
c i e n c i e s ) .  The f l o a t i n g - p o i n t  p rocessor  r e q u i r e s  a p p r o x i m a t e l y  an o r d e r  o f  mag- 
n i t u d e  more work p e r  communicat ion than  t h e  i n t e g e r  p rocessor  because o f  t h e  
f l o a t i n g - p o i n t  u n i t ' s  i n c r e a s e d  comput ing c a p a b i l i t y .  
17. Key Words (Suggested by Author(s)) 
P a r a l l e l  p r o c e s s i n g  
T ranspu te r  
Performance c a l  c u l  a t i o n  
18. Distribution Statement 
U n c l a s s i f i e d  - U n l i m i t e d  
S u b j e c t  Category  61 
19. Security Classif. (of this report) 
U n c l a s s i f i e d  
22. Price' 
A04 
20. Security Classif. (of this page) 
U n c l a s s i f i e d  
21. No of pages 
58 
National Aeronautics and 
Space Administration 
Lewis Research Center 
ICOMP (M.S. 5-3) 
Cleveland, Ohio 44135 
Official Business 
Penalty for Private Use $300 
FOURTH CLASS MAIL 
ADDRESS CORRECTION REQUESTED 
Posfage and Fees P a ~ d  
Nal~onal Aeronauf~cs anc 
Space Adrn~n~slraf~on 
NASA 451 
