Randomized PRAM Simulation Using T9000 Transputers by Czech, Zbigniew & Mikanik, Wojciech
Kent Academic Repository
Full text document (pdf)
Copyright & reuse
Content in the Kent Academic Repository is made available for research purposes. Unless otherwise stated all
content is protected by copyright and in the absence of an open licence (eg Creative Commons), permissions 
for further reuse of content should be sought from the publisher, author or other copyright holder. 
Versions of research
The version in the Kent Academic Repository may differ from the final published version. 
Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the 
published version of record.
Enquiries
For any further enquiries regarding the licence status of this document, please contact: 
researchsupport@kent.ac.uk
If you believe this document infringes copyright then please contact the KAR admin team with the take-down 
information provided at http://kar.kent.ac.uk/contact.html
Citation for published version
Czech, Zbigniew and Mikanik, Wojciech  (1996) Randomized PRAM Simulation Using T9000
Transputers.   Technical report. UKC, University of Kent, Canterbury, UK
DOI















The parallel random access machine PRAM is the most commonly used generalpurpose
machine model for describing parallel computations Unfortunately the PRAM model is not phy
sically realizable since on large machines a parallel shared memory access can only be accompli
shed at the cost of a signicant time delay A number of PRAM simulation algorithms have been
presented in the literature The algorithms allow execution of PRAM programs on more realistic
parallel machines In this paper we study the randomized simulation of an EREW exclusive read
exclusive write PRAM on a module parallel computer MPC The simulation is based on utili
zing universal hashing The results of our experiments performed on the MPC built upon Inmos
T			 transputers throw some light on the question whether using the PRAM model in parallel
computations is practically viable given the present state of technology
Key words Parallel computing PRAM model randomized PRAM simulation algorithms
module parallel computer universal hashing T			 transputer
  Introduction
The parallel random access machine PRAM is the most commonly used generalpurpose machine
model for describing parallel computations The PRAM consists of a set of processors where each
processor is a random access machine RAM All processors share the memory and communicate
through it The PRAM is relatively easy to program because one does not need to allocate storage
within a distributed memory or specify interprocessor communication Unfortunately the PRAM mo
del is not physically realizable since on large machines a parallel shared memory access can only be
accomplished at the cost of a signicant time delay
A number of PRAM simulation algorithms have been presented in the literature for the survey see
	
 The algorithms allow execution of PRAM programs on more realistic parallel machines Among
several types of such machines is a fully connected parallel computer called a module parallel computer
MPC The MPC consists of a set of RAM processors Each processor of the MPC has an associated
memory module and is connected via communication links to all other processors A memory module
operates sequentially responding to only one data access request at a time
In this paper we study the randomized simulation of an EREW exclusive read exclusive write
PRAM on an MPC The simulation is based on utilizing universal hashing The results of our experi
ments performed on the MPC built upon Inmos T transputers throw some light on the question
whether using the PRAM model in parallel computations is practically viable given the present state
of technology
The remainder of the paper is organized as follows In section  we describe the PRAM model
Section  denes the MPC Section 	 presents some theoretical results regarding the randomized
 
Research supported by the Polish Committee for Scientic Research under the grant BKRAu	
y
Research supported by the European Committee under the Tempus programme grant IMG
PL and by the
Polish Committee for Scientic Research under the grant BKRAu	
PRAM simulation In Section  we describe the architecture of the Parsys SN parallel computer
which served as the platform for our experiments In Section  the PRAM simulators which have been
designed and implemented are discussed Section  presents a matrix multiplication algorithm used for
the purpose of simulation In Section  the experiments which were conducted are described Section
 concludes the paper The Appendix contains the results of the experiments
 PRAM model




     P
n 
 and a shared memory of m









Figure  The PRAM model of computation
proceed with instruction i   until all have nished instruction i In every step of the PRAM each
processor executes a private RAM instruction In particular each processor may read a variable from
the shared memory into its local memorywrite a variable from its local memory to the shared memory
or perform some internal computation eg addition multiplication boolean operation etc on the
variables contained in its local memory It is assumed that the execution of each instruction takes unit
time Depending on whether various processors may access the same memory location on a given step
or not the following variants of the PRAM model are distinguished
  the exclusive read exclusive write EREW PRAM in which at most one processor may read
or write to a particular variable
  the concurrent read exclusive write CREW PRAM in which multiple processors may read
from a particular variable but at most one processor may write to a particular variable
  the concurrent read concurrent write CRCW PRAM in which multiple processors may read
or write to any variable
There is also a further classication of the CRCW PRAM model based on a writing conict resolution
strategy which species what is written when more than one processor writes to a particular variable
on a given step For more details regarding this classication see   

 Module parallel computer
A module parallel computer MPC consists of n RAM processors each of which has an associated
memory module 
 A memory module is a collection of variables Every processor may access every
memory module via a fully connected network linking the processors see Fig  It is assumed that
an access takes constant time The memory modules however are sequential devices ie all access
requests that arrive at a memory module in a given step are processed one at a time This can result

















Figure  The module parallel computer
 Randomized PRAM simulation




we understand an algorithm that allows an instruction
fromM

to be executed on M

 Our goal is to simulate an EREW PRAM on a more realistic parallel
machine namely an MPC The basic problem which must be solved by the simulation algorithm
concerns the memory management and it can be formulated as follows Consider an n mPRAM
which is to be simulated on an MPC with n memory modules so that each memory module will hold
mn memory locations Suppose that on a given MPC step each processor issues a memory request
Then in the best case each request will go to a dierent module and all requests may be serviced
in O time recall that the communication time between the MPC processors is constant In the
worst case however all n requests may be directed to the same memory module and will be serviced
in n time The problem of memory management is how to map the logical addresses of the PRAM
into the physical addresses of the MPC distributed over its n memory modules such that the amount
of module contention is minimized given any set of n requests which are to be serviced
One of the approaches to solve this problem is based on utilizing universal hashing as introduced
by Carter and Wegman 
 During a simulation the MPC processors apply a hash function h chosen
randomly from a class of universal hash functions H The function h is used in order to distribute
the logical addresses of the PRAM among the memory modules of the MPC It is expected that on
every simulation step the function h will spread the requests evenly among the memory modules of the
MPC regardless of the memory access patterns of the PRAM The class of universal hash functions
is dened as follows
Denition  Let c  R k  N  m  N  n  N  A class of hash functions H  fh  h 
   m  
     n  
g is c strongly k universal if for all a

     a
k
    m  
 pairwise
distinct and all b

     b
k
    n 





for   i  kgj  cjHjn
k

While simulating the PRAM we assume that the ith processor of the MPC runs the same program
as the ith processor of the PRAM The shared memory of the PRAM is divided among n memory
modules of the MPC in such a way that memory moduleM
j
   j  n contains all PRAM addresses
a   a  m for which ha  j The details of the simulation can be described as follows
Initialization Choose h  H at random and store h in every processor of the MPC
Step by step simulation For the logical address a
i
generated by processor P
i
of the MPC apply
h to a
i










 A memory module M
j
   j  n collects all requests for variables in M
j
and serves
them sequentially When all requests are served the next PRAM step is simulated
Now given the above scheme the question arises how ecient is the simulation or how long are the
queues of requests in front of each memorymodule Since the PRAM processors operate synchronously

all memory requests issued in a particular step must be serviced before the simulation of the next
step can begin Therefore our objective is to minimize the length of the longest queue of requests in
front of the memory modules recall that all these requests are serviced sequentially as it bounds the
eciency To study it in more detail we need to dene some parameters describing the length of the




     a
p
g S     m  
 be a set of addresses of arbitrary cardinality p
and let h  H Dene
R
max
h S  max
 jn













h S is the length of the longest queue in front of any memory module when function h  H is





h SjHj is the expected value
of R
max
h S and R
p
max
is the worst case of that value taken with respect to all possible sets S
Mehlhorn and Vishkin proved the following theorem 

Theorem  Let H be a c strongly k universal class of functions from    m  






 k  cpnpn
k
k
for all p  N 
Proof Let S be dened as before and let P
i
S be the probability that R
max
h S  i ie P
i
S 
jfh  H  R
max
h S  igjjHj Then P
p
   P
k



























S be the probability that at least k addresses of S are mapped onto memory module j






S      P
kn 





     a
k
g it holds
jfh  H  ha
l


























Mehlhorn and Vishkin introduced the following class G of universal hash functions Let m be a
























    m  
 and a
i
	  for some i   be the set of all polynomials of degree at most
k   Since for every x





     y
k
    m  
 x

     x
k
pairwise distinct there




   i  k it can
be concluded that G is  strongly k universal For the purpose of the simulation class G has to be
modied into the form
H

 fh  hx  gxmodng  
Given an address x of the PRAM gx can be interpreted as a global address in the MPC which
corresponds to location bgxnc of module hx  gxmodn Unfortunately for polynomials of
degree greater than  the mapping of PRAM addresses x into their internal locations in memory
modules is not one to one In other words several addresses can be mapped into the same location
in a given module We call these addresses the synonyms To handle this problem a memory module
maintains for each location bgxnc a table of pairs x data inPRAM locationx for all x mapped to
	
that location Thus PRAM address x is accessed by searching the table of synonyms associated with
location bgxnc




Theorem  If H  fh  h     m  
     m  
g is c strongly k universal and r 
   m  
    n 
 is such that jr
 
jj  dmne for all j   j  n then class

H  fr 
 h  h  Hg
is c strongly universal where c  ndmnem
k















 k  cpnpn
k





 Parsys SN architecture
The two main components of the Parsys SN parallel computer are the Inmos T transputer
and the ST C	 or C	 for short packet routing device The T has much greater capabilities
than any of its predecessors from the transputer family Its peak performance is expected to be 
MIPS and  MFLOPS according to the Inmos specication of  MHz T with links running at
up to  Mbitssec in each direction The onchip virtual channel processor which operates in parallel
with the central processing unit allows physical links to be shared transparently by a large number of
virtual channels The packetization and multiplexing operations are implemented directly in hardware
The C	 allows to construct networks of very large number of fullyinterconnected Ts without
use of any routing software It has  bidirectional data links and two control links It also includes
a full    nonblocking crossbar switch enabling messages to be routed from any of its links to
any other link The C	 uses wormhole routing which minimizes communication latency because
the chip can start outputting a packet which is still being input The use of a crossbar switch allows
packets to be passed through all links at the same time The C	 can route packets of any length 

The SN contains ve C	s and up to  fullyinterconnected Ts Fig  Each data link
of each T is connected to one of the C	 routing devices Except for two of the Ts data
link  of each T is connected to C	
 link  to C	
 etc This means that every T
is connected to every other T via only one C	 The data links of the two Ts and of the
interface card are connected to the fth C	 which in turn is connected via four pairs of its data
links to each of the other routing devices 

 PRAM simulators
The two kinds of simulators called SIM and SIM have been designed and implemented in the occam
language on the Parsys SN parallel computer The structure of SIM is similar to that of the




 i     n
 is simulated by a single




 placed on a single transputer
The second simulator SIM simulates the multithreaded module parallel computer MMPC as











     T
u 
i
 i     n 
 The number
of threads u is called the parallel slackness of the MMPC The parallel slackness is introduced in order
to maintain high utilization of the simulating processors through overlapping their local computations
with global communication More specically if one thread of the MMPC requests a shared memory
access then instead of the simulating processor remaining idle during the time the request is serviced



















Figure  The Parsys SN Parallel Computer
The structure of the simulator SIM reects the structure of the MMPC Namely the ith transputer






     T
u 
i
 and a process




As mentioned above two occam processes run on a single transputer in SIM A high priority
process called Memi
 i     n  
 simulates a module of the shared memory M
i
Fig a It
accepts memory access requests performs the appropriate operations and sends back to the requesting
process either a content of the specied memory location for reading or an acknowledge message
for writing The second low priority process called CPU i
 runs the program of the ith PRAM
processor Since all transputers used in SIM are fully connected each CPU i
 communicates with each
Memi
 directly via a pair of occam channels In a single step a CPU i
 may asynchronously perform
an arbitrary number of local operations followed up by at most one access to any of the memory
modules All the CPU i
s synchronize their work after each computation step as follows First the
CPU i
s i     n  
 send an appropriate message to CPU 
 and wait for a reply The CPU 

collects the messages from the CPU i
s and then broadcasts the signal to these processes enabling
them to resume their computations A frontend process running on an additional transputer of SIM
performs IO operations initiates the work of all processes and measures the time of computations
Simulator SIM




     CPU i
u 
 i     n 
 executing
the MMPC threads run on a single transputer of SIM Fig b Furthermore there is an MMU i

process which acts as a concentrator of memory requests from the CPU s A similar role plays a
synchronization process Synci




























Figure 	 The multithreaded module parallel computer
transmitting to them a signal received from Sync
 Most of the communication in SIM is carried out




 processes are eected through the transputer links With the exception of CPU s all other
processes have high occam priority A special scheme is applied to ensure that the local computations
of a CPU is not interrupted in favour of another CPU process
	 Matrix multiplication algorithm
In order to measure the performance of the simulators described in the previous section an EREW
PRAM matrix multiplication algorithm has been implemented see Fig  Each processor P
i
 i 
   n 
 of the algorithm computes every nth row of the resultant matrix C starting from row i
We assume that a  c
Example Let the constants dening the sizes of matrices A B and C be a  	 b   and c  
and let n   In the rst stage every processor P
i
 i     







 for k     
 see Fig a Clearly there are no memory access conicts




i  mod 
 Ci
i  mod 
 and so on Similarly no memory access conicts arise during




 for i 	 j read elements from dierent rows and








 and write to dierent
elements Ci
i  l mod 
 and Cj
j  l mod 
 l     	
 Fig b shows the memory access
pattern for l   lled circles mark elements that have been already computed and unlled ones the
elements to be evaluated In the next stage once all the values in row Ci

 have been evaluated
processor P
i
begins the computation of values Ci  n

 if row i  n exists In our example in this
stage there is enough work only for processor P
 




nish their work after




 respectively and the algorithm completes when P
 
computes
all the values in row C

  
The EREW PRAM matrix multiplication algorithm was implemented in occam see Fig  Note
that the lines in Fig  marked with   and    represent a sequence of operations as shown below






 tmp  tmp  Register  Register

















































































Figure  The structure of processes running on the ith transputer of simulators SIM and SIM each
line denotes a pair of occam channels

! ! n " number of PRAM processors
! ! a b c " constants dening the sizes of matrices A B and C
A  array a 
b 
 of oat#
B  array b 
c 
 of oat#
C  array a 
c 
 of oat#
parfor i     n   do
row  i
while row  a do
for j  i to i  c   do
col  j mod c
tmp  
for k   to b  do









row  row  n
end while
end parfor
Figure  The EREW PRAM matrix multiplication algorithm
The operations  and  which access the shared memory were implemented by using the Load
procedure calls and local variables tmpa and tmpb in place of registers see lines a and b in Fig 
Once the reading request is completed the Load procedure executes the code synchronizing the work
of all CPU s of a simulator The line 	 specied above was implemented by lines d and e in Fig 
The Store procedure writes the value of its second parameter into an address of the shared memory
dened by its rst parameter and then synchronizes its work with other CPU s

 Experiments
The goal of the experiments was to investigate the performance of the simulators SIM and SIM For
the purpose of simulation the EREW PRAM matrix multiplication algorithm was used see Sec 
The algorithm was executed on the square matrices A B and C of size ss where s    	    
 The simulators themselves and the matrix multiplication algorithm were implemented in occam
The experiments were carried out on the Parsys SN parallel computer populated with n  
T transputers an additional transputer ran a frontend process The T Gamma silicon
was applied with a clock speed of  MHz and the data links congured to run at  Mbitssec
The computation times were measured by making use of an internal high priority processor timer
incremented every  s Each execution time measurement was averaged over  experiments
The timings of the sequential version of the matrix multiplication algorithm ran on a single T
transputer are shown in Table  s denes the size of matrices#Ave Max andMin denote the average
execution time the standard deviation the maximum and minimum execution time respectively
among the  experiments
Experiments on SIM
The two versions of the matrix multiplication algorithm were implemented In the rst one all the
matrices A B and C were located in the shared memory In the second version only matrix C was
stored in the shared memory whereas the matrices A and B were copied into the local memory of
each transputer As the result the ratio of local memory accesses and computations to shared memory












































s s s c
s s s c
s s sc
A B C
Figure  Memory access patterns for the EREW PRAM matrix multiplication algorithm
For these two versions of the algorithm the two series of experiments were conducted in which the
polynomial hash functions h of degree  and degree logn     were applied respectively cf eqs
 and  Before each experiment the new random coecients of the polynomials were generated
The equation  indicates that the expected length of the longest queue R
p
max
is smaller if the
degree of the polynomial hash function h is higher eg equal to  In such a case one can expect that
the simulation is more ecient as the shorter queues of requests are serviced quicker However on
the other hand an evaluation of a polynomial of higher degree is more computationally expensive and
adversely inuences the eciency Therefore in practice the degree of the polynomial hash function
should be chosen as a result of some compromise
The results of the tests for the rst version of the algorithm are shown in Tables  and  and






is the execution time of the sequential version of the matrix multiplication algorithm on
a single transputer and T
s
is a time of the PRAM simulation of the algorithm As can be seen
from graphs a and b the multiplication of matrices simulated on  processors lasts roughly 
times longer than on a single processor The reason of this low performance is a small grain size of
the computations Namely only two relatively cheap local oatingpoint operations on the matrix
elements and a few xedpoint address operations are executed for the three pairs of an expensive
shared memory access and a global synchronization cf lines a!e in Fig  It is worth noting that
the matrix multiplication algorithm with a small grain is a demanding test for the PRAM simulation
The graphs a and b also show how a degree of the polynomial hash function inuences the
performance of the simulation The rst degree polynomial gives a little shorter average execution
times although the times themselves are less regular and predictable For example for the matrices
of size   the longest execution time in the rst series of our experiments was more than twice
as long as the longest time in the second series Those shorter average execution times obtained for
the rst degree hash function mean that the evaluation time of the function dominates the time of
servicing longer queues which likely arise while this function is used plus the time for dealing with

INT row 
REAL tmp tmpa tmpb
SEQ
row  i
VAL OsetB IS a  b
VAL OsetC IS a  b  b  c
WHILE row  a
SEQ
VAL RowO IS row  b
SEQ j  i FOR c
VAL col IS j REM c
SEQ
tmp  REAL
SEQ k   FOR b
SEQ
a LoadRowO  k tmpa
b Loadk  c  OsetB  col  tmpb
c tmp  tmpa  tmpb  tmp
d VAL AddressC IS row  c OsetC   col 
e StoreAddressC  tmp
row  row  n
Figure  The EREW PRAM matrix multiplication algorithm in occam
synonyms
The graphs c and d in Fig  illustrate the results of the experiments with the second version of
the algorithm in which only the resultant matrix C was stored in the shared memory Tables 	 and
 contain the corresponding measurements In that case the shared memory was accessed only once
after s oatingpoint multiplications and s oatingpoint additions Due to the greater grain size the
speedups achieved are much better than previously
For the matrices of size  and the hash function of degree  we measured the longer average
execution time than for the function of higher degree It was caused by an enormous execution time
of a single experiment " almost three times bigger than the average In that experiment the hash
function of the randomly generated coecients mapped almost all writing requests in every step of
the simulation into the same module of the shared memory
Experiments on SIM
During the experiments on the simulator SIM only the polynomial hash function of degree  was
used for the matrices of size    The results obtained are showed in Table  Contrary to our
expectations the introducing of parallel slackness by running a number of threads on each transputer
resulted in only slight improvement of the eciency of simulations For example for the matrices of
size   the speedup equals 	 on SIM increased to  on SIM cf Tables 	 and 
 Conclusions
In the paper the problem of the randomized simulation of an EREW PRAM on an MPC was studied
The two kinds of simulators based on utilizing universal hashing were designed and implemented in
the occam language In the rst simulator a number of simulating processors was equal to the number
of programs of the PRAM so that each processor ran a single program In the second simulator the

(a) (b) (c) (d)










Figure  Speedup versus s for simulator SIM s  s is the size of matrices A B and C# a and b
" matrices ABC in the shared memory h is a polynomial of degree  and  respectively# c and
d " only matrix C in the shared memory h is a polynomial of degree  and  respectively
parallel slackness was introduced by executing a number of computation threads on each simulating
processor The practical experiments on the simulators using the Parsys SN parallel computer
and the matrix multiplication algorithm as a running example were conducted The results of the
experiments on the rst simulator indicate that the PRAM simulation is still not ecient enough to
be useful in practice We found out that the cost of the shared memory accesses recall implemented
in the fully connected transputer network via the wormhole routing is relatively high in comparison
with the cost of the local computations One of the reasons for this is the fact that a size of messages
exchanged during an access is small an access request can be  or  bytes long and a reply  or 
bytes and according to the measurements presented in 
 only roughly a half of the peak T
link bandwidth is attained with messages of this size The experiments exhibit that the simulations
in which the polynomial hash function of degree  is used are more ecient than for the function of
higher degree This means that the evaluation time of the hash function dominates the time of servicing
longer queues which likely arise while a lower degree polynomial is applied Since the simulation is
randomized by nature the simulation times vary among the experiments especially for the polynomial
of degree  It is explicable for the mapping of addresses among the memory modules of the MPC
is not so uniform as in the case of polynomials of higher degrees Contrary to our expectations the
introducing of parallel slackness in the second simulator improved the eciency of simulations only
in a small degree
Acknowledgments
We wish to thank the Department of Computer Science the University of Kent at Canterbury Great
Britain for providing access to the parallel computing facilities
References





 Carter JL and Wegman MN Universal classes of hash functions Journ Comput Syst Sci
   	!	

 Goldschlager LM A unied approach to models of synchronous parallel machines Journ ACM
 	  !
	
 Harris TJ A survey of PRAM simulation techniques ACM Computing Surveys   June
	 !

 Hipperson AM The global communications performance of fully interconnected T ne
tworks November 	 manuscript

 The T Transputer Hardware Reference Manual Inmos Ltd 

 Mehlhorn K and Vishkin U Randomized and deterministic simulations of PRAMs by parallel
machines with restricted granularity of parallel memories Acta Informatica  	 !	







    
   	 
	  	  
	 	 	 			 	
    	
 	  	 	






 	  		  	
  	  	 	
	   		 			 
	   	  
     
 				 	 	  
Table  Timings of the algorithm simulated on SIM matrices A B and C in the shared memory# h






   	 	 
    	 
	  		  	 
	  	 	  
 	  	  
 		  	 				 
Table  Timings of the algorithm simulated on SIM matrices A B and C in the shared memory# h






  	   
   	  
	  		 		  
	  	 	  
    	 	
 	  	  	
Table 	 Timings of the algorithm simulated on SIM  only matrix C in the shared memory# h "






     
  		   
	 	   	 	
	     
 		    		
   	  	
Table  Timings of the algorithm simulated on SIM  only matrix C in the shared memory# h "







ABC 		 	 		  		
C 		  	 		 
Table  Timings of the algorithm simulated on SIM s  # h " polynomial of degree 
	
