The parallel random access machine (PRAM) is the most commonly used general-purpose machine model for describing parallel computations. Unfortunately the PRAM model is not physically realizable, since on large machines a parallel shared memory access can only be accomplished at the cost of a signi cant time delay. A n umber of PRAM simulation algorithms have been presented in the literature. The algorithms allow execution of PRAM programs on more realistic parallel machines. In this paper we study the randomized simulation of an EREW (exclusive read, exclusive write) PRAM on a module parallel computer (MPC). The simulation is based on utilizing universal hashing. The results of our experiments performed on the MPC built upon Inmos T9000 transputers throw some light on the question whether using the PRAM model in parallel computations is practically viable given the present state of technology.
Introduction
The parallel random access machine (PRAM) is the most commonly used general-purpose machine model for describing parallel computations. The PRAM consists of a set of processors, where each processor is a random access machine (RAM). All processors share the memory and communicate through it. The PRAM is relatively easy to program, because one does not need to allocate storage within a distributed memory or specify interprocessor communication. Unfortunately the PRAM model is not physically realizable, since on large machines a parallel shared memory access can only be accomplished at the cost of a signi cant time delay.
A n umber of PRAM simulation algorithms have been presented in the literature (for the survey see 4]). The algorithms allow execution of PRAM programs on more realistic parallel machines. Among several types of such m a c hines is a fully connected parallel computer called a module parallel computer (MPC). The MPC consists of a set of RAM processors. Each processor of the MPC has an associated memory module and is connected via communication links to all other processors. A memory module operates sequentially responding to only one data access request at a time.
In this paper we study the randomized simulation of an EREW (exclusive read, exclusive write) PRAM on an MPC. The simulation is based on utilizing universal hashing. The results of our experiments performed on the MPC built upon Inmos T9000 transputers throw some light on the question whether using the PRAM model in parallel computations is practically viable given the present state of technology.
The remainder of the paper is organized as follows. In section 2 we describe the PRAM model. Section 3 de nes the MPC. Section 4 presents some theoretical results regarding the randomized PRAM simulation. In Section 5 we describe the architecture of the Parsys SN9500 parallel computer which s e r v ed as the platform for our experiments. In Section 6 the PRAM simulators which h a ve b e e n designed and implemented are discussed. Section 7 presents a matrix multiplication algorithm used for the purpose of simulation. In Section 8 the experiments which w ere conducted are described. Section 9 concludes the paper. The Appendix contains the results of the experiments.
PRAM model
An (n m)-PRAM consists of n RAM processors, P 0 , P 1 , . . . , P n;1 , and a shared memory of m locations, also called variables (see Fig. 1 ). The processors work synchronously, i.e. no processor will P 0 P 1 P n;1
. . .
Shared memory
Figure 1: The PRAM model of computation proceed with instruction i + 1 u n til all have nished instruction i. I n e v ery step of the PRAM, each processor executes a private RAM instruction. In particular, each processor may r e a d a v ariable from the shared memory into its local memory,write a variable from its local memory to the shared memory, or perform some internal computation (e.g., addition, multiplication, boolean operation etc.) on the variables contained in its local memory. It is assumed that the execution of each instruction takes unit time. Depending on whether various processors may access the same memory location on a given step or not, the following variants of the PRAM model are distinguished: the exclusive read, exclusive write (EREW) PRAM, in which at most one processor may read or write to a particular variable, the concurrent read, exclusive write (CREW) PRAM, in which m ultiple processors may read from a particular variable, but at most one processor may write to a particular variable, the concurrent read, concurrent write (CRCW) PRAM, in which m ultiple processors may read or write to any v ariable. There is also a further classi cation of the CRCW PRAM model based on a writing con ict resolution strategy which speci es what is written when more than one processor writes to a particular variable o n a g i v en step. For more details regarding this classi cation see 8, 3 , 1 ].
Module parallel computer
A module parallel computer (MPC) consists of n RAM processors, each of which has an associated memory module 7] . A memory module is a collection of variables. Every processor may access every memory module via a fully connected network linking the processors (see Fig. 2 ). It is assumed that an access takes constant time. The memory modules, however, are sequential devices, i.e. all access requests that arrive at a memory module in a given step are processed one at a time. This can result in memory contention, in which an access request is delayed because of a concurrent request to the same module.
. . . By a simulation of machine M 1 on machine M 2 we understand an algorithm that allows an instruction from M 1 to be executed on M 2 . Our goal is to simulate an EREW PRAM on a more realistic parallel machine, namely, an MPC. The basic problem which m ust be solved by the simulation algorithm concerns the memory management, and it can be formulated as follows. Consider an (n m)-PRAM which i s t o b e s i m ulated on an MPC with n memory modules, so that each memory module will hold m=n memory locations. Suppose that on a given MPC step each processor issues a memory request. Then, in the best case each request will go to a di erent module, and all requests may be serviced in O(1) time (recall that the communication time between the MPC processors is constant). In the worst case however, all n requests may be directed to the same memory module, and will be serviced in (n) time. The problem of memory management i s h o w to map the logical addresses of the PRAM into the physical addresses of the MPC distributed over its n memory modules such that the amount of module contention is minimized given any set of n requests which are to be serviced.
One of the approaches to solve this problem is based on utilizing universal hashing, as introduced by Carter and Wegman 2]. During a simulation the MPC processors apply a hash function h chosen randomly from a class of universal hash functions H. The function h is used in order to distribute the logical addresses of the PRAM among the memory modules of the MPC. It is expected that on every simulation step the function h will spread the requests evenly among the memory modules of the MPC regardless of the memory access patterns of the PRAM. The class of universal hash functions is de ned as follows. While simulating the PRAM we assume that the ith processor of the MPC runs the same program as the ith processor of the PRAM. The shared memory of the PRAM is divided among n memory modules of the MPC in such a w ay that memory module M j , 0 j < n , contains all PRAM addresses a, 0 a < m , for which h(a) = j. The details of the simulation can be described as follows. Initialization. Choose h 2 H at random and store h in every processor of the MPC.
Step by step simulation. For the logical address a i generated by processor P i of the MPC apply h to a i and obtain the memory module index b i = h(a i ). Issue a request for a variable a i stored at module M bi . A memory module M j , 0 j < n , collects all requests for variables in M j and serves them sequentially. When all requests are served the next PRAM step is simulated. Now given the above s c heme, the question arises how e cient is the simulation, or how long are the queues of requests in front o f e a c h memory module. Since the PRAM processors operate synchronously all memory requests issued in a particular step must be serviced before the simulation of the next step can begin. Therefore our objective is to minimize the length of the longest queue of requests in front of the memory modules (recall that all these requests are serviced sequentially) as it bounds the e ciency. T o study it in more detail we need to de ne some parameters describing the length of the queues. Let S = fa 1 a 2 . . . a p g, S 0 . . . m ; 1], be a set of addresses of arbitrary cardinality p, and let h 2 H. D e n e R max (h S) = m a x 0 j<n jfa 2 S : h(a) = jgj and
R max (h S) is the length of the longest queue in front o f a n y memory module when function h 2 H is used and set S of addresses is issued by the processors. P h2H R max (h S)=jHj is the expected value of R max (h S), and R p max is the worst case of that value taken with respect to all possible sets S. Proof. Let S be de ned as before and let P i (S) be the probability t h a t R max (h S) i, i . e . P i (S) = jfh 2 H : R max (h S) igj=jHj. Then P p . . . P k . . . P 2 P 1 1 and
Let P k j (S) be the probability that at least k addresses of S are mapped onto memory module j. Then we h a ve P k (S) P k 0 (S) + P k 1 (S) + . . . + P k n;1 (S). Since H is c strongly k universal, for a xed set fa 1 a 2 . . . a k g it holds . . x k pairwise distinct, there is at most one non-trivial polynomial g of degree at most k ; 1 w i t h g(x i ) = y i , 1 i k, i t c a n be concluded that G is 1 strongly k universal. For the purpose of the simulation, class G has to be modi ed into the form
Given an address x of the PRAM, g(x) can be interpreted as a global address in the MPC, which corresponds to location bg(x)=nc of module h(x) = g(x) m o d n. Unfortunately, for polynomials of degree greater than 1, the mapping of PRAM addresses x into their internal locations in memory modules is not one to one. In other words, several addresses can be mapped into the same location in a given module. We call these addresses the synonyms. T o handle this problem, a memory module maintains for each location bg(x)=nc a table of pairs (x data in PRAM locationx) for all x mapped to that location. Thus PRAM address x is accessed by searching the table of synonyms associated with location bg(x)=nc.
Carter and Wegman proved the following theorem which m a k es possible to assess the universality of class H 1 . 
5 Parsys SN9500 architecture
The two main components of the Parsys SN9500 parallel computer are the Inmos T9000 transputer and the ST C104 (or C104 for short) packet routing device. The T9000 has much greater capabilities than any of its predecessors from the transputer family. Its peak performance is expected to be 200 MIPS and 25 MFLOPS (according to the Inmos speci cation of 50 MHz T9000), with links running at up to 100 Mbits/sec in each direction. The on-chip virtual channel processor which operates in parallel with the central processing unit allows physical links to be shared transparently by a large number of virtual channels. The packetization and multiplexing operations are implemented directly in hardware. The C104 allows to construct networks of very large number of fully-interconnected T9000s without use of any routing software. It has 32 bidirectional data links and two c o n trol links. It also includes a full 32 32 non-blocking crossbar switch, enabling messages to be routed from any of its links to any other link. The C104 uses \worm-hole routing" which minimizes communication latency, because the chip can start outputting a packet which is still being input. The use of a crossbar switch allows packets to be passed through all links at the same time. The C104 can route packets of any length 6]. The SN9500 contains ve C104s and up to 32 fully-interconnected T9000s (Fig. 3) . Each data link of each T9000 is connected to one of the C104 routing devices. Except for two of the T9000s, data link 0 of each T9000 is connected to C104 0], link 1 to C104 1], etc. This means that every T9000 is connected to every other T9000 via only one C104. The data links of the two T9000s and of the interface card are connected to the fth C104 which in turn is connected via four pairs of its data links to each of the other routing devices 5].
PRAM simulators
The two kinds of simulators called SIM1 and SIM2 have been designed and implemented in the occam language on the Parsys SN9500 parallel computer. The structure of SIM1 is similar to that of the MPC (see Fig. 2 ). Each processor P i and memory module M i , i 2 0 . . . n;1], is simulated by a single occam process, with both processes corresponding to a pair (P i M i ) placed on a single transputer.
The second simulator, SIM2, simulates the multithreaded module parallel computer (MMPC) as shown in Fig. 4 The structure of the simulator SIM2 re ects the structure of the MMPC. Namely,the ith transputer of the simulator runs u processes simulating the computation threads T 0 i , T 1 i , . . . , T u;1 i , and a process simulating a memory module M i .
Simulator SIM1
As mentioned above, two occam processes run on a single transputer in SIM1. A high priority process called Mem i], i 2 0 . . . n ; 1], simulates a module of the shared memory M i (Fig. 5a ). It accepts memory access requests, performs the appropriate operations and sends back to the requesting process either a content of the speci ed memory location (for reading) or an acknowledge message (for writing). The second, low priority process called CPU 
Matrix multiplication algorithm
In order to measure the performance of the simulators described in the previous section, an EREW PRAM matrix multiplication algorithm has been implemented (see Fig. 6 ). Each processor P i , i 2 0 . . .n ; 1], of the algorithm computes every nth row of the resultant matrix C starting from row i. We assume that a c.
Example. Let (1) and (2) which access the shared memory were implemented by using the Load procedure calls and local variables tmpa and tmpb in place of registers (see lines (a) and (b) in Fig. 8 ).
Once the reading request is completed, the Load procedure executes the code synchronizing the work of all CPU s of a simulator. The line (4) speci ed above w as implemented by lines (d) and (e) in Fig. 8 . The Store procedure writes the value of its second parameter into an address of the shared memory de ned by its rst parameter, and then synchronizes its work with other CPU s.
Experiments
The goal of the experiments was to investigate the performance of the simulators SIM1 and SIM2. For the purpose of simulation the EREW PRAM matrix multiplication algorithm was used (see Sec. 7). The algorithm was executed on the square matrices A, B and C of size s s, where s = 16, 32, 48, . . . , 96. The simulators themselves and the matrix multiplication algorithm were implemented in occam. The experiments were carried out on the Parsys SN9500 parallel computer populated with n = 1 6 T9000 transputers (an additional transputer ran a front-end process). The T9000 Gamma silicon was applied, with a clock speed of 20 MHz and the data links con gured to run at 100 Mbits/sec. The computation times were measured by making use of an internal high priority processor timer incremented every 1 s. Each execution time measurement w as averaged over 20 experiments.
The timings of the sequential version of the matrix multiplication algorithm ran on a single T9000 transputer are shown in Table 1 (s de nes the size of matrices Ave, , Max and Min denote the average execution time, the standard deviation, the maximum and minimum execution time, respectively, among the 20 experiments).
Experiments on SIM1
The two v ersions of the matrix multiplication algorithm were implemented. In the rst one, all the matrices A, B and C were located in the shared memory. In the second version, only matrix C was stored in the shared memory, whereas the matrices A and B were copied into the local memory of each transputer. As the result, the ratio of local memory accesses and computations to shared memory accesses was increased. We shall call this ratio a grain size. For these two v ersions of the algorithm, the two series of experiments were conducted, in which t h e polynomial hash functions h of degree 1 and degree log n ; 1 = 3 w ere applied, respectively (cf. eqs.
(1) and (2)). Before each experiment, the new random coe cients of the polynomials were generated.
The equation (1) indicates that the expected length of the longest queue R p max is smaller if the degree of the polynomial hash function h is higher, e.g. equal to 3. In such a case one can expect that the simulation is more e cient, as the shorter queues of requests are serviced quicker. However, on the other hand, an evaluation of a polynomial of higher degree is more computationally expensive and adversely in uences the e ciency. Therefore in practice the degree of the polynomial hash function should be chosen as a result of some compromise.
The results of the tests for the rst version of the algorithm are shown in Tables 2 and 3 , and illustrated in Fig. 9 (graphs (a) and (b) ) (the graphs in Fig. 9 depict speedups de ned as S = T 1 =T s where T 1 is the execution time of the sequential version of the matrix multiplication algorithm on a single transputer, and T s is a time of the PRAM simulation of the algorithm). As can be seen from graphs (a) and (b) the multiplication of matrices simulated on 16 processors lasts roughly 30 times longer than on a single processor. The reason of this low performance is a small grain size of the computations. Namely, only two (relatively cheap) local oating-point operations on the matrix elements and a few xed-point address operations are executed for the three pairs of an (expensive) shared memory access and a global synchronization (cf. lines (a){(e) in Fig. 8 ). (It is worth noting that the matrix multiplication algorithm with a small grain is a demanding test for the PRAM simulation.)
The graphs (a) and (b) also show h o w a degree of the polynomial hash function in uences the performance of the simulation. The rst degree polynomial gives a little shorter average execution times, although the times themselves are less regular and predictable. For example, for the matrices of size 32 32 the longest execution time in the rst series of our experiments was more than twice as long as the longest time in the second series. Those shorter average execution times obtained for the rst degree hash function mean that the evaluation time of the function dominates the time of servicing longer queues which l i k ely arise while this function is used plus the time for dealing with The graphs (c) and (d) in Fig. 9 illustrate the results of the experiments with the second version of the algorithm, in which only the resultant matrix C was stored in the shared memory (Tables 4 and  5 contain the corresponding measurements). In that case the shared memory was accessed only once after s oating-point m ultiplications and s oating-point additions. Due to the greater grain size, the speedups achieved are much better than previously.
For the matrices of size 32 32 and the hash function of degree 1, we measured the longer average execution time than for the function of higher degree. It was caused by an enormous execution time of a single experiment | almost three times bigger than the average. In that experiment the hash function of the randomly generated coe cients mapped almost all writing requests in every step of the simulation into the same module of the shared memory.
Experiments on SIM2
During the experiments on the simulator SIM2 only the polynomial hash function of degree 1 was used for the matrices of size 96 96. The results obtained are showed in Table 6 . Contrary to our expectations the introducing of parallel slackness by running a number of threads on each transputer resulted in only slight improvement of the e ciency of simulations. For example, for the matrices of size 96 96 the speedup equals 5.045 on SIM1 increased to 5.887 on SIM2 (cf. Tables 4 and 6).
Conclusions
In the paper the problem of the randomized simulation of an EREW PRAM on an MPC was studied. The two kinds of simulators based on utilizing universal hashing were designed and implemented in the occam language. In the rst simulator a number of simulating processors was equal to the number of programs of the PRAM, so that each processor ran a single program. In the second simulator, the parallel slackness was introduced by executing a number of computation threads on each simulating processor. The practical experiments on the simulators using the Parsys SN9500 parallel computer and the matrix multiplication algorithm as a running example were conducted. The results of the experiments on the rst simulator indicate that the PRAM simulation is still not e cient enough to be useful in practice. We found out that the cost of the shared memory accesses (recall, implemented in the fully connected transputer network via the worm-hole routing) is relatively high in comparison with the cost of the local computations. One of the reasons for this is the fact that a size of messages exchanged during an access is small (an access request can be 5 or 9 bytes long, and a reply 1 or 5 bytes), and according to the measurements presented in 5] only roughly a half of the peak T9000 link bandwidth is attained with messages of this size. The experiments exhibit that the simulations in which the polynomial hash function of degree 1 is used are more e cient than for the function of higher degree. This means that the evaluation time of the hash function dominates the time of servicing longer queues which l i k ely arise while a lower degree polynomial is applied. Since the simulation is randomized by nature, the simulation times vary among the experiments, especially for the polynomial of degree 1. It is explicable, for the mapping of addresses among the memory modules of the MPC is not so uniform as in the case of polynomials of higher degrees. Contrary to our expectations the introducing of parallel slackness in the second simulator improved the e ciency of simulations only in a small degree. Table 6 : Timings of the algorithm simulated on SIM2 (s = 9 6 h | polynomial of degree 1)
