We study the e ect of varying the multithreading level of processors in work-optimal PRAM simulation algorithms on coated meshes. A coated mesh consists of a mesh connected routing machinery and P processor&memory pairs that form a coat on the routing machinery. The algorithms studied are based on greedy routing, sorting, improved virtual leveled network technique, combining queues method, and synchronization wave. Our results show that increasing the multithreading level considerably improves the simulation cost. The cost can be decreased below 5 routing steps per P simulated PRAM processors. In case of one algorithm, even costs 1:1 : : :2 are achieved.
Introduction
Work-optimal simulation of PRAM models means that a constant fraction of the aggregate power of processors can be given to (arbitrary) PRAM computations. We study the e ciency of ve PRAM simulation algorithms on a structure called coated mesh. This is interesting, since (a) the coated mesh (rigid de nition is given in Section 1.1) is a very realistic architectural model of parallel computation; (b) it allows work-optimal implementation of certain PRAM models by using Valiant's parallel slackness concept 15] (i.e., by using multithreading); (c) the PRAM models have been very popular platforms in parallel algorithm design 5]; and (d) although several work-optimal PRAM simulations have been developed 4, 9, 10, 11, 15] , relatively little attention has been paid on the exact simulation cost and the realisticness of proposed simulations.
The coated mesh is a mesh based routing machinery that is \coated" with processor&memory pairs. It is This work was nancially supported by the Academy of Finland. an attractive architectural model, since due to constant length connections it is truly scalable. As a mesh based construction, it is very regular and has a natural layout. The di erence compared to an ordinary mesh is that the mesh nodes are much more simpler in the coated mesh { basically, they just forward packets. Ordinary meshes cannot simulate PRAM models work-optimally because of insu cient routing capacity compared to the diameter 8]. The idea in introducing a P-processor coated mesh with diameter is to increase the routing capacity (from O(P)) to O(P ). By using Valiant's XPRAM framework, this allows work-optimal PRAM simulation.
The capacity is obtained by having O( ) routing machinery nodes per each physical processor. However, the work and hardware complexity of routing machinery nodes is seen negligible, since they are considerably simpler than processor&memory pairs. The same reasoning is used in the connection of p-direct butter y 15] that is analogous structure with the coated mesh. It uses the butter y instead of the mesh and has shorter logical diameter but longer wires and more irregular physical layout.
All algorithms, except one, that we study in this paper can be proved to be work-optimal. However, the presented proofs do not take increasing of the multithreading level (how many threads, or virtual PRAM processors, per each real processor) properly into consideration. In the following, we shall refer to the multithreading level simply by load, and call overloading the increasing of load. We show that overloading can be used to signi cantly improve the e ciency of all the studied simulation algorithms. This is due to the simulation algorithms having parts whose cost is independent of the load. Another major reason behind improving is that relative cost of routing phases decreases as the load increases. A routing machinery whose capacity is O( ) times larger than P enables, in principle, that random routing tasks complete in time O( +load). The relative cost of routing phase O( +load load ) thus decreases as the load increases. When load exceeds the routing machinery capacity (per processor), the relative cost can be expected to show asymptotic behavior.
Is high multithreading level an unrealistic assumption? Tera 3] and SB-PRAM 1, 2] support 128 and 1024 threads per each physical processor, respectively. Supporting even more is not unrealistic from hardware point of view, since a thread basically requires only few tens of words of memory. From algorithm design point of view, the answer is not as clear. However, the facts that several problems have work-optimal NC-class PRAM-solutions 5] and parallel algorithms are often developed in practice for large modeling problems 13], suggest that high multithreading level is not an unrealistic assumption. Work-optimal NC-class algorithms use polylogarithmic time. Since work-optimality is dened with respect to sequential work (often polynomial wrt. the size of problem), the multithreading level is clearly high. Besides, the threads can originate from several concurrently running PRAM applications.
Next in Sections 1.1 and 1.2, we de ne the coated mesh and the PRAM models, respectively. Then we give an overview of the ve PRAM simulation algorithms that work on coated meshes. In Section 3, we show how overloading improves their e ciency. A small comparison with work-optimal PRAM implementations proposed for the p-direct butter y is made in Section 4. We draw conclusions in Section 5 and propose some open problems.
Coated mesh
We call coated mesh a mesh based routing machinery (of shape
, which is coated with processor&memory pairs (see Figure 1) . The processors are attached only to the nodes on the surface of the routing machinery. Each node of the routing machinery has a constant amount of local memory for computation and for storing packets; has at most unidirectional links towards each neighboring node; is capable of forwarding at most one packet from each incoming edge (current size of output queue restricts forwarding) in unit time; has a bu er of size q packets for each outgoing edge (we assume only the \head" of each output bu er to reside at the receiving node); and is connected with bidirectional links to at most one processor per direction. We assume that each physical processor P i is attached to a memory module M i . The routing machinery we take to be either a p Q p Q mesh or a where the queues are combining queues { i.e., the edge queues can combine, without increasing the size of the packet, in unit time an incoming packet with another packet already in the queue and referring to the same shared memory location (both packets must be either read or write requests); there is enough memory attached to each edge to store the necessary information to do decombining (split a reply into two replies); and the queues can do the decombining.
The clockrate of the routing machinery might be higher than that of the processors (by time multiplexing the usage of a physical link, one can create more virtual links). The routing machinery could be toroidal like that used in the Tera machine 3]. This would shorten the length of expected route.
PRAM models
The PRAM models consists of N processors and a shared memory M of size m. Each processor has some local memory, and it knows its own unique identi er PID (Processor IDenti er). The PIDs are integers in the range 0; 1; : : :; N ? 1. We denote the processors with P 0 ; P 1 ; : : :; P N?1 . A step of a PRAM processor consists of a local operation, reading a shared memory location, or writing to a shared memory location. The phases of a step are executed synchronously, and each processor is assumed to nish the current step before any of the processors begins the next one. Here we discuss only the following PRAM models. EREW: Concurrent reading or writing of any shared memory cell is forbidden. However, each shared memory cell may be read and written during the same step.
Arbitrary CRCW : Concurrent read and write operations of the shared memory are allowed. If two or more processors write to the same memory location in a given step, then one of the values is selected arbitrarily to become the new value. Nothing is known about the selection. From now on, N denotes the number of PRAM processors, P the number of real processors, and Q the number of nodes in the routing machinery.
Simulation algorithms
The technique to simulate PRAM models on distributed memory machines is well-known 11, 15] . The shared memory of the PRAM is mapped into the memory modules of the coated mesh with a randomly chosen hash function h, and each real processor is assumed to simulate bN=Pc or dN=Pe PRAM processors. The only problem in the simulation is how to e ciently route read and write requests (caused by references to the shared memory) and replies in the routing machinery while preserving the atomic consistency of the PRAM.
In Sections 2.1 and 2.2, we give high level descriptions of two EREW simulation algorithms. In Section 2.3, we describe a sorting based CRCW simulation algorithm. Then, in Sections 2.4 and 2.5 we give two CRCW simulation algorithms, which are based on the combining queues method and on the virtual leveled network technique.
Two-phase EREW simulation
We do the simulation of an N-processor EREW PRAM on a P-processor coated mesh with Algorithm 2.1. It attaches two intermediate targets for each packet: One from pile of nodes attached to source and target processors. A pile of nodes related to some processor is a set of routing machinery nodes that form the closer half of an array of nodes perpendicular to the processor. The related processor is connected to one of those nodes. Each routing machinery node belongs to at most d piles.
We route packets from rst intermediate target to the second with a greedy routing algorithm R g that \cor-rects" the current coordinates of packets one by one in a predetermined order of axis (X, Y , Z). On arrival to the second intermediate target, the packets are forwarded greedily along removal edges to their target processor. Besides ordinary bidirectional connections, the coated mesh is assumed to have unidirectional removal edges from each routing machinery node towards all the processors whose pile it belongs. The algorithm uses queuing discipline Q ff (farthest-rst) to determine, which packet is sent rst in case there are many candidates.
We assume that each processor can receive all the packets destined to it during one simulation round, and all the nodes can store the packets injected to the routing machinery before actually beginning a routing phase. For simplicity, N=P is an integer in the following.
Algorithm 2.1 (2-phase EREW simulation) Each processor P i (i 2 f0; 1; : : :; P ? 1g) simulates PRAM processors P iN=P +j (j 2 f0; 1; : : :; N=P ? 1g). 3 ) PRAM processors, with high probability. By \high probability" we mean probability 1 ? N ?k for any xed k (the choice of k a ects o(1)). Expression \x routing steps per P simulated processors" tells the cost of each PRAM operation. In the following, we shall call this the relative cost of simulation algorithm G(N; P; d; q), or simply the relative cost.
A relative cost of 36, or more, is too high. A method to improve it is overloading { having N Q. The relative cost of injection phases (Phases 1 and 5) is almost independent of Q and N. For N Q their expected relative cost is close to 2 and with high probability it is less than 3. The synchronization phases (Phases 4 and 7) are independent of N { i.e., their relative cost decreases as N increases. However, due to di culties of exactly analyzing routing situations, where N Q, we provide experimental results in Section The purpose of having 4 bidirectional connections between neighboring nodes is to build a virtual acyclic directed graph (VDAG) from processors to memory modules and vice versa. Such a VDAG can be used to implement more sophisticated synchronization mechanism: synchronization wave 2, 11]. The idea is that when a source has sent all its packets on their way, it sends a synchronization packet. Synchronization packets from various sources push on the actual packets, and spread to all possible paths that the actual packets could go. When a (logical) node receives a synchronization packet from one of its inputs, it waits, until it has received a synchronization packet from all of its inputs, then it forwards the synchronization wave to all of its outputs. While waiting, the node forwards other packets. The synchronization wave may not bypass any actual packets and vice versa. When a synchronization wave sweeps over a VDAG based routing machinery, all nodes (and processors) receive exactly one synchronization packet via each input link and send exactly one via each output link.
The injection and removal edges together with R g routing protocol de ne a VDAG for both logical networks in a natural way. The VDAG nature of routing prevents deadlocks. The synchronization wave approach makes PRAM simulation more exible: The processors and memory modules may process arriving packets freely and the atomic consistency is preserved as long as the synchronization wave re ects between the memory modules and processors. Obviously, the processors can achieve a high utilization level. Only the di erences to Algorithm 2.1 are shown.
2. Route the packets to their target via a random node on target pile by using combining on route and storing information on nodes for decombining (when and where the combined packets came). 6. Repeat the routing process of Phase 2 in reverse order and do decombining on route. 7. Update registers. 
Improved virtual leveled network
In a (directed) leveled network, each directed edge connects a node on some level i to another node on level i + 1. A leveled network is assumed to consist of L levels. Mesh as such is not a leveled network, but it can be seen as a virtual leveled network 7]. Our reason to embed a virtual leveled network to a mesh is due to the routing results proved in 7] that can be used to derive PRAM simulation results. Notice that only the routing machinery is seen as a leveled network, not the whole coated mesh.
We used instead of . In fact, the depth of the above construction can be reduced by modifying the way packets enter from plateau s to plateau s + 1. Clearly, there is a unique path from a node on plateau 1 to any node on plateau 2d. We call R ivln a routing algorithm that routes packets from level to level accord-ing to above described improvement; keeps all the packets passing through (virtual) 
Experimental comparison
In our experiments, we used meshes of side length 10, 30, and 50 for the 3D coated mesh (Q = 1000, 27000, 125000 and P = 600, 5400, 15000), and 50 and 150 for the 2D coated mesh. We varied the overloading factor b = p Q=6 for the 3D coated mesh) was at most 200 in the 3D case, and at most 750 in the 2D case. Altogether, about 30 000 routing experiments were conducted on more than 300 di erent parameter value combinations of q, Q, b, and the simulation algorithm. For each experimented parameter combination, at least 30 experiments were made. The curves shown later are B ezier splines drawn via the measured points.
We measured the e ciency of simulation algorithm A in terms of measured relative simulation cost G(s d ; b; q; A). We made our experiments on randomly generated patterns (EREW), and patterns produced randomly on the basis of distributions of Figure 2 (CRCW). The distributions of Figure 2 are, of course, only a tiny fraction of all possible distributions but we believe them to be representative enough. A gray area means randomly chosen targets, and the white areas each represent an amount (proportional to the area) of read requests with a common target memory location. An expression like \16 5%" is a shorthand for 16 areas of size 5%. Some of the algorithms described in Section 2 have a phase where a global ensuring that all packets have reached their destination is required. We assume this to be implemented so that after T 1 routing steps a global check over the routing machinery is calculated, and if some packets are still in the routing machinery, the checking is repeated after Time T 1 should be chosen so that only a small fraction of typical routing tasks require second checking. We have found T 1 = \average routing time + 2 standard deviation of the routing time" to be a good measure. In the cost calculation, T 2 is simply the maximum measured routing time.
Experiments with bu er size
Proofs related to the algorithms described in Section 2 do not give good exact bounds for the bu er size. Some bu ers being occasionally full during the routing process does not decline the overall routing capacity signi cantly. One is tempted to believe that small bu er size is adequate in practice. On the other hand, it is not clear how big role the bu er size has in the cost of the whole construction. Figure 3 describes simulation experiments made with Algorithm 2.1. As can be seen, the curves for di erent bu er sizes almost overlap each other: Obviously bu ers of size 4 are adequate. Queues with higher capacity seem to make the routing to succeed faster, but the e ect is quite negligible. On the basis of Figure  3 , we conclude that the e ect of overloading factor b on relative simulation cost is much larger than that of bu er size q. Similar, observations were made for all other simulation algorithms. For simplicity, we mainly use q = 4 in the following. 
EREW simulation results

Algorithm 2.1
In our experiments with Algorithm 2.1, we assume that after the injection phase each mesh node has exactly b packets with randomly chosen targets. The routing itself was done with Q fifo and R g using the principle that the incoming queues of each node are processed in cyclic order (by selecting a random starting point each time), and only the rst packet of each queue is moved to the target queue, if there is room for that packet. We were only interested of Q fifo because of its simplicity. As can be seen in Figures 4 and 5 , the relative cost of Algorithm 2.1 can be pushed below 10 both for the 2D and 3D coated meshes. In fact at b = 40, we measured relative costs 4:9 and 8:8 for a CM(150 2 ; 4; 2) and a CM(30 3 ; 4; 2), respectively. We achieved even better results by increasing q. At q = b=2 and b = 120, relative costs 3:4 and 6:3 were achieved for a CM(50 2 ; q; 2) and a CM(10 3 ; q; 2), respectively. Our experiments suggest that by using only small bu ers and properly overloading the coated mesh, about 7{8-fold speedup over the results of 9] can be achieved (36 ?! 5 and 78 ?! 9).
By using larger bu ers, over 10-fold speedups seem achievable.
Overloading decreases the relative cost, since the relative cost of synchronization phases (Phases 4 and 7 of Algorithm 2.1) is approximately 2=b. The relative cost of injection phases is almost independent of b { the cost of them is approximately 2{3. The cost of routing depends on load and the diameter of routing machinery , and is approximately (c d load + )=load. When load is very high, the relative cost of synchronization phases is negligible and the cost of injection phases is just over 2. Therefore it seems that c 2 1 but the same does not seem to hold for c 3 .
Algorithm 2.2
In the test setting of Algorithm 2.2, the processors initially have load packets with randomly chosen targets. The two intermediate targets are chosen randomly as explained in Algorithm 2.2, and again queuing discipline Q fifo is used. Synchronization wave separates consecutive steps, and a PRAM step is taken to be completed when last part of the synchronization wave (related to that PRAM step) leaves the processors. That moment is used to measure the time to simulate a PRAM step. On a CM(10 3 ; 32; 4) we even measured relative cost 1:68 at point b = 180. Although in the 3D case, the cost decreases below 2, it seems not to freely approach 1.
The reason is that in the 3D case the routing machinery bandwidth turns out to be insu cient. . On average half of the packets produced by the other half of the processors cross the bisection. Therefore the bisection width should be at least 2ds
In the 2D case this holds but not in the 3D case. In fact, the bisection width argument suggests the relative cost of Algorithm 2.2 to converge towards 3=2 on CM(s 3 ; q; 4) { the bisection bandwidth should be somehow increased. 
CRCW simulation results
The purpose of sorting in Algorithm 2.3 is to transform CRCW style routing problem into EREW style problem. Therefore, we estimated the cost of Algorithm 2.3 directly on basis of EREW experiments made with Algorithm 2.1 (all except routing phases are deterministic). The cost of Algorithms 2.4 and 2.5 were estimated by conducting a set of experiments with patterns of Figure 2. Destination for packets were then chosen by rst generating a distribution and then assigning a target for each packet randomly according to the distribution. Basically, all the same assumptions concerning processing of queues are as in the previous section. Further, the routing machinery nodes are assumed to \precom-bine" packets with the same target { after or during the injection phase but before the routing phase. Comparison of the three CRCW simulation algorithms is made in Figures 9 and 10 by showing the curves for measured relative cost. The measured relative cost is now calculated as an average over all the experiments made with patterns of Figure 2 .
First impression of Figures 9 and 10 is that overloading clearly improves the e ciency. In the 2D case, the increase of b seems to provide a relatively modest im- We also observed that pattern A produces the worst result for almost all methods. Obvious conclusion is that the more we have packets to combine the faster we can solve the routing problem.
We conclude that overloading can be used to improve the e ciency of Algorithm 2.3 by 5{7-fold. In many cases, the improved virtual leveled network technique proves out to be better than the sorting based solution. However, the combining queues method seems to be even more e ective: It seems to cost about as much as an EREW simulation based on Algorithm 2.1! An obvious remaining question is that can one modify the combining queues method to work on Algorithm 2.2?
Summary of observations
The larger the bu ers the better the simulation time. In the EREW case Algorithm 2.2 is nearly optimal, but in the CRCW case all the algorithms leave a lot to hope for. Sorting based solution is too costly. The combining queues method works very well, but the simplicity of CM over CM + makes the virtual leveled network based approach more attractive.
Mesh versus butter y
The original stimulus to study the coated meshes was to nd alternatives for butter y based work-optimal PRAM constructions 1, 2, 12, 15]. In 9], we presented such constructions for the coated meshes, although the constant factors left a lot to hope for. The advantage of P-direct butter y over the coated meshes is a small graph-theoretic diameter O(logP), which also allows the overloading factor to be asymptotically smaller. However, the problem with the butter y is its physical layout { the largest physical distance between two nodes is of the same order for both interconnection types: Some connections between logical neighbors must necessarily be of length ( 3 p P= logP) in any 3D layout of P-direct butter y 9]. In fact, Thompson's VLSI areatime complexity result 14] implies wire length bound (P= log 2 P) for any 2D VLSI layout. Naturally, physical layouts for the butter y are not as regular as for the mesh.
On the other hand, in order to avoid some timing and load balancing issues one might have to operate on the condition of the length of longest wires. This would restrict the exploitable clockrate and/or scalability. The e ect of long wires can be reduced by scattering bu ers on long wires at regular intervals, and by pipelining the communication over long wires. This increases the logical diameter and at the same time changes the whole structure towards the mesh.
The coated meshes do not have such severe layout problems. Yet, given su cient amount of parallel slackness, the coated meshes can be made to simulate PRAM models with relatively small cost. In 2], the relative cost was made very close to 1 for the P-direct buttery by using many hardware improvements. Those improvements could also be applied to the coated meshes to further improve the e ciency.
Interesting questions that remain are the size of rout- Mn(P)/Bn(P) Figure 11 : Ratio of routing machinery sizes. Denote by Bn(P) = P(logP + 1) the number of routing machinery nodes in an ordinary P-direct buttery implementation. Respectively, denote by Mn(P) = (P=6) 1:5 the size of a P-processor 3D coated mesh routing machinery. Surprisingly, Mn(P) < Bn(P) for P 2 15 .
ing machinery, the hardware complexity of nodes and processors, and the amount of parallel slackness required. We would very much like to see a hardware comparison between the 3D coated mesh and the Pdirect butter y. The su cient parallel slackness Load was concluded to be approximately 10 3 p Q=6 for the 3D coated mesh. For Q = 10 6 , this is less than 200, and for P = 10 6 , it is approximately 650. Thus, even for a relatively large 3D coated mesh the amount of parallel slackness should not be a serious obstacle.
An interesting aspect is the number of nodes in the routing machinery (=extraneous hardware). If no attempt to integrate nodes is made, then the routing machinery of the P-direct butter y has P(logP +1) nodes, whereas the 3D coated mesh implementation has ? P 6 3 2 nodes. It is quite surprising to notice that the 3D coated mesh actually has less nodes for P 2 15 . Even, when P = 2 20 , the ratio of nodes is only 3.3:1 in favor of the butter y construction (see Figure 11) . Unfortunately, for the 2D coated mesh the corresponding ratios are clearly worse.
Conclusions
We have provided experimental evidence that EREW and CRCW PRAM models can be simulated on the 2D and 3D coated meshes with relative cost clearly below 10 { even below 5 { routing steps. In the EREW case, Algorithm 2.2 was found to be nearly optimal, giving relative costs below 2 { even relative cost 1:1 was achieved. Of the three CRCW simulationalgorithms we found the combining queues based algorithm to be the best in terms of the number of required routing steps. We leave as an open problem the question whether or not this holds when the hardware requirements (implied by the algorithms) are taken into account. We also would like to know whether the combining queues method can be beaten?
