We present deterministic upper and lower bounds on the slowdown required to simulate an (n; m)-PRAM on a variety of networks. The upper bounds are based on a novel scheme that exploits the splitting and combining of messages. This scheme can be implemented on an n-node d-dimensional mesh (for constant d) and on an n-leaf pruned butter y and attains the smallest worst-case slowdown to date for such interconnections, namely, O ? n 1=d (log(m=n)) 1?1=d for the d-dimensional mesh (with constant d) and O( p n log(m=n)) for the pruned butter y. In fact, the simulation on the pruned butter y is the rst PRAM simulation scheme on an area-universal network. Finally, we prove restricted and unrestricted lower bounds on the slowdown of any deterministic PRAM simulation on an arbitrary network, formulated in terms of the bandwidth properties of the interconnection as expressed by its decomposition tree.
Abstract
We present deterministic upper and lower bounds on the slowdown required to simulate an (n; m)-PRAM on a variety of networks. The upper bounds are based on a novel scheme that exploits the splitting and combining of messages. This scheme can be implemented on an n-node d-dimensional mesh (for constant d) and on an n-leaf pruned butter y and attains the smallest worst-case slowdown to date for such interconnections, namely, O ? n 1=d (log(m=n)) 1?1=d for the d-dimensional mesh (with constant d) and O( p n log(m=n)) for the pruned butter y. In fact, the simulation on the pruned butter y is the rst PRAM simulation scheme on an area-universal network. Finally, we prove restricted and unrestricted lower bounds on the slowdown of any deterministic PRAM simulation on an arbitrary network, formulated in terms of the bandwidth properties of the interconnection as expressed by its decomposition tree.
List 1 Introduction
The problem of implementing a shared-memory abstraction on various distributed-memory parallel architectures has been intensively studied over the last decade. Generally, this problem has been referred to as the PRAM simulation problem and involves representing the m cells of the PRAM shared memory (called variables) among the n processor-memory nodes of the simulating machine in such a way that any n-tuple of distinct cells may be read or written e ciently. The time required to simulate one PRAM step is known as the slowdown of the simulation. A number of approaches to this problem, both probabilistic and deterministic, have been investigated for a variety of well-known architectures such as the complete interconnection, the mesh of trees, the butter y, as well as a variety of expander-based architectures, among others. We will not attempt to summarize the extensive literature on this problem here but only quote those results that relate directly to our work, and refer the interested reader to PPS94] for a recent and comprehensive summary of further work on this topic. Building on earlier work of Upfal and Wigderson UW87], Alt et al. AHMP87 ] presented a deterministic scheme to simulate a PRAM with n processors and m variables (called an (n; m)-PRAM) on an nnode Module Parallel Computer (MPC), an architecture in which each node includes both a processor and a private memory module accessible only to that processor, and in which the nodes are connected by a crossbar that allows each node to transmit or receive one message per step. Their scheme employs the following copy-based method for the representation of the PRAM variables, which most of the deterministic simulation algorithms, including this present work, adopt. Speci cally, each variable is represented by a set of copies, whose size 2c ? 1 is logarithmically related to n and m, and each copy consists of a value and a timestamp. The copies are distributed carefully among the memory modules of the simulating machine. To write a variable, at least c of its copies are overwritten to re ect the intended value and the time of writing. To read a value, at least c copies are inspected. This set of c copies read contains at least one of the copies most recently written, which is readily identi able by virtue of its time-stamp. Alt et al. show that for a suitable distribution of the copies among the nodes of the machine, any n-tuple of variables may be accessed (read or written) in O (log m) time.
The above scheme can be ported to an arbitrary network by simulating each MPC step using standard techniques such as routing and sorting. In particular, this approach yields a simulation with slowdown O n 1=d log m on an n-node d-dimensional mesh (d = O (1)) and a simulation with slowdown O ( p n log m) on an n-leaf pruned butter y, which are the interconnections that we consider in this paper. In AHMP87], it was also observed that a simple PRAM simulation for the two-dimensional mesh with an optimal slowdown of O ( p n) is indeed possible. Unfortunately, this simulation requires up to n copies per variable, resulting in an unacceptable memory blow-up. Moreover, the method does not extend to higherdimensional meshes and other interconnections.
Most of the deterministic simulations that appear in the literature, including those of this paper, rely on memory distributions that are built upon certain expander-based graphs whose existence can be proved, but for which no e cient construction is known. Recently, Pietracaprina et al. PPS94, PP95] have studied deterministic simulations based entirely on explicitly constructible structures. By resorting to a complex hierarchical arrangement of constructible, mildly-expanding graphs, they achieve O ( p n log n) slowdown on an n-node mesh for memories of O ? n 1:5 size, using O log 1:59 n copies per variable. In this paper, our focus is on slowdown rather than constructiveness. By employing more powerful expander-based structures, we achieve a better slowdown than that of PP95] at a lower level of redundancy (copies per variable).
It might appear, at least at rst glance, that updating c copies apiece for n variables must involve the physical movement of cn distinct packets across the entire network, which on the p n p n mesh would require (c p n) time. In this paper, we devise a novel splitting/combining technique to circumvent this di culty, based on the following idea. If a processor p wishes to send the same packet to nodes x and y that are \distant" from p but \close" to one another, then rather than dispatch a separate packet for each, it may be more e cient to dispatch a single message to some \intermediate" node z close to both x and y. At node z, the original packet is then made into two replicas which are forwarded to x and y separately. A careful implementation of this idea leads to the following result.
Theorem 1 For any m 2 (n 1=(d+1) ) there exists a scheme to simulate an (n; m)-PRAM on an n-node d-dimensional mesh (d constant) with worst-case slowdown O n 1=d (log(m=n)) 1?1=d , using O (log(m=n)) copies per variable and O (m=n) log 3 (m=n) storage per node.
In order to implement the splitting/combining strategy outlined above, the scheme relies on a recursive decomposition of the mesh and on e cient algorithms for k-sorting, where each processor initially holds k packets, and for k-relation routing, where each processor sends and receives at most k packets. Theorem 1 implies that our simulation scheme incurs a slowdown which is a factor O (log(m=n)) 1=d smaller than the one obtainable by porting the MPC algorithm of AHMP87] on an n node d-dimensional mesh. We want to remark that the (exponential) upper bound on the memory size m in Theorem 1 is placed to avoid that the cost of sequential bookkeeping operations such as local sorting or counting dominate the overall running time of the simulation algorithm. Such a bound on m is not needed if a cost model which accounts for interprocessor communication only was adopted, as customary for network algorithms Kun93].
The n-leaf pruned butter y BB95] (described later) is an area-universal network that is a variant of Leiserson's fat-tree Lei85]. Although quite di erent from the two-dimensional mesh in terms of the details of its structure, it is su ciently similar in its bandwidth characteristics to support the key operations upon which our simulations rely with comparable e ciency. By providing novel sorting and routing primitives for this network, and by using its natural decomposition into subtrees, we are able to implement the above simulation scheme with the same slowdown achieved for the two-dimensional mesh, thereby obtaining the rst PRAM simulation on an area-universal network. The result is stated in the following theorem.
Theorem 2 For any m 2 (n 1=3 ) there exists a scheme to simulate an (n; m)-PRAM on an n-leaf pruned butter y with worst-case slowdown O p n log(m=n) , using O (log(m=n)) copies per variable and O (m=n) log 3 (m=n) storage per node.
Lower bounds on the slowdown of PRAM simulations on bounded-degree networks have been presented in a number of studies AHMP87, KU88, HB94]. All such bounds, however, apply to the entire class of such networks, and cannot be specialized to the characteristics of a given topology. For example, in HB94] the authors show an log 2 (m=n)= log log(m=n) lower bound on the slowdown required to simulate a PRAM step on any bounded degree network, which is too weak for our purposes, since a trivial n 1=d lower bound may easily be obtained for d-dimensional meshes based on diameter limitations. An ( p n) bound holds for the pruned butter y based on straightforward bandwidth considerations. In this paper, we present the rst lower bound argument that takes into account the characteristics of the individual network. To capture the properties of the network topology, the bound exploits the notion of decomposition tree BL84, Lei85], which provides a partition of the network into disjoint regions of limited bandwidth.
As in all previous works, the lower bound is proved under the point-to-point assumption, which requires that a processor updating a number of copies of a variable dispatch a separate message for each copy. When specialized to d-dimensional meshes and to the pruned butter y, our lower bound technique yields the following results. Theorem 3 Let m 16n. For every T 2m=n, there exists a T-step (n; m)-PRAM program, whose point-to-point, on-line simulation requires time )! on an n-leaf pruned butter y.
Unfortunately, the point-to-point assumption upon which Theorems 3 and 4 and the other works in the literature rely, precludes the splitting and combining of messages. As a consequence, the above lower bounds do not apply to our simulations directly. However, we are able to prove similar bounds in an unrestricted model by limiting the total level of redundancy used to represent the variables. Such bounds show that our simulations use an amount of redundancy which is only a doubly logarithmic factor higher than the minimum redundancy needed to achieve the same slowdown. Speci cally, we have the following result. 
:
The bound with d = 2 also applies to the pruned butter y.
The rest of the paper is organized as follows. Section 2 discusses the distribution of the copies among the memory modules and the properties required of the graph representing the memory map. In Section 3, the simulation algorithm for the two-dimensional mesh is presented. The algorithm consists of two phases, copy-selection and routing, which are described in Subsections 3.1 and 3.2, respectively. This scheme is extended to higher-dimensional meshes in Section 4 and to the pruned butter y in Section 5. Section 6 shows how the space bounds quoted in Theorems 1 and 2 may be achieved. Section 7 presents the lower bound results discussed above. Section 8 concludes the paper with some nal remarks and indicates future research directions.
Memory Organization
Consider the simulation of an (n; m)-PRAM on an n-node machine and suppose that each variable is replicated into 2c ? 1 copies, for a suitable integer c. It is convenient to model the distribution of copies among the nodes of the machine by means of a bipartite graph G = (U; V ; E), where U represents the set of variables, V the set of processor-memory nodes of the simulating machine, and 2c?1 edges connect each variable to the distinct nodes storing its copies. In the following we will denote by E(S) the set of edges in E incident on a set S U. Note that there is a one-to-one correspondence between E(S) and the set of all copies of variables in S.
Let S U and F E(S). When F contains exactly k edges incident on each s 2 S we call F a k-bundle for S. Also, we denote by ? F (S) the subset of V reached by edges in F. A vertex v 2 V is said to be q-congested with respect to F if more than q edges in F are incident on v. Finally, the congestion of F is the maximum value q such that there is a vertex in V that is q-congested with respect to F.
Recall that our simulations adopt the majority protocol, which requires that at least c copies be accessed in order to complete a read or a write. Equivalently, in graph-theoretic terms, if we wish to access a set S of variables, then we must select a c-bundle for S. Since the congestion of a c-bundle models the maximum number of physical copies that have to be accessed sequentially by some individual node in the underlying machine, it is desirable to access a c-bundle with low congestion. The existence of c-bundles of low congestion is intimately related to the expansion properties of the graph G. 1) )-generalized expander G = (U; V ; E) with jUj = m, jV j = n, = (1) and c = (log(m=n)).
The graph of Theorem 6 will govern the memory organization of our simulations. We shall see that such a graph has the desirable property that every set S U of size at most n has a c-bundle of low congestion. Moreover, this c-bundle can be constructed e ciently. Let S U be the set of variables to be accessed. The simulation algorithms described in the next sections construct a c-bundle for S starting from E(S) and applying a sequence of whittling steps. Each whittling \prunes" the set of edges by selecting c edges apiece for some of the variables in S, and discarding the remaining c ? 1. At the beginning of a whittling step, a variable is said to be alive if the c edges for the variable have not been selected yet, and dead otherwise. The sequence terminates when all variables are dead, at which point we are left with the desired c-bundle. The variables to whittle at each step are chosen to ensure that the degree of the nal c-bundle will not exceed a xed congestion q, whose value will be speci ed later.
For i 1, let S i S denote the set of live variables and F i E(S) the residual set of edges at the beginning of the i-th whittling step. Initially, S 1 = S and F 1 = E(S). Conceptually, the i-th whittling step identi es a set W i of \congested" nodes and selects c edges apiece for as many live variables as possible without touching nodes in W i . More formally, we say that x 2 S i is con ned to W i under F i if x has c or more copies in F i stored in nodes of W i . In the i-th whittling step, for each x 2 S i which is not con ned to W i under F i , we select c edges not incident on W i and remove the remaining c ? 1 from F i . This operation will be referred to as whittling of S i with respect to W i .
De nition 2 Let S U, jSj n. A q-whittling sequence of length t for S is a sequence (S 1 ; F 1 ; W 1 ), (S 2 ; F 2 ; W 2 ), : : :, (S t ; F t ; W t ) such that S 1 = S, F 1 = E(S), and W 1 V is a set of at most (2c ? 1)n=q nodes including all nodes that are q-congested with respect to E(S); For i > 1, S i S i?1 and F i F i?1 are, respectively, the set of live variables and the set of residual edges left after whittling S i?1 with respect to W i?1 and W i is the set of q-congested nodes with respect to F i ; S t = W t = ; and F t is a c-bundle for S.
Note that in the above de nition each W i , with i > 1, contains only nodes that are q-congested with respect to F i , while W 1 may include an additional (small) number of uncongested nodes.
The rationale behind this asymmetry will become clear later in the paper. It is also easy to see that F t , the nal c-bundle for S, has congestion at most q. The following lemma characterizes the rate at which the variables \die" during a whittling sequence.
Lemma 1 Let S be a set of at most n variables, and q 4c= = (log(m=n)). Then, for any q-whittling sequence of length t, ( (1) On the other hand, being con ned to W i?1 , all variables in S i have a c-bundle in W i?1 , and jS i j jS 2 j n=(2c ? 1), hence, by the expansion properties of the memory map, jW i?1 j cjS i j:
The bound for S i follows by combining Inequalities (1) and (2) and applying the inductive hypothesis.
3 Simulation on the Mesh
We now consider the simulation of an (n; m)-PRAM on a p n p n mesh. In the subsequent section we will sketch the (relatively minor) modi cations required to extend the simulation to meshes of higher dimensions. For notational convenience we will assume that n is a power of four. Each mesh node simulates the activities of a distinct PRAM processor. Specically, a node comprises a processor, capable of executing a standard repertoire of logical and arithmetic operations, and a local memory directly accessible to that processor alone. Each processor-memory node is connected to its immediate neighbours in the mesh by means of bidirectional wires capable of transmitting a single O (log m)-bit quantity per step.
Recall that the distribution of the PRAM variables (set U) among the memory modules of the mesh (set V ) is governed by a ( ; 2c?1; c; 1=(2c?1)) generalized expander G = (U; V ; E) with jUj = m and jV j = n, < 1 ( constant), and c = (log(m=n)). It is assumed for the moment that each mesh node holds a copy of a read-only table that encodes the structure of the memory organization. In other words, the i-th entry of the table records the locations of the copies of the i-th PRAM variable. This naive representation of the memory map requires O (m log(m=n)) storage per node. In Section 6, we will show how this table may be represented in a distributed fashion, taking only O (m=n) log 3 (m=n) storage per node.
Note that the total storage of the simulating machine will then be only a polylogarithmic factor away from the size of the PRAM memory.
We simulate a PRAM program by separately simulating its individual steps. Consider the simulation of an arbitrary PRAM step, and let S U denote the set of variables accessed in the step. We assume that each processor accesses a distinct variable (i.e., jSj = n), thus restricting our attention to the simulation of the EREW PRAM. Standard techniques can be employed to extend the simulation to the other PRAM variants with no performance loss (e.g., see HB94]). We assume that the address of the variable referenced by the i-th PRAM processor during the step is made known to the corresponding mesh processor at the start of the simulation. Each mesh processor creates a packet, referred to as a v-packet, containing (i) the processor's id; (ii) the name of the PRAM variable it wishes to access and the type of operation (read/write); (iii) a data eld for the value to be read/written; and (iv) a bit vector of length 2c ? 1, whose entries correspond to the copies of the variable and are initially set to zero. LetŜ denote the set of v-packets created by the mesh processors. The simulation consists of the following two phases.
Copy Selection Phase A c-bundle for S of congestion at most q = 4c= is selected. The bit vector of each packet inŜ is set to encode which copies of the corresponding variable are in the c-bundle and which are not.
Access Phase For each variable in S, a copy of the corresponding v-packet is routed to each mesh node containing a selected copy of that variable. Each mesh node performs the accesses relative to the packets it receives and, in the case of reads, sends the accessed values back to the requesting processors.
In the two subsections that follow we will describe the implementation of these two phases. Most of our algorithms will involve the movement or manipulation of sets of packets. In turn, such activities are based on two algorithmic primitives: k-sorting and k-relation routing. Given a set of packets distributed among the processors of an n 0 -node mesh, so that each node holds at most k packets, k-sorting is used to rearrange the packets so that node 1 holds the packets with the k smallest keys, node 2 holds the next k smallest keys, and so on. For any value of k and any numbering of the nodes, this can be accomplished in O k log k + p n 0 on the two-dimensional mesh, where the term k log k arises from the need to initially sort the groups of k keys locally at each node. It has to be remarked that in our simulation algorithm, k-sorting (with nonconstant k) is always applied for k = O (log(m=n)) and n 0 = (n= log(m=n)). Therefore, for values of m within the bound stated in Theorem 1, the time for the initial local sorting will never dominate the overall running time. In other words, in our scenario k-sorting will always require O k p n 0 time. The k-relation routing problem involves routing a set of packets subject to the constraint that no node is the source or destination of more than k packets. Again, this can be done in O k p n 0 time on an n 0 -node mesh. Both algorithms can be found in Kun93].
Copy Selection Phase
The copy selection is accomplished by performing the whittlings implied by a q-whittling sequence f(S i ; F i ; W i ) : 1 i tg, for some t = O (log n= log log(m=n)), as de ned in Section 2. Note that each whittling could be easily performed by representing the copies of the live variables as individual packets and then employing sorting, pre x and routing in order to select the packets relative to the copies to be included in the c-bundle. However, the n variables would initially account for (2c ? 1)n packets, and manipulating such a large set of packets can be expensive: for example, just sorting the packets would require O (c p n) time, which is too costly for our purposes. For this reason, the copy selection phase is broken into two stages. The rst stage performs the rst whittling step, using more complex but faster techniques. Since only a relatively small number of variables (less than n=(2c ? 1)) will participate in the second stage, the remaining whittlings can be implemented using the simple technique described above.
Stage 1 In the rst stage, we perform the whittling of S 1 = S with respect to a certain set W 1 (speci ed later) which includes all nodes q-congested relatively to F 1 = E(S), with q = 4c= . In what follows, we will regard the mesh as being partitioned into s disjoint submeshes of size p n=s p n=s, which we will call cells. The quantity s, which we assume to be a power of four greater than log c, will be determined by the analysis.
Recall thatŜ denotes the set of v-packets relative to the n variables to be accessed. The stage consists of the following steps.
1. Create a replicaŜ C ofŜ in each cell C of the mesh. (c) Determine how many competitors each c-packet has. A c-packet with q or fewer competitors is deemed accessible, and inaccessible otherwise.
3. For those variables with c or more accessible c-packets, select c copies and update the bit-map in the appropriate v-packet inŜ to re ect which copies have been selected. The bit-vectors for all other v-packets should remain unchanged i.e. all bits should remain at zero.
We can easily see that the rst stage executes the whittling of S with respect to a set W 1 which comprises all mesh nodes that store more than q copies of variables in S (i.e., the q-congested nodes), plus those nodes belonging to cells with total degree at least q(n=s).
Therefore, the nodes in W 1 account for at least qjW 1 j copies. Since there are (2c ? 1)n copies of variables, we must have jW 1 j (2c ? 1)n=q, as required by the de nition of q-whittling sequence. i.e. into individual cells. The replication ofŜ in the individual cells, re ects this recursive decomposition of the mesh. First, the setŜ is replicated in each of the four quadrants. This is achieved by having the four mesh nodes that occupy the same relative position in the four quadrants send each other a copy of the packets they hold. At this point, each node holds four packets and each quadrant holds a replica ofŜ. Each quadrant independently replicates its copy ofŜ into its four constituent subquadrants in the same manner, and so on. At the start of the i-th iteration of this process, each node of a q n=4 i?1 q n=4 i?1 submesh holds 4 i?1 packets that it must send to three other nodes within this submesh. This is a simple routing pattern in which each node is the source and destination of 3 4 i?1 packets (i.e., an instance of the 3 4 i?1 -relation routing problem), and so can be completed in O Step 2.a involves calculating the degree of each variable packet, hence requires O (c) time per packet and O (sc) time per node, since each node hold s packets. The total cell degree for C can easily be computed within each cell from the individual degrees in O p n=s time.
A straightforward implementation of Step 2.b, where each node generates the appropriate number of c-packets for each v-packet it holds, would result in an unbalanced distribution of the c-packets among the nodes, which would in turn make the execution of subsequent steps more expensive. Therefore, we resort to the following more careful implementation of this step. Within each cell C, partitionŜ C into degree classesŜ (i) C for 0 i dlog(2c ? 1)e, whereŜ (i) C contains v-packets of variables with at least 2 i and at most 2 i+1 ? 1 copies in C.
(We ignore variables of degree zero and note that no v-packet can have degree greater than 2c?1.) Redistribute the v-packets of each degree class so that each node of C receives at most djŜ (i) C js=ne v-packets in Class i. This redistribution can be performed by rst determining the destinations of packets in all classes using dlog(2c ? 1)e pre x operations (one per degree class), and then invoking a routing step within C in which each node is the source of at most s packets and the destination of at most
This requires overall time O p n=s log c + p ns = O ( p ns).
After the above redistribution of the v-packets, each node examines each v-packet it holds and generates the appropriate number of c-packets for the corresponding variable. Since each node receives no more than djŜ (i) C js=ne v-packets from the i-th degree class and each such packet may result in up to 2 i+1 ? 1 c-packets, it follows that each node may hold up to Putting it all together, Step 2.b can be completed in time O ( p ns + q) time.
Turning to Step 2.c, we can count the competitors of each packet within each cell by rst sorting the c-packets by target, to group competitors together, and then using parallel pre x to determine which packets are accessible and which are inaccessible. If a c-packet is deemed accessible, then the appropriate entry in the bit vector of the corresponding v-packet is set. Since each node initially holds O (q) packets and the cell has diameter p n=s, this step requires O q p n=s time.
As for
Step 3, we gather the accessibility information gained for each variable in di erent cells by \coalescing" the replicas of the v-packets created in Step 1. When two replicas of a v-packet are coalesced, a single v-packet results, whose bit-vector is obtained as the bitwise OR of the two bit-vectors associated to the replicas. The structure of the gathering process follows the reverse pattern of the replication process of Step 1, and is executed in the same time. At the end of the process, each processor checks the bit-vector of its v-packet. If the bit-vector contains more than c 1-bits, then c of these, chosen arbitrarily, are retained (corresponding to a c-bundle), while the others are reset to 0. Otherwise, all bits are reset to 0.
By Stage 2 Since jS 2 j n=(2c ? 1), the live variables account for at most n copies, i.e. at most one per mesh node. As a consequence, we can perform the remaining t ? 1 whittlings using standard 1-sorting, pre x and 1-routing primitives and without exceeding our target time performance. In fact, in each whittling step the number of live variables decreases geometrically, thus we can execute the whittlings within smaller and smaller submeshes so that the cost of the aforementioned primitives also decreases geometrically.
Let S i denote the live variables at the beginning of the i-th whittling step, and recall that (For concreteness, we will assume that M i+1 occupies the lower left hand corner of M i .) Note that M 2 is the entire mesh. Stage 2 consists of t?1 iterations numbered, for convenience, from 2 to t. Iteration i performs the i-th whittling step in the whittling sequence and is executed entirely within M i . In particular, for i 2 at the beginning of Iteration i, the v-packets corresponding to the live variables (set S i ) are distributed among the nodes of M i , with at most one v-packet per node. At the end of the iteration, the v-packets corresponding to dead variables in S ? S i+1 , are evenly distributed among the nodes outside M i+1 (for notational convenience, we assume that M t+1 is an \empty" mesh). Iteration i is implemented as follows.
1. For each variable in S i create 2c ? 1 c-packets and distribute the c-packets among the nodes of M i , assigning at most one packet to each node.
2. For each c-packet, determine how many competitors it has. A c-packet with at most q competitors is said to be accessible and is said to be inaccessible otherwise.
3. For each variable x in S i determine how many of the associated c-packets are accessible. If c or more are accessible, set the bit-vector positions in x's v-packet corresponding to the rst c accessible copies of that variable; otherwise reset all bits in the bit-vector for x to zero. In the former case x becomes dead while in the latter x is alive and will belong to S i+1 .
4. Delete all c-packets. Route the v-packets corresponding to the dead variables in S i to Once the sequence of whittlings has been completed, the information on which copies have been selected is encoded in the bit-vectors of the various v-packets that lie scattered among the nodes of the mesh, with at most 2 packets per node (at most one from Stage 1 and at most one from Stage 2). Finally, the v-packets are sent back to their originating processors in O ( p n) time.
Combining the contributions of the two stages, we obtain the overall running time for the copy selection phase. By combining Theorem 7 with Theorem 8 we establish that any step of an (n; m)-PRAM can be simulated with slowdown O p n log(m=n) on an p n p n mesh, using O (log(m=n)) copies per variable. Section 6 shows how the read-only tables encoding the memory map held by each node may be replaced by a space-e cient, distributed representation using only O (m=n) log 3 (m=n) storage per node, thus completing the proof of Theorem 1 as it relates two-dimensional meshes.
Higher-Dimensional Meshes
The overall structure of the simulation scheme for the two-dimensional mesh described in Section 3 also applies to higher-dimensional meshes. Speci cally, we adopt the same memory organization and pick the same values for the parameters s and q. In what follows, we sketch how to implement the individual steps of the copy selection and access phases e ciently on an n 1=d n 1=d n 1=d d-dimensional mesh (called, for short, an n-node d-mesh) with constant d. As for the case of the two-dimensional mesh, we recall that for m 2 (n 1=(d+1) ) , k = O (log(m=n)) and n 0 = (n= log(m=n)), the k-sorting problem and the k-relation routing problem may both be solved in time O k(n 0 ) 1=d on an n 0 -node d-mesh Kun93] .
Let us rst consider the rst stage of copy selection, and interpret a cell C as an (n=s)-node d-submesh.
Step 1 prescribes that the setŜ be replicated in each cell. We adopt the same recursive approach developed for d = 2 and perform the replication in substeps. At the i- Step 3 is accomplished in the same time by performing the routings of Step 1 in the reverse order. Let us nally consider Step 2. In this step, the nodes of each cell C rst determine whether the cell degree is less than q(n=s) ( Step 2.a). If this is the case, a balanced collection of c-packets relative to the copies of variables in S residing in C is generated (Step 2.b) and nally checked for accessibility ( Step 2.c). The operations involved are dlog(2c ? 1)e pre x computations as well as a constant number of s-sorting, s-relation routing and q-sorting operations, plus additional O (q + s) work per node. By plugging in the chosen values for s and q we see that Step 2 of copy selection can be completed in time O sc + n 1=d s 1?1=d + q(n=s) 1=d = O n 1=d (log(m=n)) 1?1=d :
Since this subsumes the cost of Steps 1 and 3, this expression also captures the cost of the entire rst stage of copy selection. In the second stage, we perform the i-th whittling entirely within an n=(2c?1) i?2 -node d submesh. As in the two-dimensional case, this whittling can be easily implemented by means of 1-sorting and parallel pre x. Since the size of the submeshes is geometrically decreasing, the overall running time is dominated by the time for i = 2, which is O n 1=d . Hence, the entire copy selection algorithm can be completed within the time given by Equation 3.
Finally, recall that Steps 1, 2.a, 3 of the access phase mirror Steps 1, 2.b, 3 of copy selection, hence they require time O n 1=d (log(m=n)) 1?1=d altogether. The remaining steps can be realized as two instances of an (s + q)-relation routing in each cell plus O (q) work per node. Therefore, the total time required by the access phase is again O n 1=d (log(m=n)) 1?1=d time.
The above discussion establishes that any step of an (n; m)-PRAM with m 2 (n 1=(d+1) ) can be simulated with slowdown O n 1=d (log(m=n)) 1?1=d on a d-dimensional mesh (with constant d) using O (log(m=n)) copies per variable. In Section 6 we will show how the space requirements per node may be reduced to ((m=n) log 3 (m=n)) storage per node. This will complete the proof of Theorem 1.
The Pruned Butter y
An n-leaf fat-tree is a routing network whose coarse structure resembles that of an n-leaf binary tree. More speci cally, leaves correspond to processing elements, non-leaf nodes represent clusters of routing switches and edges represent communication channels of bandwidth that increases from the leaves to the root. The rst architecture of this kind was proposed by Leiserson Lei85] , and was later followed by a number of related networks di ering in the detail of how components at di erent levels of the tree are interconnected. In this paper we adopt the pruned butter y fat-tree of Bay and Bilardi BB95], an example of which is illustrated in Figure 1 . Each dotted ellipse in the gure identi es a cluster of routing switches that collectively correspond to a single node in the binary tree. The bundle of edges joining switches of a cluster to switches of its parent cluster constitutes a channel whose bandwidth equals the cardinality of the bundle. The depth of a cluster is the distance of its component switches from the root. At any given depth, the clusters are numbered from left to right, beginning at zero. Individual switches within each cluster are also numbered from left to right, beginning at zero.
Assume that n is a power of four. Formally, an n-leaf pruned butter y is a graph G = V = f< i; j; k >: 0 i log n; 0 j < 2 i ; 0 k < p n2 ?bi=2c g:
With the above indexing, the j-th processor-memory node from left to right corresponds to vertex < log n; j; 0 >. For 0 i < log n, vertex < i; j; k > corresponds to the k-th switch of the j-th cluster at depth i of the tree. The set of edges E is de ned as follows:
where E ij contains the edges connecting switches in the j-th cluster at depth i to those in its parent cluster. When i is even, we let E ij = f(< i; j; k >; < i ? 1; bj=2c; k >); (< i; j; k >; < i ? 1; bj=2c; k + p n2 ?i=2 >) :
When i is odd, we let E ij = f(< i; j; k >; < i ? 1; bj=2c; k >: 0 k < p n2 ?bi=2c g:
Note that jE ij j = p n2 1?di=2e , therefore channel bandwidths double every other level from the leaves to the top, ranging from 2 to p n.
G is interpreted as an n-node machine by identifying the n processor-memory nodes with the n leaves, a routing switch with each internal vertex, and a wire capable of transmitting a single packet along its length in unit time with each edge. Moreover we assume that each routing switch is provided with an adder, so that parallel pre x computations may be completed e ciently. Intuitively, to route a message from leaf i to leaf j, the message is routed upwards in the tree to the cluster that is the least common ancestor of i and j and thence downwards to its destination.
This architecture has a number of interesting properties. For example, it is a subgraph of the butter y network and it embeds the n-leaf mesh of trees architecture with constant dilation and load. Furthermore, the original bit-serial formulation of the pruned butter y presented by Bay and Bilardi is area universal: the n-leaf pruned butter y can be laid out in O n log 2 n area and can route any set of messages almost as e ciently as any circuit of similar area. (See BB95] and the references contained therein for a fuller discussion of area-universality and the capabilities and properties of the pruned butter y.) An important routing property of the n-leaf pruned butter y is the following. Consider a collection of k p n packets, stored one per node among the leaves, with the i-th packet residing at leaf < log n; s i ; 0 > and destined to leaf < log n; d i ; 0 >. We refer to the collection as a wave if s 1 < s 2 < : : : < s k , and d 1 < d 2 < : : : < d k . In BB95] it is shown that any wave can be easily routed in O (log n) time. Moreover, a sequence of t waves may be routed in a pipelined fashion in time O (t + log n).
Sorting and Routing on the Pruned Butter y
In this subsection, we develop algorithms for k-sorting and k-relation routing on the pruned butter y, which will needed for the shared memory simulation.
Lemma 4 Any instance of k-sorting can be performed in O (k(log k + p n)) time on the n-leaf pruned butter y.
Proof: We will consider the input packets sorted when the k packets with the smallest keys are in the 0-th leaf, the next k smallest occupy the 1-st leaf, and so on. When k = 1, let x 0 ; x 1 ; : : : ; x n?1 be the n-tuple of variables that we wish to sort, with x i residing at the i-th leaf. The bitonic sorting algorithm is structured as a sequence of log n merging phases. During the i-th phase, for 1 i log n, distinct pairs of sorted subsequences of length 2 i?1 are merged into subsequences of length 2 i . In turn, the i-th phase is made of (i; j)-stages, for j = i ? 1; i ? 2; : : : 1; 0. An (i; j) stage executes as follows:
for all 0 k n ? 1 do in parallel if k] j = 0 then Oper(k; k + 2 j?1 ; i; j).
In the above code, k] j denotes the j-th bit in the binary representation of k, while Oper(k; k+ 2 j?1 ; i; j) denotes a compare-exchange operation applied to the variables x k and x k+2 j?1 . The \orientation" of the exchange depends on i and j. Note that the leaves containing these two variables fall within the same subtree of 2 j leaves. Thus, any (i; j)-stage can be performed within such subtrees. We now describe the implementation of an (i; j)-stage for the subtree with leaves 0; ; 2 j ? 1, and note that the same algorithm can be executed simultaneously within each 2 j -leaf subtree.
1. Transfer the values of x 0 ; x 1 ; : : : ; x 2 j?1 ?1 (residing at distinct leaves of the left subtree)
to the right subtree, so that leaf k+2 j?1 holds both x k and x k+2 j?1 , for 0 k 2 j?1 ?1. Recall that a k-relation routing problem involves routing a set of packets from source to destination subject to the constraint that no node is the source or destination of more than k packets. We have:
Lemma 5 Any instance of the k-relation routing problem may be routed in O (k(log k + p n)) on the n-leaf pruned butter y.
Proof: LetŜ denote the set of packets to be routed. The routing algorithm is made of the following steps:
1. Sort the packets inŜ by their destination among the n leaves of the pruned butter y.
2. For 0 i < k p n, route the packets whose rank in the sorted sequence is equal to i mod k p n.
By Lemma 4, Step 1 above requires time O (k(log k + p n)). As for Step 2, it is easy to see that each of the k p n routings is a wave. Therefore, all iterations can be pipelined and completed in time O (k p n). Thus, the entire routing algorithm can be completed in O (k(log k + p n)) time.
2

The Simulation Algorithm
A closer look at the simulation algorithm devised for the two-dimensional mesh reveals that both the copy selection and access phases are implemented in terms of k-sorting, k-relation routing, pre x computations, and rely upon a recursive decomposition of the network into subnetworks of the same topology. Note that the pruned butter y exhibits such a decomposition. Speci cally, for n and s n arbitrary powers of two, an n-leaf pruned butter y can be decomposed into s (n=s)-leaf pruned butter ies. Since the complexity of routing and sorting are asymptotically the same for the pruned butter y and the mesh, we conclude that any step of an (n; m)-PRAM can be simulated with slowdown O p n log(m=n) on an n-leaf pruned butter y using O (log(m=n)) copies per variable. The techniques of the next section show how space requirement per node may be limited to O 6 Space-E cient Simulations
The simulations presented in the paper are based on a memory organization whose structure is modelled by a bipartite graph G = (U; V ; E), with jUj = m, jV j = n, and where every vertex in U has degree 2c ? 1 = (log(m=n)). This graph may be represented by means of a read-only table T G = t 1 ; t 2 ; ; t m ] consisting of m entries, where the i-th entry t i = (t i (1); t i (2); ; t i (2c ? 1)) contains the addresses of the copies of the i-th variable. We call each such address an item. In this section, we show that such a table may be represented in a distributed fashion among the nodes of the simulating network, so that the maximum number of items stored per node is O (m=n) log 3 (m=n) and that any n-tuple of entries (corresponding to the variables to be accessed in the PRAM step) may be read in time proportional to the slowdown of the simulation step. We sketch the required techniques for the two-dimensional mesh, which are akin to those presented in Her96], though somewhat simpler. The result extends immediately to the other interconnections considered in the paper.
Let n 0 n be a parameter to be xed later. Partition the p n p n mesh into n=n 0 tiles of size (m=n 0 )c 2 items per node. Within each node, the individual items are held in a static dictionary in order to facilitate retrieval.
Note that there are a total of (n=n 0 )(2c 0 ? 1) copies of each entry and that to read an entry it su ces to read any one copy. Note also that the structure of the graph H can be represented by means of a read-only table T H of m=b entries. This latter table is replicated and represented in every block in the network, with each node of each block holding m=b 2 entries of T H .
To read an n-tuple of entries of T G , each tile deals locally with the reads relating to its own nodes, independently of other tiles, by executing the following steps.
1. Generate a set S containing 2c 0 ?1 numbered request packets r 1 (x); r 2 (x); : : : ; r 2c 0 ?1 (x) for each referenced entry x. Packet r i (x) bears the name of the processor that generated it, the entry to which it refers, and the name of the block that contains the i-th copy of that entry within the tile in question. 5. Route each packet back to the node that generated it.
Notice that the c 0 = 2c ? 1 selected packets relating to entry x are ultimately returned to the node that generated them, each bearing the value of a distinct item of that entry. In order to discover the locations of the various copies of the entries, which are needed to generate packets during Step 1, the nodes need to query T H . Since each block maintains a private copy of this table and each block generates b(2c 0 ? 1) request packets, this operation can be accomplished in the same fashion as that outlined for
Step 4 and has the same O (bc 0 ) running time. Steps 3 and 5 involve c 0 -relation routing within an n 0 -node tile so these contribute O c 0 p n 0 to the running time. As for Step 2, note that for each page of T G the number of entries referenced may be up to b, the page size. For a particular tile, let P i denote the set of pages where the number of referenced entries lies in the interval 2 i ; 2 i+1 ). Clearly, P log b i=0 2 i jP i j 2n 0 . Since H is a generalized expander, it is possible to construct a c 0 -bundle for the pages in each P i that has degree O ((jP i j=(n 0 =b))c 0 ). Each edge in such a bundle corresponds to at most 2 i+1 request packets, and so the total number of selected packets over all the P i is at most 
Lower Bound
In this section, we prove a lower bound on the worst-case slowdown incurred when simulating a PRAM step on a processor network. Unlike previous approaches AHMP87, KU88, HB94] , which do not account for the network topology, we obtain a bound that is based on the bandwidth characteristics of the simulating network. As a result, while previous lower bounds were signi cant only for very powerful networks such as expanders, our lower bound can be specialized, yielding nontrivial results, to a broad family of topologies, including lowbandwidth ones such as d-dimensional meshes and the pruned butter y. The bound is based on the notion of balanced decomposition tree BL84], which provides a partition of the network into disjoint regions of known bandwidth. We rst formulate the general lower bound in terms of such a decomposition, and then show how to specialize it to meshes and to the pruned butter y.
Consider the simulation of an arbitrary (n; m)-PRAM program on an n-processor network N. For convenience, we assume that each PRAM step involves either the reading (read step) or the writing (write step) of some n-tuple of variables. The simulation must satisfy the following standard constraints, which are also required in the lower bounds quoted earlier.
The simulation must on-line, in the sense that each PRAM step is made known to the simulation algorithm only after the simulation of previous read steps has been completed. Thus, read steps are simulated one-by-one according to the order speci ed by the PRAM program. Each read must succeed in accessing the correct (i.e. most recently written) value of the variable in question. Note that no restriction is placed on the execution of write operations.
The simulation must be point-to-point in the sense that a processor that wants to write a variable must dispatch a distinct message for each copy of the variable it wants to update.
Note that the point-to-point constraint rules out the splitting and combining techniques that are at the core of the simulations presented in this paper. However, at the end of the section, we modify the argument to obtain a non point-to-point lower bound, formulated in terms of the global space used to represent the PRAM memory, which applies to our upper bounds.
We assume that the simulating network N has a w 0 ; w 1 ; : : : ; w log n ] balanced decomposition tree, as de ned in BL84]; that is, for any i, 0 i log n, N can be partitioned into 2 i disjoint i-regions, R (i) 1 ; : : : ; R (i) 2 i , where each i-region contains dn=2 i e 1 processors and is connected to the rest of the network by at most w i edges. Clearly, every network has a balanced decomposition tree, for suitable values of the w i 's.
De nition 3 Let h and k be two integers, with 1 h; k log n, and let t be an arbitrary time step during the course of the simulation. For any shared variable u 2 U, we de ne r t h;k (u) to be the minimum, taken over all h-regions R (h) j , of the number of k-regions containing valid (i.e., most recently updated) copies of u that lie outside R (h) j at the beginning of step t. (We assume r 0 h;k (u) = 0, for every h; k and u.) We also de ne the average redundancy at time t with respect to h and k as r t h;k = P u2U r t h;k (u)=m.
The lower bound argument is similar in spirit to the ones in AHMP87, KU88, HB94], namely, it relies on nding a sequence of PRAM steps which are \hard" to simulate. Such a sequence will contain a judicious mixture of write and read steps suitably chosen to expose a tradeo of the following kind: unless the simulation devotes a su cient amount of e ort to each write step to ensure that the valid copies of the variables written are \nicely distributed" among di erent regions of the network, an adversary is always guaranteed to be able to nd a read instruction that will be relatively expensive to simulate.
In the subsequent analysis, we will make use of the following technical lemma, whose proof is embedded in that of Lemma 1 in PP97]. A lower bound on the complexity of a read operation as a function of the redundancy of the simulation scheme is proved in the following lemma.
Lemma 7 Fix an arbitrary time step t during the course of the simulation. For every h and k, with 1 h; k log n, at time t an adversary could issue a read step involving n distinct variables, whose simulation requires time at least g h;k (r t h;k ), where g h;k (r) = 8 > < > : 1 if (2r; 2 k ; n; m=2 h+1 ) 2 k?2 ; n 4(w h +w k (2r;2 k ;n;m=2 h+1 )) otherwise and r t h;k is the average redundancy at time t with respect to h and k .
Proof: Fix h and k and let r = r t h;k . The case (2r; 2 k ; n; m=2 h+1 ) 2 k?2 is trivial, hence we assume that (2r; 2 k ; n; m=2 h+1 ) < 2 k?2 . We will identify a set of (n) variables all of whose valid copies are con ned within a low-bandwidth portion of the network, and therefore are expensive to read by processors outside the region. LetÛ = fu 2 U : r t h;k (u) 2rg.
Clearly, jÛj m=2 and there exists an h-region R (h) j 0 for which there are at least m=2 h+1 variables inÛ achieving their minimum redundancy with respect to R (h) j 0 . LetÛ (h) j 0 Û be the set containing these variables. Note that since (2r; 2 k ; n; m=2 h+1 ) < 2 k?2 we must have m=2 h+1 n and, thus, jÛ (h) j 0 j n. We distinguish between two cases, depending on whether r is less than 1=2 or not. If r < 1=2, then there exists an n-tuple of variables inÛ (h) j 0 whose valid copies are all within R (h) j 0 . Since h 1 and so R (h) j 0 contains no more than n=2 processors, we can stipulate that n=2 variables of the n-tuple be read by processors outside R (h) j 0 . At least one copy per variable must then be transmitted along the wires connecting the region with the rest of the network, therefore such a read instruction will require at least n=(2w h ) time.
The case r 1=2 is more involved. By plugging r 0 = 2r, p = 2 k , n 0 = n, and m 0 = m=2 h+1 into the statement of Lemma 6, we conclude that there are n variables in W whose expensive copies are all contained in at most (2r; 2 k ; n; m=2 h+1 ) k-regions. Since (2r; 2 k ; n; m=2 h+1 ) < 2 k?2 , the union of R (h) j 0 and these k-regions contains no more than 3n=4 processors. Therefore we can stipulate that n=4 variables of the n-tuple be read by processors outside the union. The lemma follows by observing that reading these variables would take time at least n 4 (w h + w k (2r; 2 k ; n; m=2 h+1 )) ; and that the above term is strictly less than n=(2w h ).
2
We observe that the function g h;k (r) de ned in the above lemma is a non-increasing function of r.
The following lemma is similar to Lemma 7 in HB94], and captures the contribution of the write steps to the running time in terms of the average redundancy.
Lemma 8 Consider an arbitrary time step t during the course of a point-to-point simulation, and let r = r t h;k . Then, t and r satisfy the following inequality:
t r m 2 h w h :
Proof: For each variable u, let r u denote the number of valid copies of u lying outside the h-region which contains the processor that most recently updated u (before time t). (Note that r u as well as r t h;k (u), for any h and k, are equal to 0 if no processor wrote u before time t.) Under the point-to-point assumption, such a processor must have dispatched at least r u distinct messages that crossed the boundaries of its h-region. As a consequence, we have that a total of at least P u2U r u rm messages must have crossed boundaries of h-regions, hence, there must be an h-region whose boundary was crossed by at least rm=2 h distinct messages, which accounts for a total of at least rm=(2 h w h ) time.
2
Theorem 9 For any T 2m=n, there exists a T-step (n; m)-PRAM program, whose pointto-point, on-line simulation on an n-processor network with a w 0 ; w 1 ; : : : ; w log n ] balanced decomposition tree, requires worst-case time T max 1 h;k logn min r 0 g h;k (r) + r n 2 h w h ;
where g h;k (r) is the function de ned in Lemma 7.
Proof: We construct a PRAM program with bT=(2m=n)c batches of m=n instructions, each batch consisting of m=n write steps that update all the variables, followed by m=n read steps suitably chosen by the adversary according to Lemma 7.
Consider the simulation of one such batch, for some h and k, with 1 h; k log n. Let r be the maximum value of r t h;k at the start (time t) of the simulation of any read step. By Lemma 8, the simulation of all the write steps requires time at least rm=(2 h w h ). By Lemma 7, the simulation of each read step requires time at least g h;k (r). Hence, the simulation time for the batch is at least m n g h;k (r) + rn 2 h w h :
The theorem follows by taking the minimum over all possible values of r and the maximum over all choices of h and k of the simulation time of a batch given above, and then by summing the contributions of the bT=(2m=n)c batches. 2
We are now ready to prove Theorem 3, stated in the Introduction, which specializes the general lower bound of Theorem 9 to the case of d-dimensional meshes (with d constant).
Proof of Theorem 3: Let us rst concentrate on one-dimensional meshes (d = 1). We establish this case separately, by means of a simple, diameter-based argument as follows. 
Using the chosen values of h, r and , we see that m 2 h+1 n 1=2 r (m=n) 8 log(m=n)= log log(m=n) (2 h+1 ) 8= log 8 (m=n)
where the simpli cation of the denominator relies on the facts that 2 h+1 < 4 and (4 ) 8= 8 4 (since 4
8 and x 1=x is decreasing for x > 2). With this it is easy to establish that when r < r, we have (2r; 2 k ; n; m=2 h+1 ) < 2 k?2 , which according to the de nition of g h;r (r) in Lemma 7 implies that g h;k (r) = nthe proof of Theorem 4, stated in the Introduction, is virtually identical to that of Theorem 3 for d = 2, and is omitted for brevity.
Recall that the simulations presented in this paper achieve high levels of e ciency by making a crucial use of splitting and combining techniques. More speci cally, a processor issuing a memory request generates a single variable packet for each subset of copies residing in a suitably sized region of the network. Once the variable packet is shipped within its destination region, it is split into multiple copy packets, destined to the individual copies of the variable. In this way, the cost of the \long leg" of the journey to access a copy is paid only once for all the copies residing within the same region.
Unfortunately, the point-to-point assumption made by our lower bound argument precludes the splitting and combining of messages destined to distinct copies, therefore Theorems 3 and 4 do not apply to our simulations. Note however that the argument uses this assumption only to establish a bound on the cost of write operations. As a consequence, we can prove a lower bound solely based on the cost of read operations, which holds in an unrestricted model where splitting/combining may occur. The lower bound, stated in Theorem 5 in the Introduction and proved below, is obtained by making sure that the average redundancy does not grow too large during the simulation. This can be achieved by establishing that the total amount of space used to represent the PRAM variables in the local memory modules can never exceed a xed threshold mr.
Proof of Theorem 5: Consider the case of d-dimensional meshes. For d = 1, the bound can be trivially obtained through the same diameter-based argument employed in the proof of Theorem 3. Hence, assume d > 1. We consider a PRAM program that rst executes m=n write steps to update all the variables, and then executes T ? (m=n) = (T ) read steps suitably chosen by the adversary. Since the average number of copies per variable is r, it is immediate to argue that, for every 1 k log n, at the beginning of each read step there are at least m=2 variables each of which has updated copies in at most 2r k-regions. By Lemma 6 this implies that there exist minf2 k ; (2r; 2 k ; n; m=2)g k-regions that contain all updated copies of at least n variables. If (2r; 2 k ; n; m=2) 2 k?1 , the adversary can require that n=2 such variables be read by processors outside the (2r; 2 k ; n; m=2) k-regions, which takes time n w k (2r; 2 k ; n; m=2) = n 
Let us x k as the minimum index such that 2 k log(m=n) log log(m=n) and let m be the largest value of m for which the chosen value for 2 k is at most n. As in the proof of Theorem 3, we rst consider the case m m. In this case, we have 1 k log n. Moreover, it is easy to verify that (m=2n) 1=(2r) (log(m=n)= log log(m=n)) 2 +1 and (2r; 2 k ; n; m=2) 2 k?1 . By plugging the value for 2 k in the right-hand side of Equation 6 we see that the cost of each read operation is The unrestricted lower bound for the pruned butter y is obtained by setting d = 2 in the above calculations.
Theorem 5 shows that our simulations use an amount of redundancy which is only a doubly logarithmic factor higher than the minimum redundancy needed to achieve the same slowdown.
Conclusions
In this paper we have presented upper and lower bounds for the problem of simulating a shared memory abstraction on network-based machines such as d-dimensional meshes and the pruned butter y. An interesting feature of our scheme is its generality. Indeed, the simulation algorithm relies on a recursive decomposition of the underlying network into subnetworks of the same topology, and employs a restricted set of basic primitives such as pre x, k-sorting and k-relation routing. As a consequence, the algorithm is e ciently portable to any other machine with a recursive structure and on which optimal algorithms for the above primitives are known. As for the lower bound, we have developed a generic, bandwidth-based argument that can be applied to any speci c interconnection using the parameters of its decomposition tree.
Regarding the upper bound, it must be remarked that we make use of memory organizations based on generalized expanders. As it was mentioned in the Introduction, the explicit, deterministic construction of generalized expanders is a long-standing open question, although it can be shown that a random bipartite graph would exhibit the required expansion property with high probability. This limitation su ered by our scheme is shared by all other deterministic mesh-based schemes in the literature, with the exception of the scheme of PPS94], which only applies to very small memory sizes (m = O ? n 1:5 ) and exhibits a higher slowdown than ours.
Finally, the general lower bound presented in Section 7 is proved under the point-topoint assumption, which stipulates that packets sent to copies of a variable can neither be split nor combined. This constraint rules out the techniques that are at the core of the simulations presented in this paper, hence the bound does not apply to our algorithms directly. However, we have been able to modify the argument to obtain one which applies to our algorithms, by introducing an upper limit to the global space used to represent the PRAM variables. In particular, we are able to show that in order to match the slowdowns exhibited by our simulations, any deterministic scheme must use about the same amount of space to represent the variables. However, the search for a nontrivial, totally unrestricted lower bound for deterministic PRAM simulation on network-based machines remains a challenging open problem.
