Implementing Shared Memory on Mesh-Connected Computers and on the Fat-Tree  by Herley, Kieran T et al.
Information and Computation 165, 123–143 (2001)
doi:10.1006/inco.2000.3006, available online at http://www.idealibrary.com on
Implementing Shared Memory on Mesh-Connected Computers
and on the Fat-Tree⁄
Kieran T. Herley
Department of Computer Science, University College Cork, Cork, Ireland
E-mail: k.herley@cs.ucc.ie
and
Andrea Pietracaprina and Geppino Pucci
Dipartimento di Elettronica e Informatica, Universita` di Padova, Via Gradenigo 6/a, 35131 Padova, Italy
E-mail: andrea@artemide.dei.unipd.it and geppo@artemide.dei.unipd.it
Received November 24, 1997
We present deterministic upper and lower bounds on the slowdown required to simulate an (n;m)-
PRAM on a variety of networks. The upper bounds are based on a novel scheme that exploits the
splitting and combining of messages. This scheme can be implemented on an n-node d-dimensional
mesh (for constant d) and on an n-leaf pruned butterfly and attains the smallest worst-case slowdown
to date for such interconnections, namely, O(n1=d (log(m=n))1¡1=d ) for the d-dimensional mesh (with
constant d) and O(pn log(m=n)) for the pruned butterfly. In fact, the simulation on the pruned butterfly
is the first PRAM simulation scheme on an area-universal network. Finally, we prove restricted and
unrestricted lower bounds on the slowdown of any deterministic PRAM simulation on an arbitrary
network, formulated in terms of the bandwidth properties of the interconnection as expressed by its
decomposition tree. C° 2001 Academic Press
1. INTRODUCTION
The problem of implementing a shared-memory abstraction on various distributed-memory parallel
architectures has been intensively studied over the past decade. Generally, this problem has been referred
to as the PRAM simulation problem and involves representing the m cells of the PRAM shared memory
(called variables) among the n processor–memory nodes of the simulating machine in such a way
that any n-tuple of distinct cells may be read or written efficiently. The time required to simulate one
PRAM step is known as the slowdown of the simulation. A number of approaches to this problem, both
probabilistic and deterministic, have been investigated for a variety of well-known architectures such
as the complete interconnection, the mesh of trees, the butterfly, as well as a variety of expander-based
architectures, among others.
We will not attempt to summarize the extensive literature on this problem here but quote only
those results that relate directly to our work and refer the interested reader to [PPS94] for a recent
and comprehensive summary of further work on this topic. Building on earlier work of Upfal and
Wigderson [UW87], Alt et al. [AHMP87] presented a deterministic scheme to simulate a PRAM with n
processors and m variables (called an (n;m)-PRAM) on an n-node Module Parallel Computer (MPC),
an architecture in which each node includes both a processor and a private memory module accessible
only to that processor, and in which the nodes are connected by a crossbar that allows each node to
transmit or receive one message per step. Their scheme employs the following copy-based method
for the representation of the PRAM variables, which most of the deterministic simulation algorithms,
including the present work, adopt. Specifically, each variable is represented by a set of copies, whose
⁄ This research was supported, in part, by the EC ESPRIT Basic Research Project 9072 (Project GEPPCOM: Foundations of
GEneral Purpose Parallel COMputing). The results in this paper appeared in preliminary form in the Proceedings of the Third
Annual European Symposium on Algorithms (ESA’95), pp. 60–74, 1995.
123
0890-5401/01 $35.00
Copyright C° 2001 by Academic Press
All rights of reproduction in any form reserved.
124 HERLEY, PIETRACAPRINA, AND PUCCI
size 2c ¡ 1 is logarithmically related to n and m, and each copy consists of a value and a time-stamp.
The copies are distributed carefully among the memory modules of the simulating machine. To write a
variable, at least c of its copies are overwritten to reflect the intended value and the time of writing. To
read a value, at least c copies are inspected. This set of c copies read contains at least one of the copies
most recently written, which is readily identifiable by virtue of its time-stamp. Alt et al. show that for
a suitable distribution of the copies among the nodes of the machine, any n-tuple of variables may be
accessed (read or written) in O(log m) time.
The above scheme can be ported to an arbitrary network by simulating each MPC step using standard
techniques such as routing and sorting. In particular, this approach yields a simulation with slow-
down O(n1=d log m) on an n-node d-dimensional mesh (d D O(1)) and a simulation with slowdown
O(pn log m) on an n-leaf pruned butterfly, which are the interconnections that we consider in this
paper. In [AHMP87], it was also observed that a simple PRAM simulation for the two-dimensional
mesh with an optimal slowdown of O(pn) is indeed possible. Unfortunately, this simulation requires
up to n copies per variable, resulting in an unacceptable memory blowup. Moreover, the method does
not extend to higher-dimensional meshes and other interconnections.
Most of the deterministic simulations that appear in the literature, including those in this paper,
rely on memory distributions that are built upon certain expander-based graphs whose existence can
be proved, but for which no efficient construction is known. Recently, Pietracaprina et al. [PPS94,
PP95] have studied deterministic simulations based entirely on explicitly constructible structures. By
resorting to a complex hierarchical arrangement of constructible, mildly expanding graphs, they achieve
O(pn log n) slowdown on an n-node mesh for memories of O(n1:5) size, using O(log1:59 n) copies per
variable. In this paper, our focus is on slowdown rather than constructiveness. By employing more
powerful expander-based structures, we achieve a better slowdown than that of [PP95] at a lower level
of redundancy (copies per variable).
It might appear, at least at first glance, that updating c copies apiece for n variables must involve the
physical movement of cn distinct packets across the entire network, which on the
p
n£pn mesh would
require˜(cpn) time. In this paper, we devise a novel splitting/combining technique to circumvent this
difficulty, based on the following idea. If a processor p wishes to send the same packet to nodes x and
y that are “distant” from p but “close” to one another, then rather than dispatch a separate packet for
each, it may be more efficient to dispatch a single message to some “intermediate” node z close to both
x and y. At node z, the original packet is then made into two replicas which are forwarded to x and y
separately. A careful implementation of this idea leads to the following result.
THEOREM 1. For any m • 22(n1=(dC1)) there exists a scheme to simulate an (n;m)-PRAM on an
n-node d-dimensional mesh (d constant) with worst-case slowdown O(n1=d (log(m=n))1¡1=d ), using
O(log(m=n)) copies per variable and O((m=n) log5(m=n)) storage per node.
In order to implement the splitting/combining strategy outlined above, the scheme relies on a recursive
decomposition of the mesh and on efficient algorithms for k-sorting, where each processor initially
holds k packets, and for k-relation routing, where each processor sends and receives at most k packets.
Theorem 1 implies that our simulation scheme incurs a slowdown which is a factor O((log(m=n))1=d )
smaller than the one obtainable by porting the MPC algorithm of [AHMP87] on an n-node d-dimensional
mesh. We want to remark that the (exponential) upper bound on the memory size m in Theorem 1 is
placed to avoid that the cost of sequential bookkeeping operations such as local sorting or counting
dominate the overall running time of the simulation algorithm. Such a bound on m is not needed if a cost
model which accounts for interprocessor communication only was adopted, as customary for network
algorithms [Kun93].
The n-leaf pruned butterfly [BB95] (described later) is an area-universal network that is a variant
of Leiserson’s fat-tree [Lei85]. Although quite different from the two-dimensional mesh in terms of
the details of its structure, it is sufficiently similar in its bandwidth characteristics to support the key
operations upon which our simulations rely with comparable efficiency. By providing novel sorting
and routing primitives for this network, and by using its natural decomposition into subtrees, we are
able to implement the above simulation scheme with the same slowdown as that achieved for the two-
dimensional mesh, thereby obtaining the first PRAM simulation on an area-universal network. The
result is stated in the following theorem.
IMPLEMENTING SHARED MEMORY ON MESHES 125
THEOREM 2. For any m • 22(n1=3) there exists a scheme to simulate an (n;m)-PRAM on an n-leaf
pruned butterfly with worst-case slowdown O(pn log(m=n)), using O(log(m=n)) copies per variable
and O((m=n) log5(m=n)) storage per node.
Lower bounds on the slowdown of PRAM simulations on bounded-degree networks have been
presented in a number of studies [AHMP87, KU88, HB94]. All such bounds, however, apply to the entire
class of such networks, and cannot be specialized to the characteristics of a given topology. For example,
in [HB94] the authors show an ˜(log2(m=n)=log log(m=n)) lower bound on the slowdown required to
simulate a PRAM step on any bounded-degree network, which is too weak for our purposes, since a trivial
˜(n1=d ) lower bound may easily be obtained for d-dimensional meshes based on diameter limitations.
An˜(pn) bound holds for the pruned butterfly based on straightforward bandwidth considerations. In
this paper, we present the first lower bound argument that takes into account the characteristics of the
individual network. To capture the properties of the network topology, the bound exploits the notion of
decomposition tree [BL84, Lei85], which provides a partition of the network into disjoint regions of
limited bandwidth.
As in all previous works, the lower bound is proved under the point-to-point assumption, which
requires that a processor updating a number of copies of a variable dispatch a separate message for
each copy. When specialized to d-dimensional meshes and to the pruned butterfly, our lower bound
technique yields the following results.
THEOREM 3. Let m ‚ 16n. For every T ‚ 2m=n, there exists a T-step (n;m)-PRAM program, whose











on an n-node d-dimensional mesh (with d constant).
THEOREM 4. Let m ‚ 16n. For every T ‚ 2m=n, there exists a T-step (n;m)-PRAM program, whose







log log(m=n) ; n
2=3
¾¶
on an n-leaf pruned butterfly.
Unfortunately, the point-to-point assumption upon which Theorems 3 and 4 and the other works
in the literature rely precludes the splitting and combining of messages. As a consequence, the above
lower bounds do not apply to our simulations directly. However, we are able to prove similar bounds
in an unrestricted model by limiting the total level of redundancy used to represent the variables. Such
bounds show that our simulations use an amount of redundancy which is only a doubly logarithmic
factor higher than the minimum redundancy needed to achieve the same slowdown. Specifically, we
have the following result.
THEOREM 5. Let m ‚ 16n. For every T ‚ 2m=n and every constant fi ‚ 1, there exists a T-step





















The bound with d D 2 also applies to the pruned butterfly.
126 HERLEY, PIETRACAPRINA, AND PUCCI
The rest of the paper is organized as follows. Section 2 discusses the distribution of the copies
among the memory modules and the properties required of the graph representing the memory map. In
Section 3, the simulation algorithm for the two-dimensional mesh is presented. The algorithm consists
of two phases, copy selection and routing, which are described in Subsections 3.1 and 3.2, respectively.
This scheme is extended to higher-dimensional meshes in Section 4 and to the pruned butterfly in
Section 5. Section 6 shows how the space bounds quoted in Theorems 1 and 2 may be achieved.
Section 7 presents the lower bound results discussed above. Section 8 concludes the paper with some
final remarks and indicates future research directions.
2. MEMORY ORGANIZATION
Consider the simulation of an (n;m)-PRAM on an n-node machine and suppose that each variable is
replicated into 2c¡ 1 copies, for a suitable integer c. It is convenient to model the distribution of copies
among the nodes of the machine by means of a bipartite graph G D (U; V ; E), where U represents the
set of variables, V the set of processor–memory nodes of the simulating machine, and 2c ¡ 1 edges
connect each variable to the distinct nodes storing its copies. In the following we will denote by E(S)
the set of edges in E incident on a set S µ U . Note that there is a one-to-one correspondence between
E(S) and the set of all copies of variables in S.
Let S µ U and F µ E(S). When F contains exactly k edges incident on each s 2 S we call F a
k-bundle for S. Also, we denote by 0F (S) the subset of V touched by edges in F . A vertex v 2 V is said
to be q-congested with respect to F if more than q edges in F are incident on v. Finally, the congestion
of F is the maximum value q such that there is a vertex in V that is q-congested with respect to F .
Recall that our simulations adopt the majority protocol, which requires that at least c copies be
accessed in order to complete a read or a write. Equivalently, in graph-theoretic terms, if we wish
to access a set S of variables, then we must select a c-bundle for S. Since the congestion of a
c-bundle models the maximum number of physical copies that have to be accessed sequentially by
some individual node in the underlying machine, it is desirable to access a c-bundle with low conges-
tion. The existence of c-bundles of low congestion is intimately related to the expansion properties of
the graph G. This motivates the following definition [HB94] that characterizes a class of graphs called
generalized expanders that make good memory organizations.
DEFINITION 1. A bipartite graph G D (U; V ; E) with jU j D m and jV j D n, and with each node in
U having degree d is a (‚; d; c; ¾ )-generalized expander if, for every S µ U such that jSj • ¾n and
for every c-bundle F of S, j0F (S)j ‚ ‚cjSj.
We say that a generalized expander is smooth if the maximum degree of any node in V is 2(jE j=jV j).
Herley and Bilardi [HB94] have established the existence of certain generalized expanders using count-
ing techniques similar in spirit to those found in the earlier work of Upfal and Widgerson [UW87]. A
minor variation of this result (guaranteeing smoothness) is quoted below.
THEOREM 6. For every n and m, with m‚ n, there exists a smooth (‚; 2c ¡ 1; c; 1=(2c ¡ 1))-
generalized expander G D (U; V ; E) with jU j D m; jV j D n; ‚ D 2(1), and c D 2(log(m=n)).
The graph of Theorem 6 will govern the memory organization of our simulations. We shall see that
such a graph has the desirable property that every set S µ U of size at most n has a c-bundle of low
congestion. Moreover, this c-bundle can be constructed efficiently.
Let S µ U be the set of variables to be accessed. The simulation algorithms described in the next
sections construct a c-bundle for S starting from E(S) and applying a sequence of whittling steps. Each
whittling “prunes” the set of edges by selecting c edges apiece for some of the variables in S and
discarding the remaining c¡ 1. At the beginning of a whittling step, a variable is said to be alive if the
c edges for the variable have not been selected yet and dead otherwise. The sequence terminates when
all variables are dead, at which point we are left with the desired c-bundle. The variables to whittle at
each step are chosen to ensure that the degree of the final c-bundle will not exceed a fixed congestion
q, whose value will be specified later.
For i ‚ 1, let Si µ S denote the set of live variables and Fi µ E(S) the residual set of edges at the
beginning of the i th whittling step. Initially, S1 D S and F1 D E(S): Conceptually, the i th whittling
IMPLEMENTING SHARED MEMORY ON MESHES 127
step identifies a set Wi of “congested” nodes and selects c edges apiece for as many live variables as
possible without touching nodes in Wi . More formally, we say that x 2 Si is confined to Wi under Fi
if x has c or more copies in Fi stored in nodes of Wi . In the i th whittling step, for each x 2 Si which
is not confined to Wi under Fi , we select c edges not incident on Wi and remove the remaining c ¡ 1
from Fi . This operation will be referred to as whittling of Si with respect to Wi .
DEFINITION 2. Let S µ U , jSj • n. A q-whittling sequence of length t for S is a sequence
(S1; F1;W1); (S2; F2;W2); : : : ; (St ; Ft ;Wt ) such that
† S1 D S; F1 D E(S), and W1 µ V is a set of at most (2c¡ 1)n=q nodes including all nodes that
are q-congested with respect to E(S).
† For i > 1; Si ‰ Si¡1 and Fi ‰ Fi¡1 are, respectively, the set of live variables and the set of
residual edges left after whittling Si¡1 with respect to Wi¡1 and Wi is the set of q-congested nodes with
respect to Fi .
† St D Wt D ; and Ft is a c-bundle for S.
Note that in the above definition each Wi , with i > 1, contains only nodes that are q-congested with
respect to Fi , while W1 may include an additional (small) number of uncongested nodes. The rationale
behind this asymmetry will become clear later in the paper. It is also easy to see that Ft , the final c-bundle
for S, has congestion at most q. The following lemma characterizes the rate at which the variables “die”
during a whittling sequence.
LEMMA 1. Let S be a set of at most n variables and q ‚ 4c=‚ D 2(log(m=n)). Then, for any q-
whittling sequence of length t , (S1; F1;W1), (S2; F2;W2); : : : ; (St ; Ft ;Wt ), we have jSi j • n=(2c¡1)i¡1
for i ‚ 1. Therefore t D O(log n=log log(m=n)).
Proof. The proof proceeds by induction on i . The basis i D 1 is trivial. For i D 2, consider the
set S2 and assume that jS2j > n=(2c ¡ 1). Since all variables in S2 are confined to W1, we can choose
an arbitrary subset S0 µ S2 of exactly n=(2c ¡ 1) variables and a c-bundle F 0 µ F1 for S0, with
0F 0 (S0) µ W1. Then, by the expansion properties of our memory organization, we obtain




a contradiction. Finally, suppose that i ‚ 3, and note that all the edges in Fi¡1 relative to the set S¡ Si¡1
of dead variables at the beginning of Step i ¡ 1 cannot be incident on nodes of Wi¡1, since we never
pick edges for c-bundles out of those incident on q-congested nodes. Therefore, the congestion of nodes
in Wi¡1 is entirely caused by copies of variables in Si¡1, whence




On the other hand, being confined to Wi¡1, all variables in Si have a c-bundle in Wi¡1, and jSi j • jS2j •
n=(2c ¡ 1), hence, by the expansion properties of the memory map,
jWi¡1j ‚ ‚cjSi j: (2)
The bound for Si follows by combining Inequalities (1) and (2) and applying the inductive hypo-
thesis. j
3. SIMULATION ON THE MESH
We now consider the simulation of an (n;m)-PRAM on a pn £ pn mesh. In the next section
we will sketch the (relatively minor) modifications required to extend the simulation to meshes of
higher dimensions. For notational convenience we will assume that n is a power of four. Each mesh
node simulates the activities of a distinct PRAM processor. Specifically, a node comprises a processor,
capable of executing a standard repertoire of logical and arithmetic operations, and a local memory
128 HERLEY, PIETRACAPRINA, AND PUCCI
directly accessible to that processor alone. Each processor–memory node is connected to its immediate
neighbors in the mesh by means of bidirectional wires capable of transmitting a single O(log m)-bit
quantity per step.
Recall that the distribution of the PRAM variables (set U ) among the memory modules of the mesh
(set V ) is governed by a (‚; 2c¡ 1; c; 1=(2c¡ 1)) generalized expander G D (U; V ; E) with jU j D m
and jV j D n; ‚ < 1(‚ constant), and c D 2(log(m=n)). It is assumed for the moment that each
mesh node holds a copy of a read-only table that encodes the structure of the memory organization. In
other words, the i th entry of the table records the locations of the copies of the i th PRAM variable.
This straightforward representation of the memory map requires O(m log(m=n)) storage per node.
In Section 6, we will show how this table may be represented in a distributed fashion, taking only
O((m=n) log5(m=n)) storage per node. Note that the total storage of the simulating machine will then
be only a polylogarithmic factor away from the size of the PRAM memory.
We simulate a PRAM program by separately simulating its individual steps. Consider the simulation
of an arbitrary PRAM step, and let S ‰ U denote the set of variables accessed in the step. We assume
that each processor accesses a distinct variable (i.e., jSj D n), thus restricting our attention to the
simulation of the EREW PRAM. Standard techniques can be employed to extend the simulation to
the other PRAM variants with no performance loss (e.g., see [HB94]). We assume that the address of
the variable referenced by the i th PRAM processor during the step is made known to the corresponding
mesh processor at the start of the simulation. Each mesh processor creates a packet, referred to as a
v-packet, containing (i) the processor’s id; (ii) the name of the PRAM variable it wishes to access and
the type of operation (read/write); (iii) a data field for the value to be read/written; and (iv) a bit vector
of length 2c¡ 1, whose entries correspond to the copies of the variable and are initially set to zero. Let
ˆS denote the set of v-packets created by the mesh processors. The simulation consists of the following
two phases.
Copy selection phase. A c-bundle for S of congestion at most q D 4c=‚ is selected. The bit vector
of each packet in ˆS is set to encode which copies of the corresponding variable are in the c-bundle and
which are not.
Access phase. For each variable in S, a copy of the corresponding v-packet is routed to each mesh
node containing a selected copy of that variable. Each mesh node performs the accesses relative to the
packets it receives and, in the case of reads, sends the accessed values back to the requesting processors.
In the two subsections that follow we will describe the implementation of these two phases. Most of
our algorithms will involve the movement or manipulation of sets of packets. In turn, such activities are
based on two algorithmic primitives: k-sorting and k-relation routing. Given a set of packets distributed
among the processors of an n0-node mesh, so that each node holds at most k packets, k-sorting is used to
rearrange the packets so that node 1 holds the packets with the k smallest keys, node 2 holds the next k
smallest keys, and so on. For any value of k and any numbering of the nodes, this can be accomplished
in O(k log k Cpn0)) on the two-dimensional mesh, where the term k log k arises from the need to sort
the groups of k keys locally at each node initially. It has to be remarked that in our simulation algorithm,
k-sorting (with nonconstant k) is always applied for k D O(log(m=n)) and n0 D ˜(n= log(m=n)).
Therefore, for values of m within the bound stated in Theorem 1, the time for the initial local sorting
will never dominate the overall running time. In other words, in our scenario k-sorting will always
require O(kpn0) time.
The k-relation routing problem involves routing a set of packets subject to the constraint that no node
is the source or destination of more than k packets. Again, this can be done in O(kpn0) time on an
n0-node mesh. Both algorithms can be found in [Kun93].
3.1. Copy Selection Phase
The copy selection is accomplished by performing the whittlings implied by a q-whittling sequence
f(Si ; Fi ;Wi ): 1 • i • tg, for some t D O(log n= log log(m=n)), as defined in Section 2. Note that each
whittling could be performed easily by representing the copies of the live variables as individual packets
and then employing sorting, prefix, and routing in order to select the packets corresponding to the copies
to be included in the c-bundle. However, the n variables would initially account for (2c¡1)n packets, and
manipulating such a large set of packets can be expensive: for example, just sorting the packets would
require O(cpn) time, which is too costly for our purposes. For this reason, the copy selection phase is
IMPLEMENTING SHARED MEMORY ON MESHES 129
broken into two stages. The first stage performs the first whittling step, using more complex but faster
techniques. Since only a relatively small number of variables (less than n=(2c¡1)) will participate in the
second stage, the remaining whittlings can be implemented using the simple technique described above.
Stage 1. In the first stage, we perform the whittling of S1 D S with respect to a certain set W1
(specified later) which includes all nodes q-congested relative to F1 D E(S), with q D 4c=‚. In what
follows, we will regard the mesh as being partitioned into s disjoint submeshes of sizepn=s £pn=s,
which we will call cells. The quantity s, which we assume to be a power of four greater than log c, will
be determined by the analysis.
Recall that ˆS denotes the set of v-packets corresponding to the n variables to be accessed. The stage
consists of the following steps.
1. Create a replica ˆSC of ˆS in each cell C of the mesh.
2. Independently within each cell C do the following:
(a) Determine the degree of each packet in ˆSC with respect to C , that is, the total number of
copies of the corresponding variable which reside within C . Compute the total degree of all packets in
the cell as the sum of the individual degrees.
(b) If the total cell degree is less than q(n=s), then generate a set of c-packets ˆRC containing
one c-packet for each copy of a variable in S residing within C . Each c-packet contains the name of the
variable and the node in C storing the copy. This node is called the target of the c-packet. Furthermore,
c-packets with the same target are said to be competitors, while c-packets relative to the same variable
are said to be companions. Note that a c-packet may have companions belonging to several cells. Finally,
we call ˆR the set obtained as the union over all cells C of the set ˆRC .
(c) Determine how many competitors each c-packet has. A c-packet with q or fewer competitors
is deemed accessible, and inaccessible otherwise.
3. For those variables with c or more accessible c-packets, select c copies and update the bit-map
in the appropriate v-packet in ˆS to reflect which copies have been selected. The bit-vectors for all other
v-packets should remain unchanged i.e. all bits should remain at zero.
We can easily see that the first stage executes the whittling of S with respect to a set W1 which
comprises all mesh nodes that store more than q copies of variables in S (i.e., the q-congested nodes),
plus those nodes belonging to cells with total degree at least q(n=s). Therefore, the nodes in W1 account
for at least qjW1j copies. Since there are (2c¡1)n copies of variables, we must have jW1j • (2c¡1)n=q,
as required by the definition of q-whittling sequence.
LEMMA 2. Stage 1 can be implemented on the mesh in time O(sc Cpns C qpn=s).
Proof. Step 1 is executed as follows. Partition the pn£pn mesh into four pn=4£pn=4 sub-
meshes (quadrants), and each of these in turn into four submeshes of size pn=16£pn=16 (subquad-
rants), and so on until we have tessellated the mesh into submeshes of size pn=s£pn=s, i.e., into
individual cells. The replication of ˆS in the individual cells reflects this recursive decomposition of the
mesh. First, the set ˆS is replicated in each of the four quadrants. This is achieved by having the four
mesh nodes that occupy the same relative position in the four quadrants send each other a copy of the
packets they hold. At this point, each node holds four packets and each quadrant holds a replica of ˆS.
Each quadrant independently replicates its copy of ˆS into its four constituent subquadrants in the same





submesh holds 4i¡1 packets that it must send to three other nodes within this submesh. This is a simple
routing pattern in which each node is the source and destination of 3 ¢ 4i¡1 packets (i.e., an instance
of the 3 ¢ 4i¡1-relation routing problem), and so can be completed in O(4i¡1
p
n=4i¡1) D O(pn2i¡1)
time. Since there are (1=2) log s iterations in all, we see that the time required to complete Step 1 is
O(P(1=2) log siD1 pn2i¡1) D O(pns).
Step 2a involves calculating the degree of each variable packet, hence requires O(c) time per packet
and O(sc) time per node, since each node hold s packets. The total cell degree for C can easily be
computed within each cell from the individual degrees in O(pn=s) time.
A straightforward implementation of Step 2b, where each node generates the appropriate number of
c-packets for each v-packet it holds, would result in an unbalanced distribution of the c-packets among
130 HERLEY, PIETRACAPRINA, AND PUCCI
the nodes, which would in turn make the execution of subsequent steps more expensive. Therefore, we
resort to the following more careful implementation of this step. Within each cell C , partition ˆSC into
degree classes ˆS(i)C for 0 • i • dlog(2c ¡ 1)e, where ˆS(i)C contains v-packets of variables with at least
2i and at most 2iC1 ¡ 1 copies in C . (We ignore variables of degree zero and note that no v-packet
can have degree greater than 2c¡ 1.) Redistribute the v-packets of each degree class so that each node
of C receives at most dj ˆS(i)C js=ne v-packets in Class i . This redistribution can be performed by first
determining the destinations of packets in all classes using dlog(2c ¡ 1)e prefix operations (one per
degree class) and then invoking a routing step within C in which each node is the source of at most
s packets and the destination of at most
Pdlog(2c¡1)e
iD1 dj ˆS(i)C js=ne D O(s C log c) D O(s) packets. This
requires overall time O(pn=s log c Cpns) D O(pns).
After the above redistribution of the v-packets, each node examines each v-packet it holds and
generates the appropriate number of c-packets for the corresponding variable. Since each node receives
no more than dj ˆS(i)C js=ne v-packets from the i th degree class and each such packet may result in up to
2iC1¡1 c-packets, it follows that each node may hold up toPdlog(2c¡1)eiD1 dj ˆS(i)C js=ne(2iC1¡1) c-packets.
Now, since
Pdlog(2c¡1)e
iD1 j ˆS(i)C j(2iC1 ¡ 1) • 2j ˆRC j < 2q(n=s) by assumption, we can conclude that each
node generates O(q C c) D O(q) c-packets. Putting it all together, Step 2b can be completed in time
O(pns C q) time.
Turning to Step 2c, we can count the competitors of each packet within each cell by first sorting
the c-packets by target, to group competitors together, and then using parallel prefix to determine
which packets are accessible and which are inaccessible. If a c-packet is deemed accessible, then the
appropriate entry in the bit vector of the corresponding v-packet is set. Since each node initially holds
O(q) packets and the cell has diameterpn=s, this step requires O(qpn=s) time.
As for Step 3, we gather the accessibility information gained for each variable in different cells
by “coalescing” the replicas of the v-packets created in Step 1. When two replicas of a v-packet are
coalesced, a single v-packet results, whose bit-vector is obtained as the bitwise OR of the two bit-
vectors associated to the replicas. The structure of the gathering process follows the reverse pattern of
the replication process of Step 1, and is executed in the same time. At the end of the process, each
processor checks the bit-vector of its v-packet. If the bit-vector contains more than c 1-bits, then c of
these, chosen arbitrarily, are retained (corresponding to a c-bundle), while the others are reset to 0.
Otherwise, all bits are reset to 0.
By combining the complexities of all the steps, we conclude that the running time of the first stage
is O(sc Cpns C qpn=s). j
Following the completion of the first stage the v-packets are back to their originating processors and
encode a c-bundle for all variables except for the set S2 of variables confined to W1. Since jW1j •
(2c ¡ 1)n=q , by Lemma 1 it follows that jS2j • n=(2c ¡ 1). Stage 2 will select a c-bundle for S2 by
performing all the remaining whittlings of the q-whittling sequence.
Stage 2. Since jS2j • n=(2c ¡ 1), the live variables account for at most n copies, i.e., at most
one per mesh node. As a consequence, we can perform the remaining t ¡ 1 whittlings using standard
1-sorting, prefix, and 1-routing primitives and without exceeding our target time performance. In fact,
in each whittling step the number of live variables decreases geometrically, and thus we can execute the
whittlings within smaller and smaller submeshes so that the cost of the aforementioned primitives also
decreases geometrically.
Let Si denote the live variables at the beginning of the i th whittling step, and recall that by Lemma 1
we have jSi j • n=(2c¡ 1)i¡1. We define a sequence of submeshes M2;M3; : : : ;Mt that are nested one




n=(2c ¡ 1)i¡2 submesh. (For concreteness, we will
assume that MiC1 occupies the lower left-hand corner of Mi .) Note that M2 is the entire mesh. Stage 2
consists of t¡1 iterations numbered, for convenience, from 2 to t . Iteration i performs the i th whittling
step in the whittling sequence and is executed entirely within Mi . In particular, for i ‚ 2 at the beginning
of Iteration i , the v-packets corresponding to the live variables (set Si ) are distributed among the nodes of
Mi , with at most one v-packet per node. At the end of the iteration, the v-packets corresponding to dead
variables in S¡ SiC1 are evenly distributed among the nodes outside MiC1 (for notational convenience,
we assume that MtC1 is an “empty” mesh). Iteration i is implemented as follows.
1. For each variable in Si create 2c ¡ 1 c-packets and distribute the c-packets among the nodes
of Mi , assigning at most one packet to each node.
IMPLEMENTING SHARED MEMORY ON MESHES 131
2. For each c-packet, determine how many competitors it has. A c-packet with at most q com-
petitors is said to be accessible and is said to be inaccessible otherwise.
3. For each variable x in Si determine how many of the associated c-packets are accessible. If c
or more are accessible, set the bit-vector positions in x’s v-packet corresponding to the first c accessible
copies of that variable; otherwise reset all bits in the bit-vector for x to zero. In the former case x
becomes dead while in the latter x is alive and will belong to SiC1.
4. Delete all c-packets. Route the v-packets corresponding to the dead variables in Si to distinct
nodes of Mi ¡ MiC1, while those corresponding to variables that remain alive are routed to distinct
nodes of MiC1.
It is easy to see that the above steps perform the whittling of Si with respect to Wi . Note that at most
jSi j • n=(2c¡ 1)i¡1 die during Iteration i and that at most jSiC1j • n=(2c¡ 1)i variables remain alive
at its conclusion. Since Mi contains n=(2c ¡ 1)i¡2 nodes and MiC1 contains n=(2c ¡ 1)i¡1 nodes, if
c ‚ 3 there are always enough nodes in Mi ¡MiC1 and in MiC1 to implement the last step of Iteration i .
LEMMA 3. Stage 2 can be implemented on the mesh in time O(pn).
Proof. Iteration i requires a constant number of prefix, 1-sorting, and 1-relation routing on Mi ,
which all require time O(
p










Once the sequence of whittlings has been completed, the information on which copies have been
selected is encoded in the bit-vectors of the various v-packets that lie scattered among the nodes of the
mesh, with at most 2 packets per node (at most one from Stage 1 and at most one from Stage 2). Finally,
the v-packets are sent back to their originating processors in O(pn) time.
Combining the contributions of the two stages, we obtain the overall running time for the copy
selection phase.
THEOREM 7. A c-bundle of congestion O(log(m=n)) for an arbitrary set of n variables can be select-
ed in time O(pn log(m=n)) on the mesh.
Proof. Choose s D q D 4c=‚ D 2(log(m=n)) and note that for m • 22(n1=3), as required in Theorem
1, sc D O(pn log(m=n)). Then, the theorem follows by adding up the complexities of Stage 1 and
Stage 2 given in Lemmas 2 and 3, respectively. j
3.2. Access Phase
After copy selection, the bit-vectors of the v-packets in ˆS encode a c-bundle of congestion at most q
for the set of variables to be accessed. The actual access is performed using a protocol similar to Stage 1
of copy selection.
1. Let s D q D 4c=‚ D 2(log(m=n)). Create a replica ˆSC of ˆS in each cell C of n=s nodes of the
mesh.
2. Independently within each cell C do the following:
(a) Generate a set ˆRC of c-packets containing one c-packet for each selected copy of a variable in
S residing within C . Now, each c-packet contains the identity of the node in C storing the copy, plus
all the fields of the corresponding v-packet, with the exception of the bit-vector.
(b) Route each c-packet in ˆRC to its target within C .
(c) Each node in C performs the memory accesses associated with the received c-packets. In case
of writes, the value of the copy is set to the one carried by the c-packet and time-stamped with the current
PRAM step. In case of reads, the data field of the c-packet is loaded with the value and time-stamp of
the referenced copy.
(d) Route c-packets carrying read requests back to the nodes where they were generated. For each
read request, the data field of the corresponding v-packet is loaded with the data field of the c-packet
carrying the most recently updated time-stamp.
132 HERLEY, PIETRACAPRINA, AND PUCCI
3. Complete the read accesses by carrying back to each originating node the (replica of the) v-packet
carrying the most recent time-stamp.
THEOREM 8. The memory accesses relative to the c-bundle determined by the copy selection phase
can be performed in time O(pn log(m=n)) on the mesh.
Proof. Steps 1, 2a, and 3 of the access phase have, respectively, the same structure as Steps 1,
2b, and 3 of the first stage of copy selection, hence they can be completed in O(pn log(m=n)) time
using the same ideas. Both Steps 2b and 2d involve a k-relation routing (with k D s C q), hence they
require O((s C q)pn=s) D O(pn log(m=n)) time altogether. Finally Step 2c can be completed in
O(q) D O(log(m=n)) time by individual nodes. The theorem follows by combining the complexities
of the individual steps. j
By combining Theorem 7 with Theorem 8 we establish that any step of an (n;m)-PRAM can be
simulated with slowdown O(pn log(m=n)) on an pn £ pn mesh, using O(log(m=n)) copies per
variable. Section 6 shows how the read-only tables encoding the memory map held by each node may
be replaced by a space-efficient, distributed representation using only O((m=n) log5(m=n)) storage per
node, thus completing the proof of Theorem 1 as it relates two-dimensional meshes.
4. HIGHER-DIMENSIONAL MESHES
The overall structure of the simulation scheme for the two-dimensional mesh described in
Section 3 also applies to higher-dimensional meshes. Specifically, we adopt the same memory organiza-
tion and pick the same values for the parameters s and q. In what follows, we sketch how to implement
the individual steps of the copy selection and access phases efficiently on an n1=d £ n1=d £ ¢ ¢ ¢ £ n1=d
d-dimensional mesh (called, for short, an n-node d-mesh) with constant d. As for the case of the two-
dimensional mesh, we note that for m • 22(n1=(dC1)), k D O(log(m=n)), and n0 D ˜(n= log(m=n)), the
k-sorting problem and the k-relation routing problem may both be solved in time O(k(n0)1=d ) on an
n0-node d-mesh [Kun93].
Let us first consider the first stage of copy selection and interpret a cell C as an (n=s)-node d-submesh.
Step 1 prescribes that the set ˆS be replicated in each cell. We adopt the same recursive approach
developed for d D 2 and perform the replication in substeps. At the i th substep, each (n=2d(i¡1))-
node d-submesh replicates its copy of ˆS in each of its 2d component (n=2di )-node d-submeshes. Such
replication involves an instance of the 2di -relation routing problem on an n=2d(i¡1)-node d-mesh and
can therefore be completed in O(2(d¡1)i n1=d ) time. Hence, the total time required by Step 1 is
O





Step 3 is accomplished in the same time by performing the routings of Step 1 in the reverse order. Let
us finally consider Step 2. In this step, the nodes of each cell C first determine whether the cell degree is
less than q(n=s) (Step 2a). If this is the case, a balanced collection of c-packets relative to the copies of
variables in S residing in C is generated (Step 2b) and finally checked for accessibility (Step 2c). The
operations involved are dlog(2c ¡ 1)e prefix computations as well as a constant number of s-sorting,
s-relation routing, and q-sorting operations, plus an additional O(q C s) work per node. By plugging
in the chosen values for s and q we see that Step 2 of copy selection can be completed in time
O
¡
sc C n1=ds1¡1=d C q(n=s)1=d¢ D O¡n1=d (log(m=n))1¡1=d¢: (3)
Since this subsumes the cost of Steps 1 and 3, this expression also captures the cost of the entire first stage
of copy selection. In the second stage, we perform the i th whittling entirely within an n=(2c ¡ 1)i¡2-
node d submesh. As in the two-dimensional case, this whittling can be easily implemented by means
of 1-sorting and parallel prefix. Since the size of the submeshes is geometrically decreasing, the overall
running time is dominated by the time for i D 2, which is O(n1=d ). Hence, the entire copy selection
algorithm can be completed within the time given by Equation (3).
IMPLEMENTING SHARED MEMORY ON MESHES 133
Finally, recall that Steps 1, 2a, and 3 of the access phase mirror Steps 1, 2b, and 3 of copy selection,
hence they require time O(n1=d (log(m=n))1¡1=d ) altogether. The remaining steps can be realized as two
instances of an (s C q)-relation routing in each cell plus O(q) work per node. Therefore, the total time
required by the access phase is again O(n1=d (log(m=n))1¡1=d ).
The above discussion establishes that any step of an (n;m)-PRAM with m • 22(n1=(dC1)) can be
simulated with slowdown O(n1=d (log(m=n))1¡1=d ) on a d-dimensional mesh (with constant d) using
O(log(m=n)) copies per variable. In Section 6 we will show how the space requirements per node may
be reduced to O((m=n) log5(m=n)). This will complete the proof of Theorem 1.
5. THE PRUNED BUTTERFLY
An n-leaf fat-tree is a routing network whose coarse structure resembles that of an n-leaf binary
tree. More specifically, leaves correspond to processing elements, non-leaf nodes represent clusters
of routing switches, and edges represent communication channels of bandwidth that increases from
the leaves to the root. The first architecture of this kind was proposed by Leiserson [Lei85] and was
later followed by a number of related networks differing in the detail of how components at different
levels of the tree are interconnected. In this paper we adopt the pruned butterfly fat-tree of Bay and
Bilardi [BB95], an example of which is illustrated in Fig. 1. Each dotted ellipse in the figure identifies
a cluster of routing switches that collectively correspond to a single node in the binary tree. The bundle
of edges joining switches of a cluster to switches of its parent cluster constitutes a channel whose
bandwidth equals the cardinality of the bundle. The depth of a cluster is the distance of its component
switches from the root. At any given depth, the clusters are numbered from left to right, beginning
at zero. Individual switches within each cluster are also numbered from left to right, beginning at
zero.
Assume that n is a power of four. Formally, an n-leaf pruned butterfly is a graph G D (V; E) whose
vertices are indexed as follows:
V D 'hi; j; ki : 0 • i • log n; 0 • j < 2i ; 0 • k < pn2¡bi=2c“:
With the above indexing, the j th processor–memory node from left to right corresponds to vertex
hlog n; j; 0i. For 0 • i < log n, vertex hi; j; ki corresponds to the kth switch of the j th cluster at depth







where Ei j contains the edges connecting switches in the j th cluster at depth i to those in its parent
FIG. 1. A pruned butterfly with 16 leaves.
134 HERLEY, PIETRACAPRINA, AND PUCCI
cluster. When i is even, we let
Ei j D
'(hi; j; ki; hi ¡ 1; b j=2c; ki); ¡hi; j; ki; ›i ¡ 1; b j=2c; k Cpn2¡i=2fi¢ : 0 • k < pn2¡i=2“:
When i is odd, we let
Ei j D
'(hi; j; ki; hi ¡ 1; b j=2c; ki : 0 • k < pn2¡bi=2c“:
Note that jEi j j D pn21¡di=2e; therefore channel bandwidths double every other level from the leaves
to the top, ranging from 2 to
p
n.
G is interpreted as an n-node machine by identifying the n processor–memory nodes with the n
leaves, a routing switch with each internal vertex, and a wire capable of transmitting a single packet
along its length in unit time with each edge. Moreover we assume that each routing switch is provided
with an adder, so that parallel prefix computations may be completed efficiently. Intuitively, to route a
message from leaf i to leaf j , the message is routed upwards in the tree to the cluster that is the least
common ancestor of i and j and thence downwards to its destination.
This architecture has a number of interesting properties. For example, it is a subgraph of the butterfly
network and it embeds the n-leaf mesh of trees architecture with constant dilation and load. Furthermore,
the original bit-serial formulation of the pruned butterfly presented by Bay and Bilardi is area-universal:
the n-leaf pruned butterfly can be laid out in O(n log2 n) area and can route any set of messages almost
as efficiently as any circuit of similar area. (See [BB95] and the references contained therein for a fuller
discussion of area-universality and the capabilities and properties of the pruned butterfly.) An important
routing property of the n-leaf pruned butterfly is the following. Consider a collection of k • pn packets,
stored one per node among the leaves, with the i th packet residing at leaf hlog n; si ; 0i and destined to
leaf hlog n; di ; 0i. We refer to the collection as a wave if s1 < s2 < ¢ ¢ ¢ < sk and d1 < d2 < ¢ ¢ ¢ < dk .
In [BB95] it is shown that any wave can be easily routed in O(log n) time. Moreover, a sequence of t
waves may be routed in a pipelined fashion in time O(t C log n).
5.1. Sorting and Routing on the Pruned Butterfly
In this subsection, we develop algorithms for k-sorting and k-relation routing on the pruned butterfly,
which will needed for the shared-memory simulation.
LEMMA 4. Any instance of k-sorting can be performed in O(k(log k C pn)) time on the n-leaf
pruned butterfly.
Proof. We will consider the input packets sorted when the k packets with the smallest keys are in
the 0th leaf, the next k smallest occupy the 1st leaf, and so on. Our sorting strategy is based on an
adaptation of Batcher’s bitonic sorting algorithm to handle the case where k D 1. The generalization to
larger values of k is standard and can be obtained by first sorting the sequence of input packets at each
node (possibly padded with extra dummy packets with key D 1 to bring its length to k) in O(k log k)
time and then replacing each constant-time compare–exchange operation in the algorithm for k D 1
with an O(k)-time merge–split operation [Knu73].
When k D 1, let x0; x1; : : : ; xn¡1 be the n-tuple of variables that we wish to sort, with xi residing at
the i th leaf. The bitonic sorting algorithm is structured as a sequence of log n merging phases. During
the i th phase, for 1 • i • log n, distinct pairs of sorted subsequences of length 2i¡1 are merged into
subsequences of length 2i . In turn, the i th phase is made of (i; j)-stages, for j D i ¡ 1; i ¡ 2; : : : ; 1; 0.
An (i; j) stage executes as follows:
for all 0 • k • n ¡ 1 do in parallel
if [k] j D 0 then Oper(k; k C 2 j¡1; i; j):
In the above code, [k] j denotes the j th bit in the binary representation of k, while Oper(k; kC 2 j¡1;
i; j) denotes a compare–exchange operation applied to the variables xk and xkC2 j¡1 . The “orientation”
of the exchange depends on i and j . Note that the leaves containing these two variables fall within
the same subtree of 2 j leaves. Thus, any (i; j)-stage can be performed within such subtrees. We now
IMPLEMENTING SHARED MEMORY ON MESHES 135
describe the implementation of an (i; j)-stage for the subtree with leaves 0; : : : ; 2 j ¡ 1 and note that
the same algorithm can be executed simultaneously within each 2 j -leaf subtree.
1. Transfer the values of x0; x1; : : : ; x2 j¡1¡1 (residing at distinct leaves of the left subtree) to the right
subtree, so that leaf k C 2 j¡1 holds both xk and xkC2 j¡1 , for 0 • k • 2 j¡1 ¡ 1.
2. Perform Oper(k; k C 2 j¡1; i; j) at leaf k C 2 j¡1, for 0 • k • 2 j¡1 ¡ 1.
3. Transfer the updated values of x0; x1; : : : ; x2 j¡1¡1 back into the leaves of the left subtree.
Clearly, Step 2 can be completed in O (1) time, since Oper involves a simple comparise–exchange
operation. Steps 1 and 3 have a similar structure and involve the routing of the 2 j¡1 values stored at the
leaves of the left (resp., right) 2 j¡1 leaf subtree to the leaves of its sibling subtree. We may decompose
this routing into d2 j¡1=2b j=2ce D O(
p
2 j ) waves which may then be routed in a pipelined fashion in
O(
p
2 j ) time. Hence, any (i; j)-stage can be completed in O(
p
2 j ) time.
Now, recall that the i th merging phase of the sorting algorithm consists of a sequence of (i; j)-stages












2i ) D O(pn): j
Recall that a k-relation routing problem involves routing a set of packets from source to destination
subject to the constraint that no node is the source or destination of more than k packets. We have:
LEMMA 5. Any instance of the k-relation routing problem may be routed in O(k(log k Cpn)) time
on the n-leaf pruned butterfly.
Proof. Let ˆS denote the set of packets to be routed. The routing algorithm is made of the following
steps:
1. Sort the packets in ˆS by their destination among the n leaves of the pruned butterfly.
2. For 0 • i < kpn, route the packets whose rank in the sorted sequence is equal to i mod kpn.
By Lemma 4, Step 1 above requires time O(k(log k Cpn)). As for Step 2, it is easy to see that each of
the k
p
n routings is a wave. Therefore, all iterations can be pipelined and completed in time O(kpn).
Thus, the entire routing algorithm can be completed in O(k(log k Cpn)) time. j
5.2. The Simulation Algorithm
A closer look at the simulation algorithm devised for the two-dimensional mesh reveals that both
the copy selection and access phases are implemented in terms of k-sorting, k-relation routing, prefix
computations and rely upon a recursive decomposition of the network into subnetworks of the same
topology. Note that the pruned butterfly exhibits such a decomposition. Specifically, for n and s • n
arbitrary powers of two, an n-leaf pruned butterfly can be decomposed into s(n=s)-leaf pruned butterflies.
Since the complexity of routing and sorting are asymptotically the same for the pruned butterfly and the
mesh, we conclude that any step of an (n;m)-PRAM can be simulated with slowdown O(pn log(m=n))
on an n-leaf pruned butterfly using O(log(m=n)) copies per variable. The techniques of the next section
show how space requirement per node may be limited to O((m=n)(log(m=n))5), thus completing the
proof of Theorem 2.
6. SPACE-EFFICIENT SIMULATIONS
The simulations presented in the paper are based on a memory organization whose structure is
modelled by a bipartite graph G D (U; V ; E) with jU j D m; jV j D n, and where every vertex in
U has degree 2c ¡ 1 D 2(log(m=n)). This graph may be represented by means of a read-only table
TG D [t1; t2; : : : ; tm] consisting of m entries, where the i th entry ti D (ti (1); ti (2); : : : ; ti (2c ¡ 1))
contains the addresses of the copies of the i th variable. We call each such address an item. In this
136 HERLEY, PIETRACAPRINA, AND PUCCI
section, we show that such a table may be represented in a distributed fashion among the nodes of the
simulating network, so that the maximum number of items stored per node is O((m=n) log5(m=n)) and
that any N -tuple (with N ‚ n) of entries (corresponding to the variables to be accessed in the PRAM
step) may be read in time proportional to the slowdown of the simulation step. (The need to access
an N -tuple rather than an n-tuple will be discussed later.) We sketch the required techniques for the
two-dimensional mesh, which are akin to those presented in [Her96], though somewhat simpler. The
result extends immediately to the other interconnections considered in the paper.
Let n0 • n be a parameter to be fixed later. Partition the pn £ pn mesh into n=n0 tiles of sizep
n0 £ pn0. Each tile will contain a complete copy of TG distributed as follows. Partition TG into m=b
pages of b entries each, and partition each tile into n0=b blocks each of size
p
b £ pb. Within each
tile, replicate and distribute the m=b pages among the n0=b blocks that make up that tile according
to a smooth (‚; 2c0 ¡ 1; c0; 1=(2c0 ¡ 1))-generalized expander H D (UH ; VH ; EH ) such that jUH j D
m=b; jVH j D n0=b; c0 D 2c ¡ 1 and where parameters ‚ < 1; c D 2(log(m=n)) are as defined in
Section 2. The maximum number of pages mapped to any individual block is O((m=n0)c0)), which
amounts to O((m=n0)bc2) items in all. The items mapped to a particular block are distributed evenly
among the nodes of the block, with O((m=n0)c2) items per node. Within each node, the individual items
are held in a static dictionary in order to facilitate retrieval.
Note that there are a total of (n=n0)(2c0 ¡ 1) copies of each entry and that to read an entry it suffices
to read any one copy. Note also that the structure of the graph H can be represented by means of a
read-only table TH of m=b entries. This latter table is replicated and represented in every block in the
network, with each node of each block holding m=b2 entries of TH .
To read an N -tuple of entries of TG , each tile deals locally with the reads relating to its own nodes,
independently of other tiles, by executing the following steps. (It is assumed that each node handles
N=n entries.)
1. Generate a set S containing 2c0 ¡ 1 numbered request packets r1(x); r2(x); : : : ; r2c0¡1(x) for each
referenced entry x . Packet ri (x) bears the name of the processor that generated it, the entry to which it
refers, and the name of the block that contains the ith copy of that entry within the tile in question.
2. Select a subset S0 of the packets that contains c0 packets r 01(x); : : : ; r 0c0 (x) per referenced entry such
that the number of selected packets relating to any individual block is O((N=n)bc0).
3. Route each packet in S0 to the appropriate block, ensuring that the number of packets routed to
any individual node is O((N=n)c0).
4. Within each block, circulate the packets around a Hamiltonian cycle. (For the pruned butterfly,
use an Eulerian cycle.) As a packet, say r 0i (x), visits a node, check whether that node contains a copy of
entry tx . If so, load a copy of item tx (i) into the packet.
5. Route each packet back to the node that generated it.
Note that the c0 D 2c ¡ 1 selected packets relating to entry x are ultimately returned to the node that
generated them, each bearing the value of a distinct item of that entry.
In order to discover the locations of the various copies of the entries, which are needed to generate
packets during Step 1, the nodes need to query TH . Since each block maintains a private copy of this
table and each block generates (N=n)b(2c0 ¡ 1) request packets, this operation can be accomplished in
the same fashion as that outlined for Step 4 and has the same O((N=n)bc0) running time. Steps 3 and
5 involve ((N=n)c0)-relation routing within an n0-node tile so these contribute O((N=n)c0pn0) to the
running time.
As for Step 2, note that for each page of TG the number of entries referenced may be up to b, the page
size. For a particular tile, let Pi denote the set of pages where the number of referenced entries lies in the
interval [2i ; 2iC1). Clearly, Plog biD0 2i jPi j • 2(N=n)n0. Since H is a generalized expander, it is possible
to construct a c0-bundle for the pages in each Pi that has degree O((jPi j=(n0=b))c0). Each edge in such
a bundle corresponds to at most 2iC1 request packets, and so the total number of selected packets over
all the Pi is at most
Plog b
iD0 2iC1(jPi j=(n0=b))c0 D O((N=n)bc0). The algorithmic techniques required to
perform the selection include straightforward combinations of sorting and parallel prefix akin to those
employed during the second stage of the copy selection process of Section 3, and this step also has a
running time of O((N=n)c0pn0).
IMPLEMENTING SHARED MEMORY ON MESHES 137
Thus, the overall running time is O((N=n)(b C pn0)c0). Recall that in our intended application,
namely the reading of the addresses of variable copies during copy selection and the access phases
of the algorithm of Section 3, each mesh node generates O(s) D 2(log(m=n)) such lookup requests.
Thus, N D O(sn), so by choosing b D pn0 D
p
n= log3(m=n) and c0 D O(log(m=n)), this running
time simplifies to O(pn log(m=n)). Moreover, the distributed representation of the memory map TG
requires O((m=n0)c0(2c¡ 1)) D O((m=n) log5(m=n)) storage per node, while the representation of TH
requires an additional amount of O((m=b2)c0) D O((m=n) log4(m=n)) storage per node. Hence, the
total storage requirement per node is O((m=n) log5(m=n)).
7. LOWER BOUND
In this section, we prove a lower bound on the worst-case slowdown incurred when simulating a
PRAM step on a processor network. Unlike previous approaches [AHMP87, KU88, HB94], which do
not account for the network topology, we obtain a bound that is based on the bandwidth characteristics
of the simulating network. As a result, while previous lower bounds were significant only for very
powerful networks such as expanders, our lower bound can be specialized, yielding nontrivial results,
to a broad family of topologies, including low-bandwidth ones such as d-dimensional meshes and the
pruned butterfly. The bound is based on the notion of balanced decomposition tree [BL84], which
provides a partition of the network into disjoint regions of known bandwidth. We first formulate the
general lower bound in terms of such a decomposition and then show how to specialize it to meshes
and to the pruned butterfly.
Consider the simulation of an arbitrary (n;m)-PRAM program on an n-processor network N . For
convenience, we assume that each PRAM step involves either the reading (read step) or the writing
(write step) of some n-tuple of variables. The simulation must satisfy the following standard constraints,
which are also required in the lower bounds quoted earlier.
† The simulation must be online, in the sense that each PRAM step is made known to the simulation
algorithm only after the simulation of previous read steps has been completed. Thus, read steps are
simulated one-by-one according to the order specified by the PRAM program. Each read must succeed
in accessing the correct (i.e., most recently written) value of the variable in question. Note that no
restriction is placed on the execution of write operations.
† The simulation must be point-to-point in the sense that a processor that wants to write a variable
must dispatch a distinct message for each copy of the variable it wants to update.
Note that the point-to-point constraint rules out the splitting and combining techniques that are at the
core of the simulations presented in this paper. However, at the end of the section, we modify the
argument to obtain a non-point-to-point lower bound, formulated in terms of the global space used to
represent the PRAM memory, which applies to our upper bounds.
We assume that the simulating network N has a [w0; w1; : : : ; wlog n] balanced decomposition tree,
as defined in [BL84]; that is, for any i; 0 • log n, N can be partitioned into 2i disjoint i-regions,
R(i)1 ; : : : ; R
(i)
2i , where each i-region contains dn=2ie § 1 processors and is connected to the rest of the
network by at wi edges. Clearly, every network has a balanced decomposition tree, for suitable values
of the wi s.
DEFINITION 3. Let h and k be two integers, with 1 • h; k • log n, and let t be an arbitrary time step
during the course of the simulation. For any shared variable u 2 U , we define r th;k(u) to be the minimum,
taken over all h-regions R(h)j , of the number of k-regions containing valid (i.e., most recently updated)
copies of u that lie outside R(h)j at the beginning of step t . (We assume r0h;k(u) D 0, for every h; k, and





The lower bound argument is similar in spirit to the ones in [AHMP87, KU88, HB94]; namely,
it relies on finding a sequence of PRAM steps which are “hard” to simulate. Such a sequence will
contain a judicious mixture of write and read steps suitably chosen to expose a tradeoff of the following
kind: unless the simulation devotes a sufficient amount of effort to each write step to ensure that the
valid copies of the variables written are “nicely distributed” among different regions of the network, an
138 HERLEY, PIETRACAPRINA, AND PUCCI
adversary is always guaranteed to be able to find a read instruction that will be relatively expensive to
simulate.
In the subsequent analysis, we will make use of the following technical lemma, whose proof is
embedded in that of Lemma 1 in [PP97].
LEMMA 6 [PP97]. Consider a fixed partition of the network into p disjoint regions and a set of m0
PRAM variables, such that, for each variable, there are at most r 0 ‚ 1 distinct regions containing valid
copies of the variable. Then, for any n0 • m 0, there exists a set of n0 variables whose valid copies are
all stored in memory modules residing in at most








A lower bound on the complexity of a read operation as a function of the redundancy of the simulation
scheme is proved in the following lemma.
LEMMA 7. Fix an arbitrary time step t during the course of the simulation. For every h and k, with
1 • h, k • log n, at time t an adversary could issue a read step involving n distinct variables, whose
simulation requires time at least gh;k(r th;k), where
gh;k(r ) D
(
1 if 8(2r; 2k; n;m=2hC1) ‚ 2k¡2;
n
4(whCwk8(2r;2k ;n;m=2hC1)) otherwise;
and r th;k is the average redundancy at time t with respect to h and k.
Proof. Fix h and k and let r D r th;k . The case 8(2r; 2k; n;m=2hC1) ‚ 2k¡2 is trivial, hence we
assume that 8(2r; 2k; n;m=2hC1) < 2k¡2. We will identify a set of 2(n) variables all of whose valid
copies are confined within a low-bandwidth portion of the network, and therefore are expensive to read
by processors outside the region. Let ˆU D fu 2 U : r th;k(u) • 2rg. Clearly, j ˆU j ‚ m=2 and there
exists an h-region R(h)j0 for which there are at least m=2
hC1 variables in ˆU achieving their minimum
redundancy with respect to R(h)j0 . Let ˆU
(h)
j0 µ ˆU be the set containing these variables. Note that since
8(2r; 2k; n;m=2hC1) < 2k¡2 we must have m=2hC1 ‚ n and, thus, j ˆU (h)j0 j ‚ n. We distinguish between
two cases, depending on whether r is less than 1=2 or not.
If r < 1=2, then there exists an n-tuple of variables in ˆU (h)j0 whose valid copies are all within R
(h)
j0 .
Since h ‚ 1 and so R(h)j0 contains no more than n=2 processors, we can stipulate that n=2 variables of
the n-tuple be read by processors outside R(h)j0 . At least one copy per variable must then be transmitted
along the wires connecting the region with the rest of the network; therefore such a read instruction will
require at least n=(2wh) time.
The case r ‚ 1=2 is more involved. Fix an arbitrary subset W of ˆU (h)j0 containing exactly m=2hC1
variables. Each variable u 2 W may have a number of valid copies within R(h)j0 plus at most 2r valid
copies scattered among k-regions external to R(h)j0 (call them expensive copies). By plugging r 0 D 2r ,
p D 2k , n0 D n, and m 0 D m=2hC1 into the statement of Lemma 6, we conclude that there are n
variables in W whose expensive copies are all contained in at most 8(2r; 2k; n;m=2hC1) k-regions.
Since 8(2r; 2k; n;m=2hC1) < 2k¡2 the union of R(h)j0 and these k-regions contains no more than 3n=4
processors. Therefore we can stipulate that n=4 variables of the n-tuple be read by processors outside
the union. The lemma follows by observing that reading these variables would take time at least
n
4(wh C wk8(2r; 2k; n;m=2hC1)) ;
and that the above term is strictly less than n=(2wh). j
We observe that the function gh;k(r ) defined in the above lemma is a non-increasing function of r .
IMPLEMENTING SHARED MEMORY ON MESHES 139
The following lemma is similar to Lemma 7 in [HB94] and captures the contribution of the write
steps to the running time in terms of the average redundancy.
LEMMA 8. Consider an arbitrary time step t during the course of a point-to-point simulation, and
let r D r th;k . Then, t and r satisfy the inequality
t ‚ r m
2hwh
:
Proof. For each variable u, let ru denote the number of valid copies of u lying outside the h-region
which contains the processor that most recently updated u (before time t). (Note that ru as well as r th;k(u)
for any h and k, are equal to 0 if no processor wrote u before time t .) Given the point-to-point assumption,
such a processor must have dispatched at least ru distinct messages that crossed the boundaries of its
h-region. As a consequence, we have that a total of at least
P
u2U ru ‚ rm messages must have crossed
boundaries of h-regions, hence, there must be an h-region whose boundary was crossed by at least
rm=2h distinct messages, which accounts for a total of at least rm=(2hwh) time. j
THEOREM 9. For any T ‚ 2m=n; there exists a T -step (n;m)-PRAM program, whose point-to-point,











gh;k(r )C r n2hwh
¾¾¶
;
where gh;k(r ) is the function defined in Lemma 7.
Proof. We construct a PRAM program with bT=(2m=n)c batches of m=n instructions, each batch
consisting of m=n write steps that update all the variables, followed by m=n read steps suitably chosen
by the adversary according to Lemma 7.
Consider the simulation of one such batch, for some h and k, with 1 • h, k • log n. Let r be
the maximum value of r th;k at the start (time t) of the simulation of any read step. By Lemma 8, the
simulation of all the write steps requires time at least rm=(2hwh). By Lemma 7, the simulation of each







The theorem follows by taking the minimum over all possible values of r and the maximum over all
choices of h and k of the simulation time of a batch given above, and then by summing the contributions
of the bT=(2m=n)c batches. j
We are now ready to prove Theorem 3, stated in the Introduction, which specializes the general lower
bound of Theorem 9 to the case of d-dimensional meshes (with d constant).
Proof of Theorem 3. Let us first concentrate on one-dimensional meshes (d D 1). We establish this
case separately, by means of a simple, diameter-based argument as follows. Consider a PRAM program
consisting of T steps where in odd steps a processor v updates a variable u and in even steps all other
processors read u. Such a sequence requires˜(T n) time to be simulated on the linear array since distinct
pairs of consecutive write and read steps must be simulated in disjoint time intervals, because of the
online hypothesis, and in each such pair the newly written value of u must travel at distance 2(n).
Consider now the case d > 1. A natural halving process of an n-node d-dimensional mesh generates
a balanced decomposition tree with wi D 2((n=2i )1¡1=d ), for 0 • i • log n. Define
1 D log(m=n)
log log(m=n) ;
and fix h and k as the minimum indices such that 2h ‚ 1 and 2k ‚ 1(2d¡1)=(d¡1). Since m ‚ 16n and
140 HERLEY, PIETRACAPRINA, AND PUCCI
d > 1, we have 1 ‚ 2, hence h; k ‚ 1. Let m¯ be the largest value of m for which the chosen value for
2k is at most n (note that this also implies 2h • n). We first prove the lower bound under the assumption




D ˜¡n1=d11¡1=d¢ for r ‚ r¯ : (4)










where the simplification of the denominator relies on the facts that 2hC1 < 41 and (41)8=1 • 84 (since
41 ‚ 8 and x1=x is decreasing for x > 2). With this it is easy to establish that when r < r¯ , we have
8(2r; 2k; n;m=2hC1) < 2k¡2, which according to the definition of gh;r (r ) in Lemma 7 implies that
gh;k(r ) D n4(wh C wk8(2r; 2k; n;m=2hC1)) :
Substituting for wh , wk , and 8, this simplifies to















Using the chosen values for 2h , 2k and the facts that gh;k(r ) is non-increasing in r and (m=(2hC1n))1=(2r¯ ) D
˜(log8(m=n)) D ˜(18), we have




for r < r¯ : (5)









Straightforward but tedious calculations show that our choice of h and k yields the best possible bound.

















and the theorem follows. j
Note that the argument used to prove Theorem 9 does not exploit the fine-grained structure of the
interconnection but depends solely on the bandwidth distribution, as captured by the decomposition tree.
Consequently, we get the same specialized version of the lower bound for networks of different topolo-
gies which have similar decomposition trees. An example is provided by the n-leaf pruned butterfly that
has the same decomposition tree (up to constant factors) as the two-dimensional mesh, although the
two topologies are very different. Hence, the proof of Theorem 4, stated in the Introduction, is virtually
identical to that of Theorem 3 for d D 2, and is omitted for brevity.
IMPLEMENTING SHARED MEMORY ON MESHES 141
Recall that the simulations presented in this paper achieve high levels of efficiency by making a
crucial use of splitting and combining techniques. More specifically, a processor issuing a memory
request generates a single variable packet for each subset of copies residing in a suitably sized region
of the network. Once the variable packet is shipped within its destination region, it is split into multiple
copy packets, destined to the individual copies of the variable. In this way, the cost of the “long leg” of
the journey to access a copy is paid only once for all the copies residing within the same region.
Unfortunately, the point-to-point assumption made by our lower bound argument precludes the split-
ting and combining of messages destined to distinct copies; therefore Theorems 3 and 4 do not apply
to our simulations. Note however that the argument uses this assumption only to establish a bound on
the cost of write operations. As a consequence, we can prove a lower bound solely based on the cost of
read operations, which holds in an unrestricted model where splitting/combining may occur. The lower
bound, stated in Theorem 5 in the Introduction and proved below, is obtained by making sure that the
average redundancy does not grow too large during the simulation. This can be achieved by establishing
that the total amount of space used to represent the PRAM variables in the local memory modules can
never exceed a fixed threshold mr .
Proof of Theorem 5. Consider the case of d-dimensional meshes. For d D 1, the bound can be
trivially obtained through the same diameter-based argument employed in the proof of Theorem 3.
Hence, assume d > 1. We consider a PRAM program that first executes m=n write steps to update all
the variables and then executes T ¡ (m=n) D 2(T ) read steps suitably chosen by the adversary. Since
the average number of copies per variable is r , it is easy to see that, for every 1 • k • log n, at the
beginning of each read step there are at least m=2 variables each of which has updated copies in at
most 2r k-regions. By Lemma 6 this implies that there exist minf2k;8(2r; 2k; n;m=2)g k-regions that
contain all updated copies of at least n variables. If8(2r; 2k; n;m=2) • 2k¡1, the adversary can require





























and let m¯ be the largest value of m for which the chosen value for 2k is at most n. As in the proof of
Theorem 3, we first consider the case m • m¯. In this case, we have 1 • k • log n. Moreover, it is easy
to verify that (m=2n)1=(2r ) ‚ (log(m=n)= log log(m=n))2fiC1 and8(2r; 2k; n;m=2) • 2k¡1. By plugging

























The unrestricted lower bound for the pruned butterfly is obtained by setting d D 2 in the above
calculations. j
142 HERLEY, PIETRACAPRINA, AND PUCCI
Theorem 5 shows that our simulations use an amount of redundancy which is only a doubly logarithmic
factor higher than the minimum redundancy needed to achieve the same slowdown.
8. CONCLUSIONS
In this paper we have presented upper and lower bounds for the problem of simulating a shared-
memory abstraction on network-based machines such as d-dimensional meshes and the pruned butterfly.
An interesting feature of our scheme is its generality. Indeed, the simulation algorithm relies on a
recursive decomposition of the underlying network into subnetworks of the same topology, and employs
a restricted set of basic primitives such as prefix, k-sorting, and k-relation routing. As a consequence, the
algorithm is efficiently portable to any other machine with a recursive structure and on which optimal
algorithms for the above primitives are known. As for the lower bound, we have developed a generic,
bandwidth-based argument that can be applied to any specific interconnection using the parameters of
its decomposition tree.
Regarding the upper bound, it must be remarked that we make use of memory organizations based on
generalized expanders. As it was mentioned in the Introduction, the explicit, deterministic construction
of generalized expanders is a long-standing open question, although it can be shown that a random
bipartite graph would exhibit the required expansion property with high probability. This limitation
suffered by our scheme is shared by all other deterministic mesh-based schemes in the literature, with
the exception of the scheme of [PPS94], which applies only to very small memory sizes (m D O(n1:5))
and exhibits a higher slowdown than ours.
Finally, the general lower bound presented in Section 7 is proved under the point-to-point assumption,
which stipulates that packets sent to copies of a variable can be neither split nor combined. This constraint
rules out the techniques that are at the core of the simulations presented in this paper, hence the bound
does not apply to our algorithms directly. However, we have been able to modify the argument to obtain
one which applies to our algorithms, by introducing an upper limit to the global space used to represent
the PRAM variables. In particular, we are able to show that in order to match the slowdowns exhibited
by our simulations, any deterministic scheme must use about the same amount of space to represent
the variables. However, the search for a nontrivial, totally unrestricted lower bound for deterministic
PRAM simulation on network-based machines remains a challenging open problem.
ACKNOWLEDGMENTS
The authors thank Gianfranco Bilardi for a some helpful discussions on sorting on the pruned butterfly and the anonymous
referee who provided a number of valuable comments and suggestions on the original version of the manuscript.
REFERENCES
[AHMP87] Alt, H., Hagerup, T., Mehlhorn, K., and Preparata, F. P. (1987), Deterministic simulation of idealized parallel
computers on more realistic ones, SIAM J. Comput. 16(5), 808–835.
[BB95] Bay, P., and Bilardi, G. (1995), Deterministic on-line routing on area-universal networks, J. ACM 42(3), 614–640.
[BL84] Bhatt, S. N., and Leighton, F. T. (1984), A framework for solving VLSI graph layout problems, J. Comput. System
Sci. 28(2), 300–342.
[Her96] Herley, K. T. (1996), Representing shared data on distributed-memory parallel computers, Math. Syst. Therory 29,
111–156.
[HB94] Herley, K. T., and Bilardi, G. (1994), Deterministic simulations of PRAMs on bounded-degree networks, SIAM J.
Comput. 23(2), 276–292.
[KU88] Karlin, A. R., and Upfal, E. (1988), Parallel hashing: An efficient implementation of shared memory, J. ACM 35(4),
876–892.
[Knu73] Knuth, D. E. (1973), “The Art of Computer Programming. Vol. 3. Sorting and Searching,” Addison–Wesley, Reading,
MA.
[Kun93] Kunde, M. (1993), Block gossiping on grids and tori: Deterministic sorting and routing match the bisection bound,
in “Proceedings, 1st European Symposium on Algorithms” (T. Lengauer, Ed.), LNCS 726, pp. 272–283, Springer-
Verlag, Berlin.
IMPLEMENTING SHARED MEMORY ON MESHES 143
[Lei85] Leiserson, C. E. (1985), Fat-trees: Universal networks for hardware-efficient supercomputing, IEEE Trans. Comput.
C-34(10), 892–901.
[PP95] Pietracaprina, A., and Pucci, G. (1995), Improved deterministic PRAM simulation on the mesh, in “Proceedings,
22nd International Colloquium on Automata, Languages and Programming” (Z. Fu¨lo¨p and F. Ge´cseg, Eds.), LNCS
944, pp. 372–383, Springer-Verlag, Berlin.
[PP97] Pietracaprina, A., and Pucci, G. (1997), The complexity of deterministic PRAM simulation on distributed memory
machines, Theory Comput. Syst. 30(3), 231–247.
[PPS94] Pietracaprina, A., Pucci, G., and Sibeyn, J. F. (1994), Constructive deterministic PRAM simulation on a mesh-
connected computer, in “Proceedings, 6th ACM Symposium on Parallel Algorithms and Architectures,” pp. 248–
256; SIAM J. Comput., to appear.
[UW87] Upfal, E., and Wigderson, A. (1987), How to share memory in a distributed system, J. ACM 34(1), 116–127.
