How to emulate shared memory  by Ranade, Abhiram G.
JOURNAL OF COMPUTER ANU SYSTEIM SCIENCES 42, 307-326 (1991) 
How to Emulate Shared Memory* 
ABHIRAM G. RANADE 
Department of Computer Science, 
Yale University, New Haven, Connecticut 06520 
Received March 23, 1988; revised June 16, 1990 
We present a simple algorithm for emulating an N-processor CRCW PRAM on an N-node 
butterfly. Each step of the PRAM is emulated in time O(log N) with high probability, using 
FIFO queues of size 0( 1) at each node. The only use of randomization is in selecting a hash 
function to distribute the shared address space of the PRAM onto the nodes of the butterfly. 
The routing itself is both deterministic and oblivious, and messages are combined without the 
use of associative memories or explicit sorting. As a corollary we improve the result of 
Pippenger by routing permutations with bounded queues in logarithmic time, without the 
possibility of deadlock. Besides being optimal, our algorithm has the advantage of extreme 
simplicity and is readily suited for use in practice. (’ 1991 Academic Press. Inc 
1. INTRODUCTION 
Of the models proposed for designing parallel algorithms the most attractive are 
the ones based on shared memory. The most general shared memory models in the 
literature, the concurrent-read concurrent-write parallel random-access machines 
(CRCW PRAM) allow an arbitrary number of processors to read or write a com- 
mon memory location in one time step. Complex communications operations, 
broadcast and multicast, for example, can be implemented in one step. Abstracting 
complex communications patterns into unit steps greatly simplifies the tasks of 
designing algorithms and writing programs. For this reason, CRCW PRAM models 
are favored over weaker abstract machine models, for which most, if not all, of the 
programming effort is spent synchronizing the movement of data. 
Unfortunately, it is unlikely that CRCW PRAMS will ever faithfully model any 
real parallel machine. A real parallel computer will most likely consist of a sparse 
network of a large number of processors. For the network to scale in size, we 
require that the complexity of the individual processors be independent of the num- 
ber of processors in the network. More specifically, by a realistic parallel computer 
* A preliminary version of this paper appeared in “Proceedings, 28th Annual Symposium on Founda- 
tions of Computer Science, Los Angeles, CA., Oct. 12-14, 1987,“ IEEE Computer Society Press. 
Washington, DC, 1987. This work was supported by the Office of Naval Research under Grant NCOOl4- 
86-K-0564. Author’s present address: Department of Electrical Engineering and Computer Science. 
University of California, Berkeley, CA 94720. 
307 
0022-0000/91 $3.00 
Copyright G 1991 by Academic Press, Lnc 
411 rights of reproduction in any form reserved 
308 ABHIRAM G. RANADE 
we mean a network of N processors, with each processor connected to no more 
than a fixed number (say 4) of processors. Each processor in this network has its 
own local memory, and processors communicate by sending messages over links to 
neighboring processors. 
How can we reconcile the convenience of CRCW PRAMS with the limitations of 
a real computer? The only alternative is to emulate a CRCW PRAM on a real 
network. Such an emulation has two components: 
l Address map-Mapping the address space of the PRAM onto the N 
memory modules of the network, 
l Message routing--Routing memory requests (read/write) from processors 
to distant memory locations and data from the location back to the processors. 
Once we have fixed our address map, each memory access is accomplished by send- 
ing a message from the processor requesting the access to the processor holding the 
memory location. 
Three measures determine the efficiency of an emulation. The first is time, the 
number of steps on the network to emulate one step of the PRAM. The second is 
queue size, the amount of additional hardware per processor required to hold 
messages in queues while in transit. The third factor is the complexity of managing 
the queues at each processor: a first-in first-out queue is less complicated than a 
priority queue, or a queue requiring associative lookups. A simple queueing 
strategy is clearly preferable to one requiring complex operations. 
Because the diameter of any bounded-degree network on N nodes must be at 
least Q(log N), this is clearly a lower bound on the time to emulate one step of 
CREW PRAM. Under fairly general assumptions, Karlin and Upfal [8] and, inde- 
pendently, Alt, Hagerup, Mehlhorn, and Preparata [2] show that any deterministic 
emulation must take at least Q(log2 N/log log N). The simpler problem of routing 
permutations (i.e., each processor requests to send a message to a unique processor) 
has also been extensively studied. If the routing scheme used is deterministic and 
oblivious (the route of each message is completely determined by the source and 
destination) Borodin and Hopcroft [3] give a worst-case lower bound of Q(fi) 
for permutation routing. 
A number of emulations have been developed in recent years [2, 7,8, 12, 19,201. 
The best known deterministic strategy for emulating an N-node CRCW PRAM 
with M shared variables on an N-node bounded-degree graph is that of Herley and 
Bilardi [7] and takes time O(log N log M/log log N). Better time bounds are 
obtained with randomized routing schemes [ 1, 18,211. Using random hash 
functions, a randomized routing strategy, and the Reif-Valiant [17] probabilistic 
sorting scheme, Karlin and Upfal [S] presented a probabilistic emulation on an 
N-node butterfly. The algorithm is always guaranteed to work, and with probability 
at least 1 - l/N emulates any PRAM step in time O(log N). However, the queues 
are required to have size Q(log N), and must be built as priority queues, which is 
expensive. 
HOW TO EMULATE SHARED MEMORY 309 
1.1. Main Result 
This paper presents a probabilistic emulation whose time complexity is O(log N) 
and queue size is O(1). We note that this is the first emulation of CRCW PRAMS 
with bounded queue size. The queues are first-in first-out, the simplest possible. We 
adapt the random hash functions of Karlin and Upfal [8] for the address map. In 
contrast, however, our routing scheme on the butterfly is completely deterministic. 
Thus our scheme only requires O(log’ N) random bits, rather than O(Nlog N), as 
in [S]. In fact, the routing scheme is also oblivious, which is rather surprising. 
Besides being optimal with respect to time complexity, our emulation has the 
advantage of extreme simplicity. To summarize, we state the main result below: 
THEOREM 1. One step of an N-processor CRCW PRAM can be emulated on an 
N-processor butterfly in time O(log N) with probability at least (1 - l/N). The queue 
size at each processor is 0( 1). 
A minor variant of our scheme can also be used for routing permutations. 
Pippenger [13] showed how to route permutations on a butterfly with bounded 
queue size in O(log N) time. His scheme allowed a small probability of deadlock. 
We obtain a deadlock-free solution for permutation routing as a simple corollary 
of our routing scheme. In comparison to [ 131, our routing scheme as well as the 
accompanying analysis is considerably simpler. 
Our scheme also extends to other networks besides the butterfly and can be used 
to emulate fetch-and-add [6] and related instructions. The work done on this is 
described in Section 6. 
1.2. Overview 
The key idea in this paper is a simple and optimal strategy for scheduling move- 
ment of messages. This simplifies combining requests that access the same memory 
location and also enables the use of constant sized queues (Section 2). Section 3 
presents the emulation formally. Sections 4 and 5 analyze the routing scheme and 
the likelihood of large delays. Section 5 also estimates the constant factors and 
describes the application to permutation routing on the butterfly. Section 6 
mentions ways in which this work has already been extended. 
In Appendix A we deal with some of the technical complexities associated with 
the address map used in this paper and also in [S]. In Appendix B we extend our 
results to the emulation of entire PRAM programs. 
2. How TO COMBINE MESSAGES 
Suppose that several processors wish to read the same memory location at the 
same time step. Each processor sends a request to the memory in which the loca- 
tion lies along some path in the network. Suppose that the paths of these request 
intersect to form a tree, as in Fig. 1. 
310 ABHIRAM G. RANADE 
0 Requesting processors 
B Module holding 
location 
- Network Link 
- - Message path 
FIG. 1. Message paths to a common location form a tree. 
There is, however, no need to send more than one request along any branch of 
this tree. A request is simply forced to wait at each tree node until (i) another 
request to the same destination arrives on the other input to the node, and the node 
combines the two and forwards the result along the tree, or (ii) the node determines 
that no future request arriving on the other input will have the same destination. 
How does a node determine that no future message will request a particular 
memory location? The key idea is that for every edge in the butterfly the messages 
corresponding to a given PRAM step can be transmitted in the sorted order of their 
destinations. Once the sorted order is established on the edges coming into a node, 
it can be guaranteed inductively on the outgoing edges. 
Figure 2 shows a snapshot of nodes in the network. Each node receives messages 
along two incoming edges and places them into the corresponding FIFO queues. At 
each step the node compares the destination addresses of the messages at the head 
of each queue. The message with the smaller destination address is transmitted 
along the appropriate outgoing edge. If both messages at the heads of the queues 
are destined for the same location, they are combined and only one sent out. 
Finally, if only one queue has a message waiting and the other queue is empty, no 
Fig. 2 Combining messages by merging streams. 
HOW TO EMULATE SHARED MEMORY 311 
message is sent out. (If the message were sent, the next message to appear in the 
other queue could conceivably have a smaller destination, potentially violating the 
sorting requirement.) 
In our snapshot of Fig. 2, node A first transmits the message destined for location 
35. It then waits until a message arrives into the top queue, which happens to be 
to location 48. Then it again compares the messages at the head of the two queues 
and discovers that both messages are distined for location 48 and can be combined. 
2.1. Reply Routing 
How do we return the data to all requesting processors? The reply message, upon 
reading the data, returns backward along each edge of the tree and reaches every 
requesting processor. For the backrouting we only need to store two direction bits 
at each node. The bits say whether the request came along the top branch, the 
bottom one, or along both. Since messages are kept sorted by the address of the 
memory location they accessed even as they return, replies at each node arrive in 
the same order as that in which the requests were sent out. Therefore, the direction 
bits can be stored in a 2-bit-wide FIFO queue. This simple idea is more efficient 
than the associative memories proposed in [6]. 
2.2. Ghost Messages 
Keeping message streams sorted has one deficiency. Consider Fig. 2 again. Note 
B cannot transmit the message it holds for location 25 until it is sure that it will 
not receive a message for a smaller address. When A transmits the message for loca- 
tion 35 on the top link, it can convey this information to B by sending a ghost 
message labelled 35. As soon as B receives the ghost message, it knows that future 
messages along that link must be destined for locations greater than 35. Therefore, 
at the next time step B can forward the message destined for address 25. 
Ghost messages are somewhat similar to the Null messages proposed by Chandy 
and Misra [S] for distributed simulation. Ghost messages simply notify a node of 
the minimum address to which subsequent messages can be destined, Ghosts are 
not used for any other purpose; they “keep the system going.” The next section 
precisely specifies the mechanisms for transmitting ghosts. This simple idea turns 
out to be powerful enough to yield our main result. 
3. THE EMULATION 
The bounded-degree network used for emulating CRCW PRAMS is the butterfl)., 
also called the FFT network (Fig. 3). The number of nodes in a butterfly with n + 1 
levels is N = (n + 1) 2’*, and we use this to emulate an N-processor PRAM. For 
simplicity, we consider only the emulation of read instructions. Write instructions 
are handled similarly. 
Each node in the butterfly is assigned a unique number ((; r), where 0 < c < II, 
0 < r < 2” - 1, and (c, r} is the binary representation obtained by concatenating 
571.!4?,3-5 
312 ABHIRAM G. RANADE 
FIG. 3. A Butterfly network with n = 3. 
the binary representations of c and r. Node (c, r) is said to belong to level c. Node 
(c, r) is connected to nodes (c+ 1, r) and to node (c+ 1, r@2”), for O<c<N, 
and where @ denotes bitwise exclusive or. In a single step each node can send a 
message of length O(n) to each of its neighbors. 
Each node in the butterfly has a processor, a memory module, and a small num- 
ber of switches (up to six), each with up to two inputs and two outputs. Each input 
into a switch has a queue that can hold at most b messages. The algorithm works 
for any value of b>2. Each switch also has a 2-bit-wide FIFO queue called the 
direction bits queue. The length of the direction queue is bounded by the running 
time of the algorithm, which is shown to be O(n) with high probability. Hence 
queues of O(n) bits are sufficient. This requires no more storage than 0( 1) 
messages. 
3.1. The Address Map 
Let M denote the size of the shared memory in the PRAM, addressed 
0 9 **a, M- 1. Following [S], shared memory location x is mapped into the memory 
of the butterfly node whose number is h(x) = g(x) mod N, where g(x) is chosen at 
random from the class G of i-universal has functions [4, 121: 
HOW TO EMULATE SHARED MEMORY 313 
Each ai is chosen at random from (0, 1, . . . . P - 1 }, and P > max(M, nN) is a fixed 
prime.’ The number [ is later shown to be O(log N). Thus the number of random 
bits required is O(log M log N). 
For simplicity the address map used here only distributes the shared variables of 
the PRAM among the modules in the butterfly, rather than mapping them onto 
individual locations within each memory. Since h(x) - y = 0 will have O(log N) 
solutions in general for some y, the memory module y mod N will receive O(log N) 
shared variables, forcing each module to have length O(log N). Appendix A shows 
how this problem can be overcome. 
3.2. Message Structure 
To access a PRAM location x, a processor sends a message to module h(x), and 
the message returns with the required data. Each message has three fields: tag, type, 
and data. The tag for our message is (g(x), x), i.e., the number obtained by con- 
catenating the number g(x) with x. The tag is used by the algorithm as a priority 
to decide which messages to transmit earlier. The tag also encodes h(x) = g(x) 
mod N and can be used to determine the message path.2 The type field is one of 
request, ghost, or EOS, described below. 
Each node issues an end-of-stream message of type EOS immediately after it 
issues a memory-access message of type request. In case a node does not access 
shared memory in a given PRAM step, it only inserts an EOS. The tag field of an 
end-of-stream message is interpreted as co. An end-of-stream message arriving at a 
switch notifies the switch that no more requests will be issued to it from the corre- 
sponding incoming link, for the corresponding PRAM step. 
3.3. Message Path 
Each request message traverses a path from its source processor to the destina- 
tion module and back. This happens in six phases. In phases 1, 3, and 5 the 
messages traverse the butterfly in the backward direction and in phases 2, 4, and 
6, in the forward direction (Fig. 4). Consider a message issued by processor (c, r ) 
for PRAM location x stored in memory module h(x) = (c’, r’). In the first phase, 
the message is directed from node (c, r) to node (0, r ). In phase 2, the message 
follows the unique (forward) path in the butterfly from node (0, r) to node (n, r’). 
This path can be determined by looking at the least significant n bits of g(x), which 
is a part of the tag. In phase 3, the message moves backward along row r’ and 
reaches the node (c’, r’), where it acquires the required data from the memory 
module. It continues to move backward through the row until it reaches node 
(0, r’). In the last three phases, the message traverses its path in the reverse direc- 
tion and returns to the node that initiated the request. The datum requested has 
finally arrived. 
’ It suffices to use P> M 114, 151, but the present choice gives substantially better constants. 
’ It is possible to use (h(x). x) as the tag, as was done in [14]. However, the current choice givea 
better constants. 
314 ABHIRAM G. RANADE 
* I  !2? ?i: * n * 
* I  Y  1 
: Phase 1 i Phase2 i Phase 3 i Phase4 i Phase5 i Phase6 ’ 
0 n 2n 3n 4n 5n 6n 
0 switch o memory module 0 processor 
FIG. 4. Logical network. 
For convenience, the routing mechanism is described in terms of the logical 
routing network (logical network for short) of Fig. 4 instead of the butterfly. The 
correspondence between the two is clear. Each butterfly node does the work of 6 
switches in the logical network. 3 The logical network has (6~ + 1) 2” switches 
organized in 6n + 1 columns. These columns are numbered 0 through 6n from the 
input side; the rows are numbered as in the butterfly. A switch in column c and row 
r is numbered (c, r). 
Since there is a unique (forward) path in the butterfly for any pair of nodes, it 
follows that the path traversed by each request is oblivious (once the address map 
is fixed). 
LEMMA 1. For any processor-memory pair (c, r >, (c’, r’ ), there is a unique path 
in the logical network that starts at (c, r), passes through (cl, r’), and ends at 
(c, r>. Furthermore, the sequence of nodes traversed in phases 4, 5, and 6 is the 
reverse of the sequence for phases 3, 2, and 1. respectively. 
3.4. Scheduling Message Movement in Phases 1 and 2 
The algorithm begins with the upper queues in levels 0 through n, each holding 
at most one request followed by an EOS message. During the execution, the algo- 
rithm maintains the following invariant. Along any link in the network, messages 
are transmitted in increasing order of the tag. The queueing discipline in each queue 
is FIFO; as a result, the algorithm ensures that throughout the execution of the 
algorithm, the messages in each queue are arranged from head to tail in the order 
of increasing tag. Initially the invariant is satisfied trivially. 
3 Except nodes in column 0 of the butterfly do the work of four switches and those in column n of 
do the work of three. 
HOWTOEMULATESHARED MEMORY 315 
At each step, a switch examines the messages at the heads of its queues. If any 
of these queues are empty, then the switch does nothing. Otherwise, it selects the 
message with the smallest tag as the candidate to be transmitted. In the case of the 
switch in level 0, it selects the message at the head of the only queue. In other levels, 
in case the message at the head have the same tag, these messages are combined 
into a single message. For phase 2, the output on which the selected message is to 
be sent is determined by the appropriate bit of the tag. The selected message is sent 
forward only if the queue that it must enter contains fewer than b messages at the 
beginning of the step. Thus, every queue is guaranteed to always hold no more than 
b messages. 
To prevent queues from becoming empty, whenever a switch selects a message for 
transmission, it sends a ghost message with the same tag along all of its other out- 
put edges. Ghost messages are sent even if the selected message is itself not sent 
because of a queue being full. The tag of the ghost message provides the switch on 
the next level with a lower bound on the tags of the messages that it will receive 
in the future. Ghost messages are discarded if the receiving queue has no space. 
Otherwise, ghost messages are handled by the receiving switch in a manner some- 
what similar to the handling of request messages. Like request message, a ghost 
message can be selected for transmission if it is at the head of its queue and has a 
smaller tag than the tags of messages in all the other queues. Ghost messages that 
are not selected are immediately destroyed. A selected ghost message is sent out 
along all output edges. As mentioned earlier, if a receiving queue does not have 
space, it discards the ghost message. 
For each request that passes through a switch, two bits are stored in the direction 
bits queue. These bits indicate whether the request arrived into the switch from the 
top link or the bottom link, or was the result of the combination of messages that 
arrived on both links. 
We summarize the properties of the routing algorithm in the following lemmata. 
LEMMA 2. Throughout the execution, each queue in the network holds messages 
jrom head to tail sorted by increasing order of tags. Each switch sends out message’s 
in the increasing order of tags. 
LEMMA 3. A switch in level i will hold messages in each of its queues at time i. 
After step i it will send out a message on each of its outputs every step (unless the 
receiving queues do not have space) until it transmits an EOS message. 
The proofs are by easy induction on the level. We also need a characterization 
of ghost messages: 
LEMMA 4. A ghost message will never wait at any switch. 
The lemma follows by definition, since ghost messages are discarded if they are 
not immediately transmitted. 
316 ABHIRAMG.RANADE 
3.5. Operation of Phases 3 and 4 
In these phases each message moves along a single row. In phase 3 the data slot 
of each request is filled by accessing the memory. 
3.6. Operation of Phases 5 and 6 
Message movement is also scheduled in phases 5 and 6, using tags to keep 
messages sorted. Ghost messages are also used, as in phases 1 and 2, to prevent 
queues from becoming empty. Lemmata 2-4 are applicable. For deciding the path 
of a message, however, switches in phases 5 and 6 use the direction bits queue, 
rather than the tag bits. For each reply message (not a ghost or EOS) two bits are 
removed from the head of the direction bits queue. These bits are used to decide 
whether to transmit the message on the upper link or the lower link, or whether to 
replicate the message and send it on both links. This is summarized by the follow- 
ing lemma. 
LEMMA 5. The ith request selected for transmission in a switch in phases 5 or 6 
is the reply to the ith request selectedfor transmission in the corresponding switch in 
phase 2 or 1, respectively. 
Proof. The lemma is obviously true for switches in level 4n. Assume that it is 
true for switches in level 4n + i; i.e., each switch in level 4n + i selects the replies to 
requests in the same order as that in which they were transmitted in switch 2n - i. 
Since the direction bits queue is accessed FIFO, the replies are returned on exactly 
the inputs on which they arrived. As a result, requests arrive into the input queues 
of the switches in level 4n + i+ 1 in ascending order of tag, and these are the replies 
to the messages sent in the corresponding switches in level 2n - i- 1. But the 
switches in level 4n + i + 1 also select messages for transmission by picking the 
smallest at the head of the queues. Thus the replies to messages sent by switches in 
level 2n - i - 1 are also selected in the order in which they were transmitted. Thus 
the lemma follows by induction. a 
4. MESSAGE POLARIZATION AND DELAY 
We show that whenever message delivery takes a long time, there exists a long 
polarized sequence of messages. The next section shows that long polarized sequen- 
ces are highly unlikely. The notion of a polarized sequence is similar to that of the 
delay sequence of [S, IS] and the critical path of Cl]. 
A path S in the logical network is a sequence (S(i)) of switches with the 
property that, for every i, the switch S(i) is connected to switch S(i + l), and the 
switches S(i), S(i+ 1) are distinct. Note that not all switches along a path need be 
distinct, only adjacent pairs. A path which originates in column 0 and ends in 
column 6n is called an input-output path. 
HOW TO EMULATE SHARED MEMORY 317 
DEFINITION 1. Let A4 = m, , m2, . . . . m, be a sequence of messages such that 
tag(m,) < tag(m,+ i) for 1 < i < or. Suppose that S is a path in the network such that 
during delivery of all messages, message mi passes through the j,th switch along S, 
i.e., through switch S(j,). Suppose, finally, that ji d j, + , for all i, 1 < i < a. Then the 
message sequence M is said to be polarized along S. 
The main result of this section is the following theorem, which relates the emula- 
tion time and queue size to the length of polarized sequences. 
THEOREM 2. Suppose that a set of memory requests takes time 6n + 6. !f each 
queue is of size h, then there exists an input-output path r qf length 6n + 26/h and 
a sequence h4 of 6 distinct request messages vvhich is polarized along r.4 
Before we prove this theorem, we need a characterization of the algorithm in 
terms of how messages delay one another. 
LEMMA 6. Suppose a message m waits in a switch s at time t. Then one of the 
following is true. 
1. There exists a message m’ that was transmitted out qf s at time t with 
tag(m’) < tag(m). 
2. At time t, the queue in the switch s’ in the next level is ,full and holds 
messages rn; , . . . . rnb with tag(mi) < . . . < tag(m’,) < tag(m). 
3. At time t some queue in s was empty and t - level(s) d 0. 
In the first case, we say that m’ m-delays m in s at t, and in the second, rn; through 
mj, b-delay m in s’ at t. 
The lemma is a direct consequence of Lemmata 2 and 3. 
4.1. Constructing Long Polarized Sequences 
We present an incremental construction for the polarized sequence (Fig. 5). The 
first message in the sequence is the one that was delayed most. Informally, suc- 
cessive messages in the sequence can be thought of as being responsible for the 
delay of the preceding messages in the sequence. Our construction identifies a 
message sequence m, , . . . . m,. The first 6 elements of this, i.e., m,, . . . . m,, form the 
polarized sequence. We use auxiliary sequences s,, . . . . s,, m;, . . . . ml, and t,, t,, . . . . t,. 
of switches, messages, and times, respectively, to facilitate the discussion. We start 
with m’, being the (request) message that was not delivered until step t, = 6n + 6. 
In general, given rn,! and tj+ , , we show how the sequences can be extended. If rn: 
is not a ghost, then we set mi = mi. If rn,! is a ghost, we follow it back until we reach 
the switch in which rn,! was created from some m,. In either case we follow mi back 
4Throughout this paper we assume for simplicity that y divides .X whenever we write X/J. This 
assumption will not be valid in general, but out results can be extended easily. This affects only the 
constant factors. 
318 ABHIRAM G. RANADE 
0 6n 
I I 
- Path S - - - - Message Path 
FIG. 5. Polarized sequence. 
until time ti, when it was forced to wait in some switch si. The next message in the 
sequence is identified by using Lemma 6. If mi was m-delayed by m’ in si, we set 
m:+, = m’. Suppose mi was b-delayed by ml, rni, . . . . ml in s’. Then we set 
mi+j=m:,j= my, s~+~=s’, and ti+j= ti, for j= 1, . . . . b- 1, and m:+,=mi. If some 
queue in si was empty at tj, or if t, = 0, we terminate the construction. 
The incremental construction extends each sequence by 1 element or by b 
elements, depending upon whether there was an m-delay or a b-delay. We apply the 
construction until a total of 6/b b-delays are encountered, or the construction 
terminates itself.’ 
Define the lag of a message m in switch s at time t as lag(m, t) = t-level(s). The 
lag is a lower bound on the amount of time the message waited in queues before 
step t. The key observation is that during each step of the construction the lag of 
the messages reduces by 1 or 2. 
LEMMA 7. Consider an incremental step in which we start with ml. Then: 
1. Suppose mi was m-delayed. Then 
lag(ml, tie ,) = lag(mi, tj) + 1 = lag(m:+ i, ti) + 1 
2. Suppose mi was b-delayed. Then 
lag(mi, ti- 1) = lag(mf, ti) + 1 = lag(m:+,, ti) + 2. 
Proof Since there is no waiting between ti_, and ti + 1, we get lag(ml, tie ,) = 
lag(mi, ti+ 1). But since mi waits at ti, we have lag(m,, ti) + 1 =lag(m,, ti + 1) = 
lag(mi, tipI). For m-delays, we know that mi and rni, i are in the same switch 
5 In the earlier version of the paper, the construction was terminated after n b-delays. The present 
version gives better constants for the time and queue. size and is adapted from Leighton, Maggs, and 
Rao [lo]. 
HOWTOEMULATESHAREDMEMORY 319 
at ti and hence must have identical lags. For h-delays, we get lag(m,, t,) = 
lag(m:+,, ti) + 1, since rn: + h is on the next level. 1 
LEMMA 8. The length v of the sequence m,, . . . . m,. is at least 6. 
Prooji Let j denote the number of incremental steps used in the construction, 
of which f < S/b involved b-delays, and the remaining j-,f involved m-delays. 
Suppose f = 6/b. We know that each b-delay adds b elements, and thus v 3 ,fb = 6. 
Else, we have ,f < 6/b. Then we know that the construction was terminated 
because some queue was found empty, or t, = 0. By Lemma 3, a queue in level i can 
only have been empty before time i. Thus in either case we get lag(m,, t,) d 0. We 
know that lag(m;, to) > 6. By applying Lemma 7 ,j times we get lag(m, , t, ) - 
lag(m,,t,,)=j-f+?f=j+j: Thus j+faS. But v~J+f(b-l)~j+.f’36. 
assuming b 3 2. 1 
LEMMA 9. Consider the path starting,from s, and passing through s2, . . . . s,, in thut 
order such that the segment between si-, and si consists of the path of m:. The total 
length is at most 6n + 26/b. 
Proof. The path has at most 6/b forward edges. Since it goes back at most 6n 
levels, its total length is at most 6n + 26/b. [ 
We now prove Theorem 2. 
Proof of Theorem 2. The switches and the messages belonging to the polarized 
sequence are obtained by taking the first 6 elements of the sequences m, , . . . . m,. and 
SI 1 . . . . s,,. The sequence of tags is tag(m,), . . . . tag(m,.). This is in decreasing order by 
construction. The polarization path is obtained from Lemma 9. This has length at 
most 6n + 26/b, as required. To complete the proof we observe that all m, are real 
messages, i.e., not EOS or ghost messages, since they delay other messages as well 
as wait in queues. 1 
5. LARGE DELAYS ARE UNLIKELY 
THEOREM 3. For every k, , there exists a constant k, independent qf N such that 
every message is routed in time k,n with probability at least 1 -N -‘I, with 
i = (k, - 6)n, and queue size b > 2. 
Proof We estimate the probability that the time required is more than k,n. In 
all such events, we can find a polarized sequence with 6 = (kz - 6)n messages. Con- 
sider a polarized sequence in which the messages have tags z, , . . . . zg. First we count 
the number of ways in which tags can be selected so that the corresponding 
messages form a polarized sequence. Then we estimate the probability of the 
occurrence of each acceptable tag choice. 
A polarized sequences can be constructed by choosing a path r, choosing 
switches where messages touch, choosing the source processors for the touching 
320 ABHIRAM G. RANADE 
Path of m; 
0 
FIG. 6. Number of ways of choosing rows of source processor and destination module of message 
m, is 2’2”-‘= 2”. 
messages, and choosing the values of g(x,) for each chosen message, where xi is the 
PRAM address for the ith message. This completely determines the tags, since we 
know the PRAM address requested by each processor. The number of possible tag 
choices can therefore be counted as follows. 
1. The path r can originate at any switch in column 6n and consists of 
6n + 2f displacements, of which f = 6/b are forward. Each forward or backward dis- 
placement can be along any of two edges. Thus the total number of possible choices 
is at most 2”( 6n; 2f) 26n+2f=27n+2f(6n;Zf). 
2. We must choose 6 not necessarily distinct switches on r where the 
messages touch. Since there are 6n + 2f switches the total number of choices is at 
most ( 6n+if+6). Let si denote the ith chosen switch starting from the origin of r. 
3. Let m, be the message touching si and accessing PRAM location xi. We 
describe how to choose the source processors and g(x,) for each mi. Given that si 
lies on the path of m,, the row in which mi originated and the row in which module 
h(xi) lies can together be chosen in 2” ways (Fig. 6). This is independent of what 
phase si belongs to. The column in which mi originated can be chosen in n + 1 
ways. Note that this fixes xi and also the least significant n bits of g(x,). Because 
of the polarization property, g(xl) k g(x,) > ... >g(x,). Thus the leading bits of 
g(x,) must be chosen as a nonincreasing sequence. 
Since there are P/2” choices for each, the total number of choices for all the leading 
bits is (p’2i’6). Thus the total number of choices is (flf=, 2”)(ni=, (n+ l))( p’2i+6) 
= jpy;+q. 
The total number of tag choices that give a polarized sequence is at most 
27n+Zf(6n:Zf)(6n+~~+S) ~“(2~“;+a). 
Fixing the tags is equivalent to fixing g(x,) for i = 6 distinct xi. We know that 
for 1~ i 6 5 and 0 <xi, yi < P, there is at most one polynomial g of degree i - 1 
over the field Z, that satisfies these choices. Since the total number of polynomials 
is P”, the probability that a given polarized sequence occurs is l/P’. 
Let Pr(t) denote the probability that the total delivery time is at least t. Then 
Pr(6n+6)<27”+2f (6n;zl)(6n+;f+6) qE;+6) $. (1) 
HOW TO EMULATE SHARED MEMORY 321 
By choosing ha2 we get f>S and thus (6n:2/)~(6ri-+h21+6)~(6n~3ii). By using 
(f) < (r~e/r)~ we get 
By choosing P > nN and 6 = kn, where k is large enough, we get 
Pr(6n + kn) 6 ( $)kfl < N ‘. 
Thus for arbitrary k,, there exists k, independent of N such that the probability of 
requiring at least k,n time for routing is less than Npki. 1 
In the above we used [ = 6. It suffices to have i = O(n) independent of 6, but it 
needs a more complex analysis [9, 14, 151, and it also leads to larger constant 
factors. 
5.1. Constant Factors 
We can use inequality (1) to estimate performance for different buffer sizes: 
b = 2-With probability at least 1 - l/N, a single step is emulated in fewer than 
87n -c 87 log N steps. 
b = t&With probability at least 1 - l/N, a single step is emulated in fewer 
than 28n < 28 log N steps. 
b = XL-With probability at least 1 - l/N, a single step is emulated in fewer 
than 23n < 23 log N steps. 
b = cc-With probability at least 1 - l/N, a single step is emulated in fewer 
than 1% < 15 log N steps. For this we need to use f’= 0 and also further simplify 
the inequality as follows. First note that the number of ways of choosing the 
polarization path is at most 23n, since we only need to specify the nodes it touches 
in columns 0, 3n, and 6n. Second, we only need to consider nodes in columns n 
through 2n and 4n through 5n for points of contact. 
The numbers mentioned above can be improved somewhat by using a more 
detailed analysis. On the basis of simulation experiments using random PRAM 
instructions, the time to emulate a single instruction is found to be approximately 
12 log N, with queues of size 2 [15, 161. 
5.2. Permutution Routing 
Suppose processor i wants to send a message to processor n(i), where rr is a per- 
mutation on {0, . . . . N - 1). We use the six-phase scheme discussed above. In the 
first three phases, the message from processor i is moved as if it were headed to 
location i. In the last three phases, instead of using the direction bits and returning 
to processor i, the message is moved to x(i). For all six phases, (g(i), i) is used as 
322 ABHIRAMG.RANADE 
tag. The construction of the polarized sequence and the analysis presented above 
are applicable with minot modifications. 
COROLLARY 1. Suppose every processor i is to send a message to processor n(i), 
where 71 is a permutation. Then for every k,, there exists a constant k, independent 
of N such that every message is routed in time k,n with probability at least 1 - NekL 
and with queue size b 2 2. 
6. EXTENSIONS 
Our algorithm can easily be extended to emulate concurrent write instructions 
with combining operators (i.e., if several processors seek to write to a common 
location, their values are combined together using some operator, e.g., addition) 
and instructions like fetch-and-add [6]. In fact, Ranade, Bhatt, and Johnsson 
[ 15, 163 propose a more powerful instruction called the Muftiprefx and show how 
it can be emulated with little extra hardware. 
Leighton, Maggs, Ranade, and Rao [9] have used the ideas in this paper to 
build a general paradigm for developing probabilistic algorithms for routing and 
sorting (also see [lo, 151). In particular, they show how to route permutations 
and sort using constant sized queues in time proportional to the diameter for 
multidimensional meshes, shuffle-exchange networks, hypercubes, butterflies, and 
cube-connected-cycles networks. The paradigm also leads to the construction of 
area- and volume-universal networks [ 111. 
APPENDIX A 
Memory Module Sizes and Local Addressing 
The address map of Section 3.1 specifies only the butterfly memory in which a 
particular PRAM location is to be placed. It does not specify the location within 
the memory that will hold the address. The map also has the drawback that it 
might cause too many PRAM locations, as many as c = O(n), to get mapped onto 
the same memory. This is not acceptable if the size of the shared memory is small, 
i.e., if M= kN, for some constant k. Here we briefly sketch a scheme in which each 
shared memory location is assigned a unique location in the butterfly memory, and 
further, the memory required at each node is O(M/N). 
Given the polynomial coefficients ai, define the hash address of x to be 
a(x)=((,zCC aixi) modP) modM. 
The hash address a(x) is interpreted as a global address in the butterfly, i.e., it 
corresponds to location a(x)/N of module h(x) = a(x) mod N, assuming M is a 
multiple of N. The function a thus maps the PRAM location x in the range 
HOW TO EMULATE SHARED MEMORY 323 
0, . . . . M- 1 onto a buttefly address a(x) in the range 0, . . . . M- 1. We call this 
region of the butterfly memory the hash table urea. The butterfly location a(x) can- 
not in general be used to store the contents of PRAM location x, because as many 
as i PRAM locations may have the same hash address. To handle this, we use the 
butterfly memory beyond address M- 1 as an overflow area. 
Every location a(x) in the hash table area points to a group of locations in the 
overflow area. These locations hold the PRAM locations that get mapped into u(s). 
Each location in the overflow area holds pairs of the form (x, data in PRAM 
location x). Thus PRAM location x is accessed by searching through the overflow 
locations associated with the hash table location a(x). We now describe how to 
allocate overflow area for each hash table location. 
A.1 . Preliminary Scheme 
In this scheme every location a(x) in the hash table is assigned [ locations in the 
overflow area whether or not there are [ PRAM locations mapped onto it. These 
i locations consist of !J(n + 1) locations from each memory in the row of the butter- 
fly to which a(x) belongs. Thus if location a(x) is assigned memory locations i 
through i + [/(n + 1) - 1 in each memory in its row, then a(x) is set to i. 
Figure 7 shows the memory layout for some row in the butterfly, for the case 
Local 
Memory ik?r 
Address 
0 2 
1 R 10 
2 * 
M2 M3 
18 34 
26 42 Hash Table 
* * Overflow 
* 
FIG. 7. Memory layout for a row in the butterfly 
324 ABHIRAM G. RANADE 
n + 1 = 3, M/N= 2, and [ = 24. Thus locations 2 through 9 are alloted to location 
to location 0 of memory M, , locations 10 through 17 to location 1 of M, , locations 
18 through 25 to location 0 of M,, etc. The starred squares indicate the locations 
that are actually used. The figure shows that four shared memory locations are 
mapped onto location 0 of M,, one onto location 1, and two onto location 0 
of M,. 
Memory access. In order to access shared memory location X, it is sufficient to 
search [/(n + 1) locations in every memory in the row of a(x). Which locations to 
search is indicated by a(x). This only requires a minor modification to the access 
scheme described in Section 3.3: the location a(x) is read in phase 3, and the search 
through the overflow area takes place in phase 4. Thus, at each node in phase 4, the 
i/(n + 1) locations specified by a(x) are searched. Eventually the message reaches a 
node that holds PRAM location x, at which point the data field of the message is 
updated. The message movement is not affected, since each memory needs to be 
accessed a constant number [/(n + 1) of times per message. 
A.2. Improved Scheme 
Since each hash table entry holds a pointer to a location in the overflow area, we 
do not need the empty spaces in the overflow area. The memory required per row 
is now proportional to the number of PRAM locations that get mapped into it. 
Since the hash function is randomly chosen, the following theorem easily follows. 
THEOREM 4. For any constant f, there exists a constant c (independent of N) such 
that the probability of mapping more than cMn/N PRAM locations into any row of 
the butterfly is less than N-f. 
Thus with extremely high probability it is sufficient to have a total memory of 
O(Mn/N) in every row. Thus O(M/N) memory is sufficient in every module, as 
desired. 
A.3. Address Computation 
The hash function has [ = O(n) coefficients, and just storing all the coefficients on 
each processor requires O(n) memory. However, this is not necessary. A node in 
level i need only hold coefficients ii/(n + 1) through i(i+ l)/(n + 1) - 1. The poly- 
nomial evaluation can be pipelined, requiring only an additional O(n) cycles. 
APPENDIX B 
Emulating Sequences of Instructions 
Our results can be extended to the emulation of multi-instruction programs, as 
done by Karlin and Upfal [S]. For emulating multiple instructions, the same 
protocol is used, but now we must guard against the possibility that a particular 
HOWTOEMULATESHAREDMEMORY 325 
instruction might not complete in the alloted time. If this happens, a new hash func- 
tion h is chosen, all the variables are sent to their new locations, and the emulation 
process resumes. We must also consider the time required to initialize the hash 
table area as described in Section A.2. 
THEOREM 5. Consider an N-processor CRCW PRAM with M shared mem0r.v 
locations, M polynomial in N. Then an arbitrary T instruction program, T> M/N, 
can be emulated on an N-node butterfly in time 0( T log N) with probability tending 
to I as N and/or T tend to co. Further the size oj’ the memory required at each 
butterfly node is O(M/N). 
Proof. We first count the number of PRAM addresses that get mapped into 
each entry of the new hash table, This can be done by using concurrent write opera- 
tions with binary addition as the combining operation. By doing a prefix over the 
hash table we can determine the region in the overflow area associated with each 
bucket. Finally each address is moved to its new place. Each of these steps can be 
done in time O((M/N) log N) with high probability. Because M is polynomial in N, 
we see from Theorem 3 that any PRAM instruction completes in time c log N with 
probability at least [ 1 - M -‘] for every f and some c independent of M. Thus the 
interval between successive rehashing operations is at least M’ with probability 4. 
Thus with high probability no more than 8T/M’ rehashing operations are needed. 
The time required for these is 0((1 + 8T/M’)(M/N) log N). Because T>, M/N, the 
total emulation time is 0( T log N) with high probability. 1 
The restriction T3 M/N is not serious because M/N instructions are required just 
to access all the M locations. 
ACKNOWLEDGMENTS 
I am immensely grateful to Sandeep Bhatt for his untiring help in writing this paper. The presentation 
in this paper has also benetitted considerably from discussions with Tom Leighton, Bruce Maggs, and 
Satish Rao. I thank Lennart Johnsson for many discussions and support, David Greenberg and 
M. T. Raghunath for critical comments, and Tom Leighton for pointing out an error in an earlier 
version and for motivating Appendix A. 
REFERENCES 
1. R. ALELIUNAS, Randomized parallel communication, in “Proceedings, ACM SIGACT-SIGOPS 
Symposium on Principles of Distributed Computing,” August 1982, pp. 6&72. 
2. H. ALT, T. HAGERUP, K. MEHLHORN, AND F. P. PREPARATA, Simulation of idealized parallel com- 
puters on more realistic ones, Preliminary Report, 
3. A. BORODIN AND J. E. HOPCROFT, Routing, merging and sorting on parallel models of computation, 
in “Proceedings, of STOC 82,” pp. 338-344, 1982. 
4. J. L. CARTER AND M. N. WEGMAN, Universal classes of hash functions, J. Comput. Systrm. Sci. 18 
(1979). 143-154. 
326 ABHIRAM G. RANADE 
5. K. M. CHANDY AND J. MISRA, Distributed simulation: A case study in design and verification of 
distributed programs, IEEE Trans. Sofrware Engrg. SE-S, No. 5 (September 1979). 440-452. 
6. A. GOTTLIEB, B. D. LUBACHEVSKY, AND L. RUDOLPH, Coordinating large numbers of processors, 
in “Proceedings, 1981 International Conference on Parallel Processing, 1981.” 
7. K. T. HERLEY AND G. BILARDI, “Deterministic Simulations of PRAMS on Bounded Degree 
Networks,” Technical Report 88-951, Department of Computer Science, Cornell University, 
November 1988. 
8. A. KARLIN AND E. UPFAL, Parallel hashing-An eficient implementation of shared memory, in 
“Proceedings, ACM Annual Symposium on Theory of Computing, 1986,” pp. 160-168. 
9. T. LEIGHTON, B. MAGGS, A. RANADE, AND S. RAO, Routing and sorting on fixed-connection 
networks, manuscript, October 1989. 
10. T. LEIGHTON, B. MAGGS, AND S. RAO, Universal packet routing algorithms, in “Proceedings, IEEE 
Annual Symposium on the Foundations of Computer Science, 1988.” 
11. C. E. LEISERSON, Fat-trees: Universal networks for hardware-efficient supercomputing, ZEEE Trans. 
Compuf. C-34, No. 10 (October 1985), 892-901. 
12. K. MEHLHORN AND U. VISHKIN, “Granularity of Parallel Memories,” Ultracomputer Note 59, 
New York University, October 1983. 
13. N. PIPPENGER, Parallel communication with limited buffers, in “Proceedings, IEEE Annual 
Symposium on Foundations of Computer Science, 1984,” pp. 127-136. 
14. A. G. RANADE, How to emulate shared memory, in “Proceedings, IEEE Symposium on the Founda- 
tions of Computer Science, 1987;” TR-578, Computer Science Department, Yale University, 1987. 
15. A. G. RANADE, “Fluent Parallel Computation,” Ph.D. thesis, Yale University, TR-663, Department 
of Computer Science, Yale University, 1988. 
16. A. G. RANADE, S. N. BHATT, AND S. L. JOHNSSON, The Fluent Abstract Machine, in “Proceedings, 
Fifth MIT Conference on Advanced Research in VLSI, March 1988,” pp. 71-94; TR-573, Depart- 
ment of Computer Science, Yale University. 
17. J. REIF AND L. VALIANT, A logarithmic time sort for linear size networks, J. Assoc. Comput. Mach. 
34, No. 1 (January 1987), 6&76. 
18. E. UPFAL, Efficient schemes for parallel communication, in “Proceedings, ACM SIGACT-SIGOPS 
Symposium on Principles of Distributed Computing, August 1982,” pp. 55-59. 
19. E. UPFAL, A probabilistic relation between desirable and feasible models of parallel computation, in 
“Proceedings, ACM Annual Symposium on Theory of Computing, 1984,” pp. 258-265. 
20. E. UPFAL AND A. WIGVERSON, How to share memory in a distributed system, in “Proceedings, IEEE 
Annual Symposium on the Foundations of Computer Science, 1984,” pp. 171-180. 
21. L. G. VALIANT AND G. J. BREBNER, Universal schemes for parallel communication, in “Proceedings, 
ACM Annual Symposium on Theory of Computing, 1981,” pp. 263-277. 
