We present a simple algorithm for emulating an N processor CRCW PRAM on an N node butterfly. Each step of the PRAM is emulated' in time O(log N) with high probability, using FIFO queues of size 0(1) at each node. The only use of randomization is. in selecting ·a hash function to distribute the shared address space of the PRAM onto the nodes of the butterfly. The routing itself is both· deterministic and oblivious, and messages are combined without the use of associative memories or explicit sorting. As. a corollary we improve the result of Pippenger [8] by routing permutations with bounded queues in logarithmic time, without the possibility of deadlock. Besides being optimal, our algorithm has the advantage of extreme simplicity and is readily suited for use in practice.
Introduction
Concurrent-read concurrent-write parallel random access machines (CReW PRAM) allow an arbitrary number of processors to read or write a common memory location in one time step. Complex communications operations, such as broadcast and multicast for example, can be implemented in one step. The faGility to succinctly express complex communication patterns greatly simplifies the task of both designing algorithms and writing programs. For this reason, the CReW PRAM model is favored over weaker abstract models for which most, if not all, of the algorithmic and programming effort is spent synchronizing the movement of data.
Unfortunately, it is unlikely that CRew PRAMs will ever faithfully model any real parallel machine~Any real parallel computer will most likely consist of a large number of small processors, each connected to a small number of other processors. For the network to scale in size, we require that the complexity of the individual processors be independent of the size (number of processors) 0272 -5428/87/0000/0185$01.00 © 1987 IEEE 185 of the network. More specifically, by a realistic paralleI computer we mean a network of N processors, with each processor. connected to no more than a fixed number (say 4) of processors. Each processor in this network has its own local.memory, and processors communicate by sending messages over links to neighboring processors. Finally, each processor can accompdate only a fixed (constant, independent of N) number of messages at any time.
How can we reconcile the convenience of CRCW PRAMs with the limitations of a real computer? The only alternative is to emulate a CRCW PRAM on a real network. Such an emulation has two components:
• Message routing -Routing memory requests (read/write) from' processors to distant memory 10-cations~,and data from· the location.back to the processors. Once we have fixed our address map, each memory". access is accomplished by sending a message from the processor requesting the access to the processor holding the memory location.
Three measures determine the efficiency of an emulation: t.me, the number of steps on the. network to emulate one step of the PRAM. Second, queue-size, the amount of add~tional hardware per processor required to hold messages in queues while in transit. The third factor is the complexity of managing the queues at each processor: a first-in first-out queue is less complicated than a priority queue,. or a queue requiring associative lookups. A simple queueing strategy is clearly preferable to one requiring complex operations.
How much time must an emulation take? Because the diam,eter of any bounded-degree. network on N nodes must be at .least O(log N), this is clearly a lower bound on the time to emulat~one step of a CRew PRAM. Un- Suppose that several processors request to read a common memory location.! Suppose further that the routes of these messages intersect to form a tree, as in Figure   1 •.. Each~essage, moves along the directed path from its source to the destination. Different messages may, in general, traverse common portions at different times.
There is, however, no need to se~d more than one read request along any branch of this tree. If a request simply waits at each tree node until another similar request appears along the other incoming edge (unless it "knows" that future requests along that edge must be for different memory locations), then the two requests can be merged, and one forwarded along the tree. Of course, the reply message must return backwards along each edge of the tree so that each requesting processor receives a reply. To accomplish this we only need, at each node, to store two direction bits to direct the reply either along the top branch, the bottom branch, or along both. This simple idea is more efficient than the associative memories disulation. Sections 4 and 5 analyze the routing scheme and the likelihood of large delays.
The address ma.p used in section 3 and in [5] has some drawbacks. Undereertain conditions it does not distribute the PRAM address space uniformly over the different memory modules in the butterfly. Further, it only assigns PRAM locations to memory modules, ignoriag the problem of· where the location is stored within the module. We overcome these problems in appendix A. In appendix B we extend our results to emulation of entire PRAM programs. [21 show that any deterministic emulation must take time at least O(log2 N/ log log N). For any deterministic routing scheme that is also oblivious (the route of each .message is completely determined by the source and destination) Borodin and Hopcroft [3] give a worst-case lower bound of O(VN).
A number of emulations have been developed in recent years [2, 5, 6, 11, 12] . The best known deterministic strategy [2] for emulating an N node CRCW PRAM with M shared variables on an N node bounded-degree graph takes time o(log2 N). Better time bounds. are. obtained with randomized routing schemes. [1.,10,13] . Using random hash functions, a randomized routing strategy and Reif-Valiant [9] probabilistic sorting scheme, Karlin and Upfal [51 presented a probabilistic emulation on an N node butterfly. The time complexity of their emulation is 0 (log N) and the queue-size is 0 (log N). Their queues are required to be built as priority queues, which are expensive.
This paper presents a probabilistic emulation whose time complexity is O(log N)and queue-size is 0(1). The queues are first-in first-out, the simplest possible. We adapt the random hash functions of Karlin and Upfal [5] for the address map. In contrast, however, our routing scheme on the butterfly is completely deterministic. In fact, the routing scheme is also oblivious, which is rather surprising. Thus our scheme only requires O(log2 N) random bits, rather than O(N log N) as in [5] . Besides being optimal with respect to time, complexity, our emulation has the advantit-ge of extreme simplicity.
We also note that this is the first emulation ofCRCW PRAMs with bounded queue-size. Pippenger· [8] showed how to route permutations on a butterfly with bounded queue-size in O(log N) time. His scheme allowed a small probability of deadlock. We obtain a deadlock-free solution for permutation routing as a simple corollary of our routing scheme. In contrast to [8] , our routing scheme as well as the accompanying analysis iseonsiderably simpler.
To summarize, we state the main result below:
The key idea in this paper is a simple and optimal strategy for combining requests that access the same memory location (section 2). Section·3 presents the emlConcurrent write requests can be handled similarly, but for the sake of simplicity we will only consider concurrent reads in this paper. [7] . The idea of message combination is not new [4] .
How do we know that no future message arriving at a node will request a particular memory location?· The key idea is to keep the messages leaving out of each processor sorted by destination. Figure 2 shows a snapshot of processors in a network at some point in time. Each receives messages along two incoming edges and places them into the corresponding FIFO queues. At each step the processor checks the two messages at the head of each queue and compares their destination addresses. The message with the smaller destination address is transmitted along the appropriate outgoing edge, and two direction bits are stored accordingly. If both messages are destined for the same location, one request is sent out. Finally, if only one queue has a message waiting and the other queue is empty, no message is sent out. (If the message were sent, the next message along the other edge could conceivably have a smaller destination, thus violating the sorting requirement).
In our snapshot at time T, processor A in figure 2 selects the message destined for location 35. Then it waits until the message to location 48 arrives, at which point it discovers that the messages at the heads of both the queues are to location 48, and can be combined. Keeping messages sorted by destination also simplifies the task of replicating the reply when it returns (section 3.4).
Ghost messages
The simple idea of keeping message streams sorted has one deficiency. Consider Figure 2 again. At time T, processor B cannot transmit the message it holds for 10-187 cation 25, because it must ensure that it will not receive a message to a smaller location in the future. When A selects the message to location 35 for~ransmission on one link, it can convey this information to B by sending a gho8t message labelled 35. As soon as B receives the ghost message, it knows that future messages along that edge must be destined for locations greater than 35. Therefore, at the next time step B can forward the message waiting in the lower queue.
Ghost messages simply notify a processor of the minimum location to which subsequent messages can be destined. Ghosts are not used for any other purpose, they "keep the system going." In section 3.5 we specify the mechanisms for transmitting ghosts precisely. This simple idea turns out to be powerful enough to yield our main result.
The eDlulation
The bounded degree network on which we emulate CRew PRAMs is the butterfly (also called the FFT network). The number of nodes in a butterfly with n levels is N = n2 n , and we will use this to emulate an N processor PRAM. We assume that levels 0 and n are identified, so that the butterfly is wrapped around. Each node in the butterfly has a processor, a memory module and a small number (6 or 7) of switches, each with upto 2 inputs and 2 outputs. Each input into a switch has a queue that can hold at most b messages. We will specify b later, but it will be a constant, independent of N.
The Address Map
Suppose that we wish to emulate a CRCW PRAM with .
: :
. . class of hash functions: switch thus notifies the s'Witch that no more requests will be issued to it from the corresponding incoming edge.
Message'path
Each ai E Zp is chosen randomly, and the number 'p is a fixed prime no less than the sizeM of the PRAM address space. The number~will·be specified~ater.
Message structure
In each step of the emulation each processor (c, r) accesses a PRAM location, say %, which is placed inmemory module h(x) = (c',r'). To accomplish the memory access, the processor sends a message to module h{x), and the module returns the message to the processor with the required data.
Each message has 3 fields: tag, type, and data. The tag for our message is (h{x) , x), ie., the number obtained by concatenating the number h(x) of the memory module which contains the shared memory location x, with x itself. The type field is one of REQUESTS, EOS or GHOST. We assume that each node issues an end-of-stream message of type EOS in the time step right after it issues a memory access message of type REQUEST. The tag "field of an end-of-stream message is always 00. We will keep the messages entering and leaving every· switch sorted by the tag field. An end-of-stream message arriVing at a 8For simplicity we assume that all requests are read requests. Write requests are handled similarly.
Each REQUEST traverses a path from its source prace&-sorto the destination module and back. This happens in 6 phases. As seen in.·' Figure 3 ,each phase is a traversal of the 'butterfly. III the first three phases, each message traverses the bl1.tterfly in the forward direction. In the first phase, the message issued at node (c, r) is directed to node (O, r). ,In Phase 2, the message follows the unique (forward) path in the butterfly from node (0, r) to node (0, r'). This path can be determined by looking at the appropriate bits of the message tag. In Phase 3, the message reaches the node {c', r'), where it acquires the required data from the memory module. It continues to move forward through the row until it again reaches node (0, r').
In the last three phases, the message traverses its path in the reverse 'direction an'd returns to the node that initiatedthe request. The data requested has finally arrived.
For cqDve~ience, we describe the routing mechanism in terms of theJpgical network of Figure 3 instead of the butterfly. The correspondence between the two is clear and each butterfly node does the work of 6 switches in the logical network 4 • The logical network has (6n +1)2" switches organized in 6n +1 columns. These columns are numbered 0 through 6n from the input side, the rows are ·except nodes in column 0, which have' 7 switches numbered as in the butterfly. A switch in column cand row r is numbered (c, r).
Since there is a unique (forward) path in the butterfly for any pair of nodes, it follows that the path traversed by each request is oblivious (once the address map is fixed).
Lemma 1 For any processor-memory pair (c, r) ,(c', r'),
there is a unique path in the logical network that starts at (c,r), passes through (c',r') and ends at (c,r). Furthermore, the sequence of nodes traversed in phases -i, 5 and 6 is the reverse of the sequence for phases 9,~and 1 respectively.
How messages are kept sorted
Using the simple idea of section 2, each switch in the logical network ensures that the messages leaving it are sorted by the tag field. This guarantees that messages to a common memory location are combined as soon.as possible.
How do we return the data to all requesting processors? Consider an arbitrary switch 85 in phase 5. Let 82 be the phase 2 switch in the same butterfly node as 85. For each request that passes through 82, two direction bits are stored, which are used by 85 to route the reply bearing the data. This can be done because of the crucial observation that replies arrive in .85 in the same order that requests were sent out of 82. More precisely: With this observation, it is not necessary to store thĩ dentities of the sources, a FIFO queue of the direction bits is adequate. This is also applicable to the switches in phases 1 and 6. Thus, each butterfly node requires additional FIFO queues between the phase 2(1) and 5 (6) switches. It will be shown that the total time required by the emulation algorithm is· O(n) .with high probability, hence queues of O(n) bits are sufficient. This requires no more storage than 0(1) messages.
How ghosts appear and disappear
Suppose that a switch determines that the message m selected for transmission in switch 8 is required ··to be transmitted only on one of its outputs. Suppose further that the other output connects 8 to switch 81 which has space in its input queue. Then, as described in section 2, a GHOST message is sent to 81 with its tag equal to the tag of m. Now, switch 81 is informed that no subsequent messages received from 8 can have a smaller tag.
This GHOST message might itself be forwarded by 81 if it has the smallest tag of all the messages at the head of the queues in 81. If, however, the GHOST is not forwarded immediately, then it need not be retained by 81. This is because 8 must send 81 another message (GHOST or otherwise) at the next step, whose tag cannot to be smaller.. The information in-the new message is bound to be at least as strong as that in an old GHOST. Lemma 3 A GHOST will never wait at any switch.
GHOSTs also help "keep the system going." More precisely, after the first time that any switch sends out a message, it will continue to send out messages (ghosts or otherwise) along all its output edges until the time at which it sends out the end-of-stream message. With this observation, we have the following lemma. 
Message polarization and delay
We show that whenever message delivery takes a long time, there-exists a long polarized sequence of messages. In the next section, we show that long polarized sequences are highly unlikely to occur. The notion of a polarized sequence is similar to that of the delay sequence of [5, 10] and the critical path of [1] .
A path S in the logical network is a sequence {S (i)} of switches with the property that, for every i, the switch 
S(i) is connected to switch S (i + 1), and the switches S(i),S(

Constructing long polarized sequences
We will first construct a sequence {(p~, t~, 1P;)} such that tag(p~) > tag(p~+I). This will give us a path S and a (possibly short) sequence {p~} that is polarized along S.
Next, we augment the sequence to get a longer sequence polarized along ths same path S.
To describe how we construct our initial sequence, suppose that a set of memory requests takes time 6n+,s. Let In general, given (lJi, ti, "pi) with lag(Pi, ti, ,pi) > 0 it is possible to construct (Pi+l, ti+l, 1Pi+l) as follows. We follow Pi back in time starting from t;. If Ili is a ghost or a combination and we reach the switch at which Pi was created, then we continue following one of the messages which caused Pi to be created. We continue this process until we reach a time at which an ancestor of Pi was last delayed. Unless lag(Pi, ti, 1Pi) S 0, we are certain to reach some ¢: at which some P~(which might be Pi or an ancestor) is forced to wait at some time t~.
Thus there must exist (Pi+l, ti+l, 1Pi+l) which delayed (p~,~,~). But there is no waiting between ti and t~+ 1.
Thus we have lag(pi, ti, tPi) = lag(p~,t~,tP;) + 1. We shall first identify some f j along which a messages are polarized. We know that for all i Ifjl = 6n + 2bj -column(,pj) and we will ensure b j~n . Then we can always construct an input-output path f of length 8n that contains fj. We need to consider two cases.
In the first case, bL~n. Assume that We now show that the polarized messages identified above are all REQUESTs and distinct. Because each (Ili, ti, 8i), i > 0 delays (JL~-I' t~_I' 8~-1)' it follows that
Ili cannot be of type EOS. But tag(lli) = tag(Il~), thus Il~cannot be of type EOS. Also, Il~waits, and hence by lemma 3 it cannot .be a GHOST. Thus these messages must be REQUESTs. None of the messages waiting in queues can be GHOSTs. All are ahead of other messages, hence none can be EOSs. Distinctness follows because the tags are strictly sorted. I 5 Large delays are unlikely Proof: The proof has two steps5:
Step A: We first estimate the total number of times arbitrary sequences of r REQUESTs are T polarized for some input-output path T of length 8n, over all possible choices of the hash function h. This can be done 6No attempt· has been made here to obtain the smallest value for b, and the value obtained can be substantially improved. by counting the choices for T, the choices for switches on T where REQUESTs touch, the choices for touching REQUESTs (i.e. the choice of their source processors, which determines Xi, and the choice of the destination modules which determines hex;»~, and the hash functions consistent with these choices.
1. The path T can originate at any switch in column 6n, and consists of 8n displacements, n of which are forward. Each forward or backward displacement can be along any of two edges. Thus the total number of possible choices is 2"(8:)2 8 " = 2 9 "(8:).
2. Let ml, ... , m, denote the sequence of messages that is T polarized, with mi touching T at 8i. All 8; can together be chosen in at most (8",+') ways. 4. The previous step fixed hex;) for~distinct Xi. We know that for 1~i~r and 0~Xi,'1I. < P, there is at most one polynomial P of degree r-1 over the field assuming~= 8n, and using (~)~(nelk)J:.
Step B: Let N6n be the number of hash functions for which the total routing time is at least bn. For each such function it is possible to find b'n ::: bn-6n polarized messages. But each subset of size~ofthese b'nmessages contributes to the above estimation. There are (';) ways of choosing the subset. Thus
N6n e~n)~(6P)r e:)
Because the coefficients ai can each be chosen in P ways, there are a total of pr hash functions. Thus the probability of requiring time bn is
Thus for arbitrary f, there exists a constant b such that the probability of requiring at least bn time for routing is less than N-I. I
Permutation Routing
These ideas can also be used in routing permutations with constant queue-size [8] , and without deadlocks.
Suppose processor i wants to send a message to 'processor ?rei), where ?r is a permutation on {O, ... , N -I}.
We use the 6 phaSe scheme discussed above. In the first 3 phases, processor i sends its message to location i. In the last 3 Memory access: In order to access shared memory location x, it is sufficient to search 8 locations in every memory in the row of a(x). Which locations to search is indicated by a(x). This only requires a minor modification to the access scheme described in section 3.3: the location a(x) is read in phase 3; and the search through the overflow area takes place in phase 4. Thus, at each node in phase 4, the 8 locations specified by a(x) are searched. Eventually the message reaches a node that holds PRAM location %, at which point the data field of the message is updated. The message movement is not affected, since each memory needs·to be accessed a constant number (8) of times per message.
A.2 Improved Scheme
It is possible to show that the empty spaces in the layout of figure 6 can be eliminated. The memory required per row is now proportional to the number of PRAM locations that get mapped into it. This is bounded by the following theorem stated without proof. 
A.3 Address computation
The hash function has~= 8n coefficients, and just storing all the coefficients on each processor requires· 0 (n) memory. However this is not necessary. The node in column i need only hold coefficients 8i through 8i + 7. The polynomial evaluation can be pipelined, requiring O(n) cycles.
B Emulating sequences of instructions
Our results can be extended to the emulation of multi instruction programs, as done by Karlin and Upfal [5] .
For emulating multiple instructions, the same protocol is used, but now we must guard against the possibility that a particular instruction might not complete in the aBoted time. If this happens, a new hash function h is chosen, all the variables are sent to their new locations, and the emulation process resumes. We must also consider the time required to initialize the hash N) with high probability. Because M is polynomial in N, we see from theorem 3 that any PRAM instruction completes in time clog N with probability at least 1 -MI for every f and some c independent of M. Thus the interval between successive rehashing operations is at least MI with probability!.
Thus with high probability no more than 8TIMI rehashing operations are needed. The time required for these is 0((1 + fr)ljf log N). Because T~MIN, the total emulation time is O(T log N) with high probability.·1
The restriction T~MIN is not serious because MIN instructions are required just to access all the M locations.
