The views and conclusions contained in this document are those of rhe authors and should not be interpreted as representing the ofticiaf policies, either expressed or imphed, of the U.S. Government.
Permission to copy without fee afl or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery.
To copy otherwise, or to republish, requires a fee andlor specific permission. Thomas M. Stricker where processors exchange messages among themselves, followed by a computation step, where the processors operate independently on their local data. In this example, a barrier synchronization between data-parallel phases keeps messages from different phases from being intermixed.
We call a barrier that involves a group of processors a subset barrier, and we say a processor participates in a barrier synchronization if it has to wait until all other processors in the subset for this barrier have reached the given barrier. To provide a basis for parallel program generators, we require that an arbitrary subset of processors can participate in a barrier. Further, disjoint subsets of processors can participate in different barriers in parallel. This proper~allows us, for example, to allocate disjoint subsets of processors to different data-parallel tasks, and then synchronize each of these data-parallel tasks independently and in parallel.
Since the mapping of processes to processors may not be known at compile time, we cannot rely on the identifiers (Ids) of each processor to identify different subset barriers. That is, if two processes Pa and pb participate in one barrier BI and three other processes P., Pv, and P, participate in another barrier Bz, then we cannot use the unique Ids of the processors that executed pa, Pb, 1'., PV, and Pz to determine if a barrier is reached (even if we map a single process onto each processor), since the barrier is placed in the programs before the mapping of processes to processors is known.
Similarly, barriers based on bit masks do not do justice to the range of programs that can be executed on a MIMD computer. Therefore, we introduce named barriers. Names do not have be unique; the only restriction is that no two baniers with the same name can be active at the same time.
It is desirable to limit the amount of information that must be ex- Furthermore, we require that all processors participating in a named barrier use a symmetric protocol, that is, they execute the same code for synchronization. If one processor is to perform extra work (e.g., it determines that all processors have reached the barrier), then the barrier algorithm must dynamically identify one processor out of the subset of processors.
Algorithms and communication models
Barrier synchronization algorithms consist of two phases:
Phase 1: Determine that every participating processor has reached the barrier.
Phase 2: Inform every participating processor of the successful completion of Phase 1.
We say Phase 1 ends if one processor has reached Phase 2. Phase 2 can be empty if every processor can determine that every processor has reached the barrier. For any symmetric barrier synchronization protocol on a private memory MIMD system, each processor reaching a barrier must somehow signal this fact by sending some data to another processor. In general, a processor must send a message to at least one other participating processor, and at least one participating processor must be able to receive information about all of the others. The information contained in receiving a message is that one additional processor has reached the barrier. The algorithms described in this paper have the key property that only one processor (determined at runtime) needs to receive information from all other participating processors.
We introduce two communication models and develop symmetric subset barrier synchronization algorithms for these models. The algorithms share the following advantages:
o Each processor uses a constant number of message buffers. Receiving such a message allows a processor to continue its computation.
The bounded buffer broadcast model, which provides for fast broadcasting and for fast and easy discarding of messages, is chuacterized by the following:
1.
2.
3.
Each processor can broadcast message to all other processors, and this message is guaranteed to arrive at its destination within some fixed time interval.
Messages are stored at every processor in FIFO (first in first out) order in a bounded number of buffers. If a message arrives but no buffer is free, the oldest message is retired to make space for the newly received message.
Only messages currently buffered at a processor can be read by this processor. This model is fairly realistic (i@ reliance on a bounded number of resources (buffers) will be appreciated by any implementor).
Section 3 will provide an example of how this model can be implemented on a specific parallel system. Note that the bounded buffer broadcast model makes no assumptions on the number of buffer elements; a single buffer element is sufficient for Algorithm 1.
Anonymous destination message passing
In the bounded buffer broadcast model, messages are discarded at the destination after being broadcast to all the processors in the system. In some communication systems, broadcasting a message and then removing it at the destination is a costly operation, so
we also investigate a model that does not rely on broadcasting and discarding. In this model, a processor receives messages only after it has reached a bmrier.
In the generic message passing model, each processor has a unique identifier, and this identifier is used to address the processor, A processor P, sends a message to a destination processor Pd by receive:
Receiving is performed simultaneously with sending, 29
ps.
The "wired or" broadcast network is a circuit that implements an unbounded fan-out logical gate (i.e. a gate with an arbitrary number of inputs). The scalability problems of "wired or" networks are well known and have been discussed before. The current design of this unit (operating at 1Mhz) allows for at least 1024 processors. User configurable cell names and message headers A pathway through a particular cell starts in express mode, (see Figure 2 (b)) which allows data to travel along with a minimal delay (0.2 ps). Each iWarp processor contains an address match CAM with four entries that can be set by the program. When a special word called a message header arrives over a pathway at a processor, the match CAM hardware automatically compares the 212 message header to its four entries. If any of the entries matches the message header, the pathway is stopped, i.e. the message header is held in the pathway, and the computation agent is informed that a matched arrival has occurred (e.g. as in Figure 2 (a)). If no match between a match CAM entry and the header was detected the message continues on to a "default" processor (see Figure 2 (b)). The "default" processor can be any neighbor of the current processor. For the purpose of barrier synchronization the "default"
processor at every processor is chosen so that the pathways build a unidirectional ring through all processors of the iWarp system.
Pathway
When a matched arrival is signaled, the message passing system splits fhe pathway The primitives
To implement the selecthe~end primitive on iWarp, the computation agent simply places a message header and data on the ring. The message circulates around the ring until one of the match CAMS in the ring matches the message header. Note that no message circulates forever because the match CAM of a sender is configured to match any message it has placed on the ring, since it must be prepared to receive messages with the same header from other processors participating in this barrier.
To implement the selective-receive primitive on iWarp, the computation agent writes the desired header to one of the match CAM entries, waits for a matching header to arrive, consumes the message header and data, and then joins the pathway so that a following message can proceed to the next processor in the ring. we already know that every participating processor is waiting for a message. The processor that ends Phase 1 sends a "barrier complete" message, and all processors receiving the message forward it (unless they are the original sender) and then continue their computations.
The running time of this algorithm is proportional to the number of sequential receiving and sending steps. If n is the number of participating processors, then the number of sequential receiving steps for Phase 1 is n (one processor has to receive n messages).
In Phase 2 one message has to be forwarded sequentially around the entire ring, which costs n steps. The drawback of this method is that it unnecesstwily forwards messages in Phase 1 and does not overlap the receiving of messages on the different processors in Phases 1 and 2. We will now discuss three ring-based algorithms that improve on this naive approach.
RING1
This algorithm eliminates the forwarding of messages in Phase 1 by using a tournament or combining approach. The processor that wins the tournament is the processor that ends Phase 1. Note that standard tree contraction algorithms are inappropriate here because they require bidirectional communication, which is not allowed in our model and is hard to provide for the barrier synchronization, given the limited number of logical channels available. In our tournament approach, a processor has two possibilities after receiving a message: (1) it continues to receive messages (wins the round) or (2) it forwards a message and waits for notification from the winning processor (loses the round). We choose the processor that ends Phase 1 to be the one with the highest Id (each processor has a unique integer Id) among those processors participating in the barrier. This means that a processor is only allowed to continue receiving messages as long as it does not know about another participating processor with a higher valued Id. Also, we must preserve the information about how many messages have been received by a processor. So in each message we include a counter that accumulates the number of messages the losing processors have received, A detailed outline of the tournament approach can be found in Algorithm 3 (Phase 1).
RING2
This algorithm improves Phase 2 of algorithm RING 1 by distributing the completion notification efficiently (in some tree-like fashion). The tree from Phase 1 cannot be reused because of the unidirectional communication network and the highly imbalance nature of the tree (see e.g. Figure 3 ). Rather, the processor that finishes Phase 1, denoted PI, determines the set of participating processors over the course of Phase 1. During Phase 2, it sends one message to the nearest participating processor, and one message to the participating processor halfway around the ring. To apply this approach recursively, the messages sent by PI must include the set of participating processors. This requires extra work and means that the message length will be linear in the number of processors instead of logarithmic. In reality this works fine for small systems but may create problems for large systems.
RING3
This algorithm gives an alternative approach to Phase 2 with a logarithmic message length, but with potentially decreasedperformance. In summary the time complexity of the barrier synchronization is in all common cases O (log N), which in general is the best we can hope for. The worst case complexity of Phase 1 is O(m), which is provably the best we can expect for this tournament approach.
Note that the time complexity of Phase 1 is only dependent on the processors still in the tournament.
In Phase 2 we point out a tradeoff between simplicity, time complexity and message length. The simplest algorithm RING1 has time complexity O(n). The fastest algorithm RING2 has a time complexity O ( log n), but its disadvantage is that it requires messages of length O(rz ). Algorithm RING3 offers a compromise by having an expected running time of 0(log2 n), using a random distribution, and sending messages of length O (log IV).
4,3.1 Time complexity of Phase 1
The cost of transmitting a message along the ring is small compared to the cost of a selective _receive or a selective >end. So we are concerned only with the number of selective-receive and selective-send operations.
A clear lower bound on the time complexity for this problem is log n (because it is impossible to compute the "or" of n numbers in time o(log n)).
Note that the time complexity of Phase 1 is only dependent on which processors are still participating in the tournament approach.
More Proof For a detailed proof see [7] . So the best we can hope for is to find a numbering scheme where the longest subsequence is 0(@). But we would also like to find a numbering scheme that for commonly used subset barriers gives us a logarithmic number of rounds. Choosing the bit-reversal numbering scheme we get a worst case performance of 0( m) and for commonly used subsets an 0( log N) performance.
The bit-reversal numbering scheme (bitrev) is defined as follows:
Let x be the processor at the x 'h ring position (numbered from zero).
Then the Id of processor .t will be the bttre U(J ) with:
The bit-reversal numbering scheme for 16 processors is: 0841221061419513311 715
Lemma 5 For any subset barrier with n participating processors there exists a numbering schernz such that the number of rounds of Phase I is at most: min{O(@), n + 1}.
Proofi We will only give a proof sketch here. The detailed proof can be found in [7] .
Let us consider the bit-reversal numbering scheme. If only n processors participate in a barrier, inc + dec can be at most n + 1.
So it is sufficient to show that the longest subsequence of the bitreversal numbering scheme are bounded by 2 *~. The longest increasing subsequence can be generated as follows: each member of the subsequence has log N bits (assume log N = 2 * 1+ 1 ). The first 1 + 1 bits form the sequence {O, 1,2, 3,...}, and the remaining bits are a mirror image of the first half. Notice that each member of this sequence is its own bit-reverse. To find a lower bound on the message complexity of Phase 1
we assume that every processor sends at least one message and this message can only be received by processors which have already reached the barrier.
Lemma 8 A lower bound for the message complexity of an n processor barrier for Phase 1 is Q ( n * N).
Proof Consider a barrier with n participating processors PI, . . . . P,,
(numbered in ring order). Processor PI reaches the barrier first.
No other processor has reached the barrier, so its message travels once around the ring. Processor PZ reaches the barrier and sends its message. It travels until it reaches processor PI, i.e. N -1 wires. When processor P, reaches the barrier, its message traverses N -i wires. So the total message complexity must be at least n * N -nz /2 -n/2. (For an example of such a situation see Figure   3 .) El
To bound the number of messages and the message complexity of Phase 1 of our algorithm we observe that every processor is sending one message initially, and every other message which is sent is only being forwarded. Any processor forwards a message only once, and all but one of the messages travel at most one time around the ring.
Note also that the number of messages sent and the way messages are sent only depends on how many processors participate in a certain barrier, and is independent of the number of simultaneous barriers, So the number of messages of Phase 1 of ouralgorhhm is 2X n plus an additive term which is necessary to reduce the time complexity, and the message complexity of Phase 1 of our algorithm is at most n * N plus an equivalent additive term. The additive term depends on the length of the timespan between when the first processor reaches the barrier and when the last processor reaches the banieĩ .e. how often the processor with the current highest Id has to do polling. If the timespan is sufficiently small, the step markedpolling of Algorithm 3 will never be executed and the message complexity isn *N.
The number of messages of Phase 2 for RING1 is 1, so the message complexity of Phase 2 for RING 1 is N. This is optimal.
For RING2 we send log n messages and have a message complexity of N * log n; and for RING3 we send up to log N messages and have a message complexity of up to N * log N, which is the price we pay for an improved time complexity.
Theorem 9 Let the tirnespan between the first processor's and the last processor's arrival at the barrier be t, the time for a message to travel once around the ring be t,, and the time to receive and send be t,. Then the message complexity of RINGI for an n processor barrier is at most:
and for algorithms RING2 and RING3 the message complexity it is at most:
Note that this is close to optimal.
5 Performance measurements and evaluation
We measured the execution times of different implementations of barrier synchronization on iWarp 2. All coding (except the bit manipulation code of RING2) was done in C, and the programs were compiled with release 2.4.1 of the C compiler (this is a pre -release without the optimizer.) All data are }{s (based on 20 Mhz clock of the production systems). Processors participating in barrier Notice that subset barriers incur a cost. For comparison, we coded a fast total barrier using logical channels. Such a total barrier can be implemented on a 64-processor iWarp system in about 60
,USusing a ring, and in about 10 ,USif more logical channels are used (to build a higher order network). But once we are committed to do subset barriers we can determine the information normally provided by the bitmasks (e.g. the processor that sends the "barrier complete" message) at no additional cost.
Of course the real value of subset barriers is that multiple barriers can be active in parallel.
We measured the cost of performing k = 64/2' barriers in parallel, each with 2' processors, for 2 =1,2,..., 6. On the ring, performing multiple barriers in parallel added (for the considered subsets) at most 25% to the total execution time, whereas the time for the bounded buffer version grows linearly with the number of barriers. In general the ring-based synchronization algorithms are about 2-3 faster than the broadcast-based synchronization algorithm. There is some variation depending on how processors are grouped into subsets; there is less interference between the different subsets if each subset happens to be a contiguous segment of the embedded ring.
If all but one processor has reached the barrier when the last processor arrives, our measurements verify that the only cost we incur is the cost to end Phase 1 and to perform Phase 2, which in broadcast-based methods are two broadcasts (125 J(S) and, in the ring-based methods one message plus the cost for Phase 2 (305 ps).
The above comparison between broadcast-based and ring-based synchronization fails to draw attention to one major advantage of the broadcast-based schemes: All participating processors leave Phase 2 at the same time, and this feature can be used to synchronize the local timers on all processors to within 800 ns (the timer resolution is 400 ns). This is an extremely valuable feature which has simplified our measurement tasks significantly. 
Concluding remarks

