INTRODUCTION
A network of workstations (NOW) is an attractive platform for many parallel processing tasks, with networking technologies turning the raw computational power of a cluster of inexpensive components into a high-performance multiprocessing system [4, 7, 15, 22] . Specialized interconnections and adapters that allow for user-level access to shared data and synchronization are necessary for dramatic performance improvement in scalable parallel computers or NOWs [21] . All areas of fine-grained interaction need such improvements, but the focus of this paper is restricted to the area of barrier synchronization for fine-grained apWhen a PE reaches a barrier, it loads its CAB hardware with the barrier identifier that it is awaiting. If done in a blocking mode, the CAB hardware activates an appropriate local signal, such as I/O wait, to block its PE until all PEs for that barrier have checked in (uniprocessing is assumed for each PE). If done in a nonblocking mode, a fuzzy barrier [9] is created so that a PE can do other work after check-in before explicitly waiting for barrier completion. The CAB hardware will monitor the packets as they pass through it, detecting a match (using a bitserial exclusive-OR) when the desired barrier reaches it. A bit-serial decrement of the count field of the packet is then performed.
The comparison and decrement operations within each CAB are done with only combinational logic in the loop itself. There are, of course, sequential components within the CAB hardware that must be able to control the loop data during the next clock cycle. As a result, each node in the ring adds only the delay of a single XOR gate to the ring latency, regardless of the clock speed. Following the decrement operation, the CAB hardware continues to monitor the packets as they pass through it until a barrier matches with a count of zero, at which time it unblocks its PE. Note that the last PE completing check-in is immediately unblocked. A cluster controller is responsible for placing active barriers on the ring, resetting counts when all PEs have proceeded from a barrier, and removing barrier packets when no longer needed. Figure 1 shows an example implementation of the CAB with barrier packets inserted (with CNT ϭ MAX) or deleted by the cluster controller, which would typically be attached to one of the loop node PEs. The CAB at each PE is able to detect a match of the desired barrier (dbar) and the packet's barrier (PBAR), and to decrement the packet's count (CNT) or detect a count of zero. Note that delay for each PE, plus the conductor's unavoidable timeof-flight delay between PEs. This configuration allows the performance of this new barrier mechanism to approach the performance of a fully connected hardware fan-in tree.
CIRCULATING ACTIVE BARRIER (CAB) SYNCHRONIZATION
The existing barrier synchronization mechanisms typically trade off cost and complexity with performance. An ideal barrier mechanism, however, would:
(1) allow for multiple barriers, each with potentially different sets of processors;
(2) have user-level access to avoid context-switching into the OS kernel; (3) allow for rapid check-in with no contention in the typical case; (4) have rapid execution resumption when all PEs have checked-in; and (5) be applicable to a wide variety of system architectures and topologies.
CAB meets these goals by using simple and inexpensive hardware with an integrated special-purpose bit-serial network between processors to take advantage of the simplicity of serial data operations while minimizing the latency such operations introduce. An added benefit of this simplicity is minimized cost of the replicated hardware at each PE, with somewhat more complexity in the shared cluster controller.
Barriers that are available for PE check in are classified as ''active'' and are circulated around a ring of CAB hardware modules, each one attached to a single PE. Each active barrier is in a packet consisting of a barrier identifier and a count of the remaining PEs to check-in at that barrier. a zero count for a packet is the ''all-checked-in'' indicator. When the cluster controller detects that a counter has become zero, a flag, R n , is set to indicate that barrier packet with a zero count is being circulated. When that zero again reaches the controller, the counter is reset to the MAX n value adjusted for any PEs checked in at the new barrier. CAB modules starting with the final one to check in will see the resetting packet twice, and if any are already checked in for the next synchronization, CNT will become negative (2's complement) so that no early check-ins are lost. Note that any number of barrier counters are possible, limited only by the number of PBAR bits, and that any barrier may be used by any PE. For example, this structure allows nested barriers, which could be the result of nested loops.
The CAB hardware must be accessible at the user-level to allow rapid access, but it must be protected from access by an unauthorized PE. For example, more than one set of PEs may have independent sets of barriers on a particular cluster. Since the barrier identifiers are physically available to any PE on the cluster, the CAB hardware must protect the integrity of the system by mapping logical barrier IDs into authorized physical IDs. On some systems, external register-mapping may be used for fastest performance [10] , but most systems can at least accommodate memory-mapping. The available barriers must be mapped into hardware using the kernel mode execution by using physical barrier identifiers supplied by a barrier system call similar to that for acquiring and releasing UNIX ports. For performance considerations, however, their use must not involve context-switching into the kernel mode. The barrier server would also initiate generation and removal of circulating barriers through communication with the cluster controller. The PE that initializes the barrier is responsible for communicating with the cluster controller to insert a new barrier packet for circulation. Also, this PE must send the barrier number to all participating PEs to initialize their CAB modules using the general-purpose inter-PE communication network, which would also be used by error recovery protocols.
Although the proportionality constant is small because of the absence of sequential logic in the loop, the O(N) loop time results in unacceptable latencies for a large number of PEs, making a hierarchy of loops desirable. In this paper 133 CIRCULATING ACTIVE BARRIER SYNCHRONIZATION we limit our discussion to two levels of hierarchy, though extensions to additional levels are expected to be straightforward. Barriers that involve only PEs within a cluster would be handled by the mechanism already described. If PEs accessing the same barrier reside in different clusters, an intercluster loop, similar to the intracluster loop, will be needed. The intercluster part of each CC shown in Fig.  2 will look similar to a cluster's CAB module. A CC sets CNT to local PEs ϩ 1 and detects when a global barrier's CNT ϭ 1 within the local cluster, and then will check in at the appropriate global barrier, whose count was originally the number of clusters required to check in. When a CC sees a global-loop barrier with a count of zero, it will finish global check-in by circulating the matching local barrier with a zero count. Checked-in global barriers within a cluster will be simultaneously checked by serial comparators in parallel. For a two-level hierarchy, two queued comparators are most likely sufficient since a particular cluster is unlikely to be involved in more than two global barriers, and global barriers would normally be involved only if the number of PEs for a barrier exceeds the number of PEs in a cluster. As on the local loop, there are no serial components in the global loop outside of the global controller to minimize the latency.
PERFORMANCE EVALUATION
If properly mapped [13] , only Nodes/2 barriers must be accommodated in a loop, with each barrier capable of counting all nodes. Since the count requires lg Node bits (lg ϭ log 2 ) and the barrier identifier requires lg(Nodes/2) bits, the optimized packet size for either a cluster or an intercluster ring which can accommodate all possible reachable barriers would be of bit length 2 lg Nodes Ϫ 1.
(
The packet length would be greater than the optimum if a parity bit or additional barrier identifier bits were added, making the maximum latency between 4 and 17% longer for 1-2 added bits for closely spaced PEs. The impact of adding one or two bits per packet on a network 
When multiple clusters are involved, the minimum latency includes times spent in the local loop, (D ϩ G)N/Z, in the local cluster controller, (C ϩ B ϩ 2)P, which includes one bit-time for mapping to the global loop, in the global loop, (I ϩ G)Z, and in the global controller, (X ϩ Y ϩ 1)P. The minimum time is then
or if optimized with C ϭ lg N/Z, B ϭ C Ϫ 1, X ϭ lg Z, and Y ϭ X Ϫ 1, the lowest time is:
The expected time is 1.5T l , which if optimized is
Similarly the worst-case time is Figure 3 shows the expected synchronization time (T e ) based on two models: (a) a network of closely spaced single-PE workstations (SP-WS), and (b) a network of closely spaced multi-PE workstations (MP-WS: one cluster per workstation). The following values are assumed:
(1) packets are of the optimum size, 2 lg Nodes Ϫ 1, as described above, (2) the XOR gate of the CAB has a propagation delay of 1.5 ns, between workstations would be minimal since conductor latency, not bit-time, is the major limiting time factor.
The various parameters used in this section are described in Table I . The time from the last check-in until all PEs are notified is the release latency produced by the CAB synchronization mechanism. For barriers totally within a single cluster, the lowest time (T l ) involves one traversal of the packet around the loop, with check-in occurring just prior to the desired packet arriving at the ''last-in'' node. This time includes the time the packet spends in the cluster controller, (C ϩ B ϩ 1)P, since the entire packet must be inside the controller before a decision can be made as to whether to recirculate or remove a barrier (which takes one clock cycle). The traversal time outside of the controller, (D ϩ G)N, includes the sum of all gate delays and inter-PE propagation times. The total lowest time is then
or if optimized via Eq. (1) with C ϭ Lg N and B ϭ C Ϫ 1, the resulting time within the controller is P lg N, yielding
The expected latency waiting for the desired packet would be one-half the loop time. Once the desired packet is seen, it must make, at most, a complete loop traversal (with a CNT ϭ 0) before all PEs will be unblocked. Assuming the number of active barriers is small enough so that the FIFO in the cluster controller is empty, which is the typical case if the number of PEs per barrier is high, the mean expected time is then T e ϭ 1.5T l , or if optimized,
The normal longest time would involve a complete loop traversal before encountering the desired barrier, leading to double the lowest time. The very worst case would involve a situation with N/2 active barriers (two PEs per barrier), which is extremely unrealistic. In this hypothetical case, the time spent outside the controller is insignificant, since the limiting factor is transmitting the backed-up pack- 
the intercluster time is 10 ns for SP-WS and 5 ns for MP-WS, (5) the serial clock period is 2 ns, (6) only dedicated (single-task) PEs are involved in barriers, and (7) the number of active barriers is small enough so that only one packet is in a cluster controller.
As can be seen from Fig. 3 , the clustering becomes very important as the number of PEs increases because of the O(N) loop-time outside the cluster controller. Each line represents a particular total number of PEs, so the number of PEs per cluster can be calculated by dividing by the number of clusters. Notice that most curves have an optimum clustering, producing a minimum synchronization time. This minimum depends on the type of system, but the slope is gentle enough so that a slightly ''nonoptimum'' choice would not be disastrous.
Rather than generating plots to observe the minimum synchronization time for a given configuration, it could prove useful to analytically determine the optimum clustering based on the number of PEs. Since
when using the optimized packet sizes, the optimum number of clusters occurs when T e 's derivative with respect to Z is zero: In order to demonstrate ease of implementation and the validity of the timing analysis presented previously, a CAB module with a four-bit barrier number and four-bit counter was simulated [2] . The CAB module was implemented using LSI Logic's G10 CMOS ASIC library, which has a 0.29-Ȑm effective transistor channel length and is optimized for 3.3-V supply operation. The Synopsys Design Analyzer was used to synthesize the VHDL code for a single CAB module. In addition to a bit-serial adder, five state machines were defined for the CAB functions: equal detect, zero detect, bit counter, check-in, and processor wait. The digital logic for the module, generated by the Design Analyzer tool for the G10 library, consisted of eight D flip-flops and 51 simple combinational gates. Once the digital logic was generated, LSI Logic tools were used to generate the floorplan for a CAB module. The chip area required for a single CAB module was found to be quite small: 115 ϫ 115 Ȑm. From the CAB layout, a transistor level HSPICE-compatible netlist with back-annotated parasitic layout capacitance and resistance was generated. is currently a faculty member in the Department of Electrical Engineering and the Director of Graduate Studies in Computer Engineering at the University of Minnesota in Minneapolis. Previously, he worked as a research assistant at the Center for Supercomputing Research and Development at the University of Illinois, and as a development engineer at Tandem Computers Incorporated in Cupertino, CA. His main research interests are in computer architecture, parallel processing, and high-performance computing. He is a senior member of the IEEE Computer Society, a member of the ACM, and is a registered professional engineer.
JOHN RIEDL has been a member of the faculty of the computer science department of the University of Minnesota since March 1990. His research interests include collaborative systems, distributed database systems, distributed operating systems, and multimedia. At the 1988 Data
