Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as apphcations scale. We argue that this problem is not fundamental, and that one can in fact construct busy wait synchronization algorithms that induce no memory or interconnect contention, The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may lbe locally-accessible as a result of coherent caching, or by virtue of allocation m the local portion of physically distributed shared memory We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction.
We also present a new scalable barrier algorithm that generates O(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can hkewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes.
We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchron zzation need not be a problem zn large-scale shared-memory multiprocessors.
The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called "dance hall" architectures, in which shared memory locations are equally far from all processors Figure  7 . Like the code in Figure  5 , it spins on processor-specific, Performance of selected barriers on the Butterfly, will also perform well on these machines. It induces less network load and requires total space proportional to P rather than P log P, but its critical path is longer by a factor of about 1.5. It might conceivably be preferred over the dissemination barrier when . J. M. Mellor-Crummey and M. L Scott sharing the processors of the machine among more than one application, if network load proves to be a problem.
Our tree-based barrier with wakeup flag should be the fastest algorithm on large-scale multiprocessors that use broadcast to maintain cache coherence (either in snoopy cache protocols [15] or in directory-based protocols with broadcast [71). It requires only 0(P) updates to shared variables in order to tally arrivals compared to 0( P log P) for the dissemination barrier. Its updates are simple writes, which are cheaper than the read-modify-write operations of a centralized counter-based barrier.
(Note, however, that the centralized barrier outperforms all others for modest numbers of processors. ) The space needs of the tree-based barrier are lower than those of the tournament barrier (O(P) instead of 0( P log P)), its code is simpler, and it performs slightly less local work when P is not a power of 2. Our results are consistent with those of Hensgen, Finkel, and Manber [191 who showed their tournament barrier to be faster than their dissemination barrier on the Sequent Balance multiprocessor.
They did not compare their algorithms against a centralized barrier because the lack of an atomic increment instruction on the Balance precludes efficient atomic update of a counter.
The centralized barrier enjoys one additional advantage over all of the other alternatives:
it adapts easily to differing numbers of processors. In an application in which the number of processors participating in a barrier changes from one barrier episode to another, the log-depth barriers will all require internal reorganization, possibly swamping any performance advantage obtained in the barrier itself. Changing the number of processors in a centralized barrier entails no more than changing a single constant. tree and tournament barriers, the number of network transactions per barrier is linear in the number of processors involved.
For the dissemination barrier, the number of network transactions is O(P log P), but still O(log P) on the critical path. No network transactions are due to spinning, so interconnect contention is not a problem.
On "dance hall" machines, in which shared memory must always be accessed through a shared processor-memory interconnect, there is no way to eliminate synchronization-related interconnect contention in software. Nevertheless, the algorithms we have described are useful since they minimize memory contention and hot spots caused by synchronization.
The structure of these algorithms makes it easy to assign each processor's busy-wait flag variables to a different memory bank so that the load induced by spinning will be distributed evenly throughout memory and the interconnect, rather than being concentrated in a single spot. . Both Cedar and the Ultracomputer include processor-local memory, but only for private code and data. The Monarch provides a small amount of local memory as a "poor man's instruction cache. " In none of these machines can local memory be modifed remotely.
We consider the lack of local shared memory to be a significant architectural shortcoming; the inability to take full advantage of techniques such as those described in this paper is a strong argument against the construction of dance hall machines. To assess the importance of local shared memory, we used our Butterfly 1 to simulate a machine in which all shared memory is accessed through the interconnection network. By flipping a bit in the segment register for the synchronization variables on which a processor spins, we can cause the processor to go out through the network to reach these variables (even though they are in its own memory), without going through the network to reach code and private data. This trick effectively flattens the two-level shared memory hierarchy of the Butterfly into a single level organization similar to that of Cedar, the Monarch, or the Ultracomputer. Figure 22 compares the performance of the dissemination and tree barrier algorithms for one and two level memory hierarchies. All timing me surements in the graph were made with interrupts disabled, to eliminate any effects due to timer interrupts or scheduler activity. The bottom two curves are the same as in Figures  19 and 20 . The top two curves show the corresponding performance of the barrier algorithms when all accesses to shared memory are forced to go through the interconnection network. When busy-waiting accesses traverse the interconnect, the time to achieve a barrier using the tree and dissemination algorithms increases linearly with the number of processors participating.
A least squares fit shows the additional cost per processor to be 27.8 ps and 9.4 ps, respectively.
For an 84-processor barrier, the lack of local spinning increases the cost of the tree and dissemination barriers by factors of' 11.8 and 6.8, respectively.
In a related experiment, we measured the impact on network latency of' executing the dissemination or tree barriers with and without local access to shared memory. The results appear in Table III. As in Table II, we probed  the network interface controller on each processor to compare network latency of an idle machine with the latency observed during a 60 processor barrier. Table III shows that when processors are able to spin on shared locations locally, average network latency increases only slightly.
With only network access to shared memory, latency more than doubles.
Studies by Pfister and Norton [381 show that hot-spot contention can lead to tree saturation in multistage interconnection networks with blocking switch nodes and distributed routing control, independent of the network topology. A study by Kumar and Pfister [23] shows the onset of hot-spot contention to be rapid. Pfister and Norton argue for hardware message combining in interconnection networks to reduce the impact of hot spots. They base their argument primarily on anticipated contention for locks, noting that they know of no quantitative evidence to support or deny the value of combining for general memory traffic, Our results indicate that the cost of synchronization in a system without combining, and the impact that synchronization activity will have on overall system performance, is much less than previ- 
