Abstract
Introduction
One way to speed up the execution of a program is to distribute the computation over a number of processors. This approach is viable as long as most processors are doing useful work; that is, as long as they are not wasting time waiting for data to become local and ready for use. In shared-memory multiprocessors one important way of allowing data to reside locally is to provide caches, over which some measure of consistency is maintained.
As the execution of the program becomes more fine grained i.e. the number of instructions between points of communication is reduced, the sensitivity to access delays on shared data increases. Stone introduces the performance measure of MSYPS: Millions of SYnchronizations Per Second [l] . If the amount of processing between synchronizations is comparable to the time to synchronize, then dividing the computation up further (making the granularity finer) will not produce commensurate speedup. Thus the speed of synchronization, or MSYPS, is a fundamental limit on the granularity of parallel programs, and therefore a limit on the number of processors that can cooperate on a single task.
We expect that future processors will be fast enough so that, even considering only speed-of-light limits, interconnection delays may be the equivalent of many instruction-issue times. Programs running on such cooperating processors will be limited in their parallelism by the time to synchronize on and exchange data. This limit can be ameliorated to a large extent by prefetching data, at least for networks that provide sufficient bandwidth and programs that are able to request data in advance [2] . But latency in accessing shared data cannot be entirely hidden-data written by another processor cannot be fetched before it is generated. Thus, multiprocessors need to provide efficient mechanisms for the acquisition of shared data. Such mechanisms must be effective for those paradigms of sharing most likely to be found in parallel applications. In this paper we look at synchronization and data exchange scenarios that we find to be important in parallel computing, and we analyze the performance of our hardware mechanisms and compare them to previous proposals.
One important scenario is this: In most applications access to a shared variable is restricted for a period of time to one of two cases. Either a process has exclusive read and write access to the variable, or one or more processes have read (but not write) access to the variable. A very common special case of sharing occurs when exactly two processes take turns reading and writing shared data. This situation is called pairwise sharing. In shared-memory multiprocessors the one-writer/many-readers paradigm of sharing may be enforced by a cache coherence protocol. Archibald and Baer describe and evaluate a number of protocols with this property [3] .
A limitation with coherence protocols is that they only guarantee consistency for a single writable entity, and only for the duration of a single write operation. Most parallel applications require exclusive access to sets of related variables during a number of consecutive read and write operations. Enforcement of this sharing mechanism, called a critical section, generally requires special synchronization operations for efficient implementation. Many shared-memory multiprocessors use locks to enforce access to critical sections. Special atomic synchronization primitives such as Test&Set and Unset can be provided to acquire and release locks.
In addition to critical sections, there are numerous other paradigms for synchronizing processes and sharing data. In barrier synchronization [4] a number of processes may wish to guarantee that all have reached a specific point in their execution before any can proceed. As a second example, processes may wish to perform enqueue and dequeue operations in parallel on a queue whose entries represent separate units of work. Although numerous scenarios exist, it is arguable that only a very few of these would warrant special hardware support. Even for barrier synchronization, which is a primary candidate for hardware support, efficient software solutions exist [5] . In this paper, we restrict our attention to efficient support for critical sections.
In a cache-coherent system, the accessing of a critical section protected, for example, by a lock, results in at least three distinct latencies. First, a process must wait for the critical section to become free (i.e. the lock is unset). Second, the operation to acquire the lock requires time to traverse the interconnect. Finally, the process must issue and wait for requests that make the shared data local to the processor where the process is executing. As the granularity of sharing becomes finer, these delays may dominate the time to complete a task.
One mechanism that shows promise for reducing and/or eliminating all three of these latencies is the QOLB primitive [SI. In this paper we explore hardware support for locks, critical sections and data exchange using QOLB. Even though software solutions exist for this type of synchronization [7, 8] , one of the main points of this paper is that there is a significant benefit in having such hardware support for reducing memory latency.
Specifically we discuss the implementation of QOLB in a shared-memory multiprocessor using the Scalable Coherent Interface (SCI). SCI, designated as IEEE Standard for Scalable Coherent Interface (IEEE Std 1596-1992) [9] , is designed to provide an efficient, cache-coherent, shared-memory model to a large number of processing nodes. Due to the similarity between QOLB and the SCI implementation of cache coherence, it is natural to extend that implementation to include QOLB, and such an extension is provided as an option to the base SCI protocol. This paper also discusses and analyzes this extension.
Previous work [6] has described the use of QOLB primarily for eliminating contention over the interconnect. The present work discusses QOLB's ability to reduce memory latency by (1) making synchronization common operations more efficient through the elimination of most traversals of the interconnect and (2) by allowing shared data to be prefetched, especially in how this relates to cooperating processes. Similarly, previous discussions of SCI [lo, 111 do not focus on the implementation issues of QOLB or various performance enhancements, such as pairwise sharing.
The remainder of the paper is organized as follows.
Section 2 presents work related to critical sections, software queues for locks and QOLB. It also provides a brief description of cache coherence operations in SCI. An extended discussion of QOLB hardware support and synchronization issues in SCI is presented in Section 3. The benefits of using QOLB for a specific example, the producer/consumer problem, are analyzed in Section 4. Section 5 provides a more general discussion of critical section latency for QOLB and software queuing methods. Finally, conclusions are presented in section 6.
Previous work
Critical sections are used to restrict access to certain shared data to at most one process at a time. New processes must wait at the entrance to the critical section if another process is presently inside. In general there are two forms of waiting: active waiting and passive waiting. In the former case a lock is associated with the critical section, and this lock is repeatedly tested by processes wishing to acquire mutually exclusive access. In the passive waiting case there is a queue of waiting processes. Among other things, the advantage of a queue is that a process or processor could do useful work while waiting, and no redundant messages will be generated across the interconnect. Another advantage is that a queue can easily be made fair: once inside the queue, a process will not be overtaken indefinitely by other processes. On the other hand there is the overhead of inserting and removing processes from the queue.
Historically, Test&Set operations have been used to spin-wait actively, while semaphores have been used to wait passively. Since semaphores are shared variables, locks have usually been used to implement the critical sections of the semaphore operations. Hence most computer instruction sets include an atomic lock operation in order to efficiently implement wait queues. In many systems simple lock operations are the only hardware support available for implementing critical sections.
In a shared memory multiprocessor, resources can be wasted if critical sections are not implemented carefully. Additional latencies in the critical path of an algorithm and unproductive use of interconnect bandwidth, are the result of naive synchronization mechanisms for critical sections.
In multiprocessors, the Test&Set operation may give better performance if implemented as a TestBTest &Set operation [12] . This primitive allows processes to spin wait on a locally cached copy of a lock. When the lock is unset all shared copies are invalidated, after which exactly one process will succeed in setting the lock.
Unfortunately, while Test&Test&Set reduces spinning over the interconnect, contention for a lock may result in unfair allocation and large amounts of interconnect traffic if the lock is held only momentarily.
QOLB and hardware queues
The QOLB primitive [6] was designed to provide efficient hardware support for critical sections by allowing processes to build distributed hardware queues of waiters for cache lines'. A line is the memory entity across which coherency is maintained. By providing a direct implementation of a binary semaphore queue, QOLB can be used as a mechanism for efficient process synchronization. As a non-blocking operation QOLB can prefetch (i.e. make local) a line of data while a process performs useful work. Combining these two operations along with a simple software convention, QOLB becomes a synchronizing prefetch operation. That is, QOLB can be used to synchronize and fetch a line of data, allowing local tests to determine when the data has become available. For example, a line of data would migrate to the next process in the QOLB queue when the owner process (the process holding the lock) releases the line. If prefetched sufficiently in advance, a process could synchronize on and access shared data without experiencing delays from transactions over the interconnect. shadow, copies on which local processes may spin. In SCI the queue is doubly linked, with memory keeping only a pointer to the head of the queue (every node has an implicit pointer to memory). The QOLB primitive can be used to define the three operations acquire (line), release (line) and prefetch(1ine). An example is given in Section 3. The acquire operation adds an entry for the requesting processor into the queue for the line. The release operation causes an owned line to migrate to the next processor in the queue, which then becomes the new owner. The prefetch operation attempts to make a line local, while freeing the processor to continue execution. By placing a lock in the desired line, and using 'The acronym QOLB, pronounced Colby, (formerly called QOSB) stands for "Queue On Lock Bit."
TesttYSet and Unset to acquire and release the lock, the above operations allow a process to enter, exit, and even prefetch the lock and data for a critical section. A lock is needed in addition to the cache-states defining the ownership, because a line may be rolled out from a cache at unpredictable times.
Two software algorithms inspired by QOLB have been developed to minimize network contention for a lock. Each of these implements queues as softwaremaintained data structures. Anderson [8] presents a scheme to implement a queue as a circular array. Mellor-Crummey and Scott [7] , on the other hand, chose to implement a queue as a linked list. Both algorithms succeed in reducing traffic across the interconnect to a constant number of traversals per lock access. However, neither allows a lock, or the data associated with a critical section, to be easily prefetched. A comparison of QOLB and the software queue techniques in the context of executing critical sections is presented in section 5.
Scalable Coherent Interface (SCI)
SCI is an official IEEE standard that defines the physical and logical interfaces between modules (called nodes) in a shared-memory multiprocessor [lo] . An SCI multiprocessor can contain up to 64K nodes, each node containing one or more processing elements (with cache), a memory module, a DMA adapter, or a combination of these. Communication between nodes is based upon the sending of messages of two types, requests and responses. A low-level logical layer defines arbitration and flow control such that messages are transported reliably and within a predefined time from a source node to a destination node.
SCI defines a chained-directory-based, cachecoherence protocol [9] . Memory that can be coherently cached is divided into lines that are 64 bytes long. For each line there is a distributed directory that defines the set of nodes whose cache contains a copy of the line. The directory is implemented as a doubly-linked list of cache lines, called a sharing list . For each memory line there is state information and a pointer to the first cache line (the head) in the sharing list. For each cache line there is a forward and a backward pointer. In addition, cache lines contain state information describing the cached data as well as the position in the sharing list (head, mid, or tail). A sharing list is shown in figure 1, with the SCI cache states listed for each node.
The close correspondence between hardware queues and sharing lists makes SCI attractive for implementing QOLB. A major difference is that in a QOLB queue only one node has a valid copy of the line. All other copies are stale, although the QOLB implementation allows processes to test the state of their local copy to determine if the data has been made local.
In a sharing list, when there exist more than one valid copy of a memory line, it is legal to read but not write the contents of these copies. New readers and writers are prepended to the head of the sharing list by first going to memory, and then to the old head. The two request/response pairs needed in this case are in principle shown in the two first pictures of figure 2. In order for one copy to be modifiable it must be the only existing valid copy, including the copy in memory.
The state of such a line is called exclusive . The SCI cache coherence protocol is based on write invalidation.
A head line can become exclusive by invalidating the rest of the sharing list. Figure 2 shows the steps involved in acquiring exclusive access to a line, specifically a line previously in exclusive mode. First a processor (node C) interrogates the directory for the line in main memory to determine the current head (node B). Second, the processor acquires the line from the head. If the processor needed a read only copy, then the previous head would have been set to state valid. The case shown, however, is one where a writable copy is needed, so the old head is marked stale and the new head exclusive. Finally, the processor invalidates the rest of the sharing list, which in this case consists of a single processor (node A). In general, if a new writer appends to a long (read) sharing list, this last step may require a series of invalidations.
SCI has optimized the important special cme where exactly two processors share a copy of a line. Then the tail may acquire read and write privileges as well. The other (invalid) copy in a two element list is in the state stale. This allows for efficient pairwise sharing of a line by allowing the head and tail to read and write the most current copy of the line without ever accessing the memory directory. A more detailed account of pairwise sharing can be found in a technical report by the authors [13].
The SCI QOLB implementation
In order to implement the QOLB protocol, four extra cache states are needed in the SCI protocol. Three of these states are called USED, NEEDED and IDLE.
A line in state USED or NEEDED implies that the node is the QOLB owner of the line. NEEDED additionally indicates that there are nodes waiting to become the owner. The shadow lines of such nodes are denoted by the state IDLE. Lines in states USED and NEEDED are readable and writable, whereas lines in state IDLE are unreadable and unwritable. The fourth state, called EXQOLB, is used to annotate the state EXCLUSIVE with the information that it resulted from breaking down a QOLB queue. This was described in the discussion of cache line rollout.
The implementation of the different QOLB operations is given below. We assume that processes are not migrated. The implementation of QOLB under SCI is slightly differently from previous specifications [6,14].
Q O L B (line)
A QOLB operation is implemented as a special nonblocking read operation on the specified line. The operation returns a boolean value: success or failure (respectively 0 and 1). This value is returned at once after a lookup in the cache. Success is returned only if there is a cached copy of the line, and this line is in state USED, NEEDED, or EXQOLB. In all other cases, including a cache miss, the returned value is failure. If a QOLB operation is issued on a memory line with a sharing list where the head is not in one of the four QOLB states, the sharing list is broken down, and the requester receives a copy in state USED. If another node issues a QOLB operation on the line, then it joins the list in state IDLE, and the node that previously was in state USED is changed to state NEEDED. New lines that are added after this, will simply join the queue in state IDLE. The QOLB owner, which is in state NEEDED, is not affected by these new additions.
UnQOLB(1ine)
When the QOLB owner executes an UnQOLB operation the exclusive copy of the memory line is sent to the node that will become the new tail, which then changes from state IDLE to state NEEDED, and the old tail is invalidated. If the new tail is the only remaining node, it enters state USED instead of state NEEDED. If, after an UnQOLB operation, there are no other nodes in the list, then state is simply changed from USED or EXQOLB to EXCLUSIVE.
If a processor decides not to wait for a critical section any longer, it executes an UnQOLB operation which rolls the IDLE cache line out of the QOLB queue. If the process executing the UnQOLB operation gets a cache miss, or the line is cached in a non-QOLB state, the UnQOLB operation is a no-op.
ReQOLB (line)
This operation is implemented as a combined and atomic execution of UnQOLB and QOLB. A QOLB owner in state USED remains in this state as a result of the execution of an ReQOLB operation. When the owner is in state NEEDED, the new tail changes from state IDLE to NEEDED, and the line reenters the sharing list as a head in state IDLE. Note that this operation is not used in the library functions defined below. ReQOLB is most useful when there is a fixed set of processors cyclicly sharing a resource, as will be demonstrated in section 4.
Read and Write operations
The cache states NEEDED, USED and EXQOLB are both readable and writable states. All other read and write operations break down the QOLB list and leave the node with a cache line in the state EXQOLB.
In light of the above definitions of QOLB and Un-QOLB, the acquire, release, prefetch, and initialize operations can now be specified as shown in listing 1. These three operations should be library functions that users call in order to use QOLB correctly. The acquire operation is a modified spin-lock in which a process spins on the line as long as the QOLB operation returns failure. When the line has become local, a Test&Set operation is used to determine the state of the lock, which would be expected to be unset except in exceptional circumstances. If it is locked, then the QOLB queue must have broken down (or will soon), and the process must continue to spin on the line using QOLB. Note that a fallback scheme, necessary to guarantee forward progress, is not included here. The release operation simply unsets the lock and UnQOLB's the line, allowing the next node in the queue to become the owner. Since the QOLB operation does not block the process while it waits for a line it can also be used alone as a prefetch operation, as shown below. Pairwise sharing using QOLB Both the pairwise sharing and QOLB features of SCI were intended to support efficient sharing of data. Pairwise sharing was incorporated specifically in recognition of the importance of efficient sharing between pairs of processors in fine-grained parallel applications. Both QOLB and pairwise sharing reduce the number of network traversals, thereby increasing the efficiency with which individual operations can be accomplished.
To enable the QOLB primitives to exploit pairwise sharing support, the obvious extension to QOLB is to allow the head of a two element list to be the QOLB owner. The other element in the list is then either in state IDLE (if it wants the line), or in state STALE (in case it has not indicated interest in acquiring the line). This alternative implementation of the QOLB queue is also supported in the SCI protocol.
Analysis of the producer/consumer problem
The producer/consumer problem is a very important example of pairwise sharing. In this section we describe a common scenario for the producer/consumer model, and show how QOLB makes it possible to exploit this synchronization method in fine-grained parallel execution.
The scenario is as follows: Producer P1 is alternately writing to buffers A and B. Consumer P2 is alternately reading buffers A and B after P1 writes them.
Synchronization must guarantee that the two processors proceed in lock-step, that is, P1 must not get so far ahead that it is acquiring a buffer it has produced but is not yet consumed. P2 must never catch up, that is, it must not consume buffers that have not yet been produced. In the model, P2 is not restricted from writing the buffer as well, providing for symmetry in the operation if desired. This provides a more powerful programming model than the pure producer/consumer model.
Each processor alternates between acquiring resources to enter a critical section and executing within a critical section. As soon as P1 reaches the end of a critical section it releases its lock, then immediately attempts to acquire the other lock. Listing 2 shows how a conventional producer/consumer exchange might occur.
Processor P1 begins execution with me=O, while processor P2 begins execution with me=l. Note that A.lock should be in a different cache line than the rest of the data structure A, since the lock will migrate to the processor not holding the lock. The owner identification, A.owner, could reside in the same cache line as A (and vice versa for B.lock and B.owner). Now consider the sequence of events that must occur when P1 tries to acquire B at the same time that P2 tries to acquire A. 2. P1 succeeds in acquiring B.lock, and now attempts to access the guarded data. Assuming that all the data (except the lock) are contained in a single cache line, an additional round-trip delay is required to access the data, which is resident in the cache of P2.
Thus four network traversal delays are necessary from the time a processor initiates its attempt to acquire the lock and the time at which it may access the guarded data. The situation is even worse if P1 and P2 differ significantly in the time they spend in the critical section. If P2 tries to acquire A.lock before P1 releases it, P2 fetches A.lock into its cache but cannot succeed. Now when P1 tries to release the lock, it must first invalidate the copy in P2 before releasing the lock. Upon invalidation of its copy, P2 must refetch a copy of A.lock from P1. Only after releasing A.lock can P1 fetch B.lock and acquire the lock.
Thus in this case the processor running behind, P1, must wait for six network traversals before it may enter the critical section. This delay can be reduced to four by recognizing that P1 need not wait for the unsetting of A.lock to complete before attempting to acquire B.lock. Thus a relaxed consistency model such as release consistency [15] , can limit the waiting time to only four network traversals. 
3
After the first iteration, the QOLB operation is initiated in advance, allowing the request to be overlapped with computation. Note that both processors continuously have QOLB requests pending for both lines, releasing a line only momentarily to explicitly allow it to migrate to the other processor's cache. Note also that the two internal infinite loops (that are exited through break statements) are specialized acquire functions that should preferably be hidden in a well documented library. This code exhibits the following features:
1. When a processor exits its critical section, it releases the line with a ReQOLB operation. After a single network traversal delay, the buffer has migrated to the other processor. If the two processors are operating in lockstep, only a single network traversal delay is lost due to synchronization delays. This demonstrates the prefetching capabilities of QOLB.
2.
If one of the processors is running ahead, it may have to wait; but for the one running behind, the network latency in its critical path can be overlapped with processing. For example, if P1 is not able to keep ahead of P2, which completes its critical section sufficiently far in advance, P2 will spinwait for a buffer to arrive from P1. However, P1 will find its buffer has already arrived when it exits a critical section and is ready to acquire one. Figure 3 shows a timeline demonstrating how the two processes naturally slip into phase so that the slower process never has to wait for its buffers to arrive, effectively experiencing zero synchronization latency, which is optimal for this case. entry to a critical section then providing for queues in software may be more appropriate. The latter methods include algorithms by Anderson [8] and by MellorCrummey and Scott [7] . For all of these algorithms access to a critical section is controlled by a lock. In this section memory latency and interconnect bandwidth usage are compared for the SCI implementations of QOLB, Mellor-Crummy and Scott (MCS), and Anderson locks. For the case of pairwise sharing, where there is no contention for the lock, TestBSet spin-locks are also included. The main results are summarized in tables 5.2 and 2 at the end of the section. The reader is referred elsewhere for the details of the software queueing algorithms [7, 8] .
Critical sections

Figure 3 PmducerKonsumer Timeline
PI pz
A critical section consists of setting a lock, acquiring shared data, performing some computation, and releasing the lock. Any differences between synchronization mechanisms may show up in all but the computation portion. For fine-grained parallelism and with proper packing of data items, a single (cache) line of shared data may be all that is needed. For larger grained computations it is desirable that the latency of acquiring additional lines of shared data be overlapped with ac- In the pure producer/consumer model the producer can be given more opportunity to race ahead by providing additional buffers. The consumer, however, still must wait after attempting to acquire a buffer, so in the absence of prefetch, the network latency cannot be overlapped. In the QOLB example, the model can be fully symmetric, with each processor reading and writing each buffer, and the network latency does not lie on the critical path.
Comparison of QOLB with software queueing methods
The previous discussion highlights the benefits of QOLB by illustrating its use for an important case of process synchronization, namely, the producer/consumer problem. In the present section we will discuss another very important synchronization and data exchange scenario. The mechanism under study is the critical section -a general mechanism for synchronizing the exchange of shared data between processes that is prevalent in many parallel applications. The objective in this section is to look at the performance benefits of QOLB in this broader context.
The main alternatives to QOLB are schemes that provide support for critical sections chiefly "in software." In actuality, all schemes analyzed here use atomic read-modify-write primitives. In the absence of contention, a simple spin-lock employing the Test&Set primitive is sufficient to guarantee mutual exclusion. However, if a number of processes are competing for quisition of the first line and/or with the critical section computation. Thus, for this discussion, only latency to acquire and release a lock and to acquire (i.e. make local) the first line of shared data are considered.
A process accessing a critical section may find the lock in one of two states, namely, idle or busy. Measuring idle lock performance yields information about the overhead of the locking mechanism, as well as the latency for acquiring free resources. In this situation it is important to include, not only the cost of entering the critical section, but of leaving as well, since the requesting process will experience both latencies. In the second case, a busy critical section, time is measured from the point at which the lock is released until the next process has acquired both the lock and the first line of data. This latency is a good indication of how long it takes for a resource to change hands.
The important special case of pairwise sharing occurs when exactly two processes alternately access a critical section. In the SCI cache coherence protocol pairwise sharing has been optimized by removing the need for either process to access the directory for the line being shared, as discussed in section 2. Since contention in a pairwise sharing situation is limited, a simple spin-waiting algorithm employing Test&Set can be used to provide the same or better performance than more complicated software schemes. For pairwise sharing, QOLB is compared against Test&Set as well as the If it is assumed that an application can statically all* cate variables so as to place them in the memories of specific processors, then spinning using remote word primitives may potentially be performed locally by a particular process. Both of these implementations of Fetch&@ operations are considered.
Latency calculations
Assuming that processors are relatively fast compared to memory and the interconnect, the time required to execute a critical section entry or exit procedure is dominated by the latency of the synchronization operations and accesses to shared data. To simplify the analysis it is also assumed that these operations consist solely of memory accesses and interconnect traversals.
Each processor is assumed to have two levels of cache, a fast first level (processor) cache and a slower second level cache. The latency of the processor cache is ignored. The large, second level cache is likely to be built using approximately the same speed DRAM as main memory. Hence, for simplicity, all memory accesses, including those to the second level cache, are assumed to require the same amount of time.
With the above restrictions the latency t for accessing a critical section can be computed for a given sequence of operations by the equation t = nmsg * tmsg + nmem * tmem where nmSg and nmem are the number of one-way traversals of the interconnect by messages and the number of memory accesses, respectively, and tmsg and tmem are the corresponding times for these events.
Thus, calculating the latency of an algorithm consists For example, a write operation (or cache coherent Fetch&@ operation) on a word in a line that is not local to the processor requires the following sequence of operations: (1) the initial miss in the requester's second level cache, (2) a request to the directory for the line, (3) an access to the directory, (4) a response yielding the current owner of the line, (5) a request to this node, (6) an access to fetch the line, (7) a reply containing the desired data, and (8) an access to place the line in the requester's cache. This results in four traversals of the interconnect and four memory accesses. The QOLB algorithm essentially requires these operations (plus an additional memory access for the initial QOLB operation) to acquire an idle critical section. Anderson's algorithm requires three such write operations when coherent Fetch&@ operations are used to acquire an idle critical section. Two significant improvements can be made to the lock algorithms. First, remote word primitives can be substituted for coherent line primitives. Although this analysis asumes that coherent line primitives are always used to acquire the data, either type of primitive can be used for synchronization. If Fetch&@ operations are performed at memory then only two traversals of the interconnect and a single memory access are required. Second, acquiring a line in pairwise sharing mode, eliminates the need to check the directory.
Thus, only two traversals of the interconnect and three memory accesses are required. The memory accesses include the initial miss in the cache, the access to the other processor's cache, and the actual write into the requester's cache. It should be noted that pairwise sharing speeds up only coherent line operations.
The equation for calculating lock algorithm latency presented above requires two types of inputs. The first are the number of one-way interconnect traversals and memory accesses. These can be found in table 5.2. The second are latencies of typical traversals and memory accesses. These times must be estimated.
While memory access time can be treated as a constant for most purposes, interconnect traversal latency is dependent on a number of important factors, including interconnect architecture, size and loading. Thus, it is difficult to determine a precise value for tmsg. In the rest of our analysis we will only assume that it is greater than t , , , .
Analysis results
Before comparing the various lock and data fetch algorithms, it is beneficial to choose among the two types of synchronization primitives ( i.e. coherent and remote word). This is especially easy in the non-pairwise cases. From tables 5.2 and 2, one sees that the remote word primitives clearly possess a performance advantage, because of the fewer number of interconnect traversals and memory accesses required to complete a Fetch&@ operation. Only the MCS algorithm for idle locks gives even similar results. Thus, for the remainder of the study remote word operations will be used for these algorithms.
From the counts of traversals and memory accesses it is seen that QOLB has a significant performance advantage over the other lock algorithms: QOLB (with no prefetch) is about twice as fast as the Anderson and MCS schemes for idle critical sections and four to fives times as fast for busy critical sections (i.e. lock transfer). If prefetching is possible, then QOLB's performance for idle locks improves by a much greater margin. The only latency involved is that for issuing the first QOLB, a second level cache access. These results are very similar to those from a simulation study of a different architecture [14] . Table 2 also shows results for situations involving pairwise sharing. In this case, the SCI-QOLB implementation produces a sizable increase in performance compared to all of the other algorithms. This benefit is, in all but one case, somewhat smaller for those algorithms that use remote word operations. For situations where the application knows in advance that a pairwise sharing situation exists, simple spin-locks using Test&Set and Unset may be used. With only two processes actively contending for a lock, one of which will typically already possess the lock, Test&Set is quite efficient. In fact, Test&Set will outperform the other software queuing methods due to its low overhead. Test&Set and QOLB are approximately equal when a critical section is idle and prefetching is not used.
QOLB's largest performance advantage is realized when there is no contention for the lock. In this case QOLB can prefetch and have data ready in the cache when the critical region is entered. This is shown in table 2 as the two cases where there is a zero message count. In this case the MSC algorithm uses between 4 and 8 interconnect traversals, depending upon other parameters.
To sum up this section we conclude that QOLB hardware in the SCI protocol decreases the number of interconnect traversals in order to acquire a lock and the related shared date by a large factor. When there is contention for the data, QOLB uses only one interconnect traversal to fetch both the lock and the data. In the absence of contention the number of traversals is zero if prefetch is possible and two (pairwise sharing) or four (no pairwise sharing) when prefetch is not possible. These low number of traversals will give the programmer of a shared memory SCI-based computer the possibility to get useful work out of a lot of processors.
Summary
We have discussed and analyzed an implementation of the QOLB primitives based on the Scalable Coherent Interface. Since both QOLB and the base SCI cache coherence protocol are based on a linked-list directory, the starting point for implementing QOLB in SCI is very good. An SCI-based system can be implemented with some nodes supporting the QOLB option while others do not.
We have addressed a number of SCI-specific issues. Particular attention was paid in the implementation to ensure that knowledge of QOLB is not needed by processes that do not wish to use that primitive. In addition our definition of QOLB allows its implementation as a no-op.
We have compared the SCI-QOLB implementation to two software schemes [7, 8] . The performance of the software schemes is dependent on the coherence protocol of the hardware used, and for our comparison we have assumed they use the base SCI cache coherence protocol. QOLB reduces latency over these schemes by a factor of two to four, although much greater improvements are possible if QOLB is used both to synchronize on and prefetch the lock and shared data associated with a critical section.
By using the SCI cache-to-cache write transaction that moves a memory line copy from one cache to another, as well as the pairwise sharing optimization, the QOLB mechanism for the exchange of critical section ownership reduces latency to a small number of interconnect traversals and memory accesses. By using the doubly-linked SCI sharing list to queue waiters on a critical section, we also allow an enqueued processor to change its mind and leave the waiting queue before acquiring the lock.
We have also analyzed a producer/consumer situation. We demonstrated how QOLB can be used to prefetch shared data, completely eliminating interconnect latency. We have shown that such a situation exists when a producer has released the data before a consumer starts using it and the consumer has an outstanding prefetch on this data. In this case the consumer acquires the data without any network delay -it simply accesses its cache and finds that the data is already there. Such overlap of execution and data transport will typically happen when it is needed the most, that is, for critical path processes. When all predecessors in the dependency graph are completed, the critical path process (having issued a prefetched on the data) can immediately continue doing useful work, assuming the needed data is released before the last predecessor completes. If the program is such that processes along the critical path can be split up into yet smaller processes that communicate and prefetch using QOLB, a finer execution granularity may be achieved that could yield even higher performance.
The results presented in this paper agree well with those of an earlier study [14] in which the performance benefits of QOLB were quantified for a bus-based architecture. The results demonstrate that there is much to be gained by providing hardware support for critical sections. While simulation studies of real programs might demonstrate more clearly the advantages to be gained by the use of QOLB, they still would be unable to assess the real value of QOLB since the algorithms were not designed to exploit such hardware support.
To demonstrate our theses, a model of a full SCI system and a C-compiler that can handle the QOLB primitives are under development. The compiler and SCI system are nearing completion. The next task will be to rewrite current applications and analyze their new performance.
In addition, QOLB provides opportunities for additional relaxation of the consistency model, which suggests further advantages. These issues are the subject of continuing study.
