Multiprocessor architects have begun to explore several mechanisms such as prefetching, context-switching and softwsreassisted dynamic cache-coherence, which transform single-phase memory transactions in conventional memory systems into multiphase operations. Multiphase operations introduce a window of vulnerability in which data can be invalidated before it is used.
Introduction
One of the major thrusts of multiprocessor research has been the exploration of mechanisms that provide ease of programming, yet are amenable to cost-effective implementation.
To this end, a substantial effort has been expended in providing efficient shared memory for systems with large numbers of pr&es-sors. Many of the mechanisms that have been proposed for use with shared memory, such as rapid-context switching, software prefetch, fast message-handling, and softwtwe-assisted dynamic cache+oherence enhance different aspects of multiprocessor performance; thus, combining them into a single architectural framework is a desirable goal.
This paper investigates such a unifying framework, and explores one consequence the window of vulnerability. Although we have implemented the complete framework in the MIT Alewife machkte [1] , mechanisms can be mixed and matched;
other multiprocessor designers may choose to implement a subset of this framework that suits their own needs.
Many of the mechanisms associated with shared memory attempt to address a central problem: access to globat memory may require a large number of cycles. To fetch data through the interconnection network, the processor transmits a reques~then waits for a response. The request may be satisfied by a single memory node, or may require the interaction of several nodes in Permission to copy without fee all or part of this material is granted provided that the copias are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing
Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
ASPLOS V -10/92 /MA, USA @J1992 ACM o-fjg791-535-6/9210010 /0274 . ..$I .50 the system. In either case, many processor cycles may be lost waiting for a response.
In a traditional shared-memory multiprocessor, remote memory requests can be viewed as split-phase transactions, consisting of a request and a response. The time between request and response may be composed of a number of factors, including communication delay, protocol delay, and queueing delay. Since a simple single-threaded processor can typically make no forward progress until its requested data word arrives, it spins while waiting. When the data word emives, the processor consumes the data immediately, possibly placing it in a local cache.
Rather than spinning, a processor might choose to do other useful work.
To tolerate long access latencies, architects have proposed a number of mechanisms such as prefetching, weak ordering, multithreading, and software-enforced coherence. All are variations on a central theme: they allow processors to have multiple ouWanding requests to the memory system. A processor launches a number of requests into the memory system and performs other work while awaiting responses. This capability reduces processor idle time and allows the system to increase its utilization of the network.
The ability to handle multiple outstanding requests maybe implemented with either polling or signaling mechanisms. Polling involves retrying memory requests until they are satisfied. This is the behavior of simple RISC pipelines which implement nonbinding prefetch or context-switching through synchronous memory faults. Signaling involves additional hardware mechanisms that permit data to be consumed immediately upon its arrival.
Such signaling mechanisms would be similm to those used when implementing binding prefetch or out-of-order completion of loads and stores. This paper explores the problems involved in closing the window of vulnerability in polle~context-switching processors. While signaling leads to related approaches, a detailed discussion of these is beyond the scope of this paper. Events on the lower line are associated with the processor, and events on the upper line are associated with the memory system, In the figure, a processor initiates a memory transaction (Initiate 1), and instead of waiting for a response from the memory system, it continues to perform useful work. During the course of this work, it might initiate yet another memory transaction (Initiate 2).
At some later time, the memory system responds to the original request (Response to Request 1). Finally, the processor completes the transaction (Access 1).
Since a processor continues working while it awaits responses from the memory system, it might not use returning data immediately. Such is the case in the scenario in Figure 1 . When the processor receives the response to its second request (Response to Request 2), it is busy with some (possibly unrelated) computation. Eventually, the processor cmmpletes the memory transaction (Access 2).
Thus, we can identify three distinct phases of a transaction:
1.
2.
3.
Request Phase -The time between the transmission of a request for data and the arrival of this data from memory.
Window of Vulnerability -The time between the arrival of data from memory and the initiation of a successful access of this data by the processor.
Access Phase -The period during which the processor atomically accesses and commits the data.
The window of vulnerability, results from the fact that the processor does not consume data immediately upon its arrival. During this perio~the data must be placed somewhere, perhaps in the cache or a temporary buffer. Note that a simple split-phase transaction can be seen as a &generate mukiphese transaction with zero cycles between response and access. The period between the response and access phases of a transaction is crucial to forward progress. Should the data be invalidated or lost due to cache conflicts during this period, the transaction is terminated before the requesting thread can make forward progress.
Closing the window of vulnerability involves ensuring forward progress for multiphase memory transactions. The consequences of lost data are more subtle and perilous than simple squandering of memory resources. The window of vulnerability allows scenarios in which processors repeatedly attempt to initiate transactions only to have them canceled during the window of vulnerability. In certain pathological cases, individual processors are prevented from making forwimd progress by cyclic fhra,rhing situations. While such situations may be rare, they are as fatal as any other livelock or deadlock situation.
The window of vulnerability is also opened by another class of mechanisms. This class contains a number of mechanisms including fast I/0, interprocessor messages, synchronization primitives, and extensions of the memory system through software. When implementing such mechanisms, the successful completion of a spinning load or store to memory may depend on the execution of network interrupts. These asynchronous events must be able to fault an instruction which is in progress, thereby opening a window of vulnerability. The term high-availability interrupt is applied to such externally initiated pipeline interruptions.
Uaw
Figure 2 illustrates this scenario with an architecture that supports fast message handling. In the figure, the processor is spinning while waiting to access a remote memory block. Several messages have entered the processor's input queue before the desired memory response. Consequently, the processor will not make forward progress unless a high-availability interrupt is invoked to process these messages.
This paper describes a framework that eliminates livelock problems associated with the window of vulnerability for systems with multiple outstanding requests and high-availability interrupts. The system keeps track of pending memory transactions in such a way that it can dynamically detect and eliminate pathological thrashing behavior. The framework consists of three major components: a small, associative set of transaction buffers that keep track of outstanding memory requests, so algorithm called thrashwait that detects and eliminates livelock scenarios that are caustxi by the window of vulnerability, and a buffer locking scheme that prevents livelock in the presence of high-availability interrupts.
Not all architects will need to implement the full gamut of mechanisms described in this paper. For this reason, we describe the different subsets of the framework and the mechanisms that each subset will support. In order to motivate the architectural framework that we propose, Section 2 presents examples of shared memory mechanisms. Section 3 then shows how the window of vulnerability can impede a system's forward progress, Section 4 explores several components of the framework, each of which provides part of the solution for ensuring forward progress. Section 4 concludes with a hybrid architecture that combines these components to implement all of the mechanisms, 
The Window of Vulnerability
To describe the window of vulnerability, we consider the memory system as a black-box that satisfies memory requests. While forwmd progress on the memory system side is important, it is beyond the scope of this paper. The window of vulnerability affects forward progress after the memory system has responded to a request. Consequently, when we say that a processor (or hardware thread) does or does not make forward progress, we are referring to properties of its local hardware and software, assuming that the remote memory system always satisfies requests.
To be more precise, a processor thread makes forward progress whenever it commits an instruction. Given a processor with precise interrupts, we can think of this as advancing the instruction pointer. A load or store instruction can be said to make forward progress if the instruction pointer is advanced beyond it.
Primary and Secondary Transactions
The distinction between primary and secondary transactions, in- They are, however, "upgraded" to primary status the moment a load or store attempts to access their data.
Memory models differ in the degree to which they require primary transactions to complete Iwfore the associated loads or stores commit. Sequentially consistent machines, for instance, require write transactions (associated with store instructions) to advance beyond the request phase before their associated threads make forward progress. Weakly-ordered machines, on the other hand, permit store instructions to commit be@re the end of the request phase. In a sense, the cache system promises to ensure that store accesses complete. Therefore, for weekly-ordered machines, write transactwns have no window of vulnerability. In contras~most memory models require a read trartsaction to receive a response from memory before committing the associated load instruction.
As tern response and the instant that context A is reenabled. During the window, the memory system causes block X to be invalidated from the processor's cache. Figure 4 shows the multi-node scenario that causes this invalidation. There are three processing nodes in the figure node 1 is the node associated with the time-line in Figure 3 ; node 2 is the home node for block X; and node 3 is the node that causes the interference. Some time after the home node has serviced the request, node 3 issues a write request for block X to no& 2. In response, no&2 transmits an invalidation message to node 1, waits for an acknowledgment message, and eventually transmits write permission to node 3. As a resul~node 1 must repeat its read request when it reenables context A at the end of the time-line in Figure 3 .
There is no reason to expect that no&3 will actually complete the write to block X before node 1 repeats its read request! If this is the case, it is possible for node 2 to invalidate block X in node 3 before the write is finished. Given an unfortunate coincidence in timing, this vicious cycle of invalidation or internode thrashing can continue forever. Our simulations indicate that thii threshing is an infrequent event, but it does happen at some point during the execution of most programs. Without a solution to the thrashing scenario, the system would livelock (effectively causing the machine to crash).
Severity of the Window of Vulnerability
This section substantiates our claim that the window of vulnerability poses a signiticent problem in shared memory architectures.
The Alewife simulator calculates the time between the instant that a data block becomes valid in a cache due to a response from memory and the first subsequent access to the cached data. The simulator measures this period of time only for the fraction of memory accesses that generate network traffic end are thus susceptible to the window. The sharp spike at zero cycles illustrates the role of context switching and high availability interrupts in causing the window of vulnerability. The spike is caused by certain critical sections of the task scheduler that disable context switching, as described in Section 2. When context switching is disabled, a processor will spin-wait for memory accesses to complete, rather than attempting to tolerate the access latency by doing other work. In this case, the processor accesses the cache on the same cycle that the date becomes available. Such an event corresponds to a zerosize window of vulnerability. The window becomes a problem only when context switching is enabled or when high availability interrupts interfere with memory accesses.
The window of vulnerability histogram in Figure 5 is qualitatively similar to other measurements made for a variety of programs and architectural parameters. The time between cache fill and cache access is usually shor~but a smrdl fraction of memory transactions always suffer from long windows of vulnerability. In general, both the average window size and the standard deviation increase with the number of contexts per processor. The window size and standard deviation also grow when the context switch time is increased. We have observed that high-availability interrupts cause the same type of behavior although their effects are not quite as dramatic as the effect of multiple contexts.
For the purposes of our argumen~it does not matter whether the window of vulnerability is large or small, common or uncom- We refer to this scheme as fouchwait, because data blocks are held until the requesting context returns to "touch" it. Touchwait eliminates the livelock scenarios of the previous section, because the cache retains data blocks until the requesting context returns to accessthem. 
instruction-data:
Thrashing between a remote instruction and its data yields a deadlock in the presence of locks. This occurs after a load or store instruction has been successfully fetched for the fist time. Then, a request is sent for the da@ causing a context-switch. When the data block finally returns, it replaces the instruction and becomes locked.
However, the data will not be accessed until after the processor refetches the instruction.
Pnmmy-secondary deadlock is easily removed by recognizing that secondary transactions are merely hints; locking them is not necessaq to ensure forward progress. Unfortunately, the remaining deadlocks have no obvious solution, Due to these deadlock (x= Y,x#z)
problems, pure locklng cannot be used to close the window of vulnerability. Finally, the associative matching mechanism can permit contexts to access buffers that are locked by other contexts. Such accesses would have to be performed directly to and from the buffers in question, since placing them into the cache would effectively unlock them. This optimization is useful in a machine with medium-grained threads, since different threads often execute similar code and access the same synchronization variables. 
Associative Locking

Thrashwait
Locking transactions prevents livelock by makiig data invulnerable during a transaction's window of vulnerability. In order to attack the window from another angle, we note that the window is eliminated when the processor is spinning while waiting for datz when the data word arrives, it can be consumed immediately. This observation does not seem to be useful in a machine with context-switchhtg processors, since it requires spinning rather than switching. However, if the processors could contextswitch "most of the time," spinning only to prevent thrashing, the system could guarantee forward progress. We call this strategy thrashwait (as opposed to touchwait). The trick in implementing thrashwait lies in dynamically detecting thrashing situations.
The thrashwait &tection algorithm is based on an assumption that the frequency of thrashing is low. Thus, the recovery horn a thrashing scenario need not be extremely efficient. Note that global accesses, which involve shared locations and the cache-coherence protocol, are distinguished here from local accesses which are unshared and do not involve the network or the protocol. When the following criteria are true, the memory system detects a thrashing situation:
1. The context requests a global load or store that misses in the cache.
2. There is no associated transaction-in-progress state, because the transaction has completed.
3. The context's tried-once bit is set.
The fact that the tried-once bit is set indicates that this context has recently launched a primary transaction but has not successfully completed a global load or store in the interim. Thus, the context has not made forward progress. In particular, the current load or store request must be the same one that launched the original transaction. The fact that transaction-in-progress is clear indicates that the transaction had completed its request phase (data was returned).
Consequently, the fact that the access missed in the cache means that a data block has been lost. Once thrashing has been detect~the thrashwait algorithm requests the data for a second time and disables context-switching, causing the processor to wait for the data to arrive.
Multiple Primary Transactions Systems requiring two primary transactions can be accommodated by providkg two triedonce bits, one for instructions and the other for data. To see why a single bit is not sufficien~consider an instruction-data thrashing situation with a single tried-once bit. Assuming that a processor has successfully fetched the load or store instruction, it proceeds to send a request for the dam sets the tried-once bi$ and switches contexts. When the data block finally errives, it displaces the instruction; consequently, when the context returns to retry the instruction, it concludes that it is thrashing on the instru.ction~etch.
Context-switching will be disabled until the instruction returns, at which point the tried-once bit is cleared. Thus, the algorithm fails to detect thrashing on the data line.
The presence of two sepmate tried-once bits solves this prob- 
Elimination of Thrashing
The thrashwait algorithm identifies primary transections that me likely to be terminated prematurely;
that is, before the requesting thread makes forward progress. Assuming that there are no high-availability interrupts, thrashwait removes livelock by breaking the thrashing cycle. Thrashweit permits each primary transaction to be aborted only once before it disables the context-switching mechanism and closes the window of vulnerability.
In a system with multiple primary transactions, livelock removal occurs because primary transactions are ordered by the processor pipeline. A context begins execution by requesting data from the cache system in a &terministic order. Consequently, under worst-case conditions -when all transactions are thrashing, the processor will work its way through the implicit order, invoking thrashwait on each primary transaction in turn. Although a context-switch may flush its pipeline state, the tried-once bits remain, forcing a pipeline freeze (rather than a switch) when thrashing occurs.
Freedom From Deadlock In this section, we prove that the threshwait algorithm does not suffer from any of the deadlocks illustrated in Figure 6 . We assume (for now) that a processor launches only one primary transaction at a time. Multiple primary transactions, which must complete to make forward progress, rue allowed, multiple simultaneous transactions, which are caused by a system that presents several addresses to the memory system at once, are not allowed. At the end of the proof, we discuss a modification to the thrashwait algorithm that is necessm-y for handlig multiple functional units and address buses.
The proof of the deadlock-free property proceeds by contradiction. We assume that the thrashwait algorithm can result in a deadlock. Such a deadlock must be caused by a cycle of pri- By definition, the head of a protocol arc is a transaction in its window of vulnerability, which is locked so that invalidations are deferred. The tail of a protocol arc is a transaction in its request phase, waiting for the invalidation to complete. Since a transaction in its request phase cannot be at the head of a protocol arc, protocol arcs cannot be linked together, thereby preventing a loop of protocol arcs.
Finally, the tail of a congruence arc cannot be linked to the head of a protocol arc due to enother type conflict: the tail of a congruence arc must be a new transaction, while the head of a protocol arc is an existing transaction in its window of vulnerability.
Thus, deadlock loops cannot be constructed from combinations of protocol and congruence loops. The fact that congruence arcs and protocol arcs cannot combine to produce a loop contradicts the assumption that thrashwait can result in a deadlock completing the proof.
The alxwe proof of the deadlock-free property allows only one primary transaction to be transmitted simultaneously. In or&r to permit multiple functioned units to issue several memory transactions at a time, the memory system must provide sufficient associativity to permit all such transactions to be launched. Also, if the memory system stslls the processor pipeline while multiple transactions are requested then the processor must access a data word as soon as h arrives. These modifications prevent dependencies between simultaneous transactions and make sure that the window of vulnerability remains closed.
Thrashwait and High-Availability Interrupts
Despite its success in detecting thrashing in systems without high-availabtity interrupts, thrashwait fails to guarantee forward progress in the presence of such interrupts. This is a result of the method by which thrashwait closes the window of vulnerability: by causing the processor to spin. This corresponds to asserting the memory-hold line and freezing the pipeline. High-availability interrupts defeat this interlock by faulting the load or store in progress so that interrupt code can be executed. Viewing the execution of high-availability interrupt handlers as occurring in an independent "context" re- 
Associative Thrashwait (Partial Solution)
In an attempt to solve the problems introduced by highavailability intermpts, we supplement the thrashwait scheme with associative transaction buffers, As described in Section 4.2, transaction buffers eliminate restrictions on transaction launches. Further, instruction-data and high-availability interrupt thrashing are eliminated. This effect is produced entirely by increased associativity: since transactions are not placed in the cache during their window of vulnerability, they cannot be lost through conflict.
Thus, the associative thrashwait scheme with high-availability interrupts is only vulnerable to invalidation thrashing. The framework proposed in the next section solves this last remaining problem.
Associative Thrashlock
Now that we have analyzed the benefits and deficiencies of the components of our architectural framework, we are ready to present a hybrid approach, called associative thrashlock. This framework solves the problems inherent in each of the independent components.
Assume, for the momenq that we have a single primary transaction per context. As discussed above, thrashwait with associativity has a flaw. Once the processor has begun thrashwaiting on a particular transaction, it is unable to protect this transaction from invalidation during high-availability interrupts.
To prevent high-avrdlabili~interrupts from breaking the thrashwait scheme, associative thrsshlock augments associative thrashwait with a single buffer lock. This lock is invoked when the processor begins thrashwaiting, and is released when the processor completes any global access. Should the processor respond to a high-availability interrupt in the interim, the data will be protected from invalidation.
It is important to stress that this solution provides one lock per processor. The scheme avoids deadlock by requiring that all high-availability interrupt handlers:
1. make no references to global memory locations, and 2. return to the interrupted context.
These two software conventions guarantee that the processor will always return to access this buffer, and that no additional dependencies me introduced. The transaction store is used as a small, fully-associative cache.
All contexts access the transaction store by address, and any context may access a transaction buffer with a matching address. The transaction store is completely integrated with the cachecoherence protoco~indeed, it is much like a multiprocessor victim cache [14] . Data may be transfemed between transaction buffers end the cache or the processor may access transaction buffers directly. In addition, special instructions permit the processor to initiate non-bhding prefetches.
The transaction store has independent data paths to the processor, to memory, and to the network. A single module of associative match circuitry is shared by the processor, network and memory.
In addition to implementing the associative thrashlock framework, the transaction store has two additional benefits. First, since the transaction store explicitly records the state of outstanding transactions, it allows the Alewife cache-coherence protocol to be independent of network ordering.
Relaxing the constraint of in-order delivery is desirable because it permits systems to be built with networks that employ adaptive routing to avoid hotspots or bad connections.
Secm4 since context switching on Spercle is a polling mechanism, contexts may retry memory accesses multiple times before the requested data word becomes available. The transaction store prevents redundant requests that could result from multiple retries by recording the state of all outstanding memory transactions. The same mechanism allows the requests from different contexts to the same cache line to be consolidated. Signaling is 1sss sensitive to remote access latency, but introduces additional hardware complexity. System parameters or philosophy determine whether polling, signaling, or a hybrid approach is most appropriate.
A multiprocessor could also avoid the window of vulnerability by eschewing the use of caches. In a system without caches, all memory requests could be serviced by d~tributed modules, By serializing transactions, memory modules would ensure both coherence and forward progress. However, such a system would have to provide extremely high bandwidth between processing nodes and memory modules in order to achieve high performance.
The associative threshlock framework provides a solution to the window of vulnerability problem in a polled system. The framework allows the use of caches to reduce the bendwidth required from the interconnec~end permits processors to store just enough information to recreate the pipeline state of a context when necessary. Instead of closing the window of vulnerability by brute force, the Alewife architecture dynamically detects the situations that can lead to deadlock and livelock. Only when these relatively rare situations arise does the system close the window.
The fundamental architectural tra&-off pits hardware expense and complexity against exceptional events that are uncommon, but potentially fatal.
283
