We also compare protocol performance by running eight benchmarks on 32 processor systems. Simulations show that Dirl S W+'s performance is comparable to more complex directory protocols.
The significant disparity in hardware complexity and the small difference in performance argue that Din S W + may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and CheckIn/Check-Out (CICO) directives [9] . 2. Compare pointer fields against a node ID.
3. Sequence through the pointera.
Diri NB protocols, i < n, use a replacement policy to select a victim when the i + 1st shared copy is requested. This policy, in turn, requires an additional mechanism.
The mechanisms for DirnNJ3 protocols are slightly different because they can employ bit vectors instead of explicit pointers.
1.

2.
Decode node ID and test/set/clear bit in vector.
Sequence through bit vector.
All DiriX protocols for i > 1 require the ability to sequence through either a set of pointers or a bit vector and send multiple invalidations.
Dhy S W Mechanisms
The Din SW column of Since much of this simplicity comes from pushing the complexity into software trap handlers, other hardware/software protocols such as Limit LESS, share this advantage.
All DiriX protocols for i > 1 require the ability to sequence through either a set of pointers or a bit vector and send multiple invalidations. To implement this mechanism ss an atomic sequence, all invalidm tions must be sent before receiving any other messages. Unfortunately, deadlock avoidance then becomes a major consideration.
If the maximum number of messages is bounded by a small constant, as in Diri NB, deadlock can be avoided with sufficient output buffering.
The After memory, the next greatest cost is the directory datapath.
For the DiriX protocols other than Dim NB, the comparison of a node ID with i pointer fields requires either a wide datapath with i comparators or a sequential search. A Dirn NB implementation will require an n-bit datapath and priority decoder.
By contrast, the Dirl S W state machine requires only a logzn-bit datapath with the ability to increment, decrement, test for zero, and select the ALU result, the message source ID, or a small constant for writing into the pointer/counter.
The absence of sequencing in the Dirl S W mechanisms also allows a regular structure: in response to each message, the state associated with the cache block is read, modified, and written back, and optionally a single message is sent. Beyond its inherent simplicity, this remlarity leads naturally to a~ipelined The base DirlS W protocol described above performs as well as any feasible directory coherence protocol for programs that exactly follow the CICO programming model (see Section 2). However, rigidly adhering to this model is not possible or desirable for all programs.
This section examines several extensions to the Dirl S W protocol that improve its performance for programs that do not conform precisely to CICO. With one exception, these extensions use exactly the same mechanisms as base Dirl S W and require minor changes to the policy implemented in hardware and software. The new mechanism, which is very simple, sets the counter in a directory entry to the value 1.
Di?'1 S W+NPT: No Pairwise Traps
The base Dirl S W protocol traps to software whenever a CICO violation occurs; that is, whenever the directory receiving a Msg.Get message cannot immediately respond with the requested data.
However, the Dirl S W mechanisms permit directory hardware to send a single message to an arbitrary processor in response to a message from another processor. The NPT extension modifies the hardware policy to directly send an invalidation message and forward the block to the requesting processor when the nsg~ut message arrives.
This extension moves a common, but more complex policy from software to hardware, which may reduce execution time.
lXrl SW+ROl: One Shared Copy
The Dirl S W mechanisms permit a protocol to maintain either a pointer to a processor node or a counter. The base DirlS Wprotocol maintains a pointer for exclusive copies and a count of shared copies. However, many of the shared-to-exclusive state transitions occur when only a single shared copy is outstanding: over 50% for 6 of the 8 applications, and over 85% 
Methods
The measurements in this paper came from the eight explicitly parallel programs listed in Table 2 Figure 3 of Section 4.3.) For mp3d, however, NPT and CICO reduce execution time by 52% and 21%, respectively, by mitigating the effect of this program's unsynchronized sharing in its cell data structure. To get more insight from the many numbers in Figure 2 , we use an analysis of variance to charac- Using CICO primitives ae memory system directives affects sharing behavior and improves performance. Table 5 examines the effect on sharing behavior of using CICO check-in's to flush cache blocks (rather than allowing them to be replaced or invalidated).2 A check in improves performance if it enables another processor to find a block at the directory instead of requiring additional messages be sent An indirection occurs when a processor cannot obtain a block from the directory, but must send meesages to one or more processors.
Colmnn "Counter-productive checkin's"
gives the fraction of checkin for which the same processor is the next user of a checked-in block.
to other processors. The results show that checkdn reduces the frequency of indirection by 45 Yo-1OOYO. A checkin hurts performance if the same processor is the next user of the block, which we found to occur in 6Y0-65Y0 of the checkin'x.
Together, NPT and CICO ran programs 19% faster, implying NPT makes CICO less important.
With NPT, CICO has a more modest impact on indirection to previously exclusive blocks (e.g., migratory data). Without NPT or CICO, migrating a block costs four network traversals and two traps. Adding NPT or CICO eliminates the traps, while CICO also reduces the network traversals to two. Thus, at best, adding CICO to NPT improves performance by a factor of two. In practice, the effect is much smaller, because programs do not spend much time migrating data.
Finally, we would like to estimate how the effects of NPT and CICO vary from benchmark to benchmark. To do this, we calculate 9070 confidence intervals assuming the residuals-the performance not explained by average effects-are normally distributed with mean zero. This calculation-explained further in the caption of at least, when no special mechanism handles read-only data.
These conclusions do not seem to be sensitive to the key system parameters of network and directory latency. (We also measured runs with 64 processors, but do not report these results because they did not differ qualitatively.) Table 6 shows the normalized execution time results from varying interconnection network latency from 100 processor cycles (the default) to 400 cycles. A 400-cycle network slowed all protocols by about a factor of two, but it has little effect on the performance difference between Di~2Nlt
and Dirl S W +.
Increasing the latency of a directory operation tc) 100 cycles approximates the effect of using an auxiliary processor, rather than a finite state machine, to perform directory operations. Increasing the directory cost from 10 to 100 cycles slowed the benchmarks by an average of 4090 with no obvious trends favoring one protocol over another.
Finally Although our results have immediate import, they also apply to future computers. These machines are moving toward large-scale (~lK-processors) systems of fast microprocessors (~1 GIPS). The network latencies of these machines (measured in processor cycles) will be much larger than today's machines. The data in Table 6 for 400 cycle network latency shows that larger networks do not affect Dirl S W more than other protocols such as Dir~NB (assuming that programs infrequently cause broadcasts). A perhaps more important implication of the data is that performance in machines with long network latencies is not sensitive to directory latency. This suggests that moving protocol sequencing to software running on a node's main processor, an auxiliary processor (as in the Intel Paragon), or a processor in the network interface may be practical [1] . The obvious drawback of this approach is that a processor sequences a protocol slower than a hardware finite state machine.
A secondary drawback is that slower directories increase directory contention. The data shows that increasing directory latency from 10 to 100 cycles degrades execution time by 1570. This degradation can be mitigated or reversed by reducing directory contention (e.g., with greater interleaving) and by using protocols that send fewer messages.
On the other hand, software sequencing offers many advantages and opportunities:
. System design time can be reduced because less hardware must be designed. In addition, fieldupgrades of protocols are possible. Thus, the design time and hardware for shared-memory machines could be similar to message-passing com- 
