Abstract
We also compare protocol performance by running eight benchmarks on 32 processor systems. Simullltions show that DiqSW+'s performance is compare ble to more complex directory protocols. The significant disparity in hardware complexity and the small difference in performance argue that D i n S W + may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and CheckIn/Check-Out (CICO) directives [SI.
Keywords:
Shared-memory multiprocessors, memory systems, cache coherence, directory protocols, and hardware mechanisms.
Directory protocols are a technique used to implement cache coherence on large-scale shared-memory parallel computers [2]. Directory protocols logically associate a directory entry with each aligned block in main memory, This entry records that the block is idle (no cached copies), one writable copy exists, or one or more read-only copies exist. We only consider write-invalidate protocols that invalidate outstanding copies of a block in other processors when a processor wishes to write into it. To facilitate invalidations, each directory entry also contains logical pointers to some or all of the processor(s) that hold copies of the block. Agarwal et al. [2] use the notation DicB to denote protocols that explicitly record the i processors that share a block and rely on broadcasts to invalidate more than i processors. DiriNB denotes protocols that avoid broadcast by preventing more than i processors from sharing a block.
We examine DirnNB, Dir4B, Dir4NB, D i q B , DiqSW, and DinSW+. The Stanford DASH project and IEEE Scalable Coherent Interface (SCI) implement DirnNB [13, 8] . DASH uses a bit vector pointinn to a maximum of 16 clusters, while SCI uses a linked-lit whose head is stored in the directory andand DinSW + (introduced in Section 3.6.3) count the readers so they can return the directory to idle when all readers return the block, thereby avoiding an unnecessary broadcast. Programmers or compilers can also produce more desirable sharing patterns by reasoning about the shared-memory communication in a program with the Check-In/Check-Out (CICO) programming model. Furthermore, CICO primitives also serve as memory system directives that improve performance. We review this approach-cooperative shared memory-in Section 2 [9].
Many directory protocols are complex and require considerable hardware, which reduces the attractiveness of shared-memory machines. A directory protocol policy describes its response to program events, such as loads and stores, and the interactions among directories and caches on different processors. At the next lower level of abstraction, these policies are implemented with mechanisms-operations on directories, caches, and network interfaces-such as updating a directory pointer, replacing a cache block, and sending a point-to-point message. Describing a directory protocol at the mechanism level exposes disparties in protocol complexity that are not apparent at the policy level. Most protocols, for example, have policy transitions from many readers to one writer and from one writer to another writer. The sharedexclusive transition is more complex to implement than the exclusive-exclusive transition. The difference becomes clear at the mechanism level. Most systems synthesize the shared-exclusive transition by sending a sequence of invalidate and acknowledge ment messages. An implementation must (a) ~e -quence through a large number of message sends, (b) count the acknowledgements, (c) ensure concurrent requests to the same directory entry are serialized, and (d) guarantee that the interaction of these messages with messages for the node's processor, cache, and other directory entries cannot cause network deadlock. On the other hand, for an exclusiveexclusive transition, the directory only sends a single invalidation, which greatly simplifies these considerations.
The first contribution of this paper is to explore the complexity of Dir1 B, Dir4 B , Dir4 NB, Dirn NB [2], and DinSW [9] at the mechanism level of abstraction (Section 3). The mechanisms and mechanism sequencing of Dir1 S W are significantly simpler than these other protocols because the shared-toexclusive transition is not handled by hardware (MIT LimitLESS is more complex that DinSW, but much simpler than the other protocols). DirlSWs mechanisms can also be used to implement a protocol with higher performance. We call the best extended pro- The second contribution of this paper is a comparison of directory protocol performance that extends previous work in three ways (Section 4). First, our results come from executing billions, not millions, of instructions. Second, we evaluate performance with execution time, not event counts. Third, we present results for Dirl B, Dir4 NB, Dir4 B , Dirn NB, Dir1 S W, and D i n s W + together. Our simulations show that DinSW+'s performance is similar to more complex directory protocols for seven of eight benchmarks on a system of 32 processors (Dir, NB preforms better on mpJd due to unsealable, unsynchronized sharing). If this result holds for other applications and larger systems, the significant disparity in hardware complexity and the small difference in performance argue that DirlSW+ may be a more effective use of resources. The small performance difference between Din S W + and the more complex protocols is attributable to two factors. First, as Weber and Gupta's measurements show, the number of outstanding shared copies is typically close to one and rarely much greater [17]. This small amount of sharing means that directory protocols that track many outstanding copies provide functionality that is not fully utilized and their additional hardware rarely improves performance. Second, CICO memory system directives reduce sharing even further.
After the principal results in Sections 3 and 4, Section 5 discusses the implication of technology trends and directions for future work, while Section 6 draws conclusions. (1) to reason about the communications caused by sharedmemory references and (2) to pass performance directives to the memory system. Neither the programming model or the directives are specific to DinSW. Elsewhere, we demonstrate that the annotations can be used to improve program performance by increasing cache reuse and reducing program sharing [12]. This paper examines the effect on directory protocol behavior of using CICO annotations as memory system directives. We do not discuss the cooperative prefetch mechanism.
In CICO, programmers bracket uses of shared data with a checkaut annotation marking the expected first use and a c h e c k i n annotation terminating the expected use of the data. In programs conforming to the model, processors coordinate access to exclusive (writable) cache blocks to avoid expensive invalidates. The primary effect of using CICO annotations as memory system directives is to have checkin's flush cache blocks back to memory.
The base DirlSW protocol associates two state bits, a trap bit, and a pointer/counter with each block in memory. A directory entry can be in one of three states: D i r X , D i r S , and D i r l d l e . State D i r 2 implies that the directory has given an exclusive copy of the block to the processor pointed to by the pointer/counter. State D i r S implies that the directory has given out N shared copies, where N is the number in the pointer/counter. State D i r l d l e implies that the directory owns the only valid copy of the block. Figure 1 illustrates state transitions for the base DirlSW protocol. Msg-Get1 (Msg-GetS, respectively) is a message to the directory requesting an exclusive (shared) copy of a block. MsgSut is a message relinquishing a copy. Processors send a Msg-GetX (Hsg-Gets) message when a local program references a block that is not in the local cache or performs an explicit checkaut. In the common case, a directory responds by sending the data. A processor sends a MsgSut message on an explicit checkin or a cache replacement of the block.
Several state transitions in Figure 1 set a trap bit and trap to a software trap handler running on the directory processor (not the requesting processor), as in MIT LimitLESS [4] . The trap bit serializes traps from multiple references to a block. The software trap handler reads directory entries from the hardware and sends explicit messages to other processors to complete the request that trapped and then restarts the program that faulted. Traps only occur on memory accesses that violate the CICO model. Thus, programs conforming to this model run at full hardware speed. Traps on blocks in state D i r X interact with one processor, while traps in state D i r S must broadcast to recall all read-only copies. While broadcast cannot be used in infinitely large systems, it is acceptable in finite systems if the frequency of broadcast times the cost of the broadcast is small.
Directory Mechanisms
The hardware base of cache-coherent shared memory is similar to a message-passing machine. Each processor node contains a microprocessor, a cache, and a memory module. Nodes are connected with a fast point-to-point network. Shared memory differs because each memory module is addressed in a global address space and each processor node contains additional hardware to implement a directory protocol. Moreover, many directory protocols are complex and require considerable hardware, which reduces the attractiveness of shared-memory machines.
A directory protocol can be decomposed into three levels of abstraction. Policy describes its response to program events, such as loads and stores, and the interactions among directories and caches on different processors. At the next lower level of abstraction, policies are implemented with mechanisms-operations on directories, caches, and network interfaces-such as updating a directory pointer, replacing a cache block, and sending a pointto-point message. Mechanisms are further decomposed into primitive operations on a particular hardware implementation, which is the lowest level of abstraction. For example, DinSW requires a mechanism to increment the directory's pointer/counter and has a policy to increment this counter on a Msg-Get3 message that finds a block in state D i r S . This mechanism may, in turn, be implemented as an atomic sequence of primitive hardware operations that read, add one to, and write the counter.
Policy and mechanisms can be implemented in either hardware or software. Most directory protocols implement both policy and mechanisms in hardware. However, both LimitLESS [4] and DirlSW [9] implement policy with a combination of software and hardware.
Previous work has concentrated on developing new protocols, that is, policies. This section focuses, instead, on the mechanisms required to implement these protocols. Describing a directory protocol at the mechanism level exposes disparities in protocol complexity that are not apparent at the policy level. Most protocols, for example, have policy transitions from many readers to one writer and from one writer to another. When examined at the mechanism level, the shared-exclusive transition is clearly harder to implement than the exclusive-exclusive transition. Most systems synthesize the shared-exclusive transition by sending a sequence of invalidate and acknowledgement messages. An implementation must (a) sequence through a large number of message sends, (b) count the acknowledgements, (c) ensure concurrent requests to the same directory entry are serialized, and (d) guarantee that the interaction of these messages with messages for the node's processor, cache, and other directory entries cannot cause network deadlock. On the other hand, for an exclusiveexclusive transition, the directory only sends a single invalidation, which greatly simplifies these considerations. By examining protocols' mechanisms, we can compare the cost and complexity of implementing different protocols and explore the appropriate boundary between hardware and software.
Message-Passing Hardware
All parallel machines provide message-passing mechanisms. Message-passing machines, such as the Intel Paragon, simply expose these mechanisms directly to the programmer. Shared-memory machines, such as Stanford DASH and the Kendell Square KSR1, use these mechanisms to implement shared memory but hide the underlying mechanisms from the programmer. We believe that future shared-memory systems will expose the underlying message passing, as done in MIT Alewife [4] . Some statically-partitionable codes achieve maximum performance through explicit me& sage passing. Agarwal, et al., have demonstrated that other codes achieve better performance with a combination of shared-memory and message-passing than by using one or the other alone [ll] .
Consequently, we assume base hardware includes support to explicitly send and receive messages. Messages contain a Cbit message type and are sent to an explicitly-specified destination node p. The messages are large enough to contain at least one cache block and an address. The network interface is memorymapped and resides on the memory bus. A limited DMA capability allows contiguous data to be fetched (stored) directly from (to) memory. When a message arrives at a destination node, it can either wait for an explicit receive operation (i.e., polling) or interrupt the processor and invoke a software trap handler.
The network interface and routers constitute a significant fraction of a parallel machine's design. Our focus in this paper is supporting shared memory without greatly increasing the overall design effort.
General Directory Mechanisms
This section identifies the primary directory mechanisms needed to implement other protocols: Diri B, DiriNB, and Dir,NB (collectively called DiriX). Rather than formally describing the protocols, we abstract these mechanisms from several recently proposed machines that use these protocols or minor variants of them. Where the published literature lacks details, we made reasonable design choices. We also concentrate exclusively on directory mechanisms since these protocols require identical cache mechanisms.
The DiriX directory protocols require numerous additions to the underlying message-passing mechanisms, as Table 1 illustrates. The fundamental change is that some messages, based on the message type, invoke directory operations. The basic directory mechanisms are:
1. Send a single point-to-point message from a directory controller to a processor cache controller.
2. Read/write a pointer field.
3. Increment/decrement/zero a counter.
Test for counter equal to zero.
Mechanism. When DinX protocols send invalidation messages, they must keep track of acknowledgements in order to maintain sequential consistency (or weaker models). Although a counter is not strictly required (one could invalidate a pointer at each acknowledgement and test for no valid pointers), a counter is far easier to implement.
M--==
In general, DinX protocols also need the following mechanisms:
1. Identify valid pointer fields.
2.
Compare pointer fields against a node ID.
Sequence through the pointers.
DiqNB protocols, i < n, use a replacement policy to select a victim when the i + 1" shared copy is requested. This policy, in turn, requires an additional mechanism.
The mechanisms for Dir, NB protocols are slightly different because they can employ bit vectors instead of explicit pointers.
1. Decode node ID and test/set/clear bit in vector.
2. Sequence through bit vector.
All DirjX protocols for i > 1 require the ability to Sequence through either a set of pointers or a bit vector and send multiple invalidations.
Dins W Mechanisms
The DinSW column of Table 1 lists the subset of directory mechanisms required by Din S W . Din S W requires mechanisms to update state, send a single message, and test and update a single pointer/counter field. However, because DinSW has only a single pointer/counter field, it does not need the sequencing logic used by DinX (i > 1). Similarly, DinSW sends at most one message in response to an incoming request; protocol transitions requiring multiple messages are handled by software.
Design Cost
The alternative is to make this mechanism nonIn our view, the ultimate measure of directory protocol complexity is design cost-how long a protocol takes to implement. Unfortunately, differences in design teams, tools, and project goals prevent any concrete comparison of design cost.
For this reason, this section considers indirect measures of design cost that arise from sequencing directory mechanisms. A key goal of DiqSW was to reduce the cost and complexity of shared-memory hardware by using a protocol where the moat frequent policy transitions can be implemented with simple, short sequences of mechanisms (e.g., a single invalidate message). More complex sequencing-involving many messages-is done by system software (trap handlers). Avoiding complex hardware sequencing eliminates the complexity that arises from transient states, ensuring new policy requests are serialized, and avoiding network deadlock.
One indirect measure of protocol complexity that has some value is the number of statelevent pairs that must be handled in hardware, where events can be messages or processor actions (e.g., loads and stores). This measure is useful, because it quantifies the number of casea that the designer must consider and test for correctness. By this measure, DiqSW is fundamentally simpler than any of the DiriX protocols (with the exception of Din NB) because it does not require sequencing, sends at moat one message in response to any message, and requires only a simple datapath. Since much of this simplicity comes from pushing the complexity into software trap handlers, other hardwarelsoftware protocols such as LimitLESS, share this advantage.
All DiqX protocols for i > 1 require the ability to sequence through either a set of pointers or a bit vector and send multiple invalidations. To implement this mechanism aa an atomic sequence, all invalidations must be sent before receiving any other me%. sages. Unfortunately, deadlock avoidance then becomes a major consideration. If the maximum number of messages is bounded by a small constant, as in Diri NB, deadlock can be avoided with sufficient output buffering. The directory controller simply waits until its output FIFO has room for i messages before sending the first. However, this is not a scalable solution' for protocols that may send large numbers of measages, such as DiriB and DirnNB, since the maximum number of messages is proportional to system size.
'This solution can be used for any system with a frzed maximum size, provided each node has output buffering at least as large as this size.
atomic, and process incoming messages between sends. This facilitates deadlock avoidance, however, the sequencer's state becomes an additional, transient part of the cache block's state, greatly increasing the number of state/message interactions. In addition, multiple cache blocks may need to be sequenced simultaneously (in order to avoid deadlock), requiring some form of preemptive scheduling. Although this complexity can be managed, architects must expend considerable effort designing, building, and testing complex hardware rather than improving the performance of simpler hardware.
Manufacturing Cost
Comparing the manufacturing cost of mechanisms is relatively straight-forward. Manufacturing cost is ultimately measured in dollars, but is commonly estimated with other measures such as transistor count, bits of memory, datapath width, etc. For directory protocola, the dominant cost is memory overhead: number of bits of state stored per block of memory. All protocols need a small number of bits (e.g., 3 or 4) to represent the block's state. The DiriX protocols other than Dirn NB require i pointers of logan bits each; Dirn NB protocols require n bits. By contrast, DirlSW requires only one logan-bit pointer/counter field. Consider a system that supports up to 1024 nodes and has 32-byte cache blocks. If we aaaume 4 bits can describe the state of each block, then Dir4NB incurs a 16% memory overhead (44 bits/256 bits), Dirn NB incurs a 402% overhead, while DirIS W incurs only 5% overhead.
After memory, the next greatest cost is the directory datapath. For the DiriX protocols other than DirnNB, the comparison of a node ID with i pointer fields requires either a wide datapath with i comparators or a sequential search. A Dirn NB implement& tion will require an n-bit datapath and priority decoder. By contrast, the DinSW state machine requires only a logzn-bit datapath with the ability to increment, decrement, test for zero, and select the ALU result, the message source ID, or a small constant for writing into the pointer/counter.
The absence of sequencing in the DirlSW mechanisms also allows a regular structure: in response to each message, the state associated with the cache block is read, modified, and written back, and optionally a single meeaage is sent. Beyond its inherent simplicity, this regularity leads naturally to a pipelined implementation with increased throughput . While other schemes can also be pipelined, as for example, in the Stanford DASH [14], the increased datap-ath complexity requires additional designer time that could otherwise be spent elsewhere. 
D i r l S W+NPT: N o Pairwise Traps
The base DirlSW protocol traps to software whenever a CICO violation occurs; that is, whenever the directory receiving a Hsg-Get message cannot immediately respond with the requested data. However, the DirlS W mechanisms permit directory hardware to send a single message to an arbitrary proceasor in response to a message from another processor. The NPT extension modifies the hardware policy to directly send an invalidation message and forward the block to the requesting processor when the HsgPut message arrives. This extension moves a common, but more complex policy from software to hardware, which may reduce execution time.
D i r l S W + R O l : One Shared Copy
The Dir1 S W mechanisms permit a protocol to maintain either a pointer to a processor node or a counter.
The base DirIS W protocol maintains a pointer for exclusive copies and a count of shared copies. However, many of the shared-to-exclusive state transitions occur when only a single shared copy is outstanding: over 50% for 6 of the 8 applications, and over 85% for 4 of them. An obvious extension of DiqSW is to add a new state DirSane that maintains a pointer to a single shared copy. The benefit of this change is that it reduces the number of traps that broadcast an invalidate to all processors.
3.6.3
This extension, called DirtSW +, combines changes from Section 3.6.2 and Section 3.6.1. In protocol, the trap handler is only invoked to broadcast an invalidation for a block that was shared by more than one processor.
Directory Performance
This sections presents our experimental methods, compares the performance of DaqSW variants, and compares the best Din S W variant with alternative protocols.
Methods
The measurements in this paper came from the eight explicitly parallel programs listed in Table 2 running on the Wisconsin Wind Tunnel (WWT), a virtual prototype for cache-coherent, shared-memory computers [15] . WWT runs parallel shared-memory programs on a parallel message-passing computer (a Thinking Machines CM-5) and uses a distributed, discrete-event simulation to concurrently calculate the programs' execution times on a proposed target machine.
The simulated parallel computer (the target sy5 tem) used in this paper consists of 32 processor nodes, come-first-serve order. Queuing delay is included in the cost of a cache miss. Network topology and contention are ignored, and all messages are assumed a fixed latency. Table 3 lists the basic system parameter values.
Benchmarks

Din S W Variants
This section discusses the performance of DiqSW variants on 32 processors for the benchmarks in Table 2. The variants are: no-pairwise-traps (NPT), one-shared-copy (ROl), and CICO directives. We examine all eight combinations (P). and CICO). Since the normalized execution times are less than 1.0, the extensions all improve performance relative to base DiqSW. However, the extensions affect the benchmarks by varying amounts. They matter little to wafer and tomcafu, for example, because both perform little communication relative to their computation. (The same effect is apparent for the other protocols compared in Figure 3 of Section 4.3.) For mp9d, however, NPT and CICO reduce execution time by 52% and 21%, respectively, by mitigating the effect of this program's unsynchronized sharing in its cell data structure.
To get more insight from the many numbers in Figure 2 , we use an analysis of variance to charac- terize mean behavior. This aggregation is meaningful only if the eight benchmarks are representative of some interesting workload. Table 4 reports results and the table's caption describes the analysis of variance method in more detail. The results show that most of the variation between runs is caused by the benchmarks themselves. Nevertheless, NPT and CICO caused statistically significant variation. The mean relative improvement from NPT was IS%, while CICO yielded 8%. Using CICO primitives as memory system directives affects sharing behavior and improves performance. Table 5 examines the effect on sharing behavior of using CICO checkin's to flush cache blocks (rather than allowing them to be replaced or invalidated) .' A checkin improves performance if it enables another processor to find a block at the directory instead of requiring additional messages be sent 2We also examined check-out's but found their effect to be d. to other processors. The results show that c h e c k i n reduces the frequency of indirections by 45%-100%. A c h e c k i n hurts performance if the same processor is the next user of the block, which we found to occur in 6%-65% of the checkin'x. Together, NPT and CICO ran programs 19% faster, implying NPT makes CICO less important. With NPT, CICO has a more modest impact on indirections to previously exclusive blocks (e.g., migratory data). Without NPT or CICO, migrating a block costs four network traversals and two traps. Adding NPT or CICO eliminates the traps, while CICO also reduces the network traversals to two. Thus, at best, adding CICO to NPT improves performance by a factor of two. In practice, the effect is much smaller, because programs do not spend much time migrating data.
Finally, we would like to estimate how the effects of NPT and CICO vary from benchmark to benchmark. To do this, we calculate 90% confidence intervals assuming the residuals-the performance not explained by average effects-are normally distributed with mean zero. This calculation-explained further in the caption of Table 4 -reveals [-22%, -lo%] for NPT and [-14%, -2%] for CICO. With eight benchmarks and not-exactly-normally distributed residuals, our confidence intervals are best taken with a grain of salt.
In summary, NPT and CICO improve performance of almost all programs, while R 0 1 helps a little. Since NPT and RO1 use the same mechanisms as the base protocol, we incorporated them in a new protocol called Dir1 S W + .
Comparison t o Other Protocols
This section compares D i q S W + without and with CICO, denoted DirkSW + and DirlSW +/CICO, against several other protocols. Figure 3 displays normalized execution time for the eight benchmarks running on 32 processors under several protocols. The principal result is that DirlSW+ and D i q SW+/CICO perform comparably to Dir32NE-well within lO%--even when Dir32NB uses CICO, except for mpfd with its unscalable, unsynchronized sharing. The data also show that D i q NB is an unstable protocol, at least, when no special mechanism handles read-only data.
These conclusions do not seem to be sensitive to the key system parameters of network and directory latency. (We also measured runs with 64 processors, but do not report these results because they did not differ qualitatively.) Table 6 shows the normalized execution time results from varying interconnection network latency from 100 processor cycles (the default) to 400 cycles. A 400-cycle network slowed all protocols by about a factor of two, but it has little effect on the performance difference between Dim2 NB and D i q S W + .
Increasing the latency of a directory operation to 100 cycles approximates the effect of using an auxiliary processor, rather than a finite state machine, to perform directory operations. Increasing the directory cost from 10 to 100 cycles slowed the benchmarks by an average of 40% with no obvious trends favoring one protocol over another. Finally, we looked at performance with larger values for both network and directory latency. With the slower network, increasing directory latency only decreased performance slightly (15%).
Discussion
While quantitative results are useful, it is important to step back and look at what they mean. The data shows that: 0 Memory system directives, such as CICO, can alter program behavior to make simple directory hardware more attractive.
Elucidating the mechanisms underlying a coherence protocol can lead to new protocols that per- form better without significantly increasing implementation complexity.
0 For the system assumptions and benchmarks, most protocols performed similarly. The significant disparity in hardware complexity and the small difference in performance argue that DirlSW + may be a more effective use of resources.
Although our results have immediate import, they also apply to future computers. These machines are moving toward large-scale (2 1K-processors) systems of fast microprocessors (2 1 GIPS). The network latencies of these machines (measured in processor cycles) will be much larger than today's machines. The data in Table 6 for 400 cycle network latency shows that larger networks do not affect DiqSW more than other protocols such as Dir, NB (assuming that programs infrequently cause broadcasts).
A perhaps more important implication of the data is that performance in machines with long network latencies is not sensitive to directory latency. This suggests that moving protocol sequencing to software running on a node's main processor, an auxiliary processor (as in the Intel Paragon), or a processor in the network interface may be practical [l]. The obvious drawback of this approach is that a processor sequences a protocol slower than a hardware finite state machine. A secondary drawback is that slower directories increase directory contention. The data shows that increasing directory latency from 10 to 100 cycles degrades execution time by 15%. This degradation can be mitigated or reversed by reducing directory contention (e.g., with greater interleaving) and by using protocols that send fewer messages.
On the other hand, software sequencing offers many advantages and opportunities:
System design time can be reduced because less hardware must be designed. In addition, fieldupgrades of protocols are possible. Thus, the design time and hardware for shared-memory machines could be similar to message-passing computers. 0 Collecting information for performance monitoring is much easier.
Regrettably, we leave evaluation of these ideas to future work. Our benchmarks were written for small scale systems. Running these programs on more than 32 or 64 processors exposes bottlenecks and yields poor speedup. We plan to use the CICO programming model [12] to construct programs that manage communication more effectively and use these programs to evaluate these ideas.
Conclusions
Shared memory offers many advantages, such as a uniform address space and referential transparency, that are difficult to replicate in today's massivelyparallel, message-passing computers. The key to effective, scalable, shared-memory parallel computers is to address the software and hardware issues together. This paper explored the complexity of implementing directory protocols by examining their mechanisms-primitive operations on directories, caches, and network interfaces. We compare the follow- The small performance difference between Din S W+ and the more complex protocols is attributable to two factors: the small degree of sharing in programs and CICO directives. The significant disparity in hardware complexity and the small difference in performance argue that DinSW+ may be a more effective use of resources.
As network latencies increase, the performance effect of directory operation overhead decreases, which provides the opportunity to sequence directory o p erations in a processor rather than a state machine. This change, in turn, permits high-level directory o p erations that have the potential to hide more of the increased communication cost. Evaluating these alternatives for kiloprocessor systems will require new benchmarks and an evaluation platform that simulate more processors than current machines contain.
