AbstractÐIn high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.
INTRODUCTION
S HARED-BUS shared-memory multiprocessors are a suitable platform to speed up the execution of general-purpose workloads, thus overcoming the intrinsic limitations of uniprocessor systems [61] . Shared-bus architecture is the straightforward approach to connect several processors having private caches by means of a simple interconnection network. Coherency units have to be added in order to maintain a consistent view of the shared memory for each processor [24] , [34] , [36] .
When scaling up this architecture, the bus soon becomes the bottleneck of the system. In fact, the bus easily saturates, not only because of the traditional bus traffic, present even in cache-based uniprocessors. The bus saturation is also caused by the traffic induced by cache coherency. Part of this traffic can be avoided by using an adequate coherence protocol. Indeed, the performance and the scalability of this architecture mainly depend on the coherency strategy and workload features. In particular, the scalability is heavily dependent on the adaptivity of coherency strategy to the memory access patterns exhibited by running applications [75] .
Several studies have analyzed the access patterns [19] , [16] , [17] , [52] , that depend in great deal on the application itself, highlighting the following main categories of sharing: active sharing, false sharing, and passive sharing. Active sharing involves data actually shared among processes [33] , [1] . False sharing occurs when several different processors access separate data stored in the same memory block. This problem is due to a mismatch in the granularity of sharing of the application and the coherence granularity of the cache [70] , [23] . Passive sharing occurs when a private data block is replicated in more than one cache as a consequence of the migration of the owner process [48] , [51] , [52] .
Passive sharing greatly influences the performance of a multiprocessor running general-purpose workloads composed of both sequential and parallel applications under a multitasking operating system. In previous works, we showed that, in this kind of workload, passive sharing has a substantial weight on the overall performance of the system for all coherence protocol schemes [52] , [29] . In a write-update coherence scheme, the number of write transactions, due to passive sharing, may be as high as 80 percent of the total write transactions for workloads consisting of only sequential programs that run concurrently. This value does not decrease below 65 percent when a parallel application is added to the same workload [52] . In these cases, passive sharing may take up 40 percent and 30 percent of the bus bandwidth, respectively. This high traffic induced designers to adopt write-invalidate schemes that partially reduce, but do not eliminate, that overhead.
After the elimination of this overhead, the reconsideration of write-update schemes is worthwhile [77] .
Process migration induces not only the generation of passive sharing, but also the scarce reuse of cached copies. Mogul and Borg found cache reload overheads of up to 8 percent of the execution time [78] . Process migration cannot be avoided if the system has a scheduler that automatically balances the load among processors [36] . Even special scheduling strategies like cache-affinity [42] , [57] , [71] , [74] cannot avoid process migration in all load conditions. In fact, the scheduler could be forced to reduce affinity in order to have a running process on each processor. Indeed, the probability of generating passive copies increases as the time interval between the instant in which the process is suspended from execution and its next rescheduling on a different processor decreases. This interval is statistically small when the number of ready processes is comparable to the number of processors. In this scheduling situation, the cache affinity policy fails.
Different solutions for coherence protocols have been proposed [59] , [67] , [68] . We have examined these solutions, grouping them in the following categories. Write-Update (WU) protocols distribute the write operation involving a shared copy by using the intrinsic broadcast nature of the shared-bus. Write-Invalidate (WI) protocols maintain coherency by invalidating remote copies upon each write operating on a shared copy. Hybrid (HY) and Hybrid Adaptive (HA) protocols use some kind of switching strategy between a WU and WI behavior. Selective protocols (SE) try to cope with some special problems that affect the performance of the system. None of the above protocols explicitly treats the problem of passive sharing, although some indirect effect is present while dealing with other kinds of sharing. For example, protocols that minimize the overhead of ªmigratory sharingº [13] , [60] , a special case of active sharing, also partially reduce passive sharing effects.
We propose a cache coherence protocol, named PSCR (Passive Shared Copy Removal), to eliminate passive sharing in throughput-oriented multiprocessor. The simple idea of our solution is to locally invalidate a cached copy belonging to a process private area as soon as the same block is fetched by another processor. This information can be produced easily by the compiler and is normally available in modern kernels and it is used, for example, by memory management mechanism in multitasking environment. PSCR can simply use this information without the need of adding extra memory into the cache and the modification of program. The protocol has a reduced complexity since it has only five states and it needs only an additional line on the bus compared to MOESI protocol scheme [63] . We are not aware of other solutions that explicitly eliminate the overhead due to private data accesses. The selective invalidation mechanism allows PSCR to gain the benefits of an update mechanism in shared bus architectures. To show the effectiveness of PSCR, we evaluated its performance against other protocols either presented in the literature or used in commercial multiprocessors. Process migration shows up in a multitasking environment. For this reason, we consider general purpose workloads, which are the usual workloads for the platform under study, instead of purely parallel workloads, which are more common in protocol evaluations [60] , [26] , [19] . The first of these workloads consists of typical sequential programs like Unix system commands, utilities, and user applications. This scenario simulates the execution of a shell script. In the other two workloads, a parallel application is added to this basic workload to model also a situation in which the user may want to run parallel applications along with the other programs. The selected applications (MP3D and Cholesky) belong to the SPLASH suite [54] . In this scenario, other user generated kind of sharing is also present along with passive sharing.
The performance evaluation has been carried out by using the ªTrace Factoryº environment [29] , [50] . Trace Factory permits the generation of a combined workload in which the concurrent execution of several applications is simulated and also includes the most influencing activities of the kernel, namely virtual memory, scheduling, and system calls. Moreover, the simulation is then performed using an enhanced trace-driven technique that solves some limitations [30] of the classical trace driven simulation. Trace-driven simulation offers a good trade-off between speed and accuracy when the performance evaluation's target is the memory hierarchy and processor interconnection subsystem [72] .
The protocol sensitivity to architectural parameters such as several processor/bus speed combinations, cache block sizes, set associativity, and bus width has also been analyzed. As for the scheduler, we have also explored conditions in which the processor affinity is high.
The protocol performance can be further enhanced if the compiler is able to extract information about the access patterns to shared items. In this case, the evaluation results showed an improvement when certain blocks belonging to shared areas are treated as if they were private, invalidating them in advance.
The main approaches to achieve cache coherency in busbased multiprocessor and some solutions that may have effects on passive sharing are reviewed in Section 2. Section 3 presents the new coherence protocol to treat and avoid passive sharing. Section 4 discusses the methodology used to evaluate the performance of PSCR and six other protocols. In Section 5, we present simulation results and compare PSCR against others for various case studies. In Section 6, the complexity of PSCR is analyzed and compared with the one of the other protocols. Finally, the conclusions are drawn in Section 7.
RELATED WORK
In this section, we will recall the main approaches to achieve cache coherency in bus-based multiprocessors and put in evidence some solutions that may have effects on passive sharing. This list is not intended to be exhaustive. The classification of some solutions is problematic since the approach may fall in more than one class.
It is now clear that many factors influence the overhead introduced by cache coherence. First, the access patterns [33] , [9] , [58] vary for different data elements in the same application and for different applications. Moreover, data are allocated to memory blocks that may in turn exhibit different aggregations of the original access patterns. In addition, when general-purpose workloads are considered, access patterns generated by migrating processes induce passive sharing. 1 Second, the access patterns to data may change during the program execution. This suggests the introduction of some kind of adaptive behavior into coherence schemes.
Third, different choices for the architectural or operating system parameters may weigh differently on the coherence overhead. For instance, large block sizes may introduce penalties due to false sharing [70] , [23] , scheduler policies like cache-affinity [57] may favor the reuse of cached copies and limit passive sharing, variable mapping influences the coherence overhead [77] .
Write-Update and Write-Invalidate Protocols
First solutions coped with coherency by using some kind of clever static strategy: updating or invalidating remote copies upon a write operation involving a shared copy. The Write Invalidate (WI) class consists of those protocols that invalidate remote copies upon a write operation involving a shared copy. In this class, early protocols were Write-Once [31] , Synapse [25] , Illinois [47] , Berkeley [40] , RB (Read Broadcast) [53] , and EIP (Efficient Invalidation Protocol) [7] . The Write Update (WU) class consists of those protocols that update remote copies upon a write operation involving a shared copy: Dragon [43] , Firefly [65] , and RST (Reduced State Transition) [49] . A first evaluation of most of these protocols can be found in [6] . Two new WU protocols have been defined for two special bus-based machines: on-chip multiprocessor [64] and bus-based COMA [41] .
A first attempt to standardize protocols yielded the MOESI class of protocols in order to implement them on a common platform [63] . MESI is a MOESI class protocol, based on Goodman's Write-Once 4-state protocol [31] . 2 It is implemented in most of the commercial high-performance microprocessors like AMD K5 and K6, the PowerPC series, the SUN UltraSparc, SGI R10000, Intel Pentium, Pentium Pro, Pentium II, and Merced.
Hybrid Protocols
The performance of a protocol depends on the access patterns exhibited by running programs and, therefore, neither WU nor WI are the best choice for all programs [19] , [75] , [76] . Eggers and Katz introduced two metrics called ªwrite-run lengthº (WRL) and ªexternal rereadsº (XRR) to characterize access patterns to shared data [19] , [20] . The first metric is the number of write operations issued by a given processor to a memory block before another processor accesses that block. (A ªwrite-runº is the sequence of writeÐpossibly interleaved by readÐreferences.) The second metric indicates the number of processors that execute read operations on a block between two consecutive write runs. A natural use of these statistics is to select the better coherence strategy between WIand WU for a given application. A long write-run and a low XRR value suggest that a write-invalidate coherence protocol should be chosen. The cost of the initial misses (caused by invalidation and indicated by the XRR value) is balanced by the large amount of bus traffic saved because all of the subsequent write operations can execute locally. A large number of external rereads and a short write-run indicate that a writeupdate strategy would be convenient. Of course, the cost of misses and updates plays a decisive role in the strategy selection.
The large variations of WRL and XRR statistics among the programs suggested the introduction of hybrid protocols (HY). Some proposed protocols start with WU strategy, but switch to WI as soon as a long write-run is encountered or predicted. Others change their behavior for each program, page, or block and dynamically for the same block. Some protocols use a centralized approach to invalidate the remote copies: The writer processor broadcasts the invalidate command. In other protocols, remote caches decide autonomously to invalidate local copies.
The RWB (Read Write Broadcast) protocol [53] is one of the first protocols to exhibit a hybrid nature. After a first write-through on a shared copy, the protocol starts to invalidate.
Karlin et al. proposed the algorithm ªCompetitive Snoopingº upon which a protocol can be implemented. The protocol switches dynamically between WU and WI modality when the cumulative cost of sending updates equals the cost (invalidation threshold) that would be incurred if data had to be read [39] . They proved that, for any sequence of operations, the overhead of their algorithm is within a constant factor of the minimum required for that sequence.
Eggers and Katz compared a variant of Competitive Snooping, called SR (Snoopy Reading) [21] with standard protocols. In SR, on a write to a shared copy, the write operation is broadcast on the bus and a counter (initialized to 3) in the writer's cache is decremented. When the counter reaches zero, the other copies of the block are invalidated and the counter is reset to three. This protocol has the advantage of using a number of coherence bus-cycles less than twice the number used by the optimal off-line protocol. The authors found good improvement over Firefly for two out of four traces. Berkeley and RB performed better than SR, with the other traces.
Archibald introduced EDWP (Efficient Distributed Write Protocol) [8] , which is similar to Competitive Snooping protocol. In this case, the invalidation threshold is fixed at three, but the decision to invalidate a copy is delayed until all counters for that block reach the value of zero (this is detected by means of the shared line). The authors show that this protocol has an improved performance compared to previous HY protocols.
Veenstra and Fowler introduced and evaluated three hybrid protocols that choose the WU or WI modality statically per-page, statically per-block, or dynamically perreference, respectively [75] . They found that HY protocols substantially reduce the cost of memory references for most of the program studied. None of the programs receives a significant additional benefit from using a dynamic hybrid protocol compared to the per-block static hybrid protocol, for cache block sizes smaller than 64 bytes.
Gee and Smith proposed a variation of EDWP named ªUpdate-Onceº [26] . They evaluated the protocol over a wide range of traces and architectural parameters and compared the performance against a large set of protocols. They used four WI protocols (Write-Once, Illinois, Berkeley, and Full-MOESI-Invalidate), three WU protocols (Dragon, Firefly, and MOESI-Update), and one HY (EDWP). UpdateOnce has an invalidation threshold of one, and results show that it yields the highest average performance over the set of protocols.
Another HY protocol is AXP, which is the protocol of Alpha 21064 [15] . The system is configured with two levels of caches that guarantee that if a block is present in L1 cache, then it is also present in L2 cache. In this protocol, invalidation is performed on a strict local criterion. Upon an update, if the block is present in both caches, the protocol updates the copy in L2 and invalidates the copy in L1. If the copy is only present in L2, this copy is invalidated. This local invalidation mechanism is also referred to as the ªdrop ruleº [76] . In this case, the invalidation threshold is two. Veenstra and Fowler evaluated AXP, finding that it does not perform better than Illinois [76] .
Adaptive Hybrid Protocols
Adaptive Hybrid (AH) protocols dynamically switch between WU and WI policies or are pattern-sensitive, modifying the basic protocol behavior to manage the necessary coherence operations adaptively.
Cox and Fowler introduced an Adaptive protocol with Migratory Sharing Detection (that we refer to as AMSD) for bus-based systems [13] . Migratory sharing is characterized by the exclusive use of data for a long time interval. Typically, the control over these data migrates from one process to another [33] , [60] . The protocol identifies migratory-shared data dynamically in order to reduce the cost of moving them. The implementation is an extension of a common MESI protocol. The four basic states are augmented with three new states and an additional bus line is required. The authors evaluated the protocol by considering only the accesses to shared data and by excluding accesses to synchronization variables, private data, and instructions. They observed that ªtreating private data as though they were migratory would reduce the cost of process migration.º Stenstro È m et al. also introduced a detection mechanism of migratory sharing for a directorybased write-invalidate protocol [60] , with a solution that is very similar to AMSD. The migratory detection mechanism has also been applied to a directory-based competitiveupdate protocol [32] , [46] . This competitive-update protocol is an extension of Competitive Snooping [39] to directorybased protocols.
Veenstra and Fowler evaluated some variants of the AXP that extend the basic behavior in two directions. One improvement (AXP+) is on the managing of migratory data [33] , and the other (AXPa) on the adaptive behavior. They found that the migratory optimization [13] , [60] influences HY protocol performance, more than choosing an optimal threshold for switching from WU to WI modality. The application of migratory optimization on the adaptive behavior (AXPa+) does not introduce improvements to the adaptive behavior extension (AXPa).
Anderson and Karlin introduced two protocols that change their behavior not only on a per-block basis but also dynamically for the same block [5] . Such protocols are motivated by the large variability of WRL observed for different programs and for different blocks of the same programs. These two protocols are based on the Snoopy Reading protocol. The adaptive behavior is achieved by using an invalidation threshold for each block whose value is adjusted after each write-run, whereas Snoopy Reading protocol uses a constant invalidation threshold. In the RW (Random Walk) protocol, the block threshold is initially zero. The threshold is incremented if the block experiences a WRL not greater than the SR invalidation threshold. It is decremented in the opposite case. The LTS (Last Three Samples) protocol approximates the mean value of the WRL distribution using the last three WRL samples. They compared the performance of these two protocols against Illinois and Dragon. The performance results were ªin all cases closer to better of WI and WUº [5] .
Protocols for General-Purpose Workloads and Selective Protocols
When a multiprocessor is used to speed up a workload that includes sequential independent applications, coherence overhead due to passive sharing is the factor that mostly influences the performance [52] , [29] . The system kernel, concurrently executing on several processors and, possibly, parallel applications also generate coherence overhead due to data sharing. Figs. 1 and 2 show the percentage of bus write operations due to passive and other shared copies in the case of Dragon protocol and two different workloads. The effects of passive sharing are significant in both cases. A first protocol that tried to eliminate the overhead induced by passive sharing was UCR (Useless Copy Removal) [48] . In this case, each cache invalidates locally unused copies as soon as another cache fetches that data block. Cache copies are classified as unused when the copy is not used by the process currently running on the corresponding processor. Since a systematic detection of this condition requires high hardware overhead to store the process identifier for each block, an approximation involving only one extra bit per-block was introduced.
The evaluation employing synthetic workloads shows a performance improvement both over RST and Dragon protocol [50] .
Skeppstedt and Stenstro È m introduced a variation to Censier and Feautrier protocol [11] , named ªprivate-detection techniqueº [56] by inserting a simple heuristic. The protocol classifies blocks as effectively private by observing the transaction sequence involving the block. The results show improvement over the basic protocol. The technique has been evaluated only for CC-NUMA machines.
Prete et al. describe a technique, named USCR (Useless Shared Copy Removal) [51] similar to UCR, but in this case the selective invalidation is applied to data accesses that the processor signals to be private. The USCR applied to Dragon and Berkeley protocols is then evaluated on parallel applications, showing substantial improvement on USCRDragon but not on USCR-Berkeley over their respective basic version. On these applications, also, USCR-Dragon exhibits a higher performance than USCR-Berkeley does. The reason is that UCR strategy cannot recover the effect of undesirable invalidations on active shared copies. This also was a good indication of the fact that, once passive sharing is eliminated, WU protocols may continue to be a viable strategy compared to WI ones.
Other mechanisms have been presented to minimize the coherence overhead due to invalidations: DSI [44] eliminates invalidation messages in a directory-based multiprocessor by automatically invalidating its local copy before a conflicting access by another processor. Cheong and Veidenbaum proposed a solution based on the combination of compile-time reference tagging and individual invalidation of potential stale copies only when referenced [12] . Carter et al. used an update time-out mechanism to invalidate replicas of copies that have not been accessed recently upon receipt of an update [10] .
Finally, many protocols have been proposed to solve the problem of false sharing copies [70] , [23] : SB (Sub-Block) protocol [4] , DSB (Dynamic Sub-Block) protocol, FSB (Fixed Sub-Block) protocol [38] , and WIP (Word Invalidate Protocol) [66] , [69] . Other techniques exist to cope with this problem, including padding [23] , program transformations [37] , separate cache block allocation [70] , data privatization [77] , and the adoption of some relaxed memory consistency models [18] .
THE PROTOCOL
None of the above protocols (except UCR and USCR by the same authors) explicitly addresses the problem of passive sharing, although some indirect effect is present while dealing with other kinds of sharing. The reason why those protocols are not effective is briefly explained below. WU protocols have the worst behavior since they update passive copies until dropped because of replacement, generating a huge amount of unnecessary traffic. WI protocols typically invalidate passive copies on the first write. Thus, they avoid part of the useless traffic, but this indiscriminate invalidation does not allow us to take full advantage of the possible remote use of actively shared copies. Therefore, the more these protocols are effective in treating passive sharing, the more they lose by invalidating actively shared copies. HY and AH protocols have a certain delay in recognizing passive copies and they limit the possible benefits of avoiding some extra update operation. Selective and Adaptive Hybrid protocols that are able to detect migratory sharing may produce some positive effects on passive sharing.
Description of the Idea
The selective invalidation mechanism allows PSCR to eliminate passive sharing and to gain the benefits of an update mechanism in bus-based architectures. The idea consists of invalidating the copies belonging to private data areas of a process as soon as they are fetched by another processor. We refer to such blocks as P-blocks, while S-blocks are blocks belonging to code or shared data area.
PSCR ensures that a P-block is never involved in a write transaction. When a cache miss involves a P-block, the only other copy, possibly left by the migrated process in a remote cache, will be immediately invalidated. In this way, private data blocks are gradually forced to ªfollowº the owner process in its migration in order to cause no further coherence-related activity.
Basic Hardware and Software Support
Our approach is both hardware-and software-based. We suppose that private data are allocated into separate memory page at compile time. At loading time, the memory management unit uses an extra bit (P-bit) for each page descriptor to indicate if the current page belongs to private data. Compilers and the operating system kernels of multiprogrammed environments normally perform this activity in order to manage virtual memory. In this way, no extra run-time software support is required in respect to what is normally present. Moreover, no extra information is necessary in program code. In Section 5.6, we discuss an advanced strategy at software level that can increase the protocol performance. In Section 6.3, we also discuss the compiler effort in detecting private data.
The hardware implementation is quite simple: The processor uses a dedicated line of the processor-cache bus to signal. This line is used for signaling when a memory operation involves a private data (P-block) of the process executing on that processor. We suppose that the shared bus can provide the following bus transactions:
Read-block transaction: The cache loads a copy of a memory block that it does not hold yet. If the copy is furnished by a cache, we mean cache-to-cache read-block transaction, while if the copy is furnished by the memory, we mean memory-to-cache read-block transaction.
Write transaction: The cache broadcasts the contents of a single location on the shared bus.
Update-block transaction: The cache writes back an entire copy that has to be destroyed.
In the case of miss condition, the cache broadcasts the Pblock/S-block information on the common bus by means of a line (L 1 ) during the read-block transaction. If the transaction involves a P-block and a remote cache holds a copy of that block, the copy is immediately invalidated.
A second line (L 2 , handled by the ªlisteningº caches) is required for a couple of (mutually excluding) purposes:
. during a read-block or a write transaction involving an S-block (v I ypp), to indicate that a copy is resident in at least one remote cache, so that the state of the loaded or newly written copy must be set to one of the shared states (see below). Other protocols use a line (named shared-line) with the same issue. . during a read-block transaction involving a P-block (v I yx), to indicate that a dirty copy is resident in a remote cache, so that the state of the new loaded block must be set to Private-Dirty (see below); A cached copy may be in four valid states:
Private-Clean (PC): Only one memory block copy exists and it is consistent with main memory.
Private-Dirty (PD): Only one memory block copy exists, but it has been modified and is no longer consistent with main memory. In case of replacement, the cache must first update main memory.
Shared-Clean (SC): Several memory block copies may exist; they are identical but may be inconsistent with main memory.
Shared-Dirty (SD): Several memory block copies may exist (one SD and the others SC); they are identical but are not consistent with main memory. The cache with the SD copy must update main memory if the SD copy has to be destroyed for replacement.
Furthermore, we have the Invalid (I) state, to which a Pblock is set after a local invalidation (see below).
Activities Due to Local Processor Operations
The description of a specific coherence protocol must generally consider a couple of independent aspects: type of access (read/write) requested by the local processor and cache condition (hit/miss). In the case of the PSCR protocol, a further distinction needs to be made between operations on P-blocks and S-blocks. Therefore, eight different cases should be considered. Of course, for two of them (read hit on either kind of block), no coherence action is needed. As for the remaining six cases, we provide a detailed description, with the help of the state diagrams showed in Fig. 3 .
Write hit on P-block: The cached copy is updated. If the copy is PC, its state is changed to PD. If it is already PD, no state transition is necessary.
Write hit on S-block: The cached copy is updated. If the copy is PC, its state is changed to PD. If it is already PD, no state transition is necessary. If the copy is either in SC or in SD state, a bus write-transaction is used to update both the main memory and the copies that may exist in remote caches. Because of this transaction, if L 2 line is not active (the block is no longer shared), the copy state changes to the corresponding private state (SC to PC, SD to PD).
Read miss on P-block: First, a cache block may have to be chosen for replacement. If the victim block is in either PD or SD state, an update-block transaction is used to write back the modified block into the main memory. Afterwards, v I line is activated and a read-block transaction is used to load the missing block. The block is loaded in PD state if v P line is activated by a remote cache during this transaction; otherwise, it is loaded in PC state. Finally, the cache supplies the processor with the contents of the involved location.
Read miss on S-block: First, a replacement phase may be necessary as in the previous case. During the read-block transaction needed to load the missing block, v I line is not activated. The block is loaded in SC state if v P line is activated by a remote cache during this transaction; otherwise, it is loaded in PC state. Finally, the cache supplies the processor with the contents of the involved location.
Write miss on P-block: First, a replacement phase may be necessary, as above. During read-block transaction, v I line is activated. The block is loaded in PD state and the pertinent location is updated.
Write miss on S-block: First, a replacement phase may be necessary as above. During read-block transaction, v I line is not activated. The block is loaded in SC state if the v P line is activated by a remote cache during this transaction; otherwise, it is loaded in PD state. Finally, the pertinent location is updated and, if the copy is SC, a write transaction is performed.
Snooping Activities
During each read-block or write transaction, each listening cache checks whether it holds a copy of the memory block involved. If so, it operates as follows:
In the case of a read-block transaction, two different cases can happen:
. If the transaction involves an S-block (line v I off), the cache activates v P line. If the state is private (PC or PD), then it is changed to shared (PC to SC, PD to SD). If the initial state was PC, PD, or SD, the cache disables the main memory and supplies the data.
. If the transaction involves a P-block (line v I on), the copy state is set to Invalid. If the cache has a PD copy of the block itself, v P line is also activated. Finally, the cache disables the main memory and supplies the data. In the case of a write transaction on an S-block (either if the copy state is SC or SD), the cache updates the copy and no state transition is performed.
METHODOLOGY
The methodology used in our analysis is based on tracedriven simulation [22] , [62] , [35] , [50] , [72] . To ensure accuracy, the kernel activities that most affect the performance are simulated. Memory references include both user and kernel references, and they are produced ªon-demand.º Three kernel activities are simulated: system calls, process scheduling, and virtual-to-physical address translation. Reference bursts, due to system calls, affect performance, interrupting the locality of the memory reference stream of the running process. Virtual-to-physical address translation may change program localities that influence the number of ªintrinsic-interferenceº (or ªconflictº) misses caused by interferences among several accesses in the same cache set. Process scheduling influences the process migration and, as consequence, passive sharing.
We used the Trace Factory environment [29] to achieve the flexibility needed to perform complex evaluations. The approach used in this environment is to produce a source trace (a sequence of memory references, system-call positions and synchronization events in case of parallel programs) by means a tracing tool (Tangolite [30] , in the evaluation carried out in this paper). Trace Factory then models the execution of complex workload by combining multiple source traces and simulating system calls, process scheduling, and virtual-to-physical translation. Finally, Trace Factory produces the references (target trace) furnished as input to a memory-hierarchy simulator [50] .
Trace Factory generates references according to the ondemand policy: It produces a new reference when the simulator requests one so that the timing behavior imposed by the memory subsystem conditions the reference production [28] . It simulates system calls by including synthetically generated memory reference bursts. Process management is modeled by simulating a scheduler that dynamically assigns a ready process to a processor. The process scheduling is driven by time-slice for uniprocess application, while it is driven by time-slice and synchronization events for parallel applications. Virtual-to-physical address translation is modeled by mapping sequential virtual pages into nonsequential physical pages.
In our simulations, kernel references and bursts are modeled gathering statistics from a set of traces distributed by Carnegie Mellon University and obtained on an Encore Multimax (shared-bus multiprocessor) machine [73] . As for the bursts, we collected statistics regarding their length and interburst distance. An evaluation of this methodology has been carried out in [50] , [29] . As for the scheduler, input parameters are the time slice in terms of number of references and the process-scheduling policy (ªcache affinityº or ªrandomº).
Workload Characteristics
Our goal is to evaluate and compare coherence protocol performance on a general-purpose multiprocessor workstation. Thus, we used the previous technique to generate three nontrivial real workloads, named UniP, Mix1, and Mix2. We considered 60 million references for each workload. UniP consists of 30 typical sequential programs such as system commands, utilities, and user applications. We selected some typical Unix commands (awk, cp, dd, du, lex, rm, and ls) with different command-line options, three utility programs (cjpeg, djpeg, and gzip), a network application (telnet), and a user application (msim, the multiprocessor simulator used in this work). In a typical situation, various users may run different system commands and ordinary applications. To take into account that users can launch the same program at different times, we traced some commands in shifted execution sections: initial (beg) and middle (mid). Table 1 shows our source traces in terms of: 1) number of distinct (unique) blocks the program uses; 2) code, data-read, and data-write access percentages; 3) number of system calls.
In Mix1 and Mix2, a parallel application is added to the basic UniP workload. The parallel application generates a number of processes equal to 50 percent of the machine processors. Since the access pattern to shared data of parallel applications influences multiprocessor performance significantly, we considered two parallel programs with different sharing behavior, MP3D and Cholesky, both from the SPLASH suite [54] . MP3D simulates rarefied hypersonic flow; the generated trace relates to a case of 10,000 molecules and 20 time steps. Cholesky factorizes a sparse positive definite matrix, using the homonymous method. For Cholesky, we generated the trace using a IY VHT Â IY VHT matrix with 30,284 nonzero elements coming from the Boing/Harwell sparse matrix test (bcsttk14) as input. Table 2 summarizes the multiprocess-application trace statistics. The write-run figures show that: 1) MP3D exhibits coarse-grained sharing, since the average write-run length varies from 5.89 to 8.18; and 2) Cholesky exhibits mediumgrained sharing, having an average write-run length from 4.75 to 5.10.
The target trace characteristics resulting from a simulation performed in the reference case study (Section 5.1) are summarized in Table 3 . Comparing the write-run statistics of Tables 2 and 3 , we can observe that the write-run of target traces results are quite high, even much higher than in source traces. This is due to process migration. Therefore, this aspect strongly motivates the introduction of kernel modeling in the evaluation of such multiprocessors with this kind of workload.
Multiprocessor Simulator Characteristics
The multiprocessor simulator [50] used in our analysis characterizes a shared-bus multiprocessor in terms of CPU, cache, and bus parameters. Our simulator models a simple processor architecture. The target of our evaluation is to show how the memory hierarchy is influenced by the choice of an adequate coherence protocols. We have not modeled some memory-latency hiding techniques that do not modify coherence protocol behavior and that are currently used in modern processors.
The CPU parameters are the clock cycle, the minimal number of clock cycles for a read/write operation, and the temporal distribution of the memory accesses. We describe this distribution in terms of the maximum number (M) of references per time interval and the probability that this interval contains exactly HY IY PY F F F Y w memory references. That time interval is a fixed number of CPU clock cycles.
The cache parameters are cache size, block size, associativity, and the access time for read/write operation. The cache block replacement policy is LRU (Least Recently Used). The simulator models a multiprocessor having a relaxed memory consistency (processor consistency [27] , [2] , [34] ). This is implemented allowing the write transaction buffering Finally, the bus parameters are the bus width and the number of CPU clock cycles for each kind of transaction: write, invalidation, update-block, and memory-to-cache and cache-to-cache read-block. Table 4 reports the values of CPU, cache, and bus parameters for the reference case study (Section 5.1). We suppose that the main memory supports write buffering. In this way, the cost of write transaction is equal to the cost of invalidation transaction.
The simulator can generate a number of statistical values, such as: miss ratio, number of write transactions, invalidation, update-block, and memory-to-cache and cache-tocache read-block per memory operation, and bus utilization ratio. In the discussion below, we shall focus on the Global System Power metric (GSP) [6] that represents the number of processors of an ideal machine that does not have delay in accessing memory: q AE pu Y where pu pu À dely a pu X pu is the time needed to execute the workload, and dely is the total CPU delay time due to waiting for memory operation completion. We use this metric instead of execution time since we do not execute a single program, in our simulations, but a combination of portions of programs. At the same time, the workload characteristics (such as the number of processes of the parallel application) change as the number of processors changes. In this condition, GSP gives the necessary comparability when the performance evaluation requires varying the number of processors and other system parameters.
Another metric that we used to show the effectiveness of each protocol in achieving coherence at reduced traffic overhead is Processor/Bus Efficiency (PBE): fi qafY where the BUR is the bus utilization ratio and its value ranges between 0 and 1. A high value of this number indicates that a protocol is exploiting the bus more effectively to get a given value of GSP. For example, if two protocols have the same value of GSP, but have different bus utilization ratios, then the protocol having the lower BUR and, thus, higher PBE, is using the bus bandwidth more effectively.
All the statistics regarding single processor performance are averaged over the total number of processors.
PERFORMANCE EVALUATION
Our goal is to show the effectiveness of PSCR protocol under various architectural parameters and scheduling policies. The considered workloads are those previously introduced: Unip, Mix1, and Mix2. The performance of PSCR is compared against the performance of six other protocols: Dragon, Berkeley, MESI, Competitive Snooping, Update-Once, and AMSD. These protocols belong to different classes, that is, they use different strategies to obtain cache coherency. The selection is motivated as follows: . In the WU class, the most widely used protocols are Dragon and Firefly, however, Dragon is usually found to perform slightly better. . In the WI class, we evaluated two protocols: MESI, the most widely used protocol, and Berkeley, which is more often evaluated in the literature. . In the HY class, Competitive Snooping and UpdateOnce have a good performance over a wide range of applications and employ different strategies to switch from WU to WI. . In the AH class, the most promising protocol is AMSD to treat passive sharing.
Reference Case Study
As a starting point of our evaluation, and for an in-depth discussion, we have chosen the case study of a machine having a 64-bit data bus width, and 64-byte cache block size. Each processor has a 256-Kbyte, direct access private cache. This case study is generic and representative of the various instances that we simulated (Fig. 4) . In Table 4 , we report CPU, cache, and bus parameter values. The simulations, related to the three workloads described above, yield the following results for the Global System Power. GSP for PSCR is at least 40 percent higher than that of other protocols' GSP. Moreover, PSCR scales up better, making it possible to connect more processors on the same bus. As expected Dragon has the worst behavior in terms of both absolute performance (GSP) and scalability. Excluding PSCR, AMSD is the best performing protocol. Its behavior is near or better than all the four other protocols (Competitive Snooping, Update Once, Berkeley, and MESI), for all workloads. The behavior of these four protocols exhibits workload sensitivity.
The good performance of PSCR is mainly due to the lower number of bus transactions and, hence, lower global traffic on the shared bus, which is the bottleneck of the system (Fig. 5) . The reduced bus traffic minimizes the latency of processor operations.
In addition, PSCR exploits the available bus bandwidth better, that is, for a given percentage of bus utilization, PSCR delivers a higher GSP in comparison to all other protocols (Fig. 6 ). This is also evident by Dragon behavior and PSCR bus utilization. For example, in Fig. 5 for the Unip workload, Dragon reaches 80 percent for 11 processors and starts to saturate (Fig. 4) . Instead, PSCR reaches 80 percent for 20 processors, but it still has the possibility to scale up. This two advantages (reduced global traffic and better exploitation of the available bus bandwidth) of PSCR can be further explained by showing details on the quantity and type of bus transactions actually needed by each protocol. This is showed in the following graphs (Figs. 7, 8,  9 , and, 10), and discussed below.
The reduction of coherency-related operations (write and invalidation transactions) can result in a real advantage only if the number of read-block transactions (due to a higher miss rate) does not increase.
Miss and write handling introduce a different bus cost and latency for the processors. Miss operations due to read operations may introduce a delay when the processor or other units have to wait for the operation termination in order to continue the execution. As for the write operations (in both hit/miss conditions), they can be managed in an asynchronous way because of write buffering, i.e., the processor can start working on the next operation even though the current one has not actually been completed. This implies that write operations do not involve idle time for the CPU directly, although they may significantly affect miss cost because they cause increased bus traffic. Moreover, it has to be considered that the amount of data to be transferred on the shared bus is significantly lower in the case of a write transaction than in the case of a read-block transaction.
PSCR protocol has both a low miss rate (Fig. 7) and a limited number of write transactions (see Fig. 8 ). These two aspects are strongly related to each other in that PSCR selective invalidation strategy only eliminates the useless copies of P-blocks without causing further unnecessary misses.
Berkeley, AMSD, MESI, and Update-Once exhibit a lower number of coherence-related bus operations (invalidations in some cases, writes in others) than PSCR, but at the cost of an increased miss rate. This effect is due to the invalidation strategy. In case of Berkeley and AMSD, the invalidation strategy causes a miss increase, whose consequences are limited because these protocols heavily employ the cheaper cache-to-cache read-block transaction (Fig. 10) . This fact explains the performance differences among Berkeley, AMSD, MESI, and Update-Once that have similar values of miss ratio. Competitive Snooping has a limited number of write transactions, like PSCR, and a good behavior as for the misses, thus it obtains an intermediate behavior between PSCR and all the other protocols (except Dragon). Finally, Dragon is greatly penalized by the high number of write transactions.
The introduction of parallel applications (Mix1 and Mix2 workloads) penalizes the global performance of each protocol (except Dragon), because of the overhead required to keep the active shared copies coherent (Figs. 4, 7, 8, and  9 ). The differences in terms of write percentage and writerun length between Mix1 and Mix2 cause different overhead (Figs. 7, 8 , and 9) and performance (Fig. 4) for all protocols (except Competitive Snooping). In the case of Dragon, this phenomenon is not observable, because of the saturation of the bus, which starts from a low number of processors. On the contrary, the reuse of active shared copies causes a GSP increase for Dragon in case of Mix1 and Mix2.
Influence of Cache Structure
In the second step of our analysis, we shall examine the behavior of the protocols in the case of a 2-way and a 4-way set-associative cache. The other parameters have the same values as in the previous case. For all protocols, the simulation yielded the following results:
. a decrease of the miss rate (and, as a consequence, the number of read-block transactions); . an increase of the number of write or invalidation transactions. The miss rate decrease is essentially due to the higher associativity, which offers more caching alternatives for blocks sharing the same cache set. The increase of the number of write or invalidation transactions is caused by the high number of shared copies, in turn due to the longer lifetime of cached blocks. PSCR exhibits the highest increment in the GSP values as the associativity increases. This can be explained as follows: The bus traffic results from the sum of two components: 1) number of read-block transactions originated by miss conditions, and 2) number of coherence actions (write and invalidation transactions).
Misses have three independent sources: misses due to newly accessed blocks, capacity and conflict misses, and invalidation misses. The higher associativity causes a reduction of capacity and conflict misses and consequently enhances the effects of coherence-related bus traffic on global performance. Furthermore, the increase of associativity generally produces an increase of coherence-related activity since a larger number of shared copies can be involved in write operations. For this reason, the protocols that generate the lowest total number of coherence-related bus actions yield good results in terms of GSP.
We do not report all the sets of graphs since they are qualitatively similar to those obtained in the reference case study. To summarize the results, we observed from GSP graphs that protocols exhibit a good scalability until the system reaches a critical number of processors. For each protocol, beyond this critical point it does not make sense to attach more processors on the bus. We tried to give a quantitative estimation of this critical point by defining it as the point in which GSP graph slope is 70 percent of the initial slope. Fig. 11 shows the critical point for direct-access (reference case) and 2-way set-associative cache. Fig. 12 shows the GSP of each protocol at the critical point in the case of direct-access, 2-way, and 4-way setassociative caches. Generally, we can observe that both scalability and processing power furnished by the machine greatly increase as switching from one to two ways. This increment is limited in the case of two-and four-way caches.
We have analyzed the performance in the cases of 64, 128, 256, and 512-kbyte, two-way set associative caches. In all the experiments, PSCR had the best performance, compared with the other protocols. The performance slightly increases for all the protocols as the cache size increases in a way similar to the case of the increased associativity. In the same way, we have found a decrease of the miss rate and an increase of the number of shared copies, due to the longer lifetime of cached blocks. The lower miss ratio produces benefits on the performance of all protocols.
As the cache size is decreased, we noticed that Dragon exhibits a small increase in its performance. For a 64-kbyte cache, its performance is also comparable with WI protocols, especially when the true sharing is higher (Mix1 and Mix2 workloads). As remarked before, and known from the literature, this is due to the high update traffic of Dragon. For smaller cache sizes, the copy eviction due to the limited cache size, is equivalent to a simple mechanism of copy invalidation that eliminates the remote copies not accessed for long time intervals (i.e., that exhibit a long WRL). Finally, we analyzed the performance as the block size is varied (Fig. 13) . For Unip workload, PSCR exhibits a slight GSP improvement as the block size increases from 64 bytes to 128 bytes. For a 256-byte block size, GSP values are slightly lower than in the 64-byte case. Dragon's GSP has almost the same value for each block size. WI protocols are clearly penalized in the 256-byte case, reducing their We represent data relevant only to those protocols having write transactions. The introduction of a parallel application (Mix1 and Mix2) increases the number of write transactions. In the case of Dragon, this does not happen since this protocol is more sensitive to passive-sharing effects, which increase as the number of processes (for a given number of processors).
performance at Dragon level. Competitive caching distinguishes its behavior compared to WI protocols in the case of 128 and 256-byte block size. In the case of Mix1 and Mix2 workloads and 256-byte block size, the behavior of WI protocols becomes even worse than the Dragon one.
The above results can be explained as follows: The increase of block size naturally causes a miss ratio decrease, but a longer time to handle the miss itself. In the case of Dragon and PSCR protocols, the latter two phenomena have opposite but balanced weigh on the performance for the 64-and 128-byte case. For the 256-byte case, the longer time to load the block becomes more weighing on the GSP. The increased miss cost penalizes the behavior of WI protocols greatly as the block size increases. Competitive Snooping adapts its behavior based on the ratio between miss and write cost, thus achieving a better performance than WI protocols.
Influence of the Memory Latency
The value of memory latency considered in the evaluations just exposed is somewhat low relative to current and future machines [26] . For this reason, we also considered a system with larger memory latency (30 cycles, against the 6 cycles for a memory access in the reference case). In this case, for all protocols, we found that a 128-byte block size yields a better performance than a 64-byte one, due to the higher cost of memory access. Therefore, we shall focus on this case in the following discussion. Bus timings are shown in Table 5 .
The simulation results are presented in reuse the shared copies, thus obtaining a miss ratio decrease (Fig. 7) . 
Influence of Bus Width
In the case of larger bus widths (128 to 256 bits), we found again that the block size that yielded an optimal performance is 128 bytes for all protocols.
All protocols provide better performance and a wider linear range. This appears quite evident by the higher GSP and, consequently, by the higher critical point values (Fig. 15) . The timing values are reported in Table 6 .
756 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 7, JULY 1999 Fig. 11 . The critical point for direct-access and 2-way set-associative cache. PSCR has the highest GSP as the associativity increases. Fig. 12 . The GSP of PSCR against all other protocols, at the critical point for direct-access, 2-way, 4-way caches. For each workload and protocol, the GSP at the critical point is reported. PSCR exhibits the highest performance increase than the other protocols, as the number of ways increases. Dragon has a behavior quite independent of the number of ways. Fig. 13 . The GSP of PSCR against all other protocols, at the critical point for 2-way set associative caches with 64, 128, and 256-byte block size). PSCR is always the best performing protocol. This is due to the Write-Update strategy on S-blocks. Dragon has a behavior quite independent of the block size.
The difference between PSCR and the other protocols becomes lower since the cost of the read-block transaction decreases, whereas the cost of the other transactions keeps constant. For this reason, the protocols based on nonselective invalidation are less penalized than in the previous situations. In particular, in the case of workloads consisting of parallel applications and a 256-bit bus width, Berkeley and AMSD approach the PSCR performance. In this case, it is clear that Dragon does not take advantage of the increased bus width.
Influence of the Scheduling Policy
As noted in Section 1, process migration allows the system to have load balancing in multiprocessor systems without requiring particular efforts to the programmer. Process migration has two negative effects of on global performance: 1) a peak of misses due to the loading of the working set of the new process; and 2) the generation of passive shared copies, which happens when a migrated process is rescheduled in a short time on a different processor.
The scheduling strategies based on cache-affinity [42] , [57] , [71] , [74] can reduce both effects. In fact, in this case, each process is preferably rescheduled on the same processor on which it was previously executed so that part of its working set is still resident in cache.
In this paragraph, we discuss the results of simulations in which a cache-affinity scheduling strategy is used along with the PSCR protocol. The system performance diagrams (Fig. 16) show that, for a low number of processors, Dragon appears to obtain the highest benefits from this solution, and its performance approaches the PSCR values.
However, the cache-affinity technique cannot provide good results in all workload conditions. In particular, it seems to be inefficient when the effects of process migration become more relevant. Indeed, while the number of cache misses due to context switch remains roughly constant, the coherence overhead induced by passive shared copies depends on the interval between the instant in which a process is suspended from execution and its subsequent resumption. The ordinary cache replacement activity also progressively eliminates possible passive shared copies and, therefore, if a process is suspended for a long time interval, the effects of process migration on coherence overhead are drastically reduced. This time interval statistically decreases when the number of ready-to-run processes is comparable to the number of processors (Fig. 17) . In this case, the probability that a process could be rescheduled on the same processor where it was previously executed also decreases and this is the main reason for the failure of the cache-affinity scheduling strategy and for the drop in the Dragon performance when the processor number roughly equals half the number of processes running in the system (Fig. 16) .
In systems where cache-affinity scheduling policy is implemented, the adoption of PSCR protocol can provide relevant benefits, because it drastically reduces that passive sharing which is still present.
In addition, from Fig. 17 , we can expect more relevant passive sharing effects in Unip workload compared to Mix1/Mix2 for a given number of processors. This is due to the smaller mean resumption distance, as already observed. Comparing the behavior of Dragon for these three workloads in Fig. 16 , we can again infer the same results since, in the case of Unip, this protocol reaches its saturation point for a number of processors lower than in Mix1/Mix2. Finally, from Fig. 16 , we notice again the potential of WriteUpdate schemes like Dragon, which can achieve better performance than Write-Invalidate once passive shared copies are somehow reduced.
Enhancing the PSCR Performance
As noted in Section 2, the write-run statistics of a parallel application can be used to decide the threshold for switching from WU to a WI policy. In this case study, we shall show how the performance of PSCR can be improved, compared with its basic version, by using a per-block WU or WI policy based on the mean WRL (Write-Run Length) dynamically experienced by that block.
In this enhanced version of PSCR, we have supposed that the protocol treats those S-blocks that dynamically have a WRL higher than a given (possibly statically detected) threshold value WRL TH as they were P-blocks. Table 7 shows the new values of GSP obtained in the case of Mix1 workload and 21 processors, when WRL TH assumes different values.
COMPLEXITY
In the following, the implementation cost and complexity of PSCR is estimated against the other protocols based on three parameters: 1) number of logical states (i.e., bits in the status field of each block), 2) type of bus transaction, and 3) number of required bus signals. Table 8 summarizes the features of PSCR and the other protocols used for comparison. PSCR also needs additional hardware and software support that we describe in detail below. We shall not treat details regarding a specific bus implementation.
Number of Logical States
For all the coherence protocols, we have examined the original papers about them (see Section 2). In particular, for the Competitive Snooping protocol, each block has an associated counter which is decremented whenever a write transaction involving the block is observed in order to invalidate the block when the maximum allowed number of write transactions is reached. The size of such counter depends on the ratio between read-block transaction and write transaction time costs. Table 8 presents the minimum and maximum values for the configurations examined in the present paper. The number of logical states is the sum of three components: states of the counter, states of the basic WU scheme (in our case the Dragon protocol), plus the invalid state.
Bus Transactions and Signals
PSCR, MESI, Dragon, Update-Once, and Competitive Snooping make use of three different kinds of bus transaction: read-block (to fetch a missed block), write (to update multiple cached copies), and update-block (to write back dirty copies when they need to be destroyed for replacement).
Berkeley does not adopt the write transaction, but needs an invalidation transaction to destroy remote copies in the case of write operation on a shared copy. Two different kinds of bus operation are employed to fetch a missed block: read-shared (analogous to the Dragon read-block) on read miss, and read-for-ownership, which invalidates all remote copies, in the case of write miss. A proper bus signal (RS/RO) makes the distinction between the two kinds of transaction.
AMSD uses both an explicit invalidation transactions in Berkeley, and a local invalidation mechanism of blocks deemed migratory.
In the case of MESI protocol, a special situation happens when a miss involves a modified remote copy. Since the shared state is not split in shared-clean and shared-dirty, the memory should be updated before the transition to shared state of these copies. Cache designers have adopted different solutions to manage this situation. We have implemented this solution: On a read miss, the remote cache aborts the bus transaction, then it writes back the copy to the main memory, and allows the originating cache to retry the bus transaction. Both copies will become shared. On a write miss, the cache starts a ªRead With Intent To Modifyº bus operation by means of RWITM line. The remote cache writes back the copy to the memory and invalidates the local copy. The originating cache saves the copy in Modified state and performs the write locally.
PSCR adopts the same bus transactions as Dragon, and it also needs a bus signal (v I ) to allow slave cache to have a different behavior during a read-block transaction.
As for the bus signals, all protocols need at least the Data Intervention (DI) bus signal. The DI signal is used by a slave cache to substitute for main memory during a read-block transaction when it holds a dirty copy (to guarantee coherence) or a private clean copy (to improve efficiency). When this signal is raised by a remote cache, the cache-tocache read-block transaction is used, which is usually faster because of the lower cache latency compared with memory latency.
We supposed that MESI protocol uses the DI line to abort a transaction, and the RWITM line to signal the ªRead With Intent To Modifyº read-block transaction.
Dragon, Competitive Snooping, Update-Once, AMSD, and PSCR use the SH signal during a bus transaction to notify that a cache holds a copy of the pertinent block, which therefore has to be considered shared. As we have seen above, in PSCR protocol, this function is accomplished by the v P line, which is also used to signal data transfer of dirty P-blocks fetched by a remote processor.
The migratory detection mechanism of AMSD needs to signal that the copy is migratory during the bus transaction used to respond to read misses, write misses, and invalidation requests. An additional line (M) is necessary for signaling the migratory condition.
Finally, Competitive Snooping needs two signals (EII, Enable-to-Invalidate-In, and EIO, Enable-to-InvalidateOut), used to implement the distributed arbitration scheme to elect a cache in which to decrement the counter of write transactions [39] .
As we can see in Table 8 , the complexity of PSCR is comparable to that of the other coherence protocols examined in the present paper. In fact, the number of logical states is one more than MESI, Dragon, and Berkeley, and fewer states than AMSD, Competitive Snooping, and Update-Once. PSCR has the same number of bus transactions of all other examined protocols. Finally, it has the same number of additional bus signals as AMSD, one more than Dragon, Berkeley, MESI, and Update-Once, and one less than Competitive Snooping.
Additional Hardware and Software Support for PSCR
As introduced in Section 3.2, our approach is both hardware-and software-based. In the simplest implementation, the extra hardware consists of an extra bit (P-bit) for each page descriptor in TLB and a signal of the processorcache bus. As for the software, we suppose that the compiler could organize data in such a way that the kernel or the run time support could mark the pages containing private data. This is normally accomplished by compiler and kernel in order to manage the virtual memory. Thus, the only additional support required is the extra wire between processor and cache. In particular, no additional complexity is required to manage dynamically allocated memory, since the allocation function usually needs to specify if the data should be private or shared. In the case of multithreading programming environments, in which all data are placed in a shared space, an additional effort from the compiler could be helpful. First, data allocated into the private stack are easily detectable. The situation changes for private data allocated in the shared space, or when shared data exhibit long write-runs.
In this case, as observed in Section 5.6, marking those data as private data would improve the global performance. This can be done by means of a compiler tool that could detect which variables might profitably be treated as private [77] . The technique can be profiling based, data-flow based [3] , or relying on static analysis as successfully used by Skeppstedt and Stenstro È m [55] , and Mowry [45] .
Low Level Optimizations of PSCR
In our presentation of PSCR, we supposed that some lowlevel optimizations could be integrated with the basic hardware. In particular, in case of write transactions operating on SD copy, memory updating can be avoided (Fig. 3) . A bus signal can be used to disable memory during this operation, for example the previously described DI signal. This optimization can be useful in a couple of cases: to reduce the number of transactions involving the shared memory or in implementations that do not employ a memory write buffer.
We also notice that L 1 and L 2 signals are used in distinct temporal intervals. L 1 specifies the type of memory block, slave units answer by means of L 2 . Thus, L 1 and L 2 can be implemented by using a single multiplexed line.
CONCLUSIONS
We have analyzed the behavior of the PSCR protocol as a function of cache organization, memory latency, bus width, and scheduling policy. The proposed protocol has been compared against six solutions based on completely different handling policies concerning shared copies: exclusive write-update (Dragon), exclusive write-invalidate (Berkeley and MESI), or dynamically switching between the two (Update-Once, Competitive Snooping) or using an adaptive detection of migratory copies (AMSD). It has been shown that the proposed protocol represents a good solution for a shared-memory shared-bus multiprocessor in all the cases under consideration.
The improvement is particularly useful in a couple of cases: 1) when read cost is higher than write cost, as is the case for high memory latency, in current systems; 2) when the block size is increased.
Multiprocessors represent a significant percentage of recent architectural solutions for workstations. These machines are rarely used to speed-up a single parallel application, rather they are employed as servers to TABLE 8 Complexity of Different Coherence Protocols achieve a higher throughput by running multiple processes simultaneously. These machines typically use an Unix-like multitasking operating system. Based on these observations, we used three distinct workload models to evaluate the performance of the proposed protocol: the first workload only includes Unix commands and singleprocess applications, whereas the others also include parallel applications with different sharing pattern (coarse/medium grain).
In the target system of our analysis, load balancing is obtained by allowing process migration, which distributes the computation workload among the available processing units. This migration determines the generation of passive shared copies, which induce a relevant amount of coherence overhead. We showed that our protocol eliminates this overhead by operating directly during the read-block transaction consequent to a miss condition and without any significant effect on the hardware complexity. On the other hand, WI protocols do not succeed in eliminating this overhead. Once passive sharing is eliminated, WU protocols may be a viable strategy compared to WI ones. The selective invalidation mechanism adequately combined with its basic write-update mechanism allows PSCR to gain the benefits of an updating in bus-based architectures. We are not aware of other approaches that explicitly eliminate the overhead due to private data accesses.
The proposed solution can also be successfully employed in systems including a cache-affinity scheduler which, as many authors have stated, does not always succeeds in eliminating process migration and its effects in all workload conditions. PSCR performance can be further enhanced by using compilation techniques that recognize long write runs on shared data.
ACKNOWLEDGMENTS
The work described in this paper has been carried out under the financial support of the Italian Ministero dell'Universita Á e della Ricerca Scientifica e Tecnologica (MURST), in the framework of the MOSAICO (Design Methodologies and Tools of High Performance Systems for Distributed Applications) Project. We thank Professor Dan Siewiorek of Carnegie Mellon University for the initial material used to validate the simulation environment. We are particularly grateful to Gianpaolo Prina and Luigi Ricciardi for a contribution to the multiprocessor simulation design, the useful discussions about the strategy to limit passive sharing problem and preliminary evaluation of the coherence protocol. Pierfrancesco Foglia contributed significantly to the validation of the performance evaluation methodology. Thanks to Steve Herrod at Stanford University for providing and helping with TangoLite. Our discussions with Professor Ali Hurson, Professor Veljko Milutinovic and Professor Per Stenstro Èm helped improve this article considerably. Finally, we thank the precious work of the anonymous referees that helped improve the quality of this paper.
