Cache coherence schemes that dynamically adapt to memory referencing patterns have been proposed to improve coherence enforcement in shared-memory multiprocessors. By using only runtime information, however, these existing schemes are incapable of looking ahead in the memory referencing stream. We present a combined hardware-software strategy that uses the predictive capability of the compiler to select updating or invalidating for each write reference. To determine the potential performance improvement that can be achieved with this optimization, three di erent levels of compiler capabilities are examined. Simulations using memory traces show that with an ideal compiler, this optimization can potentially reduce the miss ratio by 0.4 to 15 percent compared to an invalidating-only scheme, while reducing the generated network tra c by 13 to 94 percent compared to an updating-only scheme. In addition, this optimization can potentially reduce the miss ratio by up to 13 percent, while reducing the generated network tra c by up to 92 percent, compared to a dynamic adaptive scheme. Furthermore, performance can be potentially improved even with a compiler capable of performing only imprecise array subscript analysis and no interprocedural analysis.
Introduction
In a shared memory multiprocessor, private data caches have been shown to be quite e ective in reducing the average delay required to access shared memory. However, since the caches themselves are not shared, multiple copies of the same memory location can be resident in several di erent caches simultaneously, which introduces the well-known cache coherence problem 20] . The simplest solution to the cache coherence problem is to make all shared writable locations non-cacheable and thereby bypass the cache for all references to these locations. Although this approach eliminates the problem of accessing stale data by caching only private and read-only blocks, it produces high average memory delays. Two other solutions to the cache coherence problem allow shared{writable data to be cached and use either updating or invalidating to prevent access to stale data.
With an invalidating scheme, coherence is maintained by forcing all cached copies of a shared block to be invalidated whenever a processor modi es a location within that block. This invalidating prevents access to stale data by forcing a miss the next time the block is referenced by any other processor. With an updating scheme, on the other hand, coherence is ensured by distributing the new value of the shared location to all caches with a copy of the block whenever a write to that block occurs. Several studies evaluating the performance of updating and invalidating schemes 2, 13, 15, 20, 26] have shown that, in general, the scheme that produces the best performance is dependent on the programs' sharing patterns. In some cases, updating outperforms invalidating, while in other cases the reverse is true 13, 17] . Thus, neither one of these strategies alone can be considered the best coherence enforcement scheme. To combine the best aspects of both approaches, hybrid schemes attempt to adapt to the current memory referencing pattern by choosing to update or invalidate the other copies of shared data depending on how they are currently being used 4].
In general, adaptive schemes such as EDWP 4], Competitive 17] , and the scheme by Wilson et al 31] use a variety of dynamic heuristics for determining when to switch between updating and invalidating. These schemes typically rely on the broadcasting capability of a single shared bus, making it di cult to extend them to large-scale multiprocessors. In recent years, a number of coherence schemes have been proposed and implemented for large-scale multiprocessors which allow the coexistence of updating and invalidating. The scheme proposed by Andrews et al 3] provides the supporting hardware for updating and invalidating within the interconnection network. In addition, it recognizes the potential for the compiler to select updating instead of invalidating. The scheme proposed by Goshe and Simhadri 14] uses only run-time information to choose be-tween updating and invalidating. In distributed shared memory multiprocessor systems, Willow 7] , Munin 6] , and DASH 19] provide the hardware to support updating and invalidating, and they acknowledge the capability of the compiler to select either one of these strategies. Furthermore, the scheme impelmented on the Galactica Net 31] uses a combined software and hardware approach for selecting updating or invalidating at run-time.
To expand on these existing adaptive coherence schemes, this paper examines the potential of using compile{time optimization in conjunction with a coherence directory to select updating or invalidating for each write reference only when necessary. This optimization attempts to reduce the tra c between processors and memory modules by delaying coherence enforcement as much as possible. In addition, this optimization determines for each write reference whether updating or invalidating will produce the lowest network tra c, and it marks each reference accordingly. Passing this static marking information to the coherence hardware is accomplished by distinguishing three di erent types of write operations at compile-time. The required hardware for supporting this compiler optimization is no more complicated than a standard directory scheme 9].
The performance of this compile{time optimization will vary depending on the precision of the data{ ow analysis the compiler is capable of performing. Branches, procedure calls, and complex array subscripts force any compile{time analysis mechanism to be conservative. These conservative decisions may reduce the performance bene t of using this type of analysis. Instead of implementing this mechanism in an actual compiler, we use a trace{based simulation methodology to determine the potential performance improvement that can be achieved with this optimization. Hence, this study examines whether or not it is worthwhile to implement this scheme in an actual compiler. Nonetheless, a related algorithm that identi es speci c read operations 25] has been implemented in the Parafrase{2 28] parallelizing compiler. Consequently, we are con dent that the algorithm described in this paper could be implemented in an actual compiler.
The compile{time optimization is presented in Section 2 along with a discussion of the hardware and software implementation implications. The simulation methodology used to evaluate this scheme is presented in Section 3. The simulation results are presented in Section 4, and Section 5 summarizes our conclusions.
Compiler-Assisted Adaptive Enforcement
This section describes the compiler support required to choose between invalidating and updating based on the estimated costs associated with each strategy. To statically choose updating or invalidating for each write operation, the compiler can determine the cost of each strategy to minimize the miss ratio, the total communication tra c, the average memory latency (which itself is a function of the miss ratio and communication tra c), or to optimize some other relevant performance metric. In this study, the estimated cost of each coherence strategy is based on minimizing the communication tra c.
Switching from one strategy to the other on the basis of the estimated communication cost is similar to the idea behind the Competitive scheme 17]. In this new scheme, however, the selection is made statically at compile-time rather than dynamically at run-time. This compile{time selection of updating or invalidating eliminates the need for a counter or some other additional hardware mechanism as is required in the Competitive 17], the EDWP 4], and the Galactica Net 31] schemes. Furthermore, this compile{time optimization does not require a sophisticated network 3, 14] to maintain cache coherence. It requires only a coherence directory to keep track of the processors with a valid copy of each block. There are several variations of directories 5, 9, 10, 16], any one of which can be used with this compiler optimization. This study uses a directory structure similar to Censier and Feautrier's 9] in which each memory module is associated with its own directory.
As a result of this compiler optimization, three di erent types of write instructions, wrt{up, wrt{inv, and wrt{only, are distinguished. A wrt{inv instruction instructs the supporting hardware to use an invalidating strategy to enforce cache coherence. A wrt{up instruction, however, uses an updating coherence enforcement strategy. A wrt{only instruction instructs the underlying hardware to ignore cache coherence since the compile{time analysis guarantees that the newly written value by a wrt{only operation will not be used by any processor other than the writer. The following subsections describe the supporting hardware and the three separate passes of the compile{time analysis required to generate these three di erent types of write instructions.
Hardware Requirements
This section describes the coherence directory that uses the information provided by the marking algorithm. Processors are assumed to be connected to the globally shared memory modules via A positive acknowledgement mechanism is used to ensure consistency and to ensure reception of block updates when an updating scheme is used. Table 1 summarizes the type of transactions between the processors and the memory modules.
The state of a block in a cache is encoded by using two bits, valid and dirty. The state of a block in memory is encoded by using a single stale bit and p valid bits, where p is the number of processors in the system. The valid bit in the cache indicates whether the memory reference will cause a cache hit or a cache miss. The dirty bit in the cache indicates whether or not the processor has modi ed the block. A set stale bit in the memory indicates that the memory copy of the block is stale. As shown in Table 2 , a cached block can be in any of the following three states:
Read{Only: the block has not been modi ed and it may be shared by two or more processors.
Modi ed: the block has been modi ed by the processor and more than one modi ed copy of the block may exist.
Invalid: an up{to{date copy of the block must be fetched from memory. Table 3 shows the corresponding states for a block in the globally shared memory. Depending on 
Compiler Marking Algorithm
Given the coherence directory described in the previous section, the compiler marking algorithm consists of three steps: 1) classifying blocks into those that need coherence enforcement, and those that do not; 2) determining which speci c write references can never cause coherence problems; 3) selecting the coherence enforcement strategy that produces the lowest network tra c for each of the remaining writes.
Classifying blocks
Using symbolic data dependence analysis, a restructuring compiler can classify the blocks referenced by a program as private, shared read-only, or shared-writable. References to private blocks can never cause coherence problems since these blocks are referenced by only a single processor. Because shared read-only blocks are never written, they also can never become incoherent. However, references to shared-writable blocks can cause coherence problems. These blocks are referenced by two or more processors, at least one of which writes the block. The following steps of the marking algorithm will exploit the fact that it is not necessary to enforce coherence for all memory accesses by considering only references to shared-writable blocks.
Marking coherence-write references
Depending on how a shared{writable block is accessed (i.e. read or written), and the sequence of consecutive references by di erent processors to the block, coherence enforcement may or may not be necessary. For example, since consecutive writes to the same word in a block by a single processor repeatedly modify the word's contents, coherence enforcement can be delayed until the rst read of that word by another processor. The purpose of this compiler pass is to identify all of the write references that actually require coherence enforcement. These references are then marked coh{wrt. All remaining writes are marked as wrt{only to indicate that coherence enforcement is not necessary for that write. Each coh{wrt reference will be marked as wrt{inv or wrt{up in the third pass of the compile{time analysis.
The algorithm used to identify the coh{wrt references assumes a weakly{ordered consistency model 12] in which a program is divided into parallel and serial sections. The serial sections are always executed by P 0 and the parallel sections, are executed by two or more processors. For any two consecutive sections of the program, S1 and S2, which are not both serial sections writes are marked as coh{wrt if they satisfy one of the following two conditions.
1. If a write in S1 is followed by a read in S2, then the write in S1 is marked as coh{wrt; or 2. If the write in S1 is followed by a write and a read in S2, then the write in S2 is marked as coh{wrt.
Since serial sections are always executed by processor P 0 , coherence enforcement is not necessary in serial to serial transitions. In a parallel to serial, serial to parallel, or parallel to parallel transition, however, the newly written value in section S1 may or may not be used by some other processor.
To ensure correct program execution, the write in S1 must enforce coherence when condition (1) is true. Furthermore, due to the consistency model, a write followed by a read in the same section of code are both guaranteed to be executed by the same processor. However, if there is a possibility of having multiple modi ed copies of the block in di erent caches (i.e. when condition (2) is true), the write preceding the read must enforce coherence. This coherence enforcement ensures that in the event of a block replacement following the write but before the read, only one processor (i.e. the one that last modi ed the block contents) updates the memory.
Selecting the coherence enforcement strategy
After the completion of the second pass, all writes that need to enforce coherence have been marked coh{wrt. In this third and nal pass, the coh{wrt references are marked as either wrt{up or wrt{ inv. The best choice for each coh{wrt to minimize the total network tra c depends on how many processors have a cached copy of the block before it is written, and how many subsequently read the block. For example, if n b is the number of processors that cache a copy of the block between the (j ? 1)st coh{wrt and the jth coh{wrt, and one of these processors performs the jth coh{wrt, (n b ? 1) invalidation messages must be sent when using invalidation. Similarly, sending the new value to all of the processors with a copy of the block requires (n b ? 1) messages when updating is used. With the coherence directory described in Section 2.1, the network cost of sending the invalidation messages will be slightly lower than the cost of sending the update messages since, by not including the new data value, invalidation messages are slightly smaller than update messages. However, depending on what happens after the write, updating may actually produce less total network tra c.
In particular, if invalidating was used for the jth coh{wrt, and if n a is the number of processors that read the block between the jth and the (j + 1)st coh{wrt, then (n a ? 1) of these processors must generate a miss request to fetch the new value, assuming that the writing processor is one of the n a processors that read the block. If updating was used, however, only those processors that did not already have a copy of the block before the jth coh{wrt will generate miss requests. The remaining processors already have the updated value and so do not generate any network tra c. Thus, updating may produce less total tra c than invalidating for a given coh{wrt by reducing the number of miss requests. To determine which type of coherence enforcement strategy to use for each coh{wrt, the compiler must estimate the actual cost, in terms of network tra c, produced by selecting either wrt{inv or wrt{up.
There are three di erent types of transactions that contribute to the cost of selecting wrt{inv for a given coh{wrt. First, the writing processor must send a message to the directory requesting exclusive access to the block. Since this request is eventually acknowledged, the network cost of this transaction is 2A bytes, where A, and the following parameters, were de ned in Table 1 . The directory must send invalidation messages to each of the processors with a copy of the block, except the writing processor itself, requiring (n b ? 1)(2A) bytes. Finally, servicing the misses produced by reading the block after the write generates (n a ? 1)(2A + Db) + (4A + 2Db)] bytes, assuming that the writing processor is not one of the n a processors that read the block. Thus, the total network tra c generated by selecting wrt{inv for the jth coh{wrt is T j (wrt{inv
If updating is selected for this coh{wrt instead of invalidating, A + D] bytes are required to write-through the new value from the processor to the directory, and (n b ? 1)(2A + D)] bytes are generated from sending the updated value to all of the remaining (n b ?1) processors with a valid copy of the block. If the estimated number of processors that read the block after the write is the same or greater than the estimated number of processors with a copy of the block before the write, that is, if n a n b , we assume that all n b processors must reread the block. Since these processors have already been updated with the new value, they do not generate any miss service requests. The additional (n a ?n b ) processors that read the block after the write will produce (n a ?n b ) misses, with each miss generating (2A+Db) bytes of network tra c. If n a < n b , however, we nd that the processors that read the block after the write tend to be a di erent set of processors than those that had copies of the block before the write. Consequently, we assume that each of these n a processors will produce a miss request. The total cost of selecting wrt{up for the jth coh{wrt then can be estimated as
Before the compiler can calculate any of these network costs, it must estimate the degree of sharing before and after each write; that is, it must estimate the values of n b and n a . One simple model to estimate sharing divides the variables referenced by the program into scalars and arrays.
With the Doall parallelism model assumed in this paper, each parallel task typically accesses a unique subset of the elements of each shared array. Thus, during a parallel section of the program, the degree of sharing for an array element is predicted to be 1. Shared scalars that are referenced during parallel sections of the program typically are referenced by every processor so that each processor will have a copy of the variable. The degree of sharing for scalar variables referenced in parallel sections thus is estimated to be p, where p is the total number of processors. Sequential program sections are always executed by a single processor, P 0 , so that the degree of sharing for all variables referenced during a sequential section is estimated to be 1. Variables not referenced between coh{wrt's have a degree of sharing of 0. The trace-driven simulations described below do not distinguish between scalars and arrays. Consequently, we estimate n b and n a in the simulations by counting the number of reads made to the block between coh{wrt's. The maximum value of n a and n b then is set to be p if at least one read is made during a parallel section, and 1 if all the reads occur in sequential sections.
After estimating the values of n a and n b , the compiler determines the type of the write operation for each coh{wrt as follows. First, if either n a or n b is predicted to be zero, the write reference will be executed using a wrt{inv instruction to avoid unnecessary updates. Otherwise, the compiler calculates the cost of choosing wrt{up or wrt{inv using the above equations. If the total cost of updating is less than the cost of invalidating, the coh{wrt will be executed using a wrt{up instruction to enforce coherence using updating. If the opposite is true, however, the coh{wrt will be executed using a wrt{inv instruction. The penalty for choosing the wrong type of instruction for a coh{wrt is simply that more network tra c will be generated than is necessary. Correct program execution is assured in either case. All other writes to shared-writable blocks, and all writes to private blocks, will be executed using the wrt{only instruction. This instruction generates no network tra c since it modi es only the local cache.
Cache Block Size
A cache block size greater than a single word has the advantage of allowing the cache to exploit spatial locality, but the disadvantage of false sharing. To accommodate block sizes larger than a single word in the marking algorithm, the compiler needs to know which variables share the same cache block. If this information is known, the following additional condition is required in the second pass of the marking algorithm:
For any two consecutive sections of code, S1 and S2, if the write in S1 and the write in S2 are to the same block (but possibly to di erent words in the block), mark the write in S1 as coh{wrt.
With a cache block size greater than a single word, multiple caches may have a valid copy of the block such that a di erent word of that block is modi ed in each of the caches. This additional condition is necessary to ensure correctness.
To take full advantage of the compile{time optimization, an alternative solution which does not require any additional condition in the marking algorithm is to associate one valid bit and one dirty bit with each word within the block. Then with a wrt{up or wrt{inv, coherence can be enforced on each word of the block 11].
Procedures and library functions
Subroutines can seriously disrupt the ability of compiler-assisted coherence schemes to perform accurate memory disambiguation causing the coherence scheme to overinvalidate or overupdate cache entries 1, 21, 23] . Similarly, procedure calls can restrict the ability of a compiler to accurately perform the optimization presented in the previous section. For marking write references as cohwrt, the compiler analyzes each procedure separately as if it were a complete program. The only additional requirement is that the compiler must ensure that the last write to a memory block before a procedure call or return is performed with a write instruction that needs coherence enforcement (i.e. coh-wrt). Then if there is any read reference within the procedure, coherence is already enforced and the block copies are consistent.
This method of optimizing each procedure separately is also useful when using precompiled library routines. The optimization can be applied to each routine when initially compiled, and the bene ts of the optimization will be exploited by any program that calls the routine. Thus, most of the performance advantage of the optimization can be obtained without having access to the library source code. While it appears that this independent compilation technique could mark an unnecessarily large number of writes as coh-wrt, the simulation results shown in Section 4 indicate that procedure calls are not major problems when using this one procedure at a time compilation strategy.
Simulation Methodology
A trace-driven simulation methodology is used to study the performance of this compiler optimization in a shared{memory multiprocessor with 32 processors. The Alliant compiler 27] is used to automatically generate parallel assembly code from Fortran source code. An emulator is used to simulate the parallel execution of the program and to produce a trace of the memory addresses generated by each of the 32 processors. These traces are completely interleaved into a single trace such that during the execution of a parallel section of the program, an address generated by processor i is followed by an address generated by processor i+1 modulo 32. During the execution of sequential phases of the program, processor P 0 generates all of the memory references. In an actual system, timing di erences between the processors due to cache misses, network and memory contention, and synchronization delays may produce a di erent ordering of the references, but this interleaving represents a valid ordering of memory accesses.
To simulate the e ect of the compiler-assisted scheme, the memory trace is examined to classify each block as private, read-only, or shared-writable. Another two passes through the trace use the marking algorithm described in Section 2 to mark the write references to shared-writable blocks as wrt-up, wrt-inv, or wrt-only. The marked trace then drives the cache simulator to determine the miss ratio, the cache-memory network tra c, and other relevant parameters.
To focus only on the performance of the compiler optimizations, a fully associative data cache with a random replacement policy is used in each of the processors. All instruction references are ignored since they can never cause any coherence problems. Since we are interested in the e ect of coherence enforcement only on data references, synchronization variables are not considered in these simulations. To study the e ect of block replacements, we have varied the cache size from 1K, to 2K, to 4K words, and to an in nite cache. Furthermore, 1 word and 4 word cache block sizes are used to investigate the impact of false sharing.
The processors are connected to the memory modules via a packet-switched multistage interconnection network, such as an Omega network 18], using log 2 p 2-by-2 switch stages. Network tra c from a processor to the memory, such as a miss service request or write-back data, uses the forward network, while tra c from the memory to a processor, such as an invalidation command or fetched data, uses the separate reverse network. Both the forward and reverse networks use 32-bit data paths. Each packet between the memory modules and the processors requires a minimum of two words (A = 8 bytes): one word for the source and destination module numbers plus a code for the operation type, and another word for the actual memory block address. An additional word (D = 4 bytes) is used for fetching and writing data or for sending data updates. The total network tra c produced by each type of memory reference was shown in Table 1 .
For comparison, the performance of the compiler{assisted adaptive scheme is compared with the performance of an invalidating{only, an updating{only, and a dynamic adaptive scheme. The last three coherence schemes use only run{time information to detect coherence. In addition, the dynamic adaptive scheme (which is described below) uses run{time information regarding the number of consecutive writes to a block of memory to select either updating or invalidating coherence enforcement. To keep track of the number of consecutive writes to a block, a counter is associated with each block in the memory.
All of these protocols use the state assignments and the transactions described in Section 2.1 with some modi cations. With the invalidating{only scheme, a Read{Only block signi es that the block may be shared and all of the cached copies must be invalidated before a write operation to this block is allowed. The state of the block changes to Modi ed as a result of this write operation. The Modi ed state guarantees that the processor has the only valid copy of the block. A block in the Modi ed state can be read and written by the same processor. A block in the Invalid state causes a cache miss. A cache miss may require a write{back if the block is in the Modi ed state in another processor's cache. Also, invalidation of all cached copies may be required if exclusive access to the block is requested.
The updating{only scheme is based on a write{through policy in which every write operation distributes the new value of the shared memory location to all caches with a valid copy, as well as to the memory. Therefore, with this protocol, a block can never be in the Modi ed state due to the distribution of the new value on every write.
For the dynamic adaptive scheme, coherence enforcement is similar to the updating{only protocol as long as the number of consecutive writes to the same block with no intervening cache misses (read or write) is less than a prede ned threshold value. When the number of consecutive writes to a block reaches this threshold value, all cached copies of the block, except for the writing processor's, are invalidated. The counter associated with that memory block then is reset and updating is again selected for enforcing coherence. This dynamic adaptive scheme is our directory{based variation of the bus{based EDWP 4] and Competitive 17] schemes.
Test programs
The programs used in this performance evaluation, shown in Table 4 , have a total of more than 55 million memory references and have signi cant di erences in block sharing characteristics. The arc3d, o52, and trfd programs are taken from the Perfect benchmark suite 8]. Arc3d analyzes a three dimensional uid ow and o52 analyzes the transonic ow past an airfoil. The trfd program is a quantum mechanical simulation of a two-electron integral transformation that uses a series of matrix multiplications. Simple24 is a hydrodynamics and heat ow problem using a 24-by-24 grid. The pic program is an electromagnetics application that models the movement of charged particles using a particle-in-cell approach, and lin125 is the Linpack benchmark with a 125-by-125 element matrix. Shallow is a weather prediction program 29]. Values designated by 0 are less than 0.05 percent.
|||||||||||||||||||||||||||||||||||||||{ 4 Simulation Results

Ideal Compiler Performance
The primary goal of using the compiler optimization is to adapt to the current memory referencing pattern by examining the past and near{future references and thereby reduce the number of unnecessary block updates and invalidates. This section examines how well the compiler optimization performs with an ideal compiler which is capable of performing perfect memory disambiguation and perfect interprocedural analysis.
Compiler Marking Statistics
The rst two columns in Table 5 show the total number of write references for the test programs, and the percentage of total references that are writes. The last three columns show the results of applying the compiler marking algorithm presented in Section 2 with perfect memory disambiguation and perfect interprocedural analysis. The \coherence writes to shared-writable blocks" column is the percentage of the total number of write references that require other cached copies to be either updated or invalidated. \Write-only" references to shared-writable blocks are accesses that are marked as wrt-only during the compiler marking and require no coherence enforcement. The private writes column shows the percentage of the total number of write references that are made to private data. The private writes are also marked as wrt-only during compiler marking. Table 5 , only a small fraction of the total references to shared-writable blocks are write references and, except for shallow, less than 20 percent of these references require updating or invalidating. Table 6 shows the fraction of memory references which access the directory with respect to the updating and the invalidating coherence enforcement strategies. As shown in the rst ve columns of this table, the compiler{assisted scheme accesses the directory less frequently than the the dynamic adaptive scheme and the invalidating{only scheme. This behavior is due to the large fraction of references that can be marked as wrt-only by the compiler. The wrt{only references access the directory only if there is a cache miss. Similarly, columns 7{11 of this table show that with the compiler{assisted scheme fewer of the memory references access the directory than with the updating-only and with the dynamic adaptive scheme due to the deferral of updating by the wrt-only references. Note that with the dynamic scheme, as the threshold value increases more of the memory references access the directory to update the memory and/or other caches than to invalidate other caches since greater threshold values increase the total number of references that use updating to enforce coherence. Furthermore, the total number of directory accesses (i.e. the number of directory accesses when invalidating is selected plus the number of directory accesses when updating is selected) with the compiler{assisted scheme is greater than that with the invalidating{ only scheme for most of the programs. This di erence is due to the additional directory accesses for performing the write{through when the updating policy is chosen. While this appears to increase the communication overhead, the simulation results show that in general, this is not as disrupting as might be expected.
E ect of Coherence Strategies on the Miss Ratio
The miss ratios produced by invalidating{only, updating{only, the compiler-assisted scheme, and the dynamic adaptive scheme are compared in Figure 3 as the cache size varies. When using updating-only, the miss ratio decreases as the cache size increases for most of the programs since the larger cache size reduces the number of block replacements 22]. For trfd and lin125, however, the miss ratio remains constant as the cache size varies. For trfd, this behavior is due to the small number of unique blocks referenced throughout this program's execution. For lin125, on the other hand, this behavior is due to the high percentage of reads to shared-writable blocks.
The miss ratio when using the invalidating-only scheme, shown in Figure 3 , behaves in the same manner as updating as the cache size increases. However, a comparison of miss ratios produced by invalidating and updating indicates that for most of the programs, invalidating generates a signi cantly higher miss ratio than updating. Updating produces a lower miss ratio since, after the initial reference to a shared-writable block, the block remains in the cache until it is replaced due to nite cache size e ects. With invalidating, however, a block remains in the cache only until it is replaced due to the nite cache size, or due to being invalidated by a coherence action. Thus, invalidating produces more misses than updating due to the invalidation of blocks that are re-referenced by any of the processors that had a valid copy of the block before invalidation. These additional misses are eliminated with updating by sending block updates on every write to all processors with a valid copy of the shared-writable block. With an in nite cache the miss ratios of updating and invalidating are usually lower than those with a nite cache size. With the dynamic adaptive scheme, however, the miss ratio may actually be higher with an in nite cache than with a nite cache due to a more frequent selection of invalidating. In this dynamic scheme, the counter associated with each block in memory is incremented by consecutive write references with no intervening cache misses. Invalidating is then selected when the counter reaches the threshold value. With an in nite cache size there is a lower probability of cache misses than with nite cache sizes due to the absence of block replacements. Hence, there is a higher probability of reaching the threshold value with an in nite cache than with a nite cache which thereby selects invalidating more frequently. Note that the miss ratio of invalidating{only and the dynamic adaptive scheme with a threshold value of one are the same with an in nite cache size. This equality is due to the low threshold value which causes invalidation of all other caches on alternate write operations.
For the ideal compiler-assisted scheme, Figure 3 indicates that varying the cache size has a similar e ect on the miss ratio as when varying the cache size with the invalidating{only and updating{only schemes. However, with the compiler-assisted scheme, the miss ratio is 0.4 to 15 percent lower than with invalidating-only, and 0.1 to 13 percent lower than the dynamic adaptive schemes. This improvement is partly due to the compiler marking of wrt{only references which reduces the number of invalidation misses (i.e. cache misses which would not have occurred if updating was used instead of invalidating). In addition, the compiler{assisted scheme allows the processors to take better advantage of temporal locality than the other two schemes by enforcing coherence with invalidating less frequently than with the invalidating{only and the dynamic adaptive schemes. The miss ratio of the compiler{assisted scheme is 0.01 to 0.55 percent higher than the updating-only scheme. The miss behavior for the compiler-assisted schemes closely approximates that of updating since the compiler frequently selects updating (see Table 5 ). Apparently, the programs' referencing patterns are such that the estimated cost of updating is frequently less than the cost of invalidating. The small di erence in miss ratio of the updating-only and the compiler-assisted schemes is due to the occasional selection of invalidating by the compiler.
E ect of Coherence Strategies on Network Tra c
In addition to the miss ratio, the performance of the coherence schemes is compared with respect to the total network tra c. For arc3d and o52 the total network tra c with updating increases as the cache size increases, as shown in Figure 4 . This increase is due to the sending of block updates on every write to a shared location. For lin125 and trfd the total network tra c is constant as the cache size increases. For trfd, the constant network tra c is due to the small number of unique blocks referenced throughout this program's execution. For lin125, this behavior is due to the constant miss ratio as well as the high fraction of read references to shared-writable blocks. The total network tra c when using invalidating-only for simple24, arc3d, pic, shallow and o52 decreases as the cache size increases since, with a cache size greater than 1K, it becomes more likely for the caches to hold copies of needed blocks. The invalidating network tra c for trfd and lin125, on the other hand, is independent of the cache size, again due to the small number of unique blocks referenced throughout the program execution and due to the high fraction of read references to shared{writable blocks.
For o52, pic, arc3d, and simple24 the network tra c for the dynamic adaptive scheme is similar to the invalidating scheme as the cache size varies. For trfd and lin125, despite the high miss ratio with an in nite cache, the total network tra c is lower than with a nite cache. This behavior is due to the frequent selection of invalidating and the elimination of unnecessary block updates.
There are two main reasons why the total network tra c of the compiler-assisted scheme behaves in the same manner as the invalidating-only scheme. First, with the compiler optimization, up to 80 percent of the write references to shared-writable blocks are marked as wrt-only, and thus do not generate any invalidating or updating tra c. Secondly, this scheme can take advantage of switching from updating to invalidating when the communication cost of updating is higher than invalidating. As a result, the compiler-assisted scheme produces network tra c 13 to 94 percent below that of the updating-only scheme and 0 to 92 percent below that of the dynamic adaptive scheme. In addition, for most of the applications, this compiler{assisted scheme generates less tra c than any of the other schemes, even the invalidating-only scheme.
Overall, a comparison of all schemes with respect to total network tra c shows that the invalidating{only and the dynamic adaptive schemes generate less tra c than updating{only. However, the compiler-assisted scheme generates lower network tra c than either updating, invalidating, or the dynamic adaptive scheme. This behavior is due to its ability to mark write references as wrt{only, and its ability to switch from updating to invalidating when the cost of updating is higher than invalidating. One exceptional program, trfd, produces network tra c 2 times higher with the compiler-assisted scheme than with invalidating-only, but 5 times lower than updating, and about the same network tra c as the dynamic adaptive scheme. 
E ect of procedures and imprecise memory disambiguation
The previous section showed that with an ideal compiler, the compiler-assisted adaptive scheme produces the low miss ratios of an updating-only scheme while reducing the total network tra c to below that produced by either the updating{only, the invalidating{only, or the dynamic adaptive scheme. This section examines the performance of the marking optimization with the di erent compiler technologies summarized in Table 7 . Except for the ideal compiler, all of these compilers have imprecise memory disambiguation in that they can resolve memory addresses only to the name of the array instead of to the address of each individual element within the array. This imprecise disambiguation is simulated with the address traces by mapping all references to the elements of a single array into a single name. Furthermore, the realistic inv and realistic up compilers can track all memory references across subroutine boundaries, while the simple inv and simple up compilers are limited to performing the marking on one procedure at a time, as described in Section 2.4. The simple and realistic compilers may be forced to default to either updating or invalidating independent of the sharing pattern due to imprecise memory disambiguation or the lack of interprocedural analysis. To study the e ect of each default strategy on performance, we have examined the performance of the simple and realistic compilers by choosing updating in one case and invalidating in another case whenever the compiler has to be conservative.
The miss ratio produced by di erent levels of compiler capabilities and the two default coherence enforcement strategies, updating and invalidating, are compared in Figure 3 . For arc3d, o52, pic, trfd, simple24, and shallow, when we choose to default to invalidating, the realistic inv compiler produces more misses than the ideal compiler. This behavior is due to the compiler producing more wrt-inv than wrt-up references 22]. On the other hand, when we choose updating instead of invalidating for the default protocol, Figure 3 shows no signi cant di erence between the ideal compiler's and the realistic up compiler's miss ratio. While we expect more misses with the realistic compiler than with the ideal compiler, for lin125 the reverse is true. That is, the ideal compiler produces more misses than the realistic compiler. With the realistic compiler, the imprecise memory disambiguation changes the estimated values of the n a and n b parameters and thus a coh{wrt which would have been a wrt{inv with the ideal compiler, may be a wrt{up with the non{ideal compiler and vice versa. This variation has caused a more frequent selection of updating for the lin125 program with the realistic compiler than with the ideal compiler. Nevertheless, defaulting to updating again produces a lower miss ratio than defaulting to invalidating.
In addition to the miss ratio, Figure 4 shows that for simple24, shallow, o52, arc3d, and pic, the realistic compilers produce higher network tra c than the ideal compiler. For lin125, the network tra c is equal to that generated by the ideal compiler. For trfd, however, the network tra c produced by the realistic inv compiler is lower than that with the ideal compiler. This behavior again is due to the marking of some of the wrt{up references in the ideal case as wrt{inv in the non{ideal case. Comparing the network tra c generated by the two realistic compilers indicates that for simple24, arc3d, lin125, pic, and shallow, defaulting to updating generates lower network tra c than defaulting to invalidating. For trfd and o52, however, the network tra c generated by the realistic inv compiler is lower than that generated by the realistic up compiler.
These gures show that marking only on the array name and defaulting to updating in ve out of seven cases is as e ective as the ideal compiler in producing miss ratios that are less than the invalidating{only scheme with network tra c lower than the updating{only scheme. For trfd, the higher network tra c than the invalidating{only scheme, but lower than the updating{only scheme, is compensated by the miss ratio which is near that of the updating{only scheme. For o52, however, using the invalidating{only scheme seems to produce better performance than using this compiler{assisted scheme.
A comparison of the dynamic adaptive scheme and the compiler{assisted approach which defaults to updating indicates that in ve out of seven programs, the compiler{assisted approach outperforms the dynamic approach. That is, both the miss ratio and the network tra c of these programs are lower than that generated by the dynamic adaptive scheme.
Figures 3 and 4 also show that for all of the test programs, the lack of interprocedural analysis, i.e. the simple inv and simple up compilers, causes little performance degradation beyond that of the imprecise memory disambiguation. In examining the programs, it appears that the procedures typically perform some action that would require a coherence enforcement (i.e. marking as cohwrt) even if the compiler had perfect interprocedural analysis. As a result, performing the marking optimization separately for each procedure is almost as e ective as optimizing the entire program as a single unit by using perfect interprocedural analysis. An actual compiler would probably generate higher miss ratio and network tra c than these simulation results. Nevertheless, the performance of this optimized scheme will not be any worse than the invalidating scheme with respect to the miss ratio and no worse than updating{only with respect to the network tra c.
Block Size E ect
This section examines the performance e ect of a block size greater than a single word with the compiler marking. Table 8 shows that with a block size of 4 words and a cache size of 4Kwords, the ideal and realistic compilers are capable of marking up to 82 and up to 78 percent of writes as wrt{only, respectively. Figure 5 shows the e ect of changing the block size on the performance of three representative programs. With a cache block size of 4 words, the performance of the other four programs is similar to these three applications. Since the compiler{assisted scheme performs better in 5 out of 7 programs when the default is updating, the realistic compiler used in this experiment defaults to updating. As shown in Figure 5 , for arc3d the larger block is quite e ective in reducing the miss ratio for all of the coherence schemes, but at the expense of higher network tra c. The network tra c is increased due to the additional words that are fetched on a cache miss. For o52, the miss ratio of all coherence schemes, except the invalidating{only scheme, also bene t from the spatial locality of the larger block. With the invalidating{only scheme, the bene t of spatial locality is entirely neutralized by the additional cache misses that are generated due to the false sharing. For lin125, the invalidating{only scheme, the dynamic adaptive scheme with threshold value of 1, and the compiler{assisted scheme with the realistic compiler, su er from the false sharing e ect as well. From these gures, we can see that the performance of the compiler{assisted scheme is a ected in the same way as the other coherence schemes as the block size is increased. 
Relationship to Software Prefetching
Software prefetching attempts to insert prefetch instruction well ahead of the actual memory reference to reduce the number of cache misses 24, 30] . While the goal of the algorithm described in Section 2.2 is to use updating or invalidating to reduce the coherence overhead, as the simulation results have shown, this scheme is also capable of reducing the miss ratio. The miss ratio is reduced primarily by selecting updating rather than invalidating for some of the write references. While prefetching mechanisms reduce the number of cache misses at the expense of higher network tra c, this scheme is capable of reducing both the miss ratio and the network tra c.
Conclusion
This paper demonstrates that there is a trade-o between updating and invalidating cache coherence enforcement schemes. Updating tends to produce fewer cache misses than invalidating, but at the expense of higher network tra c. Invalidating, on the other hand, reduces communication overhead at the cost of higher miss ratios than updating. Therefore, neither one of these schemes can be considered an optimal strategy for maintaining coherence. This work con rms previous studies 4, 13] that have shown that combining updating and invalidating is an e ective approach for maintaining consistency.
While other adaptive coherence enforcement schemes 17, 4, 31] select either updating or invalidating dynamically using only run-time information, the compiler-assisted approach presented in this paper selects updating or invalidating statically at compile-time. By estimating the communication cost of updating and invalidating for each write reference and then choosing the least expensive scheme, this compiler optimization takes advantage of both past and near-future information about referencing patterns. This optimization also eliminates the additional hardware required by run-time strategies for choosing between updating and invalidating.
The performance of this compiler-assisted approach was compared to an invalidating-only scheme, to an updating-only scheme, and to a dynamic adaptive scheme in a large-scale multiprocessor environment using trace-driven simulations. These simulations show that up to 82 percent of the writes with an ideal compiler, and up to 78 percent of the writes with a non{ideal compiler, do not require any coherence enforcement. As a result, this compiler marking can reduce the coherence communication overhead. In addition, by appropriately choosing updating or invalidating for writes that do require coherence enforcement, the simulations show that this compiler-assisted adaptive approach is capable of producing the low miss ratios of an updating-only scheme while reducing the total network tra c to below that produced by either updating or invalidating alone. The simulation results also show that the miss ratio and the network tra c generated by this compiler{assisted approach is, in general, below that of a dynamic adaptive scheme. Since this compiler{assisted approach has access to the high level sharing information, it is able to adapt to the current referencing pattern of a program better than a dynamic adaptive scheme.
Although interprocedural analysis and imprecise memory disambiguation were expected to disrupt the performance of this compiler-assisted approach, the simulations show that these are not major problems. A compiler with imprecise memory disambiguation, no interprocedural analysis, and with updating as the default protocol, in general, produces lower miss ratios than the invalidating-only and the dynamic adaptive schemes. Similarly, the non{ideal compiler generates lower network tra c than the updating-only and the dynamic adaptive schemes. Thus, we conclude that compile-time selection of an updating or invalidating coherence enforcement strategy could be an e ective means of improving the memory performance in a shared-memory multiprocessor.
