Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significantly reduces their memory requirements, but they still allocate pointers to blocks that do not need them. We present two compiler optimizations that exploit the high-level sharing information available to the compiler to further reduce the size of a tagged directory by allocating pointers only when necessary. Trace-driven simulations are used to show that the performance of this combined hardware-software approach is comparable to other coherence schemes, but with significantly lower memory requirements. In addition, these simulations suggest that this approach is less sensitive to the quality of the memory disambiguation and interprocedural analysis performed by the compiler than software-only coherence schemes.
Introduction
Private data caches in shared memory multiprocessors can significantly reduce the average time to access global memory, but they also introduce a consistency problem since multiple copies of a memory location can be resident in several different caches simultaneously. Snoopy coherence schemes [15, 20, 31] can efficiently ensure that all processors use the most up-to-date copy of each shared variable, but the shared bus can become a serious performance bottleneck. Replacing the bus with a multistage interconnection network [16, 23, 32] reduces the bottleneck, but it compounds the need for the caches by increasing the delay, and it exacerbates the coherence problem by eliminating the snooping medium.
Hardware coherence schemes that dynamically determine which memory operations need coherence actions [3, 4, 7, 19] have access to memory addresses only as the program generates them. Since it is impossible for the hardware to predict how the blocks will be shared, they must track the state and sharing characteristics of every memory block referenced by the program. The number of memory bits needed to store this information can be enormous for these schemes. The compilerdirected schemes [10, 12, 24, 28, 35] , on the other hand, have an advantage over the hardware schemes in this area since they may need only a few state bits per cache block, such as in the fast, selective invalidation scheme [10] . While the version control scheme [12] and the time-stamp scheme [28] require more local memory in each processor to hold the version or time-stamp numbers, this storage still can be significantly smaller than that required by the dynamic hardware-only schemes.
Another advantage of the static compile-time coherence detection schemes is that by making each processor responsible for maintaining coherence in its own cache using self-invalidation, no interprocessor communication is required. The hardware directory schemes, on the other hand, send many messages between the directories and the processors which increases the congestion in the interconnection network and thereby increases the average memory delay. Finally, the static schemes are able to exploit some of the high-level sharing information available to the compiler to optimize cache coherence enforcement, such as ignoring coherence on private and read-only memory locations.
The primary advantage of the hardware coherence schemes is their perfect memory disambiguation. By tracking the actual memory addresses, these schemes can invalidate only those blocks that are actually stale, assuming that they save enough state information. Several studies [1, 26, 29] have indicated that, due to its reduced network traffic, an ideal implementation of a compiler-based coherence scheme with perfect memory disambiguation can have performance comparable to, and in some cases better than, that of a full hardware directory. Of course, it is not possible to perfectly disambiguate all memory references at compile-time, in which case the compiler-based coherence schemes will overinvalidate cache blocks. The amount of this overinvalidation is dependent on the characteristics of the application program, and on the quality of the data dependence analysis performed by the compiler. In addition, unlike the compiler-based schemes, the hardware schemes make coherence enforcement completely transparent to procedure calls and subroutines. These differences between compiler-based and hardware directory coherence schemes are summarized in Table 1 .
In the conventional hardware directories [3, 4, 7, 19] , pointer resources are statically associated with each block in the main memory fixing the total number of pointers to the size of the memory. Recently proposed dynamically tagged directories [9, 17, 26, 30] take advantage of the observation that only blocks that are actually cached in one or more processors need to be allocated pointers. In these tagged directories, pointers are dynamically associated with memory blocks using an address tag field only as the blocks are moved from the memory to a cache. With these schemes, the number of pointers is proportional to the size of the data caches, which are significantly smaller than the main memory. These dynamic directories substantially reduce the number of memory bits needed to ensure coherence while maintaining the advantages of the traditional hardware schemes. Unfortunately, they do not exploit the high-level sharing information available to the compiler.
In this paper, we present two compiler optimizations that can be used in conjunction with a directory coherence scheme to further reduce the memory requirements of the directory itself without reducing performance. The first optimization, block partitioning, simply divides all memory blocks referenced by a program into two categories -those blocks that can never become incoherent, and those that need coherence enforcement. This optimization can be used with any hardware directory to reduce the directory size by eliminating the memory bits needed to store sharing information for those blocks that can never become incoherent. The second compiler optimization, delayed allocation marking, is a significant extension of this idea of combining hardware and software coherence enforcement. It uses the predictive power of the compiler to delay the allocation of a coherence pointer as long as possible. This optimization requires a dynamically tagged directory [9, 17, 26, 30] , but by delaying the allocation of pointers, they are in use for a shorter period of time, and thus can be reused more frequently. This reuse allows the size of the directory to be reduced by a factor of approximately 50 to 100 while increasing the average memory delay by less than a few percent.
The two compiler optimizations are presented in Section 2 along with a discussion of the hardware and software implementation implications. A parallelism model in which the granularity of parallelism is the independent iterations of a Do loop is used as an example to explain the details of the compiler optimizations. The trace-driven simulation methodology described in Section 3 is used to analyze the performance of this combined hardware-software coherence scheme. Section 4 examines the ideal performance improvement possible with these optimizations, as well as the potential degradation due to imprecise memory disambiguation and poor interprocedural analysis. Section 5 compares the performance of this compiler-assisted directory with the traditional hardware directory schemes [3, 4, 7, 19] and with the software-directed version control scheme [12] . The results and conclusions are summarized in the last section.
Compiler Assistance for Coherence Directories
The two compiler optimizations presented in this section can be used to reduce the directory size needed to maintain coherence. The block partitioning optimization can be used with any hardware directory. With this optimization, the compiler determines which memory blocks never may become incoherent and assigns these blocks to a region of memory that does not maintain any sharing information. The delayed allocation marking optimization, on the other hand, is used in conjunction with a dynamically tagged directory. It extends the block partitioning optimization to mark individual memory references, instead of memory locations, as needing a coherence pointer allocated or not. This finer granularity of marking can take advantage of the potentially long periods when a block is only read by allocating pointers only when they are needed.
This compiler-assisted approach to coherence exploits the high-level information about future memory accesses available to the compiler, but not available to hardware directory schemes alone. In addition, unlike software-only coherence schemes, this approach can fall back on the perfect memory disambiguation of the directory when the compiler is unable to determine a block's memory referencing characteristics. As a result, this combined approach exploits some of the advantages of both hardware and software coherence schemes.
Block Partitioning
Block partitioning exploits the fact that it is necessary to maintain pointer information only for references to shared-writable blocks. These blocks are ones that are referenced by two or more processors, at least one of which writes the block. References to private blocks can never cause coherence problems since these blocks are referenced by only a single processor. Because shared read-only blocks are never written, they also can never become incoherent. Several studies [2, 5, 13, 25, 36] have shown that from 30 to more than 90 percent of all blocks referenced by a program may be private or read-only, and thus would not need to be allocated pointers.
Examples of read-only data blocks include all instruction blocks, preinitialized lookup tables, and constants. Temporary variables created within parallel tasks, such as those needed to maintain local loop counts and other intermediate results, are examples of private blocks. In addition, if the compiler preschedules the execution of specific loop iterations onto specific processors, it may be able to identify blocks that are accessed by only a single processor throughout the execution of the program and mark them as private. Compiler optimizations and transformations [18] may be able to further reduce the number of shared-writable blocks. If the access patterns for a block cannot be determined by the compiler because they are hidden by a procedure call or because of the complexity of the array subscript analysis, for instance, the compiler must conservatively assume that the block is shared-writable.
Using symbolic data dependence analysis, a restructuring compiler such as Cedar/KAP [14] can classify the blocks into one of these three types (ie. shared-writable, private, or shared read-only) based on the sharing patterns it detects. All private and shared read-only block types then are marked safe_blk to indicate that the directory can be completely bypassed when these blocks are referenced. No coherence checking will be performed and no pointers will be allocated for references to these blocks. Shared-writable blocks are marked coh_blk to ensure that the directory is accessed to maintain coherence, and that a pointer is allocated when the block is referenced. Table 2 shows how the processor interacts with its data cache and the memory when using this block partitioning compiler optimization.
Delayed Allocation Marking
In the traditional hardware directory schemes [3, 4, 7, 19] , memory bits needed to point to processors with a cached copy of a block are statically bound to each block in memory when the machine is built so that the total number of pointer bits, and thus the size of the directory, is proportional to the size of the main memory. In the tagged directories [9, 17, 26, 30] , however, pointers are dynamically bound to the blocks as they are referenced using an address tag within the pointer. As a result, the size of the tagged directory is proportional to the data cache size. If there are more requests for pointers than there are available pointers in the directory pointer cache, one is freed by invalidating the cached copy of the block to which it points, or by overflowing a pointer to a secondary memory. These tagged directories can significantly reduce the number of memory bits dedicated to coherence, but they still may waste some pointer resources by allocating pointers to blocks when they are not needed.
The purpose of allocating a pointer when a block is referenced is to ensure that some future reference to the block by a different processor can find the current copy of the block, or can invalidate all cached copies. Delayed allocation marking exploits the predictive power of the compiler to allocate a pointer to a current reference only when there will be a future reference to the block that will need to perform some coherence action. As in block partitioning, delayed allocation marking does not allocate pointers to private and read-only blocks since they can never become incoherent, but unlike block partitioning, it also takes advantage of the opportunity to mark shared-writable blocks as not needing pointers during the periods when they are essentially read-only. For example, with block partitioning, a data structure that is written only a few times, but read many times, would be marked coh_blk and would have coherence information allocated throughout the program. With delayed allocation marking, however, pointers would be allocated only for writes, and for the reads on different processors immediately preceding these writes. All of the read operations would access the directory to force a writeback, if needed, but no pointers would be allocated during the periods when the block was read-only. To mark each individual memory reference as needing a pointer allocated or not, the compiler must be able to detect a specific sequence of events that cause a shared-writable block to become incoherent (stale).
To simplify the following discussion of how a compiler can perform this marking, a parallelism model is assumed in which independent iterations of a Do loop are the tasks to be executed concurrently. Some form of barrier synchronization is necessary at the parallel loop boundaries to ensure program correctness, and all reads and writes must be completed before the processors can proceed across the barrier. With minor adjustments, this marking optimization can be readily adapted to other parallelism models and synchronization schemes. With this model, a block B can become stale within processor P i 's cache with the following sequence of events [11] E 0 : Processor P i reads block B zero or more times. E 1 : Processor P i reads block B. E 2 : A processor reassignment occurs (synchronization point). E 3 : Processor P j (i≠j) reads B zero or more times, and writes it at least once. E 4 : A processor reassignment occurs (synchronization point).
Event E 0 or E 1 moves the initial copy of block B into processor P i 's data cache. Event E 2 is a parallel loop boundary that allows a different processor to write to block B in E 3 . The loop semantics guarantee that all reads and writes to block B in event E 3 will be performed by processor P j , and thus will be coherent. After the next parallel loop boundary at E 4 , processor P i could attempt to read its copy of block B, which is now stale. To prevent P i from accessing this stale copy, the write in E 3 must invalidate P i 's copy of block B before it is referenced again. This invalidation requires that a pointer must be allocated in the pointer cache to point to block B in processor P i during or before E 1 . With block partitioning, a pointer is allocated the first time P i references block B during E 0 or E 1 , but with delayed allocation marking, the allocation of the pointer is always delayed until E 1 . This delay in allocating the pointer reduces the amount of time that the pointer is active, potentially freeing it up for use by other processor-block combinations.
The processor-memory interaction for delayed allocation marking is shown in Table 3 . Read and write references to blocks marked safe_blk completely bypass the directory and never allocate pointers for the referenced blocks. Two different read operations are used for blocks marked coh_blk. During a block's read-only phase, all reads are performed with the rd_no_ptr operation. This operation accesses the directory to force a write-back if necessary, deallocates any pointer currently bound to the block, and does not allocate another one for the current reference. During other program phases, reads are performed with a rd_w_ptr operation that also accesses the directory to enforce coherence, and forces the allocation of a pointer to the processor making the reference to the block. Writes to shared-writable blocks access the directory and allocate a pointer so that future reads to the block can find the current copy.
The algorithm for determining whether or not to allocate a pointer for each memory access is shown in Figure 1 , and a specific implementation of step 3 of this algorithm is presented in Section 2.4. The result of applying this algorithm to a parallel program is that all private and read-only blocks will be referenced without accessing the directory, and without having pointers allocated to them. In addition, for shared-writable blocks, only the read in event E 1 will be allocated a pointer. The other reads to the block will not be allocated pointers, and thus will not consume limited directory resources. 
Hardware Requirements
The two compiler optimizations need some simple hardware support to signal to the directory how the blocks are partitioned and how the individual memory references are marked. This section discusses the hardware required to support each compiler optimization.
Hardware support for block partitioning
When using block partitioning, the processors' data cache controllers and the directories must be able to distinguish blocks marked safe_blk from those marked coh_blk. As in the Cedar system [23] , a high-order address bit can divide the address space into two portions to distinguish blocks that need a pointer from those that do not. Alternatively, different segments can be allocated for the two block types, or page table entries may be individually marked. For blocks marked safe_blk, no information needs to be stored in the directory. Thus, the size of the directory can be reduced so that coherence information is stored only for blocks marked coh_blk. This block partitioning can easily be used with standard processor instruction sets, such as the memory accessing instructions of an offthe-shelf microprocessor, since the directory determines whether or not to allocate coherence information based only on a block's address. Another approach is to design the processor with special memory referencing instructions to access the different types of blocks.
Hardware support for delayed allocation marking
The delayed allocation marking optimization requires a dynamically tagged directory since pointers are allocated to blocks only as they needed, regardless of a blocks' memory address. There are several variations of these tagged directories, any one of which can be used with this compiler optimization. Four of these directories are described below to give an idea of their complexity.
The pointer cache tagged directory [26] allocates to each memory block exactly the number of pointers it needs from a cache of pointers in each memory module. Up to p entries, where p is the number of processors, can have the same address tag so that a fully associative search of the pointer cache can return from zero to p pointers to different processors for the same block. Access to these pointers is pipelined so that i cycles are required to fetch i pointers with the same address tag. With this pipelining, associative pointer matching can be overlapped with other required operations, such as the sending of invalidation messages. A fully associative memory is expensive to implement, but, due to the sharing characteristics of many programs, a set associative implementation still can offer good performance [26] .
If there are more requests for pointers than there are entries in the pointer cache, a free pointer entry is created by evicting an active pointer from the pointer cache. The replacement policy, such as random or least recently used, determines which pointer will be evicted, while the eviction strategy determines what is done with the evicted pointer. For the invalidate on overflow eviction approach, an invalidation message is sent to the processor pointed to by the evicted pointer entry. After the processor acknowledges the message, the pointer can be reused by another block/processor pair. Alternatively, the overflow to memory eviction approach writes the evicted pointer to a special buffer in main memory.
The number of address tag bits in each pointer cache entry is log 2 m, where m is the number of blocks in each memory module, and the number of bits required to point to a particular processor is log 2 p. Two extra bits are required for the pointer valid and the exclusive state bits, and each block in the data caches requires two state bits. No additional coherence bits are required for the blocks in the main memory modules since sharing information is maintained only when the blocks are actually cached. The total number of bits dedicated to coherence functions in the pointer cache scheme is
p, where r is the number of pointer entries in each pointer cache, and c is the number of blocks in each data cache. The actual cost of implementing this scheme is higher than indicated by this bit count since there is additional logic needed for the associative search of the tags, and to automatically perform the pointer overflowing or invalidation.
The tag cache directory [30] is similar to the pointer cache in that the pointers are stored in each memory module in a special pointer cache. These two schemes differ primarily in the organization of the pointer cache. Each pointer in the tag cache scheme consists of an address tag plus n processor pointers, as in the n-pointer scheme [3] . For blocks that are simultaneously shared by more than n processors, and thus need more than n pointers, there is a second-level tag cache that uses the p+1 bit pointers of the full distributed directory. If this second-level cache overflows, the scheme invalidates a pointer entry already in use and reuses it for another block. This tagged directory requires p ¤ ¥ r 1 (log 2 m+nlog 2 p+2)+r 2 (log 2 m+p+2)+2c ¦ § total bits for coherence, where r 1 and r 2 are the number of entries in the first-and second-level tag caches, respectively. Again, this bit count ignores the cost of the associative matching logic.
The LimitLESS directory implemented for the Alewife machine [9] uses a structure similar to the tag cache for the first level of pointers, but it generates a processor interrupt when an entry needs more than n pointers. An interrupt service routine then emulates the function of a p+1-bit full directory to maintain the complete sharing information. This approach handles the common case directly in hardware, while using the flexibility of the software interrupt to handle the more infrequent overflow case. The cost of this scheme is similar to the other tagged pointer schemes.
The coarse vector tagged directory [17] dynamically allocates pointers consisting of log 2 m address tag bits, a valid bit, a dirty bit, v pointer bits, and a mode bit. If only a few processors request a copy of a block, the mode bit is reset to indicate that the v pointer bits are to be interpreted as v/log 2 p© pointers to individual processors. When more than v/log 2 p processors try to share a copy of a block, the mode bit is set to group the processors into p/g clusters, with g processors in each cluster. The ith bit of the v pointer bits is turned on whenever any processor from cluster i requests a copy of the indicated block, and invalidation messages are sent to all processors within a cluster. To ensure that at least one full processor pointer can fit into the v bits, it is necessary to have v=max(p/g,log 2 p), thus the total number of coherence bits needed for this scheme is r cv p[log 2 m+max(p/g,log 2 p)+3]+2cp. As before, this bit count ignores the cost of the associative tag compare logic.
In addition to one of these dynamically tagged directories, the delayed allocation marking compiler optimization requires new memory referencing instructions. The safe_rd and safe_wrt instructions are used to reference private and read-only blocks (ie. blocks marked safe_blk) without allocating pointers. The rd_no_ptr instruction is used for a shared-writable block's read-only phase, while the rd_w_ptr instruction is used to read a shared-writable block and simultaneously allocate a pointer. The coh_write instruction accesses the pointer cache and allocates a pointer for all writes to shared-writable blocks.
Instead of defining new memory referencing instructions, the marking of individual memory references can be done by sacrificing a single address bit to indicate whether a read operation is a rd_w_ptr, or a rd_no_ptr. Each shared-writable block is actually assigned to two different addresses, one with this special address bit turned on, and one with it off. The address decoding hardware ignores this address bit in accessing the actual memory modules, but the directory decodes it to determine whether or not to allocate a pointer to the block for this reference. This approach is not as elegant as defining new memory instructions, but it does allow the use of this compiler optimization with an off-the-shelf processor instruction set.
Compiler implementation
The compiler algorithm for determining which memory references must be allocated pointers was shown in Figure 1 . The key to implementing this algorithm is efficiently detecting the sequence of events E 1 -E 4 for every unique shared-writable block B referenced by the program. This section describes a software state machine for detecting this sequence. It is intended to be applied during the last pass of the compiler as the final code is being generated.
A state vector in the compiler's symbol table, consisting of a single current state integer and two pointers to instruction nodes in the parse tree, is associated with each block referenced by the program that has been classified coh_blk by an earlier pass of the compiler. Recall that these blocks are shared-writable, and, in the worst case, the compiler will conservatively assume that all blocks fall into this category. The current state value is initialized to E start , and the instruction pointers are initialized to null.
Starting at the beginning of the program, the compiler sequentially examines each read and write instruction, and each parallel loop boundary. For the memory block being referenced by the current instruction, the compiler updates the block's state vector according to the state transition table shown in Table 4 . The state names correspond to the events that have been detected so far. Whenever the block is in state E 3 and the current instruction is a parallel loop boundary, the read instruction pointed to by the rd_ptr variable of the block's state vector is marked as being a rd_w_ptr instruction. As a result, a pointer will be allocated to the block for each processor that ultimately executes this instruction. After all instructions in the program have been examined, all read instructions that have not been marked rd_w_ptr can safely be marked rd_no_ptr. 
1. This table shows the next state and the action given the current state and the current instruction. 2. curr_instr is a pointer to the instruction node in the parse tree containing the current instruction.
3. last_rd is a pointer to the instruction node in the parse tree containing the last read instruction to this block. 4 . rd_ptr is a pointer to the instruction node in the parse tree containing a read instruction to this block that may need to be marked rd_w_ptr.
Procedures and library functions
Subroutines and procedure calls can seriously disrupt the ability of the software-directed coherence schemes to perform accurate memory disambiguation causing them to overinvalidate cache entries. Similarly, procedure calls can restrict the ability of a compiler to accurately perform the block partitioning and delayed allocation marking optimizations. For the best performance, the compiler must be able to perform interprocedural analysis when looking for the sequence of events E 1 -E 4 . For some programming languages such as Fortran, it may be relatively easy to perform this interprocedural analysis and continue the marking optimizations across procedure boundaries, while in other languages such as C, it may be more difficult.
If the interprocedural analysis is too complex, it may be possible to eliminate the procedure calls by copying the entire procedure body in place of the call, similar to an assembler macro expansion. To perform this in-lining, the compiler must match formal and actual parameters, and must appropriately rename variables that are local to the procedure. Unfortunately, this approach may not work for procedures with complex call nesting, and it is not possible to in-line a recursive procedure. In addition, the expansion of the call can significantly increase the size of the generated code which can severely reduce the hit ratio in the instruction cache. It actually may be easier to perform procedure in-lining at the assembly language level, but the sharing information required by the compiler optimizations is most readily accessible at the source code level. In spite of these limitations, there are cases in which in-lining is the most effective means of handling the procedure calls.
When in-lining is inappropriate and interprocedural analysis is ineffective, the compiler optimizations may be applied to each procedure independently. In this case, the block partitioning algorithm must be conservative in deciding whether or not a block is shared-writable. For instance, a variable that is only read within the procedure, but is passed as a parameter, must be conservatively marked shared-writable since it may be written outside of the procedure. On the other hand, if the procedure is called from within a parallel loop, all local variables allocated on its private stack can be marked as private.
For the delayed allocation marking optimization, the compiler analyzes each procedure separately as if it were a complete program. The only additional requirement is that the compiler must ensure that the last read of a memory block before a procedure call or return is performed with a read instruction that allocates a pointer to the block (ie. rd_w_ptr). Then if there is any coherence action required due to memory references within the procedure, the pointer is already allocated. If the pointer is not needed, its allocation does not hurt anything, other than wasting some pointer resources. In the state transition table shown in Table 4 , procedures can be handled simply by adding an additional column for the case when the current instruction is a call or a return. When one of these instructions is encountered, all read instructions pointed to by all of the blocks' rd_ptr variables are marked as rd_w_ptr to ensure that a pointer is allocated before the call or return is executed. This action is performed regardless of the value of a block's current state variable. The state vectors of all of the blocks are reinitialized so that the marking optimization starts fresh for the next instruction after the call. After optimizing the main program, each procedure is separately optimized.
This method of optimizing each procedure separately also is useful when using precompiled library routines. The optimizations can be applied to each routine when initially compiled, and the benefits of the optimizations will be exploited by any program that calls the routine. Thus, most of the performance advantage of the optimizations can be obtained without having access to the library source code. While it appears that this independent compilation technique could allocate an unnecessarily large number of pointers, the simulation results shown in Section 4.2 indicate that procedure calls are not a major problem when using this one procedure at a time compilation strategy.
Simulation Methodology
This section describes the trace-driven simulation methodology and the machine model used to compare the performance and the directory size requirements of the block partitioning and the delayed allocation marking compiler optimizations with no compiler marking, and to examine the effects of imprecise memory disambiguation and interprocedural analysis. In addition, the memory overhead requirements and performance of the pointer cache is compared with several other directory schemes and with the version control software-directed scheme. In all of these simulations, it is assumed that no context switching or task migration takes place.
Machine Model
These simulations use a shared memory multiprocessor with p=32 processors connected to an equal number of memory modules through a multistage interconnection network. The pointer cache coherence scheme [26] is used as an example of a tagged directory, although the compiler optimizations apply equally well to the other tagged directories [9, 17, 30] . The Alliant compiler is used to automatically generate parallel assembly code from Fortran source code, which then is converted into equivalent assembly code for a multiprocessor emulator. This emulator simulates the parallel execution of the program and generates an interleaved memory reference trace for all of the processors. Parallel loop iterations are statically scheduled across the p processors so that processor 0 executes loop iterations 1,1+p,1+2p,..., processor 1 executes iterations 2,2+p,2+2p,..., and so on. Synchronization variables are never cached with the data. Since instructions are only read they never become incoherent, thus all instruction references are marked safe_blk and all instructions are assumed to be cached in a separate infinite instruction cache. The long execution time required to perform the simulations limited the size of the application programs' data sets. To maintain a realistic relationship between the size of the data set and the size of the data cache, a data cache of c=8 kbytes is used in each of the 32 processors.
The processor-memory interconnection network consists of log j p switch stages, where j is the number of inputs and outputs in each switch. Since 2-by-2 switches are used in this system, j=2. Network delays are modeled [22] as T net =log 2 p (1+f contention ) cycles, where
, λ is the probability that a packet arrives at each input to the network during each cycle, and m is the number of packets per message. This model assumes that each switch introduces one cycle of delay when the network is unloaded. As more messages are generated, λ and f contention increase to model the additional queuing delays in the switches.
A command or message between a processor and a pointer cache in one of the memory modules requires one cycle to set up the message plus T net cycles to propagate through the interconnection network. The main memory requires T mem =4 cycles to access the first word of a cache block, plus (b−1) additional cycles to access the remaining (b−1) words. A read miss of a block that is not exclusive in another processor takes a total of T miss =(1+T net ) +[T mem +(b−1) +T net ] cycles to send the miss service request message and to receive the b words in the cache. If the requested block is exclusive in another cache, it first must be written back to memory, adding T w_back =(1+T net ) +[T net +1 +(b−1)] cycles to T miss . A read hit requires one cycle. A write hit to a block in the exclusive state can be serviced directly by the cache in one cycle. A write miss to a block that is exclusive in another processor requires that processor to write the block back to memory, taking a total of T miss +T w_back cycles.
Both write hits to blocks in the shar_RO state, and write misses to blocks that are in the shar_RO state in other processors, require the directory to invalidate the other cached copies of the block before giving exclusive access to the requesting processor. Fetching from the pointer cache the i pointers to processors with a cached copy of the block requires i cycles, which are overlapped with the generation and sending of the invalidation messages. Because of this pipelining, it takes T miss +T inv (i) cycles to service these write requests, where T inv (i) =(d+1)T mem +i+2T net cycles, and d is the number of pointers for this block that have been overflowed to memory. Notice that one additional network delay is required in the invalidation operation to receive the last invalidation acknowledgement back at the pointer cache directory. Since the invalidation commands are pipelined, the other (i−1) invalidation acknowledgements are overlapped and do not directly add to the total delay. If the invalidate on overflow approach is used, all pointers are always cached and d=0. If any memory request causes the pointer cache to overflow, an additional T inv (1) cycles are needed to evict a pointer using the invalidation approach. If the overflow to memory approach is used, however, there is no additional delay since writes from the pointer cache to the main memory are buffered.
The processors used in these simulations can have multiple outstanding memory requests so that some of the memory delay can be overlapped with other processing. This type of processor requires the use of lockup-free data caches [21, 33, 34] which could increase the complexity of the processors and potentially increase contention in the interconnection network. However, to directly compare the memory performance of the different coherence schemes, the average memory delay is used as the figure of merit. This delay is insensitive to the fact that memory references may be overlapped with other processing activities. On a miss service request, the total memory delay includes the time required to send the miss service request to the memory, plus the time required for the pointer cache directory to perform any needed writeback operations. Similarly, the delay of any memory operation that initiates an invalidation sequence includes the time required to receive all of the invalidation acknowledgements back at the directory. The average memory delay is calculated by summing the delays generated by each memory operation in all of the processors and dividing by the total number of memory references generated by the program to give the average memory delay in cycles per reference.
Test Programs
The test programs used in this performance evaluation, shown in Table 5 , have a total of more than 45 million memory references and have significant differences in block sharing characteristics. The arc3d, flo52, and trfd programs are taken from the Perfect benchmark suite [6] with slight changes in the problem size and loop iteration counts. Arc3d analyzes a three-dimensional fluid flow and flo52 analyzes the transonic flow past an airfoil. The trfd program is a quantum mechanical simulation of a two-electron integral transformation that uses a series of matrix multiplications. Simple24 is a hydrodynamics and heat flow problem using a 24-by-24 grid. The pic program is an electrodynamics application that models the movement of charged particles using a particle-in-cell approach, and lin125 is the Linpack benchmark with a 125-by-125 element matrix.
To determine the memory referencing characteristics shown in this table, the memory traces were examined and the number of processors accessing each block were counted. The private blocks were those that are referenced by the same processor throughout the program's execution with no references by any other processor. The shared-writable blocks were referenced by two or more different processors, at least one of which wrote the block. Finally, the shar_RO blocks are blocks that were referenced by more than one processor, but were never written. Only shared-writable blocks can cause coherence problems, and, as shown in this table, fewer than 40 percent of the unique blocks referenced by arc3d, pic, and flo52 are shared-writable, and fewer than half of their total references are made to these blocks. Most of their references are to private and read-only blocks, and thus do not force any coherence actions. In contrast, more than 78 percent of the blocks referenced by simple24, trfd, and lin125 are shared-writable, although only trfd and lin125 have more than half of their references to these blocks. Furthermore, Table 6 shows the fraction of all references that request exclusive access to a block that is currently in the shar_RO state in one or more data caches. These references force the directory to invalidate all cached copies of the block to maintain coherence. This table also shows the average number of invalidations generated by each of these references. The relatively small average values for arc3d, pic, simple24, and flo52 indicate that in most cases when a block in the shar_RO state is changed to exclusive access, few copies of it are cached so that few invalidation messages need to be sent. The trfd and lin125 programs are the exceptions with many cached copies that need to be invalidated. Since the fraction of total references that force invalidations is small for these two programs, however, the absolute number of invalidations remains small. The widely different sharing characteristics of these test programs provide a good indication of the performance that can be expected from the compiler optimizations.
Performance of Compiler Optimizations
The primary goals of using the compiler marking optimizations are to reduce the absolute number of pointers allocated, and to reduce the time that pointers are active, in order to reduce the size of the pointer cache directories without increasing the average memory delay. This section examines how small the directories can be made with an ideal implementation of the optimizations. It also examines how the memory performance is affected by imprecise compile-time memory disambiguation and by the lack of interprocedural analysis. These simulations are performed using the trace-driven methodology and the machine model described in the previous section. The compiler optimizations are not actually implemented in the compiler, but the effects of different levels of compiler technology are simulated using the address traces, as described below.
Ideal Compiler Performance
To simulate the effect of a compiler performing block partitioning, the memory referencing patterns of each processor in the trace are examined to classify each block as private, read-only, or shared-writable. The private and read-only blocks are marked safe_blk, while the shared-writable blocks are marked coh_blk. Another pass through the trace uses the delayed allocation marking algorithm of Figure 1 to mark all of the read references to shared-writable blocks that must be allocated a pointer. After the trace is annotated for block partitioning and delayed allocation marking, it drives a multicache timing simulator that uses this marking information to determine the average memory delay for the given configuration. This technique simulates the performance of an ideal compiler that has perfect memory disambiguation and perfect interprocedural analysis. Table 7 shows the resulting percentage of references made to each type of block for block partitioning, as well as the percentage of read references that must be allocated a pointer when using delayed allocation marking. For arc3d, pic, simple24, and flo52, more than 50 percent of all their memory references are to blocks marked safe_blk and thus will not be allocated pointers. In addition, Even though fewer than 6 percent of the total references for lin125 are to blocks marked safe_blk, nearly 87 percent of all its references can be marked rd_no_ptr to reduce the total number of pointers it needs to be allocated. Fewer than 8 percent of its references ever need to be allocated a pointer.
Since a real compiler cannot know the exact address of every memory reference, and because subroutines may hide some of the addressing information, the following simulation results represent a best bound on the performance of these two compiler-marking optimizations. A fully associative pointer cache is used to compare the performance differences due to only the two compiler marking strategies without any potential artifacts from using set associativity. (A 4-way associative pointer cache implementation is used in Section 5 to compare the performance of the pointer cache with compiler marking to other coherence schemes.) The invalidate on overflow approach is used to obtain a free pointer when the pointer cache is full.
The average memory delay for no marking, block partitioning, and delayed allocation marking as the size of the pointer cache is varied from 1/32 of the size of the data cache (s=0.03125) to twice the size of the data cache (s=2) is shown in Figure 2 . With no compiler optimization and a small pointer cache, many invalidations are needed to reuse the limited number of pointers as processors reference different blocks. As the size of the pointer cache increases, normal data cache replacements deallocate enough pointers so that there is almost always a free pointer available when one is needed.
For programs such as arc3d, flo52, and pic that have a large number of references to private and read-only blocks (see Tables 5 and 7 ), relatively few references need to be allocated pointers so that block partitioning can significantly reduce the pointer cache size needed to maintain small memory delays. Since a substantial number of reads to shared-writable blocks in these programs can be marked rd_no_ptr, delayed allocation marking can further reduce the memory delay for a given pointer cache size by reducing the number of pointer overflows. In addition, delayed allocation marking reduces the number of messages between the processors and the pointer caches, causing a corresponding decrease in the average delay due to the lower network traffic.
Even though the simple24 program has a large fraction of its references to private and read-only blocks, most of its memory delay comes from references to shared-writable blocks. This behavior limits the effectiveness of block partitioning for this program. However, many of the read references to the shared-writable blocks can be marked rd_no_ptr to spread out the demand for pointers. As a result, delayed allocation marking can substantially reduce the pointer cache size needed for small memory delays in this program. Since the trfd program references a very small number of unique memory addresses, a pointer is almost always available when one is needed, with or without compiler-marking. In contrast, the lin125 program accesses a much larger number of blocks, most of which are shared-writable. Since most of its references are to these shared-writable blocks, block partitioning has almost no effect on its performance. Delayed allocation marking, however, can improve its performance for a given pointer cache size since there are long periods when many of the read references do not need to be allocated pointers.
It appears that, in programs with many references to private and read-only variables, an ideal implementation of block partitioning can substantially reduce the size of the pointer directory while maintaining good performance. Delayed allocation marking builds on the benefits of block partitioning to further reduce the number of pointers needed by exploiting the potentially long readonly periods of shared-writable blocks. In many cases, delayed allocation marking can improve the performance for a given pointer cache size also by reducing the network traffic through fewer pointer allocation requests.
Procedures and Imprecise Memory Disambiguation
The previous section showed that the compiler marking optimizations can significantly reduce the size of the pointer cache directory when the compiler can perfectly disambiguate all memory references, and can track all references across procedure boundaries. This section compares the ideal performance of delayed allocation marking to no marking, and to two nonideal compilers. Both of these nonideal compilers have imprecise memory disambiguation in that they can resolve memory addresses only to the name of the array instead of to the address of each individual element within the array. This imprecise disambiguation is simulated by mapping all references to the elements of a single array into a single name. Furthermore, one of the compilers can track all memory references across subroutine boundaries, while the other is limited to performing the optimizations on one procedure at a time, as described in Section 2.4. Figure 3 shows that, except for pic and simple24, the compiler with imprecise memory disambiguation and perfect interprocedural analysis performs almost as well as the ideal compiler. Perfect disambiguation is not critical because in most of the test programs, all of the elements of an array have roughly similar access patterns so that they all tend to need pointers allocated in roughly the same pattern. Thus, allocating pointers based only on the array name in many cases can be almost as effective as allocating pointers based on the addresses of the individual elements of the array. An actual compiler could probably implement a more sophisticated data dependence analysis than this simple array name-only approach, so that the actual performance would be between these two values. Nevertheless, it is encouraging to see that the compiler optimizations still perform well even with limited memory disambiguation capabilities. Figure 3 also shows that for all of the programs, the lack of interprocedural analysis causes little performance degradation beyond that of the imprecise memory disambiguation. In examining the programs, it appears that the procedures typically perform some action that would require a pointer to be allocated anyway, even if the compiler had perfect interprocedural analysis. As a result, performing the optimizations separately for each procedure is almost as effective as optimizing the entire program using perfect interprocedural analysis.
Comparison with Other Coherence Schemes
While the previous simulations show that the compiler marking optimizations can significantly reduce the size of the pointer cache directory without increasing the average memory delay, it is important to compare the performance and implementation requirements of this combined hardware-software approach with that of current cache coherence schemes. Using the same trace-driven simulation methodology, the performance and the memory requirements of the pointer cache are compared with several hardware directory schemes [3, 4, 7, 19] , and with the software-directed version control scheme [12] in the same shared memory multiprocessor. These different schemes represent a wide range of performance and memory overhead trade-offs and provide a solid basis for comparing the proposed approach to existing coherence schemes. Delayed allocation marking is not used for the traditional directory schemes since, as discussed earlier, this optimization cannot reduce the size of these directories. Block partitioning can be used with these traditional directories, however, and this optimization is used in the following simulations. Since the version control scheme updates version numbers only for blocks that are actually written, the compiler marking algorithms do not apply to this scheme.
The other dynamically tagged directories [9, 17, 30] also could benefit from the compiler marking strategies. They differ from the pointer cache scheme [26] primarily in the organization of the pointers in the directory. These other schemes are not included in this study since all of these tagged directories should have approximately the same performance with roughly the same memory requirements when using compiler marking.
More Memory Delay Models
All of the traditional directories use essentially the same protocol to maintain coherence so that the primary performance difference between the schemes is the time required to perform the invalidations. The p+1-bit full distributed directory [7] maintains in each memory module p valid bits and a single exclusive bit for each block in the module, where p is the number of processors. If the exclusive bit is reset, up to p valid bits may be set to indicate which processors have a cached copy of the block. If the exclusive bit is set, a single valid bit will be set to point to the processor that has the only copy of the block. Since the directory has exact information about which processors have a copy of which blocks, the number of invalidation messages required with this directory is the same as the number of cached copies of a block.
To reduce the memory requirements, the broadcast directory [4] maintains only two state bits for each block in the memories and the caches. This approach significantly reduces the number of bits needed for coherence, but it requires the directory to broadcast all of its invalidation messages to all of the processors. Broadcasts may be inexpensive in a bus-based system, but they can be very time-consuming in a system with a multistage interconnection network since broadcasts typically must be implemented as a sequence of p individual messages. These additional messages also will increase the network congestion, further increasing the invalidation time.
The n-pointers plus broadcast scheme [3] compromises between the number of pointers and the need to broadcast. In this scheme, n pointers are associated with each block in memory to point to the first n processors that request a copy of the block. If more than n processors need a block, the scheme resorts to broadcasting. Another scheme [19] reduces the size of the directory without requiring broadcasts by maintaining a linked list from the directory to each of the processors having a copy of a block. When an invalidation message is sent from the memory, every node on the list must be visited sequentially.
For each of these directory schemes, Table 8 shows the number of cycles needed to invalidate i cached copies of a shared block when exclusive access to the block is requested by a processor. In the multiprocessor architecture used in these simulations, processors can communicate only via the shared memory, requiring each invalidation command to cross the network twice in the linked list schemes. In systems in which the memory is distributed among the processors, only one network crossing is needed, but this architecture is substantially different from the model examined in this study, and it is not considered further. Both a singly-linked list directory and a doubly-linked list directory are included for completeness, although it is expected that only the doubly-linked approach would be used in practice. The doubly-linked directory simplifies cache block replacements over the singly-linked directory, and it provides a mechanism for recovering the list in the event of a processor failure. Invalidations still may be propagated from only one end of the list due to the potential race condition if they are simultaneously propagated from both ends [19] . To compare the potential range of performance of the linked list schemes, however, these simulations assume that invalidations are propagated from both ends in the doubly-linked directory, and from one end in the singly-linked directory [8] .
In the version control software-directed scheme, T miss cycles are required to service both read and write misses. At the end of every parallel loop, the version numbers of every variable that may have had a new version created in that loop must be incremented, which requires three cycles per updated entry. The range of performance of the version control scheme is estimated using three different levels of compiler technology, as summarized in Table 9 . The simplistic compiler has imprecise memory disambiguation and maintains one version number for each array. With this compiler, a write to any element of the array creates a new version of the entire array. Furthermore, this compiler cannot track variable names across subroutine boundaries so that the entire data cache must be invalidated at the entry and exit points of each subroutine.
The other extreme of compiler performance assumes an ideal compiler with perfect memory disambiguation and perfect interprocedural analysis. This compiler maintains a unique version number for each element of every array, and it never invalidates the caches at subroutine boundaries. It models the best possible performance of the version control scheme, but it is probably impossible to implement. The realistic compiler compromises between these two extremes with imprecise This additional time is included in the average memory delay calculated for this configuration.
Memory Overhead
To compare the memory requirements of the different coherence schemes, the memory overhead is defined to be the ratio of the total number of bits dedicated to coherence functions divided by the total number of data bits in both the main memories and the data caches. The total number of data bits in the system is D=pbw(m+c), where there are p processors, b words per block, w bits per word, m blocks in each of the p memory modules, and c blocks in each of the p caches. If N x is the number of bits dedicated to coherence functions for a particular coherence scheme, the corresponding overhead is O x =N x /D. Notice that the overhead can exceed 100 percent since there could be more bits dedicated to maintaining coherence than are actually used to store useful data.
The (p+1)-bit full directory has two state bits per cache block plus (p+1) bits for each block in the memory modules. The total number of bits dedicated to coherence in this scheme is
The broadcast directory maintains only two state bits for each block in the memories and the caches, for a total of N 2 =2p(m+c) bits. Each pointer in the n-pointers plus broadcast scheme requires log 2 p bits plus a broadcast bit, each cache block requires two state bits, and an exclusive bit is needed for each block in the memory, giving N 3 =p[2c+m+m(1+nlog 2 p)] bits. The total number of coherence bits required in the singly-linked list scheme is N 4 =p[3c+2m+(c+m)log 2 p] since each pointer in the memory and cache blocks requires log 2 p bits, plus an extra bit to point back to memory. In addition, two state bits are used in each cache block, and an exclusive bit is needed for each memory block. The doubly-linked list scheme requires N 5 =p[3c+2m+2(c+m)log 2 p] bits since two pointers are needed in each cache block and in each memory block. The pointer cache scheme requires log 2 m bits for the address tag, log 2 p bits for the processor pointer, two state bits for each of the r pointers in the pointer cache (three state bits if the overflow to memory approach is used), and each cache block needs two state bits, for a total of N 6 =p[r(log 2 m+log 2 p+2)+2c] bits.
With the version control scheme, the local memory for storing the version numbers in each processor requires vl bits, where v is the width of the version number and l is the total number of entries. Each cache block requires v bits for its version number, a single valid bit, and one dirty bit for each word in the block (b total dirty bits). These dirty bits ensure that the version number for the block is updated only after every word in the block has been written, indicating that a new version of the entire block has been created. The total number of coherence bits then is N 7 =p[vl+c(v+1+b)].
Using these bit counts, Table 10 shows the memory overhead for the different coherence schemes normalized to the number of blocks in the data cache, c. The ratio of the number of pointer cache entries in each memory module, r, to the number of blocks in each data cache is s=r/c, and k=m/c is the ratio of the number of blocks in memory, m, to the number of blocks in the data caches. The number of version number entries when using the ideal compiler in the version control scheme, l, is the number of unique memory locations referenced by the program. For the simplistic and realistic compilers, l is reduced by mapping all elements of a single array to one version number.
Since the pointer cache would be built out of fast, expensive memory components, while the other directory schemes probably could be built out of slow, less expensive components, this memory 
Key: m= number of blocks in each memory module. c= number of blocks in each data cache. k=m/c= ratio of memory size to data cache size. p= number processors and number of memory modules. b= number of words in each cache block. w= number of bits per word. n= number of pointers in the n-pointer directory. r= number of entries in each pointer cache. s=r/c= ratio of pointer cache size to data cache size. v= number of bits in each version number. l= number of version numbers in each processor.
overhead model does not provide a precise indication of the differences in implementation cost of the various schemes. In addition, this model ignores the cost of the associative compare logic in the pointer cache, and the additional logic needed in the version control scheme to update and compare the version numbers. However, the model does provide a good comparison of the relative memory utilization efficiencies of the different schemes. The amount of memory used to store coherence information is important since it can be a significant portion of the total cost of implementing the coherence mechanism.
Performance Differences
The average memory delay for all memory references in a program is used to compare the performance of the different coherence schemes, and the normalized memory overhead is used to compare their memory requirements. For a fair comparison of a realistic pointer cache implementation, a 4-way set associative pointer cache is used with invalidation of evicted pointers when the pointer cache overflows. A random replacement policy is used in both the 4-way associative pointer caches and the fully associative data caches. The word size is w=32 bits with p=32 processors, and the data cache block size is b=1 word. We find similar results to those reported below when the block size is four words or fewer, and that the performance seriously degrades for larger block sizes [26] .
Typical cache memory sizes are in the range of 64K words to 256K words, and a typical memory module may contain from 2M words to 16M words. Thus, typical values of k=m/c, which is the ratio of the number of blocks in each memory module to the number of blocks in each data cache, are in the range of 8 to 256. The data cache size again is c=8 kbytes in each of the 32 processors to maintain a realistic relationship between the cache size and the application programs' data set sizes. While this cache size may seem small, we note that the absolute memory latency will tend to decrease with a larger cache, but the relative performance of the different schemes will remain approximately the same since the cache invalidation patterns for all of the coherence schemes will change in roughly the same way. In addition, because the definition of memory overhead is normalized to the size of the data cache, c, the overhead of the various schemes is quite insensitive to changes in c. For example, with k=256 and s=1, the overhead of the pointer cache increases from 0.34 percent to only 0.42 percent as the size of each data cache increases from 8 kbytes to 1 Mbyte.
In the version control scheme, the caches must be cleared and all version numbers must be reset whenever a version number overflows. Larger version numbers will overflow less often, and thus improve performance, but they also require more memory than smaller version numbers. Both 8-and 16-bit version numbers were simulated and little difference was found in their performance, indicating that 8-bit version numbers are sufficient for the programs tested. Consequently, these simulations were performed with 8-bit version numbers.
The average delay and the memory overhead for k=16 and 256 is compared for the different coherence schemes in Table 11 both with and without the relevant compiler optimizations. All of the traditional hardware directories were simulated without any compiler optimizations, and again with the block partitioning optimization. With this optimization, the memory overhead of the directories is reduced since coherence information is maintained only for shared-writable blocks. The memory delay is the same with or without this optimization for these traditional directories, however, since the same number of coherence actions are performed in both cases. Table 11 also shows the pointer cache performance and memory overhead without any compiler optimizations, and with the delayed allocation marking optimization. (Recall that this optimization also incorporates the block partitioning optimization.) The memory overhead for a given configuration of the pointer cache remains the same with or without the compiler optimization since evicted pointers are invalidated and thus do not use any additional storage space. With the delayed allocation marking optimization, however, the pointer cache will produce a much lower average memory delay for a given hardware cost than without the optimization. Or, conversely, with this compiler optimization, the number of memory bits needed to store the pointer information for these test programs can be reduced by a factor of 64 while increasing the average memory delay by only approximately 1 to 8 percent (in flo52 and simple24, respectively).
For all of the programs except trfd and lin125, the pointer cache with s=1/2 and the delayed allocation marking optimization performs better than any of the other directory schemes. When executing the trfd program, the pointer cache has slightly worse performance than the other directory schemes due to the 4-way associativity of the pointer cache. Similarly, due to this limited associativity, the lin125 program executing with the pointer cache has its average memory delay approximately the same as the other directories. The pointer cache can perform slightly better than the other directory schemes since it takes less time to retrieve the pointers from the cache than from the main memory, and because the delayed allocation marking optimization reduces the network traffic by reducing the number of pointers allocated.
Compared to the full directory scheme with the block partitioning compiler optimization, the pointer cache with the delayed allocation marking optimization reduces the number of bits dedicated to coherence by a factor of about 10 to 150 when k=16, up to a factor of about 100 to 1000 when k=256, while maintaining comparable or better memory delays. The 2-bit broadcast directory has the lowest memory requirements of any of the traditional directory schemes, but its performance is generally poor due to the network traffic generated by its broadcasts. The pointer cache with compiler marking always performs better than the 2-bit scheme while using one to two orders of magnitude fewer memory bits to store pointer information. The n-pointer plus broadcast scheme has generally good performance with n=2 due to the sharing characteristics of these programs, but it also has very high memory overhead compared to the pointer cache.
The performance of both the singly-linked list scheme and the doubly-linked list scheme are worse than the pointer cache because they must sequentially follow the sharing list whenever they invalidate blocks. Due to the relatively small number of copies of a block that are cached at one time, propagating invalidations from both ends of the list improves performance only slightly over propagating from only one end. The trfd and lin125 are the exceptions since they tend to have more copies that need to be invalidated (see Table 6 ). In addition, even with the block partitioning compiler optimization, the singly-linked scheme requires about 10 to 40 times more memory than the pointer cache when k=16, up to about 100 to 600 times more memory when k=256. The doublylinked scheme requires 70 percent more memory than the singly-linked scheme.
The memory requirements of the version control scheme with the simplistic and the realistic compilers are less than a factor of 10 times larger the pointer cache since these two version control configurations maintain only one version number per array. The performance of the simplistic compiler tends to be poor since it clears all of the caches at every subroutine boundary, however. The realistic compiler can have slightly better performance because it can look beyond subroutine boundaries, but it overflows the version numbers frequently since all references to a single array are mapped to a single number. In addition, due to tracking references across subroutines, it sees more version numbers that need updating at the end of the parallel loops, which can increase its average delay. In fact, for some of the programs, this additional updating makes the realistic compiler actually perform worse than the simplistic compiler due to the additional time required to do this updating, plus the additional version number overflows it causes.
The performance of the version control scheme using the ideal compiler is comparable to the directory schemes, and in some cases is better due to its lower network traffic. It has approximately 2 to 30 times greater memory requirements than the pointer cache, however, since it has one version number for each element in every array. It may be possible to reduce the local memory requirements in each processor by caching the version numbers, but a missing number would be treated as a cache miss, which would lower its performance, and the evicted entries still would consume main memory space. Thus, it appears that for the programs tested, combining the compiler optimizations with a tagged directory produces the best performance with the lowest memory overhead.
Effect of Faster Processors
The simulations up to this point have assumed a basic one-way memory delay of T delay =T net +T mem cycles where T net =5(1+f contention ) and T mem =4 cycles. Because the data cache access time is T cache =1 cycle, the ratio of the memory delay with no network contention to the cache access time is k L =T delay /T cache =9. As processor cycle times decrease due to improvements in technology, it is reasonable to expect that the cache access time also could be reduced by the same amount. Thus, with faster processors, the memory delay will be relatively longer compared to the cache access time, increasing the ratio k L .
With larger values of k L , it may be expected that the software-controlled coherence schemes will tend to have the best performance since the directory-based schemes must contend with this longer delay when communicating between the processors and the directories. Figure 4 shows how the average delay for all data references changes as the ratio of memory delay to data cache access time varies from k L =4.5 to k L =576. The simulations show that the version control scheme with the realistic compiler tends to perform worse than the 4-way set associative pointer cache because the imprecise memory disambiguation of the version control scheme causes it to overinvalidate the data caches. The performance degradation due to the additional misses of this overinvalidation becomes even more pronounced as k L increases. The hardware-based pointer cache, on the other hand, perfectly disambiguates all addresses, making it slightly less sensitive to changes in k L .
For the arc3d, pic, simple24, and lin125 programs, the ideal version control scheme outperforms the pointer cache because it requires no interprocessor communication, and because it invalidates only those cache entries that are actually stale. The performance of the pointer cache is slightly better than the ideal version control scheme for the trfd program with small values of k L , but as k L increases, the reduced network traffic causes the ideal version control scheme to improve relative to the pointer cache. For flo52, the pointer cache has the lowest average latency, but the difference diminishes as k L increases.
It appears, then, that the more accurate cache invalidations provided by the hardware-based scheme are more important to maintaining low delay than the reduced network traffic of the realistic software-based scheme, even with very long relative memory delays. The perfect memory disambiguation of the ideal version control scheme, combined with its lower network traffic, ultimately may have the best performance, but it is probably impossible to achieve this level of compiler technology.
Scalability
Another aspect to consider in comparing these coherence schemes is how well they scale as the number of processors in the system is increased. Since the full directory scheme maintains a pointer for each processor, its performance should scale adequately, but, referring to the memory overhead models in Table 10 , its memory overhead grows as O(p). The memory overhead of the 2-bit broadcast scheme is fixed, independent of the number of processors, but the additional messages needed for the broadcasts as p increases will seriously degrade its performance. The overhead of the version control scheme also is fixed since each processor maintains its own coherence information. Since there is no interprocessor communication with this scheme, its performance should scale well.
One of the major advantages of the linked list scheme is its ability to scale gracefully with additional processors. Its memory overhead grows as O(log 2 p) since the number of bits needed in each pointer for the processor number field is log 2 p, and the number of pointers is independent of p. For the same reason, the memory overhead of the pointer cache also grows as O(log 2 p). The size of the address tag field for each pointer in the pointer cache is fixed by the number of blocks in each memory module, which is independent of the number of processors. Since the memory overhead of the pointer cache is significantly lower than the linked list scheme, and its performance is generally better, it would appear to be the best choice of directory schemes in terms of scalability.
Conclusions
Both software and hardware coherence schemes have been proposed for maintaining coherence in large-scale shared memory multiprocessors with multistage interconnection networks. Since hardware directory schemes monitor actual run-time memory addresses, they can precisely disambiguate all references, and can transparently track execution across subroutine boundaries. However, they require large amounts of memory to store the processor pointers, and they do not have access to the high-level sharing information that can be used to reduce the number of pointers that must be saved. Software-directed coherence schemes can exploit all of the high-level information, but, by relying on imprecise compile-time memory disambiguation, they overinvalidate cache entries and can have difficulties preserving temporal locality across subroutine boundaries.
The compiler-assisted hardware directory presented in this paper combines some of the advantages of both the hardware and software coherence strategies, while eliminating many of their disadvantages. The compiler marking optimizations can determine which blocks actually need to be allocated pointers, and when they need to be allocated. When used in conjunction with a directory that dynamically allocates pointers, this use of the high-level sharing information can further reduce the directory size by reducing the total number of pointers that need to be allocated, and by reducing the time that pointers are active. Unlike software-only coherence schemes, this compiler-assisted scheme still can use the full power of the directory when the compiler is unable to determine the precise sharing characteristics of a particular block.
As summarized in Table 12 , the pointer cache directory performs as well as any of the current directory schemes while using only a small fraction of the memory that the other directory schemes need to store the pointer information. The memory overhead of the software-directed version control scheme with imprecise memory disambiguation is less than a factor of 10 times greater than that of the pointer cache, but the pointer cache produces lower memory delays due to its perfect memory disambiguation. The ideal version control scheme ultimately can have the best performance due to its low network traffic, but it requires significantly more memory to store the additional version numbers, and it is probably impossible to actually implement perfect memory disambiguation in a compiler.
The software-based version control coherence scheme is more sensitive to the precision of the memory disambiguation and interprocedural analysis performed by the compiler than this compilerassisted approach since any imprecision causes the software-only approach to err on the conservative side and over-invalidate data cache blocks. In the compiler-assisted pointer cache, however, any 
imprecision in memory disambiguation or interprocedural analysis by the compiler simply causes more pointers to be allocated than are actually necessary. Thus, the pointer cache approach can tolerate a lower level of compiler performance in data dependence analysis and interprocedural analysis without sacrificing the application program's run-time memory performance. The pointer cache approach scales well as the number of processors is increased since its memory overhead grows only as O(log 2 p), and because it does not broadcast any invalidations. In addition, as faster processors are used so that the memory latency appears to increase, the performance of the pointer cache generally adapts as well as the software-only coherence schemes. While we used a loop-based parallelism model with barrier synchronization to demonstrate these compiler optimizations, we point out that the technique for assisting a tagged directory with these compiler optimizations is applicable to a much broader class of parallelism models. 
