Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch singleword cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5 to 30 percent compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but that they tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy.
Introduction
Several studies [1, 6, 8, 15] have suggested that small cache blocks should be used in shared memory multiprocessors with private data caches to reduce the performance impact of false sharing. This type of sharing occurs when processors are not actually sharing data, but are sharing multiword cache blocks due to the placement of data within the memory modules. Another study [10] has suggested that false sharing can be reduced by scheduling several consecutive parallel loop iterations to execute on each processor using the stripmining program transformation. To extend this previous work, this paper examines how data prefetching can be used to improve memory performance in a shared memory multiprocessor with a multistage interconnection network [9, 14, 19] , taking into consideration how the parallel task scheduling strategy interacts with the prefetching strategy. This study concentrates on numerical Fortran programs since these large-scale parallel systems frequently are used to execute this type of program. Although these programs have a relatively simple control structure, they represent an important class of applications due to their widespread use.
Several different strategies that can be used to schedule independent iterations of a parallel loop on to the processors of a shared memory multiprocessor are described in Section 2. This section also describes hardware and software data prefetching techniques, and discusses how this prefetching may actually reduce memory performance by polluting the caches. To reduce this cache pollution, Section 3 proposes an adaptive prefetching scheme in which the number of cache blocks fetched on a miss is dynamically adapted to match the loop scheduling strategy. The machine model, simulation methodology, and test programs used in this study are described in Section 4. Section 5 presents simulation results that demonstrate the impact of the scheduling strategy on the exploited parallelism, the miss ratio, the interconnection network traffic, the data sharing among processors, and the cache pollution. This section also examines the impact of the scheduling strategy on the temporal distribution of invalidations needed to maintain cache coherence. Finally, the last section summarizes the results and conclusions.
Prefetching and Scheduling
The delay to receive the first word of a cache block on a miss in a shared memory multiprocessor with a multistage interconnection network can be significantly greater than that of a uniprocessor, and this delay may be unpredictable due to contention within the network switches and the memory modules. The data prefetching effect of large cache blocks could be very helpful in this type of system to hide this long memory delay, but the parallel task scheduling strategy can interact strongly with the choice of cache block size to significantly affect the spatial locality within the processors' data caches. The scheduling strategy determines which parallel tasks will be executed on which processors, and can be classified as either static or dynamic, depending on when the task assignment decisions are made. With static scheduling, the assignment of tasks to processors is predetermined at compile-time or at load-time. Since each processor knows which tasks it will execute based on its processor number and the task identifiers, there is essentially no run-time overhead to increase the task execution times. If a compiler can accurately predict the execution times of all of the tasks, and then evenly distribute them among the processors, the computational load can be perfectly balanced to minimize the execution time.
Unfortunately, subroutine calls and conditional statements within a task may alter its expected execution time. Furthermore, unpredictable events, such as cache misses, page faults, and interprocessor communication delays, add another level of nondeterminism to the task execution time making it difficult to maintain good load balancing with static scheduling. If there could be a large variance in task execution time due to these effects, dynamic scheduling, also called self-scheduling, may be used to distribute the task scheduling decisions to each of the processors at run-time. With a dynamic approach, each processor assigns work to itself only when it is idle. The time required to access the common pool of waiting tasks introduces some run-time overhead, but the improved load balancing may compensate for this additional delay.
Parallel Loop Scheduling Strategies
Since the body of a loop is executed many times, a large portion of the parallelism available in a program may be exploited by simultaneously executing independent loop iterations on different processors [21] . For example, a Fortran doall loop is one in which each iteration of the loop is independent of the other iterations; that is, in this type of loop, there are no dependences between iterations. The independent tasks to be scheduled in this case are the individual instantiations of the iterations, each with a unique value of the loop index. These independent iterations of the loop can be arbitrarily complex, containing other nested loops and subroutine calls, for instance.
With static scheduling each processor determines which iterations it will execute based on its processor number. For example, if the iterations are to be evenly distributed across the processors, processor 0 will execute iterations 1, p+1, 2p+1, etc., where p is the number of available processors. With dynamic self-scheduling, on the other hand, idle processors assign iterations to themselves at run-time by accessing a shared iteration-counter variable. This shared variable indicates which iteration number should be executed by the next available processor. To prevent more than one processor from executing the same iteration, access to this variable must be limited to one processor at a time using an appropriate synchronization mechanism.
Iterations from a loop can be grouped such that several consecutive iterations are assigned to execute on a processor as a complete task. The loop blocking factor, or chunk size, c, is the number of consecutive iterations executed by a single processor. With static scheduling using a loop blocking factor of c, iterations 1 to c will execute on processor 0, iterations c+1 to 2c will execute on processor 1, and so on. Single-chunk static scheduling assigns c= N/p consecutive iterations to each processor at the start of each parallel loop, where N is the total number of iterations to be executed by the p processors. With this scheduling strategy, the last processor will execute c′=N− (N−1)/c c iterations since N may not be evenly divisible by p. With dynamic scheduling, the loop blocking factor determines how many iterations each processor will execute between accesses to the shared iteration-counter. When a processor allocates the next block of iterations to itself, it increments the iteration-counter by c so that the next processor will begin executing with the (c+1 st ) next iteration. Small values of c tend to increase the scheduling overhead in dynamic scheduling since more accesses to the single shared iteration-counter are required, but these small chunk sizes provide better load balancing by ''filling-in'' idle processor time. Larger values of c decrease the scheduling overhead at the expense of poorer load balancing.
Guided self-scheduling [20] combines a dynamic scheduling strategy with a variable loop blocking factor in an attempt to compromise between the low scheduling overhead of large values of c, and the good load balancing of smaller values of c. In its simplest form, the number of iterations allocated to a processor at time T i is R i /p , where R i is the number of iterations remaining to be executed at time T i . This approach schedules large blocks of iterations near the beginning of a loop's execution to reduce the scheduling overhead. As fewer iterations remain towards the end of the execution of the loop, smaller blocks of iterations are scheduled to provide good load balancing. A variation of this scheduling strategy, called factoring, assigns smaller blocks of iterations to the processors at the start of the loop to provide potentially better load balancing [11] .
A barrier synchronization operation is required at the end of each parallel loop to ensure that no processor begins executing tasks from after the parallel loop until all of the iterations have been completed from the current loop. Processors check-in at a barrier to indicate that there are no more iterations to be executed, and that they are waiting for the remaining processors to finish executing their current tasks. After all of the processors have checked-in, the barrier is released, allowing the processors to begin executing the next set of available tasks.
Data Prefetching
Private data caches can reduce the average time to access memory by amortizing the cost of a miss over several references to the same and nearby memory locations, thereby exploiting temporal and spatial locality, respectively. Prefetching, on the other hand, masks the memory delay by overlapping it with other processing. Hardware prefetching is a function of the cache block size and of the fetch size. The cache block size, b, is the number of consecutive memory words updated or invalidated as a single unit. It is the smallest unit of memory for which unique address tag information is maintained in the caches. The fetch size, f , is the number of blocks moved from the main memory to the cache on a miss, making the total number of words transferred on a miss equal to fb.
In uniprocessors, increasing the size of the cache blocks while maintaining the fetch size at f =1 block frequently can reduce the miss ratio. This reduction occurs because many programs exhibit the property of spatial locality in which memory addresses physically near each other in the memory space are likely to be referenced within a short span of time [25] . Assuming that the total number of words that can be stored in a cache is fixed, there is an optimum block size that minimizes the miss ratio. Increasing the block size beyond this optimum size tends to cause the miss ratio to increase since words that are still being actively referenced must be evicted to make room for the newly fetched block. It has been pointed out that due to the time required to transfer large cache blocks, the block size that minimizes the average memory delay typically is smaller than the block size that minimizes the miss ratio [22, 26] . The performance impact of the larger blocks can be reduced by using lockup-free caches [13, 24] , although the temporal clustering of misses may reduce the effectiveness of these complex fetch strategies [23] .
An alternative to the implicit hardware prefetching of multiword cache blocks is to have the fetch size, f , greater than one block so that the hardware can explicitly prefetch multiple blocks. For instance, the cache controller can be designed to fetch a fixed number of consecutive blocks on every miss [25] , or to prefetch extra blocks only on a miss for a long vector operand [7] . While this latter approach improves the miss ratio for these vectors, it ignores the potential improvement from prefetching scalars and arrays that do not vectorize. Another technique is to use explicit software prefetching where the compiler inserts special memory prefetch instructions in loop iteration i to begin fetching the data needed in iteration i+j [3, 12] . The prefetch distance, j, is determined by the compiler, and may be different for different loops.
False Sharing
Both implicit and explicit prefetching can interact with the parallel task scheduling strategy to produce different data sharing patterns in a multiprocessor, and can thereby affect the system's memory performance. When scanning through an array, for instance, the loop increment used to access the array elements between subsequent iterations is called the stride. A loop stride of one causes each iteration of the loop to access consecutive elements of the array. When the loop iterations are statically scheduled with a chunk size of c=1, these consecutive array elements will be accessed by iterations executing on different processors. False sharing of cache blocks occurs when the elements of the array are allocated to consecutive memory locations and the cache blocks are large enough to contain more than a single array element. With a loop stride of one, iteration i executing on processor j reads array element i, while iteration i+1 executing on processor j+1 reads element i+1. Elements i and i+1 may be allocated to the same cache block so that both processors j and j+1 will cache a copy of the same block, even though they are not actually sharing any data.
This sharing of cache blocks is not necessarily harmful as long as the processors only read the array. With an invalidation-based coherence protocol [4] , however, a processor must request exclusive access to an entire block before it can write to any word within the block. This request for exclusive access will force the coherence mechanism to send invalidation messages to all processors with a cached copy of the block, even though those processors may not actually be referencing the particular word being written. These processors then will generate cache misses when they attempt to rereference the elements of the array used by the iterations assigned to them. In the worst case, every write to a falsely shared block will cause all cached copies to be invalidated, and every read to the block will cause a cache miss.
Cache Pollution
If the extra data loaded into a cache by an implicit or explicit prefetch operation is referenced before it is evicted, the miss ratio will tend to improve. If the prefetched blocks are evicted from the cache before they are referenced, however, they are said to pollute the cache. Two types of cache pollution can be identified: 1) cache overflow pollution, and 2) false sharing pollution. Cache overflow pollution is a consequence of the finite cache size. It occurs in both uniprocessor systems and in multiprocessor systems when a block is prefetched into the cache, and then is replaced by another block on a cache miss before the previously prefetched block is referenced.
False sharing pollution is similar to overflow pollution in that a block is prefetched into the cache but is not referenced before being evicted. With false sharing pollution, however, the block is invalidated in the cache due to a cache coherence invalidation command caused by another processor that is attempting to write to a word in the block other than the word that caused the block to be initially moved into the processor's cache. Thus, false sharing pollution is independent of the size of the data cache and is a consequence only of memory block sharing in a multiprocessor system. Both types of cache pollution can reduce memory performance by consuming the limited bandwidth between the processors and the memory modules, and by prematurely replacing active blocks in the cache that are still being referenced by the processor. In addition, the time required to perform the invalidations and the evictions can directly increase the average delay to access memory. Consequently, for the best performance it is important to keep both types of cache pollution to a minimum.
Adaptive Prefetching
A close examination of the cache eviction and invalidation patterns in the program traces described in the next section revealed that the cache pollution tended to be minimized when the number of iterations scheduled to execute on a processor matched the number of single-word blocks fetched from memory. For example, the loop blocking factor with GSS initially is c= N/p , where N is the loop iteration count. As the loop's execution progresses, the loop blocking factor is reduced so that fewer consecutive iterations are assigned to a single processor. The cache evictions and invalidations produced by the memory traces with GSS showed that the cache pollution was minimized when the loop blocking factor for GSS happened to match the number of blocks fetched from memory on a miss. This behavior makes intuitive sense because consecutive iterations frequently reference consecutive memory blocks; that is, stride-one memory accesses are common.
Thus, when multiple consecutive blocks are fetched into the cache, and several consecutive iterations are scheduled to execute on that processor, blocks that are prefetched are likely to be referenced before being evicted.
This observation suggests that an adaptive prefetching strategy may allow the processors to more consistently take advantage of this alignment effect. In particular, it may be possible to improve the memory performance by matching the loop blocking factor, or chunk size, to the cache fetch size [16] . However, fixing the loop blocking factor to match the fetch size may unbalance the distribution of iterations to processors, which would reduce the overall parallelism and thereby increase the execution time. GSS ensures a balanced distribution of iterations to processors, but it has a variable loop blocking factor which only occasionally would match the fetch size. To accommodate these conflicting requirements, an adaptive strategy is proposed that dynamically adjusts the cache fetch size in each processor to match the number of iterations executing on that processor.
In particular, each processor has a fetch count register that determines how many single word cache blocks will be fetched on a miss. For single-chunk scheduling, c= N/p consecutive iterations are assigned to each processor, and the fetch count register is set once at the start of the parallel loop to r f =min(max_fetch_size, c). The minimum operation sets the largest fetch size to max_fetch_size to prevent the processor from fetching an unreasonably large number of blocks when N/p is large. With GSS, each processor sets its fetch count register to r f =min(max_fetch_size, R i /p ), where R i is the number of iterations remaining to be executed when the processor assigns the next block of iterations to itself. Thus, with this adaptive prefetching approach, the number of single-word cache blocks fetched by a processor on a miss during a parallel section of the program will match the number of consecutive iterations being executed on that processor. At the beginning of a sequential section of the program, the fetch count register can be set to any value that maximizes the memory performance in that section.
It should be noted that this adaptive matching strategy optimizes prefetching for unit-stride loops in which the order of memory accesses matches the layout of data in the memories. For strides that are greater than one, some of the words prefetched with this technique will be superfluous, but this technique will do no worse than a fixed-size prefetching strategy. Similarly, if the access order does not match the order in which data elements are stored in memory, this adaptive prefetching will not prefetch the proper cache blocks. For instance, if the data are stored in row-major order, but are accessed in column-major order, then the prefetching operation will not be helpful. This prefetching may actually be counterproductive if the prefetching pollutes the cache, or if it consumes limited memory bandwidth that could have been used by another processor. Again, however, this adaptive approach will perform no worse than a nonadaptive prefetching approach. In fact, the simulations presented in Section 5 suggest that this adaptive approach does work well even with the mixture of strides and access orders used in the loops in these test programs.
Simulation Methodology
A trace-driven simulation methodology is used to compare the memory referencing performance of different scheduling and prefetching strategies in a multiprocessor with p processors connected to a shared memory via a multistage interconnection network. Parallel loops within application programs written in Fortran are found automatically using the Alliant compiler [18] . The compiler produces parallel assembly code which is executed by a multiprocessor emulator using static scheduling with c=1, single-chunk scheduling with c= N/p , or GSS. The emulator produces a trace of all addresses generated by each of the p processors. Processor 0 generates all addresses produced during sequential program sections. During the execution of parallel loops, a separate trace is produced for each of the p processors. These p individual traces are perfectly interleaved to produce a single trace in which an address generated by processor i is followed by an address generated by processor i+1. This interleaving is only one of many possible interleavings that could be produced in an actual system, but it emphasizes the effects of false sharing on the cache coherence mechanism. Thus, this interleaving of the memory addresses produced by the processors provides a strong test of the impact of prefetching and loop scheduling on the multiprocessor system's memory performance.
Machine Model
A multicache simulator is driven by the interleaved memory trace to determine the relevant performance metrics, such as the global miss ratio, the network traffic generated between the processors and the memories, and the cache pollution. To limit the total execution time of the simulations, the input data to the application programs was reduced in size. In addition, to provide a reasonable relationship between the size of the processors' data caches and the size of the input data sets, each of the eight processors used an 8 kbyte data cache. These data caches were fully associative to focus on the effects of data sharing between processors without the possibility of set associativity conflicts within a single data cache.
The simulator allows the cache block size to be set independently of the fetch size. On a cache miss, f blocks of b words each are moved from the memory to the cache. Since each word is 4 bytes (32 bits), 4fb bytes are fetched per miss. The write-back strategy used by the cache coherence mechanism writes a single b-word block back from the cache to memory, regardless of the fetch size. Since previous studies have suggested the use of small cache blocks [1, 6, 8, 15] , these simulations use b=1 word per block. The number of blocks fetched on a miss, f , is a free parameter when using the nonadaptive prefetching strategies. For the adaptive scheme, the fetch count register is set as described in Section 3 during the execution of the parallel sections of the program, with a max_fetch_size of eight blocks. During sequential sections of the program, the number of blocks fetched on a miss is fixed at a value specified at the start of the simulation.
All instructions are stored in a separate instruction cache, and all instruction references are ignored in these simulations. In addition, since the primary focus of this study is the impact of loop scheduling strategies on hardware data prefetching, all accesses to synchronization variables are ignored. It is assumed that the barrier synchronization operation required at the end of parallel loops is performed using a special synchronization bus. This bus also is used to distribute the next available iteration values to the processors when the loop iterations are dynamically scheduled.
A p+1-bit distributed full directory scheme [4] is used to maintain cache coherence in this system. This coherence mechanism associates p valid bits and one exclusive bit with each block in memory. If the exclusive bit is reset, valid bit i is used to indicate that processor i has a copy of the block in the shared read-only state. Up to p processors may have a copy of the block in this state. When a processor attempts to write to a shared block, it must request exclusive access to the block from the directory. Before granting this exclusive access, the directory will send invalidation messages to all processors with a cached copy of the block, as indicated by the valid bits. The exclusive bit then is set to indicate that one processor, identified by one of the valid bits, has the single copy of the block in its cache in the exclusive state. Since this processor has the only copy of the block, it can read or write the block as it pleases. Before another processor can write to this same block, the first processor will be requested by the directory to write the latest value back to the main memory. Processors use two state bits per cache block to maintain the current state of each block they have cached.
In this simulation study, a 32-bit wide multistage interconnection network is used to connect the p processors to the p memory modules. Messages on this network are packetized with each packet requiring a minimum of two words (eight bytes). The first word in each packet contains an operation code to indicate the function of the message. In addition, this word contains the addresses of the source and destination modules. The second word of each packet contains the memory address of the word being referenced. Additional words are appended to the packets to transfer the data read from and written to the memories by the processors. The network is divided into a forward network to carry the traffic from the processors to the memory modules, and a separate reverse network to handle the traffic from the memories to the processors. All messages between the processors and the memories are acknowledged to ensure consistency using a weak-ordering consistency model [5] . The steps required for each type of memory reference, along with the corresponding network traffic, are shown in Table 1 . 
Test Programs
The parallel memory traces used to drive the simulator were generated from six numerical Fortran programs. The size of the input data sets and the outer loop counts were reduced so that the entire application program could be simulated in a reasonable amount of time. The pic program uses a particle-in-cell technique to model the movement of charged particles in an electrodynamics application. A 24-by-24 element grid is used in the simple24 program's simulation of a hydrodynamics and heat flow problem. The trfd program uses a series of matrix multiplication operations to simulate a quantum mechanical model of a two-electron transformation. Flo52 analyzes an airfoil with transonic flow, while arc3d is a three-dimensional fluid flow problem. The lin125 program is a version of the Linpack benchmark using a 125-by-125 element array.
The memory referencing characteristics of the test programs are summarized in Table 2 for p=8 processors, a cache block size of b=1 word, a fetch size of f =1 block, and a static scheduling strategy with a loop blocking factor of c=1. Each of the memory blocks referenced by the programs were classified as private, shared-writable, or shared read-only by examining the number of processors that referenced each block in the address traces. Blocks that were referenced by the same single processor throughout the program's execution are classified as private. A block that was read by at least two processors and written by at least one is classified as shared-writable. Shared read-only blocks were never written, by were read by two or more processors. Since the table does not show the statistics for these read-only blocks, the percentages do not sum to 100. These different programs produce a wide variety of sharing behavior which provides for a broad test of the memory performance when using the different scheduling strategies.
The last column of Table 2 shows the fraction of all references that request exclusive access to a block that is currently in the shared read-only state in one or more data caches. These references force the directory to invalidate all cached copies of the block to maintain coherence. Figure 1 shows a histogram of the fraction of these references that generate i invalidations. The large values around i=1 and 2 indicate that in most cases when a block in the shared read-only state is changed to exclusive access, few copies of it are cached so that few invalidation messages need to be sent. The trfd and lin125 programs are the exceptions with many cached blocks that need to be invalidated. The absolute number of invalidations remains small, however, since, as shown in the last column of Table 2 , the fraction of the references that force invalidations is small. The i=0 case occurs when a processor attempts to write to a block in the shared read-only state that is only in its own cache. An invalidation is not needed in this case, but since the processor does not know if it has the only cached copy, it must first request exclusive access from the directory. Similar histogram results have been reported elsewhere for a variety of programs [2, 27] . 
Interaction of Scheduling and Prefetching
The different memory referencing characteristics of these test programs produce a wide range of memory performance in this shared memory multiprocessor. This section presents the results of the simulations showing how the prefetch strategy interacts with the parallel loop scheduling strategy to affect the exploited parallelism, the cache block sharing characteristics, the miss ratio, the network traffic, the cache pollution, and the temporal clustering of invalidations.
Effect of Scheduling on Parallelism
The loop scheduling strategy can have a significant effect on the parallelism exploited by the system since it determines the distribution of work to the processors. When a loop is scheduled such that c consecutive iterations will be executed on each processor, it is possible for c to be too large so that the iterations will not be evenly distributed across the processors. For instance, if the number of iterations in a parallel loop, N, is smaller than the product of the number of processors and the chunk size, pc, some of the processors will not be assigned any iterations, and will remain idle. Less severe load imbalances still may occur even when N>pc, but c is large. To produce the shortest overall execution time, and to maintain the highest level of parallelism, it is important to balance the load across all of the processors.
One measure of parallelism in a program is the execution time speedup ratio S p =T 1 /T p , where T 1 is the execution time on a single processor, and T p is the execution time using p processors. The values of T 1 and T p can be skewed by the specific implementation details of an actual machine, thereby making this metric an unreliable indicator of the parallelism available in a program. Additionally, since the simulations in this study do not measure the actual execution time, it is not possible to measure the execution time speedup of the test programs. As an approximation of the parallelism in these test programs, the memory parallelism with p processors is defined to be M p =R 1 /R p , where R 1 is the total number of memory references generated by the program when running on a single processor, and R p is the number of references generated by processor 0 when executing the program with p processors. This parallelism approximation is analogous to the speedup since processor 0 executes all of the sequential portions of the program, as well as its share of parallel loop iterations, and thus is the processor that generates the most memory references. In terms of execution time, it is reasonable to expect that this processor will take the longest time to complete execution and will be the limiting execution path determining the overall execution time. In any case, this memory parallelism value provides a reasonable approximation of the actual parallelism exploited by a particular scheduling strategy. The memory parallelism of the test programs is shown in Table 3 for static scheduling, single-chunk scheduling, and guided self-scheduling (GSS). It is apparent that all three scheduling strategies exploit approximately the same amount of parallelism, and thus assign approximately the same load to each processor.
Sharing Behavior
As an indication of how widely blocks are shared among processors in these programs, the passive and active sharing statistics are defined and measured. The passive sharing statistic shows the number of processors that ever access a cache block during the execution of the program, averaged over all of the blocks. It is defined to be ( 
, where a i is the total number of processors that ever access block i during the execution of the program, and n is the total number of unique blocks referenced by the program. This statistic provides an indication of how many processors ever touch a cache block throughout the course of the program's execution. The active sharing statistic, on the other hand, shows the average number of copies of a block that are actually cached in all of the processors when the block is referenced by any processor. It indicates how many processors are actively sharing the same block at the same time. This statistic is defined to be
where q i |r is the number of copies of block i that are cached in all of the processors when r is a memory reference to block i, and R i is the total number of references made to block i by all of the processors. The total number of unique blocks referenced by the program again is n.
The larger passive sharing statistics than active sharing statistics shown in Table 4 for the test programs indicate that over the course of a program's execution, blocks are referenced by many more processors than the number that actually have a copy of the same block cached at the same time. That is, although a block may migrate to many different caches in the course of a program's execution, it tends to be present in relatively few caches when it is referenced. This type of sharing pattern occurs when blocks are read by only a few processors between writes, and it helps to explain the shape of the invalidation histograms in Figure 1 .
Recall that for the adaptive prefetching scheme, the number of blocks fetched on a miss is set to match the scheduling strategy only during parallel sections of the program. During sequential sections, the number of blocks fetched is set to the value of f shown in the right-hand side of the table. Thus, in this table, and in all of the following tables, the parameter f is the number of blocks fetched on any miss for the nonadaptive prefetching schemes, while it is the number of blocks fetched on a miss only during sequential sections of the program for the adaptive scheme. During the parallel sections with the adaptive scheme, the number of blocks fetched potentially is different for each parallel loop, independent of f .
Considering this explanation of the parameter f , Table 4 also shows how the number of blocks prefetched on a miss interacts with the loop scheduling strategy to affect the active sharing of cache blocks. As the number of single-word blocks fetched on a miss increases (i.e. f increases), the value of the active sharing statistic tends to increase for the static scheduling strategy since many of the loops in these programs scan through memory using stride-one accesses, and the array elements are sequentially distributed in memory. With these accesses, consecutive locations in memory are accessed by consecutive iterations of the loop, which, with the static scheduling strategy, are executed on different processors. Thus, fetching multiple consecutive cache blocks increases the false sharing, Both the single-chunk scheduling strategy and GSS reduce the amount of false sharing compared to static scheduling with c=1, and thus reduce the active sharing statistic value, since they both schedule several consecutive iterations to execute on the same processor. The multiple consecutive blocks prefetched on a miss are more likely to be referenced by the prefetching processor when using these these two scheduling strategies than when using static scheduling due to the order of memory accesses in the consecutive loop iterations. The adaptive prefetching scheme with both GSS and single-chunk scheduling shows the lowest values for the active sharing statistic, indicating that these schemes do the best job of matching the data prefetched to the data that will be referenced by the iterations assigned to execute on each processor. The adaptive scheme also shows the smallest increase in sharing as f increases since with adaptive prefetching this parameter affects only the sequential sections of a program. How these changes in sharing statistics affect the actual memory performance is discussed in the next section.
Memory Performance
There are several factors that affect the memory performance of a shared memory multiprocessor. The global miss ratio, which is the total number of cache misses produced by all of the processors divided by the total number of references, provides a good indication of how well the caches capture the locality of the application programs. It also provides a first-order approximation of the expected memory delay. A measure of the false sharing among the processors is the average number of invalidations generated by each reference requesting exclusive access to a block from the directory. This statistic is the expected value of the type of histogram shown in Figure 1 .
Another factor affecting memory performance is the total network traffic between the caches and the memory modules due to misses, invalidations, and write-backs. The number of bytes required for each type of memory transaction was shown in Table 1 . These simulations count the total number of bytes transferred on both the forward and reverse networks for all memory operations. This total is divided by the total number of memory references generated by the program to report the network traffic as the average number of bytes transferred per memory reference. For instance, say a processor reads a memory location that misses in its cache. Servicing this miss will require a minimum of 20 bytes to be transferred on the network if f =1 and b=1. If this block is exclusive in another processor's cache, the processor with the exclusive copy must first write it back to memory. This write-back will increase the network traffic by another 20 bytes. Thus, one read miss can produce from 20-40 bytes of network traffic depending on the sharing status of the missed block. This example illustrates that there may not be a simple relationship between the miss ratio and the average network traffic making both of them important performance metrics.
Since a primary purpose of this study is to evaluate the effectiveness of hardware prefetching with different loop scheduling strategies, it is important to provide an indication of how much cache pollution is generated. The invalidation cache pollution is measured by counting the number of times any block is invalidated in a cache due to a coherence action before the block has been referenced in that cache, and then dividing the total count for all of the processors' caches by the total number of memory references. That is, if v i is the number of cache blocks prefetched into processor i's cache that are invalidated before being referenced, the invalidation cache pollution is ( 
where R is the total number of references generated by the p processors. Thus, the invalidation cache pollution is the number of prefetched blocks invalidated in the caches before being referenced, per memory reference. Similarly, the cache overflow pollution is the average number of prefetched but unreferenced blocks each memory reference causes to be evicted from the cache due to the finite size of the cache.
For example, a cache invalidation pollution statistic of 0.25 means that, on the average, each memory reference forces 0.25 blocks that were prefetched into processors' caches to be invalidated before the prefetched blocks were referenced by the processors. Or, equivalently, an invalidation pollution statistic of 0.25 can be interpreted as meaning that, on the average, every fourth memory reference will cause an unreferenced block to be invalidated in a cache. A cache overflow pollution statistic of 0.25, on the other hand, means that, on the average, every fourth memory reference will evict a block from a cache to make room for the new blocks being fetched by the current reference, before the prefetched blocks were ever referenced by the processor.
As shown in Table 5 , for all of the programs except trfd, the cache pollution due to false sharing and due to the finite cache size tends to increase for all of the scheduling strategies as the number of blocks fetched on a miss increases. The trfd program has no overflow pollution since its data set is small enough to fit in the data caches, and its lack of false sharing pollution indicates that every prefetched block is referenced at least once before being invalidated. For the other programs, the static scheduling strategy shows the greatest increase in both types of cache pollution due to the placement of data in the memory, as described in the previous section. As expected, the adaptive prefetching strategy tends to have the lowest value of both types of cache pollution since it more precisely matches the data prefetched to the memory references generated by the loop iterations assigned to the processors.
Adaptive prefetching with the single-chunk scheduling strategy tends to have slightly higher cache overflow pollution than adaptive prefetching with GSS since the average number of blocks fetched on a miss with single-chunk scheduling tends to be slightly larger than with GSS. That is, the average number of consecutive iterations scheduled on a processor is slightly smaller with GSS than with the single-chunk approach since the loop blocking factor is fixed with the single-chunk approach, but it continually decreases with GSS as the loop is executed. Since the adaptive prefetching strategy matches the number of blocks fetched to the number of iterations scheduled, fewer blocks tend to be fetched on average when using GSS than when using single-chunk scheduling, thereby reducing the cache overflow pollution GSS generates. The false sharing pollution is roughly the same for both scheduling strategies with adaptive prefetching, however, since they both produce similar sharing patterns. Table 6 shows that the miss ratios tend to improve for all of the scheduling and prefetching strategies as the number of blocks prefetched on a miss increases, and that the network traffic decreases along with the miss ratio. These trends indicate that the programs are able to exploit the additional spatial locality introduced by the prefetching. In the adaptive schemes, the miss ratio and network traffic changes as the parameter f changes in the table are due to changes only in the sequential sections of the program since the number of blocks fetched in the parallel sections is determined by the scheduling strategy. Thus, these tables show that prefetching multiple blocks during sequential program sections frequently can further reduce the miss ratios and the network traffic over that provided by prefetching only during the parallel loops (i.e. when f =1). Table 7 shows that the number of invalidations required to maintain coherence increases as f increases due to the increase in false sharing. These additional invalidations have relatively little effect on the network traffic or on the miss ratios due to the relatively small fraction of all memory references that force invalidations, as shown in the last column of Table 2 . This increasing number of invalidations can significantly increase the average memory delay, however, since the directory must wait for each processor with a cached copy of the block being invalidated to acknowledge the invalidation [17] . The adaptive prefetching scheme tends to produce the smallest number of invalidations, and thus will reduce the average memory delay when compared to the nonadaptive schemes.
To summarize these tables, it should be noted that increasing the number of single-word blocks fetched on a miss tends to improve the miss ratios and to reduce the network traffic for all of the scheduling strategies. For static scheduling with c=1, however, this prefetching produces higher cache pollution and a greater number of invalidations than the other scheduling schemes, which then can produce greater average memory delays. The differences in network traffic, miss ratios, and cache pollution between GSS and single-chunk scheduling are quite small. In general, GSS tends to generate a slightly greater number of invalidations than single-chunk scheduling due to its slightly higher active sharing of blocks.
Adaptive prefetching, in which the number of blocks fetched during parallel loops matches the number of consecutive iterations scheduled on the processors, produces approximately the same performance when applied to either the single-chunk scheduling strategy or to GSS. The exception to this generalization is the lin125 program in which GSS outperformed the single-chunk strategy. This exception occurs because the referencing behavior of this program happens to coincide better with the assignment of iterations produced by GSS than by the single-chunk schedule. Since the adaptive scheme tends to prefetch only those blocks likely to be referenced by the iterations scheduled on the processors, it generally further reduces the miss ratios and the network traffic compared to the fixed prefetching schemes. The adaptive scheme also produces significantly lower cache pollution than the fixed schemes, and it generates significantly fewer invalidation messages. Both of these factors will cause the adaptive scheme to produce a lower average memory delay than the nonadaptive schemes since less time will be spent waiting for acknowledgements to invalidation messages, and less time will be spent evicting blocks from the cache.
Temporal Distribution of Invalidations
Another indication of the impact of the prefetching and scheduling strategies on memory performance is how they change the distribution of coherence invalidation messages as a function of time. If invalidations tend to cluster together in time, there will be many invalidation messages traversing the network simultaneously, which will increase the probability of messages interfering with each other in the network. This interference may increase the time a processor must wait for the invalidations to be acknowledged. In addition, this increased network traffic may force delays in the memory transactions of other processors not involved with the invalidation, thereby increasing the average memory delay for all of the processors. If invalidations tend to be uniformly distributed in time, however, their impact on network performance will tend to be minimized.
The temporal distribution of invalidations is determined by counting the number of coherence invalidations generated by each memory reference at ''time'' t j where t j is the number of references that have been issued by processor j since the beginning of the current program section. A program section is defined to be either a parallel loop, or the sequential code between parallel loops. Thus, a new section begins at the start and at the end of each parallel loop. Since each processor in each section can have a different total number of references, the time basis for each section (i.e. the horizontal axis of a histogram) is normalized to 100. The number of invalidations generated by the (normalized) i th reference in each of the p processors then are added together to produce a histogram of the total number of invalidations generated by all processors within that section. These normalized histograms for each section are summed together to produce two composite histograms showing the time distribution of the invalidations separately for the parallel and sequential sections of the entire program. This normalization process allows a direct comparison of the fraction of all invalidations that are generated within a given period of time in each section.
To define this histogramming process more precisely, let x i,s,j be the number of invalidations generated by memory reference i in processor j during section s of the program. The memory reference counter i is reset to zero at the beginning of each new section of the program. This histogram of invalidations for processor j executing program section s is normalized to 100 by letting The temporal distributions of cache invalidations for the test programs are shown in Figure 2 , along with the fraction of all invalidations produced in the parallel and sequential sections in each program. In this figure, the solid line shows the total percentage of invalidations that have been generated at the given time for the sequential sections of each program, while the dotted line shows the distribution of invalidations for the parallel sections of the given program. The slope of these lines indicates how evenly distributed the invalidations are throughout each program section. For instance, the slope of the dotted line for the arc3d, pic, simple24, and lin125 programs is roughly one-to-one indicating that the first x percent of the memory references generated by the processors during the execution of the parallel loops produce approximately the first x percent of the invalidations. Thus, when executing parallel loops, invalidations tend to be evenly distributed throughout the loop's execution. This uniform distribution suggests that active data sharing among the processors is continuous throughout the parallel loop's execution. The trfd program shows a similarly smooth distribution of invalidations during the first 75 percent of the references in the parallel loops, followed by a sudden burst of invalidations due to a sudden change in its data sharing pattern. The flo52 program, on the other hand, shows distinct phases of invalidations. In this program, there appear to be periods of strictly local computation followed by exchanges of data, which produce the bursts of invalidations.
In contrast to the distribution of invalidations in parallel sections of a program, the distribution in sequential sections appears to much more bursty, as shown by the solid lines in Figure 2 . In all of the programs except pic, more than 50 percent of the invalidations in the sequential sections are generated before 50 percent of the memory references have been issued. In fact, in trfd, the first 16 percent of the program's data references during sequential execution generate 50 percent of the invalidations. In lin125, 50 percent of the invalidations are generated by the first 30 percent of the data references. The exception to this burstiness of invalidations during sequential program sections is the simple24 program in which all of the invalidations are relatively evenly distributed throughout the program. This burstiness of invalidations during sequential program sections is intuitively reasonable since the single processor executing a sequential section must invalidate blocks cached in other processors before it can write to the blocks. After it has invalidated a block and moved it into its own cache, no other processors will attempt to invalidate the block until the next parallel section. Thus, the burst of invalidations tends to occur at the beginning of the sequential sections. During parallel sections, however, blocks are continuously read and written by different processors so that invalidations are more evenly distributed in time.
The percentages shown below the plots in Figure 2 indicate that almost all of the invalidations are produced during parallel program sections in the arc3d, pic, and lin125 programs. Perhaps not coincidentally, these programs also show the smoothest distribution of invalidations during these parallel sections. Most of the invalidations in flo52 are generated during the sequential sections, and its parallel sections have a correspondingly burstier distribution of invalidations. The trfd and simple24 programs evenly split their total invalidations between the parallel and sequential sections, although the sequential invalidations in trfd still tend to be quite bursty.
While it is not shown in these figures, it was found that the different data prefetching and loop scheduling strategies have very little effect on the temporal distribution of invalidations in the sequential program sections. This insensitivity is not surprising since the different scheduling strategies only affect the execution of the parallel sections of the program. In these parallel sections, increasing the number of blocks fetched on a miss tends to shift the curves slightly to the right indicating that more of the invalidations tend to occur near the end of the parallel loops. This effect is very small, however. Thus, although Table 7 showed that the different scheduling strategies can produce significant differences in the total number of invalidations that they cause to be generated, they have a very small effect on the temporal distribution of these invalidations within the parallel loops. The additional invalidations produced by the increased false sharing when using different scheduling strategies continue to be evenly distributed throughout the execution of a parallel loop. Put another way, the changes in data sharing patterns produced by scheduling and prefetching changes tend to be uniformly distributed throughout a parallel loop's execution.
Conclusion
Trace-driven simulations of numerical Fortran programs were used to examine how hardware data prefetching interacts with the parallel loop scheduling strategy to affect memory performance in a shared memory multiprocessor. The results suggest that to maximize memory performance when scheduling independent loop iterations in this type of multiprocessor, it is important to schedule multiple consecutive iterations to execute on the processors. When multiple consecutive iterations are scheduled on each processor, prefetching multiple single-word blocks on a miss reduces the miss ratio by approximately 5 to 30 percent compared to a system with no prefetching.
The adaptive prefetching strategy presented in this paper tends to further reduce the miss ratio compared to no prefetching, and it tends to reduce the network traffic compared to the nonadaptive prefetching schemes. The network traffic is reduced since this adaptive scheme fetches only those cache blocks that are likely to be referenced by the iterations executing on each processor. The corresponding reduction in false sharing produced by this adaptive prefetching then causes fewer invalidation messages to be generated than those generated by fixed prefetching, which thereby reduces the network traffic. In addition, the average memory delay in a multiprocessor system using this adaptive prefetching will be reduced compared to using a nonadaptive prefetching scheme since the round-trip time to generate and acknowledge an invalidation can be quite long, and the adaptive scheme produces fewer invalidation messages. Also, fetching only those blocks likely to be referenced reduces the number of premature block evictions and replacements due to the finite size of the cache. This more efficient use of the limited cache space will further improve the average memory delay.
The memory performance appears to be relatively insensitive to whether the specific scheduling strategy is guided self-scheduling (GSS), or whether a single chunk of iterations is assigned to each processor at the start of each parallel loop. What is important, however, is to adaptively prefetch single-word cache blocks on a miss to match the number of iterations scheduled on each processor at each scheduling step. The choice between GSS or single-chunk scheduling should be based on the run-time scheduling overhead, and on the load balancing capabilities of the two schemes. While it is important to coordinate the prefetching with the parallel loop scheduling strategy, it also is desirable to fetch a large number of blocks on a miss during the execution of a sequential section of a program. Since there is no false sharing to affect the prefetching when executing a sequential section, the prefetching during these sections should be limited only by the overflow pollution this prefetching causes within the data caches.
This study also examined the temporal distribution of invalidations in both sequential and parallel sections of the test programs. It was found that invalidations tend to be evenly distributed throughout the execution of parallel loops, suggesting that data sharing tends to occur continuously during the execution of parallel sections of a program. In contrast, invalidations produced during sequential sections tend to be much more bursty with the invalidations clustering typically at the beginning of the sequential sections. In addition, the temporal distribution of invalidations appears to be relatively insensitive to the data prefetching strategy and to the parallel loop scheduling strategy.
In summary, the parallel loop scheduling strategy can have a significant impact on the memory performance of a shared memory multiprocessor by altering the block sharing characteristics of a program. Prefetching multiple single-word cache blocks on a miss can help improve the memory performance, but care must be taken to match the prefetching strategy with the loop scheduling strategy. The adaptive prefetching scheme proposed in this paper is one possible mechanism for coordinating data prefetching and parallel loop scheduling to reduce the miss ratios compared to no prefetching, and to reduce the network traffic compared to a nonadaptive prefetching scheme. 
