Software prefetc hing and locality optimizations are techniques for overcoming the speed gap betw een processor and memory. In this paper, we e v aluate the impact of memory trends on the e ectiveness of soft w are prefetc hing and locality optimizations for three types of applications: regular scienti c codes, irregular scienti c codes, and pointer-chasing codes. We nd for many applications, softw areprefetching outperforms locality optimizations when there is sufcien t memory bandwidth,but localit y optimizations outperform softw are prefetc hing under bandwidth-limited conditions. The break-even poin t(for 1 Ghz processors) occurs at roughly 2.5 GBytes/sec on today's memory systems, and will increase on future memory systems. We also study the interactions between software prefetc hing and locality optimizations when applied in concert. Naively combining the techniques provides robustness to changes in memory bandwidth and latency, but does not yield additional performance gains. We propose and evaluate sev eral algorithms to better integrate softw areprefetc hing and locality optimizations, including a modi ed tiling algorithm, padding for prefetc hing, and index prefetching.
INTRODUCTION
Current microprocessors spend a large percen tage of execution time on memory access stalls, even with large onchip cac hes.Since processor speeds are growing at a greater rate than memory speeds, w eexpect memory access costs to become even more important i n the future. Computer architects ha ve been battling thismemory wall problem 46] by designing ever larger and more sophisticated caches. Although caches are extremely e ective, they are not the comThis research w as supported in part by NSF Computer Systems Architecture grant CCR-0093110 and NSF CAREER Awards CCR-0000988 and ASC-9625531. plete solution. Other techniques are required to fully address the memory wall problem.
Two promising approaches for improving memory performance are softwar eprefetching and locality optimizations. The rst executes explicit prefetc h instructions to begin loading data from memory to cache. As long as prefetching begins early enough and the data is not evicted prior to its use, memory access latency can be completely hidden. Ho wever, as processor throughput improves due to memory latency tolerance, memory bandwidth use is increased since prefetc hing increases memory tra c. In comparison, locality optimizations use compiler or run-time transformations to c hange the computation order and/or data layout of a program to increase the probability it accesses data already in cache. If successful, both average memory latency and bandwidth usage are reduced, since there will be fewer main memory accesses.
Both software prefetc hing and locality optimizations have been studied in isolation. In this paper, w e examine how w ell each approac h w orks for three types of data-intensive applications. Our evaluation uses a single uni ed environment to enable a meaningful comparison. A primary focus of our work is to compare the importance of latency toleranc e pro vided by prefetc hing andlatency reduction provided by locality optimizations on future high-performance memory systems. In addition, our work also investigates the interactions of soft w are prefetc hing and localit y optimizations when applied in concert. The contributions of this paper are as follows:
We compare the e cacy of softw are prefetching and locality optimizations for three types of data-intensive applications.
We quantify the impact of bandwidth and latency scaling in future memory systems on the relative e ectiveness of soft w are prefetc hing and localit y optimizations.
We examine the performance of integrated software prefetc hing and locality optimizations, then propose and evaluate several enhancements to increase their combined e ectiveness. In the rest of this paper, we will look at three memory access patterns, examine software prefetching and locality optimizations for each type of application, present experimental evaluations for each application, develop improved algorithms, and nally, discuss related work and conclude.
MEMORY ACCESS PATTERNS
The types of software prefetching and locality optimizations which m a y be applied are dependent on the type of memory access pattern made by a program. We b e g i n b y presenting three important t ypes of memory access patterns.
Affine Array Accesses
The most basic memory access pattern is a ne (linear) accesses to multidimensional arrays. For instance, consider the Jacobi code in Figure 1 , typically used in multigrid solvers for partial di erential equations (PDEs). The value of a point i n A is calculated as the average of values of neighboring points in all three dimensions of B. This stencil pattern is repeatedly applied to each point o f A, resulting in a smoother solution. All array accesses are a ne because array subscripts are combinations of loop index variables with constant coe cients and additive constants. In practice, there are no coe cients and small additive constants are used. These programs are also called regular codes because memory access patterns are so regular and well de ned.
A ne array accesses are common in dense-matrix linear algebra and nite-di erence PDE solvers, as well as database scans and image processing. A major feature of a ne array accesses is that they allow memory access patterns to be entirely computed at compile time, assuming array dimension sizes are known. This allows both software prefetching and locality transformations to be calculated precisely at compile time.
Indexed Array Accesses
Another memory access pattern is called indexed array a ccesses because the main data array is accessed through a separate index array whose value is unknown at compile time. For example, consider the molecular dynamics code in Figure 1 , which calculates forces between pairs of atoms in a molecule. Accesses to the index array E are a ne, striding through the array sequentially. However, accesses to arrays X and Y are indexed by the contents of E. Such programs are also called irregular because their memory accesses are not xed. Irregular accesses typically make it di cult to keep data in cache, resulting in many c a c he misses and low performance.
Indexed array accesses arise in several scienti c application domains as computational scientists attempt more complex simulations. In computational uid dynamics, meshes for modeling large problems are sparse to reduce memory and computation requirements. In N-body solvers which arise in astrophysics and molecular dynamics, for example, data structures are irregular because they model the positions of particles and their interactions. Unfortunately, those irregular computations have poor temporal and spatial locality, and do not utilize processor caches e ciently. Unlike applications with a ne accesses, compile-time transformations alone cannot improve locality because the values of the index array are not known at compile time.
Pointer-Chasing Accesses
Finally, we also consider pointer programs which dynamically allocate memory and use pointer-based data structures such as linked lists, n-ary trees, and other graph structures. For example, consider the list traversal code in Figure 1 which creates a singly-linked list using a data structure node containing a pointer to the next element on the list. To traverse this list, the program must determine the pointer value stored in each node.
As with indexed array accesses, programs utilizing pointers have irregular memory access patterns that cannot be determined at compile time. Additionally, the next node cannot be traversed until the pointer stored in the current node is found, sequentializing memory accesses. These programs are thus known as pointer-chasing codes. Pointer-chasing codes occur in many application domains, including scienti c programs which use advanced data structures.
SOFTWARE PREFETCHING
Software prefetching relies on the programmer or compiler to insert explicit prefetch instructions into the application code for memory references that are likely to miss in the cache. At run time, the inserted prefetch instructions bring the data into the processor's cache in advance of its use, thus overlapping the cost of the memory access with useful work in the processor. In this section, we brie y describe the software prefetching techniques previously proposed for prefetching di erent t ypes of memory references.
Affine Array Prefetching
To perform software prefetching for a ne array references commonly found in scienti c codes, the well-known compiler algorithm for inserting prefetches proposed by M o wry and Gupta 32] is followed. In this algorithm, locality analysis is used to determine which array references are likely to su er cache misses. The cache-missing memory references are then isolated by performing loop unrolling and loop peeling transformations. Finally, prefetch instructions are inserted for the isolated cache-missing references. Each inserted prefetch is properly scheduled such that there exists ample time between the initiation of the prefetch and the consumption of the data by the processor (known as the prefetch distance) t o o verlap the latency of the memory access.
Indexed Array Prefetching
Indexed array accesses, of the form A(B(i)), are common in irregular scienti c codes. The prefetch algorithm for indexed array accesses, originally proposed in 30], is similar to the algorithm for a ne array p r e f e t c hing. The main di erence lies in how prefetch requests are scheduled. In a ne array prefetching, each prefetch i s s c heduled early enough to tolerate the latency of a single cache miss. For indexed array references, the memory indirection between the index array and data array requires more sophisticated prefetch s c heduling. If both the index array and the data array references miss in the cache, then the memory latency of two serialized cache misses, rather than just one, must be tolerated. Hence, the prefetch algorithm must schedule the prefetch for the index array access two cache miss times prior to the iteration that consumes the data, and schedule the prefetch for the data array one cache miss time prior to the iteration that consumes the data.
Pointer-Chasing Prefetching
Prefetching for pointer-based data structures is challenging due to the memory serialization e ects associated with traversing pointer structures. The memory operations performed for array t r a versal can issue in parallel because individual array elements can be referenced independently. In contrast, the memory operations performed for pointer traversal must dereference a series of pointers, a purely sequential operation. The sequentiality of pointer chasing prevents conventional prefetching techniques from overlapping cache misses su ered along a pointer chain, thus limiting their e ectiveness.
Jump pointer prefetching 41, 25 ] is a promising approach for addressing the pointer-chasing problem. In jump pointer prefetching, additional pointers are inserted into a dynamic data structure to connect non-consecutive link elements. These \jump pointers" allow prefetch instructions to name // Prefetching a linked list using // prefetch arrays and jump pointers for (i=0 i < PD i++) { prefetch(the_list.prefetch_array i]) } ptr = the_list.head while (ptr->next) { prefetch(ptr->jump) ... ptr = ptr->next } Figure 2 : Prefetch a r r a y and jump pointers code.
link elements further down the pointer chain (i.e. a prefetch distance away) without sequentially traversing the intermediate links. Consequently, prefetch instructions can overlap the fetch o f m ultiple link elements simultaneously by issuing prefetches through the memory addresses stored in the jump pointers. Figure 2 illustrates a while loop that has been instrumented with jump pointer prefetching. Jump pointer prefetching, however, cannot prefetch the rst prefetch distance number of link elements in a linked list because there are no jump pointers that point t o these early nodes. To enable prefetching of early nodes, jump pointer prefetching can be extended with prefetch arrays 21]. In this technique, an array o f p r e f e t c h pointers is added to every linked list to point to the rst prefetch distance number of link elements. Hence, prefetches can be issued through the memory addresses in the prefetch arrays before traversing each l i n k ed list to cover the early nodes, as illustrated in Figure 2 .
In addition to inserting the prefetch instructions, both jump pointer prefetching and prefetch arrays require insertion of code to create and maintain the prefetch pointers as the data structure is modi ed (not shown in Figure 2 ).
LOCALITY OPTIMIZATIONS
Software prefetching tries to hide memory latency while retaining the original program structure. Another alternative is to reduce memory costs by c hanging the computation order and data layout of the program at compile and run times. These locality optimizations try to improve data locality, t h e ability of an application to reuse data in the cache 45]. Reuse may be in the form of temporal locality, where the same cache line is accessed multiple times, or spatial locality, where nearby data is accessed together on the same cache line. Previous researchers have d e v eloped many locality optimizations. In this section we consider optimizations for the three types of data-intensive applications which w e a r e investigating.
Tiling for Affine Accesses
Programs with a ne array accesses are the easiest for compilers to apply locality optimizations since memory access patterns can be fully analyzed at compile time. One useful program transformation is tiling (or blocking), which uses loop transformations to form small blocks of loop iterations which are executed together to exploit data locality 43, 45] . Figure 3 demonstrates how the 3D Jacobi code can be tiled.
By rearranging the loop structure so that the innermost loops can t in cache (due to fewer iterations), tiling allows reuse to be exploited on all the tiled dimensions 38].
A major problem with tiling is that limited cache associativity m a y cause data in a tile to be mapped onto the same cache lines, even though there is su cient space in the cache. Con ict misses will result, causing tile data to be evicted from cache before they may b e reused 23]. This e ect is shown in Figure 4 . Previous research found tile size selection and array padding can be applied to avoid con ict misses in tiles. Tile-size-selection algorithms carefully select tile dimensions tailored to individual array dimensions so that no con icts occur 11]. Array padding expands leading array dimensions, increasing the range of non-con icting tile shapes 36] and improving the performance of tiled codes over a range of problem sizes 35, 38] . In this paper, we a pply a combination of both algorithms to tile both 2D linear algebra and 3D PDE solvers 37, 38].
Reordering for Indexed Accesses
Index arrays arise in scienti c applications such as sparse mesh PDE solvers and molecular dynamics codes, where the access pattern is determined at run time. Unfortunately, i ndex arrays cause data to be accessed in an irregular manner, making spatial locality (reuse of data on a cache line) unlikely when the data is larger than the cache.
Researchers have discovered recently that run-time data and computation transformations can improve the locality of irregular computations 1, 13, 28, 29] . Because computations are typically commutative, loop iterations can be safely reordered to bring accesses to the same data closer together in time. Data layout can also be transformed so that data accesses are more likely to be to the same cache line. These compiler and run-time transformations can be automated using an inspector-executor approach d e v eloped for message-passing machines 12].
In this paper, we apply e cient partitioning algorithms to the input data to bring reuse closer together, then follow u p by lexicographically sorting loop iterations based on data access patterns, using algorithms speci ed elsewhere 17, 18] . Our partitioning algorithm works by viewing data accesses as a graph, then applying a series of graph coarsening passes where connected data is put into the same cluster. The resulting programs achieve m uch better use of processor caches.
Memory Allocation For Pointers
Pointer-based programs frequently su er from poor locality. They are also notoriously di cult to analyze and transform because of their reliance on pointers and heap-allocated recursive data structures. Researchers recently have d e v eloped cache-conscious heap allocation and transformation techniques to improve locality for pointer-based programs 3, 9] . Algorithms include run-time tree optimization routines which place parent nodes with child nodes for improved lo-
Figure 3: Tiled 3D Jacobi example.
cality, and coloring when placing tree nodes to avoid con ict with the root.
Of particular interest is CCMALLOC, a customized memory allocator which allocates memory in a location near to a user-speci ed address. A heuristic which p r o ved e ective reserves space for future allocation requests when allocating new blocks of data 9]. Using this memory allocator, multiple members of a linked list are thus more like l y t o b e i n a djacent memory locations. Not only does this take a d v antage of hardware prefetching of long cache lines, but cache line utilization increases and fragmentation is reduced, decreasing the probability that useful cache lines will be evicted from cache. In this paper, we applied this optimization to our pointer-chasing benchmark codes.
EXPERIMENTAL EVALUATION
This section evaluates the performance of software prefetching and locality optimizations independently and in concert. We describe our experimental methodology. Then, we compare software prefetching and locality optimizations under di erent memory bandwidths and latencies, and nally, w e study their combination.
Methodology
Our experimental evaluation employs 7 benchmarks, representing the 3 classes of data-intensive applications described in Section 2. Table 1 lists the benchmarks along with their problem sizes and memory access patterns.
The rst three applications in Table 1 Table 1 : Benchmark summary.
For each application, we applied software prefetching and locality optimizations by hand| rst in isolation, then in combination. We followed the algorithms described in Sections 3 a n d 4 , applying the appropriate algorithm to each application given its memory access pattern. We t h e n m e asured the performance of the optimized codes on a detailed architectural simulator.
Our simulator is based on the SimpleScalar tool set 2] and models a 1GHz 4-way issue dynamically-scheduled processor. The simulator models all aspects of the processor including the instruction fetch unit, the branch predictor, register renaming, the functional unit pipelines, and the reorder bu er. In addition, our simulator also models the memory system in detail. We assume a split 8-Kbyte direct-mapped L1 cache with 32-byte cache blocks, and a uni ed 256-Kbyte 4-way set-associative L2 cache with 64-byte cache blocks. Although the caches are small, they match the input data sets required for simulation.
To study the sensitivity o f o u r s o f t ware prefetching and locality optimization techniques to available memory bandwidth, we modi ed the SimpleScalar simulator to model bus contention across the L2-memory bus, varied the L2-memory bus bandwidth between 1-64 Gbytes/sec (note that a bandwidth of 1 Gbyte/sec is equivalent to the processor loading one byte per cycle), and varied the L2-memory latency from 80 to 640 cycles. These parameters capture characteristics of future architectures, where processors will be much faster than large DRAM memories. For all experiments, transfers across the L1-L2 bus incur a 7-cycle latency, and experience no contention.
Varying Memory Bandwidth
We e v aluate the performance of software prefetching and locality optimizations under memory bandwidth scaling. In Figure 5 , we plot execution time along the y-axis, and vary memory bandwidth from 1-64 Gbytes/sec along the x-axis, keeping memory latency xed at 80 cycles. Each executiontime bar is broken down into memory stall, software overhead, and computation components. Groups of bars represent the original version of each program, and versions optimized with either software prefetching, locality optimization, or both. In this section, we focus only on applying the techniques in isolation. Later in Section 5.4, we will examine the techniques in combination.
For the a ne array and indexed array benchmarks, both software prefetching and locality optimizations provide signi cant performance gains, improving performance on average by 46% and 42%, respectively, o ver unoptimized codes.
Comparing the techniques, we see two major di erences. First, software prefetching su ers more overhead due to prefetch and related address computation instructions. Tiling incurs some overhead for the extra levels of loops, but this is minimal. We do not include preprocessing overhead for data reordering because it can be amortized over many l o o p iterations 18].
Second, the relative e ectiveness of software prefetching and locality optimizations to eliminate memory stalls depends on available memory bandwidth. At high memory bandwidths, software prefetching eliminates practically all memory stalls since the memory system can sustain the simultaneous memory requests necessary to hide all the memory latency. As memory bandwidth is reduced, memory requests must serialize, thus software prefetching loses its e ectiveness. In contrast, locality optimizations reduce memory latency, a n d hence, memory tra c. This makes them highly e ective a t low bandwidths where reduced tra c pays o . However, locality optimizations cannot eliminate all memory stalls, so they achieve a lower maximum performance compared to software prefetching. Consequently, for the 5 array-based benchmarks, software prefetching outperforms locality optimizations at high memory bandwidths, while locality optimizations outperform software prefetching at low memory bandwidths.
To quantify this e ect, Table 2 reports the memory bandwidths at which software prefetching and locality optimizations achieve equal performance. Memory systems providing memory bandwidths higher than this equi-performance bandwidth favor software prefetching, while those providing Table 2 : Equi-performance bandwidths for 80, 160, 320, and 640-cycle memory latencies. The last column reports the average over the 5 benchmarks. All memory bandwidths are in Gbytes/sec. lower memory bandwidths favor locality optimizations. For an 80-cycle memory latency corresponding to the data in Figure 5 , Table 2 shows the average equi-performance bandwidth is 2.54 Gbytes/sec. Such a large equi-performance bandwidth underscores the importance of latency reduction techniques, and implies future memory systems must provide signi cant memory bandwidth before prefetching can outperform locality optimizations on these data-intensive applications.
For the pointer-chasing benchmarks, locality optimization outperforms software prefetching at all memory bandwidths. This is due to three factors. First, pointer prefetching incurs high software overhead to create and manage jump pointers. Software overhead in Health and MST is 131% and 70%, respectively, of the busy component. In contrast, CCMALLOC memory allocation incurs no measurable overhead. Second, the traversal loops in our pointerchasing codes are short, particularly for MST, and do not provide su cient w ork under which to hide memory latency. Hence, software prefetching cannot eliminate all memory stalls. Finally, pointer prefetching requires jump pointer storage that increases the cache miss rate and memory bandwidth consumption, making the optimized code even more data-intensive than the original code. As Figure 5 s h o ws, software prefetching in Health and MST performs worse than the original code at low memory bandwidths. sion of a program (original, prefetch, optimized, both) is displayed in a separate graph. Once again, we focus on applying the techniques in isolation, leaving a discussion of the combined techniques to Section 5.4.
Varying Memory Latency
Not surprisingly, execution time for all program versions increase as we scale memory latency. For the a ne array and indexed array b e n c hmarks, software prefetching e ectively hides the increasing memory latencies given su cient m e mory bandwidth. In contrast, locality optimizations su er performance degradation as memory latencies grow however, they still enjoy the bene t of reduced tra c at low memory bandwidths. As a result, software prefetching outperforms locality optimizations at high memory bandwidths, while locality optimizations outperform software prefetching at low memory bandwidths for all the memory latencies we s i m ulated. Table 2 shows equi-performance bandwidths generally increase with memory latency. Consequently, o n future systems with high memory latencies, greater memory bandwidth will be required before software prefetching demonstrates a performance advantage over locality optimizations.
For the pointer-chasing benchmarks, locality optimization outperforms software prefetching at all memory latencies and bandwidths. The same reasons given in Section 5.2 for the reduced e ectiveness of software prefetching on pointerbased data structures explain locality optimization's performance advantage at higher memory latencies.
Combined Techniques
This section evaluates software prefetching and locality o ptimizations in combination. We created combined versions of our benchmarks in the following manner. For the a ne array benchmarks, we applied software prefetching to the innermost tiled loops. For the indexed array a n d p o i n terchasing benchmarks, software prefetching and locality optimizations modify distinct parts of the code. Hence, for these programs, we simply merge the modi ed portions of the software prefetching and locality optimization program versions. Results are reported in Figures 5, 6 , and 7, under \Pref+Opt."
In Figure 8 we also summarize the average performance of each v ersion of the program relative to memory bandwidth and latency. Performance is rst normalized relative t o t h e original program (with bandwidth of 1 Gbyte/sec and latency of 80 cycles), then averaged over all programs for each memory bandwidth or latency. Simulations show results vary depending on memory bandwidth and latency.
Software prefetching is very sensitive t o available memory bandwidth. When bandwidth is very low, software prefetching increases overhead without reducing memory costs. The combined algorithm thus performs slightly worse than locality optimizations alone. In comparison, when memory latencies are very high, combining software prefetching and locality optimizations usually yields better performance than applying either one alone. As Figure 8 shows, combining is much better than prefetching, and only slightly better than locality optimizations.
Under certain conditions, combining techniques encounters high overhead. For the a ne array benchmarks, tiling signi cantly reduces the number of iterations in the innermost loop. When prefetching is applied to these short tiled loops, the software pipeline startup overhead incurred by prefetching becomes signi cant, reducing the amount of memory latency hidden. This e ect is apparent in the high CPU overhead in the "Pref+Opt" versions of Matmult and Jacobi in Figure 5 . Combining also inherits the overheads from both software prefetching and tiling, further reducing its performance relative to software prefetching alone.
For pointer-chasing benchmarks, combining always underperforms CCMALLOC memory allocation alone at low memory bandwidths. The extra jump pointers and prefetch arrays required for pointer prefetching increase the demand for memory bandwidth, thus partially negating the reduced tra c bene ts achieved by CCMALLOC memory allocation in the combined version. The combined version also underperforms CCMALLOC memory allocation at high memory bandwidths in MST. As described previously, software prefetching for the short list traversal loops in MST is ine ective hence, combining software prefetching with CC-MALLOC memory allocation only adds overhead without reducing memory stalls.
Finally, because combining exploits both latency tolerance and latency reduction, it is less sensitive to variations in memory bandwidth and latency than either technique in isolation. Robust performance is valuable when bandwidth and latency parameters on the target system are not available to the compiler, or when the compiler must produce a single optimized code for heterogeneous systems.
ALGORITHM ENHANCEMENTS
In addition to evaluating the e ects of memory bandwidth and latency scaling on performance, our simulations also point out a numberof ways to enhance both software prefetching and locality optimizations.
Tiling and Prefetching
One problem with combining tiling and software prefetching naively is the high startup overhead from prefetching short tiled loops. We can improve performance by modifying the tiling algorithm to select tiles with more iterations in the innermost loop. Our tiling heuristic uses the Euclidean GCD algorithm 11, 37] to generate a series of non-con icting tile shapes. Although tiles with a square aspect ratio typically achieve the best cache utilization, we can bias the selection towards taller tiles with greater height to width aspect ratio. Such tall tiles have more iterations in their innermost loop compared to square tiles, thus reducing startup overheads when used in combination with software prefetching. Table 3 reports both square and tall tile sizes for our 3 a ne array b e n c hmarks. Figure 9 presents the tall-tile results with and without prefetching for Matmult, Jacobi, and RedBlack, and compares them to the corresponding square-tile results from Figure 6 . Notice tall tiles and square tiles alone achieve similar performance. However, when combined with software prefetching, tall tiles signi cantly reduce the short-loop overheads su ered at high bandwidths when using square tiles, matching the performance of software prefetching alone from Figure 6 . These simulation results demonstrate that tall tiles allow us to fully exploit the bene ts of software prefetching and tiling simultaneously.
Padding for Software Prefetching
While software prefetching can hide memory latency given su cient memory bandwidth, con ict misses on prefetched data can degrade or even completely eliminate bene ts. In our experiments, we found that prefetching for a ne array codes may require array padding, particularly if the set associativity of the L2 cache is low. The problem is that for some applications and problematic data sizes, severe con ict misses may result, with all prefetched data being mapped to the same cache lines. This problem is especially acute for a ne accesses to arrays whose dimensions are near a multiple of the cache size, since adjacent array elements will con ict in cache.
Compilers can avoid this problem and pad arrays to avoid prefetch con icts 36, 37], even if loops are actually tiled. The approach we employ i s to treat the distance between the prefetch data and the actual data as the \height" of a tile, with the variable references determining the tile width.
Compiler analysis can then use the Euclidean GCD algorithm to determine whether cache con icts will occur within this tile, padding leading array dimensions until con icts are eliminated 37, 38] . This ensures that prefetched data will be able to stay i n c a c he until they are used by the processor. Figure 10 presents experiments demonstrating the utility of combining array padding with prefetching. Versions of the program Jacobi were created with and without both padding and prefetching. We u s e d a 2-way associative L2 cache in these simulations for the purpose of illustration, since 4-way c a c hes can eliminate con icts in Jacobi but not more complicated programs. We chose a problem size of 256 256 8. These power-of-two problem sizes occur frequently in multigrid codes due to the need to use a series of meshes of increasing granularity. Based on the prefetch distance, our Euclidean algorithm chose to pad the array t o 313 256 8 to eliminate con icts.
Simulation results show Jacobi experiences many c a c he conicts which array padding can eliminate, improving performance. Software prefetching alone does not help, since con icts evict prefetched cache lines before their use. Once padding is applied, prefetching can improve performance beyond that achieved by padding alone.
CCMALLOC and Prefetching
Software prefetching for pointer-chasing codes su ers high overhead to create and manage jump pointers. However, jump pointers may not be necessary when prefetching is combined with CCMALLOC memory allocation. Since intelligent allocation places link nodes contiguously in memory, prefetch instructions can access future link nodes by indexing, just as for a ne array accesses. This approach, which w e r e f e r to as index prefetching, was originally pro- Results are shown in Figure 11 .
The upper portion of Figure 11 compares index prefetching to the original versions with prefetch arrays and CC-MALLOC allocation alone as well as in combination, assuming a memory latency of 80 cycles. The data shows index prefetching indeed eliminates most of the software overheads incurred by prefetch a r r a ys. As a result, index prefetching outperforms all other optimized versions at high memory bandwidths for both Health and MST. Index prefetching performs slightly worse than CCMALLOC allocation alone though, especially in MST, due to prefetching of link nodes that are conditionally accessed, increasing memory bandwidth consumption.
While index prefetching reduces software overheads, it is not as e ective in eliminating memory stalls as prefetch a r r a ys for Health. In Health, m a n y link nodes are deleted and re-inserted into linked lists frequently. Contiguous allocation, and hence index prefetching, for such dynamic lists is useless since the layout of link nodes becomes random after a few delete-insert operations. As the upper portion of Figure 11 shows for Health, index prefetching hides less memory latency than prefetch arrays due to frequent d e l e t e -insert operations. At larger memory latencies, the increased memory stalls outweigh the reduced software overheads, and combining CCMALLOC and prefetch a r r a ys naively outperforms index prefetching.
RELATED WORK
Our work is most similar to Saavedra et al 42] , which e v aluated unimodular transformations, tiling, and software prefetching for matrix multiply. M o wry et al 33] also evaluated software prefetching and tiling for two scienti c applications. In comparison, this paper focuses on memory trends and quanti es their impact on software prefetching and locality optimizations. Prior work has considered a single technology point o n l y . Furthermore, we examine 3 classes of applications requiring di erent types of optimizations to study the memory trend e ects in a broader context. We also propose enhancements to address problems that arise when combining techniques. Finally, compared to 42] which used a cache simulator to evaluate performance, we use detailed execution-driven simulation of a modern processor.
Although relatively little work has compared software prefetching and locality optimizations, a large body of work has studied the techniques in isolation. Software prefetching for a ne array accesses has been studied in 31 
CONCLUSION
Several conclusions can be drawn from our work. First, the relative e ectiveness of software prefetching and locality optimizations depends on available memory bandwidth. For our array-based benchmarks, software prefetching outperforms locality optimizations at high memory bandwidths, while locality optimizations outperform software prefetching at low memory bandwidths. The equi-performance bandwidth is 2.5 GBytes/sec on today's memory systems, but will increase as memory latencies increase in the future. However, locality optimizations outperform software prefetching for the pointer-chasing benchmarks at all memory bandwidths and latencies due to the reduced e ectiveness of prefetching for pointer-based data structures.
Second, combining software prefetching and locality optimizations inherits the merits of both techniques. Combining yields better performance than either software prefetching or locality optimizations alone when memory latency is very high since it exploits both. Combining is also more robust to changes in memory system parameters than either latency tolerance or latency reduction techniques in isolation. However, naively combining techniques does not outperform the best choice amongst software prefetching and locality o p t imizations alone at all bandwidths and latencies.
Finally, the combined e ectiveness of software prefetching and locality optimizations can be enhanced through new algorithms. For a ne array benchmarks, tall-tile selection reduces prefetch startup overheads, allowing combining to outperform software prefetching and locality optimizations alone for practically all memory bandwidths and latencies. Also, padding can remove con icts between prefetched data for af ne array benchmarks, and is crucial when prefetch-ing for problem sizes that su er from cache con icts. For pointer-chasing benchmarks, combining index prefetching and CCMALLOC memory allocation can reduce prefetch overheads, but this is not e ective when a large numberof link nodes cannot be contiguously allocated, as in Health, or when CCMALLOC allocation already gets most of the gain, as in MST. With current processor speeds, maintaining memory bandwidths of 1-4 Gbytes/sec is probably achievable. The simulation results most relevant are thus those with bandwidth towards the low end. As processors become faster, the memory wall will increase, reducing available memory bandwidth relative to processor speed. Locality optimizations should become thus more important. Similarly, as processor speeds increase, memory latencies are likely to increase past 80 cycles, making the results of our simulation of higher memory latencies more relevant.
Architects might s w i t c h to processor-in-memory (PIM) architectures to increase memory bandwidth dramatically. For on-chip data, available memory bandwidth will be more like that towards the high end of our simulations. Our experiments show s u c h PIM systems should bene t signi cantly from software prefetching. However, even PIM systems will require locality optimizations to reduce accesses to o -chip data.
ACKNOWLEDGMENTS

