Data prefetching is an e ective approach to addressing the memory latency problem. While a few processors have i mplemented hardware-based data prefetching, the majority o f modern processors support data-prefetch instructions and rely on compilers to automatically insert prefetches. However, most prefetching schemes in commercial compilers suffer from two limitations: (1) the source code must be available before prefetching can be applied, and (2) these prefetching schemes target only loops with statically-known strided accesses. In this study, w e broaden the scope of softwarecontrolled prefetching by addressing the above t wo limitations. We use pro ling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler. We then use the strides discovered to insert prefetches into the executable directly, without the need for re-compilation. Performance evaluation was done on an Alpha 21264-based system with a 64KB data cache and an 8MB secondary cache. We nd that even with such large caches, our technique o ers speedups ranging from 3% to 56% in 11 out of the 26 SPEC2000 benchmarks. Our technique has been incorporated into Pixie and Spike, t wo products in Compaq's Tru64 Unix.
INTRODUCTION
Memory latency is one of the major obstacles that limit the performance of modern microprocessors. In many r e c e n t machines, it is quite common to take 100 cycles or more to fetch a word from memory, during which hundreds of instructions can potentially be executed on these multipleissue machines. Even worse, the gap between processor and memory speeds is expected to grow in future processor generations.
Prefetching is a promising technique to cope with the memory latency problem, and can be controlled by hardware or software. Hardware-based schemes 9, 20, 28] typically look for patterns of previous accesses to predict future behavior at run-time, while software-based schemes 7, 25, 26] rely on either the programmer or the compiler to insert explicit prefetch instructions. Since software-controlled prefetching requires minimal hardware support, many microprocessors today already support prefetch instructions.
Most commercial machines rely on compilers to automatically insert prefetches by analyzing the source c ode. These compilers (e.g., the IBM 5] and HP 30] compilers) typically prefetch memory references in loops whose addresses have static strides. While this approach w orks reasonably well, it has two limitations. First, prefetching cannot be applied when the source code is not available for compilation. This is particularly a problem in optimizing commercial software, libraries, or legacy codes. Second, the scope of prefetching is largely constrained by the fact that the stride size must be known at compile time. In applications with features like dynamic memory allocation, pointers, and indirect references, there are still a non-trivial numberofcache misses caused by references with strides that are not captured by the compiler.
In this study, we address the above two limitations by a t e c hnique called pro le-guided p ost-link stride prefetching, which s e r v es as a complement to compiler-inserted prefetching. It is based on an observation that although many strides are not compile-time constants, they are in fact highly predictable given the pro le information collected from past runs. The strides predicted via pro ling are then used to insert prefetches into the executable directly, without the need for re-compilation.
Our technique has been incorporated into two existing post-link pro ling/optimization tools for the Alpha. More speci cally, we extend Pixie 34] , which originally instruments executables for control-ow pro ling, to collect striderelated information. We also extend Spike 10, 12] cording to the stride pro le collected. Performance results on an Alpha 21264 system (a DS20E) demonstrate that our technique speeds up SPEC2000 benchmarks that are already aggressively optimized by a s m uch a s 5 6 % .
An Overview
The rest of this paper is organized as follows. We begin in Section 2 by performing two case studies. In Section 3, we describe how our technique is implemented using Pixie and Spike. Section 4 presents an experimental evaluation of our technique on an Alpha 21264 system using the SPEC2000 benchmarks. Section 5 discusses related work, and nally, we conclude in Section 6.
CASE STUDIES
In this section, we discuss the strided references exhibited in two SPEC2000 benchmarks: mcf and equake. They are selected because they are the two benchmarks in their corresponding groups (i.e. integer and oating point) that bene t the most from stride prefetching. In addition, their access patterns are relatively easy to understand without the need of looking at the entire program.
Mcf
Mcf is an application used for single-depot vehicle scheduling. It is the integer benchmark in SPEC2000 that su ers the most from data cache misses. The DCPI 3] tool reports that 26% of the total stall time in mcf (running on a DS20E system which will be described in detail in Section 4.1) happens in the while-loop shown in Figure 1 (a). It traverses a list structure via two pointers, arcin and tail, a s s h o wn in Figure 1(b) . Pro ling the values of these two pointers reveals that arcin and tail have strides of -144 bytes and -96 bytes, respectively, throughout the entire execution. 1 These strides originate from the fact that both arcs and nodes are allocated in memory via calloc(), and that the list structure is not changed once it is created. With these strides, 1 The magnitude of these two strides depends on both the compiler and run-time system. 
(c) Layout in Memory
Figure 2: Abstract version of an important loop nest in the Spec2000 benchmark equake.
we can easily determine any future pointer values down the list from the current ones.
Equake
Equake is a program for simulating the propagation of elastic waves in large basins. The core computation lies in a sparse matrix multiplication routine, which c o n tains the nested loop structure shown in Figure 2 (a). Our focus is on the three dimensional matrix A ARCHmatrixlen] 3] 3], whose cache misses account for 70% of the total stall time. Since this matrix is dynamically allocated, the compiler treats it as an array o f ARCHmatrixlen pointers, each p o i n ting to an array of three pointers, each in turn pointing to an array of three doubles, as pictured in Figure 2 
PROFILE-GUIDED POST-LINK STRIDE PREFETCHING
Before discussing the details of our technique, we rst give an overview of it. Figure 3 shows the three major steps involved, including both the functionalities and software tools used. They are explained below:
Instrumentation: The rst step is to instrument the given executable (i.e. foo in Figure 3 ) for pro ling the strides at the loads that are likely to cause cache misses. We enhance Pixie to perform this new type of instrumentation in addition to the existing control-ow instrumentation.
Stride Pro ling: The next step is stride pro ling, where we run the instrumented executable (foo.pixie) with a training input set. It generates a pro le (foo.sprof) of the strides found at the loads selected in the rst step.
Prefetch Insertion: The nal step takes the stride prole and inserts stride prefetches into the executable (foo.pf) accordingly. A prefetch-insertion module has been added to Spike including various optimizations that maximize the bene t of stride prefetching.
Having seen the overview, we n o w discuss the above three steps in a greater depth.
Instrumentation
The rst step is to decide which loads in the executable need to be instrumented for stride pro ling. Of course, the simplest way is to instrument all loads. However, doing this is expensive in terms of the overhead of stride pro ling, and is also unnecessary since previous research shows that cache 2 Again, the magnitude of these two strides depends on the compiler and run-time system. Figure 4 : Examples of a naive instrumentation and an optimized instrumentation for stride pro ling misses tend to be generated by only a small subset of static loads in the program 1, 27]. Thus, we use the following two heuristics to decide whether a load should be instrumented:
Not-scalar: Since references of scalars or small aggregates rarely miss in the cache, they can be ignored for prefetching. On Alpha, these references are typically made o the global pointer ($gp) or the stack pointer ($sp).
Not-compiler-prefetched: If a load is already prefetched by the compiler, it does not need to be considered for stride prefetching. Thus, such t ype of loads are also excluded for instrumentation.
While the above heuristics always accelerate stride pro ling (as fewer loads are instrumented), their performance impact can be mixed, depending on the miss rates of the loads that are ignored for stride pro ling. Later in Section 4.2.3, we will measure their performance impact.
Once we decide which loads are important, we are ready to perform the actual instrumentation. A naive approach i s t o do an one-to-one instrumentation, like the pseudo assembly code shown in Figure 4 (a), where one instrumentation call is inserted for each load. Nevertheless, the example also illustrates the following two opportunities for optimization.
First, it is not necessary to pro le the e ective address of the load: Pro ling the value of the base register is already su cient to detect the address strides (since the o set will remain the same). An implication of this is that loads that share the same base register can also share the instrumentation call. And a natural place to put the instrumentation call is immediately after the point where the base register is de ned. For this purpose, we enhance Pixie to nd the de nition points of registers in a ow graph, using an al-gorithm 31] resembling static single assignments (SSA) 2, 13].
Second, multiple instrumentation calls located at the same place can be combined into a single instrumentation call. In our example, applying these two optimizations together reduces the number of instrumentation calls from four to one, as shown in Figure 4(b) . In general, as we w i l l show in Section 4.2.2, these optimizations are very e ective at reducing both the number of instrumentation calls and the pro ling time.
Stride Profiling
In this step, we run the instrumented program with a training input set in order to pro le the address strides of the loads selected. Similar to the way that strides are detected in Chen and Baer's hardware prefetching scheme 9], a stride S is recognized if it appears twice in a row ( i . e . Addressi ; Addressi;1 = S = Addressi+1 ; Addressi). When there are more than one stride, we record up to 10 of them per load, which w e found to be su cient in most cases. Each stride is associated with three pieces of information: its value (including sign), its frequency, a n d the average run-length of that stride. The run-length is de ned as the number of consecutive instances of a given load that share the same stride. As we will discuss in Section 3.3, the run-length is useful for determining how far ahead a load should be prefetched.
As a tradeo between accuracy and speed, stride pro ling can be either complete or sampled. Complete pro ling collects statistics from the entire execution without any skipping. In contrast, sampled pro ling collects statistics from parts of the execution, with other parts skipped. We h a ve examined two approaches to sampled pro ling. The rst is a periodic one, where statistics collection is turned on and o repeatedly during the entire execution. The second approach collects statistics for only the rst N instances of each load being pro led, where N is reasonably large yet is still small compared to the total execution count. While the rst approach w orks better for programs with di erent phases in the execution that lead to di erent strides, we found that the software overhead of turning on and o statistics collection is so substantial that there is no signi cant a d v antage to using periodic sampling. Thus, we focus on evaluating the second approach when we compare sampled pro ling against complete pro ling later in Section 4.2.4.
Prefetch Insertion
In the nal step, Spike reads the stride pro le and inserts prefetches into the executable. There are three subtasks that Spike performs in this step: (i) determining the stride to be used for prefetching if multiple strides are detected for a particular load, (ii) computing the prefetching distance for each strided load, and (iii) minimizing the number of prefetches needed to be inserted. They are further discussed in the rest of this section.
Choosing from Multiple Strides
When multiple strides are detected for a given load, Spike has to decide which one should be used. Figure 5 shows for the SPEC2000 benchmarks the distribution of the most frequent stride vs. the less frequent o n e s f o r e a c h static load that exhibits strided accesses. On average, the most frequent stride happens over 90% of time for integer and around 70% This requires one free register (i.e. t in Figure 6 (c)) and four instructions in addition to the prefetch itself. Since nding free registers at post-link time is challenging in terms of ensuring correctness, and we do not want t o i n troduce spilling code to free up registers, we adopt the static approach in this study. Nevertheless, according to Figure 5 , this would still allow u s t o c o ver a majority of strided accesses.
Computing the Prefetching Distance
Once the stride is determined, the next step is to compute the prefetching distance|the number of iterations to prefetch ahead. A general formula for computing prefetching distance 26] is: D = d l w e, w h e r e l is the expected miss latency and w is the estimated amount of computation between two consecutive references in cycles. For a strided load L, w can be approximated by t h e a verage body length of the innermost loop that encloses L. We h a ve enhanced Spike to compute the average loop body length based on the control-ow feedback collected via regular Pixie pro ling. For the miss latency l, w e conservatively assume it to be a L2-cache miss, thereby ensuring su cient time to bring in the data. Since w is now i n t e r m s o f t h e a verage number of instructions executed in a loop, we a l s o m ultiply l by the instruction-per-cycle (IPC). In our experiments, l and the IPC were assumed to be 100 cycles and 1.4, respectively.
Another consideration in deciding the prefetching distance is the run-length R of a stride. If R is signi cantly greater than the ideal prefetching distance D = d l w e, the actual prefetching distance is simply D. However, if R is comparable to D, something less than D should be used instead. The rationale is that when we prefetch N references ahead, (a Figure 6 : Two approaches to handling multiple strides the rst N out of the R references in the run will not be prefetched. Thus, the prefetching distance should not be so large that many references at the beginning of the run are not covered. In the case where R D, the actual prefetching distance is set to R 2 , the mid-point of the run, in an e ort to balance between prefetch coverage and prefetch timeliness.
Prefetch Minimization
The nal step is to optimize away prefetches that are redundant because there are already prefetches to the same cache lines. This type of redundancy typically occurs when nearby elds in a structure or consecutive array e l e m e n ts are separately prefetched. Our prefetch minimization algorithm consists of four steps, which are explained below using the example shown in Figure 7: 1. Prefetches that share the same base register are potentially redundant if their o sets are close enough. Hence, we rst logically group prefetches at the point where their base registers are de ned. For instance, all the six prefetches shown in Figure 7 (a) are considered together for minimization at the de nition point o f R1 (i.e. R1 R1 + 8). If the base register has multiple possible de nitions, the con uence point of these denitions will be used instead (the con uence points are essentially where the functions in SSA are located). 2. Prefetches grouped at the same register de nition point are then classi ed into two types: \must" or \may" prefetches, according to the likelihood that the prefetch will be executed between that de nition point and the end of the program. \Must" prefetches are those that are certain to be executed, while \may" prefetches are those that are uncertain. For instance, the \must" prefetches with respect to the de nition point o f R1 are are \may" prefetches. Both \must" and \may" prefetches are computed via a backward data-ow analysis. 3. We then compute the maximum possible number of cache lines spanned by \ m ust" prefetches. With this information, we can nd the minimum set of prefetches that span the same cache lines. For example, the four \must" prefetches in Figure 7 (a) target addresses between 16(R1) and 118(R1), spanning at most three 64-byte cache lines. These four prefetches can hence be reduced to three: prefetch 16(R1), prefetch 80(R1), and prefetch 118(R1), which cover the same set of lines. Note that \may" prefetches are not considered in this step because we do not want t o i n troduce additional overhead by executing prefetches that are actually not needed in the original program. 4. Finally, any \may" prefetches whose data addresses are already covered by the span of \must" prefetches can be removed. Thus, in the example, prefetch 32(R1) is eliminated but prefetch 1024(R1) is not. The ultimately optimized code is shown in Figure 7 (b), where either one or two prefetches are saved in execution, depending on the direction of the if-then-else statement.
EXPERIMENTAL EVALUATION
We e v aluated pro le-guided stride prefetching on an Alpha 21264-based system. The experimental framework is depicted in Section 4.1, and the performance results are discussed in Section 4.2. 
Framework
The test bed of our experiments was a DS20E Alpha workstation 11], equipped with a 667-MHz 21264 processor 23] and 2GB of main memory. The Alpha 21264 is an outof-order superscalar machine which can execute up to four instructions per cycle. The memory hierarchy is the most relevant component to our study, and hence is summarized in Table 1 .
We used the entire SPEC2000 suite 17] as our benchmarks. The training data sets were used to generate the stride pro le, while the reference data sets were used for performance measurement. For most benchmarks, the median execution time of ve r u n s w as reported. However, galgel and sixtrack required more runs due to the relatively large variances observed in their execution times.
The benchmarks were rst compiled using the standard Compaq compiler version 6.3 with -O5 optimizations under Tru64 Unix V5.1. At this optimization level, the compiler inserts prefetches which are estimated as bene cial. The benchmarks were then further optimized by Spike to improve their I-cache performance. Thus, the baseline of our experiments was the best that we could get prior to stride prefetching.
Results
We rst report the overall performance improvement due to pro le-guided stride prefetching, followed by the overhead of stride pro ling. Next, we measure the performance impact of the two heuristics used in instrumentation, as well as that of prefetch minimization. Finally, we demonstrate the e ectiveness of sampled stride pro ling.
Performance of Stride Prefetching
Our rst set of results are shown in Figure 8 , where the execution time of stride prefetching is normalized to that of the baseline. The rst observation made from Figure 8 is that stride prefetching is more e ective on the oating-point side than on the integer side. This is somewhat expected as oating-point c o d e s t ypically contain more regular data accesses and are more loop intensive. Nevertheless, we still see over 4% speedups in four out of the 12 integer benchmarks. The performance gains on the oating-point side are quite substantial: Six programs are sped up by at least 10%, with equake up by 56%. Although we do su er slowdowns in a few benchmarks, they are all quite mild (at most 3% in the case of twolf). On average, stride prefetching speeds up the integer benchmarks by 2% and the oating-point b e n c hmarks by 9%.
To understand the performance results in a greater depth, we u s e d A tom-based cache simulation 35] to estimate the data tra c between the D-cache and the L2 cache. The Figure 9 as they do not generate any requests to the L2 cache). Three benchmarks (eon, gcc, gzip) are missing in Figure 9 o wing to a bug in Atom that prevents it from instrumenting the spiked version of these three programs. Fortunately, a s w e see Figure 8 , these three benchmarks are of less interest since stride prefetching has little impact on their performance. Figure 9 illustrates that the performance bene t we s e e i n Figure 8 is due to the successful conversion of load misses into prefetches by stride prefetching. This is most noticeably in cases like mcf, applu, and equake. Figure 9 also shows that stride prefetching increases the total tra c by 10% or less|the only two exceptions are twolf and facerec. Overall, bandwidth does not appear to be a serious problem here.
Overhead of Stride Profiling
Recall from Figure 4 in Section 3.1 that there are two approaches to instrumenting programs for stride pro ling: naive or optimized. Figure 10 compares the number of static instrumentation points generated by e a c h approach. Obviously, the optimized approach i s v ery e ective at reducing the number of instrumentation points|it is reduced by a half or more in most cases. This advantage is a consequence of sharing instrumentation calls across loads that have the same base registers and of merging instrumentation calls at the same program points. Figure 11 shows how m uch pro ling time is actually saved by this reduction in instrumentation calls. The pro ling time is expressed as a percentage of the time taken to run the uninstrumented program with the same training input. We rst note that the stride pro ling overhead is in the same order as that of some other software-based value pro lers 6], which t ypically slow d o wn the program by 10 to 30 times. Optimized instrumentation is very helpful to oating-point benchmarks|it cuts down their pro ling time by nearly two-thirds on average. Nevertheless, it is less e ective on the integer side. In fact, optimized instrumentation performs a little worse than naive instrumentation in crafty, gap, gzip, and perlbmk. A possible reason for this is that putting instrumentation at the de nition point of the load's base register (as done in optimized instrumentation) may introduce additional overhead on other paths that go through the de nition point. For these cases, other means like s a mpling become more important for reducing pro ling overhead, as we will discuss in Section 4.2.4.
Performance Impact of Instrumentation Heuristics and Prefetch Minimization
In Section 3.1, we h a ve i n troduced two heuristics for selecting critical loads to pro le: not-scalar and not-compilerprefetched. While they always reduce pro ling time, their performance impact is less clear. Hence it is measured and shown in Figure 12 . For each benchmark, the rst three bars (from left to right) correspond to the following three scenarios: (i) all loads were instrumented, (ii) only not-scalar Figure 10 : Number of static instrumentation points with naive and optimized instrumentation.
loads were instrumented, and (iii) only not-scalar and notcompiler-prefetched loads were instrumented. To isolate the performance impact of these heuristics, we disabled prefetch minimization (described in Section 3.3.3) in the rst three bars. The last bar corresponds to adding prefetch minimization to scenario (iii). Note that the last bar is equivalent t o the case already shown in Figure 8 . Figure 12 shows that little performance is lost with notscalar and not-compiler-prefetched heuristics. In fact, they do improve performance in a few benchmarks, most prominently in galgel. This is because these heuristics avoid adding stride prefetches for loads that tend to hit in the cache or those that are already prefetched by the compiler.
Comparing the third and fourth bars in Figure 12 also reveals the e ectiveness of prefetch minimization. While the average performance gain is small (1% for the oating-point benchmarks), it is particularly e ective i n eon, applu, and galgel.
Effectiveness of Sampled Stride Profiling
Our nal set of results demonstrate the e ectiveness of sampled stride pro ling. This is the approach where we collect stride statistics for only the rst 10,000 occurrences of each of the loads selected. The purpose of this experiment is to nd out whether pro ling time can be substantially saved by sampling while maintaining su cient accuracy so that the performance gains are mostly preserved.
A comparison in the overhead of complete and sampled stride pro ling is shown in Figure 13 . As we can see, sampling speeds up the pro ling process by 58% for integer and by 4 1 % for oating-point. With sampling, the slowdowns of stride pro ling are 2 to 15 times, which are quite encouraging given that the slowdowns of software-based value pro lers 6] are typically 10 to 30 times.
The performance impact of sampled stride pro ling is shown in Figure 14 . The good news is that sampling does not degrade performance in most cases. The only exception is mcf, where about one-tenth of the strided loads found by stride pro ling are di erent b e t ween complete and sampled stride pro ling. Overall, sampling only the rst 10,000 references works reasonably well for SPEC2000.
RELATED WORK
Prefetching has been an active area of research since it was rst introduced. Meanwhile, many m a c hines have adopted some form of prefetching, mostly in software 5, 15, 23, 30] with a couple in hardware 18, 37] . One way to categorize prefetching schemes is based on the data access patterns that they target. Hence, they can be broadly classi ed as sequential 33], strided 9, 14, 19], streamed 16, 21, 32] , data-dependency based 28], pointer based 22, 25, 29] , and address-correlation based 8, 20, 24] . Since our focus is on strided accesses, we compare our scheme with the other approaches to stride prefetching in the rest of this section.
Hardware-based stride prefetching has been studied by a number of researchers 9, 14, 19] . The general idea is to use a hardware table to remember for each load executed the last address loaded Ai;1 and the di erence, say S, b e t ween Ai;1 and Ai;2 (the second-to-last address loaded). Later on, when the same load is seen again with a new address Ai, a prefetch for the address Ai + S will be launched if Ai ; Ai;1 equals S, provided that the load's information has not been displaced from the table. The table entry for that load is then updated with the new address and the latest stride computed.
Comparing hardware-based vs. software-based stride prefetching (such a s o u r s c heme), the advantages of the hardwarebased approach are that it requires no re-compilation or post-link optimizations, and it also poses no instruction over-
