Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we nd that existing prefetching schemes are insu cient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch ltering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor, which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We nd that both the prefetch ltering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.
Introduction
Memory latency is a key performance bottleneck in modern microprocessor-based systems. The relative importance of memory latency is expected to increase as the gap between processor and memory speeds continues to grow, and as wider-issue processors increase the e ective performance penalty of each cycle of latency. While techniques for coping with data access latency have received considerable attention, it is also important to address the latency of fetching instructions. Although instruction cache hierarchies are an essential rst step toward coping with this problem, they are not a complete solution. For example, a study conducted by Maynard et al. 7] demonstrates that many commercial applications su er from relatively large instruction cache miss rates (e.g., over 20% in an 8KB cache) due to their large instruction footprints and poor instruction localities. To further tolerate this latency, one attractive technique is to automatically prefetch instructions into the cache before they are needed.
Previous Work on Instruction Prefetching
There has been a long history of research on instruction prefetching. We will begin by discussing and then quantitatively evaluating four of the most promising techniques that have been proposed to date, all of which are purely hardware-based: next-N-line prefetching 10, 11] , target-line prefetching 12], wrong-path prefetching 8], and Markov prefetching 3].
Before we begin our discussion, we brie y introduce some prefetching terminology. The coverage factor is the fraction of original cache misses that are prefetched. A prefetch is unnecessary if the line is already in the cache (or is currently being fetched), and is useless if it brings a line into the cache which will not be used before it is displaced. An ideal prefetching scheme would provide a coverage factor of 100% and would generate no unnecessary or useless prefetches. In addition, the timeliness of when prefetches are launched is also crucial. The prefetching distance is the elapsed time between when the prefetch is initiated and when the prefetched instruction is used. The prefetching distance should be large enough to fully hide the cache miss latency, but not so large that the line is likely to be displaced by other accesses before it can be used (i.e. a useless prefetch).
As its name implies, the idea behind next-N-line prefetching 10, 11] is to prefetch the N sequential lines following the one currently being fetched by the CPU. A larger value of N tends to increase the prefetching distance, but also increases the likelihood of polluting the cache with useless prefetches. The optimal value of N depends on the line size, the cache size, and the behavior of the application itself. To increase the likelihood that these prefetched sequential lines will be used, the hardware can postpone launching a prefetch until the current instruction falls within a speci ed distance (called the fetch-ahead distance) of the end of its line 12] . Next-N-line prefetching captures sequential execution as well as control transfers where the target falls within the next N lines. It is usually included as part of other more complex instruction prefetching schemes, and based on our experiments, it accounts for most of the performance bene t of these schemes.
One limitation of next-N-line prefetching is that it does not prefetch control transfer targets which do not fall within the N fall-through lines. To address this limitation, Smith and Hsu 12] proposed target-line prefetching which uses a prediction table to record the address of the line which most recently followed a given instruction line, thus enabling hardware to prefetch targets whenever an entry is found in this table. They observed that combining target-line prefetching with next-1-line prefetching produced signi cantly better results than either technique alone.
Rather than relying on a history table to predict likely target addresses, Pierce and Mudge 8] proposed a scheme called wrong-path prefetching which combines next-N-line prefetching with always prefetching the target of control transfers with static target addresses (including procedure calls, conditional and unconditional branches). Hence for conditional branches, both the target and fallthrough lines will always be prefetched. However, since target addresses cannot be determine early, this scheme only outperforms next-N-line prefetching when a conditional branch is initially untaken but later taken (assuming that enough time has passed in between to hide the latency of fetching the target line, but not so much time that the line has been displaced). Their results indicated that wrong-path prefetching performed slightly better than next-1-line prefetching on average. Joseph and Grunwald 3] proposed Markov prefetching which is applicable to both instruction and data cache misses. This mechanism correlates the current cache miss address with the next miss address and stores this information in a miss-address prediction table using the current miss address as the index. Multiple predicted addresses can be associated with a given miss address. Upon a cache miss, prefetches are issued for these predicted addresses. The Joseph and Grunwald study focused primarily on data cache misses, and did not compare Markov prefetching with techniques designed speci cally for prefetching instructions.
Finally, we note that while a previous study by Xia and Torrellas 13] considered instruction prefetching for codes where the layout has already been optimized using pro ling information, we focus only on techniques which do not require changes to the instruction layout in this study.
Performance of Existing Instruction Prefetching Techniques
To quantify the performance bene ts and limitations of the four prefetching techniques described above, we implemented each of them within a detailed, cycle-by-cycle simulator which models an out-of-order four-issue superscalar processor based on the MIPS R10000 14]. We model a two-level cache hierarchy with split 32 KB, two-way set-associative primary instruction and data caches and a uni ed 1 MB, four-way set-associative secondary cache. Both levels use 32 byte lines. The penalty of a primary cache miss that hits in the secondary cache is at least 12 cycles, and the total penalty of a miss that goes all the way to memory is at least 75 cycles (plus any delays due to contention, which is modeled in detail). To provide better support for instruction prefetching, we further enhanced the primary instruction cache relative to the R10000 as follows: we divide it into four separate banks, and we add an eight-entry victim cache 4] and a 16-entry prefetch bu er 3]. Further details on our experimental framework will be presented later in Section 5. Table 1 summarizes the parameters used throughout our experiments for each of the prefetching schemes. These parameters were chosen through experimentation in an e ort to maximize the performance of each scheme. All schemes e ectively include next-2-line prefetching. 1 We do not use the fetch-ahead distance mechanism 12] to throttle back prefetching. When a target is to be prefetched, we prefetch two consecutive lines starting at the target address. Figure 1 shows the performance impact of each prefetching scheme on a collection of seven non-numeric applications (which are described in more detail later in Section 5). We show three di erent versions of next-N-line prefetching (where N = 2, 4, and 8) in Figure 1 , along with the original case without prefetching (O) and the case with a perfect instruction cache (P). Each bar represents execution time normalized to the case without prefetching, and is broken down into three categories explaining what happened during all potential graduation slots. 2 The bottom section (Busy) is the number of slots when instructions actually graduate, the top section (I-Miss Stall) is any non-graduating slots that would not occur with a perfect instruction cache, and the middle section (Other Stall) is all other slots where instructions do not graduate.
We observe from Figure 1 that despite signi cant di erences in complexity and hardware cost, the various prefetching schemes o er remarkably similar performance, with no single scheme clearly 100  97 95 95 97 97 96   83   100  95 94 95 95 94 93   79   100 98 98 99 98 98 98   89   100  95 94 95 95 95 94   83   100  96 95 97 96 96 96  87  100 97 96 93  96 96 95   81   100  91 90 90 91 90 dominating. Perhaps surprisingly, the best performance is achieved by either next-4-line or next-8-line prefetching in all cases except perl; even in perl, next-4-line prefetching is still within 1% of the best case. The reason for this is that the bulk of the bene t o ered by each of these schemes is due to prefetching sequential accesses. Finally, we see in Figure 1 that these schemes are hiding no more than half of the stall time due to instruction cache misses. Through a detailed analysis of why these schemes are not more successful (further details are presented later in Section 6.1), we observe that although the coverage is generally quite high, the real problem is the timeliness of the prefetches|i.e. prefetches are not being launched early enough to hide the latency. Hence there is signi cant room for improvement over these existing schemes.
Our Solution
To hide instruction cache miss latency more e ectively in modern microprocessors, we propose and evaluate a new fully-automatic instruction prefetching scheme whereby the compiler and the hardware cooperate to launch prefetches earlier (therefore hiding more latency) while at the same time maintaining high coverage and actually reducing the impact of useless prefetches relative to today's schemes. Our approach involves two novel components. First, to enable more aggressive sequential prefetching without polluting the cache with useless prefetches, we introduce a new prefetch ltering hardware mechanism. Second, to enable more e ective prefetching of non-sequential accesses, we introduce a novel compiler algorithm which inserts explicit instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that our scheme provides signi cant performance improvements over existing schemes, eliminating roughly 50% or more of the latency that had remained with the best existing scheme.
This paper is organized as follows. We begin in Section 2 with an overview of our approach, and then present further details on the architectural and compiler support in Sections 3 and 4. Sections 5 and 6 present our experimental methodology and our experimental results, and nally we conclude in Section 7.
Cooperative Instruction Prefetching
We begin this section with a high-level overview of how our prefetching scheme works. To make our approach concrete, we also present some examples illustrating how prefetches are inserted.
Overview of the Prefetching Algorithm
As we mentioned earlier, the key challenge in designing a better instruction prefetching scheme is to be able to launch prefetches earlier|i.e. to achieve a larger prefetching distance. Let us consider the sequential and non-sequential portions of instruction streams separately.
Prefetching Sequential Accesses
Since the addresses within sequential access patterns are trivial to predict, they are well-suited to a purely hardware-based mechanism such as next-N-line prefetching. To get far enough ahead to fully hide the latency, we would like to choose a fairly large value for N (e.g., N = 8 in our experiments). However, the problem with this is that larger values of N increase the probability of overshooting the end of the sequence and polluting the cache with useless prefetches. For example, next-8-line prefetching performs worse than next-4-line prefetching for four cases in Figure 1 (perl, porky, postgres, and skweel) due to this e ect.
The ideal solution would be to prefetch ahead aggressively (i.e. with a large N) but to stop once the end of the sequence is reached. Xia and Torrellas 13] proposed a mechanism for doing this which involves having software explicitly mark the likely end of a sequence with a special bit. In contrast, we achieve a similar e ect using a more general prefetch ltering mechanism which automatically detects and discards useless prefetches before they have a chance to pollute the instruction cache. We will explain how the prefetch lter works in detail later in Section 3.3.1, but the basic idea is to use two-bit saturating counters stored in the secondary cache tags to dynamically detect cases where lines have been repeatedly prefetched into the primary instruction cache but were not accessed before they were displaced (i.e. useless prefetches). When prefetches for such lines subsequently arrive at the secondary cache, they are simply dropped. One advantage of our approach is that it adapts to the dynamic branching behavior of the program, rather than relying on static predictions of likely control ow paths. In addition, our ltering mechanism is equally applicable to non-sequential as well as sequential prefetches.
Prefetching Non-Sequential Accesses
In contrast with sequential access patterns, purely hardware-based prefetching schemes are far less successful at prefetching non-sequential instruction accesses early enough. Wrong-path prefetching does not attempt to predict the target address of a given branch early, but instead hopes that the same branch will be revisited sometime in the not-too-distant future with a di erent branch outcome. Both target-line and Markov prefetching rely on building up history tables to predict addresses to prefetch along control targets. However, if a control transfer is encountered for the rst time or if its entry has been displaced from the nite history table, then its target will not be prefetched. 3 Perhaps more importantly, even if a valid entry is found in the history table, it is often too late to fully hide the latency of prefetching the target since the processor is already accessing the line containing the branch.
To overcome these limitations, we rely on software rather than hardware to launch non-sequential instruction prefetches early enough. To avoid placing any burden on the programmer, we use the compiler to insert these new instruction-prefetching instructions automatically. As we describe in further detail later in Section 4, our compiler algorithm moves prefetches back by a speci ed prefetch-scheduling distance while being careful not to insert prefetches that would be redundant with either next-N-line prefetching or other software instruction prefetches. Since many control transfers within procedures have targets within the N lines covered by our next-N-line prefetcher, the bulk of the instructions inserted by our compiler algorithm are for prefetching across procedure boundaries. Hence, although it is an oversimpli cation, one could think of our scheme as being primarily hardware-based for intraprocedural prefetching, and primarily software-based for interprocedural prefetching. While direct control transfers (i.e. ones where the target address is statically known) are handled in a straightforward way by our algorithm, indirect jumps require some additional support in order for software to generate the target addresses early. We consider two separate cases of indirect jumps: procedure returns, and all other indirect jumps. Since procedure return addresses can be easily predicted through the use of a return address stack 5], we simply use a special prefetch instruction which implicitly uses the top of the return address stack as its argument. 4 To predict the target addresses of other indirect jumps, we use a hardware structure called an indirect-target table which records past target addresses of individual indirect jump instructions, and which is indexed using the instruction addresses of indirect jumps themselves. A prefetch instruction designed to prefetch the target of an indirect jump i conceptually stores the instruction address of i, which is then used to index the indirect-target table to retrieve the actual target addresses to prefetch. (Note that an indirect-target table is considerably smaller than the tables used by either target-line or Markov prefetching since it only contains entries for active indirect jumps other than procedure returns.)
While the advantage of software-controlled instruction prefetching is that it gives us greater control over issuing prefetches early, the potential drawbacks are that it increases the code size and e ectively reduces the instruction fetch bandwidth (since the prefetch instructions themselves consume part of the instruction stream). Fortunately, our experimental results demonstrate that this advantage outweighs any disadvantages.
Examples of Prefetch Insertion
To make our discussion more concrete, Figure 2 contains three examples of how di erent types of prefetches are inserted. We assume the following in these examples: a cache line is 32 bytes long; an instruction is four bytes long (hence one cache line contains eight instructions); hardware next-8 line prefetching is enabled; and the prefetch-scheduling distance is 20 instructions.
Figure 2(a) shows two procedures, main() and foo(), where main() contains ve basic blocks (labeled A through E). Two prefetches have been inserted at the beginning of basic block A: one targeting block E, and the other targeting procedure foo(). There is no need to insert software prefetches for blocks B, C or D at A since they will already be handled by next-8-line prefetching. The prefetch targeting E is inserted in block A rather than in block C in order to guarantee a prefetching distance of at least 20 instructions. Although there are two possible paths from A to foo() (i.e. A!B!D!foo() and A!C!D!foo()), the compiler inserts only a single prefetch of foo() in A (rather than inserting one in A and one in B) because (i) A dominates 5 both paths, and (ii) the compiler determines that these prefetched instructions are not likely to be displaced by other instructions fetched along the path A!B!D!foo().
Figure 2(b) shows an example of prefetching return addresses. The prefetches in procedures bar() and foo() get their addresses from the top of the return address stack|i.e. 3204 and 1004, respectively. Finally, Figure 2 (c) shows an example where a prefetch is inserted to prefetch the target address of the indirect jump at address 8192 before the actual target address is known (i.e. the value register R has not been determined yet). Hence the prefetch has 8192 as its address operand to serve as an index into the indirect-target table. Three target addresses are predicted for this indirect jump, and all of them will be prefetched.
Architectural Support
Our prefetching scheme requires new support from the architecture. In this section, we describe how we extend the instruction set architecture, the impact that these new instructions have on the pipeline, and the new hardware that we add to the memory system (including the prefetch lter).
Extensions to the Instruction Set Architecture
Without loss of generality, we assume a base instruction set architecture (ISA) similar to the MIPS ISA 6]. Within a 32-bit MIPS instruction, the high-order six bits contain the opcode. For the jump-type instructions which implement static procedure calls, the remaining 26 bits contain the low-order bits of the target word address. We will use this same instruction format as our starting point.
There are many ways to encode our new instruction-prefetch instructions, and Figure 3 (a) shows just one of the possibilities. An opcode is designated to identify instruction-prefetch instructions. In contrast with the standard jump-type instruction format, we assume that 24 bits (bits 2 through 25) contain information for computing the prefetch address(es), bits 1 and 0 indicate one of the four prefetch types. The prefetch type pf d stores a single prefetch address in a format similar to a MIPS jump address. The only di erence is that since the lower two bits are ignored, it e ectively encodes a 16-byte-aligned address. 6 The pf c type is a compact format which encodes two target addresses within the 24-bit eld in the form of o sets between the target address lines and the prefetch instruction line itself (again, a single o set bit represents 16 bytes); each o set is 12 bits wide. The remaining two types are for prefetching indirect targets | pf r is for procedure returns, and pf i is for general indirect-jump targets. A pf r prefetch does not require an argument since it implicitly uses the top of the return address stack as its address. A pf i prefetch encodes the word o set between itself and the indirect-jump instruction that it is prefetching. To look up the (a) Adding instruction prefetches to the ISA 
Impact on the Processor Pipeline
Many recent processors have implemented instructions for data prefetching 2, 9, 14]. With respect to pipelining, our instruction prefetches di er in two important ways from data prefetches: (i) the pipeline stage in which the prefetch address is known, and (ii) the computational resources consumed by the prefetches. Figure 3(b) contrasts the pipeline for data prefetches in the MIPS R10000 14] with the pipeline for our instruction prefetches in an equivalent machine. As we see in Figure 3 (b), the prefetch address of a pf d instruction prefetch (the mostly used type) is known immediately after the Decode stage (the other types of instruction prefetches would require some additional time), while the address for a data prefetch is not known until it is computed in the Address Calculate stage. Hence a pf d instruction prefetch can be initiated two cycles earlier than a data prefetch. In addition, since instruction prefetches do not go through the latter three pipeline stages of a data prefetch (instead they are handled directly by the hardware instruction prefetcher after they are decoded), they do not contend for processor resources including functional units, the reorder bu er, register le, etc. In e ect, the instruction prefetches are removed from the instruction stream as soon as they are decoded, thereby having minimal impact on most computational resources.
Extensions to the Memory Subsystem
Figure 4(a) shows our memory subsystem (only the instruction fetching components are displayed). The I-prefetcher is responsible for generating prefetch addresses and launching prefetches to the uni ed L2 cache for both hardware and software initiated prefetching. Prefetch-address generation involves simple extraction of prefetch addresses from pf d prefetches, adding constant o sets to the current program counter (for next-N line prefetching and pf c prefetches), or retrieving prefetch targets from some hardware structures (for pf r and pf i prefetches). The I-prefetcher will not launch a prefetch to the L2 cache if the line being prefetched is already in the primary instruction cache (I-cache) or has an outstanding fetch or prefetch for the same line address. The auxiliary structures shown in Figure 4 (a) include the return address stack and the indirect-target table used by pf r and pf i prefetches, respectively. These structures are not necessary if these two types of prefetches are not implemented.
Prefetch Filtering Mechanism
The prefetch lter sits between the I-prefetcher and the L2 cache to reduce the number of useless prefetches. In addition, a prefetch bit is associated with each line in the I-cache to remember whether the line was prefetched but not yet used, and a two-bit saturating counter value is associated with each line in the L2 cache to record the number of consecutive times that the line was prefetched but not used before it was replaced. The prefetch ltering mechanism works as follows. When a line is fetched from the L2 cache to the I-cache, both the prefetch bit and the saturating counter value are reset to zero. When a line is prefetched from the L2 cache to the I-cache, its prefetch bit is set to one and its saturation counter does not change. When a prefetched line is actually used by a fetch, its prefetch bit is reset to zero. When a prefetched line l in the I-cache is replaced by another line, then if the prefetch bit of line l is set, its saturation counter is incremented (unless it has already saturated, of course); otherwise, the counter is reset to zero. When the prefetch lter receives a prefetch request for line l, it will either respond normally if the counter value is below a threshold T, or else it will drop the prefetch and send a \prefetch canceled" signal to the processor if the counter has reached T (in our experiments, T = 3). Figure 4 (b) shows an example of how the prefetch ltering mechanism works, and Figure 5 summarizes the states and transitions of the prefetch bit and the saturation counter for a particular cache line. 
Compiler Support
The compiler is responsible for automatically inserting prefetch instructions into the executable. Note that since prefetch insertion is most e ective if it begins after the code is otherwise in its nal form, this new pass occurs fairly late in the compilation: perhaps at link time, or in our case, we implemented it as a binary rewrite tool. The goal of the compiler is to schedule prefetches to achieve high coverage and satisfactory prefetching distances while at the same time minimizing the static and dynamic instruction overhead. Hence our compiler algorithm has two major phases: prefetch scheduling and prefetch optimization. Figure 6 shows a pseudo-code representation of our prefetch scheduling algorithm. After generating an initial prefetch schedule, the compiler then performs the four optimization passes described below, using the running example in Figure 7 . A complete implementation of this algorithm was used throughout our experiments.
Pass 1: Combining Prefetches at Dominators. This pass boosts prefetches that have been attached to a basic block b in the prefetch scheduling phase to b's nearest dominator (other than b itself) if the boosting is not harmful (it is harmful when the boosted prefetches will displace other useful instructions from the cache before b is referenced). After this boosting process, the compiler could combine some prefetches at dominators. For example, Figure 7 (b) shows the result of combining the two prefetches of line y into one after boosting prefetches from basic blocks D, E, and F into their dominator C.
Pass 2: Eliminating Unnecessary Prefetches. A prefetch instruction targeting a line l is unnecessary if l resides in the I-cache on all possible paths reaching the prefetch instruction. To eliminate unnecessary prefetch instructions, the compiler estimates which lines reside in the I-cache at each prefetch instruction using an algorithm similar to the one for computing available expressions in classical code optimization 1]. In our case, the gen set of a basic block b is the set of lines fetched or prefetched by b while the kill set is the set of lines displaced by b. In our example, since line z will de nitely be in the I-cache when we enter basic block C regardless of whether we came from A or B, the prefetch of line z in C is unnecessary and therefore is eliminated, as shown in Figure 7 (c). Pass 4: Hoisting Prefetches. Finally, the compiler hoists prefetches scheduled inside a loop up to the nearest basic block that dominates but is not part of the loop, if the prefetches do not need to be re-executed at every iteration (which may not be the case if each iteration can access a large volume of instructions). In some cases, a pre-header block will be created for the loop to hold the hoisted prefetches. For example, in Figure 7 (e), a pre-header C' is created to immediately precede the header (i.e. C) of the loop containing C, D, E, and F to hold the hoisted pf c prefetch. While this optimization does not reduce the code size, it can reduce the number of dynamic prefetches.
Experimental Framework
We performed our experiments on seven non-numeric applications which were chosen because their relatively large instruction footprints result in poor instruction cache performance. These applications are described Table 2 , and all of them were run to completion. We performed detailed cycle-by-cycle simulations of our applications on a dynamically-scheduled, superscalar processor similar to the MIPS R10000 14]. Our simulator models the rich details of the processor including the pipeline, register renaming, the reorder bu er, branch prediction, branching penalties, speculative instruction fetching (including incorrect execution paths), the memory hierarchy (including tag, bank, and bus contention), etc. Table 3 shows the parameters used in our model for the bulk of our experiments (we vary the latency and bandwidth later in Section 6.6). As shown in Table 3 , we enhanced the memory subsystem in a few ways relative to the R10000 to provide better support for instruction prefetching|e.g., we added an eight-entry victim cache 4] and a 16-entry prefetch bu er 3]. Our prefetching bu er is similar to the one used in the Markov prefetching study 3], with the only di erence being that when an entry is forced out of this bu er, we place it in the instruction cache rather than dropping it. Hence anything that enters the prefetch bu er eventually enters the instruction cache in our model|its primary purpose is to delay lling the instruction cache to help avoid cache con icts. We compiled each application as a \nonshared" executable with -O2 optimization using the standard MIPS C compilers under IRIX 5.3. We implemented our compiler algorithm as a standalone pass which reads in the MIPS executable and modi es the binary. However, since we did not have access to a complete set of binary rewrite utilities, we tightly integrated our compiler pass with our simulator so that rather than physically generating a new executable, we instead pass a logical representation of the new binary to the simulator which it can then model accurately. For example, the simulator fetches and executes all of the new instruction prefetches as though they were in a real binary, and it remaps all instruction layouts and addresses to correspond to what they would be in the modi ed binary. Hence we truly emulate the physical insertion of prefetches at the expense of decreased simulation speed. two key components of our scheme: prefetch ltering and software-initiated prefetching. We then measure the impact of varying the prefetch-scheduling distance used by the compiler, and of our compiler's prefetch optimizations, on the code size and performance. We also quantify the impact of varying cache latencies and bandwidths on the performance of our scheme. Finally, we justify the hardware cost of cooperative prefetching.
Performance of the Basic Cooperative Prefetching Scheme
Our basic cooperative prefetching scheme includes compiler-inserted pf d and pf c prefetches, hardware-based next-8-line prefetching, and prefetch ltering. No pf r or pf i prefetches (and hence the required hardware structures) are used. A prefetch-scheduling distance of 20 instructions is used for all applications. Figure 8 shows the performance impact of cooperative instruction prefetching. For each application, we show two cases: the bar on the left is the best previously-existing prefetching scheme (seen earlier in Figure 1) , and the bar on the right is cooperative prefetching (C). As we see in Figure 8 , our cooperative prefetching scheme o ers signi cant speedups over existing schemes (6.4% on average) by hiding a substantially larger fraction of the original instruction cache miss stall times (71% on average, as opposed to an average reduction of 36% for the best existing schemes).
To understand the performance results in greater depth, Figure 9 shows a metric which allows us to evaluate the coverage, timeliness, and usefulness of prefetches all on a single axis. This gure shows the total I-cache misses (including both fetch and prefetch misses) normalized to the original case (i.e. without prefetching) and broken down into the following four categories. The bottom section is the number of fetch misses that were not prefetched (this accounts for 100% of the misses in the original case, of course). The next section (Late Prefetched Misses) is where a miss has been prefetched, but the prefetched line has not returned in time to fully hide the miss (in which case the instruction fetcher stalls until the prefetched line returns, rather than generating a new miss request). The Prefetched Hits section is the most desirable case, where a prefetch fully hides the latency of what would normally have been a fetch miss, converting it into a hit. Finally, the top section is useless prefetches which bring lines into the cache that are not accessed before they are replaced. Figure 9 shows that both cooperative prefetching and the best existing prefetching schemes achieve large coverage factors, as indicated by the small number of unprefetched misses. The main advantage of our scheme is that it is more e ective at launching prefetches early enough. This is demonstrated in Figure 9 by the signi cant reduction in late prefetched misses, the bulk of which have been converted into prefetched hits. We also observe in Figure 9 that both cooperative prefetching and existing schemes experience a certain amount of cache pollution since the sum of the bottom three sections of the bars adds up to over 100%. However, the prefetch ltering mechanism used by cooperative prefetching helps to reduce this problem, thereby resulting in a smaller total for the bottom three sections than the best existing scheme in all of our applications. In addition, Figure 9 shows another bene t of prefetch ltering: it dramatically reduces the number of useless prefetches. The reduction in total useless prefetches ranges from 2.4 in perl to 10.6 in tcl|on average, cooperative prefetching has achieved a sixfold reduction in useless prefetching.
Adding Prefetches for Procedure Returns and Indirect Jumps
Having seen the success of our basic cooperative prefetching scheme, we now evaluate the performance bene t of extending it to include the indirect prefetches|i.e. pf r and pf i prefetches for procedure returns and indirect jumps, respectively. Figure 10 shows the performance of ve variations of cooperative prefetching: the basic scheme (C); the basic scheme plus pf r prefetches (SR); the basic scheme plus using hardware to prefetch the top three addresses on the stack at each procedure return (HR); and two cases which include the basic scheme plus pf i prefetches (SI and BI). Both schemes SR and HR use a 12-entry return address stack. While scheme HR has no instruction overhead, scheme SR has a better control over the prefetching distance via compiler scheduling. Scheme SI uses a 1 KB, 2-way set-associative indirect-target table where entry holds up to four target address; scheme BI uses a 16 KB, 4-way set-associative indirect-target table with 16 targets per entry.
As we can see in Figure 10 , the marginal bene t of supporting indirect prefetches is quite small for these applications. Part of the limitation is that only a relatively small fraction (roughly 15%) of the remaining misses which are not handled by our basic scheme are due to either procedure returns or indirect jumps, and therefore the potential for improvement is small. In addition, since some indirect jumps can have a fairly large number of possible targets|e.g., more than eight, as we observe in perl and gcc|prefetching all of these targets could result in cache pollution. Prefetching indirect jump targets may become more important in applications where they occur more frequently|e.g., object-oriented programs that make heavy use of virtual functions, or applications that use shared libraries. Although two of our applications are written in C++ (porky and skweel), they rarely use virtual functions. Since our applications show little bene t from pf r and pf i prefetches, we do not use them in the remainder of our experiments. 
Importance of Prefetch Filtering and Software Prefetching
Two components of the cooperative prefetching design contribute to its performance advantages: prefetch ltering and compiler-inserted software prefetching. To isolate the contributions of each component, Figure 11 shows their performance individually as well as in combination. The relative importance of prefetch ltering versus compiler-inserted prefetching varies across the applications: in tcl, prefetching ltering is more important, and in postgres, compiler-inserted prefetching is more important. In all cases, the best performance is achieved when both techniques are combined, and in all but one case this results in a signi cant speedup over either technique alone. Intuitively, the reason for this is that the bene ts of prefetch ltering (i.e. avoiding cache pollution) and software prefetching (i.e. issuing non-sequential prefetches early enough) are orthogonal. Hence both components of our design are clearly important for performance and are complementary in nature. The y-axis of (a) is normalized to the number of instructions in the original executable.
Impact of Prefetching Optimizations
To evaluate the e ectiveness of the compiler optimizations in reducing the number of prefetches, we measured their impact both on code size and performance. Figure 12 (a) shows the number of static prefetches remaining as each optimization pass is applied incrementally, normalized to the original code size. Without any optimization (U), the code size can be bloated by over 40%. Combining prefetches at dominators (D) dramatically reduces the prefetch count by more than a half in all applications except postgres. Eliminating unnecessary prefetches and compressing prefetches further reduces the prefetch count by a moderate amount. (Prefetch hoisting has no e ect on the static prefetch count, and therefore is not shown in Figure 12 (a).) Altogether, the prefetch optimizations limit the prefetch count to only 9% of the original code size on average. Figure 12 (b) shows the impact of these optimizations on performance. As we see in this gure, combining prefetches at dominators results in a noticeable performance improvement in several cases (e.g., gcc, perl, and tcl). The other optimizations have a negligible performance impact. In fact, prefetch compression and hoisting sometimes degrade performance by a very small amount by changing the order in which prefetches are launched.
Varying the Prefetch-Scheduling Distance
A key parameter in our prefetch scheduling compiler algorithm is the prefetch-scheduling distance (i.e. SCHED DIST in Figure 6 ). When choosing a value for this parameter, we must consider the following tradeo s: we would like the parameter to be large enough to hide the expected miss latency, but setting the parameter too high can increase the code size (since more prefetches must be inserted to cover a larger number of unique incoming paths) and increase the likelihood of polluting the cache. In our experiments so far, we have used a prefetch-scheduling distance of 20 instructions, which is roughly equal to the product of the expected IPC ( 1.6) and the primary-to-secondary miss latency ( 12 cycles). To determine the sensitivity of cooperative prefetching to this parameter, we varied the prefetch-scheduling distance across a range of ve values from 12 to 28 instructions, and measured the resulting impact on both code size and performance (shown in Figures 13(a) and 13(b), respectively). As we observe in Figure 13 (a), increasing the prefetch-scheduling distance can result in a noticeable increase in the code size. Fortunately, even with a prefetch-scheduling distance as large as 28 instructions, the compiler is still able to limit the code expansion to less that 11% on average, due to the optimizations discussed in the previous section. In contrast, the performance o ered by cooperative prefetching is less sensitive to the prefetch-scheduling distance, as we see in Figure 13(b) . While tcl enjoys a 6% speedup as we increase this parameter from 12 to 28 cycles, the other applications experience no more than a 2% uctuation in performance across this range of values. Hence we observe that performance is not overly sensitive to this parameter.
Impact of Latency and Bandwidth Variations
We now consider the impact of varying miss latencies and available bandwidth between the primary and secondary caches on the performance of cooperative prefetching. Recall that in our experiments so far, the primary-to-secondary miss latency has been 12 cycles (plus any delays due to contention). Figure 14 shows the performance of next-4-line and cooperative prefetching when this parameter is decreased to 6 cycles and increased to 24 cycles. (Note that the compiler's prefetch-scheduling distance was set to 12 and 28 instructions, respectively, for the 6-cycle and 24-cycle cases.) As we see in Figure 14 , cooperative prefetching still performs well under both latencies, and results in even larger improvements as the latency grows. In the 24-cycle case, cooperative prefetching results in an 89  95 91 87  87 86 82 81  91  95 92 91  87  91 88 86  91 93 91 90  88 85 82 82  84  88  83 average speedup of 24.4%, which is double the average speedup of next-4-line prefetching (12.2%).
Turning our attention to bandwidth, recall that our experiments so far have assumed a bandwidth of 32 bytes/cycle between the primary instruction cache and the secondary cache. Figure 15 shows the impact of decreasing this bandwidth to 16 bytes/cycle, and increasing it to unlimited bandwidth.
(Note that the C32 case|cooperative prefetching with the original bandwidth of 32 bytes/cycle| is include on the same axis simply as a point of comparison.) There are two things to note from Figure 15 . First, we see in Figure 15 (a) that while reducing the bandwidth does degrade the performance of cooperative prefetching somewhat|from an average speedup of 13.3% to 12.5%| the overall performance gain still remains high. Hence cooperative prefetching can achieve good performance with realistic amounts of bandwidth. (Note that this bandwidth includes servicing data cache misses as well.) Second, in Figure 15 (b) we observe that increasing the bandwidth beyond 32 bytes/cycle does not signi cantly improve the performance of cooperative prefetching (the average speedup only increases from 13.3% to 13.7%). Therefore cooperative prefetching is not bandwidth-limited, and it is more likely that it is limited by other factors (e.g., cache pollution, achieving a su cient prefetching distance, etc.).
Cost E ectiveness
Having demonstrated the performance advantages of cooperative prefetching, we now focus on whether the additional hardware support is cost e ective. One alternative to cooperative prefetching would be to simply increase the cache sizes by a comparable amount. (Note that this is overly simplistic since the primary cache sizes are often limited more by access time than the amount of silicon area available.) For our baseline architecture, the additional storage necessary to support basic cooperative prefetching is 640 bytes at the level of the primary I-cache (128 bytes for the prefetch bits used by prefetch ltering, and 512 bytes for the prefetch bu er), and 8 KB for the 2-bit saturating counters added to the L2 cache. (We do not count the storage for prefetching indirect jumps because they are not used in basic cooperative prefetching.) Figure 16 compares the performance of a 32 KB I-cache with cooperative prefetching with that of three larger I-caches, ranging from 64 KB to 256 KB, without prefetching. It is encouraging that the average speedup achieved by cooperative prefetching (13.3%) is greater than that obtained by doubling the cache size from 32 KB to 64 KB (10.8%) despite of the substantially higher hardware cost of the larger cache. In addition, cooperative prefetching outperforms the 128 KB I-cache in three of the seven applications, and is within 2% of the performance with a 256 KB I-cache in ve cases. Overall, cooperative prefetching appears to be a more cost-e ective method of improving performance than simply increasing the I-cache size.
Conclusions
To overcome the disappointing performance of existing instruction prefetching schemes on modern microprocessors, we have proposed and evaluated a new prefetching scheme whereby the hardware and software cooperate as follows: the hardware performs aggressive next-N-line prefetching combined with a novel prefetch ltering mechanism to get far ahead on sequential accesses without polluting the cache, and the compiler uses a novel algorithm to insert explicit instruction-prefetch instructions into the executable to prefetch non-sequential accesses. Our experimental results demonstrate that our scheme signi cantly outperforms existing schemes, eliminating 50% or more of the latency that had remained with the best existing scheme. This reduction in latency translates into a 13.3% average speedup over the original execution time on a state-of-the-art superscalar processor, which is more than double the 6.5% speedup achieved by the best existing scheme, and much closer to the maximum 20% speedup (for these applications and this architecture) in the ideal instruction prefetching case. These improvements are the result of launching prefetches earlier (thereby hiding more latency), while at the same time reducing the cache-polluting e ects of useless prefetches dramatically. Given these encouraging results, we advocate that future microprocessors provide instruction-prefetch instructions along with the prefetch ltering mechanism.
