Compiler-controlled speculation has been shown to be effective in increasing instruction level parallelism (ILP) found in non-numeric programs. However, it is not clear the extent to which speculatively scheduled code may affect the instruction and data caches. In particular, the amount of time spent resolving cache m k e s may be significant enough to prevent the more aggressitle speculation models from attaining their best potentialperformance results. The objective of this paper is to quantify these effects using aggressive speculation models.
Introduction
Instruction scheduling is the process used by the compiler to re-order instructions in an effort t o minimize program execution time. Since instruction scheduling is NP-Hard, heuristics are used to approximate the best schedule. One common approach t o scheduling is to perform list scheduling using greedy heuristics to approximate a globally optimal scheduk [I] . Regardless of the scheduling heuristics, instructions are ordered based upon some priority mechanism. At each cycle, the instructions with the highest priority that have resolved all dependences and meet the issue requirements of the processor are scheduled.
The implementation of a scheduler is straightforward if list scheduling is applied only within basic blocks. Unfortunately, there is insufficient instruction level parallelism (ILP) available within basic blocks of non-numeric benchmarks t o fully utilize the functional units of wide issue superscalar and VLIW architectures [a, 3, 41 . Therefore global scheduling techniques such as trace scheduhng [5] and superblock s c h e d u h g [6] have been proposed to permit gieater scheduling and optimization freedom beyond basic block boundaries. Using these techniques, the program is divided *Roger Bringmann is now with QMS, Inc. fScott Mahlke is now with H P Labs into a set of traces or superblocks that represent frequently executed paths. These traces or superblocks contain multiple basic blocks and as a result can contain multiple conditional branches. When building a dependence graph for a trace or superblock, control dependence arcs are added from conditional branches to subsequent instructions. In order t o gain additional scheduling freedom beyond the natural basic block boundaries found within these traces or superblocks, the compiler must remove some of these control dependence arcs. This permits speculation of instructions past conditional branches, thus the term compiler-controlled speculation.
When an instruction is speculated above a branch, it is executed regardless of the direction taken by the branch. As such, the speculated instruction could introduce instruction cache (Icache) and data cache (Dcache) effects that may not have been present in the original unscheduled program. If these effects are significant, much of the performance pcr tential of aggressive speculation may bt, lost. This makes it critical that the processor and computer system designers understand the requirements of the ?peculation models being used by the compiler if they arc t o balancc thc potential performance of the compiled code with the cache implementation.
The next section briefly describes the static speculation models used in the experiments. Section 3 disciisses the expected cache effects. Section 4 presents experimental results showing how these speculation models affect various con figurations of instruction and data caches. Finally, concluding remarks are given in Section 5.
Scheduling Models
In order to gain greater scheduling freedom, instructions must be allowed to speculate above conditional branches found within a trace or superblock. In some cases, speculation of these instructions can introduce a scheduling error that can cause unexpected program termination. An example of such a scheduling error would be scheduling a divide before a branch that was implicitly preventing a divide-byzero. The 2. Ignore Errors -assumes that the likelihood of a scheduling error is small and will therefore speculate a non-excepting form of the instruction. As a result, any scheduling errors are hidden. This model requires non-excepting forms of each potentially excepting instruction that is speculated [2] .
3. Resolve Errors -speculates instructions that could cause a scheduling error but assumes that the processor has some mechanism to resolve the error. Three examples of speculation models that fall into this category are boosting [7] , sentinel scheduling [8] and write-back suppression based upon the speculation model and the processor support. As shown in Figure 1 , the existing speculation models can be categorized into three classes based upon these decisions.
Each of the speculation models used for experimentation contain certain characteristics that permit different degrees of scheduling freedom. As such, they are expected t o introduce varying Icachr and Dcache effects. In order to evaluate the speculation models fairly, all benchmarks are aggressively optimized with superblock techniques [6] . Each of the speculation models are used to schedule the optimized code. This paper does not cover any of the resolve error speculation models since their recovery processes arc not directly comparable to the other models.
No Speculation Model
This model provides a baseline for the typical Icache and Dcache effects that occur without code speculation. As a result, this model provides the best scheduling results that are attainable with no additional Icachr and Dcache effects int rotluced from compile-time scheduling.
Restricted Speculation Model
Restricted speculation (formally called restricted code percolation [2] ) assumes that correct program execution is always required as defined by the Q W O~ errors class. IJsing this model, the compiler can only speculate instructions that will never cause an exception. The conservative definition of potentially excepting instructions used by this model assumes that if there is any way that an instruction could cause an exception, it will be classified as potentially excepting and may not be speculated. As such, this definition ignores the cases where the context in which the instruction is used can sometimes indicate if the instruction may or may not cause an exception. This conservative model prevents speculation of any memory instructions, integer divide and remainder, and all floating point instructions. This model functions as a low-end for the speculation models. T h e advantage of this model is that it does not introduce any Dcache effects as a result of speculating memory instructions. T h e only Dcache effects introduced above those introduced by the no speculation model are a direct result of the increased register pressure created by the more aggressively speculated code.
Safe Speculation Model
Safe speculation expands the scheduling freedom of restricted speculation by using program analysis to identify potentially excepting instructions that can never cause scheduling errors or will introduce no new scheduling errors. An example of such a potentially excepting instruction would be a load from an array If the potential values that the array may access can be proven to bc, within the declared array bounds, then the load is a safe load. This model also falls under the avoid errors classification. The advantage of this speculation model is that it requires no special hardware support in the processor as required for the resolve errors class and the zgnore e, rors class and does not have the inherent risks associated with the ignore errors class. The safe speculation model results reported in this paper are based on inter-procec~urd and intra-procedural analysis algorithms reported in [lo] 
General Speculation Model
General speculation (formally called general code percolation) falls under the agnore errors classification. It requires a non-excepting form of every potentially excepting instruction t h a t is desirable to speculate [2] . Thus, if a potentially excepting instruction is to be speculated, it will be replaced by its non-excepting form. Potentially excepting instructions are typically load instructions, integer divide and remainder, and all floating point instructions. The results on general speculation reported in this paper are based upon a full model where every potential exception causing instruction has a non-excepting counter part in the instruct,ion set. Figure 2 : T h e most important loop in cccp scheduled using no speculation model.
compiler-controlled speculation. To accomplish this goal, scheduled code examples from two benchmarks are presented. These code examples were chosen because they show extreme cache effects due to speculation. The instruction opcodes and their descriptions for the examples are given in Table 1 . T h e examples were scheduled with the no speculation and general speculation models using an eightbissoe superscalar processor that has uniform functional units and instruction latencies of the HP-PA 7100 (see Table 5 ) . The
Icache and Dcache block sizes were 64 bytes each.
Icache Effects
Speculating instructions above branches moves them from less frequently executed paths t o more frequently execut,ed paths. As such, the instruction working set is increased which should result in more Icache requests and subsequently more Icache misses. The first example benchmark, cccp, is used t o show the expected Tcache effects. To accomplish this. the most frequently executed loop within cccp (found in the rescan function) was used. Based upon profile information, the IMPACT superscalar opt,imizer decided to unroll this loop three times. Tables 2 and 3 respectively show t.he scheduled code for the no speculation and general Speculation models. As these tables show, none of the branches from ' Table 2 have been delayed in Table : 3. In addition, the schedule was reduced from 20 cycles for the no speculaiion model to 8 cycles for the general speculation model. It should also be noted that scheduling with the no speculation model provide insufficient freedom to schedule more than 5 instructions in any cycle for the 8-issue processor.
While none of the branches in the general speculation schedule were issued later than in the no speculation schedule, the loc,ation of the branches within the Icacht? blocks did change as shown by Tables 4 and 5. T h e most notable difference is that branch instruction 59 is located in block 2 of the no speculation Icache layout while it is locat,ed in block 3 of the general speculation Icache layout. As a result, there is one additional Icache block before the branch. If branch 59 is infrequently taken, this may not increase lcache misses since both no speculation and general speculation loops are contained within only 4 Icache blocks. However, as Table 2 shows this branch is taken 6192 times. This means that there is an additional Icache block in the working set of the taken path of branch 59 in the general speculation schedule than in the no speculation schedule. The increased working set of this taken branch increases the chance of mapping conflicts with other important Icache blocks. As such, the advantages of the more aggressivp schedule have resulted in greater risk of Icache misses.
Dcache Effects
Speculating load instructions above branches moves them from less frequently executed paths t o more frequently executed paths. This will not only have effects on the Icache, but will also increase the frequency that the load requests are made. As such, the data working set is increased which should result in more Dcache requests and subsequently more Dcache misses. The second benchmark, compress, is used to show the expected Ikache effects. To accomplish this, the most frequently executed loop within compress (found in the compress function) was used. Based upon profile information, the IMPACT superscalar optimizer decided to unroll this loop three times. Tables 6 and 7 respective1.y show the scheduled code for the no speculation and general speculation models. As the tables show, the no speculation model used 37 cycles while the general speculation model required only 18 cycles. It should also be noted that scheduling with the no speculation model provide insufficient freedom to schedule more than 6 inst.ructions in any cycle for the 8-issue processor. Table 3 shows the increased execution frequency of the six speculated loads from the general speculation schedule of this loop. By speculating a load above a particular branch, the memory reference patterns of the control flow paths reached from that branch have been altered. Depending upon the cache configuration, this could introduce more Dcache conflicts. For example, by speculating load instruclion 163 a.bove branch 159 in Table 7 , the memory reference pattern of the paths reached by the taken path of this branch have been altered. Based upon the increased execution frequency of load number 163, and the resultant change in memory reference patterns, Dcache miss rates caused by this load could increase. Due to speculation of other loads and the chiange in their memory reference pattvrns, the to- 
Experimental Evaluation
This section will quantify the effects that increasing levels of scheduling freedom can have on instruction and data caches. The speculation models used in the experiments from least aggressive to most aggressive are no speculation, restricted speculation, safe speculation and general speculation. 
Methodology
Compiler support for each of the speculation models has been implemented in the IMPACT-I C compiler. The IMPACT-I compiler is a prototype optimizing compiler designed to generate efficient code for VLIW and superscalar processors [2] . The benchmarks used in this study are the 14 non-numeric programs shown in Table 4 . The benchmarks consist of' 5 non-numeric programs from the SPECint92 suite and 9 other commonly used non-numeric programs. Each of the benchmarks were aggressively optimized with superblock techniques [6] and scheduled using the four speculation models varying the processor issue width from 1 to 8 instructions per cycle.
The processor model used in this study is an in-order T h e most important loop in compress scheduled using general speculation model. branch delay slot, and the instruction set of the HP PA-RISC processor. The instruction latencies assumed are those of the H P PA-RISC 7100 (see Table 5 ). For each machine configuration, the program execution times are dc,-rived from execution driven simulations of the benchmarks in Table 4 . During the simulations, the issue widths were varied from 1 t o 8 based upon the processor model that the code was scheduleql for Dynamic branch prediction was assumed using a 1024 entry direct mapped BTB with a 2 bit counter and a 2 cycle misprediction penalty. A perfect Dcache was used when measuring the [cache effects and a perfect Icache was used when measuring the Dcache effects.
T h e cache configurations used for the experiments are given in Table 6 .
Results
The shear volumes of d a t a produced from the simulations made it impossible to present the individual benchmark results in this paper. In an effort to br more concise, the results presented in the subsequent figures are generated by computing the arithmetic mean of speedups for each speculation model, cache size and issue rate. Speedup was computed by dividing the execution time of the respective benchmark using the no speculation model a t issue 1 with a 4K direct mapped Icache and Dcache by the execution time of the same benchmark using the specified speculation model a t the specified cache size and issue rate. Figure 8 shows the performance results for direct mapped caches for the extreme speculation models -no speculation and general speculation. T h e first thing to observe from this figure is that the curves for the no speculation model show very little change regardless of the issue rate. In particular,
Icache Performance Results
there was an increase of only .35 I P C (16.9'%) a t issue 8 from a 4K to a 64K Icache. In contrast, the curves for t.he general speculation model showed a noticeable increase from the lower issue rates to the higher issue rates. In particular, t,heir is an increase of .36 IPC (20.9%) a t issin: 2, .77 I P C (29.3%) a t issue 4, and 1.05 IPC: (31.5%) a t issue 8. Thus, the benefits from larger cache sizes are more pronounced as the issue rate increases. Finally, the performance for all speculation models stabilized with a 64K Icachr. Figure 9 shows the performance results for 2-way set associative caches for no speculation and general speculation models. By comparing this figure to Figure 8 , it is clear that f.here is little advantage in higher associativities with Icaches larger than 8K regardless of the issue rate or speculation model. Even a t the lowest cache sizes. general speculation was only able to show a 6 percent speedup at %issue using 2-way set associative Icaches over direct mapped Icaches. Figure 10 shows the comparative Icache results for all of the scheduling models at issue-1 and issue-8. As the figure shows, there is no significant performance advantage in using any of the aggressive speculation models for a single issue processor. Since only one instrirction can be issued per cycle, the only potential slots that. can be filled in the schedules of the integer benchmarks are branch and load delay slots. Therefore, there is very little opportunity to improve the performance of the benchmarks through more aggressive speculat,ion. As a result of little speculation, only minor Icache effects are observed.
In contrast to the single issue performance, there is a clear advantage in using more aggressive speculation models a t 8-issue. The no speculation model shows a 13.1 percent improvement between 4K and 64K Icaches. The restricted speculation model shows an 18.0 percent improvement, the safe speculatioii model shows 21.4 percent improvement and thc: general speculation model shows a 24.5 percent improvement over the sanie cache configurations. Thus, while the cache size was only a minor impediment to performance with lower issue rates, it is clearly a larger impediment to performance with higher issue rates for more aggressive speculation models. However, this set of benchmarks were not able to benefit from Icaches larger than 64K.
One additional point should be noted from the 8-issue results shown in Figures 10. The most aggressive speculation model's performance ranged from onlv 8.9 to 11.8 percent higher than safe speculation. Thus, safe speculation has great potential since it, requires no special processor support that could potentially lead t o slower clock rates. Also, it introduces none of the risks that result from ignoring scheduling errors like general speculation.
Analysis of Icache Results
To more fully understand the performance results, the Icache behavior is broken down in Tables 7 and 8 . Table 7 contains the absolute number of read request,s and read misses as well as the miss rate for each of the bench- marks in the base case. The numbers from Table 8 represent the read requests and read misses as a percentage of the totals presented in the final row of Table 7 As Table 8 shows, the more aggressive speculation models tend to reduce the number of Icache read requests. This can be justified by understanding how the simulator's fetch model works. The fetch model fills buffers equivalent to twice the issue rate of the processor in an effort to provide the processor with the issue-width number of instructions at each cyclr. Thus, each cycle, the fetch unit fetches a block of instructions to fill the fetch buffer. Any instructions that cannot be placed into the fetch buffer will be discarded and potentially fetched again the next cycle. Since the more aggressivc speculation models have more independent instructions each cycle to choose from, the compiler is better ablc to group independent instructions together and reduce interlock. As such, more instructions can be issued each cycle, which reduces the need to re-fetch the same cache block repeatedly. As Table 8 shows, even though the number of read requests decreased, the absolute miss rates increased for both 4K and 64K from the least aggressive speculation models to the most aggressive speculation models. In particular, there was a 1 percent increase in the miss rate from no speculation to general speculation. There was practically no change in the Icache miss rates with 64K Icaches since the Icache was sufficiently large to hold the working set for all speculation Table 9 : Icache Misses for the no speculation and general speculation models of t8he cccp loop example at Issue 8 (2-way set associative, 4K Tcache).
-58.66 models. While the miss rates for general speculation a t 8-issue with a 64K cache is only 1.5 percent lower then the miss rate with a 4K cache, the performance was 24.5 percent higher. Thus, even a small increase in the miss rate can significantly impact the performance for the more aggressive speculation models. The impact on performance would be even more pronounced if the cache miss latency was grea.ter than the simulated 12 cyclrs. The cccp loop example shown in Tables 4 and 5 can be used t o il1ustrat.e the reasons for the increase in the miss rate with the 4K Icache. ' Table 9 shows the Icache misses caused hy the first instructions in each Icache blocks. The misses caused hy the instruction at the start of the loop are represented with Icache block 1. There was only a negligible difference in t h r miss rates for the two speculation models in this block. Icache blocks 2 and 4 decreased their cache misses from the no speculation model to the general speculation model. Icache block 3 showed a significant increase in Icaches misses. Most of these misses can be attributed to migration of the misses from Icache blocks 2 and 4 to Icache block 3 due to the small 4K Icache. However, even after considering the migration of misses, there was an overall increase in misses for the loop by 16.34 percent which is attributable to the additional Icache block before the frequently taken branch numher 59 in the Icache layout for the general speculation model. Figure 11 shows the performance results for direct mapped Dcaches for the extreme speculation models. The first thing to observe from this figure is that the curves for the no speculation model show a much smaller increase in performance than general speculation at the same issue rates. In particular, there was an increase of only .51 IPC (27.5%) at issue 8 from a 4K t o a 64K Dcache while the general speculation model showed an increase of 1.32 IPC (45.4 %). In contrast t o the Icache results, the performance for general speciilai.ion model still demonstrates a noticeable Figure 11 : Dcache effects for no speculation and general speculation models. Figure 12 shows the performance results for 2-way set associative Dcaches for no speculation and general speculation models. By comparing this figure to Figure 11 , it is clear that higher associativity significantly benefits the smaller Dcaches. In particular, general speculation showed a 19 percent improvement in performance at %issue for a '?-way set associative 4K Dcache over a direct mapped 4K Dcache. The no speculation model showed a 14 percent improvement in performance a t the same cache configurations. Both speculation models showed some performance improvement with higher associativity when using Dcaches its large as 128K. Thus, higher associativity can be better used t o offset the limitations of smaller Dcaches than the smaller Icaches. Figure 13 shows the comparative Dcache results for all of the scheduling models at issue 1 and issue 8 As the figure shows, there is no significant performance advantage in using any of the aggressive speculation models for a single issue processor. However, at issue 8, there is a clear advantage in using the more aggressive speculation models. An increase in the Dcache size from 4K to 64K using the no speculation model resulted in a performance improvement of 13.8 percent while the restricted speculation model showed an increase of 16.3 percent. Safe speculation increased per- formance by 20.6 percent and general speculation increased performance by 24.6 percent over the same region. While there was no performance advantage from increasing the Icache beyond 64K, this was not the case with t,he Dcache. The no speculation model improved its performance to 21.5 percent higher than 4K with perfect. Dcaches. Restricted speculation improved to 26.2 percent higher than 4K. Safe speculation improved to 30.1 percent, higher and general speculation improved to 35.9 percent higher. Thus, small Dcaches have been shown t o be a significant impediment to the potential performance of more aggressive speculation models a t higher issue rates.
Dcache Performance Results

Analysis of Dcache Results
To more fully understand the performance results, the Dcache behavior i s broken down in Tables 10 and 11. Tab k 10 contains the absolute number of read requests and react misses as well as t,he miss rate for each of the benchmarks in the base case. The numbers from Table 11 represent the read requests and read misses as a percentage of the totals presented in the final row of Table 10 . Ta.ble 11 shows that the Dcache accesses increase with the more aggressive speculation models. This is caused by an increase in the working set size r e s u l h g from specthtion of additional load instructions.
The decrease in the miss rate from the less aggressive to the more aggressive speculation models is miss-leading since Tables 6 and 7 can be used to illustrate the reasons for the increases in Dcache misses. Table 12 shows the Dcache misses generated by the load instructions in the no speculation and gmeral specillation codes based upon a 4K Dcache. It can be seen from this (data that there were moderate to significant increases in Dcache misses from the no speculation case to the general speculation case. By comparing the increased Dcache miss rates for load instructions 163, 183 and 203 with their respective increases in execution frequency given in Table 3 , it is apparent that the increase in miss rates for these loads was not constrained by the their increase in execution frequency. Ot,her speculative loads actrially caused further Dcache misses for these loads. In addition, the nonspeculated load instructions 158, 178 and 198 also showed an increase in Dcache misses that is attribut,ahle to other speculated loads.
Conclusions
This paper has presented experimental results for four compiler-controlled speculation models over a variety of issue rates and cache configurations. The results indicate that the more aggressive speculation models create larger instruction and data working sets. As such, processor designers need t o ensure that cache configurations can tolerate the increased working set if they expect to attain the best performance from aggressive speculation models. These experiments have shown that increasing the Icache and Dcache from 4K t o 64K resulted in a performance increase of a p proximately 26 percent for the general speculation model a t issue 8. Additionally, the results indicate that 2-way set associativity beneficially reduces misses for Dcaches up to 128K. In contrast, 2-way set associativity was only beneficial for Icaches up t o 8K.
While small Icaches and Dcaches can significantly limit the potential performance of more aggressive speculation models, there is still an advantage in using the more aggressive speculation models a t higher issue rates even if the cache configuration is held constant. Even though some of the potential advantages of the more aggressive speculation models are negated by the higher miss rates, it was not sufficient to offset the performance advantages. In particular, general speculation at issue 8 was 63.6 percent faster than no speculation with the same 4K cache configuration and issue rate. Safe speculation was 50.2 percent faster and restricted was 9.6 percent faster. When using a 64K cache, general speculation was 80 percent faster than no speculation. Safe speculation was 61.1 percent faster and restricted speculation was 14.3 percent faster. The improvements in performance were almost identical for the experiments that used a perfect Icache and varied the Dcache as those that used a perfect Dcache and varied the Icache. Thus, aggressive speculation effects the Icache and the Dcache in a similar fashion.
