Abstract
Introduction
Dependences between instructions limit the instruction execution rate of a typical superscalar processor to an average of only about 1.7 to 2.1 instructions per cycle (IPC) [3] . Speculative execution and multithreading are two techniques that have been introduced to extend the limits of instruction-level parallelism. Some recently proposed processors that incorporate these techniques include the multiscalar architecture [2] , the trace processor (TP) [13] , the superthreading architecture [17, 18] , the multiprocessoron-a-chip (MOAC) [10] , and the superspeculative processor (SSP) [7] . The multiscalar and trace processor architectures advocate a wide-issue multi-threaded approach, while the MOAC incorporates multiple separate processors on a single chip. The supertheaded processor is a hybrid of superscalar and multithreaded architectures that speculates on control dependences while resolving data dependences at runtime. The TP and SSP, on the other hand, speculate on both control and data dependences, while the MOAC incorporates only data speculation [16] .
To speculate beyond control and data dependences, Lipasti et al [5, 6] introduced the concept of value locality, which is the likelihood that a previously-seen value will recur repeatedly within a storage location.
This locality is a measure of how often an instruction regenerates a value that it has produced before. Lipasti et al discovered that the values produced by an instruction are actually very regular and predictable.
Tyson and Austin [19] further found that 29% of the load instructions in the SPECint benchmarks and 44% of the loads in the SPECfp benchmarks reload the same value as the last time the load was executed.
This value locality allows processors to predict the actual data values that will be produced by instructions before they are executed.
Several techniques have been proposed to improve value prediction accuracy. These include a historybased predictor, a stride-based predictor, a hybrid predictor [21] , and a context-based predictor [12] . All of these schemes work at the level of a single instruction, and try to predict the next value that will be produced by an instruction based on the previous values already generated. Since these schemes try to cache as large a history of values as possible, they require large hardware tables on the processor die.
The scope of all these techniques can be too limited, however, and the values predicted can be wrong.
By determining actual values instead of simply predicting them, the processor could throw away redundant work and simply jump directly to the next task. For example, the dynamic instruction reuse proposed by Sodani and Sohi [14] saves the input and output register values for each instruction to allow the execution of the instruction to be skipped when the current input values match a previously cached set of values.
We observe, however, that the inputs and outputs of a chain of instructions are highly correlated. Thus, a natural coarsening of the granularity for value reuse is the basic block. A basic block can be viewed as a superinstruction that has some set of inputs and produces some set of live output values. Using the basic block as the prediction and reuse unit may save hardware compared to previous instruction-level reuse and prediction schemes in addition to reducing execution time.
In this work, we investigate the input and output value locality of basic blocks to determine their predictability and their potential for reuse [4] . In the following experiments, the basic block boundaries are determined dynamically at run-time. The upward-exposed inputs of each basic block, as well as its live outputs, are stored in a new hardware mechanism called the block history buffer. The processor uses these stored values to determine the output values a basic block will produce the next time it is executed. If the current inputs to a block are found to be the same as the last time the block was executed, all of the instructions in the block can be skipped. We call this technique block reuse in contrast to instruction reuse [14] .
In order to prevent the register outputs that are dead after a block's execution from occupying limited block history buffer resources, and to prevent dead outputs from poisoning a block's value locality, we use the compiler to mark dead register outputs, and pass this information to the hardware. Our simulation results
show that block reuse can boost performance by 1% to 14% over existing 4-issue superscalar processors with reasonable hardware assumptions.
In the remainder of the paper, Section 2 defines and quantifies the concepts of input and output value locality for basic blocks. Section 3 describes the idea of block reuse, the hardware implementation of the block history buffer, and evaluates the performance potential of block reuse. Section 4 studies the impact of different compiler optimizations on basic block value locality and block reuse. Related work is described in Section 5 and Section 6 concludes the paper.
Input and Output Value Locality of Basic Blocks
Each instruction in a program belongs to a basic block, which is a sequence of instructions with a single entry and a single exit point. Instructions within a basic block are correlated in that some inputs to an instruction may be produced by previous instructions within the same block. An input which is not produced within the same block is called an upward-exposed input. The set of all upward-exposed inputs compose the input set of a basic block. This set includes both registers and memory references. When a basic block is executed a second time and the set of input values are the same as the last time the block was executed, we say that this block is demonstrating block-input value locality. Block-output value locality is defined similarly. However, some values produced inside a basic block may not be needed by the following blocks, since they may be either unused or overwritten by the following blocks in the execution path. These types of outputs are termed dead outputs, similar to the concept of a dead definition in a compiler. All outputs that are used outside a basic block are called its live outputs. The output value locality of a block refers only to its live outputs. Instructions also have input and output value locality [5, 6] . The input and output value locality of a block that has only a single instruction is the same as that instruction's value locality.
We use the terms input and output value locality in later discussions to refer to block input and output value locality.
In this study, we construct basic blocks and their input and output sets dynamically at runtime as discussed in Section 3. We store up to four sets of input and output values for a block from its previous four executions. The accordingly. The value locality corresponding to depth-n inputs or outputs is called depth-n input or output value locality. All programs are compiled with the GCC compiler using the -O2 flag.
Characteristics of Basic Block Inputs and Outputs
A basic block can consist of an arbitrary number of instructions, although typical values range between 1
and 25. Table 1 shows the average number of instructions in a basic block for a collection of the SPEC benchmarks and a GNU utility program. The corresponding cumulative execution frequencies are shown in Figure 1 . We see that for 5 of the 9 programs, approximately 70% of the blocks have no more than 5 instructions. For 6 out of the 9 programs, 90% of the blocks have fewer than 15 instructions. For most programs, roughly 10% of the basic blocks have only 1 instruction, and fewer than 5% have more than 20
instructions. For Ijpeg and Ear, however, about 15% to 20% of the blocks have more than 20 instructions.
Since most of the basic blocks are not very large, we expect to see relatively few inputs and outputs for each block. As shown in Figure 2 , roughly 90% of the blocks have fewer than 4 upward-exposed register inputs and fewer than 4 memory inputs for all programs except Ear. We have modified the GCC compiler to mark the dead register outputs in each instruction using the SimpleScalar [1] instruction annotation tool. The hardware interprets this information to exclude the marked registers from the set of live outputs of each basic block. From this analysis, we find that the number of live register outputs in a block tends to be slightly larger than the number of inputs, as shown in Figure 2 . Table 2 : Arithmetic and execution frequency weighted means of the number of inputs and outputs for basic blocks.
Ideal Value Locality of Basic Blocks
The relatively small numbers of inputs and outputs in a basic block provide the basis for the block history buffer mechanism to store this information (described in Section 3.1). If the input values of basic blocks tend to recur, a considerable amount of redundant work can be avoided by simply retrieving the stored output values when the input values match, and thereby avoiding the need to re-execute all of the instructions in the block.
How repetitive or determinable are a block's input values? To answer this question, we studied the behavior of 8 SPEC benchmark programs and 1 GNU utility using the SimpleScalar tool set [1] . We first assume unlimited hardware resources and record all of the input and output values for all basic blocks.
The basic blocks themselves are constructed on the fly at run-time with a value history of depth one to four stored for each block. This information is adequate to summarize the overall determinability. The depth-n input value locality for each block is calculated as the number of times a block finds the same input values in the depth-n input history table divided by the number of times the block is executed. Then the overall depth-n input value locality is weighted by the block's execution frequency. The overall depth-n output value locality is obtained in a similar fashion. The overall block value locality for the programs tested are shown in Figures 3 and 4 .
We find that the depth-1 input value locality varies from 2.21% to 41.44%, and the depth-1 output value locality ranges between 3.09% and 51.63%. Ear has the worst basic block value locality for both input and outputs. In this program, most of the blocks with high execution frequencies are large with many register inputs and possibly some memory inputs. Furthermore, its loops often update induction variables within frequently executed basic blocks. As a result, these blocks tend to have low input locality.
From the relatively small differences between the input and output value locality numbers, we may infer that most basic blocks produce repeated outputs only when they have repeated inputs. However, these differences may indicate an opportunity to predict the output values of a basic block for speculative execution.
Increasing the history depth tends not to produce a significant increase in the input or output value locality, except for M88Ksim, which is a processor simulator with a well-defined input domain (a fixed instruction set). For the other programs, the set of inputs to a basic block has a large domain, so that even a tiny change in any input values will cause the basic block to lose its value locality. As a result, individual basic blocks tend to exhibit either very good value locality or almost none at all. A depth-one history is sufficient to capture most of the essential value locality behavior of a block for most of the tested programs.
Consequently, if the goal is simply to identify redundant basic block executions, a history depth of one is adequate.
Determinable Locality
The unlimited resource assumption of the previous subsection will not help us to understand the potential benefits of exploiting basic block input and output value locality. We thus assume that the number of input Table 1 ) and consequently have the lowest miss rates in the block history buffer. For large programs, such as Perl and
Go, the miss rates are 20.50% and 28.63% when the buffer is small, and 4.51% and 6.37% when the buffer is as large as 4K entries. Observe that a buffer size of 2K entries is sufficient to cover the block execution window for most programs. Even Go, which has 8969 unique basic blocks, has a miss rate of only 11.28%
with the 2K buffer. Hence, we decide to use the 2048-entry configuration in all of the subsequent experiments.
Sources of Block Value Locality
Each basic block in a program is responsible for a simple task. Thus, if the compiler can do extensive analysis and optimization to eliminate the redundancy, we would expect few redundant tasks. In our experiments, all test programs are compiled with the -O2 optimization flag, which activates constant propagation and common subexpression elimination [9] . These optimizations remove most of the redundant tasks, and, in fact, we see that most basic blocks in major loop nests have relatively poor value locality.
We find that the majority of the basic blocks that have good value locality belong to one of the following cases:
1. Preparing for a function call. It has been observed that many functions are called repetitively with the same parameters [15, 11] . Since the calling convention is predetermined for a particular instruction set architecture (ISA), the basic blocks that prepare for a call tend to exhibit good value locality.
2. Function prologs. Basic blocks in the prolog portion of a function process the parameters, adjust the stack pointer, and store callee-saved registers. Since a function is very likely to be called from the same call-site repetitively, the values for the stack pointer and callee-saved registers may frequently repeat. As a result, these basic blocks tend to have good value locality. From the above list, we can see that the basic blocks that are related to function calls are among the most likely to exhibit value locality. Consequently, a more efficient convention for function calls may be necessary to remove more redundancy from programs. Sophisticated interprocedural analysis is required to remove the redundancy related to the global variables, which is beyond the reach of current compiler technologies and is part of our future work.
The Performance Potential of Basic Block Reuse
Good input value locality for a basic block provides opportunities to improve the performance of a processor. The instruction value prediction table in a superscalar processor could be replaced with a block history buffer (BHB) that can be used for both value prediction and block reuse. Specifically, when the current input values to a basic block are identical to those stored in the BHB, the stored output values can be passed to the inputs of the next basic block to be executed, thereby allowing the processor to skip the execution of all of the instructions in the current block. Table 3 : Average run-length of input locality flow and average task redundancy for basic blocks. The average run-length with uninterrupted input locality ranges between 1.15 and 3.65 basic blocks, but the average TR varies from 1.70 to 18.33 instructions, as shown in Table 3 . The average size of the basic blocks involved in a run is larger than the average size of all basic blocks shown in Table 1 . Wordcount, however, is a short program that repetitively executes several switch statements, which makes it consist of many small basic blocks, as shown in Figure 1 . As a result, the average size of basic blocks in the run is actually smaller than the overall average block size for Wordcount. The other programs typically have TR values of around 4-9 instructions. The average TR for a locality flow is large for floating point programs like Alvinn and Ear, although Ear exhibits little input value locality.
If the task redundancy in a program is not large enough, skipping the execution of the basic blocks cannot offset the time required to access the BHB and update the processor state. Figure 8 long to one-instruction basic blocks. Thus, the benefit of block reuse cannot be large for this program. Ear has very low input locality, and the total number of instructions that are skippable is less than 3% , which means block reuse will not be effective for Ear, either. For the other programs, skippable instructions that belong to basic blocks of 3 or more instructions comprise 5% to 28% of the total number of instructions executed. Skipping the execution of these blocks may compensate for the time required to interrogate the BHB and the data cache, and the time required to update the processor state, to thereby provide a performance benefit.
Hardware Implementation
To evaluate the potential performance benefit of block reuse, we propose one possible design. instruction in a basic block commits, the BHB is updated. Figure 9 shows the processor model we use.
Basic blocks are constructed dynamically using the following algorithm:
1. Any instruction after a branch is identified as the entry point of a new block. The first instruction of a program is the entry point of a block automatically. Note that subroutine calls and returns are treated exactly as any other type of branch instruction.
2. Executing a branch instruction marks the end of a basic block. Reg-Out, 4 Mem-In, and 2 Mem-Out fields, the total space occupied is around 248KB, which is smaller than a typical level-2 cache in state-of-the-art processors.
When an instruction is fetched, the BHB is queried. If this instruction matches an entry for a block in the BHB, the current input values to this basic block are compared with the buffered values when the instruction reaches the issue stage, i.e., when all of its operands are ready. When any entry in the Mem-In field of a basic block is valid, the data cache must be accessed. If the access produces a hit, the value from the data cache is compared with the buffered values. If the cache access is a miss, the memory contents are assumed to be different and value locality is lost. Note that during this comparison process, the processor continues its normal execution. Thus, the execution time that can be saved by block reuse needs to offset the time required for comparison to produce any speedup.
The hardware collects the input and output values of the basic blocks dynamically. When an instruction is executed, the input mask bits for all logical input registers are set, and the appropriate output mask bits are set for the block's live output registers. Note that the registers that are live at the end of the basic block have been previously marked by the compiler. The memory input and output fields are used in a first-come-first-served manner, and the full/empty-bit is set when any entry is taken. If the output mask bit is set for a register that the current instruction is trying to read, this read is not an upward-exposed input.
In this case, the input mask is left unchanged. Also, if a load instruction finds that the address it is trying to read already resides in the Mem-Out field, the load is not upward-exposed. Consequently the memory input field is left untouched.
When the BHB determines that all of the instructions in the block are redundant and can be skipped, it will perform one of the two following actions depending on the type of exception processing desired.
For precise exceptions, the instructions are issued as in normal processing. They are marked as completed when they reserve reorder buffer entries, which prevents them from consuming any functional unit resources. Note that store instructions actually access the cache when they commit.
For imprecise exceptions, the branch target stored in the Next Block field for the block is retrieved from the BHB and used as the next PC. This effectively skips the entire block of instructions.
If the input values stored in the BHB do not match those in the processor's current state, or if there is no entry for this block in the BHB, the processor core will take control and issue the instructions to the functional units for normal execution. The processor core will continue to update the BHB whenever an instruction in a block commits.
Compiler Support
Registers are often used to store intermediate results for all kinds of operations in the programs. However, these intermediate results are seldom used outside the basic blocks that produce them. Results that are produced within one basic block but never used in the following basic blocks are dead outputs and should be excluded from the blocks' live outputs. Although hardware could be used to distinguish the dead outputs within the scope of a few consecutive basic blocks in the instruction execution window [20] , it would be unrealistic for the hardware to identify all the outputs that are never used in the subsequent execution paths. The compiler, however, can achieve this task using data flow analysis.
The GCC compiler identifies all dead registers in its flow analysis step and saves this information in the REG NOTE field of its RTX structure. However, this information is inaccurate after it does register allocation. We added another flow analysis step right before the assembly code is generated to obtain correct REG NOTEs. Then we modified GCC's assembly code generation step to encode dead register information in each instruction's annotation field [1] . The block history buffer can interpret this annotation field to identify the register number for each dead register output. While dead register outputs of a block are common, dead memory outputs are rare. Consequently, we chose not to mark dead memory outputs at all so that all memory outputs are considered live at the end of a basic block.
For each loop in a program, there is typically one, or at most a few, variables that take on a regular sequence of values. These variables include basic and general induction variables, for instance. For the basic blocks containing instructions to update these induction variables, some of the blocks' inputs and outputs will always be changing. Since these changes are regular, they can be captured by the hardware with the assistance of the compiler. The compiler can identify the induction variables within each basic block and pass on this information to the hardware. In turn, the hardware, such as a block history buffer, can use this information to determine the actual values of these induction variables each time the basic blocks are re-executed. Furthermore, the induction variables could be excluded when we study the value locality of basic blocks. This extended study, however, is beyond the scope of this paper and is part of our future work.
Indirect Memory Referencing
In load-store architectures, memory addresses change only when the corresponding input registers used to calculate the addresses also change. Therefore, if the register inputs to a basic block differ, then the memory addresses calculated from these registers will also differ. Furthermore, recall that the BHB checks the contents of the data cache as well as the addresses being referenced. Consequently, even if the user program uses multiple levels of pointers, the BHB still detects the repetition of block inputs correctly.
Simulation Methodology
The block history buffer (BHB) can be implemented in various formats. Since our purpose is to illustrate the potential of a novel mechanism, we restrict our attention to evaluating only the proposed design, instead of comparing different design options. We use execution-driven simulations to investigate the performance potential that could be obtained by using the BHB to skip around the execution of all of the instructions in a basic block with repeating inputs. We modified the SimpleScalar Tool Set [1] for all of our experiments.
The SimpleScalar processor has an extended MIPS-like instruction set architecture with modified versions of the GCC compiler (version 2.6.2), the gas assembler, and the gld loader. 
Performance Results
To obtain a coarse upper bound on the performance benefit of the block history buffer mechanism, the simulations assumed that it takes one cycle to query the BHB plus another cycle to update the registers and data cache. Also, each entry in the BHB can store any number of input and output values, but it is (Figure 8 and Table 3 ) which together produce speedup values for these programs between 1.15 and 1.37.
We next test the sensitivity of these speedup results to the number of fields available in each entry of the BHB. We choose to evaluate five cases based on the cumulative distributions of the number of block inputs and live outputs. For example, Figure 2 showed that 75% of all basic blocks have fewer than four register inputs, four live register outputs, three memory inputs, and two memory outputs. Thus, this configuration is used for the 75-percentile case. Similarly, 90% of all basic blocks have fewer than four register inputs, five live register outputs, four memory inputs and two memory outputs, and so forth. Table 4 shows the hardware configurations tested with the corresponding speedups shown in Figure 11 . Note that, since the hardware configurations for the 75, 80 and 85-percentile cases are the same, only three cases are actually compared here. We see that the performance improves gradually for all of the programs as the number of input and output values that can be stored increases. For the 95-percentile configuration, the speedup values are between 1.01 and 1.16 with a typical value of 1.10, which is close to the unlimited case.
Since each entry in the BHB records more than one register number and value pair, the time required to check the BHB and update the processor state may be longer than the 2 cycles assumed above. Unlimited Unlimited Unlimited Unlimited Unlimited Table 4 : Hardware settings to cover different basic block input and output requirements (in number of entries).
12 shows the speedup obtained when varying the total time in cycles required to access the BHB and data cache, and to update the processor state. Here each entry in the BHB can hold 4 register inputs, 5 register outputs, 4 memory inputs, and 2 memory outputs, which corresponds to the 90-percentile case. We can see that the performance potential of the BHB is not overly sensitive to the time required to interrogate the BHB and the data cache when a block is entered and to then update the processor state. For example, even if the delay takes 5 cycles, the speedup of block-reuse is still about 1.03 to 1.09, with a typical value of 1.06. This robust performance potential occurs because of the relatively large amount of time saved when a block's execution is skipped compared to the BHB overhead delays. 
Impact of Compiler Optimizations
The programs used in the previous experiments were all compiled with GCC's -O2 optimization level.
This optimization level does not include loop-unrolling or function inlining, which could possibly change the size of the basic blocks, as well as their input and output value localities. Loop-unrolling is a compiler optimization that makes the loop body larger by merging a few iterations of the original loop into a single iteration. Thus, fewer total iterations are executed, but each iteration is larger. The function-inlining optimization reduces the number of function calls by replacing the call with a copy of the function body. This change eliminates the call and return instructions, as well as the register-saving and restoring overhead.
Effects on Basic Block Value Locality
Figures 13 to 19 show the changes that occur when applying each optimization individually, and together.
When inlining is used, there are no significant changes in the basic block sizes for most of the programs.
Similarly, there is little change in the corresponding sizes of the input and output sets, with the exception Ijpeg and Alvinn. For Ijpeg, the weighted average size of the basic blocks reduces considerably while the number of register inputs per block increases by 30%. The number of memory inputs, live register outputs, and memory outputs drops by as much as 15%. For Alvinn, the number of live register outputs drops by 40% while the other metrics remain approximately the same. These results indicate that, if functioninlining is adopted for a loop-enclosed function call, the average number of upward-exposed inputs and live outputs for some of the most heavily executed blocks could be reduced since more data dependences are now resolved within each basic block. However, even with all these differences in the basic characteristics, inlining has little effect on the blocks' input and output localities, as shown in Figures 18 and 19 .
With loop-unrolling, the average basic block size, and the number of register inputs, live register outputs, memory inputs and memory outputs increase significantly for most programs. Since loop-unrolling merges a few iterations of the original loop into one iteration, the total number of upward-exposed inputs and live outputs could be reduced compared to the sum of the number of inputs and outputs of the iterations before merging. In fact, this effect is prominent for Alvinn and Ear, where the average number of register inputs and live register outputs dropped by a large margin. The input and output value locality also improve considerably for Alvinn, Compress, Ear, M88Ksim, and Wordcount. The value locality does not change significantly for the remainder of the programs, though.
When both optimizations are performed simultaneously, the changes brought about by unrolling appear to overshadow the changes from inlining. Since these two optimizations are based on heuristics, they do not always produce the same effect for different programs. However, a coarse generalization is that loop-unrolling tends to increase the average number of memory outputs, the average size, and the input and output value locality of basic blocks. We can see that function-inlining has almost no effect on the program execution time for At a BHB-delay of 2 cycles, the block-reuse scheme is able to improve the performance of the programs produced when using either of the compiler optimizations or their combination. The speedup produced by block-reuse is very significant for Alvinn, Go, Ijpeg, Li, M88Ksim, and Perl. Although functioninlining and loop-unrolling cannot speedup M88Ksim and Go without the BHB, mostly likely due to the increased difficulty of instruction scheduling for the larger basic blocks, the increased opportunity for block-reuse more than compensates for this performance loss when both optimizations are applied with the BHB.
Effects on the Performance of Block Reuse
When the BHB delay is increased by 250% to 5 cycles in Figure 21 , the speedups produced by block reuse decrease slightly. However, the basic patterns in Figure 20 are still preserved. All of the programs except Ear show some performance improvement. Block reuse cannot reduce the execution time to below the O2-only case for M88Ksim in this situation, though.
While the input value locality improves by 12% to 28% for Alvinn, Compress, M88Ksim and Wordcount when loop-unrolling is applied, the addition of block reuse shows only 4% to 6% performance improvement for Alvinn, Compress, and Wordcount. This indicates that the improved value locality produced by loop-unrolling mostly occurs in small basic blocks for these programs. For M88Ksim, the 18% improvement in input value locality due to loop-unrolling leads to a 20% reduction in its execution time, however.
We conclude that the performance impacts of loop-unrolling and function-inlining are program-dependent and, due to their heuristic nature, are difficult to predict.
Related Work
Several techniques have been proposed to dynamically reuse values produced by instructions or to issue and execute portions of programs at a granularity coarser than a single instruction. These include dynamic instruction reuse [14] , the block-structured instruction set architecture [8] , the trace processor [13] , and memory renaming [19] .
Dynamic instruction reuse [14] stores the input operands and the output result of each instruction to eliminate the need to re-execute an instruction when its operands are the same as the last time the instruction was executed. This approach was introduced to make use of the squashed speculative execution of instructions due to branch mispredictions. Three reuse schemes were evaluated -1) reuse based on operand values, 2) reuse based on operands' register numbers, and 3) reuse based on the register numbers and dependence chains. Reuse based on operand values was shown to be the most successful scheme.
Our block reuse mechanism is essentially an extension of the instruction reuse approach to the basic block level, but our study did not save any values that were produced by speculation. Only committed instructions can update the BHB. Hence, our approach is conservative and could be enhanced with greater use of speculation.
Melvin and Patt [8] introduced a new instruction set architecture (ISA) based on basic blocks called the block-structured ISA. This ISA relies on the compiler to identify and merge basic blocks. All inter-block and intra-block data dependences are determined by the compiler and marked as such in the instructions.
All instructions in a basic block are packed together to issue as a single unit. Only results that are live upon exit of a basic block update the general purpose registers. Register and memory writes are always delayed until the basic block is retired. Register reads always take the values currently in the register file upon entry to the block. This ISA supports speculative execution within basic blocks as well as across blocks. However, this ISA did not incorporate any block-level reuse as described in this paper.
The trace processor [13] uses a trace cache to store the instructions in a trace. A trace does not necessarily end at a basic block boundary. The trace processor also uses hardware to detect dependences [20] dynamically and to identify both registers that are used locally and those used across traces. The scope of this register identification is limited to the traces that have been dispatched to the trace processing units and are awaiting execution. Hence, the detection logic does not have to consider register outputs that are used after the currently executing traces. Our technique evaluated in this study, on the other hand, always uses the basic block as the fundamental unit for value prediction and reuse. The trace processor still executes every instruction in the program, while our block reuse approach skips all blocks whose execution is determined to be redundant.
Tyson and Austin [19] proposed dynamic memory renaming to better resolve the problem of issuing load instructions ahead of stores before the address calculation of a store is completed. Their approach uses a value file to record the values fetched and written by each load and store instruction respectively.
This mechanism also identifies the relationship between each load/store pair via a load/store cache. When a load or a store instruction is executed, it reserves an entry in the load/store cache. A load queries the load/store cache for the store instruction that produces the result it will fetch. If the corresponding store instruction has completed execution, the value stored in the value file will be forwarded to the corresponding load. If the store instruction is still being executed, a functional unit ID will be returned. Since the predicted relationship between load and store pairs can be wrong, all load and store instructions still need be executed to verify the prediction. This scheme operates only on memory operations and functions at the instruction level. Our block reuse mechanism addresses both register and memory operations at the basic block level. Again, once the block history buffer identifies a redundant execution of a basic block, all instructions in the block are skipped. No verification is necessary.
Conclusion
Speculation and reuse have been shown to be successful in improving processor performance, while value prediction has been shown to make these two approaches even more successful. Current prediction and reuse approaches use the instruction as the base unit. In this paper, we have extended these ideas to the granularity of the basic block and found that they are still applicable.
Our experiments using a subset of the SPEC benchmarks and the SimpleScalar Tool Set show that basic blocks have varying degrees of predictable input and output values. The depth-1 input value locality of basic blocks ranges between 2.21% and 41.44%, while the depth-1 output value locality varies from 3.15% to 51.63%. We also find that 90% of the basic blocks have fewer than 4 register inputs, 5 live register outputs, 4 memory inputs and 2 memory outputs, while 4 register inputs, 4 live register outputs, 3 memory inputs and 2 memory outputs are sufficient to cover the requirements of 85% of the basic blocks.
The relatively high input and output value locality of basic blocks, as well as their limited numbers of inputs and outputs, provides the basis for our approach of applying reuse techniques at the basic block level. We proposed a hardware mechanism called the block history buffer to record the input and output values of basic blocks to thereby identify blocks with repetitive inputs. The execution of the instructions within basic blocks with repetitive upward-exposed inputs are redundant and can be skipped. We call this scheme block reuse. Simulation results showed that a 2048-entry block history buffer with enough input and output fields to cover the requirements of 90% of the basic blocks produced miss rates below 7%.
Block reuse with this block history buffer can improve performance for the tested programs from 1% to 14% with an overall average improvement of 9% when using reasonable hardware assumptions.
We conclude that exploiting basic block input value locality by skipping over redundant computation in a basic block has the potential to produce moderate performance improvements in the types of programs tested in this study. Hardware implementation details are needed to determine if the actual cost of a mechanism such as our proposed block history buffer is commensurate with the performance realized.
