The fact that instructions in programs often produce repetitive results has motivated researchers to explore various techniques, such as value prediction and value reuse, to exploit this behavior. Value prediction improves the available Instruction-Level Parallelism (ILP) in superscalar processors by allowing dependent instructions to be executed speculatively after predicting the values of their input operands. Value reuse, on the other hand, tries to eliminate redundant computation by storing the previously produced results of instructions and skipping the execution of redundant instructions. Previous value reuse mechanisms use a single instruction or a naturally formed instruction group, such as a basic block, a trace, or a function, as the reuse unit. These naturally-formed instruction groups are readily identifiable by the hardware at run-time without compiler assistance. However, the performance potential of a value reuse mechanism depends on its reuse detection time, the number of reuse opportunities, and the amount of work saved by skipping each reuse unit. Since larger instruction groups typically have fewer reuse opportunities than smaller groups, but they provide greater benefit for each reuse-detection process, it is very important to find the balance point that provides the largest overall performance gain. In this paper, we propose a new mechanism called sub-block reuse. Sub-blocks are created by slicing basic blocks either dynamically or with compiler guidance. The dynamic approaches use the number of instructions, numbers of inputs and outputs, or the presence of store instructions to determine the sub-block boundaries. The compiler-assisted approach slices basic blocks using data-flow considerations to balance the reuse granularity and the number of reuse opportunities. The results show that sub-blocks, which can produce up to 36% speedup if reused properly, are better candidates for reuse units than basic blocks. Although sub-block reuse with compiler assistance has a substantial and consistent potential to improve the performance of superscalar processors, this scheme is not always the best performer. Sub-blocks restricted to two consecutive instructions demonstrate surprisingly good performance potential as well. 
Introduction
The resources available in superscalar processors, such as the instruction issue bandwidth and the function units, are often substantially underutilized due to limits in the amount of instruction-level parallelism (ILP) that is available in application programs [19] , and due to the limited size of the instruction lookahead window. At the same time, however, researchers have found that a substantial amount of value locality, which refers to the recurrence of values in any storage location inside a processor or its memory [20] , exists in application programs. The existence of this value locality suggests that, even though resources are being underutilized, processors may be performing a significant amount of redundant work [9, 15, 20, 28, 31] .
To exploit this program behavior for better processor performance, numerous value reuse and value prediction mechanisms have been proposed [3, 8, 15, 20, 21, 25, 27, 29, 33, 34] . Value prediction attempts to improve instruction-level parallelism by trying to issue instructions speculatively before the actual operands become available. Value reuse, on the other hand, attempts to avoid redundant execution by buffering instructions' previous input and output values. When the processor detects that a reuse unit, e.g., an instruction or a group of instructions, is about to be re-executed using these previously saved input values, their execution can be skipped entirely.
In order to actually skip the execution of instructions, the processor needs to make sure that the new input values of the reuse unit are exactly the same as the previously saved values. This match of values requires that: 1) the new input values of the reuse unit are up-to-date; and 2) the new input values are the same as the previously saved values. The first requirement determines the earliest time at which the old and new input values can be compared. A match of the values then allows all of the instructions in the reuse unit to be skipped. The reuse unit must be large enough so that the execution time saved by reuse can offset this reuse detection latency. Examples of reuse units include a single-instruction [29, 31] , a basic block [14, 15, 16] , a trace [10, 26] , and a function [22, 25] .
Reuse at the single-instruction level keeps the inputs and results of previously executed instructions in a hardware buffer and tries to skip the execution of instructions that are subsequently re-executed with the same inputs. In a pipelined superscalar processor, the execution of an instruction could take as little as one cycle, which limits the effectiveness of instruction-reuse for these low-latency instructions. Reuse at very large granularities, such as a trace [10] , introduces the problem of insufficient opportunities to reuse results, since more instructions typically require more input values to repeat in order for these instructions to be skipped. Hence, a reuse granularity between an instruction and a trace could be very attractive.
A natural choice is the basic block, which has been shown to provide substantial speedups [14] when used as the reuse unit. However, there are several limitations that arise when using basic blocks as the reuse unit. First, basic blocks potentially could be very large so that they may not fit in the fixed hardware table that stores the input and output values. Second, the boundaries of basic blocks are determined by branch instructions, which means that the basic block reuse scheme [15] essentially slices the instruction sequence using branch instructions. This slicing scheme may not be the optimal choice. On the other hand, since single instructions encounter repeated inputs more frequently than basic blocks [15, 20] , there must be a better scheme to slice the instruction sequence so that the reuse opportunities and the reuse granularity are optimally balanced to generate the best overall performance improvement. This paper extends the previous work on basic block value reuse to sub-block reuse. The performance trade-offs inherent in using reuse units ranging in size from a single instruction to an entire basic block are evaluated. A compiler-based basic-block partitioning algorithm is proposed in an attempt to construct reuse units that can consistently balance reuse opportunities with the performance improvements obtained by skipping the execution of a group of instructions. In the remainder of the paper, Section 2 describes the related work. Section 3 introduces the proposed sub-block reuse mechanism. Section 4 describes the simulation framework, while Section 5 compares the actual performance obtained when using the various sub-block reuse units in a superscalar processor. Future work is summarized in Section 6 and Section 7 concludes the paper.
Background and Related Work
The concept of value locality was defined by Lipasti and Shen [20] as the likelihood of value recurrence in a particular storage area in a processor. A direct implication of value locality is that some instructions produce repetitive results in different execution instances. Lipasti and Shen [20] showed that 22% -75% of the instructions in selected SPEC92 and SPEC95 benchmark programs produce repetitive results. For the SPEC95 benchmark programs, Sazeides and Smith [28] found that:
More than 50% of the static instructions generate only one value throughout a program's execution.
More than 90% of the static instructions generate fewer than 64 values.
More than 50% of the dynamic instructions generate fewer than 64 values.
Gonzalez et al [11] and Gabbay et al [8] studied the ideal performance of value prediction with unlimited value buffers and 100% prediction accuracy. Their results indicate that a considerable amount of execution time potentially could be saved if a processor can fully utilize value locality.
Value Prediction Mechanisms
Value locality can be exploited to predict the results produced by instructions to break dependence chains speculatively [20] . This prediction will allow the processor to issue the dependent instructions earlier than it can without value prediction. The last-value predictor and the stride-based predictor produced about 15% speedup using a PowerPC processor model [20] . Burtscher and Zorn [2] found that storing four previous values for each load instruction in a 21-kilobyte load value predictor could generate 12.5% average speedups. Rychlik et al [27] proposed a detailed implementation of a value-speculative PowerPC 604 processor. They also proposed the dynamic classification scheme to dynamically select the most appropriate value predictor for each instruction to improve prediction accuracy. Nakra et al [24] proposed the context-based value predictor to associate each prediction with the actual program execution path and value history. However, their results did not show a significant improvement over other value prediction schemes.
Wang and Franklin [34] proposed a two-level predictor in which up to n previous values are stored for each instruction, together with a confidence counter for each. They also combined this two-level predictor with the stride-based predictor to form a hybrid predictor. Choi et al [17] call repetition behavior that causes instructions to produce the same value output value locality, while using the term operand value locality to refer to the repetition of input operand values. Their Speculative Result Cache (SRC) is indexed and tagged by the available values of the operand registers to exploit the operand value locality. However, this scheme is very sensitive to the misprediction penalty due to its potentially low prediction accuracy.
Value Reuse Techniques
In addition to predicting values, several techniques have been proposed to dynamically reuse values produced by instructions [4, 7, 12, 10, 15, 22, 25, 29] . Harbison [12] proposed a value cache for stack machines to reuse results. Richardson [25] also proposed the result cache to skip the execution of longlatency instructions, such as floating point division. Citron et al [4] extended the result cache idea to multimedia applications. Dynamic instruction reuse [29] applies the reuse idea to integer programs.
All of these schemes have one aspect in common, i.e., they store the input operands and the output result of each instruction to eliminate the need for re-executing an instruction when its operands are the same as the last time the instruction was executed. However, the dynamic instruction reuse approach was introduced to make use of the speculative execution of instructions that are squashed due to branch mispredictions.
Sodani and Sohi [30] analyzed the differences between instruction-level value prediction and value reuse. Their results showed that 84-97% of redundancy in programs could be reused and that the performance obtained by value prediction is sensitive to the way branches with value-speculative operands are handled. Although instruction reuse typically captures less redundancy than instruction-level prediction, it may perform equally well because it validates results early, it is non-speculative, and it reduces the branch misprediction penalty.
Basic-block reuse [14, 15] was proposed to extend the value reuse concept to the granularity of a basic block. Speedups of up to 15% for a collection of SPEC92 and SPEC95 benchmark programs were produced. Context-based basic block reuse produced only a 1% to 7% additional improvement with a considerable amount of additional hardware investment [13] .
Gonzalez et al [10] extended value reuse to the trace level by grouping instructions with repetitive inputs to form dynamic traces. Speedups of up to 80% with ideal assumptions were obtained for the SPECint95 benchmark programs. However, this scheme appears to be infeasible to implement in hardware. The Dynamic Trace Memorization (DTM) approach proposed by da Costa et al [7] uses a hardware Memo Table to In 1968, Michie [22] proposed to decompose functions into two components. The "rule" component would consist of a function's procedural computation, while the "rote" component could be determined directly using a lookup table. A function then could be evaluated either by rule or by rote, or by using a blend of the two. The rule versus rote decision would be handled transparently to the program. To increase performance as the program continued to run, each evaluation by rule would add an entry to the rote table.
This idea was later extended to function-level reuse by Richardson [25] . However, this simple function reuse mechanism, which distinguishes the rote part of a function only by a function's name and parameter values, cannot be effectively applied to programs written in languages that support global variables and pointers.
Molina et al [23] proposed a mechanism to exploit operand value locality by dynamically removing redundant execution in the hardware. In this mechanism, value buffers are tagged with operands and operation types, but indexed by the program counter of the instruction. Different entries in the value buffer are also linked based on data dependence information to provide reuse-chaining [23] . Connors and Hwu [6] proposed to rely completely on the compiler to identify reusable regions. Input and output values of these code regions are then saved in the hardware for value reuse. In contrast, the sub-block reuse scheme presented in the next section relies on the compiler only for exposing more reuse opportunities to the hardware.
Sub-Block Reuse
A basic block can be viewed as a superinstruction that has some set of upward-exposed inputs and produces some set of live output values [14, 15] . This superinstruction can be reused like a single instruction with the assistance of special hardware, as explained in the following.
Basic Block Reuse
In order for a basic block to be reused, a hardware mechanism that can store the set of input and output values is required. This hardware mechanism should be capable of identifying the upward-exposed input and live output registers and memory locations. In addition, it should be capable of sustaining the dynamic nature of computer programs. Thus, this mechanism should be able to: 1) capture the basic blocks that are inherent at the source code level; 2) minimize the set of input and output values that must be stored; and 3) detect the cases where basic block reuse would change the program semantics.
The Block History Buffer (BHB) proposed in [14, 15] meets all of these requirements. As shown in Figure 1 , the Reg-In and Reg-Out fields in the BHB contain a mask that identifies the input and output registers that are needed by the corresponding basic block (the location of the bit indicates the register number). The mask bits are set as the basic block is executed for the first time such that the bit that corresponds to an input register is set in the Reg-In field only if this register is a source register in the instruction and the bit in the Reg-Out field corresponding to this register is not set. In a load-store architecture, memory addresses are calculated directly using register values. Thus, the memory address will change only if the register values used for address calculation change. Hence, the Mem-In and Mem- For a 32-bit architecture, each entry in the Reg-In and Reg-Out fields requires 4 bytes. The mask will also be 4 bytes since there are only 32 architectural registers. Each entry in the Mem-In and Mem-Out fields require 12 bytes (4 bytes each for PC, address, and data). In [14, 15] , the proposed size of the BHB is 4 entries for Reg-In, 5 entries for Reg-Out, 4 entries for Mem-In, and 2 entries for Mem-Out. Together with a 4-byte tag and a 2-Byte next-block field, one BHB entry requires 112 bytes.
Sub-Block Reuse
Since basic blocks come in many different sizes, and have differing numbers of inputs and outputs, a fixed hardware mechanism, such as the BHB, cannot be guaranteed to have the appropriate size for all of the different basic blocks that may exist in a program. When using basic blocks as reuse units, the BHB must discard any basic block that requires a wider BHB entry than is available in the hardware. In [14, 15] , up to 25% of the basic blocks were excluded from being reuse candidates simply because their sizes did not fit into an already very wide BHB.
On the other hand, using basic blocks as the reuse unit is equivalent to using branch instructions to slice the static instruction sequence into the reuse units. This slicing approach ignores the data-flow information inherent to all computer programs. Since value reuse is a concept closely related to dataflow, it should be natural that we also take into account the data-flow information in the slicing process.
To avoid the potential exponential growth of instruction combinations, the branch instructions should still be used to perform a preliminary slicing of the instruction sequence, which means further slicing of instruction sequences will be based on basic blocks. We call the instruction sequences resulting from basic block slicing sub-blocks. In addition to their potential data-flow-aware advantages, these subblocks have manageable and controllable sizes with only a limited number of inputs and outputs. Hence, a sub-block should be a more feasible reuse unit than an entire basic block.
Slicing Basic Blocks Using Data Flow Information
Appropriately slicing basic blocks is not entirely straight-forward since it may have both positive and negative impacts on the overall reuse opportunity and potential performance gain. In general, splitting a basic block into smaller sub-blocks requires fewer instructions to have simultaneously repeating input values, which typically translates to more reuse opportunities. However, these smaller blocks also mean skipping fewer instructions for each reusable block, which makes it more difficult to amortize the cost of reuse detection. Our goal is to find the balance point that achieves maximum performance for value reuse.
Potential Effects of Basic-Block Slicing
If not done carefully, slicing basic blocks may reduce the total amount of available value locality. Some intrablock dependences could be converted to interblock dependences due to slicing, for instance, which reduces the chances for value reuse. Furthermore, some dead outputs are converted to live outputs since the last use of the value may be delayed to the next block. This change has a negative impact on value reuse since it now requires more values to repeat in order for the same sequence of instructions to exhibit value locality.
For example, splitting the basic block between instructions 3 and 4 in Figure 2 makes the inputs to instruction 4 upward-exposed in the second sub-block after splitting, although these inputs were intrablock dependences before splitting. In addition, the result produced by instruction 3, which is dead at the end of the basic block without slicing, becomes a live output for the first sub-block since the consumer of this result, instruction 5, is in a different sub-block after slicing. Furthermore, when multiple instructions in a basic block depend on a single upward-exposed input, careless slicing could produce two sub-blocks with the same upward-exposed input stored twice, which again could mean fewer chances for block
reuse. An example of this effect is shown in Figure 3 . Before splitting, the upward-exposed input is stored in one BHB entry and used by both instructions 3 and 5. After splitting, however, the same value On the other hand, slicing basic blocks into sub-blocks can have positive effects. For instance, the block shown in Figure 4 has two essentially separate dependence graphs for two independent tasks. The computation for one task may have good input and output value locality, but the computation for the other task may not. Combining the computation for both tasks into one basic block reduces the possibility for block reuse. In this case, splitting the basic block at the right spot may actually expose more value reuse opportunities. Splitting the basic block between instructions 3 and 4 in Figure 4 does not create any more upward-exposed inputs or live-outputs with the two different dependence graphs now in different sub-blocks. If either of the sub-blocks has better value locality than the two combined, we will have more opportunities for value reuse. 
The Dependence-Directed Block Partitioning (DDBP) Policy
We propose the Dependence-Directed Block Partitioning (DDBP) policy to split blocks along the dependence edges that will minimize the negative impacts of slicing, such as those shown in Figures 2 and 3.
While this policy could be implemented dynamically at run-time, we developed a compiler implementation to save hardware costs.
Based on the observation that a block with more upward-exposed inputs and live outputs is likely to have less value locality than blocks with fewer inputs and outputs, our objective should be to minimize the overall number of upward-exposed inputs and live outputs after splitting. This objective reduces this partitioning to the classic min-cut graph-partitioning problem. The direct requirement of this objective is that we should try to reduce the exposure of values produced for intra-block consumption. As the name suggests, values produced for intra-block consumption are typically unneeded after the execution of the current block. Hence, exposing these values brings unnecessary hardware costs. Naturally, the top priority of our algorithm should be to minimize the exposure of the number of values produced for intrablock consumption. This is equivalent to minimizing the number of dependence edges that remain between the sub-blocks produced when a larger block is split.
As shown in Figure 5 , the algorithm has three phases:
Phase I (steps 1-7 in Figure 5 ): The algorithm tries to take care of two cases:
1. The dependence graph of a basic block could contain multiple Directed Acyclic Graphs (DAGs), as shown in Figure 4 . In this case, the optimal split point would produce sub-blocks containing the separate DAGs. These split points are identified as those having zero crossing dependence edges.
2. For blocks with a single DAG, the best split point is the one that minimizes the number of dependence edges between sub-blocks. There may be multiple candidate split points in this case, which will be resolved in the next two phases.
Phase II (step 8 in Figure 5 ): We choose as a split point the instruction that minimizes the number of doubly-saved upward-exposed inputs (see Figure 3 ) if multiple candidate instructions meet the requirement of the first pass. This pass helps to reduce the total number of upward-exposed inputs that need to be saved in the BHB after splitting.
Phase III: The instruction that best balances the sizes of the resulting sub-blocks is chosen as a split point to maintain a proper grain size if the previous two phases fail to find a unique split point.
1 Scan the basic block bottom up from the instruction to the .
2 Number the instruction being scanned , which is the number of unscanned instructions remaining in the basic block. Decrease by 1; place this instruction into the yet-to-be-scanned queue.
3 If the instruction at the head of the yet-to-be-scanned queue depends on only one instruction source inside the block, simply number the instruction it depends on Decrease by 1, and append the corresponding source instruction to the yetto-be-scanned queue.
4 If the head of the yet-to-be-scanned queue depends on two instructions that are both in the basic block, and one of the source instructions has only upwardexposed inputs, number this source instruction , and the other source instruction . Append both instructions to the yet-to-be-scanned queue in the order . Otherwise, maintain the original order of the source instructions. Decrease by 2. This step keeps source instructions with upward-exposed inputs close to their dependent instructions to identify potential split points where the number of crossing dependence edges is one.
5 If all inputs to the head of the yet-to-be-scanned instruction are upward-exposed, mark this instruction a candidate for starting a sub-block, since this is where a new dependence tree starts.
6 Return to Step 1 if there are any unscanned instructions left in the block.
7 Order the instructions by the assigned sequence number.
8 If the data dependence graph of this basic block is a single tree, start searching for a split point where the number of crossing dependence edges is the smallest. The search begins from instruction and continues bottom up to instruction 3, where is the number of instructions in the basic block. This split-point instruction should not have any upward-exposed input that is the same as one of its predecessors. We stop at 3 instructions since sub-blocks with fewer than 3 instructions are not likely to provide much performance improvement for block reuse. In addition, the split points are marked using the above algorithm after the compiler's instruction scheduling pass using the compiler's previously derived data dependence information. This reordering means that our DDBP scheme may interfere with the instruction scheduling phase of the compiler. Hence, our algorithm maintains the original instruction order selected by the compiler when there is no opportunity to reduce the edges between two resulting sub-blocks (step 4 in Figure 5 ).
In the example shown in Figure 6 , the algorithm starts with instruction 6 in the queue of instructions to be scanned. Since instruction 6 is the last instruction, and since it depends on instruction 5, we assign instruction 6 the sequence number 6 and append instruction 5 to the queue. Next, we dequeue instruction 6 and start scanning instruction 5. Since instruction 3, which is one of the sources for instruction 5, has one upward-exposed input, we assign instruction 3 the sequence number 4 and append it to the yet to be scanned queue followed by instruction 4. We then dequeue instruction 5, and start scanning instruction 3 . We find that instruction 3 has only upward-exposed inputs, and therefore mark it one of the candidates for splitting the basic block. We then move on to instruction 4. Since both inputs of instruction 4
(instructions 1 and 2) have upward-exposed inputs, we maintain the original order by queuing instruction 2 before instruction 1. The new sequence numbers indicate that the new instruction order is 1,2,4,3,5,6.
Since instruction 3 is the only one that has been marked as a split point, we split the basic block into two sub-blocks at this point. This split produces one sub-block containing instructions 1, 2, and 4, and another containing instructions 3, 5, and 6.
In the example shown in Figure 7 , instructions 1, 2 and 4 form one dependence graph, while instructions 3 and 5 form another separate graph. Thus, it will be best to split the block into two different sub-blocks containing the two separate graphs. The DDBP algorithm when applied to this basic block starts with instruction 5, and traces the dependence backwards to mark instruction 3 as the start of a new sub-block. This split reorders instructions 3 and 4.
As a result of the reordering caused by applying the DDBP algorithm, the source instructions are brought closer to their dependent instructions. This reordering could potentially reduce the instruction level parallelism (ILP) if the processor performs only in-order issue, or if the instruction window is small. Since our processor model is an out-of-order superscalar processor, this negative effect of instruction reordering is expected to be minimal. Furthermore, our subsequent experiments found that an instruction window with 64 instructions is large enough to compensate for any negative effects of reordering for the programs tested.
Algorithm Complexity
The DDBP algorithm scans the instructions in a basic block one-by-one by following the dependence chain backwards to search for optimal slicing points. For a load-store instruction set architecture, there will be no more than two sources for each node in the dependence graph, which corresponds to an instruction. Hence, the complexity at the node level is Ç´½µ. If Steps 1 -7 in Figure 5 find the optimal split point, the overall complexity is Ç´Òµ, where n is the number of instructions in a basic block. If
Steps 1 -7 fail to find a unique slicing point, Step 8 scans the basic block for a second time to find a slicing point that best balances the sizes of the resulting sub-blocks. As a result, in the worst case, the algorithm scans all the instructions twice. Thus, the complexity of this algorithm is ¾Ç´½µÇ´Òµ, which is simply Ç´Òµ.
Theoretically, the best slicing points for basic blocks should be able to capture all of the instructions that are skippable at run-time, which is the same as the number of skippable instructions in instructionreuse (see Section 3.5), and to maximize performance gain by skipping these instructions. However, since the number of consecutively skippable instructions in a basic block, as well as the start and end points of this instruction sequence, changes dynamically at run time, it is virtually impossible to determine the optimal performance gain. Hence, the upper bound of what an algorithm can achieve remains to be determined. In this paper, we choose to evaluate the algorithm's effectiveness by comparing its performance with other dynamic slicing policies.
Incorporating Profile Information
The compiler first marks the candidate split points using only the data dependence graph information. To prevent the block splitting algorithm from producing blocks that are too small to be reused effectively,
we further require that any block must have more than three instructions before we attempt to split it.
The compiler also splits all of the blocks that would not fit into the limited number of fields available in the processor's BHB entry. In addition, we use profiling information to decide whether a block can benefit from splitting. If a basic block is relatively large (e.g., 10 instructions) and its input and output set fits a BHB entry, the compiler needs to know whether a split is necessary before it applies the DDBP algorithm. Thus, we measure the amount of input value locality for each block using a profiling run.
We experimented splitting basic blocks with less than 25%, 50%, 75%, 90%, and 100% of input value locality [15] and collected the fraction of skippable instructions shown in Figure 8 . The amount of input value locality is collected using profile runs with BHBs that have an unlimited number of entries.
We can see that the overall fraction of skippable instructions reaches a peak when we choose to split all basic blocks that have less than 90% input value locality, although the differences are quite small between various selections. However, the choice of 100% is obviously a bad one for Compress and
Ijpeg. In the 100% case, the compiler effectively requires a block to be strictly repeating in order for it not to be split. The small differences between the choices indirectly prove that basic blocks have either very good or very poor input value locality. Using the data in Figure 8 , we empirically determine that blocks should not be split unless they have less than 90% input value locality.
Dynamic Block-Slicing Policies
As an alternative approach to data-flow based block slicing, we also propose three dynamic policies for splitting basic blocks into sub-blocks, namely, the Fit-Or-Split policy (FOS), the Restrained Instruction The fraction of skippable instructions obtained when using different percentages of input value locality to determine when to split basic blocks using the DDBP algorithm.
Count (RIC) policy, and the Break-At-Store (BAS) policy. We then conduct an exhaustive evaluation of all four policies to determine the best approach to partition basic blocks for block reuse.
The Fit-Or-Split (FOS) Policy
Under the FOS policy, we still determine block boundaries by identifying branch instructions. (Note that an existing block identified at run-time must be split if there is a subsequent branch instruction that jumps to the middle of it.) However, we create a new entry in the BHB (see Figure 1) , i.e., we split a block into two sub-blocks, if the addition of the next instruction to the current basic block will overflow any of the fields in one BHB entry. Since large and irregular basic blocks typically have a large number of upward-exposed inputs and live outputs, creating new sub-blocks when the BHB entry overflows essentially splits these basic blocks by the predetermined number of inputs and outputs. The width of the BHB entry is crucial since it directly affects the size of the sub-blocks that can be built and, in turn, the granularity of the instruction groups available for value reuse. As shown in Table 1 , a BHB entry with 5 register inputs, 6 register outputs, 4 memory inputs, and 3 memory outputs is sufficient to cover 95% of the blocks for the benchmarks tested. Splitting basic blocks using this policy controls the hardware cost of the BHB and limits the reuse-detection latency. However, this policy does not directly 
The Restrained Instruction-Count (RIC) Policy
While the FOS policy splits blocks when an additional instruction would overflow the width of a BHB entry, the Restrained Instruction Count (RIC) policy limits the size of basic blocks by setting an upperbound on the maximum number of instructions allowed in a block. With this policy, a block ends when the number of instructions reaches the prespecified limit or when a branch instruction is encountered.
Although this splitting policy cannot remove the uncertainty in the number of fields needed in a BHB entry, it does place an upper-bound on the amount of hardware needed. For example, a five-instruction sub-block can have at most five register outputs and ten register inputs.
Splitting basic blocks by using the instruction count does not directly address the issue of exposing more reuse opportunity, but it does directly control the number of instructions available to skip for each reuse detection process. One to ten instructions are evaluated as the upper-bound on block size for this study since about 65% to 95% of the basic blocks in the programs tested have fewer than 10 instructions, as shown in Figure 9 . Grouping more instructions usually means less opportunity for reuse. Hence, this scheme may help to identify the practical upper limit where no more instructions should be added to an existing group.
When the number of instructions allowed in each block is set to one, we essentially have the Instruction Reuse scheme [29] . Although instruction reuse is implemented differently from block reuse (See Section 4.1.1), we still treat it as a special case of the RIC policy in our evaluation of the performance potential of these approaches in Section 5. Figure 9 : Distribution of basic block sizes for the programs evaluated in this study.
Distribution of Basic Block Sizes (Weighted by Execution Frequency)

The Break-At-Store (BAS) Policy
The appearance of a store instruction in the instruction sequence typically indicates the completion of a basic computation task [5] , although register spilling and stack-related store operations could hide a basic computation. Hence, store instructions could provide a reasonable marker to signal the end of a sub-block. In order to maintain the non-speculative nature of sub-blocks, branch instructions are still used to end a block if there are no store instructions in the block. Since some basic blocks do not contain store instructions, as shown in Figure 10 , only 15% -43% of the basic blocks in the programs evaluated in this study are actually split using this policy. Figure 10 : The distribution of the number of memory outputs from the basic blocks in the SPECint95 benchmarks.
The Performance Potential of Sub-Block Reuse
as the BHB entry gets wider. For the RIC policy, the number of skippable instructions strictly decreases when more instructions are allowed in a block, except for Compress and M88Ksim. Compress favors the RIC-10 policy when RIC is used while M88Ksim showed a minimum with RIC-5. The causes of this non-monotonically decreasing behavior for these two programs remain to be studied further. The overall trend, however, shows that the reuse opportunity tends to decrease when the number of instructions in the reuse unit increases since more instructions all must have their inputs repeat simultaneously.
The DDBP scheme is able to capture significantly more reuse opportunities than the compiler-formed basic blocks since the DDBP scheme splits the large basic blocks into manageable sub-blocks when they do not fit into the BHB entry. Note that the DDBP scheme has more skippable instructions than the FOS-4542 policy, which uses the same BHB width, but splits the basic blocks when one BHB entry cannot hold the current basic block. This difference shows that our heuristic for minimizing the number of dependence edges between the sub-blocks does help to expose more reuse opportunities than simply splitting the basic blocks blindly. In general, the DDBP policy has the third-highest fraction of skippable instructions, after RIC-1 and RIC-2. However, while these latter two policies have a good chance of reusing previous instruction results, they are likely to produce only a limited performance benefit since the number of instructions that can be skipped is small (only one or two instructions) for each reuse detection process.
The fraction of skippable instructions for the BAS policy is only slightly smaller than that of the DDBP policy, indicating that this alternative to end sub-blocks is potentially a good choice. However, its fraction of skippable instructions belonging to sub-blocks of two or fewer instructions is typically larger than that of the DDBP policy, which may undermine the actual performance obtained with this scheme.
The RIC-1 policy is equivalent to the dynamic instruction reuse scheme and captures the largest fraction of skippable instructions. As indicated previously, though, the number of skippable instructions provides only a hint about the potential performance of the block reuse scheme. The actual performance is determined by the benefit achieved when skipping the instructions that produce repeated results, which is evaluated in Section 5.
Processor Model and Experimental Framework
In this section, we describe the processor model and the simulation tools that are used to evaluate and compare the performance obtained with each of the sub-block reuse policies. Implementation details of the value reuse schemes also are presented.
Processor Model
All of our experiments are based on the SimpleScalar Tool Set 2.0 distribution [1] . The SimpleScalar Tool Set has its own MIPS-like instruction set and its own distribution of the GCC 2.6.3 compiler, the GAS assembler, and the GLD loader. Our processor simulator is based on Simplescalar's detailed outof-order superscalar processor simulator.
The base processor is a four-issue superscalar with 64 RUU [32] the various reuse-mechanisms into this pipeline is described in the following subsections.
Instruction Reuse
Although instruction reuse is treated as a special case of sub-block reuse under the RIC policy (see Section 3.4.2), it is implemented differently from the other RIC mechanisms. We implemented the reuse-by-operand-values scheme proposed by Sodani and Sohi [29] . The reuse buffer required for this mechanism is shown in Figure 16 . It is organized like a direct-mapped cache with 32768 entries. This size is chosen so that the reuse buffer requires a die area similar to that of the BHBs evaluated with the other block splitting mechanisms. When an instruction is fetched, the reuse buffer is queried and the saved inputs are retrieved. When all of the operands to this instruction are ready, the saved inputs are compared with the actual processor state. If the inputs repeat, the issue and execution stages of the pipeline are skipped for this instruction. The time it takes to compare the inputs is assumed to be 1 cycle. Since the execution of an instruction may take just 1 cycle, and the operands may not be ready before the instruction is to be issued normally, the instruction may not actually be skipped. However, when the processor's issue width is not sufficient to cover all of the ready instructions, all of the notissued-but-ready instructions actually could be skipped. This skipping process is propagated to all of the instructions in the reorder buffer. The implementation in [29] allows the update of the reuse buffer by speculative instructions executed down a predicted branch path. We chose not to implement this feature, however, in order to fairly compare the single-instruction reuse scheme with the other reuse mechanisms.
Sub-Block and Block Reuse
The sub-block and block reuse schemes both use a direct-mapped Block History Buffer (BHB), as shown in Figure 1 . Each BHB entry saves the upward exposed inputs and live outputs of each instruction block, along with the address of the next block that should be executed when the current block has repeated inputs. The BHB also has 4096 entries. Each entry can contain 4 register inputs, 5 register outputs, 4 memory inputs, and 2 memory outputs. Note that different widths are evaluated for the the FOS policy in Section 3.4.1, however. This configuration is chosen since it is sufficient to cover the needs of 90% of the basic blocks without splitting (see Table 1 ) [15] . The reuse detection time for the BHB is assumed to be 2 cycles (longer latency options were investigated in [13, 14, 15] ).
Block reuse is implemented as follows:
1. When the first instruction of a block is fetched, the BHB is accessed. If the BHB produces a miss, all instructions in the block are executed normally.
2. If the BHB access produces a hit, and the inputs for the block are available, the inputs saved in the corresponding BHB entry and the actual processor state are compared. If the inputs of a block include memory reads, the data cache is accessed to compare the actual values.
3. If all of the values match, the entire block is skipped. This skipping process is done by replacing the skipped instructions in the reorder buffer with special operations that update the processor state. For example, each register output corresponds to a register-write operation with the saved output value. This approach guarantees that precise interrupts are preserved. Note that comparing memory inputs involves cache accesses. If any of the cache accesses produce a miss, it is assumed that the input values will differ and the entire block must be executed.
4. During this reuse-detection process, the processor pipeline continues to flow as normal. Thus, the number of instructions that are actually skipped may be fewer than the number of instructions in the block. In the worst case, if the comparison process takes longer than the time required to execute the block, no instructions are actually skipped. However, since this checking process is done in parallel with the normal execution, it incurs no penalty. This partial skipping effect is modeled in detail in our simulator.
5. When the BHB entry of the current block is retrieved, the block beginning at the address saved in the next-block field also is accessed. The input value comparison process for this next-block is done in parallel, but delayed by one cycle, with the value comparison of the current block. Thus, if both the current block and the next block are reused, the value comparison of the next block appears to take only 1 cycle. This scheme is similar to the reuse chaining mechanism described in [10] . In addition, since the value comparison of the subsequent block happens one cycle later, no additional ports for the register file or cache are required.
Compiler Support
The Gcc compiler is modified to mark the dead register outputs of each basic block [14, 15] . It also marks the potential split points using the DDBP algorithm described in Section 3.3.2. After the compiler's flowanalysis and instruction scheduling stages, the dependence information is available and the instruction order is determined. The potential split points are identified in this phase and the corresponding instructions are marked. However, the dead register information is lost after the register allocation stage. Thus, we rerun the flow-analysis pass after register allocation to obtain the most up-to-date dead-register information. This information is passed to the simulator using SimpleScalar's instruction annotation tool
Hardware Complexity
In a 32-bit processor, each entry of the BHB uses 112 bytes (see Section 3.1) so that a 4096-entry BHB requires 448KB. This size is comparable to the level-two cache of a typical microprocessor. Since only 36 out of the 112 bytes in each entry must be compared to determine reuse opportunities, the access time of the BHB can be within 2-4 cycles if very wide comparison logic is used. In [14, 15] , latencies of 2 to 6 cycles were evaluated. These results showed that speedups of up to 9% could be obtained using a BHB latency of 5 cycles. In this study, we assumed a BHB access time of 2 cycles.
The single-instruction reuse scheme requires 20 bytes per entry for the reuse buffer. Twelve of the twenty bytes must be compared to determine reuse opportunities. For a fair comparison with the block reuse mechanism, we implemented a reuse buffer of 32768 entries, which takes 640KB of total storage.
However, we still assumed a one-cycle access time, as was done in [29] .
Test Programs
The 
Performance Evaluation
In this section, we evaluate the performance of the different reuse mechanisms using the simulated superscalar processor described in Section 4.1. We use percent speedup as the metric to compare the performance of the different mechanisms. Speedup is calculated by dividing the execution time (in simulated cycles) obtained without any value reuse mechanism by the execution time of the program when using one of the value reuse mechanisms. Percent Speedup then is ½¼¼ ¢´×Ô ÙÔ ½µ.
Performance Comparison of the Different Slicing Policies
The speedups obtained by the different value reuse mechanisms are shown in Figures 17 and 18 . In We can see from these results that the DDBP scheme performs the best for Compress and Go. For
Li, it trails the Break-At-Store (BAS) and Basic Block cases by less than 1%. For Perl, the RIC-2 mechanism performs slightly better than the DDBP mechanism. However, the RIC-2 mechanism performs the best for M88Ksim (26% speedup), Gcc (19% speedup), Perl (17% speedup), and Vortex (36% speedup).
The BAS case performs the best for ijpeg, exceeding the DDBP scheme by 4%. For the RIC policy, the speedup decreases as the number of instructions in a block increases from 2 to 10, except for Compress and ijpeg, which favor the RIC-10 mechanism. The single-instruction reuse scheme performs poorly due to the small amount of time that can be saved for each reuse detection, although the opportunities for reuse are abundant. Generally, the sub-block reuse scheme performs better than the basic-block reuse scheme [14, 15] , which uses only branch instructions to signal the end of reuse units. Among the variations of the subblock reuse mechanisms, RIC-2 performs the best. The BAS policy performs relatively well considering that store instructions can be easily detected at run-time. It ranks first for Ijpeg and Li, and second for Vortex. Although its performance for the other programs is not very impressive, it is always better than the basic-block reuse scheme.
While the fraction of skippable instructions [13] reflects how often values can be reused and how often groups of instructions can be skipped, the actual speedup achieved depends on more complex interactions. For instance, the level of Instruction-Level Parallelism (ILP) in the skippable blocks and the configuration of the memory hierarchy both play important roles. If the ILP in a skippable block is high, the total execution time of the skippable block will be relatively short. Thus, the benefit of skipping the block will be less significant. In addition, most of the reuse mechanisms need to access memory to verify inputs. Consequently, the unavailability of a cache port or the occurrence of a cache miss would prevent a block from being skipped. To quantify the impact of these effects on performance, the percentage of instructions that are actually skipped during execution are shown in Figures 19 and 20 . In these figures, we see that the patterns closely follow the speedup results except for the instruction-reuse case, which skips the most instructions, but delivers only mediocre speedups.
Analysis of Results
A reuse unit granularity of two or three instructions performs surprisingly well in our experiments due to the relatively large number of opportunities available to reuse units of these sizes. The FOS policy, M88Ksim, its performance is consistently among the top three performers. Figure 21 shows the execution-time weighted mean speedups [18] produced by the different blockslicing policies. Using this weighted mean, we can see that the RIC-2 scheme performs the best, followed closely by DDBP. Other schemes that offer good performance include the BAS and RIC-3. This comparison also shows that the RIC policies generally perform better than the FOS policies.
In summary, the experiments show that sub-block reuse can moderately improve the performance of superscalar processors with only a limited hardware investment. This new mechanism performs better than a reuse mechanism that relies entirely on the basic blocks naturally formed by the compiler. Al-though their performance has a larger variance across benchmark programs, sub-blocks formed by the RIC-2 policy deliver the best average performance (as measured by the execution time weighted mean).
The DDBP policy, which slices basic blocks using the data-flow information, offers more consistent but slightly lower performance. Hence, other compiler-based partitioning algorithms should be explored to take advantage of the data flow information.
Future Work
Intuitively, partitioning algorithms that use data flow information, such as the DDBP algorithm proposed in this work, should demonstrate better performance for sub-block reuse than approaches that blindly partition instructions into reuse units. Our simulation results show that using this information does not always produce the best performance, however, suggesting that other compiler algorithms should be explored further. On the other hand, sub-blocks with only a few instructions may not contain sufficient data flow information to significantly impact performance, suggesting that an entirely dynamic approach may be appropriate. No definitive conclusion can be drawn, though, without understanding the root causes of reusability in programs. Several outstanding issues concerning sub-block reuse remain, including the following:
Understanding the relationship between the reusable blocks and the overall data flow information, and how this information interacts with the instruction set architecture.
Estimating the best performance of sub-block reuse to determine how much additional performance could be achieved through reuse compared to what the mechanisms evaluated in this work are able to achieve.
Developing global-level compiler algorithms to further exploit sub-block reuse; for instance, using program structure information to estimate the likelihood of reuse for different blocks and different partitioning approaches.
Extending block and sub-block reuse to block value prediction.
Conclusion
The existence of value locality shows that a significant amount of processor resources are wasted since different execution instances of the same instruction may produce repetitive and regular results. Value reuse is one of the promising approaches to exploit this value locality in programs. Value reuse can be performed at different granularities. In this paper, we proposed the sub-block reuse mechanism that uses a reuse unit granularity lying between a single instruction and a basic block. Choosing the best policy to form the sub-blocks is not straight-forward, however, since there is a fundamental tension between the size of the reuse unit and the likelihood that all inputs to the unit will repeat.
This study examined the sub-block granularity that maximizes the net performance gain that can be obtained in a superscalar processor by exhaustively studying reuse units ranging from a single instruction to an entire basic block. This comparison is done by slicing the compiler's naturally-formed basic blocks into sub-blocks using several different policies. We find that the dynamic but simple slicing policies, namely the Fit-Or-Split (FOS), Break-At-Store (BAS), and Restrained-Instruction-Count (RIC) policies, can deliver surprisingly good performance gains, even though these policies produce arbitrarily ragged boundaries at the edges of the reuse units. However, slicing basic blocks without data flow information does lead to inconsistent performance across programs. In contrast, the proposed Dependence-Directed Block Partitioning (DDBP) algorithm partitions compiler-produced basic blocks into appropriate subblocks along the heuristically determined dependence edges using data flow information. Our simulation results on programs from the SPECint95 benchmark suite show that this intelligent generation of reuse units produces consistently impressive performance gains when compared to block slicing mechanisms that rely on partitioning along intuitive natural boundaries. This intelligent approach is not always the best performer, however. Sub-blocks produced using the RIC-2 policy demonstrate the best performance potential for many of the programs we simulated, although its performance was less consistent across programs than the compiler-based mechanism.
