Abstract: Instruction level parallelism is one of the basic ways of increasing the performance of current processors. ALU chaining (chain technique) and load value prediction have been proposed for improving instruction level parallelism. Specifically, ALU chaining aims to reduce data dependence. However, it cannot do this when the instruction being depended upon is load instruction. Load value prediction is an effective method for reducing load delay, but the current predictor cannot deliver a good performance because that some predictors just predict few load instructions or some predictors 'prediction accuracy is not good. In this work, we propose a two directional address renaming load value predictor that renames load instruction addresses into a data address and a store instruction address to increase the number of predictable load instructions and improve the prediction accuracy. This method is designed for the current load value predictor. We combine the proposed load value predictor with ALU chaining to improve the superscalar processor performance even more. Experimental results show that the proposed load value predictor improved performance by about 3.54% on its own and by about 5.79% when combined with ALU chaining.
Introduction
Instruction level parallelism (ILP) remains one of the best ways to improve processor performance, even in this modern era of multi-core technology and general-purpose computing on graphics processing units (GPGPUs). One reason for this is that multi-cores and GPGPUs are constructed on a superscalar processor, the performance of which depends on ILP. Moreover, GPGPUs and multi-cores come with large hardware and consume a lot of power, so not all systems, particularly the smaller embedded systems, can be equipped on these bigger processors. We therefore need to revisit ILP to determine ways of improving the performance of the superscalar processor.
True data dependence is one of the main problem for improving the ILP. One method for reducing the data dependence is the ALU chaining, which chains the ALU and bypasses the result from one ALU to others (Tomita et al., 1986; Sasaki et al., 2005 Sasaki et al., , 2006 Ogata et al., 2007; Yao et al., 2009) . By this method, the two data dependency instructions can be executed in parallel.
However, when the instruction being depended upon is a load instruction, ALU chaining cannot be used. This case, called load delay, which is a kind of data dependency, occurs because the load instruction needs to execute the data address in the execution stage and yields the load value after the execution stage by accessing the data memory. Otherwise, when using ALU chaining, the load value needs to be set in the execution stage. This is how load delay limits the performance of ALU chaining. In this paper we target to the general CPU with has five stage (fetch stage: FE, decode stage: DE, execution stage: EX, memory access stage: MA, write back stage: WB).
Load value prediction is an effective method for reducing the load delay. In this method, the load value predictor stores the load value history and uses it to predict the load value when the load instructions are fetched again. Several load value predictors have been proposed, including the last value predictor (Lipasti and Shen, 1996; Burtscher and Aom, 1999) , Stride predictor (Sazeides and Smith, 1997) , and differential finite context method (DFCM) predictor (Goeman et al., 2001) . However, the accuracy of the last value predictor is not good, which results in a number of miss-predictions causing serious problems such as wasted cycles and power, and because the DFCM and Stride predictors rely on the prediction accuracy and predict just a few load instructions, they also cannot yield a good performance. Thus, for the load value predictor, we need to not only increase the number of predictable load instructions but also improve the prediction accuracy.
In this work, we propose a high performance load value predictor and combine it with ALU chaining to improve the superscalar processor performance.
Specifically, we propose a two directional address renaming load value predictor that renames load instruction addresses as a data address and a store instruction address. Our focus is the relationship between the data address and load instruction address, and between the store instruction address and the load instruction address, and we add two kinds of address for the prediction. In this way, we aim to increase both the number of the predictable load instructions and the prediction accuracy.
The proposed method is designed for the current load value predictor, upon which it is installed, and we combine the proposed load value predictor with ALU chaining to improve the superscalar processor performance.
Our contributions in this work are as follows:
• we propose a two directional address renaming method for load value prediction to increase the number of predictable load instructions and improve the prediction accuracy
• we combine the load value prediction with ALU chaining to reduce the load delay
• we demonstrate that combining load value prediction with ALU chaining can improve IPC performance.
The rest of this paper is organised as follows. Section 2 describes related work on the load value prediction and scheduling for ALU chaining. In Section 3, we describe the proposed load value prediction and hardware configuration and use an example to demonstrate the action of the proposed predictor. In Section 4, we combine the proposed predictor with ALU chaining and in Section 5, we present the experimental results. We conclude in Section 6 with a brief summary and a mention of future work.
Related work

Load value prediction
Action of load value prediction
Load value prediction is a method of predicting load value in order to reduce the load delay (Lipasti and Shen, 1996; Burtscher and Aom, 1999; Sazeides and Smith, 1997; Goeman et al., 2001; Sato, 2001) . A basic load value predictor that utilises a table consisting of tag and load value information is shown in Figure 1 . The tag contains the load instruction address and the load value information contains information on the history of the load value, etc. The prediction action includes both prediction and update actions.
• Prediction: when a load instruction is fetched, the predictor is accessed by the load instruction address (load addr). If the load instruction address matches the predictor tag, the load value is obtained by the stored information in the table, and if it does not, the load value cannot be predicted.
• Update: when a load instruction is committed, the predictor is accessed by the load instruction address. If the load instruction address matches the predictor tag, the load value information is updated, and if it does not, the load instruction address and load value information are updated using a new entry of the load value predictor. 
Related work of load value prediction
Several load value predictors for reducing load delay have been proposed.
• Last value predictor (Lipasti and Shen, 1996; Burtscher and Aom, 1999) : the last value predictor is the simplest load value predictor and uses the last load value as the predictor result. It uses smaller hardware, however, as its prediction accuracy is not high.
• Stride method (Sazeides and Smith, 1997) : this method uses a history of load values to find a stride and then predicts the load value on the basis of the stride pattern. Variations on the stride decision method include using the difference of the last value and second-last value and adding the last load value.
• DFCM (Goeman et al., 2001) : this method, which is an extension of the Stride method, uses several stride histories to isolate changing patterns and predict the load data value.
• Two-hop address renaming (Sato, 2001) : this method maintains the relationship between a store and a load instruction by ensuring they retain the same address. A load and a store instruction to the same address are linked by renaming the data address and the stored value is then forwarded to the load instruction.
ALU chaining scheduling
The chain technique was first defined and used in QA-2 in Tomita et al. (1986) , and several scheduling methods designed for use with this technique were subsequently proposed (Sasaki et al., 2005 (Sasaki et al., , 2006 Ogata et al., 2007; Yao et al., 2009 ).
• Grouping method (Sasaki et al., 2005 (Sasaki et al., , 2006 : in the globally asynchronous locally synchronous processor (GALS), the scheduling groups the fetched instruction and checks the dependence in the group.
• Dependence matrices table method (DMP) (Ogata et al., 2007; Yao et al., 2009) : the scheduling uses a matrix table to maintain the data dependency and wakeup-select the issuable instructions. In this paper, we propose a two directional address renaming load value predictor and re-design ALU chaining scheduling to combine ALU chaining with the predictor.
Two directional address renaming load value predictor
Prediction configuration and prediction flow
Our two directional address renaming load value predictor renames the load instruction address into a data address and a store instruction address they are then used for predicting the load value. The predictor consists of four tables: the load indexed data and store table (LIDST), the store indexed data table (SIDT), the data indexed store table (DIST), and the data indexed value table.
• LIDST is indexed by load instruction address (load addr). It keeps the data address (data addr), which is the address of the load value, the address of the store instruction (store addr), which is used to store the load value, and the value that is used for prediction.
• DIST is indexed by data address and keeps the store address. It is used to link the load instruction to the store instruction by referring to the same address.
• SIDT is indexed by store address and keeps the data address. It is used to obtain the data address from the store instruction.
• DIVT is indexed by data address and keeps the store value.
The two directional address renaming load value predictor is equipped on Stride or the DFCM. When using Stride, the LIDST keeps the last value data and stride value, and when using DFCM, the LIDST keeps the stride history, last data address, etc. The two directional address renaming load value predictor is called 2D-Stride when using the Stride predictor and 2D-DFCM when using the DFCM predictor.
Here, we show the 2D-Stride prediction flow. Figure 2 shows the configuration of 2D-Stride, where the data address is predicted by the Stride predictor.
In the prediction, the load instruction address is renamed as the a data address and a store instruction address.
1 The LIDST is accessed by using the load instruction address to yield the store instruction address and data address.
2 The SIDT is accessed by using the yielded store address to yield the data address.
3 The LIDST is then accessed by using the yielded data address to yield the data value. Any data address yielded from the LIDST is considered a priority. Table 1 shows the details of 2D-Stride table size, where c-2bit is a confidence bit (discussed in the next subsection) and u-bit is the use bit (i.e., the entry is used). (Jacobson et al., 1996) Miss-predictions with load value predictions cause heavy penalty. To increase the prediction accuracy and reduce the penalty, we add a confidence estimation that uses a 2-bit saturation counter to the proposed load value predictor. The counter is incremented when the prediction is correct and decremented otherwise. When the saturation counter is 00 or 01, the confidence is low and the prediction results are not used. When the confidence is high, the predicted results are used.
Prediction confidence estimation
The 2-bit saturation counter is set in the LIDST. The 2D-Stride prediction has one saturation counter, and in the 2D-DFCM predictor, each pattern has one saturation counter.
Action example
In this subsection, we present three kind examples of tasks using the 2D-Stride method with the proposed predictor: a updating the predictor b prediction using store instruction address c prediction using data address.
A sample code is shown in Figure 3 . The program flow is divide into two direction (1,2,) by branch. SW (store word) and SB (store byte) are store instructions, and LW (load word) is load instruction. Figure 4 shows the predictor being updating when a store or a load instruction is committed. Here, we hypothesise that the branch is never taken and go to 1, where register $1 is 0X10EF34A2 and $2 is 0X8000F2B4. The $1 is the register which keeps data which will be stored, and the $2 is the address in i1. The $2 keeps the data address in i5. The update includes load instruction and store instruction update:
Load value predictor update
1 Store instruction update When a store instruction is committed, the relationships between store instruction address and data address, between the data address and store instruction address, and between the data value and data address are registered. In the example shown in Figure 3 , store instruction i1 is committed and the store data is 0x10EF34A2.
• DIST registers store instruction address i1 using data address 0X8000F2B4 as the index.
• SIDT registers data address 0X8000F2B4 using store instructions address i1 as the index.
• DIVT registers the data value of 0X10EF34A2 using data address 0X8000F2B4 as the index. The data value is divided into four parts, with each value keeping one byte.
Load instruction update
When a load instruction is committed, the predictor accesses the DIST to yield the store address and to register the relationship between the load instruction address and the store instruction address.
In the example shown here, load instruction i5 is committed and the data address is 0X8000F2B4.
• DIST is accessed to yield store instruction address i1 by using data address 0X8000F2B4 as the index.
• LIDST registers store instruction address i1 in the entry of the tag which's tag of load instruction address is i5. For stride values that cannot be predicted, data values are not registered and the most recent data address 0X8000F2B4 is registered
Prediction using store instruction address
Here, we hypothesise that the loop execution of Figure 3 is run, where instructions i1 and i5 are executed repeatedly and registers $1 and $2 are set to 0X8000F2BC and 0X35D490AB, respectively, by the program. Figure 5 shows the prediction using store instruction address.
Store instruction update
When a store instruction is committed, the store instruction is updated. In this example, the store data has been stored in $1 with a value of 0X35D490AB.
• DIST registers store instruction address i1 using data address 0X8000F2BC as the index.
• SIDT registers data address 0X8000F2BC using store instruction address i1 as the index.
• DIVT registers the data value of 0X10EF34AB using data address 0X8000F2BC as the index. The data value is divided into four parts, with each value keeping one byte.
Prediction
When a load instruction is fetched, the predictor uses the load instruction address to yield the store instruction address in the LIDST, uses the store instruction address to yield the data address in the SIDT, and uses the data address to yield the data value in the DIVT.
• LIDST is accessed by using the tag of load instruction i5 and yields the store instruction address i1.
• SIDT is then accessed by using the tag of store instruction address i1 to yield the data address 0X8000F2BC.
• Finally, DIVT is accessed by the data address and units of the 4-byte data from the four value fields (0X10EF34A2) are output as the prediction result. Because in LIDST data address are not be set, to the data address which yield form SIDT are used.
In addition, when the prediction is correct, the difference between the last value (8) is registered as the stride value and the 0X10EF34A2 is registered in the most recent address. 
Prediction using data address
Here, we hypothesise that the loop execution of Figure 3 is run, where instructions i1 and i5 are executed repeatedly and registers $1 and $2 are set to 0X28DBA162 and 0X8000F2C4, respectively, by the program. Figure 6 shows the prediction using data address.
Store instruction update
When a store instruction is committed, the store instruction information is updated.
• DIST updates the relationship between data address 0X8000F2C4 and store instruction address i1.
• SIDT updates the relationship between store instruction address i1 and data address 0X8000F2C4.
• DIVT registers data value 0X28DBA162 and address 0X8000F2C4.
First time load prediction: using store instruction address
When a load instruction is fetched, the predictor uses the load instruction address to yield the store instruction address in LIDST, uses the store instruction address to yield the data address in SIDT, and uses the data address to yield the data value in DIVT. This method is the same as the prediction using the store instruction address.
When the difference between the current load value and the most recent load value equals that of the stride, the data address can be used again to fetch the load instruction, and the prediction address is the data address used for adding the stride.
Prediction: using data address
When the load instruction is fetched again, the data address is used to perform the prediction.
First, using DIVT to perform the data value prediction, when the data address exists, the four field values are united as the data value. When both the store instruction address and the data address exist in the LIDST, the prediction value by using data address is the priority.
• SIDT is then accessed by using the tag of store instruction address i1 for yielding the data address.
• Finally, DIVT is accessed by the data address and units of the 4-byte data from the four value fields (0X10EF34A2) are used as the prediction result.
Superiority of proposed method and comparison with two-hop
The superior points of the proposed method are related to how the flow of action is executed.
1 The predictor can do the prediction even if the store address are not exist. Hence the load value predictable instruction numbers are increased.
2 Improving the prediction accuracy by Using the two kind address and confidence estimation.
Here, we provide a brief comparison with the two-hop method.
Figure 7 Example of incorrect prediction by two-hop method
The two-hop predictor has a greater application range than the one-hop, but it can only keep the relationship between the load instruction and the first executed store instruction referring to the same address. Hence, when the relationship is changed by some other store instruction, the predictor cannot perform accurate predictions. An example is shown in Figure 7 . We assume that direction-1 is selected on the branch and then store instruction i1 and load instruction i2 are linked at first. At the next iteration, the load value of i2 can be predicted correctly if direction-1 is selected. However, the load value of i2 is incorrectly predicted if direction-2 is selected in spite of store instruction i3 and load instruction i2 referring to the same memory address because the link of the load instruction is limited to only one store instruction. In addition, load value cannot be predicted if the load instruction is not linked prior to the execution of the store instruction.
To overcome these problems, we designed our twodirection address renaming to keep both the relationship between load instruction and data address and between load and store instruction.
Combining the data value predictor with ALU chaining
Overview of ALU chain scheduling
Load value prediction is an effective method for reducing the load delay and ALU chaining is an effective method for reducing the data dependence. When combining load value prediction and ALU chaining, the data dependence in which the depended upon instruction is load instruction can be reduced. We extended the chaining scheduling using a dependency map (DMP) to store the data dependency information and use an issue map (IM) to issue the instructions. The hardware configuration is shown in Figure 8 .
The dependency map consists of a load bit dependency bit and two dependency numbers for the two source operands. Here, load bit means the instruction is a load instruction and dependency bit means the instruction depends on the older instructions. The dependency number keeps the older dependent instruction number of the source operand.
Example of combining ALU chaining and load value prediction
In ALU chaining scheduling, we add a prediction bit (Pred) for keeping the load value prediction information. When a load instruction is fetched, the LIDST is accessed, and if the load instruction exists in the LIDST, the Pred is set to 1. It is also set to 1 when the instruction depends on a load instruction that can be predicted. Figure 8 shows the chain scheduling for the load value predictor in greater detail, where L is load bit, DepB is dependency bit, and DepNo is dependency number. The first source of i1 depends on i0, and because i0 does not depend on any instructions, i0 and i1 can be executed in parallel by using ALU chaining.
In this example, load instruction i2 can be predicted, so the entries of load bit and predicted bit are both set to 1. ADD instruction i4 depends on load instruction i2, but i3 is predicted by load value prediction. These two instructions can be executed in parallel by using load value prediction. In the dependency map, the predict bit is set to 1, which means the instruction can be run even if it depends on the load instruction.
Load instruction i6 cannot be predicted, so the predict bit of i6 is 0. The instruction i7 which depends on i6, cannot be issued.
Evaluation
We installed the two-directional address renaming method on the DFCM (2D-DFCM) and Stride predictor (2D-Stride) to evaluate the performance of the proposed method. The last value predictor was also installed for the purpose of comparison. Performance improvement by combining ALU chaining and load value predictor was also evaluated.
Evaluation index
We used the following evaluation index as well as the prediction confidence to compare the prediction performances.
• prediction rate = (predicted load instruction number/load instruction number)
• prediction correct rate = (correct prediction load instruction number/actual load instruction number)
• prediction accuracy rate = (correct prediction load instruction number/predicted load instruction number)
• prediction accuracy rate of high confidence = (correct prediction load instruction number of high confidence/predicted load instruction number of high confidence load instruction).
Instruction per cycle (IPC) was a key index for measuring the processor performance.
Evaluation conditions
We applied the proposed predictor using the SimpleScalar Tool set (Burger and Austin, 1997) with SPCE2000int benchmarks to evaluate the performance. Table 2 shows and the processor configuration. We used two types of window: a small one with 32-entry instruction windows, a 16-entry LSQ, and a max issue instruction number of 4, and a big one with 64-entry instruction windows, a 32-entry LSQ, and a max issue instruction number of 16. The other hardware is the default size of SimpleScalar. We set the comparison target of the last value predictor, the Stride predictor, and DFCM to 2,048 entries, so in terms of hardware size, the last value predictor is 17 KB, Stride is 33 KB, and DFCM is 163 KB. The proposed predictor has four tables, each of which contain 1,024 entries, so the 2D-Stride is 42 KB and the 2D-FCM is 110 KB.
Seven benchmarks from SPECint2000 are used. Input data are from the references listed in Table 3 . In terms of instruction executions, we skip the first 100 million instructions and run a subsequent 100 million instructions.
Evaluation results
Prediction accuracy
Figures 9 to 12 show the prediction rate, prediction correct rate, prediction accuracy rate, and prediction accuracy rate of high confidence, respectively. In Figure 9 , almost all load instructions are predicted by last value prediction, otherwise the prediction rate of the Stride and DFCM are assumed to be 50% and 40% on average. It is clear that DFCM and Stride are aim to increasing the accuracy, cause the predictable instruction number are reduced. Hence improving the predictable load instructions are important for Stride and DFCM. In 2D-Stride and 2D-DFCM, the prediction rate increased to about twice that in Stride and DFCM. The benchmarks bzip2, gzip, mcf, and vpr were nearly 100% in 2D-Stride and 2D-DFCM. Figure 10 shows that the last-value predictor has a lowest prediction correct rate in several benchmarks. This is because the last-value predictor predicts a large number of load instructions. However, the 2D-Stride and 2D-DFCM yield a higher prediction correct rate than the three conventional predictors, indicating that the 2D-Stride and 2D-DFCM are superior to these other types. Figure 11 shows that the DFCM has the highest prediction accuracy rate. This is because the predicted instruction number is the least. Therefore, here we use a confidence for the prediction. Figure 12 shows the five types of prediction that perform predictions using just the high confidence. We found that 2D-Stride and 2D-DFCM had a similar prediction accuracy rate with high confidence. This means that the correct prediction load instruction numbers are larger with 2D-Stride and 2D-DFCM, at the same time, the miss prediction load instruction number is not smaller. 
IPC performance improvement
Of the conventional methods, the DFCM method is considered the best. We therefore used it in our comparison with the three proposed methods. Also included in the comparison is the combination with ALU chaining. Figure 13 shows the IPC performance using the small window. In the case with no ALU chaining, the 2D-Stride predictor had a performance improvement of about 1.43% compared to the case with no load value predictor. The case with ALU chaining had a performance improvement of about 1.15% compared to without ALU chaining. In the case in which ALU chaining was used, the 2D-FDCM predictor had a performance improvement of about 2.99% compared to the case in which the load value predictor was not used. Figure 14 shows the IPC performance using the big window. The 2D-Stride predictor had a performance improvement of about 3.54% compared to without using the load value predictor, and the ALU chaining had a performance improvement of about 5.08% compared to without using ALU chaining. In the case of using ALU chaining, the 2D-FDCM predictor had a performance improvement of about 5.79% compared to without using the load value predictor average.
Conclusions
In this paper, we presented a two directional address renaming method and used it on the current load value predictor to improve prediction accuracy rate and predictable load instruction numbers. We also combined ALU chaining with the proposed load value predictor. The results show that 2D-DFCM had the best performance, about 19.84%, and that it improved by about 5.79% when using large windows by combining with ALU chaining. A more detailed comparison of the proposed method with the two-hop load value predictor will be the focus of our future work.
