Abstract-Value prediction is a technique that breaks true data dependences by predicting the outcome of an instruction and speculatively executes its data-dependent instructions based on the predicted outcome. As the instruction fetch rate and issue rate of processors increase, the potential data dependences among instructions issued in the same cycle also increase. Value prediction and speculative execution become critical to keep the issue rate high. Unfortunately, most of the proposed value prediction schemes focused only on the accuracy of the prediction. They have yet to consider the bandwidth required to access the value prediction tables. In this paper, we focus on the bandwidth issues of the value prediction. We propose augmenting the trace cache [19] , [26] (which was proposed to provide the required fetch bandwidth for wide-issue ILP processors) with a copy of the predicted values and moving the generation of those predicted values (which require accessing the value prediction tables) from the instruction fetch stage to a later stage, e.g., the writeback stage. Such a change will allow "selective value prediction," i.e., only those instructions which require value prediction will access the value prediction tables. It can significantly reduce the bandwidth requirement of value prediction tables. We also use a dynamic classification scheme to steer predictor updates to behavior-specific tables (such as last-value, stride, two-level, etc.). A relatively even split among such table accesses further moderates the bandwidth requirement of those tables.
INTRODUCTION
T O achieve high performance on wide-issue superscalar processors, it is essential to maintain high instruction fetch rate and issue rate and to exploit as much instructionlevel parallelism (ILP) as possible. Instruction fetch rate is mostly limited by the changing of control flow direction and its subsequent nonconsecutive instruction layout in the instruction cache. To overcome such problems, many dynamic branch predictors [3] , [11] , [18] , [33] and the trace cache have been proposed [4] , [5] , [19] , [22] , [24] , [26] . As the instruction fetch rate of processors increases from using such schemes, potential data dependences among instructions issued in the same cycle will also increase. There are two kinds of data dependences: true data dependences and false data dependences (or name dependences). False data dependences can be eliminated by using hardware and software techniques such as register renaming. But, true data dependences remain a serious problem for wide-issue processors. Recent proposed schemes using value prediction provide an effective way to address this problem [2] , [6] , [7] , [9] , [14] , [15] , [16] , [17] , [21] , [27] , [28] , [29] , [30] , [31] , [32] . By predicting the result value of an instruction, the processor can proceed to speculatively execute its datadependent instructions. These studies show value prediction can provide rather accurate values for many programs [16] , [27] , [30] , [32] and good performance gain can be obtained [2] , [7] , [9] , [21] , [28] , [29] .
However, most studies assume all fetched instructions could access the value prediction table simultaneously in the instruction fetch stage. It means that, in wide-issue superscalar processors which allow 8 to 16 instructions per clock cycle, we may need to perform several branch predictions and 8 to 16 simultaneous accesses to the value prediction table in each cycle. It will require very expensive hardware because it needs very high bandwidth branch and value prediction tables with many read and write ports to accommodate so many simultaneous accesses. Most prior value prediction studies are thus focused only on the predictor algorithms and their potential performance gain. Some studies, such as in Gabbay and Mendelson [7] suggest using multiple interleaved banks with a fast distribution network to implement the value prediction table. However, potential bank conflicts may exist because several instructions from different basic blocks may try to access the same bank simultaneously [15] . Another important drawback of the previously proposed schemes is that most of them assume the value prediction table is accessed in the instruction fetch stage. It requires all instructions to access the value prediction table, regardless of whether they are needed or not, because instructions are not yet decoded in the fetch stage.
In this paper, we focus on the bandwidth issues of the value prediction for wide-issue architectures [14] , [15] . In particular, we focus on the processors with trace cache because such architectures incorporate multiple branch predictions and speculative execution in a very effective way [4] , [5] , [11] , [24] . We propose augmenting the trace cache with a copy of the predicted values for data-dependent instructions. These predicted values are organized the same way as their corresponding instructions in the trace cache. Hence, they can be accessed in parallel and avoid the problem of having to access a centralized value prediction table. Actual accesses to the value prediction table for the predicted values are moved from the instruction fetch stage to a later stage, such as the writeback stage. One significant benefit of such a move is that actual datadependent instructions are known in the later stages. Hence, only those dependent instructions need to access the value prediction table and a significant reduction in the bandwidth requirement of the value prediction table can be achieved. Another advantage is that the latency between the table update and the next value prediction can be reduced because they can be carried out in the same stage (e.g., writeback stage), instead of value prediction in the instruction fetch stage and the predictor update in a later stage as in other schemes.
Any good existing value predictors can be used with such a scheme. In this paper, we use a dynamic classification scheme similar to the one proposed by Rychlik et al. [27] . Such a dynamic classification scheme steers predictor updates to several behavior-specific tables (such as lastvalue, stride, two-level, etc.). The traffic to those tables is thus distributed, further reducing the bandwidth requirement for each particular value prediction table. Such a scheme is inherently more effective than arbitrary banking schemes with a centralized value prediction table. Trace cache fits nicely with such dynamic classification schemes because of its ability to store other relevant information in the trace cache. Our simulation results show that, using such a structure with realistic read/write ports to value prediction tables, good performance can be achieved. We also compare the performance to those assuming unlimited table bandwidth.
The rest of the paper is organized as follows: Section 2 describes recent related work on the trace cache and the value prediction schemes. Section 3 summarizes the value predictors used in our proposed scheme. Section 4 presents the needed changes to the microarchitecture for the proposed scheme using value prediction with dynamic classification. Section 5 describes our evaluation methodology. In Section 6, we present and discuss our simulation results. Finally, Section 7 offers some conclusions.
RELATED wORK
In order to supply as many instructions as the processor instruction issue width, the trace cache supplies instructions by caching logically contiguous instructions, called traces, dynamically at runtime. Many studies show that the trace cache is a more effective mechanism to supply multiple basic blocks per clock cycle than most conventional instruction fetch mechanisms using multiple branch prediction and realignment hardware [4] , [5] , [19] , [22] , [24] , [25] , [26] . The trace cache moves the complexity out of the instruction fetch/issue stages to the trace cache fill stage and is mostly out of the critical path in the instruction execution. Rakvic et al. proposed a completion-time multiple-branch predictor to reduce the complexity and the latency of the instruction fetch stage [22] .
In the case of value prediction, several schemes have been proposed. They include last value prediction [16] , stride prediction [6] , context-based prediction [30] , [31] , [32] , and hybrid value prediction [27] , [30] , [32] . Last value prediction predicts the result value of an instruction based on its most recently generated value. A stride predictor predicts the value by adding the most recent value to the difference of the two most recently produced values. This difference is called the stride. Context-based predictors predict the value based on the repeated pattern of the last several values observed. FCM (Finite Context Method) [30] and two-level [32] predictors belong to this category.
Each of the predictors mentioned above shows good performance for certain data value sequences, but bad for others. Therefore, some hybrid predictors which combine several predictors are proposed. Wang and Franklin proposed a hybrid predictor which combines a stride predictor and a two-level value predictor [32]. Rychlik et al. [27] , [28] combine a last value predictor, a stride predictor, and an FCM predictor. The choice of a predictor for each instruction is guided by a dynamic classification mechanism. They show that the prediction rate is not a good indicator of performance because over 30 percent of the value predictions in the SPECint programs have no effect in enhancing the performance [28] . They also noticed that updating the value prediction tables after the real values are produced, instead of an ideal immediate update after each prediction table access, can result in a lower prediction rate, hence, a lower performance. It is because the subsequent instructions may use the stale prediction values before they are updated [15] , [21] . Calder et al. examine selective value prediction schemes based on the confidence counter value and the instruction type [2] .
In spite of those studies on value prediction, there are few studies focusing on multiple value predictions, which are required on wide-issue superscalar processors. Gabbay and Mendelson have shown the limitation of multiple accesses to the value prediction table using a realistic implementation [7] . They proposed a highly interleaved prediction table with a fast distribution network to support high instruction issue rate.
A HYBRID VALUE PREDICTOR WITH DYNAMIC CLASSIFICATION
In this section, we describe the hybrid value predictor used in our study. It includes: a last value predictor (Last), a stride predictor (Stride), a two-level value predictor (Twolevel), and a dynamic classification scheme. An FCM predictor can also be used instead of a two-level value predictor.
Last Value Predictor
The last value predictor uses a prediction table which contains a tag, the last value, and a confidence counter in each entry [16] . The prediction table is indexed by the address of an instruction.The predictor stores the last value produced by the instruction and uses the value as the predicted next result if the value of its confidence counter is greater than a threshold. The confidence counter is incremented by one if the prediction is correct and decremented by one if not. We set the threshold value of the confidence counter to two.
Stride Predictor
The 
Two-Level Value Predictor
The two-level value predictor consists of two tables: a value history table (VHT) and a pattern history table (PHT) [32] . Each instruction is directly mapped to a VHT entry using its address. A VHT has four fields: tag, LRU, four distinct data values, and its value history pattern.The data value field stores up to four distinct values generated by the instruction. The LRU field keeps track of the order in which the four distinct data values were last observed. The value history pattern records the pattern of the last few outcomes of the instruction. Each pattern is composed of a series of binary coded indices into the four data value fields. This value history pattern is used to index the PHT that has four up-down saturating counters corresponding to the data values field. The counter is incremented or decremented by one according to the prediction result. Prediction is made only if the maximum counter value is greater than or equal to a threshold value, which is set to two.
Dynamic Classification
The classification is done dynamically with the help of the classification Fig. 1 shows the process of dynamic classification of our hybrid predictor. When the value predictor encounters an instruction for the first time, it forwards the instruction to the classification table (see Fig. 2b ). Initial state changes to Transient state. The result value (value1) is stored in the value field and the class is set to Unknown type. Next, when the instruction is in Transient state, a stride value (stride1) is calculated by subtracting the old value (value1) in the value field from the new result value (value2) of the instruction. The value2 and the stride1 are stored in the value field and the stride field in the classification table, respectively. The state is changed to Classified state. If an instruction is in Classified state, another stride (stride2) is calculated. If the stride2 is equal to the stride1 and the value is zero, the prediction type of the instruction is set to Last type. Next time, the instruction is predicted by using the Last predictor. If the value of the stride is not zero, the type of the table is set to Stride type. If the value of the stride1 is not the same as the stride2, the instruction is classified as Two-level type. Once the prediction type of the instruction is set, the predicted instruction is removed from the classification table. The value is predicted by using the corresponding prediction table specified in the prediction type. If the value of the confidence counter becomes zero after several mispredictions, the predictor can no longer accurately predict a value for the instruction. In this case, the instruction is removed from the predictor table and the type of prediction is reset to Unknown type. This instruction needs to be reclassified using the classification table.
There are several differences between our proposed hybrid predictor and the one used in Rychlik et al. [27] . First, for a classified type, our predictor does not need another table because the type is stored in the trace cache (see Section 4.1). Second, to classify an instruction, Rychlik et al. store three distinct values in the classification table, but we only store one value and one stride. Finally, our scheme always tries to predict the value, whereas, in their scheme, when the instruction is removed from the FCM predictor, the instruction is no longer predicted. 
PROPOSED SCHEME
In this section, we describe the needed changes to the microarchitecture for our proposed scheme and a hybrid value predictor using a dynamic classification mechanism. Fig. 2a shows the main components of our proposed scheme in conjunction with a trace cache. Fig. 2b shows a hybrid value predictor with dynamic classification. A value prediction buffer (VPB) is associated with the trace cache. The VPB has the same organization as the trace cache, i.e., each instruction in a trace cache block has a corresponding entry in the corresponding VPB block.
Microarchitecture
Each VPB entry contains a copy of the predicted value of the corresponding instruction in the trace cache block with its prediction type and a valid bit. Augmenting the trace cache with VPB provides a very simple mechanism to access multiple predicted values without the complexity of having to access both the branch prediction tables and the value prediction tables in the instruction fetch stage, as in other proposed schemes. It also avoids potential bank conflicts among the multiple accesses to the value prediction table. As opposed to the previous schemes, the actual value prediction is moved from the instruction fetch stage to the writeback stage (see Section 4.2). We use a hybrid predictor with a dynamic classification scheme (Section 3). It contains a last value predictor, a stride predictor, a twolevel predictor, and a classification table for dynamic classification. We assume that each predictor table has limited read ports and write ports. A value predictor may predict either source operands or the results of the instructions [17] . Since our proposed scheme moves the value prediction out of the instruction fetch stage, it is easier to implement schemes based on the result prediction. The rest of the paper thus assumes the result prediction is used.
As there could be up to 16 instructions accessing and updating the prediction tables and we assume, realistically, that each prediction table has limited read ports and write ports, a queue is provided in front of each prediction table (see Fig. 2b ). We fix each queue size to 32. Because the predicted values are only advisory and the chance of a queue being full is small if the queue size is chosen properly, to simplify the hardware design, we assume that further accesses to a prediction table are dropped when its queue becomes full. The predicted values in VPB may become stale when this happens.
According to the prediction type, which is determined by the classification table and is stored in the VPB entry, a request is sent to update the prediction table with the result that its instruction just produced. It also produces a predicted value which is collected to make the trace block and then sent to the corresponding VPB entry after the prediction table lookup. These updates continue to be delivered to the VPB even if the trace fetch is stalled due to a trace cache miss, a branch misprediction, or an instruction cache miss.
Instruction Pipeline Design
There are many ways to implement an instruction pipeline which incorporates the value prediction scheme proposed above. It is beyond the scope of this paper to come up with an optimal design. However, to facilitate our simulation study on the performance of the proposed scheme, we assume the instruction pipeline for the proposed architecture as follows.
In the instruction fetch stage, the predicted result values from the VPB are read at the same time as the instructions are fetched from the trace cache. These predicted result values are inserted into the reservation stations of the datadependent instructions through the fetch buffer (see Fig. 2a ) after the instructions are decoded in the dispatch stage. When all the operands of an instruction are available in the reservation station (regardless of whether the operand values are predicted or not) and there is an available functional unit, the instruction is scheduled and executed. As in most value prediction schemes which require such speculative execution, the reordering buffer is used to keep both speculative and nonspeculative results. After the actual results are produced, the value prediction is verified in the writeback stage. When comparing the actual results with the predicted values, if the prediction is correct, the execution is continued without any interruption. If the prediction is incorrect, the execution of the data-dependent instructions which have already been issued will be invalidated and reissued with the correct values. The actual results and the correctness of the prediction are inserted into the queues according to their prediction types. They will then be used to update the value prediction tables and to obtain the next predicted values.
When a predictor gets a request from the queue, it will process the request similarly to other value prediction schemes. The predictor updates the state and the confidence counter of the corresponding entry in the prediction table according to the produced result. If the previous prediction is incorrect, it may need to store the new value in the table according to the prediction type. As soon as it updates the prediction table, it will predict the next value according to the prediction type and send the result to VPB.
Selective Prediction
Not every instruction in the trace block needs a predicted value. Only the instructions which are data-dependent need a predicted value to break the data dependences. For integer and ALU operations (we assume they have one clock cycle latency), they need predicted values to break the data dependences among instructions in the same trace block so they can be issued in the same cycle. In the case of multiple-cycle instructions (such as load instructions), datadependent instructions issued simultaneously or in later cycles will need their predicted values. The fill unit of the trace cache is in charge of constructing trace blocks from the executed instructions. It is usually not in the critical path of the instruction execution. It can easily mark the instructions which require value prediction. Because load instructions are multiple-cycle instructions which most likely will have data-dependent instructions in the same or later trace blocks, we always predict their result values. Fig. 3 shows an example of selective prediction (we assume eight instructions in a trace cache block and all the instructions are single cycle ALU instructions). In this example, among eight instructions in a trace cache block, three instructions (i, j, m) have their data-dependent instructions. Only the result values of these instructions need to be predicted to issue their data-dependent instructions k, l, n, o. Using this selective prediction, we can reduce the number of accesses to the value prediction tables by 19 percent on average. It allows the predictor resources to be used more efficiently and effectively.
EVALUATION METHODOLOGY
In this section, we describe our performance evaluation methodology. First, we explain the machine model used in our simulation study. Then, we describe the simulation set up and the benchmark programs used.
Machine Model
The instruction fetch and issue width is either 8-issue or 16-issue. The baseline processor is an out-of-order superscalar processor using the register update unit (RUU) [1] , [12] with the parameters shown in Table 1 . RUU is the unit which combines the instruction window for instruction issue and the reorder buffer for instruction commit. The RUU has 256-entries and the load-store queue has 64 entries.
To support the fetch width needed for this configuration, the processor uses a 4-way set associative trace cache with 2,048 sets (i.e., a capacity of 512KB with 32-bit instructions), as in [19] , and a hybrid branch predictor. The trace cache has 16 instructions per block. We use a trace selection mechanism which stops at a maximum of 16 instructions or at any indirect jump or return instruction. There is no limit on the number of conditional branch instructions in each trace cache block [24] . We use a large 512KB instruction cache to keep the bandwidth demand for value prediction high. The hybrid branch predictor consists of a 16K-entry gshare predictor [3] with 14 history bits and a 16K-entry bimodal predictor [18] . This hybrid branch predictor is accessed multiple times in one clock cycle to obtain multiple branch prediction results (instead of using a more complicated true multiple-branch prediction [11] , [19] ).
The parameters used in the value predictor are listed in Table 2 . The conventional hybrid predictor is the one which combines a stride predictor and a two-level value predictor proposed by Wang and Franklin in [32] . The value predictor in our proposed scheme has a 4K-entry Last 1K-entry Classification table. These table sizes are derived from the dynamic classification rates of the predicted instructions (see Section 6.5). The overall table size of our predictor is roughly 7 percent larger than the size of an 8K-entry conventional hybrid predictor when we count the total number of bits needed.
Evaluation Method
In this study, we include our proposed value predictor and the trace cache on the sim-outorder simulator in the SimpleScalar tool set 3.0 [1] . The SPECint95 and SPEC2000 benchmark suites used in our simulation are listed in Table 3 . The table shows the input, the number of instructions executed, the percentage of register-write instructions that need value prediction, the IPC of the baseline architectures, and the trace cache miss rates for each program. A few of the input sets for the SPEC95 programs were modified slightly to control the simulation time. The input sets for the SPEC2000 programs were also carefully modified to reduce the simulation time while maintaining similar characteristics that the programs demonstrate when they are executed using the original input sets [13] .
There are two ways to compute the trace cache miss rate: the correct-path miss rate and the all-path miss rate [20] . The correct-path miss rate is computed only along the correctly speculated paths. The all-path miss rate is computed along all execution paths, including both correctly and incorrectly speculated paths. In general, the correct-path miss rate is lower than the all-path miss rate. In this paper, we use the allpath miss rate.
Some of the miss rates in our study appear to be higher than those in [19] , [26] . One of the reasons is caused by the multiple-branch prediction scheme we use. We implemented multiple-branch prediction by accessing branch predictors sequentially multiple times. In [19] , [26] , they use the path-based next trace prediction or a true multiple-branch predictor [11] , [19] . In the case of gcc and go, the branch misprediction rate using the path-based next trace prediction is less than half of the misprediction rate of the sequential multiple-branch prediction scheme [11] . Hence, their trace cache miss rate is better. As our study is focused on the bandwidth of value predictor, not on the trace cache performance, we use a simpler sequential multiple-branch prediction scheme in our simulation.
In order to show the effectiveness of our proposed value prediction scheme, we compare the performance of the following configurations:
Base: Baseline processor architectures (as shown in Table 1) Conv: Base + conventional hybrid value predictor (as shown in Table 2) Band: Base + our proposed hybrid value predictor with dynamic classification The access to the prediction tables in Conv and Band is limited to one read/write port unless noted. The instruction pipeline of our machine model is based on SimpleScalar with fetch, dispatch, issue, execution, writeback, and commit stages [1] . The value predictor updates the prediction tables with the computed value after the writeback stage. In the Simplescalar machine model used in this study, the verification of predicted values is done in the writeback stage. Hence, there is no need for an additional stage in the pipeline to verify and update the predicted values. As a result, there is no compulsory penalty for value mispredictions [28] , [29] . The only penalty caused by wrongly predicted values is a potential structural hazard when reissued instructions compete for limited hardware resources with other instructions, such as the instruction window and function units. However, it has been shown that these structural hazards tend to have a small effect on performance [29] . Based on these considerations, we use a misprediction penalty of one cycle [16] , [21] .
PERFORMANCE ANALYSIS
In this section, we first examine the performance gain of the predictors with one read/write port. We then evaluate the impact of varying the number of read/write ports. We also investigate the performance of the hybrid predictors, the scope of selective prediction, the characteristics of the dynamic classification, queue size, and the performance impact of the predictor table size and trace cache size. Fig. 4 shows the speedup of Conv and Band with one read/ write port over Base. For an 8-issue processor, the average speedups are 4.3 percent for Conv and 7.0 percent for Band and, for a 16-issue processor, the average speedups are 3.8 percent for Conv and 9.0 percent for Band. In Conv, instructions may conflict with each other when they are trying to access the prediction table simultaneously. As a result, those instructions which are not granted access to the table cannot obtain predicted values. This will decrease the overall performance. For a 16-issue processor, the speedups of Band could be much higher than those of Conv because the pressure on the bandwidth of the prediction table is higher in a 16-issue processor than in an 8-issue. Band produces a particularly large increase in performance compared to Conv for compress, li, m88ksim, and mcf. One of the main reasons is that the trace cache miss rates of these programs are at or below 10 percent (see Table 3 ). Band requires trace cache hits to obtain the predicted values from VPB. Fig. 5 shows the speedup over the baseline architecture for Conv as the number of read/write ports of the prediction tables is varied from unlimited to 1. We can see that the performance of Conv is very sensitive to the number of read/write ports. The speedup decreases consistently when the number of read/write sports is decreased for all the benchmark programs except parser on the 8-issue Conv-2. The amount of the performance degradation ranges from 6.9 percent for the 8-issue to 10.7 percent for the 16-issue when the number of read/write ports changes from unlimited to one port. Sixteen-issue processors have larger performance degradation due to their higher bandwidth requirement on value predictors. Fig. 6 shows the speedup over the baseline architecture for Band as the number of read/write ports of the prediction tables is varied from unlimited to 1. From  Fig. 6 , we can see that the performance of Band is not as sensitive to the number of read/write ports. The variation in speedup is around 0.2 percent on average for the 8-issue and 0.4 percent for the 16-issue. This is because our proposed scheme can access all the prediction values from the copy stored in the value prediction buffer in the instruction fetch stage. The value prediction tables are accessed only during their updates and the actual value prediction in the writeback stage. Those updates and value prediction are distributed among different value predictors with much reduced traffic (see Section 6.5). Some anomalies are observed, for example, the speedup of 8-issue Band-4 is better than Band-U for li and 16-issue Band-1 is better than the other configurations for compress. There are complex reasons for these anomalies such as misprediction rate, update delay, trace cache miss, and queue overflow [15] , [28] .
Speedup

Sensitivity to the Number of Read/Write Ports
Performance of the Hybrid Value Predictors
In order to evaluate the performance of the hybrid value predictors, we measure the correct prediction rate, the misprediction rate, and the prediction coverage. The correct prediction rate is the percentage of instructions which predict the values correctly out of all instructions which perform value predictions (i.e., all register-write instructions). The misprediction rate is the percentage of instructions which predict the wrong values out of all instructions which perform value predictions. Prediction coverage is defined as the percentage of instructions which attempt value prediction among all instructions with register writes. Therefore, prediction coverage is the sum of prediction rate and misprediction rate. The prediction coverage does not add up to 100 percent because some register-write instructions do not attempt to predict values (e.g., those with confidence counter value below the threshold and those with trace cache misses). The correct prediction rate has the greatest impact on the overall performance because correct predictions break true data dependences and expose more instruction-level parallelism. Not all correct predictions actually produce a performance improvement, though. For instance, correctly predicting a value for an instruction which has no pending dependent instructions in the instruction window does not produce any performance improvement. As a result, there is not necessarily a proportional relationship between the speedup and the correct prediction rate. The misprediction rates, on the other hand, will have an adverse impact on performance since instructions that have been previously issued with an incorrectly predicted value must be squashed and reissued with a correct value. Fig. 7 shows the correct prediction rate and the misprediction rates for Conv and Band. We use the prediction tables with unlimited ports to see the performance difference of the two hybrid value predictors without the effect of nonprediction and nonupdate due to the limitation of read/write ports.
The average correct prediction rates are 42 percent for Conv and 33 percent for Band. The misprediction rates are 9 percent for Conv and 8 percent for Band. Therefore, the average prediction coverage is 51 percent for Conv and 41 percent for Band. The prediction coverage of Band is less than that of Conv because Band cannot predict value in case of a trace cache miss or a queue overflow. Also, in Band, the instructions which have no data-dependent instructions in the same trace cache block don't predict values (see Fig. 9 ). Another reason is that the dynamic classification mechanism needs time to classify the prediction type. During the classification phase, it cannot predict values.
Prediction Table Lookups
In Section 6.1, we see the performance of Conv decreases as the number of read/write ports of the prediction table is reduced. Fig. 8 shows the percentage of prediction table lookups among the predictable instructions (i.e., register-write instructions) during the instruction fetch for Conv with limited ports. It shows that the access to the prediction table for the lookup is greatly reduced as the number of read/write ports are decreased. Those instructions that cannot look up the tables will not obtain the predicted values.
In our proposed scheme, not all register-write instructions need to look up the prediction tables. Only the load instructions (and multicycle latency instructions) and the instructions that have data-dependent instructions in the same trace cache block need value prediction. Fig. 9 shows the prediction table lookup rates among all register-write instructions. The percentage of those instructions that require value prediction table lookup is 78 percent on average. The percentage of register-write instructions that have no data dependence in the same trace cache block is 19 percent on average. This allows us to further eliminate 19 percent of the register write instructions from accessing the tables (in addition to non-register-write instructions, which are also eliminated). The performance of our scheme ties closely to the miss rate of the trace cache because we cannot access the VPB when there is a trace cache miss. The average percentage of nonlookups due to trace cache misses is 2 percent. The percentage of missing value prediction due to such nonlookups is below 1 percent for most programs. But, the programs such as gcc, go, vortex, and vpr have about 5 percent nonlookups because their trace cache miss rate is relatively high. 
Dynamic Classification
Fig . 12 shows the distribution of the prediction types in the dynamic instruction stream using the proposed predictor. The left bar of each program (Band) shows the dynamic classification types for the proposed hybrid value predictor with one read/write port. The right bar (Band-Ideal) assumes unlimited read/write ports, queues, and trace cache size. Band-Ideal also assumes prediction tables and The percentage of the overflow is rather high in this case because the prediction table has only one read/write port. Each prediction table can only remove one access request from each queue in each cycle. As a result, especially in the queue for Two-level (see Fig. 13 ), access requests stay longer in the queue. This produces a high overflow rate.
In Band-Ideal, on average, 35 percent of the instructions are classified as Two-level. The percentages of the instructions classified as Last and Stride are 26 percent and 13 percent, respectively. Twenty-six percent of the instructions are classified as Unknown. The percentage of Stride type shows a big difference between Band and Band-Ideal (especially for compress, ijpeg, m88ksim, and mcf) because the delayed updates due to the long queue cause more stale values in VPB and the prediction tables. Hence, a Stride type could be misclassified as a Two-level or an Unknown type after several mispredictions. The performance degradation due to such stale values can be reduced by using techniques such as speculative update [15] , [21] .
Queue Size
In the 16-issue configuration with one read/write port, the average queue sizes are 7.0 for Last, 0.18 for Stride, 18.6 for Two-level, and 2.5 for Classification (Fig. 13) . The average queue size for Two-level is high. As mentioned in Section 6.5, this is due to limited read/write ports of the prediction table. The queue size for Last in m88ksim is high because the instructions with Last type tend to cluster together in a short time period. Table Size and Trace Cache Size It is quite obvious that the speedup decreases for most programs when the size of the prediction tables decreases. Fig. 15 shows the performance of the trace cache with 128KB, 256KB, and 512KB, respectively, for Conv and Band in the 16-issue configuration. As expected, the speedup decreases for most programs when the size of the trace cache decreases because our scheme is more sensitive to the trace cache miss rate as described earlier.
The Performance Impact of Predictor
CONCLUSIONS
In this paper, we look at the bandwidth issues of the value prediction on wide-issue processor architectures. In particular, we focus on the processors with a trace cache because such architectures incorporate multiple branch predictions and speculative execution in a very effective way. We propose to augment the trace cache with a copy of the predicted values for data-dependent instructions. It allows predicted values to be accessed easily and avoids the problem of having to access a centralized value prediction table. Actual accesses to the value prediction tables are moved from the instruction fetch stage to a later stage, such as the write-back stage. One significant benefit of such a move is that actual data-dependent instructions are known in the later stages, hence, only those dependent instructions will access the value prediction tables. A significant reduction in the bandwidth requirement of the value prediction tables can thus be achieved. We use a hybrid value predictor with a dynamic classification scheme which can distribute predictor updates to several behavior-specific tables. It allows further reduction in bandwidth requirement for each particular value prediction table. Such a scheme is inherently more effective than arbitrary banking schemes on a centralized value prediction table. Trace cache fits nicely with such dynamic classification schemes because of the ability to store relevant information in the trace cache. Our simulation results show that our scheme consistently outperforms the performance of the conventional hybrid predictor when realistic read/write ports to value prediction tables are used.
[32] K. Wang and M. Franklin, "Highly Accurate Data Value Predictions Using Hybrid Predictor," Proc. . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
