Abstract-Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because a long-latency operation is being processed, like a memory access or a floatingpoint calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current SMT fetch designs, causing a performance drop due to the absence of instructions to execute.
I. INTRODUCTION

E
XPLOITING instruction level parallelism (ILP) and thread level parallelism (TLP) is a commonly accepted technique to achieve high performance in current computer systems. Superscalar processors take advantage of ILP by executing multiple instructions of a single program during each cycle. To achieve this, accurate branch prediction mechanisms should be used to feed the execution engine with enough instructions, mainly from correct paths. Simultaneous multithreaded processors (SMT) [1] , [2] take one step further by exploiting TLP, i.e. executing multiple instructions from multiple programs (threads) during each cycle. Multithreading adds pressure to the branch predictor, increasing the total number of predictor accesses per cycle.
Branch predictors consist of one or more large tables that store prediction information. Each cycle, they must provide a prediction in order to keep on fetching instructions to feed the execution core. However, as feature sizes shrink and wire delays increase, it becomes infeasible to access large memory structures in a single cycle [3] , [4] . This involves that small branch prediction tables are needed to generate a branch prediction each cycle. However, the low accuracy of small branch predictors degrades the processor performance. This forces superscalar processors to use big and accurate branch predictors in combination with mechanisms for tolerating their access latency [5] - [8] .
The impact of branch predictor latency also affects SMT considerably. The problem gets even worse because there are several threads trying to access the shared branch predictor. If the branch predictor is unable to give a response in one cycle, multiple threads could be stalled at the fetch stage for several cycles.
In this paper, we evaluate the impact of the branch predictor latency in the context of SMT processors. We show that the increased access latency of branch predictors degrades the overall processor performance. However, reducing the predictor size is not a solution, since the lower latency does not compensate the lower accuracy of a smaller predictor. We also evaluate the effect of varying the number of branch predictor access ports. Since the SMT model evaluated in this paper can fetch from two different threads each cycle, the branch predictor needs two access ports. Using a single port involves a lower access latency, but the processor performance is degraded because there are not enough predictions to feed the fetch engine. On the other hand, using four ports allows to generate more predictions, but the increased latency also reduces performance. In this context, mechanisms for tolerating the branch predictor access latency can provide a worthwhile performance improvement.
We show that decoupling the branch predictor from the instruction cache access, as proposed in [6] , is helpful for tolerating the branch predictor delay on SMT. Our SMT decoupled fetch model uses per-thread fetch target queues (FTQ) to decouple the branch predictor. Branch predictions are stored in the FTQs, and later used to drive the instruction cache while new predictions are being generated. We also evaluate some prediction policies, aimed to obtain a better utilization of the contents of each FTQ. These techniques allow to increase the latency tolerance of the branch predictors, being specially useful for hiding the high latency of a 4-port branch predictor. The decoupled scheme, combined with the ability of predicting from four different threads, allows a 4-port branch predictor to achieve a performance similar to an ideal 1-cycle latency branch predictor.
Finally, we propose an inter-thread pipelined branch predictor design. A pipelined branch predictor can provide a prediction each cycle in spite of its access latency. However, in order to achieve accurate branch prediction, the information generated by the previous prediction should be used to 0000-0000/00$00.00 c 2003 IEEE generate a new prediction. This forces the use of in-flight information and recovery mechanisms, like in [9] , which increases fetch engine complexity. Our proposal is interleaving prediction requests from different threads each cycle. Although a particular thread should wait for its previous prediction to start a new one, a different thread can start a new prediction. This allows to complete a branch prediction each cycle at a relative low complexity. Using the inter-thread pipelining technique, a 1-port branch predictor can achieve a performance close to a 4-port branch predictor, but requiring 9 times less chip area.
The remainder of this paper is organized as follows. Section II exposes previous related work. Our experimental methodology is described in Section III. Section IV analyzes the impact of branch predictor delay on SMT processors. Section V shows the effect of varying the number of branch predictor access ports. In Section VI we show a decoupled design of an SMT fetch engine and evaluate some prediction policies. Section VII describes our inter-thread pipelined branch predictor design. Finally, Section VIII exposes our concluding remarks.
II. RELATED WORK
The increase in processor clock frequency and the slower wires in modern technologies prevent branch prediction tables from being accessed in a single cycle [3] , [7] . In the recent years, a lot of research effort has been devoted to find solutions for this problem.
A first approach to overcome the branch predictor access latency is to increase the length of the basic prediction unit. Long prediction units allow to feed the execution engine with instructions during several cycles, hiding the latency of the following prediction. Since predicting a single basic block per cycle is not enough to achieve this, some multiple branch predictors have been proposed in the literature. The FTB [6] extends the classic concept of BTB, storing fetch blocks which are only ended by strongly biased taken branches. Thus, an FTB prediction ignores biased not taken branches, enlarging the prediction unit. The next stream predictor [10] takes this advantage one step further, ignoring all not taken branches. The use of path correlation allows the stream predictor to provide accurate prediction of large consecutive instruction streams.
The next trace predictor [11] also tries to enlarge the basic prediction unit by using instruction traces. A trace is a fragment of the dynamic instruction flow, potentially containing multiple basic blocks, which are stored in a trace cache [12] , [13] . The main advantage of the trace predictor is its ability of predicting beyond taken branches. However, the maximum trace size is physically limited by the trace cache implementation. Having longer traces involves storing a smaller number of traces in the trace cache, reducing its potential performance. In practice, instruction streams are longer than traces [10] , being more tolerant to access latency at a lower cost and complexity.
Decoupling branch prediction from the instruction cache access [6] is a helpful mechanism to take advantage of large basic prediction units. The branch prediction mechanism generates requests which are stored in a fetch target queue (FTQ) and used to drive the instruction cache. When fetching instructions is not possible due to instruction cache misses or resource limitations, the branch predictor can still make new predictions, which are stored in the FTQ. Therefore, the presence of an FTQ makes less likely for the fetch engine to stay idle due to the branch predictor access latency.
Prediction overriding [7] is a different approach for tolerating the predictor access latency. This mechanism provides two predictions, a first prediction coming from a fast and small branch predictor, and a second prediction coming from a slower, but more accurate predictor. The first prediction is used while the second one is still being calculated. Once the second prediction is obtained, it overrides the first one if they differ, discarding the wrong speculative work done based on the first prediction. Such a recovery mechanism is complex, because those instructions fetched using the initial prediction should not be squashed if they will be fetched again using the new prediction [6] . A similar mechanism is used in the Alpha EV6 [5] and EV8 [8] processors, where a multi-cycle latency branch predictor overrides a fast and simple line predictor [14] .
Another promising idea to tolerate the branch predictor access latency is pipelining the branch predictor [9] , [15] . Using a pipelined predictor, a new prediction can be started each cycle. However, this is not trivial, since the result of a branch prediction is needed to start the next prediction. Therefore, a branch prediction can only use the information available in the cycle it starts, which has a negative impact on prediction accuracy. In-flight information could be taken into account when a prediction is generated, like described in [9] , but this also involves an increase in the complexity of the fetch engine.
III. SIMULATION SETUP
We use a trace-driven version of the SMTSIM [16] simulator. Our simulation tool allows wrong path execution by having a separate basic block dictionary which contains information of all static instructions. We have modified the fetch stage of the simulator, by dividing it into two different stages: a prediction and a fetch stage.
The baseline processor configuration is shown in Table I . The third column indicates whether the resource is replicated per thread or shared among all the threads. We provide simulation results obtained by the FTB fetch architecture [17] , using a gskew conditional branch predictor [18] , as well as the stream fetch architecture [10] . Both evaluated fetch architectures use the ICOUNT.2.8 [19] fetch policy (up to 8 instructions from up to 2 threads). ICOUNT gives priority to threads according to the number of instructions in the decode, rename and dispatch stages of the processor, prioritizing threads with the fewest number of active instructions in the pipeline. Table II shows the workloads used in our simulations, which are composed of benchmarks selected from the SPECint2000 benchmark suite. We have chosen benchmarks with a high instruction-level parallelism because our study is focused on the fetch engine architecture. In order to simulate the effect of different numbers of threads in the evaluated mechanisms, we use workloads including 2, 4, 6, and 8 threads. Benchmarks were compiled on a DEC Alpha AXP-21264 using Compaq's C/C++ compiler with '-O2' optimization level. Additionally, code layouts were optimized using the spike tool [20] . We fed spike with profile data obtained executing the train input set. Due to the large simulation time of SPECint2000 benchmarks, we collected traces of the most representative 300 million instruction slice, following the idea presented in [21] . These traces were collected executing the ref input set.
IV. MOTIVATION
This section shows the impact of the branch predictor access latency on the performance of an SMT processor. We explore the tradeoff between having fast branch predictors with a low accuracy, and having more accurate predictors with a larger access time.
A. Technology Constraints and Branch Predictor Delay
We have measured the access time for the branch prediction structures evaluated in this paper using the CACTI 3.0 tool [22] , a detailed wire and transistor structure model of cache memories. We modified CACTI to model tagless branch predictors, and to work with setups expressed in bits instead of bytes.
Data we have obtained corresponds to a 0.10 m process. For translating the access time from nanoseconds to cycles, we assumed an aggressive 8 fan-out-of-four (FO4) delays clock period, that is, a 3.47 GHz clock frequency as reported in [3] . It is claimed in [23] that 8 FO4 delays is the optimal clock period for integer benchmarks in a high performance processor implemented in 0.10 m technology. Figure 1 shows the prediction table access time obtained using CACTI. We have measured the access time for 2-bit counter tables ranging from 8 to 64K entries. We have also 4-port: 3 cycles 4-port: 6 cycles measured the access time for an FTB and a stream predictor ranging from 32 to 4K entry tables. These tables are assumed to be 4-way associative because direct mapped tables provide a poor performance. Besides, 2-way associative tables do not require a lower number of cycles to be accessed than 4-way associative ones, but they have a lower accuracy. Data in Figure 1 is presented for branch predictors using 1, 2, and 4 access ports 1 .
This data shows that devoting multiple cycles to access the branch predictor is unavoidable. Even single-cycle conditional branch predictors, composed of a small 2-bit counters table, should use a multi-cycle FTB to predict the target address of branch instructions. On the other hand, although an FTB with a size below the evaluated range could be accessed in a single cycle, its poor prediction accuracy will be more harmful for the processor performance than the increased latency of a larger FTB.
In order to analyze the tradeoff between fast but inaccurate and slow but accurate predictors, we have chosen two setups for the two evaluated branch predictors: a 0.5KB setup and a 32KB setup. The access latency of each predictor depends on the number of access ports used. In addition, we have explored a wide range of history lengths for the gskew predictor, as well as DOLC index configurations [10] for the stream predictor, and selected the best one found for each setup. The four evaluated predictor setups are shown in Table III 
B. Impact of Branch Predictor Delay on SMT Performance
As discussed previously, the branch predictor delay is a key topic in an SMT processor fetch engine design. On the one hand, a fast but small branch predictor causes a high number of mispredictions which degrades processor performance. On the other hand, a big and accurate branch predictor requires multiple cycles to be accessed, which also degrades performance. In this section we explore this tradeoff. Figure 2 shows the prediction accuracy of the four evaluated branch predictor setups. Clearly, small 0.5KB predictors provide the worst prediction accuracy. The exception is the 2-thread workload. It contains two benchmarks (gzip and bzip2) with few static branches, which allows a small branch predictor to predict branches accurately. The larger number of total static branches in the rest of workloads increases aliasing, degrading the accuracy of the 0.5KB predictors. Moreover, the larger number of threads also increases aliasing, degrading even more the accuracy of these predictors. Nevertheless, the 32KB predictors provide an accuracy over 95% for all the evaluated workloads due to their larger size.
According to this data, a 32KB predictor should be used. This predictor requires two access ports in order to generate two predictions each cycle, since the ICOUNT.2.8 fetch policy can fetch from two different threads each cycle. However, such a predictor needs 4 cycles to be accessed, which could degrade processor performance. An alternative is to use a smaller 0.5KB predictor, which is faster but also less accurate. Figure  3 shows a comparison of the performance achieved by ideal 32KB predictor setups, i.e. with 1-cycle latency, and realistic predictor setups: 32KB predictors with 4-cycle latency and 0.5KB predictors with 2-cycle latency.
In the 2-thread workload, the 2-cycle latency 0.5KB predictors provide a performance better than the 4-cycle latency 32KB predictors. This is caused by the good prediction accuracy achieved by the 0.5KB predictors for this workload. In the rest of workloads, its higher speed enables the 0.5KB FTB to achieve a better performance than the 4-cycle 32KB FTB, but the 4-cycle latency 32KB stream predictor achieves a better performance than the smaller one. This happens because streams are longer than FTB fetch blocks [10] , making the stream predictor more latency tolerant. Nevertheless, all the realistic setups achieve a worse performance than the ideal case, which means that there is room for improvement. In the remainder of this paper we describe techniques to tolerate the branch predictor access latency, trying to reach the ideal performance. For the purpose of brevity, we only show data for the 32KB branch predictor setups, since the 0.5KB predictors behave in a similar way.
V. A FIRST APPROACH: VARYING THE NUMBER OF PORTS
A first solution to alleviate the effect of the branch predictor latency is modifying the number of access ports. The branch predictors evaluated in the previous section use two access ports because the fetch policy used by the processor is ICOUNT.2.8 [19] , which can fetch from two different threads each cycle. We assume that these predictions are stored in intermediate buffers and can be used in following cycles, while new predictions are being generated.
Reducing the number of access ports involves a reduction in the predictor access latency. Therefore, it is interesting to explore the tradeoff between using a single access port and having a higher latency. On the other hand, increasing the number of ports involves an increment in the access latency. Despite being slower, a 4-port predictor can provide 4 predictions each cycle, helping to hide the access latency. Figure 4 shows the performance achieved by the 32KB predictor setups using 1, 2, and 4 access ports. Data using 2 ports is the same data shown in Figure 3 . The first observation in this figure is that reducing the number of ports harms the processor performance. The reduction in the access time does not compensate losing the ability to make two predictions in each access. The second observation is that increasing the number of access ports allows to achieve a performance similar to the faster 2-port predictors, even achieving a higher performance in some cases. This is caused by the ability of making four predictions in each access. The exception is the 2-thread workload, where a maximum of two predictions can be generated each cycle.
Nevertheless, there is still room for improvement. The slowdown of the different configurations varying the number of ports against the ideal 2-port 32KB predictors shown in Figure 3 ranges from 19% to 56% using the FTB, and from 9% to 17% using the stream predictor. In the following sections we present additional mechanisms to increase the latency tolerance. We show data for the two 32KB predictor setups using 1, 2, and 4 ports, as well as their corresponding access latency.
VI. A DECOUPLED SMT FETCH ENGINE
A solution to alleviate the effect of the branch predictor latency is decoupling the fetch address generation from the fetch address consumption, as proposed in [6] . Our proposal of decoupled SMT fetch engine is shown in Figure 5 . For each thread, a fetch target queue (FTQ) stores predictions done by the branch predictor, which are consumed later by the fetch unit. It is important to note that, while the branch predictor and the fetch unit are shared among all the threads, there is a separate FTQ for each thread. 
Branch Predictor
A. Branch Prediction Policies
The throughput obtained by an SMT is influenced by the quality of the fetched instructions. For this reason, a fetch policy is used to select which threads are allowed to fetch instructions to the instruction queues [19] . However, if the FTQ of a thread selected to fetch in a cycle is empty, this thread cannot be fetched in this cycle. Instead, the thread with a higher priority and an available FTQ entry will be selected to fetch. Thus, the decision taken by the fetch policy is subordinated to the availability of predictions for every thread.
We have examined the use of branch prediction policies to select which thread(s) is the most suitable to use the branch predictor every cycle. It is important to distinguish a prediction policy from a fetch policy. A fetch policy is applied to select which FTQ will be used by the fetch unit. On the contrary, a prediction policy selects which FTQ will be filled with a new prediction. Thus, the prediction policy determines the fill speed of every FTQ, while the fetch policy determines their empty speed.
We define several branch prediction policies that prioritize threads according to different criteria:
RR: Priority among threads is given in a circular way (Round Robin), among those threads that have an available entry in the FTQ. This policy is blind, in the sense that it does not follow any special criteria to give priority among threads. ICOUNT: In this case, the same policy used for giving fetch priority is used for giving prediction priority. The aim of this policy is to add a new entry to the FTQ that will be used to fetch in the same cycle. FTQ COUNT: This policy tries to keep all FTQs nonempty. The priority is given to the thread with less occupied entries in its FTQ. The reasoning of this prediction policy is the following: if all FTQs are full, the fetch policy will be able to select the most priority thread(s) to fetch without penalization. Otherwise, if the FTQ of a thread selected to fetch is empty, the efficiency of the fetch policy is penalized.
IF COUNT:
This policy extends the idea followed by the previous one. In this case, the criteria to select which thread should be predicted is not the number of fetch blocks ready to be fetched. Instead, this policy considers the total number of instructions ready to be fetched, i.e. the sum of lengths of all predictions stored in the FTQ. The ICOUNT prediction policy tries to add a new FTQ entry for the thread which will be used to fetch. The main disadvantage of this policy is that it does not take into account the amount of information stored in each FTQ. It is possible for the ICOUNT prediction policy to select a thread having a full FTQ. In this case, it would be more beneficial to select a different thread with an empty FTQ, probably avoiding a fetch stall in a later cycle. Both FTQ COUNT and IF COUNT prediction policies take into account the amount of information stored in the FTQs. The IF COUNT policy is more complex, but it has the additional advantage of knowing the total number of instructions stored in each FTQ, which is a more precise data than only knowing the total number of stored predictions. Figure 6 shows the performance achieved by the four prediction policies using 4-entry FTQs. Data is shown for branch predictors with 1, 2, and 4 access ports. As can be expected, the IF COUNT prediction policy obtains the best performance results. However, the main observation from this figure is that the four policies provide a similar performance. There is little difference between the IF COUNT prediction policy and the blind, but simpler, Round Robin scheme. This happens because, although the predictor is used in a different way, all the evaluated policies are able to correctly balance the contents of the FTQs. Therefore, to avoid an increase in the fetch engine complexity, we have chosen to use the simple Round Robin prediction policy. Figure 7 shows performance results using a decoupled fetch with a 4-entry FTQ per thread. The prediction policy used is the Round Robin scheme, so these results correspond to those shown in the RR columns of Figure 6 . For all workloads, decoupling the fetch involves a performance improvement with respect to the performance results shown in the previous section (Figure 4) . The FTQs allow the branch predictor to work at a different rate than the instruction cache, introducing requests in the queues even while the fetch engine is stalled 
B. Performance Results
(c) 4 ports ¡ ¢ £ ¤due to cache misses or resource limitations. The information contained in the FTQs can be used to drive the instruction cache during the following branch predictor accesses, hiding its access latency. Table IV shows the speedup achieved by decoupling the branch predictor for the four evaluated workloads, as well as the average speedup. The best average speedups are achieved by the 4-port predictors. This makes sense because they have a higher latency, and thus they can benefit the most from a decoupled scheme. The speedup is specially high in the 2-thread workload, since increasing the number of access ports does not provide any benefit. It is also interesting to note that the FTB achieves higher speedups than the stream predictor. The stream fetch engine uses longer prediction units than the FTB fetch architecture [10] . Therefore, the use of an FTQ is more beneficial for the FTB fetch architecture than for the stream fetch engine, since a larger prediction unit makes it easier to hide the branch predictor access latency.
The last column of Table IV shows the average slowdown of the decoupled scheme against the ideal 1-cycle 32KB predictors using 2 access ports and a coupled fetch engine (shown in Figure 3 ). There is little room for improvement in the 4-port predictors, since their ability of making four predictions combined with the FTQ allows the fetch engine to hide the access latency. The stream predictor using 2 access ports also achieves a performance close to the ideal 1-cycle predictor due to the longer size of instruction streams. However, the rest of setups can still be improved. This is specially true for the FTB using a single access port, which achieves half the performance of the ideal FTB. Therefore, additional mechanisms to tolerate the branch predictor access latency can still achieve further performance gain.
VII. AN INTER-THREAD PIPELINED BRANCH PREDICTOR
Pipelining is a technique that allows the branch predictor to provide a prediction each cycle [9] , [15] . The branch predictor access is divided into several stages, the last of which provides the final branch outcome. With this technique, large branch predictors can be used without impacting performance due to their large access time. However, pipelining the branch predictor implies that the next prediction is initiated before the current one has finished. Therefore, the next prediction can not use the information generated by the current one (including next fetch address), which can harm prediction accuracy. Figure 8 shows different implementations of a branch predictor. Figure 8 .a shows a non-pipelined branch predictor. A branch prediction begins only when the previous one has finished. Hence, some bubbles are introduced in the pipeline. Figure 8 .b shows a pipelined branch predictor as proposed in [9] . In order to initiate a new prediction without waiting for the previous one (@1), the new fetch address (@2') is calculated by using speculative dynamic information of pending predictions, as well as by predicting future decodification information. When a prediction is finished, the target address obtained (@2) is compared with the predicted target address (@2') used for generating the following prediction. If both target addresses are different, wrong speculative instructions should be discarded, resuming the instruction fetch with the correct prediction (@2).
Although using in-flight information and decoding prediction like in [9] allows accurate latency tolerant branch prediction, it also involves a high increase in the fetch engine complexity (predict the presence of branches, maintain inflight prediction data, support recovery mechanisms, etc). To avoid this complexity, we propose a pipelined branch predictor implementation for SMT. Our proposal interleaves branch prediction requests for each thread, providing a prediction each cycle despite of the access latency. Each thread can initiate a prediction as long as there is no previous pending prediction for this thread. Thus, if there are enough threads being executed in the processor, the branch predictor latency can be effectively hidden. Figure 8 .c shows the behavior of our inter-thread pipelined branch predictor. This example employs a 2-cycle 1-port branch predictor and a 2-thread workload. Each cycle, a new prediction for a different thread can be initiated. Therefore, although each individual prediction takes 2 cycles, the branch predictor provides a prediction each cycle. If the branch predictor latency were higher, there would be bubbles in the predictor pipeline. Nevertheless, decoupling the fetch helps to relax this effect. Moreover, as more threads are executed, the pipeline is always filled with new prediction requests, so branch predictor latency is totally masked. Figure 9 shows performance results using an inter-thread pipelined branch predictor in a decoupled SMT fetch engine. The main observation is that the performance of all the evaluated setups is similar. Table V shows the IPC speedups achieved by using a decoupled inter-thread pipelined branch predictor over the decoupled fetch with a non-pipelined branch predictor. As stated in the previous section, there was little room for improving the 4-port predictors. Since decoupled branch predictors with four access ports can efficiently hide the access latency, pipelining them has little impact on processor performance. On the contrary, the single port predictors achieve an important performance improvement by using our inter-thread pipelining technique. The 1-port stream predictor achieves an average 12% speedup, while the 1-port FTB achieves a 48% speedup.
These results show that the inter-thread pipelined branch predictor is an efficient solution to tolerate the branch predictor latency on SMT. The last column of Table V shows the average slowdown of the decoupled inter-thread pipelined predictors against the ideal 1-cycle 32KB predictors using 2 access ports and a coupled fetch engine (shown in Figure 3 ). It is clear that all the evaluated setups almost achieve the performance of an ideal 1-cycle predictor. Therefore, the combination of an interthread pipelined predictor design and a decoupled SMT fetch engine constitutes a fetch unit that is able to exploit the high accuracy of large branch predictors without being penalized for their large access latency.
Finally, an important conclusion that can be drawn from these results is that an inter-thread pipelined branch predictor with a single access port provides a performance close to a branch predictor using four access ports, even if it is pipelined. A reduction in the number of access ports involves a large reduction in the chip area devoted to the branch predictor. According to data collected using CACTI, our 32KB 1-port inter-thread pipelined design requires 3 times less area than a similar predictor using 2 access ports, or even 9 times less area than a similar predictor using 4 access ports. This high reduction in the required chip area is another worthwhile contribution of our proposal.
VIII. CONCLUSIONS
Current technology trends prevent branch predictors from being accessed in a single cycle. In this paper, we show that the branch predictor access latency is a performance limiting factor for SMT processors. We provide data for two state-ofthe-art branch predictors, the FTB and the stream predictor, with a 32KB hardware budget. Although the predictor latency degrades the potential performance, reducing its size is not a solution, since the lower prediction accuracy degrades the performance even more.
We evaluate the possibility of modifying the amount of access ports. Since our front-end model is able to fetch from two different threads each cycle, it makes sense that the branch predictor uses two access ports. Using a single port reduces the access latency, but only one prediction can be generated each time, which degrades performance. On the other hand, using four access ports increases the access latency, also degrading performance, although generating four predictions each time partially compensates this degradation. In general, our results show that techniques for reducing the impact of branch predictor access delay on SMT can provide a worthwhile performance improvement.
We propose to decouple the SMT fetch engine to tolerate the branch predictor latency. Decoupling the fetch generation from the fetch consumption allows the branch predictor to work autonomously, generating predictions even when threads are stalled for any reason (cache misses, instruction dependencies, or resource constraints). A decoupled SMT fetch provides larger improvements for those predictors with a higher latency, so the 4-port predictors benefit the most from this technique, achieving speedups ranging from 3% to 34%. We also propose an inter-thread pipelined branch predictor for SMT. Our design maintains a high prediction generation throughput with a low complexity. Each cycle, a new branch prediction is initiated, but only from a thread that is not waiting for a previous prediction. Using this technique, the 4-port predictors achieve little speedups, since the decoupled scheme allows them to obtain a performance near to the obtained by a single cycle ideal predictor. However, our inter-threaded pipelining mechanism allows the 1-port predictors to achieve important speedups over decoupled non-pipelined predictors, ranging from 3% to 63%. Moreover, inter-thread pipelined 1-port predictors are able to achieve a performance close to 4-port predictors, but reducing the required chip area by a factor of nine.
In summary, SMT tolerates the latency of memory and functional units by issuing and executing instructions from multiple threads in the same cycle. The techniques presented in this paper allow to extend the latency tolerance of SMT to the processor front-end. Thus, multi-cycle branch predictors and instruction fetch mechanisms can be used without affecting SMT performance.
