Speculative multithreading (SpMT) promises 
Introduction
Speculative multithreading (SpMT) is emerging as an effective mechanisni for exploiting thread-level parallelism (TLP) from non-numeric programs. SpMT processors allow multiple threads to execute in parallel in the presence of ambiguous contra1 and data dependences, and recover upon the detection of dependence violations.
Control and data speculations are the essence of SpMT execution models. Both inter-thread data dependences and inter-thread control predictions at runtime play crucial roles in aSfecting the performance of the SpMT system. Therefore, it is important that we carefully charaxterize the execution behaviors of threads in order to gain a better understanding of performance and to evaluate the effectiveness of the compiler and the hardware. Most SpMT processors [5, 10, 12, 141 rely on the compiler to generate proper threads so as to exploit the parallelism available in the programs. The SpMT compiler should try to partition the program such that the inter-thread data dependences are minimized during execution. The threads are spawned and executed speculatively and inter-thread control missspeculation leads to squashing of the threads. Therefore, the compiler should try to expose more predictable branches at the thread boundaries while hiding the less predictable branches inside the threads.
We have developed an SpMT compiler [a] to partition sequential programs into threads for running on SpMT processors. Our work differs from earlier works on SpMT compilation [ll, 15 , 171 in a number of ways. Most of the earlier works [11,17] primarily target loop level parallelism, whereas our compiler exploits parallelism from the non-loop regions as well.
In this paper we used a simulation-based environment to evaluate the performance of the non-numeric programs partitioned by our SpMT compiler. In order to have a better understanding of the SpMT system performance and evaluate the effectiveness of our compiler algorithms, it is extremely important to perform a detailed analysis of the run-time behaviors of the inter-thread data dependences and control predictions of the generated threads. Our detaiIed study shows that our compiler has modeled the inter-thread data and control dependences efficiently. It has been able to minimize both inter-thread data and control dependences effectively in most of the programs. Moreover, our study further reveals the difference in access patterns for inter-thread memory and register data and indicates further opportunity of improvement for the SpMT compiler and the hardware. Our characterization of data and control dependences of the multithreaded programs help in understanding program behaviors in SpMT execution model and indicates the ways to improve the system further.
The rest of the paper is organized as folIows. In Section 2, we review the concepts of speculative multithreading and discuss the important issues related to the performance of the SpMT system. We present an overview of our SpMT compiler framework in Section 3. The experimental methodologies and the evaluation are presented in Section 4. Finally we conclude in Section 5. [lo] , and CMP [S, 111. Speculative multithreading enables paralleIization of applications, despite any uncertainty about (control or data) dependences that may exist between the parallel threads. The hardware speculates on dependences, and recovers whenever a speculation is found to be incorrect. This allows the SpMT compiler to do optimistic speculation, thereby improving the performance. Below we review some of the important aspects of SpMT. 
Speculative multithreading

Importance of considering interthread data dependences
Perhaps the most important factor affecting the SpMT SpMT processor parameters used for our evaluation are shown in Figure 2 . tions. Otherwise lots of squashing and re-execution of threads would take place. Therefore, it is important to see the efficiency of our compiler in reducing interthread control mispredictions. Our compiler tries to reduce inter-thread control mispredictions by spawning non-speculative threads from the control independent points and also by spawning speculative threads from the likely paths in the program.
In Table 1 , we present the branch prediction statistics. Table 1 shows that except for perimeter, in all other benchmarks, inter-thread branch prediction accuracies are significantly higher than the intra-thread branch prediction accuracies and the overall branch prediction accuracies. This indicates the success of our compiler in exposing the more predictable branches at thread boundaries.
In Table 1 , we see that equake, mcf, tuolf health, and tsp have significant proportions of interthread branches. In equake and health, the percentage of inter-thread branches are higher than that of intra-thread branches and also these branches have very high prediction accuracies. This is due to the fact that both these programs spend more than 90% of the time in loop-centric threads and the interthread branches consist of mainly the loop terminating branches, resulting in high inter-thread branch prediction accuracies. The inter-thread branches in mcf vpr, m s t also mostly consist of loop-terminating branches.
In crafty, perimeter, and treeadd, some of the branches, on which the speculative threads are dependent, are included inside another thread. In treeadd and perimeter there are no loops and the program works by making recursive function calls. There are multiple function calls after a single conditional branch and speculative threads are spawned far every function call from before the conditional branch.
Therefore in this case there are successive speculative threads without any intervening conditional branch. From Figure 3 , we see that in aIl three programs speculative threads achieve good speedup therefore the intra-thread branches, on which the speculations are made, are likely to have high accuracies as well.
In perimeter, the inter-thread branch prediction accuracies are Iower than the intra-thread branch prediction accuracies. This is because of the overall poor branch prediction accuracy of the program. However the percentage of inter-thread branches is only 2.8% and as explained above not all branches, on which the speculations are based, are exposed at the thread boundaries.
Characterization of inter-thread data dependences
In this subsection we characterize the dynamic data dependence behaviors of the programs and correlate them with the program performance. Various statistics of inter-thread dynamic register and memory data dependences for 4 PES are shown in Table 2 .
For example, from Table 2 we see that, crafty has an average inter-thread register dependance of 11.54. Out of this, 3.34 requests are made to inter-thread register whose values are not yet available. In crafty, out of an average of 3.34, only an average of 0.56 unresolved register values are correctly predicted and the consumer instructions have to stall for the remaining 2.78 inter-thread register dependences.
From Table 2 we see that the programs that achieve high speedup do not have high inter-thread data dependences that cause stalls. For example, the average value for inter-thread register dependences causing stalls for crafty is 2.78 and that for vpr is 2.92. These vdiies are not high, considering the average dy- Table 2 : Inter-thread average dynamic register and memory dependence statistics €or 4 PES namic thread sizes of crafty and vpr are 93.5 and 80.1, respectively. Similarly, m s t , which has the highest parallelism, does not have any inter-thread register dependences that can cause stalls. On the other hand, in health the average inter-thread register dependences that cause stall has value 1.99. This is a significant dependence, considering that the dynamic thread size of health is only 8.9 and this is also evident from the small speedup of health in Figure 3 . t s p also bas very high unresolved register dependences, resuiting in small speedup. In t s p the very high Unresolved register dependences are due to long latency multiplication and floating point operations.
From Table 2 we find that most programs have a higher number of register dependences causing stalls than that, of memory dependences. However, even a small amount of memory dependences can cause more stalls if they result in cache misses. For example, in m c f , although the inter-thread dependences are not high, the cache miss rates are very high for both the intra-thread and the inter-thread memory dependences. Also, the data value prediction accuracy is higher for memory data than for register data. This is because the programs often load the same data from memory and this is easier to predict than the register values.
While building an inter-thread data dependence model the compiler tries to limit the number of unresolved inter-thread data dependences within 10 and adding up the unresolved register and memory data dependences, we can see that the value does not exceed 10 except for t s p . This validates our data dependence modeling.
In order to evaluate the inter-thread register dependence patterns, we measure the span of the inter- In Figure 4(a) , we see that for mcf, health, rnst and t s p the resolved register dependences are coming either from the immediate predecessor or from the threads that are mote than ten dynamic threads apart. All the above programs are loop-centric, and for such programs most of the resolved register dependences are loop carried dependences coming from the last iteration. For all other benchmarks, the resolved register dependences are spanned more or less uniformly among the predecessor threads.
In Figure 4(b) , we see that more than 50% of the unresolved register dependences are due to the immediate predecessor. In the loop based programs mcf, health, mst and t s p all the unresolved dependences are coming from the last iteration. Since our compiler and SpMT model support out-of-order thread spawning and execution [2] , in a 4 PE SpMT processor there can be unresolved dependerices due to threads that are more than 4 threads apart. However, except for twolf and perimeter, in all other programs nearly 100% of the unresolved dependences are due threads that are a t most 4 threads away. This indicates that there is no stalls due t o out-of-order execution.
In Figure 5 (a), we see that in all benchmarks, except tsp, more than 50% of the resolved memory dependences are accessing threads that are more than 10 threads apart. Like register dependences, in case Figure 4: Distribution of register data dependences in terms of distances in dynamic threads of memory dependences also, in the programs mcf, health, m s t and tap all the resolved dependences are coming from the previous iterations or from the threads that are far away.
In Figure 5 (b), we see that the unresolved memory dependences span a larger distance than unresolved register dependences and indicates the existence of unresolved memory dependences due to out-of-order execution. Except €or health, most of the unresolved memory dependences in other programs are mainly coming from nearby predecessors. Like unresolved register dependences, in m s t and t s p all unresolved memory dependences are coming from immediate predecessor. Overall, the register and memory dependence patterns are quite different in other programs.
Conclusions
Speculative multithreading is emerging as an important parallelization method for non-numeric programs. Judicious partitioning of a sequential program into threads is necessary to exploit parallelism in SpMT processors. We have developed an SpMT compiler for partitioning sequential programs into multiple threads. The inter-thread data dependences and control dependences are extremely important in achieving good speedup. So the compiler should model the interthread data and control dependences as accurately as possible and generate the threads such that the dependences be minimized.
In this paper we used a simulation-based environment to evahate the performance of the non-numeric programs partitioned by our SpMT compiler framework. We studied the inter-thread control and data dependences of the programs and analyzed the performances with the help of them. Our study shows that the compiler has modeled the inter-thread data and control dependences efficiently. It has been able to minimize both inter-thread data and control dependences effectively in most of tbe programs. However, lack of parallelism in case of long latency operations and the different access patterns of memory and register dependences indicates possible improvement for the compiler. Our characterization of data and control dependences of the multithreaded programs help in understanding program behaviors in SpMT execution model and indicates the ways to improve the system further. on Principles and Practice of Parallel Progmmnaing, 1995.
