Clustered multithreading for speculative execution by Marukatat, Rangsipan
Clustered Multithreading for Speculative 
Execution 
Ran gsipan Marukàtat 
Doctor of Philosophy 
Institute for Computing Systems Architecture 
School of Informatics 




This thesis introduces the use of hierarchy and clusters in multithreaded execution, 
which allows several fragments of an application to be specifically optimised and ex-
ecuted by clusters of thread processing units (TPUs) as orchestrated by compile-time 
analysis. Our multithreaded architecture is a network of homogeneous thread pro-
cessing units. Additional features were proposed, aiming at dynamic clustering of the 
TPUs throughout the entire program execution as well as minimum hardware support 
for speculative execution. The architecture executes a sub-set of the MIPS insruction 
set augmented with multithreaded instructions. A multithreaded compilation system 
was implemented, which focuses on high-level or front-end transformation from se-
quential C programs to multithreaded ones. 
Empirical studies were conducted on benchmarks containing two types of program 
structures: loops and conditional branches. Coarse-grained control speculation enables 
simultaneous execution of several sub-problems such as loops, each of which could in 
turn be executed by multiple threads. Strategies were proposed for allocating TPU re-
sources to these sub-problems and evaluated in simulations. Significant speedups were 
observed in the performance of multithreaded loop execution, and could be further 
improved by the application of control speculation. 
Acknowledgements 
I would like to thank my supervisor, D.K. Arvind, for his guidance and support through-
out my study at the University of Edinburgh. The simulator and the compiler I used in 
my research were implemented, thanks to helps and suggestions from Alastair Patrick, 
Christoffer Arvidsson, and many others I talked to via the SUIF mailing lists. 
I shared the office with Fang, Shun, and Grigon, whom I appriciate their company. 
My landlady, Deborah, and her Giant Schnauzers, Tavi and Talen, are always great 
friends and made me feel at home. I also thank for the friendship from many Thai 
friends, especially P'Noi, P'Fa, P'Ting, and Jay. 
The biggest thanks go to my mom and dad for their love, support, and patience. 
Also, my little brother, Peune, who studied in France always sent jokes to me and our 
parents. I am very grateful to our relatives and friends in Thailand who looked after 
our parents while we were away. 
My study was sponsored by a Thai Government Scholarship. 
Declaration 
I declare that this thesis was composed by myself, that the work contained herein is 
my own except where explicitly stated otherwise in the text, and that this work has not 
been submitted for any other degree or professional qualification except as specified. 
Table of Contents 
1 Introduction 	 1 
	
1.1 	Thesis Overview 	............................3 
1.2 	Thesis Organisation ...........................8 
2 Literature Survey 
	
9 
2.1 	Thread Creation ............................. 	10 
2.1.1 	Dynamic Approach ....................... 	10 
2.1.2 	Static Approach ......................... 	11 
2.2 	Thread Initialisation ........................... 	12 
2.2.1 	Register Context ........................ 	13 
2.2.2 	Branch Predictor ........................ 	14 
2.3 	Thread Retirement ........................... 	15 
2.3.1 	Master/Slave Model ...................... 	15 
2.3.2 	Predecessor/Successor Model ................. 	16 
2.4 Inter-thread Data Communication ................... 	17 
2.5 	Synchronisation ............................. 	18 
2.6 Thread-Level Speculation ........................ 	19 
2.6.1 	Control Speculation 	...................... 	20 
2.6.2 	Register Speculation ...................... 	21 
2.6.3 	Memory Speculation ...................... 	22 
2.7 Hierarchical Organisation and Clusters ................. 	23 
2.8 	Other Techniques ............................ 	26 
2.9 	Chapter Summary ............................ 	26 
iv 
3 	The Multithreaded Processor Architecture and The Compiler 27 
3.1 Hierarchical Multithreaded Execution 	................. 27 
3.2 Description of the Architecture 	..................... 30 
3.2.1 	Global Thread Control Unit (GTCU) 	............. 30 
3.2.2 	Thread Issue Unit (TIU) 	.................... 31 
3.2.3 	Local Thread Control Unit (LTCU) 	.............. 31 
3.2.4 	Register File 	.......................... 32 
3.2.5 	Speculative Buffer 	........................ 33 
3.2.6 	Inter-thread Communication Unit 	............... 35 
3.3 Multithreaded Instructions 	....................... 36 
3.3.1 	Multithreaded Instructions Group 1 	.............. 37 
3.3.2 	Multithreaded Instructions Group 2 	.............. 42 
3.3.3 	Multithreaded Instructions Group 3 	.............. 42 
3.3.4 	Multithreaded Instructions Group 4 	.............. 47 
3.4 The Multithreaded Processor Simulator 	................ 49 
3.4.1 	Simulator Framework 	..................... 49 
3.4.2 	Limitations 	........................... 54 
3.5 The Multithreaded Compiler 	...................... 54 
3.5.1 	Compiler Implementation 	................... 56 
3.5.2 	Compilation Process 	...................... 58 
3.6 Discussion ................................ 61 
4 	Multithreaded Loop Execution 64 
4.1 	Multithreaded Loop Transformations .................. 65 
4.1.1 	Simple Loops 	.......................... 69 
4.1.2 	Loops with Multiple Exits 	................... 72 
4.1.3 	Register Communication .................... 77 
4.2 	Performance Evaluation ......................... 81 
4.2.1 	Benchmarks ........................... 81 
4.2.2 	Results and Discussions 	.................... 84 
4.2.3 	Summary 	............................ 109 
4.3 	Chapter Summary ............................ 110 
LTA 
5 	Multithreaded Control-Speculative Execution 113 
5.1 	Transformations for Control Speculation ................ 114 
5.1.1 	Single-Path Speculation 	.................... 120 
5.1.2 	Dual-Path Speculation 	..................... 126 
5.1.3 	Nested Speculation 	....................... 130 
5.2 	Performance Evaluation ......................... 134 
5.2.1 	Benchmarks ........................... 134 
5.2.2 	Results and Discussions 	.................... 143 
5.2.3 	Summary 	............................ 163 
5.3 	Chapter Summary ............................ 165 
6 Conclusions 	 166 
6.1 	Thesis Summary 	............................166 
6.2 Discussion and Future Works ......................168 
6.2.1 	Multithreaded Architecture ...................168 
6.2.2 	Multithreaded Compiler ....................171 
6.2.3 	Applications 	..........................174 
6.3 	Conclusion 	............................... 175 
A Examples of Control-Flow Graphs 	 176 
A.1 	heapsort .................................176 
A.2 	164.gzip .................................182 






List of Figures 
1.1 SMT and CMP architectures (reproduced from [44]) .........2 
1.2 The system overview ..........................4 
1.3 An example of program partitioning .................. 5 
1.4 The clustering of TPUs during program execution ...........7 
2.1 Two-level multithreaded models ....................24 
3.1 The target multithreaded architecture .................. 29 
3.2 State transitions for (W, U) in a register 	................ 32 
3.3 Retirement actions in the hierarchical-speculative execution . 	. . . . 	 . 	 35 
3.4 Hierarchical multithreaded execution .................. 46 
3.5 An overview of the simulator ...................... 49 
3.6 State transitions of a participating entity 	................. 50 
3.7 An overview of the compilation process 	................ 60 
4.1 An outline of the loop transformation 	................. 66 
4.2 Loop structure in SUTF IR, (a) before and (b) after loop expansion 67 
4.3 Multithreaded loop generated by Loop-Transformer-1 ........ 68 
4.4 Diagram of the multithreaded loop in operation 	............ 70 
4.5 Store/load synchronisation in Loop-Transformer-1 .......... 71 
4.6 Multithreaded loop generated by Loop-Transformer-2 ........ 74 
4.7 Nested loop execution in speculative mode ............... 76 
4.8 Transformed loop using memory communication 	........... 78 
4.9 Transformed loop using register communication ............ 79 
4.10 Diagram of register communication for register $70 .......... 80 
LV 
4.11 Speedup of multithreaded programs with cluster size ranging from 2 
to 16 TPUs, in steps of 2 	........................ 85 
4.12 A saturation point being reached at cluster size = 4 	.......... 86 
4.13 Speedup of multithreaded versions of DA and RJ8 .......... 88 
4.14 Speedup of multithreaded versions of U..21 	.............. 89 
4.15 An example of nested multithreading .................. 90 
4.16 RIEavg of the multithreaded programs shown in Figure 4.11 ...... 91 
4.17 Speedup of recycling multithreaded execution after loop peeling 	. 93 
4.18 RIEavg graphs after loop peeling 	.................... 94 
4.19 Standard deviations of the RIE bars in Figure 4.18 ........... 95 
4.20 Speedup of recycling multithreaded execution after loop unrolling and 
loop peeling (continued in Figure 4.21) 	................ 98 
4.21 Speedup of recycling multithreaded execution after loop unrolling and 
loop peeling (continued from Figure 4.20) 	............... 99 
4.22 Loop chunking for multithreaded execution on 4 TPUs 	........ 100 
4.23 Speedup of non-recycling multithreaded execution after loop chunking 102 
4.24 Speedup of nested-multithreaded programs with and without optimi- 
sation (loop chunking) 	......................... 104 
4.25 Speedup of multithreaded programs being sequentially executed 	. 106 
4.26 Speedup of multithreaded programs with fork penalty ......... 108 
4.27 Performance of one-level multithreaded programs (continued in Figure 
4.28) 	................................... 111 
4.28 Performance of one-level multithreaded programs (continued from Fig- 
ure4.27) 	................................. 112 
5.1 An example of a control-flow graph 	.................. 116 
5.2 The control-flow graph in Figure 5.1 after code replication 	...... 119 
5.3 An outline of the transformation for speculative execution ....... 120 
5.4 Branch structure in the SUIF intermediate representation 	....... 121 
5.5 Code generated by SpecTransformeri, THEN path is predicted 122 
5.6 Memory communication in Spec-Transformer-1 ........... 124 
5.7 Code generated by Spec-Transformer-2 ................ 128 
VIII 
5.8 Register communication 129 
5.9 Sample nest of branches for Figure 5.10 	................ 130 
5.10 Code generated for nested speculation 	................. 131 
5.11 Handling of data dependencies in nested branches ........... 133 
5.12 Modified Livermore kernels (continued in Figure 5.13) 	........ 135 
5.13 Modified Livermore kernels (continued from Figure 5.12) ....... 136 
5.14 Synthetic benchmark SYN_1 	...................... 137 
5.15 Synthetic benchmark SYN..2 	...................... 138 
5.16 Synthetic benchmark SYN3 	...................... 138 
5.17 Synthetic benchmark SYNA 	...................... 139 
5.18 Synthetic benchmark SYN.5 	...................... 140 
5.19 Synthetic benchmark SYN6 	...................... 141 
5.20 Synthetic benchmark SYNJ 	...................... 142 
5.21 Speedup of speculative programs (Cindep policy) ........... 144 
5.22 A comparison of 2 cluster allocation policies for non-speculative pro- 
grams.................................. 146 
5.23 A comparison of 4 cluster allocation policies for speculative programs 149 
5.24 A comparison of 4 cluster allocation policies for the nested speculation 
inSYN5 	................................. 150 
5.25 An outline of control-independent execution in SYN.2 ......... 153 
5.26 Speedup after CSP and CI are performed (total TPUs = 24) ...... 154 
5.27 Speedup after CSP and CI are performed (total TPUs = 8, 12) ..... 155 
5.28 Best performance from CSP, CI, and CSP+CI 	............. 157 
5.29 Results from the lookahead speculation 	................ 157 
5.30 Speedup of multithreaded execution, with and without concurrent spec- 
ulation 	.................................. 159 
5.31 Speedup of speculative programs after the outer loop is optimised 	. 159 
5.32 Loop unrolling and code motion being applied to the outer ioop. . . 160 
5.33 Synthetic benchmarks with unbalanced control structures ....... 162 
5.34 Speedup of the speculative execution in SYN_UB.2 	.......... 163 
A. 1 CFG of the heap-sorting function ....................177 
ix 
A.2 CFG of the heap-sorting function after code replication (1) 	...... 180 
A.3 CFG of the heap-sorting function after code replication (2) 	...... 181 
A.4 CFG of procedure deflatefast 	..................... 183 
A.5 Handling of function calls inside procedure deflate fast ........ 184 
A.6 CFG of procedure inflate -block 	.................... 186 
A.7 Completely-nested branches in procedure inflate-block ........ 187 
B. 1 Speedup of non-speculative programs (with GTCU delay =0, 1, and 2 
time units) ................................190 
B.2 Speedup of speculative programs (with GTCU delay =0, 1, and 2 time 
units) 	..................................191 
x 
List of Tables 
2.1 Categories of thread-level data speculation ...............19 
3.1 Examples of pseudo-functions 	..................... 38 
3.2 Multithreaded Instructions Group 1 (continued in Table 3.3) 	..... 39 
3.3 Multithreaded Instructions Group 1 (continued from Table 3.2) . . 40 
3.4 Multithreaded Instructions Group 2 ................... 42 
3.5 Multithreaded Instructions Group 3 (continued in Table 3.6) 	..... 43 
3.6 Multithreaded Instructions Group 3 (continued from Table 3.5) . . 44 
3.7 Multithreaded Instructions Group 4 ................... 48 
3.8 Probing Instructions ........................... 54 
4.1 Order of commit and retirement 	.................... 75 
4.2 Benchmark description and general statistics .............. 82 
4.3 Parameters for the simulated multithreaded architecture ........ 83 
4.4 Multithreading overheads ........................ 83 
4.5 Details of parallelisable loops in the benchmarks 	........... 84 
4.6 Details of parallelisable nested loops 	.................. 87 
5.1 Overheads of multithreaded speculative execution ........... 117 
5.2 Description and general statistics of synthetic benchmarks 	...... 134 
5.3 Average sequential execution time (per invocation) 	.......... 135 
5.4 Contribution of individual loop to the overall program execution 	. 148 




There is a recent trend in multiprocessor architectures towards multithreading. Threads 
are streams of instructions with each one having its own program counter and regis-
ter space. Whether the threads share memory space and other resources depends on 
the particular architecture and its implementation. A number of research groups have 
proposed architectural models which can be divided into two broad groups: Simultane-
ous Multithreading (SMT) and Chip Multiprocessing (CMP). The SMT-based model 
[3, 44, 45, 71] is built on a traditional wide-issue superscalar processor, which issues 
instructions from multiple threads to any available functional unit (FU) as the pro-
cessor's resources are shared. The CMP-based model [26, 32, 58, 66, 70],  which is 
analogous to a traditional tightly-coupled multiprocessor, fixedly partitions a single 
chip into multiple thread processing units (TPUs), each comprising of a number of 
functional units. The partitioning of computational resources (i.e. FUs) in SMT and 
CMP architectures are displayed in Figures 1.1(a) and (b), respectively. 
Much effort has also been devoted to developing compilers for the multithreaded 
architectures, notably for CMPs [11, 32, 40, 53, 54, 66, 72, 79]. Unlike SMTs, which 
can exploit thread-level and instruction-level parallelism dynamically and interchange- 
1 
Chapter 1. Introduction 
	
2 
I- A pool of FUs —H 	 H- TPU 1 -H H- TPU2H 











thread I 	thread 4 	 thread 1 	thread 2 
0 thread 2 	thread 5 	 idle slot 
1JJjJ thread 3 	idle slot 
(a) SMT 	 (b) CMP 
Figure 1.1: SMT and CMP architectures (reproduced from [44]) 
ably (i.e. in the absence of thread-level parallelism, an SMT would dedicate its re-
sources for instruction-level parallelism), CMPs rely heavily on the compilers to ex-
tract thread-level parallelism and typically apply conventional optimisations to further 
exploit instruction-level parallelism. Because of this, however, CMP architectures are 
relatively simple to design and optimise compared to SMT ones. 
Chapter 1. Introduction 
	
3 
1.1 Thesis Overview 
This thesis proposes a framework that organises the multithreaded execution on a 
CMP-based architecture into multiple layers or hierarchy. The main ideas are: 
Distributed program analysis allow one to focus on classes of compilation tech-
niques as well as resource requirement for each sub-problem, bearing in mind 
the overall constraints of the architecture. 
. Hierarchical thread management alleviates the workload of overseeing and man-
aging all threads in the global scope. Instead, groups of individual threads, corre-
sponding to sub-problems, are mapped to clusters of TPUs and managed locally. 
Dynamic clustering of the TPUs enables resource allocation to be adjusted to 
specific requirements of the sub-problems during the program execution. 
Hierarchical program partitioning is employed. Firstly, a program is divided into a 
(small) number of subsystems which are, for example, paths of conditional branches 
or outermost loops. They can be repeatedly decomposed into finer subsystems. Even-
tually, the innermost or the deepest ones are individual threads. Clusters of TPUs are 
allocated to the program partitions and their sizes depend on the inherent parallelism in 
those partitions. To enable this, a CUT-based architecture is provided with the ability 
to construct and manage clusters at run-time, as dictated by the compile-time analy-
sis. An interface that conveys commands and inquiries between the compiler and the 
architecture is a set of special instructions added to the standard MTPS instruction set 
[22, 371. An overview of the framework is shown in Figure 1.2. 
Figure 1.3 depicts an example of program partitioning. The hierarchy is managed 
through master/slave relationships between threads, i.e. a cluster which is manipulated 




03 D _- 
_____________OW 
Threads / Collctions of Threads 
Instruction Set Architecture (MIPS + Multithreaded Instructions) 
C] 	13 	El  ....................... 0 	. 0 
Classes of compilation techniques 
Architectural requirement analyses 
CMP—based architecture 
(Network of TPU5) 
Figure 1.2: The system overview 
by the master thread is allocated to a collection of slave threads, while each thread in 
the collection could, in turn, be the master thread of another cluster, and so on. The 
number of TPUs required in each cluster is determined at compile-time. At run-time, 
there could be both independent threads and collections of threads running on TPUs or 
clusters of TPUs. There are two levels of resource competition: (1) the master threads 
compete for the available TPUs in order to form clusters; and (2) the threads within the 
group allocated to a cluster compete for the available TPUs within the cluster. 
The assignment of clusters of TPUs to collections of threads is illustrated by anal-
ogy with the assignment of clusters of FUs to threads of code in the SMT model. Both 
share an underlying idea that the resource partitioning and assignment are dynamically 
performed throughout the program execution rather than fixedly done in the hardware. 
In the SMT model (Figure 1.1(a)), the FUs are virtually clustered and de-clustered by 
Chapter 1. Introduction 
[Program] 
Outermost loop ] 
[Inner loop 
< (_ [Innermost loop ]jj 
r   ----- - Thread 	 Thread 	 Thread 	 Thread 
Conditional branch 
f
Conditional branch • 
5 
 
Figure 1.3: An example of program partitioning 
threads on a cycle-by-cycle basis. In other words, multiple threads compete for the FUs 
in each cycle. The number of FUs used by each thread depends on the instruction-level 
parallelism and the availability of resources, both of which are exposed at run-time. 
In our model (Figures 1.4(a) and (b)), multiple collections of threads are gener -
ated to execute program partitions and compete for the TPUs. The number of TPUs 
executing each program partition depends on the thread-level parallelism predicted at 
compile-time and the availability of resources known at run-time. Figure 1.4(a) dis-
plays snapshots of the program execution shown in Figure 1.4(b). Clusters 11, 2, 3, 4, 
51 are allocated to collections of threads 11, 2, 3, 4, 5}, respectively. At cycle 1, there 
are 3 program partitions being executed simultaneously, one by a cluster of 2 TPUs 
and the rest by a single TPU each. During the execution of each program partition, 
Chapter 1. Introduction 	 6 
multiple threads may reuse a TPU since they may be spawned and retire at different 
cycles. The threads spawned concurrently can only compete for the available TPUs in 
the cluster allocated. At cycle 4, cluster 1 is still active while cluster 2 and cluster 3 
have freed their TPUs which are grabbed by cluster 4. At cycle 8, only cluster 5 is 
active which uses all the available TPUs. 
An advantage of dynamic cluster allocation is in the utilisation of TPU resources by 
various sub-problems in the program. For instance, if a non-speculative and a specula-
tive loops are to be executed in parallel, a small number of TPUs should be dedicated 
to the speculative loop while the rest are reserved for the other computation. This ap-
proach differs from other clustered multithreaded architectures (e.g. [21, 38, 47, 78]) 
in that the others statically allocate clusters, as shown in Figure 1.4(c). Within the 
clusters, their resource partitioning could be in either SMT [38, 47] or CUT [78] style. 
The main contribution of this thesis is the experimental evaluation of hierarchical 
multithreading in a framework consisting of a simulated multithreaded architecture 
and a compiler. The focus is on two types of program structures: loops and conditional 
branches. Loops are potential sources of parallelism and their nesting structures fit 
well with the hierarchy. Control speculation is a well-known method for exposing 
parallelism in programs although the speculative execution is not guaranteed to be 
useful. Based on the experimental results, significant program speedups were achieved 
by loop parallelisation, and could be further improved by control speculation. 
Chapter 1. Introduction 
	
7 
Threads / Collections of Threads 
IIII&I U 
IM IMM 




(a) Snapshots of (b) 
pool of TPUs  
cycl 
• S • S • S 
	
col. threads 1 	col. threads 4 
0 col. threads 2 	col. threads 5 
col. threads 3 	idle slot 
(b) Our approach 
SS 
::r:fl55 
col. threads 1 	col. threads 4 
0 col. threads 2 col. threads 5 
col. threads 3 idle slot 
(c) Conventional approach 
Figure 1 .4: The clustering of TPUs during program execution 
Chapter 1. Introduction 	 8 
1.2 Thesis Organisation 
The remaining chapters are summarised as follows: 
Chapter 2 reviews issues concerning multithreaded execution such as (1) creation, 
initialisation, and retirement of individual threads; (2) interaction between threads 
such as communication, synchronisation, and thread-level speculation; and (3) 
their collective relationship in clusters and hierarchy. 
Chapter 3 describes the multithreaded architecture, which is based on a CUT proces-
sor similar to the Superthreaded architecture [68, 69, 70].  It was enhanced to 
support hierarchical execution, control speculation, register synchronisation and 
forwarding, and novel multithreaded instructions. The multithreaded compiler 
implemented using the SUM package [84] is also described. It takes advantage 
of well-defined intermediate representation to recognise and transform loops and 
conditional branches in sequential programs for multithreaded execution. 
Chapter 4 presents examples of multithreaded loop execution. Transformation rou-
tines implemented in the compiler are described, followed by experimental re-
sults and discussion. 
Chapter 5 presents examples of multithreaded control-speculative execution. It de-
scribes how programs are transformed and executed. Strategies used to partition 
programs for control speculation and to allocate resources are explained. Exper-
imental results are presented and discussed. 
Chapter 6 summarises and discusses the main findings of this research with sugges-
tions for future work. 
Chapter 2 
Literature Survey 
The key ideas in multithreaded execution are as follows: 
The creation, initialisation, and retirement of individual threads. 
The interaction between threads, essentially the inter-thread communication, 
synchronisation, and thread-level speculation. 
The collective relationship, such as hierarchical organisation and clustering. 
We examine these ideas in some well-known multithreaded architectures, such as 
Single-Program Speculative Multithreading (SPSM) [18], Superthreaded [68, 69, 70], 
Stanford Hydra [31, 32, 53, 54], CMU STAMPede [65, 66, 67], Multiscalar [12, 26, 
35, 72], Trace processors [58, 59, 60], UPC Speculative Multithreaded [45, 46, 47], 
and Dynamic Multithreading (DMT) [3]. SPSM, Superthreaded, Hydra, and STAM-
Pede combine various software and hardware techniques. Multiscalar relies heavily on 
the hardware although compiler assistance is still needed. On the other hand, Trace, 
Speculative Multithreaded, and DMT are solely hardware-based. 
Chapter 2. Literature Survey 	 10 
2.1 Thread Creation 
2.1.1 Dynamic Approach 
UPC Speculative Multithreaded, DMT, and Trace Processors use different criteria to 
extract multiple threads from sequential programs. 
The UPC Speculative Multithreaded detects loops at run-time and generates threads 
to execute the loop iterations concurrently. In [45, 46],  a single fetch stream mecha-
nism was implemented. Instructions are fetched from the same program counter and 
broadcast to all the threads. In their follow-up work [47], a loop trace was introduced 
to support multiple control-flows. Each entry in the loop trace is a sequence of the 
predicted branch directions that defines a particular control-flow. 
DMT creates threads at procedural and loop boundaries. An after-call thread exe-
cutes the instruction at the static address after the call, while the parent thread enters 
the procedure body. Likewise, an after-loop thread starts its execution at the static 
address after the loop. Although this lookahead technique exploits coarse-grained par-
allelism, it suffers from poor resource utilisation. Because threads are spawned in the 
reverse program order, the most recently-created threads are the earliest ones to retire. 
The oldest threads which are typically further away from the main execution point hold 
resources for a longer period before retiring. To solve this, an adaptive thread predic-
tor assigns priority to threads using the lookahead distance and history patterns. The 
threads with higher priority will pre-empt the ones with the lower priority. 
Trace processors construct traces from the dynamic instruction stream. The trace 
size is restricted to 16 instructions, or even shorter if any call indirect, jump indirect, or 
return instruction is encountered. Traces are stored in the trace cache. The next-trace 
predictor [36] predicts the next instruction sequence and looks in the trace cache. If 
Chapter 2. Literature Survey 	 11 
the trace is found, it is fetched and sent to the processing unit. Otherwise, the trace is 
constructed by fetching from the instruction cache. 
2.1.2 Static Approach 
SPSM supports the master/slave model. The program execution starts with the main 
thread. It forks new threads which are ahead of itself in the program order. The threads 
are merged when the main thread reaches the starting address of the future thread and 
the future thread encounters a suspend instruction. After merging, the main thread 
resumes the execution after the suspend. SPSM is unaware of the actual resources at 
run-time. Depending on the hardware implementation, a thread may or may not be 
successfully forked. Hence the correct program execution must be preserved whether 
each code region is executed by the main thread (fork fails) or a future thread (fork 
succeeds). 
The Superthreaded compiler partitions a program into threads and each thread into 
four pipeline stages. Continuation variables such as loop index variables are computed 
in the first stage as they are needed for sparking a new thread. The next stage computes 
target store addresses and forward them to the successors for run-time checking of 
data dependencies. The main computation and data communication is performed in 
the following stage. Finally, the thread is synchronised and commits data to the data 
cache before retiring. Their thread allocation policy is to delay forking until the next 
thread processing unit is available, while the current thread continues after the fork 
instruction without stalling. Because of this, the Superthreaded's performance is likely 
to be sensitive to the workload distribution among threads. 
Hydra supports two types of parallel threads: subroutine (after-call) threads, and 
loop iteration threads. The subroutine threads are created automatically at run-time 
Chapter 2. Literature Survey 	 12 
when procedure calls are encountered. However compiler support is needed to identify 
potential loops and perform source-to-source transformation for speculative parallel 
execution. Threads are manipulated at run-time by software exception handlers which 
are implemented in the speculative coprocessor. STAMPede's approach is very similar 
to Hydra's. A program is partitioned into units of execution, epochs, at compile-time 
and the software handling routines manage threads at run-time. 
Unlike SPSM, Superthreaded, Hydra, or STAMPede, Multiscalar is biased toward 
extensive hardware support for inter-task register communication, and control and data 
speculation. However, it still relies on the compiler to analyse the control-flow graph 
of a program and use heuristics to group basic blocks into tasks. A task descriptor is 
generated for each task to indicate its boundary, a list of possible successor tasks for 
the run-time control-flow speculation, and the inter-task data dependence information. 
In Hydra and STAMPede, the partial ordering between threads or epochs can be 
controlled by the compiler, by passing the thread/epoch number as an argument to the 
fork routine. In Multiscalar, the task identification number is read from the task de-
scriptor. In Superthreaded, since a new thread only starts on the next thread processing 
unit in the uni-directional ring, the thread ordering is implicitly known by the order of 
the thread processing units and the head thread pointer. 
2.2 Thread Initialisation 
When a new thread is sparked on a processing unit, its program counter is set to the 
address it will start the execution. Local components in the processing unit, such as 
register file and branch predictor, are initialised as described next. 
Chapter 2. Literature Survey 
	
13 
2.2.1 Register Context 
The most common approach is to copy the current register values from the parent's 
register file to its child's [3, 18, 45, 46, 58, 60]. Based on datafiow definitions given in 
[2, 4], a register carries a live-in value at the beginning of the child thread if the child 
thread reads from this register before any writes to it. Also, since this register carries 
a live-in value to the child, it is considered to carry a live-out value from the parent. 
At the time of forking, some registers may not yet be available. There are two ways to 
handle this: 
Enforce synchronisation in the child thread until the values are produced and 
forwarded from the parent. 
Use value prediction techniques to speculate the live-in values. 
To enforce synchronisation in the child thread, the compiler may explicitly insert 
synchronisation primitives such as barrier or blocking receive before the instructions 
that consume the live-in values. In Multiscalar [12], a create mask is read from the task 
descriptor, which identifies all registers that may be written during the task execution. 
The task also receives an accum mask from its parent, which is the accumulation of the 
create masks of all the active predecessors. It will block if it tries to use the registers 
indicated in the accum mask whose values have not yet been received. 
Architectures that opt for the live-in value speculation include DMT, UPC Specula-
tive Multithreaded, and Trace processors. DMT allows the child thread to speculatively 
copy all the current values from its parent at the spawning point. Because the looka-
head policy spawns threads which are further away from the current execution point, it 
might incur a high misprediction rate, particularly for after-loop threads. On the other 
hand, there is often false data dependence due to register saves and restores in the pro- 
Chapter 2. Literature Survey 
	
14 
cedure call sequence. Value prediction for after-call threads is likely to be beneficial. 
Their experiments on the Spec95 benchmarks show a significant prediction accuracy; 
however, most benchmarks perform better when only after-call threads are allowed. 
The UPC group [45,46] uses execution history from an iteration table to determine 
register predictability. The hardware initialises predictable live-in registers for a new 
thread by inserting add $R, $R, stride instructions at the beginning of the dynamic 
instruction stream. Unpredictable registers are mapped to the live-in register file. The 
child thread will stall if it tries to read those registers, until they are forwarded from 
the parent. 
In the Trace processor, before a trace is stored in the trace cache, it is preprocessed 
in the hardware by identifying local, live-in, and live-out values. When the trace is 
fetched and started on a processing unit, it receives predictable live-in values from the 
value predictor, whereas unpredictable values are obtained from the global register file 
during the trace execution. 
Finally, in Krishnan and Torrellas [39], when a new thread is initialised on a pro-
cessing unit, some registers in the local register file are invalidated while the rest (with 
existing values) are reused by the new thread. 
2.2.2 Branch Predictor 
There are at least three options for initialising the local branch predictor: 
1. Copy the branch history table from the parent thread. This approach incurs a 
higher initialisation overhead than the other two. A study by Marcuello and Gon-
zalez [49] showed that it gave a very close performance to the gshare predictor 
in the single-threaded execution, which predicts a branch by using the combined 
history of all the recent ones. The branch address and the combined history are 
Chapter 2. Literature Survey 
	
15 
exclusively-ORed (XORed) to form an index for accessing the prediction table 
in the gshare. 
Use the current state of the branch history table as it was left by the previous 
thread executing on this processing unit. This option could incur less predic-
tion accuracy due to more arbitrary branch correlation between the previous and 
the current threads. Marcuello and Gonzales [49] also showed that this option 
suffered at least a 10% performance degradation. 
Initialise the branch history table to some fixed values, such as 0. With this 
approach, early branches in the thread have no memory from the previous exe-
cution. As the thread proceeds, the branch history is built up for later branches. 
An experiment by Akkary [3] showed that this scheme performed as well as the 
gshare predictor in the single-threaded execution. 
2.3 Thread Retirement 
Multithreaded execution can be broadly categorised into master/slave and predeces-
sor/successor models. Conditions as to when and how threads in these models retire, 
update program's state, or handle exceptions are different, as described next. 
2.3.1 Master/Slave Model 
In this model, the master thread maintains the state of the program. It forks slave 
threads to execute instructions which are ahead of itself in the program order. At some 
point, e.g. when a slave completes its execution, it will be merged into the master 
thread. The merge action typically induces an effect as if the slave's execution has been 
Chapter 2. Literature Survey 	 16 
performed by the master itself. For instance, in the SPSM architecture, the register 
values updated by the slave are copied back to the master's register file. The master 
also receives the updated program counter and consequently resumes the execution 
after the last instruction executed by the slave. An exception raised by the slave will 
be delayed and handled after it has been merged into the master. 
2.3.2 Predecessor/Successor Model 
In this model, a sequential order of active threads is maintained. The head thread 
which is the first thread in the list represents the current state of the program. It is 
usually the only non-speculative thread while the others could be speculative. When 
the head thread finishes its execution and retires, the next thread in the order list be-
comes the new head thread and its state becomes the current program state. Generally, 
if a thread causes an exception, it will stall until it becomes the head thread. Then 
the instructions before the one that raised the exception are retired and the exception 
handling is processed. As mentioned in [31], if the stalled thread is mispredicted and 
aborted, the exception should be discarded because it would not have occured in the 
sequential execution. 
Steffan et al. [65] use software interface to emulate the predecessor/successor 
model, which is called one-shot threading. Instead of relying on a centralised hardware 
structure, the identification of the oldest and the least speculative epoch is controlled 
by the software, by passing a homefree token. An epoch can be forced to block until it 
receives the homefree token. It can then commit the speculated results, pass the token 
to the next epoch, and retire. 
Chapter 2. Literature Survey 	 17 
2.4 Inter-thread Data Communication 
Threads communicate data in the initialisation phase and during their execution. The 
communication between threads can be categorised as follows: 
Producer-driven. Producers initialise the communication, such as register for-
warding in Multiscalar. 
Consumer-driven. Consumers initialise the communication. 
Producers and consumers communicate via shared medium such as global reg-
ister files or shared memory. 
The register communication in Multiscalar is local reads/distributed writes, i.e. an 
instruction reads a register value from the local register file and, if tagged with a for-
ward bit, propagates the value it produces to successor tasks. Vijaykumar [72] pro-
posed register communication scheduling techniques targeted at the Multiscalar archi-
tecture. He studied four strategies for register communication: End-send forwards all 
registers at the end of the task execution; Eager-send forwards a register every time 
it is modified; Last-send forwards a register after its last modification; and Spec-send 
forwards a register when there is a high probability that it will not be modified again. 
The first two strategies do not require any compiler support, whereas the others require 
dataflow analysis to determine the last modification of each register. Eager-send and 
Spec-send also involve squashing threads and re-forwarding the values. 
Traces in the Trace processor communicate via a global register file. During the 
execution, the producer trace sends live-out values to global result buses, whereas the 
consumer reads from the global register file or monitors the buses. 
Superthreaded forwards memory data instead of registers. A thread computes tar- 
get store addresses and passes them to its successor. The successor will stall if it tries 
Chapter 2. Literature Survey 	 18 
to load from these addresses before the data is made available. As soon as the pre-
decessor stores data in its own memory buffer, the data and the store address will be 
forwarded to the next thread. 
2.5 Synchronisation 
There are computations such as reduction operations in which the ordering of threads 
is irrelevant; however, only one thread should be allowed to update shared data at any 
time. This section focuses on two types of synchronisation to handle this situation: 
code locking and data locking. 
Code locking permits one thread at a time to execute the code inside the criti-
cal section. Common synchronisation techniques include mutex locks, conditional 
variables, and semaphores [14, 41]. The synchronisation variables used in all these 
techniques are stored in global registers or shared-memory areas. Architectures that 
support speculation may allow only non-speculative threads to execute the critical sec-
tion, as suggested in [68]. The restriction prevents speculative threads from impeding 
the non-speculative ones. 
At a fine-grained level, data locking enforces synchronisation on data items. A 
widely-used technique is multiple readers/single writer locks [14, 28, 50]. There are 
three variations to this scheme: reader preference, writer preference, and fair lock. 
All of them require readers to block until the current writer finishes. With a reader 
preference lock, once there are readers currently active, new readers that arrive can 
proceed even though there is a writer waiting. Conversely, with a writer preference 
lock, the current readers are suspended if a new writer arrives. In the case of a fair 
lock, new readers wait until earlier writers finish, while a new writer waits until both 
Chapter 2. Literature Survey 	 19 
Table 2.1 Categories of thread-level data speculation 





memory load values 
memory references 
readers and writers before it finish. Furthermore, many multiprocessors support atomic 
read-modify-write operations such as test-and-set and fetch-and-op. 
An alternative lock-free technique uses a pair of load-linked and store-conditional 
instructions [7, 57]. A thread executes a load-linked instruction to load an original 
value from a memory location, performs further computation, and tries to store a new 
value back using a store-conditional instruction. The load-linked approach does not 
prevent the other threads from loading the data or executing the critical code following 
it. However, only one thread will successfully store the new data back to the memory. 
The others whose store-conditionals failed may retry the computation. 
2.6 Thread-Level Speculation 
Thread-level control speculation enables threads to start execution before the condi-
tions on which they are dependent are resolved. On the other hand, thread-level data 
speculation enables threads to continue the execution in spite of data dependence be-
tween concurrent threads. It is further categorised as illustrated in Table 2.1. Value 
speculation speculates on register or memory load values. Memory dependence spec-
ulation conventionally speculates in the midst of ambiguous memory references. Fi-
nally, register dependence speculation assists inter-thread register communication. 
Chapter 2. Literature Survey 
	
20 
2.6.1 Control Speculation 
In SPSM, Superthreaded, Hydra, and STAMPede, thread-level control speculation is 
performed by the compilers. In Multiscalar, tasks and their associated task descriptors 
are generated at compile-time. At run-time, the global sequencer predicts the next task 
which is one of the possible successors indicated in the current task descriptor. 
Both the task predictor in Multiscalar and the trace predictor in Trace processors are 
based on path-based trace predictors proposed by Jacobson et al. [35, 36]. Clustered 
Speculative Multithreaded [47] uses a loop trace which is also adapted from Jacobson's 
to predict control flows of the loops containing multiple conditional branches. An 
adaptive thread predictor in DMT assigns priority to threads using criteria such as 
lookahead distances and global history. 
Misspeculation penalty at the thread level can be higher than in case of the indi-
vidual branch prediction. Because the predicted branch is usually the last instruction 
in the thread, it takes many cycles before the branch is finally resolved and the wrong 
thread is squashed. To keep the misspeculation penalty as low as possible, many ar-
chitectures and compiler techniques include low-confident branches within the threads 
and expose high-confident branches to the thread-level speculation. In practice, the 
embedded branches may have even lower predictability than when they are predicted 
in the sequential execution. This is because the local branch predictors do not have a 
complete view of the continuous (global) dynamic instruction stream. 
The point where both paths of a conditional branch rejoin indicates the start of the 
control-independent path of that branch. Since the control-independent path will be 
executed regardless of the outcome of the branch, another thread can be launched to 
execute this path in parallel with the main and the control-speculated threads. The 
control-independent thread must also be treated as a speculative thread because it may 
Chapter 2. Literature Survey 	 21 
still be data dependent on either path of the branch. This aspect of control indepen-
dence has been studied in detail by Rotenberg [59, 60]. 
On the other hand, there are works such as Thread Multiple Path Execution (TME) 
[73] and Selective Dual Path Execution (SDPE) [33] that allow the execution of both 
paths of the hard-to-predict branches. TME spawns threads to execute the less likely 
paths when there are fewer threads running than the available hardware contexts. SDPE 
investigates dual-path forking policies in detail. 
2.6.2 Register Speculation 
Well-known value speculation techniques in superscalars [42, 43, 63] are last value, 
stride, and context-based predictors. They are based on the history pattern seen by 
individual instruction operands. Nakra et al. [52] proposed path-based value predic-
tors to predict values along different control-flow paths. The idea of correlating the 
prediction history with control-flow traces is employed in multithreaded architectures 
[47, 48, 60]. These architectures (as listed) achieve significant performance improve-
ment by limiting the speculation to only high-confidence, live-in registers. 
Register dependence speculation is performed in conjunction with register commu-
nication. It speculates whether a register is written for the last time in a thread. After 
the predicted point, the register communication hardware (or software) assumes that 
there is no further read-after-write dependence, caused by this register, from this thread 
to the others. A register forwarding strategy, Spec-send, proposed by Vijaykumar [72] 
speculatively forwards a register when it is unlikely to be further updated. An update 
probability is assigned by the compiler to each register in each basic block of a task, 
using profile information and data flow analysis. 
UPC Speculative Multithreaded predicts the number of writes to each register by 
Chapter 2. Literature Survey 	 22 
each thread. Once a thread performs the predicted number of writes, it will forward the 
register to the next thread. Misprediction is detected when the number of actual writes 
exceeds the predicted number. 
2.6.3 Memory Speculation 
Each thread processing unit (or processor) is typically equipped with a private memory 
buffer or Li cache to keep results from the thread execution. In the sequential control-
flow, RAW or read-after-write dependence occurs when an instruction reads a value 
which has been written by its predecessor; WAR or write-after-read dependence occurs 
when an instruction writes a new value to a memory location (or register) after the 
old one has been read by its predecessor; and WAW or write-after-write dependence 
occurs when an instruction writes a value to the same memory location (or register) 
as its predecessor. In the multithreaded execution, the multiple versions of memory 
data must be handled properly to honour the RAW, WAR, and WAW dependencies. 
Generally, a load must see the latest store to the same address (RAW rule) and should 
not be aware of stores to the same location by successor threads (WAR rule). It must 
be squashed and re-executed if it has read the wrong version of the data. Finally, 
concurrent threads perform write-back to the shared memory in the correct sequential 
order (WAW rule). 
Hydra and STAMPede allow threads to dynamically switch between speculative 
and non-speculative execution. A speculative region is marked by start .speculation 
and end-speculation instructions. Because a thread can store to the shared memory 
when it is non-speculative, the compilers must ensure that store operations outside the 
speculative regions are safe. They use hardware to detect dependence violation and 
software to control recovery actions. Hydra relies on a snooping-bus-based mecha- 
Chapter 2. Literature Survey 	 23 
nism. When a processor writes back to the next level shared memory (L2 cache), all 
the other processors watch the write bus to detect the violation. On the other hand, 
STAMPede extends invalidation-based cache coherence. When an epoch stores to a 
location that has been speculatively loaded, it sends invalidation signals to the con-
sumer epochs. The consumers detect the violation by comparing their sequence orders 
with the producer's. 
More complicated approaches include Address Resolution Buffer (ARB) [26, 27] 
and Speculative Versioning Cache (SVC) [30]. Both of them aggressively perform 
memory speculation, i.e. every load and store can be executed as soon as its address 
is known even if memory references in the preceeding tasks are still unresolved. ARB 
is a centralised structure. It keeps all versions of the data from all tasks, and conse-
quently suffers from limited bandwidth and long access delays. In contrast, SVC is 
a decentralised structure. The memory references are spread across multiple caches. 
Although it solves the problems in the ARB, the SVC incurs lower hit rate and larger 
amount of communication between caches. 
2.7 Hierarchical Organisation and Clusters 
The M-Machine [23] has two levels of concurrency. As illustrated in Figure 2.1(a), 
V-threads share the same set of processing units and can be swapped in and out of the 
processors. A V-thread is composed of subthreads or H-threads which simultaneously 
execute on separate processing units. In contrast, the two-dimensional Superthreaded 
[68], as shown in Figure 2.1(b), has X-threads allocated to different processing units, 
each of which comprises of multiple resident Y-threads. Normal policies for context 
switching are round-robin and event-trigger (e.g. cache misses). A major advantage 
Chapter 2, Literature Survey 
	 24 
XO 	Xl 	X2 
V2 
V1 I1 
PUO PU1 PU2 
(a) M-Machine 
FY-11 Ed FY-11 
FPu l  PU1 PU2 
(b) Superthreaded 
Figure 2.1: Two-level multithreaded models 
of hierarchy in the M-Machine and the two-dimensional Superthreaded is in its ability 
to exploit more parallelism, by hiding the long latency of operations such as memory 
access and inter-thread communication. 
Zahran and Franklin [78] have proposed Hierarchical Multithreading (HMT). Their 
architecture is basically a network of Multiscalar processors. A program is partitioned 
into supertasks which are assigned to the Multiscalar nodes. The supertasks are further 
broken into tasks and assigned to processing units within the nodes. The HMT takes 
advantage of coarse-grained thread-level parallelism since the supertasks are typically 
far apart in the sequential control-flow order. In addition, control and data depen-
dencies between them are minimised in order to limit the amount of communication 
between the Multiscalar nodes. 
Simultaneous Subordinate Microthreading (SSMT) [15] employs a concept simi-
lar to interrupt handling. Events, such as branch mispredictions and cache misses, 
occurring as a result of a (primary) thread's execution automatically spark specialist 
Chapter 2. Literature Survey 	 25 
microthreads. The microthreads execute optimisation routines which are written in 
the internal machine format and stored on-chip. During the microthread initialisation, 
these routines are loaded into the decode/rename stage and issued simultaneously with 
the primary thread's instructions. Another example of using separate threads to handle 
exceptions is described in Zilles et al. [80]. Exception threads are sparked to fetch 
and execute exception handlers before the normal execution resumes. By fetching the 
exception handlers separately, the main threads need not squash the instructions fol-
lowing the ones that cause the exceptions, and are able to execute the independent ones 
in parallel with the exception handling. 
In Dorai and Yeung [ 17], foreground threads perform high-priority or critical corn-
putation whereas background threads perform low-priority ones. They aimed at mak-
ing the background threads transparent or having almost no impact on the performance 
of the foreground threads. Hardware resource are divided into three classes: instruc-
tion slots, instruction buffers, and memories. Competition for each type of resources 
affects the performance of the foreground threads differently. For example, the fore-
ground threads are disrupted for only single cycle if they lose out on instruction slots 
such as fetch and functional units. However, they may be disrupted for several cycles 
if they lose out on instruction buffers. Although there is little contention for mem-
ory resources such as caches and branch prediction tables, interfering accesses by the 
background threads may cause performance degradation in the foreground threads. 
As an architecture is scaled up, the wire delays become a hurdle to the overall per-
formance. Because of this, there have been proposals to group multiple processing 
units into clusters [21, 38, 39, 47].  Programs are typically partitioned, either stati-
cally or dynamically, to exploit communication locality. In general, threads that cause 
frequent communication are allocated to processing units in the same clusters. 
Chapter 2. Literature Survey 
	
26 
2.8 Other Techniques 
A new dynamic resource allocation approach has been introduced in a-Coral architec-
ture [77]. It has a large register file and a program counter queue holding the states 
of all currently-active threads in the processor, both of which are centralised. New 
threads can be spawned until the program counter queue is full. Upon thread initiali-
sation, a segment of the shared register file is allocated to the thread. The size of the 
segment depends on the number of registers each thread requires, allowing flexibility 
in the resource management. However, a drawback of the centralised structures is poor 
scalability. 
2.9 Chapter Summary 
This chapter has investigated some of the fundamental issues in multithreaded exe-
cution. These include the creation, initialisation, and retirement of threads; the in-
teraction between concurrent threads including communication, synchronisation, and 
thread-level speculation; and hierarchical structures. Relevant software and hardware 
techniques were reviewed. Some of these have inspired our compiler and architecture 
designs in the forthcoming chapters. 
Chapter 3 
The Multithreaded Processor 
Architecture and The Compiler 
The target architecture is a CMP-based multithreaded processor which was inspired 
by hardware simplicity of the Superthreaded model [70]. The initial design was pre-
sented in [5, 6, 34]. First, hierarchical multithreaded execution is described briefly in 
Section 3.1, followed by the architectural details in Section 3.2 which include novel 
features to support hierarchical execution, register synchronisation and forwarding, 
and speculation. The multithreaded instructions are described in Section 3.3, and the 
implementation of the multithreaded processor simulator in Section 3.4. Finally, the 
multithreaded compiler is described in Section 3.5. 
3.1 Hierarchical Multithreaded Execution 
In part of the master/slave execution model [18, 23, 68, 78],  a thread can execute 
a command to form a cluster of slave TPUs during program execution. Each slave 
thread, which runs on the slave TPU, could in turn form a cluster at the next level, 
27 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	28 
and so on recursively. The master thread could free its slave TPUs by executing a 
command to release the cluster. Hence, clusters in our context are dynamic and logical 
entities. The thread processing units in a cluster are logically connected to each other 
in a uni-directional ring and operate in the predecessor/successor style [45, 65, 68]. 
Threads can be created or forked in two directions: the master thread forks a new 
slave in the vertical direction, while the slave thread forks the next one in the horizontal 
direction. When a thread forks a new thread, it becomes the parent of that new thread. 
If a master thread T0 vertically forks a slave thread T1, and T1 horizontally forks another 
slave thread T2, then the relationships between T0, T1, and T2 would be: 
• For master/slave relationships, T0 is the master of T1 and T2 (conversely, T1 and 
T2 are the slaves of T0). 
• For parent/child relationships, T0 is the parent of T1, and T1 is the parent of T2. 
Thus, T1 is the child of T0, and T2 is the child of TI, respectively. 
As in the predecessor/successor model, the slaves retire and update the cluster's state, 
instead of the processor's state, in a sequential order. Since the cluster's state is main-
tained by the master thread, this is also equivalent to merger in the master/slave model. 
Upon merger, register values, program counter, and speculative results of the slaves 
are transferred to the master's. 
In order to incorporate this idea into the original design [5, 6, 34], additional fea-
tures were introduced in the Global Thread Control Unit (GTCU), Local Thread Con-
trol Units (LTCUs) and Speculative Buffers. Furthermore, additional multithreaded 
instructions were proposed to support hierarchical and speculative execution. 








Thread Processing 	Unit 
	




























Comm. Unit Comm. Unit 
Speculative 	Temp. 	 Speculative I Temp. 
Buffer Buffer Buffer 	Buffer 
1st Level Shared Memory 
Figure 3.1: The target multithreaded architecture 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	 30 
3.2 Description of the Architecture 
Figure 3.1 depicts an overview of the multithreaded architecture. The processor con-
sists of a number of identical Thread Processing Units (TPUs). At the start of program 
execution, the First Level Scheduler (FLS) fetches instructions from the central instruc-
tion cache and passes them to the instruction buffer of a head thread which, by default, 
always runs on TPU 0. 
3.2.1 Global Thread Control Unit (GTCU) 
As the architecture relies on static program partitioning, thread sequence according to 
the sequential semantics has to be conveyed from the compiler to the hardware. Be-
sides controlling the retirement order of concurrent threads, the sequence information 
is needed for handling multiple versions of loads/stores in speculative execution. The 
Global Thread Control Unit (GTCU) was added to the original design, which main-
tains the relative order, by ascending sequence numbers, of all the active threads in the 
processor. If multiple threads have the same sequence number, then they are ordered 
by the time of creation, starting from the oldest. A sequence number is assigned to a 
thread either explicitly or implicitly, as described next. 
Explicit assignment. For a normal fork operation (frk instruction in Section 
3.3.1), a sequence number is given as an argument of the fork. 
Implicit assignment. In the cases of vertical and horizontal fork operations (yfrk 
and xfrk instructions in Section 3.3.2), a child is given the same sequence num-
ber as the parent's, and the master/slave relationship has priority over the par-
ent/child relationship. Thus, slave threads are inserted in the order list immedi-
ately after the master and following the parent/child relationship between them. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	31 
The thread sequence is updated each time a new thread is forked or an existing 
thread retires, by receiving signals from the Local Thread Control Units (LTCUs). 
The GTCU also maintains a pointer to the head thread, which is generally the oldest 
running thread on the processor. 
3.2.2 Thread Issue Unit (TIU) 
The TIU decodes instructions and passes them to the corresponding execution units. 
Instructions are issued in-order from the instruction buffer but can be executed out-of-
order as soon as resources are available. The instruction-level parallelism is exposed by 
the compiler's instruction scheduling and optimisation techniques. Normal arithmetic 
and memory instructions are sent to ALUs and MUs respectively, while multithreaded 
instructions are sent to the LTCU. 
3.2.3 Local Thread Control Unit (LTCU) 
The LTCU executes the multithreaded instructions. It also maintains the following 
information: 
. Parent Address. It is set when a thread is initialised. 
. Child Addresses Table. A child address is added to the table if the fork operation 
succeeds. As soon as the parent thread retires, its children will be notified to 
invalidate the parent address. 
. Slave Addresses Table. This table is set when a thread successfully forms a slave 
cluster and is cleared when the cluster is released. 
. Master Address. It is set for a slave thread to retain the address of its master. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	32 
(1,1)) 	 ((0,1)) 	 ((I,1 	
receive value 	
((o,1) 
fork 	 fork forward 
ureg.s 	 value 	
uregs 
fork 	 (00) 	 receive fork value 
	forward 
value 
(a) Fork 	 (b) Register communication 
Figure 3.2: State transitions for (W, U) in a register 
3.2.4 Register File 
A simple register synchronisation and forwarding mechanism was proposed. As a 
preliminary study, the register forwarding is restricted to only from parent to child 
threads. Each register in the local register file is associated with the following 2 bits: 
• W bit. This bit is set for the child thread's register to enforce synchronisation 
until the register is forwarded from its parent. If the thread tries to read a register 
whose W bit is set, then it has to wait until the bit is turned off. If the thread 
writes this register for the first time (before any read), then the W bit is turned 
off since it no longer has to wait for the value forwarded from the parent. 
• U bit. This bit is set, prior to forking a new thread, for the parent thread's register 
whose value is unavailable to the child. 
When a new thread is initialised, the register values are copied from the parent's 
register file to the child's. Figure 3.2 (a) depicts the state transitions of a register from 
parent to child: (W(parent),U (parent)) (W(child),U(child)). The initial state 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	33 
of the child's register is set to either (1,0) or (0,0). The former implies that the register 
might be live-in; the latter implies that the register is dead-in. If the state of the parent's 
register is (1,1) or (0,1), then the child's state is set to (1,0) which indicates that the 
register might be live-out from the parent but its value is not yet available to the child. 
After the parent produces a value for the live-out register, it can be forwarded to 
the child. The forwarding operation resets the U and W bits in the parent's and child's 
registers, respectively. Figure 3.2 (b) depicts the state transitions of a register of the 
same thread due to the register communication: (W, U) action  (W, U). The thread can 
forward a register to its children only if the state is (0,0) or (0,1), i.e. it is not waiting 
for that register itself. Upon receiving the values, the receivers set their corresponding 
W bits to 0. Finally, a set of live-out registers whose values have not yet been produced 
can be declared by executing uregs (see Section 3.3). This instruction sets the specified 
U bits to 1 and consequently enforces synchronisation in the successor threads when 
they try to read those registers. 
3.2.5 Speculative Buffer 
The speculation is almost entirely controlled by the compiler. The hardware support is 
very simple, as described next. 
A thread can switch between non-speculative and speculative modes during its exe-
cution, in the same style as in STAMPede [65].  When the thread becomes speculative, 
it writes to the speculative buffer instead of to the shared memory. These stores are 
flushed to the memory when the thread commits. If the thread stops without commit-
ting these stores, then the speculative buffer is simply cleared. For a load operation, 
the thread should see the latest version of the data as if the program is executed in the 
sequential order. Firstly, it checks the load address in its own buffer. If the address 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	34. 
is not found, it will lookup the predecessors' buffers. Finally, if the address is still 
not found in any of the predecessors' buffers, then it will load from the memory. The 
information as to which threads are the predecessors of the current thread is obtained 
by scanning the thread order list maintained by the GTCU. If the dependency distance 
between threads is large, i.e. a thread is data dependent on a predecessor which is far 
ahead of itself in the order list, then the overhead of loading can be quite high. Com-
piler techniques such as loop unrolling [20] can reduce the dependency distance so that 
the thread is only data dependent on its immediate predecessor. 
However, both misspeculation detection and recovery are performed in the soft-
ware. The misspeculation is handled by aborting the wrong thread and starting a new 
one to execute the correct path. 
In the non-speculative mode, the thread directly reads from and writes to the shared 
memory. The compiler determines whether a load/store operation is safe and chooses 
the execution mode accordingly, in order to guarantee the program correctness. For 
example, a thread can store its result in the speculative buffer and then load data from 
the shared memory, by switching from the speculative mode (before the store) to the 
non-speculative one (before the load). 
In the hierarchical execution, if the master thread is speculative, then its slaves 
should also run in the speculative mode. Our execution model expects the master 
to only fork slave threads to execute parts of the program which are logically ahead 
of itself. As shown in Figure 3.3, there is a temporary storage inside the speculative 
buffer, which maintains the cluster's state. When the slaves are merged into the master, 
their register and memory updates are collected as the master's temporary state (or the 
cluster's state). As soon as the clustered execution is completed and the cluster is freed, 
the temporary register values (including the program counter) will be transfered to the 




Re File 	I Temp State I 	But tel 






Temp State 	 . 
[Reg File 
	
	 Spec Buffer 
Reg Mem 
Slave 	\ 
ri1 State II 
ReF1 Reg 1 Mem 1SPecBuffer 
Token Description 
1,2 Slave 1st is merged. Registers and speculative stores are saved to the master's temporary state. 
3,4 Slave Nth is merged. Registers and speculative stores are saved to the master's temporary state. 
5,6 Master releases cluster. 
5 	Temporary register updates are transfered to register file. 
6 	Temporary speculative stores are transfered to speculative stores. 
7 Master retires. Registers and speculative stores are saved to temporary state of the higher-level master. 
8 The highest-level master commits speculative stores to the shared memory. 
Figure 3.3: Retirement actions in the hierarchical-speculative execution. 
current register values while the speculative stores from the slaves will be transfered to 
the master's speculative stores instead of being flushed to the shared memory. 
3.2.6 Inter-thread Communication Unit 
The inter-thread communication unit takes care of the signal transmission between 
the TPUs. It contains a signal buffer and, depending on the implementation, signal 
handlers for some particular signals. In the absence of the signal handlers, the signal 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	36 
transmission can be used as a synchronisation mechanism, for example, between the 
memory load/store operations. 
When a signal is transmitted, its number is written in the buffer of the target TPU. 
The thread that executes wait signal instruction checks its signal buffer and blocks until 
the signal has arrived. Then it either invokes the signal handling routine or simply 
continues its execution, after which the signal is removed from the buffer. A signal 
might be lost before it has been processed should the buffer be full and the old signal 
be replaced by a new one. 
3.3 Multithreaded Instructions 
A subset of the standard MIPS instructions [22,37] was augmented with multithreaded 
instructions. These can be categorised into 4 groups: basic instructions (Tables 3.2 and 
3.3), auxiliary instructions (Table 3.4), instructions that support hierarchical execution, 
(Tables 3.5 and 3.6), and instructions that support speculative execution (Table 3.7). In 
the tables, $s and $d denote source and destination register operands, respectively; L 
is a label; and I is an integer value. Description and pseudo-code of each instruc-
tion are also provided in the tables. Pseudo-functions, other than the ones associated 
with the instructions, perform operations as indicated by the names of these functions 
(examples are listed in Table 3.1). 
Most multithreaded operations are guarded. Semantics of a guarded operation is 
to evaluate the guard operand: if the condition is true, then the instruction is executed; 
otherwise it is treated as a nop instruction. Similar to the thread creation or forking 
in SPSM [18], the fork-operation may succeed or fail depending on the availability of 
resources at run-time. Subsequent multithreaded instructions must therefore be exe- 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	37 
cuted under guarded conditions to preserve the program correctness, particularly those 
involved in synchronisation and communication as they could potentially cause dead-
locks. In order to prevent premature program termination, a thread needs to check on 
occasion if it has successfully forked the next one before it retires. Thus, the retirement 
operation is also guarded. 
3.3.1 Multithreaded Instructions Group 1 
Tables 3.2 and 3.3 summarise the basic instructions for simple non-speculative multi-
threaded programs. They are modified from the preliminary proposal in [5, 34]. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	38 
Table 3.1 Examples of pseudo-functions 
Function 
1 TPU& get.TPUs(nuITLTPUs) 
II Get available TPUs and return a pointer to these TPUs 
2 bool TPt.J::avail() 
II Check whether the TPtJ is available 
3 void thread_init(TPU&, 	sequence, label) 
II Initialise a new thread on the TPU 
4 void cluster_init(master_thread, TPU&) 
Initialise a new cluster 
II Thread's operation 
5 void thread: :wait.signal (signal) 
6 void thread: :get.signal(signal) 
7 void thread: :save_pc(label) 	II Set OPC = PC and PC = label.PC 
8 void thread: :restore_pc() 	II Set PC = OPC 
9 void thread::interrupted() 
10 void thread::stop(merge) 
11 void thread: :comrnit() 
12 void thread: :set_head(thread) 	II Nominate a new head thread 
II Master thread's operation 
13 void master_thread: :pass_signal (slave_thread&, signal) 
14 void master_thread: :cluster_release() 
15 void master_thread: :cluster_abort() 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	39 
Table 3.2 Multithreaded Instructions Group 1 (continued in Table 3.3) 
Instruction Description 
frk 	$d, 	$sl, 	L Fork a new thread to execute target label L. Return d = TRUE, 
if successful. si is a sequence number associated with the new 
thread. 
op frk(op si, op L) 	{ 
if 	(pt_TPUs = get_TPUs(l)) 	{ 
d = TRUE; 
thread_init(pt_TPUs, 	si, L); 
} 
else d = FALSE; 
return d; 
stp 	$sl, 	$s2 If guard si is set, then stop. If it is the head thread, set thread 
s2 as the new head. 
void stp(op si, op s2) 	{ 
if 	(si) 	{ 
this thread. set_head (s2); 
thisthread.stop(merge = FALSE); 
} 
sstp 	$sl, 	$s2 Wait for synchronisation signal and pass it to thread s2. If 
guard si is set, stop. If it is the head thread, set s2 as the new 
head. 
void sstp(op si, 	op s2) 	{ 
thisthread.wait_signal (SYNCH); 
if 	(si) 	{ 
this thread. set_head(s2); 
thisthread.stop(merge = FALSE); 
} 
s2 .get_signal(SYNCH); 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	40 
Table 3.3 Multithreaded Instructions Group 1 (continued from Table 3.2) 
Instruction Description 
psg 	$sl, 	$s2, 	I If guard si is set, pass signal I to thread s2. 
void psg(op si, op s2, op I) 
{_if 	(Si)_s2.get_signai(I);_} 
wat 	$sl, 	I If guard si is set, wait until signal 1 is received. 
void wat(op si, op I) 	{ 




isg 	$sl, 	$s2, 	L If guard si is set, interrupt the execution of thread s2. The 
interrupted thread jumps to label L. 






mop 	$sl If guard si is set, move the old program counter to the current 
program counter. 
void mop(op si) 
{ 
if 	(si) 	thisthread.restore_pc; 	} 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	41 
The frk instruction forks a new thread on an available TPU and returns TRUE, or 
returnsFALSE if no TPU is available. The TPU address of the newly-forked thread 
is retrieved by executing cadr (see Table 3.4). An alternative design is to return the 
child's TPU address if the fork succeeds, or an INVALID value (e.g. -1) if it fails. 
However, we opted for the first approach because the values TRUE/FALSE are handy 
to use either as guards in the subsequent multithreaded instructions or as operands in 
conventional branch instructions (e.g. beqz). 
A thread can stop either with or without waiting for a synchronisation signal, by 
executing either the stp or sstp instructions. Before the thread stops, it will nominate 
a new head thread. The nomination is valid only if it is currently pointed to by the 
head pointer. If the nominated thread is not active, then the GTCU will move the head 
pointer to the next thread in the order list. 
The psg and wat instructions communicate signals. The psg is a non-blocking send 
while the wat is a blocking receive. The psg puts the signal number in the receiver's 
signal buffer. When the thread executes wat and receives the signal, it either performs 
a sequence of actions as specified by the signal handler or continues its execution if 
there is no handler for that signal. 
The isg instruction interrupts the execution of the target thread. Having been inter-
rupted, the thread saves its current program counter (PC) to the old program counter 
(OPC) before branching to the interrupt handling routine. The thread may invoke a 
default hardware procedure by sending a signal to itself and receiving that signal. Fi-
nally, the mop instruction can be inserted at the end of the interrupt. When the thread 
reaches mop, it copies the content of the OPC back to the PC and resumes its execution. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	42 
Table 3.4 Multithreaded Instructions Group 2  
Instruction Description 
adr $d Return the address of this thread. 
op adr() 
{ 
return thisthread.address; } 
padr $d Return the address of the parent of this thread. 
op padr() 
{ 
return thisthread.parent.address; 	} 
cadr $d Return the address of the most recent child of this thread. 
op cadr() 
{ 
return thisthread.children[last] .address; 	} 
hadr $d Return the address of the head thread. 
op hadr() 
{ 
return thisthread.head.address; } 
3.3.2 Multithreaded Instructions Group 2 
The auxiliary instructions such as adr, padr, cadr, and hadr are summarised in Table 
3.4. They do not have guard operands since the execution of these instructions have no 
side-effect on the state of the TPU or the processor. The cadr, as mentioned earlier, is 
used as a complement to frk and might be omitted in an alternative design. The adr, 
padr, and hadr can be replaced by the use of the additional software routines to keep 
track of the thread information, as was done in the preliminary work [34]. 
3.3.3 Multithreaded Instructions Group 3 
The instructions in Tables 3.5 and 3.6 support hierarchical multithreaded execution. 
They can be emulated by a sequence of the basic multithreaded instructions described 
earlier, at the expense of additional software thread manipulation costs. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	43 
Table 3.5 Multithreaded Instructions Group 3 (continued in Table 3.6) 
Instruction Description 
cform 	$d, 	$51 Form a cluster of si slave TPUs. Return d = TRUE, if suc- 
cessful. 
op cform(op si) 	{ 
if 	(pt_TPUs = get_TPUs(sl)) 	{ 
d = TRUE; 
cluster_init(thisthread, pt_TPUs); 
} 
else d = FALSE; 
return d; 
yfrk 	$sl, 	$d, 	L Vertical fork. 	If guard si is set, fork a new thread on the 
first slave TPU, and return d = TRUE, if successful. The new 
thread executes label L. 
op yfrk(op si, op L) 	{ 
if (si && pt_TPUs = 
thisthread.slave_TPUs{O] .availW { 
d = TRUE; 
thread_mit (pt_TPUs, 	thisthread. seq, 	L); 
} 
else d = FALSE; 
return d; 
xfrk 	$sl, 	$d, 	L Horizontal fork. If guard s] is set, fork a new thread on the 
next slave TPU, and return d = TRUE, if successful. The new 
thread executes label L. 
op xfrk(op si, op L) 	{ 
if 	(si && pt_TPUs = 
thisthread.next_TPU.availO) 	{ 
d = TRUE; 
thread_mit (ptTPUs, 	thisthread. seq, 	L); 
} 
else d = FALSE; 
return d; 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	44 
Table 3.6 Multithreaded Instructions Group 3 (continued from Table 3.5) 
Instruction Description 	 I 
crels 	$sl 	$s2 Cluster release. 	Execute if guard si is set. If s2 = TRUE, 
send synchronisation signal to slaves and free the cluster when 
the signal returns. Otherwise, abort the slaves and release the 
cluster. 
void crels(op sl, op s2) 
{ 
if 	(sl) 	{ 
if 	(s2) 	{ 
this thread. pas s_signal(pt_slaves, SYNCH); 




xstp 	$sl, 	$s2 Similar to sstp. Synchronisation signal is passed to the next 
slave or back to the master if it is the last active slave. 	s2 
indicates whether the slave's state is merged into the master's. 
void xstp(op si, op s2) 	{ 
thisthread.wait_signai (SYNCH); 
if 	(sl) 	thisthread.stop(merge = s2); 
thisthread.pt_next.get_signal(SYNCH); 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	45 
The cform instruction checks for available TPUs in the processor and reserves 
them as slave TPUs. A TPU is considered to be available if it is unclustered and there 
is no thread running on it. 
A thread can fork new threads in 2 directions: vertical fork (yfrk) and horizontal 
fork (xfrk). The yfrk is executed by the master thread to fork a child on the first 
slave TPU. The slave then executes xfrk to fork a successor thread on the next slave 
TPU. The slaves are inserted in the list maintained by the GTCU, after their master 
and in the order in which they are forked. Hence the sequence number is not given 
in both instructions. An assumption is that the master should only fork the slaves to 
execute program partitions which are encountered after the one executed by the master, 
according to the sequential semantics. Figure 3.4 (a) demonstrates the use of yfrk and 
xfrk where the threads in level 1 execute the outer loop iterations and the ones in 
level 2 execute the inner loop iterations. The order of the current running threads is 
depth-first or in-order: T0, T1, T11, T12, T2, T21, T22, T3, T4. 
The master and slave threads synchronise by executing crels and xstp instructions, 
respectively. This is equivalent to merging in the SPSM model [18]. Unlike the SPSM, 
where the main thread's merging point is implicitly the starting address of the future 
thread, the master thread in the hierarchical model explicitly executes crels. It passes 
the synchronisation signal to the first active slave and waits until all the slaves have 
been retired. When the slave executes xstp, it waits for the synchronisation signal 
before merging its state into the master's temporary state. If it is the last active slave in 
the cluster, then it passes the signal back to the master. Otherwise, it passes the signal 
to the next slave. When the signal returns to the master, it frees the slave TPUs and 
transfers its temporary state to the current state. The execution resumes after the last 
instruction executed by the last slave (as if the slaves' execution has been sequentially 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	46 








Level 2 	1 	
Xfrk[ 
1 j J2 121 
(a) Hierarchical fork 
Level 	 TO 
crels 
.xslp 
Level 1 	 T 	
XSt HT XStI) 	XSt) 
rcIs 	crels 	XStp 
Level 2 	T11 
XSt 	 XSI]) T77 
(b) Hierarchical synchronisation 
Figure 3.4: Hierarchical multithreaded execution 
performed by the master itself). If the slave cluster is aborted, then all the slaves are 
interrupted from their execution and stop. 
Figure 3.4 (b) demonstrates the synchronisation between the threads in Figure 3.4 
(a). The order of execution is the in-order traversal of the tree: T0 -* T1 -p T11 
T12 -* T2 - T21 -* T22 -* T3 -* T4. Since T11 and T12 execute the inner loop of the 
first outer loop iteration (executed by T1), then they must be merged into T1 before 
the synchronisation signal is passed to the next outer loop iteration (executed by T2). 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	47 
When T0 aborts its cluster IT, , T, T3, T41, T1 and T2 also abort their next-level clusters 
before stopping. 
3.3.4 Multithreaded Instructions Group 4 
The instructions in the final group, as shown in Table 3.7, support speculative execu-
tion. The safe instruction switches between the non-speculative and the speculative 
modes. By default, a thread is non-speculative when it starts the execution. When it 
becomes speculative, all the store operations write to the speculative buffer instead of 
to the shared memory. 
Because the speculative buffer is cleared when the thread stops, it must explicitly 
execute cmmt to write to the memory if the speculation is correct. In the case of 
misspeculation, the guard operand of the cmmt can be set to FALSE. The thread will 
simply stop without committing the results from the speculation. 
The uregs and fregs instructions manipulate the U and W bits of the thread's reg-
isters. The registers whose corresponding bits are to be set are specified by the mask 
which encodes base registers 0-31 and the offset which is an integer to be multiplied by 
32, (register number = base register number + 32(offset)). The uregs sets U bits to 
TRUE which indicates that the corresponding registers are unavailable to the successor 
threads. The fregs instruction sets U bits to FALSE and forwards the register values 
to the thread's successors. Upon receiving the values, the successors will set their W 
bits to FALSE. If the thread executes fregs when it has no children, then the specified 
U bits are simply switched off. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	48 
Table 3.7 Multithreaded Instructions Group 4 
Instruction Description 
safe 	$sl, 	$s2 Execute if guard si is set. If s2 = TRUE, the following stores 
will write-through to memory. Otherwise, the following stores 
will write to speculative buffer. 
void safe(op si, 	op s2) 
{ if 	(Si) 	this thread. set_ment.access(s2); 	} 
cmmt 	$sl If guard si is set, wait for synchronisation signal and commit 
speculative stores to memory. 
void crrant(op si) 	{ 




uregs 	$sl, 	1($s2) If guard si is set, set U bits of the registers specified by mask 
I and offset s2 to TRUE. 
void uregs(op si, op I, op s2) 	{ 
if 	(si) 	{ 
while 	(r = decode(I, 	s2)) 
thisthread.regs[r] .set_U(TRUE); 
} 
fregs 	$sl, 	1($s2) If guard si is set, forward the registers specified by mask 1 and 
offset s2 to the child threads. 
void fregs(op si, op I, op s2) 	{ 
if 	(Si) 	{ 
while 	(r = decode(I, 	s2)) 	{ 
thisthread.regs [id] .set_U(FALSE); 




Chapter 3. The Multithreaded Processor Architecture and The Compiler 	49 
Processor Model 	I ................... .Instruction Definition 
Simulator Kernel 	I I Assembler / Instruction Evaluation 
Context Switching I 	 :further modification 
Figure 3.5: An overview of the simulator 
3.4 The Multithreaded Processor Simulator 
A sequential processor simulator [55] was modified to handle the multithreaded pro-
cessor architecture. The basic multithreaded features were implemented in [5]. It was 
enhanced considerably to reflect the details described in Sections 3.1, 3.2, and 3.3. 
3.4.1 Simulator Framework 
The framework is based on a process-based, discrete-event simulator. Figure 3.5 de-
picts an overview of the simulator. It was implemented in C++ and can be divided into 
three layers: processor model, simulator kernel, and context switching. 
3.4.1.1 Simulator Kernel 
The simulator kernel is a general-purpose library for discrete-event simulation. It de-
fines class entity, from which the processor components are derived. Entities are di-
vided into two types: (I) participating entities such as First Level Scheduler (FLS), 
Thread Processing Units (TPUs), and Arithmetic and Logic Units (ALUs); and (2) 




Active 	 Iw ( Passive 
Holding 
Figure 3.6: State transitions of a participating entity 
non-participating entities such as register files. 
A participating entity may be in one of four states: passive, active, holding, or 
pending. Figure 3.6 depicts the transitions between these states. Solid lines denote ex-
plicit transitions made by function calls and dotted lines are implicit transitions made 
by the simulator kernel. Passive entities have no control over the progress of the sim-
ulation. They can be scheduled and placed in the pending queue (see below) by active 
entities. The active entities may change the state of the simulation, schedule passive 
entities, or reschedule themselves. Holding entities are ones waiting for the state of the 
simulation to meet certain conditions before activating. Pending entities are ones being 
scheduled to activate after certain times; they are held in the pending queue which is 
ordered by the activation time. Both holding and pending entities are passive. Once 
the current active entity deactivates and there is no other active entity, the simulator 
kernel searches for a new active entity. It first checks in the holding list. If no holding 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	51 
entity is able to activate in the current state of the simulation, the kernel picks the first 
entity in the pending queue and advances the simulation time. 
3.4.1.2 Context Switching 
The context-switching layer is a layer on which the rest of the simulator is built. Enti-
ties are derived from the base class context. Each context is an operating system thread. 
Mechanisms to maintain and switch between contexts are implemented in this layer. It 
is also the only layer that handles operating system functions. 
3.4.1.3 Processor Model 
The processor is modelled in the top layer. Its components are derived from the base 
class entity and the behaviours of these components are implemented using state tran-
sition functions. For example, the Thread Issue Unit (TIU) inside an active Thread 
Processing Unit (TPU) repeatedly performs the following operations: 
1: Get a new instruction. 
2. If there is a buffer hit, 
Hold until the registers required are unlocked. 
Fetch the source registers and lock the destination register. 
Issue the instruction to (i.e. schedule) Arithmetic and Logic Unit, Memory 
Unit, or Local Thread Control Unit. 
Reschedule itself, accounting for issue cycle time. 
3. If there is a buffer miss, 
(a) Send a request to First Level Scheduler and passivate. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	52 
However, components such as the buses and the bus interface are not simulated 
separately. It was assumed that the delay for a processor component to perform its 
task has included bus delay if that component accesses the buses, and bus contention 
is lumped in the contention for that component. In practice, the bus delay is also 
affected by distances between the components. For example, in a large application 
where several clusters of TPUs are allocated to execute several program partitions, the 
TPUs in large clusters may be more scattered than in smaller ones. Our multithreaded 
execution models, which will be described in Chapters 4 and 5, permit communication 
between parent and child threads only. Assuming that they often (if not always) execute 
on neighbouring TPUs, the communication delay which includes the bus delay is set 
to be uniform. 
Delays are measured in terms of number of time units. The absolute number is in 
itself less important, but the ratio between time delays should be realistic or correspond 
to the architectural assumptions being made - the actual execution time can be correctly 
estimated once a clock speed has been set. For example, a normal ALU instruction is 
split into four operations: fetch instruction, read registers, execute, and write back. 
Thus, the delay is set to 4 time units. On the other hand, the delays of the auxiliary 
multithreaded instructions (Section 3.3.2) are expected to be short, as these instructions 
are frequently used to support the multithreaded execution. The delay of the First Level 
Scheduler (ELS) is proportional to its fetch bandwidth and the size of the instruction 
buffer in the TPU. For instance, if the fetch bandwidth is 2 instructions per time unit 
and the TPU's buffer holds 10 instructions, then the FLS delay for processing a request 
from a TPU would be 5 time units. 
The Global Thread Control Unit (GTCU) was implemented as a non-participating 
entity in the simulator because it does not perform any action other than maintaining 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	53 
threads' information. Accesses to the GTCU are managed in a multiple-readers/single-
writer style as the thread sequence can be updated by only one thread at any time 
(experiments on the GTCU's access delay are reported in Appendix B). 
3.4.1.4 Instruction Definitions 
The instruction set module is defined separately from the processor model. An in-
struction definition is implemented for each opcode and has two main functions: syn-
tactic analysis and instruction evaluation. At the start of the simulation, an input file is 
parsed. For each instruction, syntactic analysis is performed, which includes checking 
the number and type of operands, and tagging the type of functional units required. 
During the simulation, instructions are executed by the ALUs, MUs, or LTCUs, by 
calling the evaluation methods of the appropriate instruction definitions. 
There is also a set of profiling instructions as listed in Table 3.8, which are the 
interface between the program being executed and the simulator. The prb.t instruction 
prints out the probe number and the current simulation time. It is useful for measuring 
the execution length of a program fragment. The prb.a and prb.ai instructions register 
an address in a lookup table. During the simulation, when a store (to shared memory) 
is executed, its target address is checked in the lookup table. If the address has been 
registered, it will be printed out along with the data to be stored. When a debug 
instruction is encountered by the current head thread, the processor's activities from 
that point onwards are printed out. The information as to which is reported depends on 
the level of debugging specified. These instructions do not have delays, but they may 
slightly affect the performance of some processor components such as ELS, since they 
are fetched into the instruction streams together with the application instructions and 
occupy space in the instruction buffers. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	54 
Table 3.8 Probing Instructions 
Instruction Description 
prb.t 	I Return probe number I and current time. 
prb.a 	I ($sl) Register address I ($sl) for profiling. 
prb.ai 	L, I Register address L + I for profiling. 
debug 	I Report processor's activities as specified by switch I 
3.4.2 Limitations 
The simulator runs on a SUN Solaris platform. It reads an input file in the assembly 
format (ASCII) rather than the binary executable. This allows us to introduce and 
experiment with new multithreaded instructions without being restricted by the actual 
instruction set architecture (ISA) and its binary format. However, the simulator does 
not deal with OS or library calls. When these calls are encountered, they are treated as 
dummy instructions, i.e. no action is actually executed. 
Another limitation is that currently cache hits/misses for the instruction cache are 
not modelled. The cache is always large enough to accommodate the whole input file 
which is loaded once at the start of the simultion. Similarly, the notion of data cache 
is deliberately omitted. Instead, we refer to the first-level shared memory which is also 
large enough to accommodate the whole program execution. 
3.5 The Multithreaded Compiler 
Compilers for parallel and multithreaded programs traditionally comprise a front-end 
source-to-source paralleliser, and a back-end optimiser and code generator. Examples 
include SUIF [84], Agassiz [79], Polaris [11, 40], and PROWS [62]. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	55 
Most front-end parallelising compilers are language-independent and machine-
independent. Source programs written in FORTRAN, C, C++, or Java are parsed into 
universal intermediate representation (IR) format, where structures such as loops and 
arrays are reserved for parallelisation analyses and transformations. The discovery 
of parallelism and the extraction of useful information from the sequential code are 
difficult to be done fully automatically. Interference from programmers is allowed, 
most commonly via interactive user interface (e.g. SUIIF, PROMIS) and the augmenta-
tion of source programs with preprocessor directives. The supplementary information 
can also be provided as external files, as in [29] and [61]. To minimise the degree 
of machine-dependence, thread manipulating code is typically inserted in the form of 
function calls which will be later linked to the target-specific multithreaded libraries. 
Output from the front-end compiler is fed into the back-end optimiser and code 
generator. High-level IR is broken down into low-level JR. Propagating high-level in-
formation, e.g. alias information and loop-carried data dependence, to the back-end 
can increase the efficiency of further analyses and optimisations. In SUIF, the infor-
mation is encapsulated in the annotations attached to SUIF JR objects. In Agassiz, an 
assertion file is generated to identify the mapping between the front-end and the back-
end representations. The back-end compilers are often modified versions of commer-
cial compilers. For instance, Agassiz uses a modified GCC back-end and Polaris uses 
a modified SGJ back-end. The multithreaded libraries are linked to produce the final 
executable code. 
Multiscalar compiler [72] focuses solely on the low-level compilation. A GCC 
front-end parses, optimises, and compiles a program down to assembly code. The 
Multiscalar compiler then performs task selection, schedules register communication, 
and annotates the assembly code with task information. It does not use high-level 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	56 
structures or information since the task selector processes control-flow graphs at the 
basic-block level. Furthermore, only register dependence and communication are han-
dled by the compiler, while memory dependence is handled by the hardware. 
3.5.1 Compiler Implementation 
A multithreaded compiler threadsuif was implemented, which analyses sequential 
programs written in C and automatically transforms them for the multithreaded execu-
tion. The SUIF framework [84] was chosen due to its availability and support provided 
by the distributor (Stanford SUIF Compiler Group). Given SUI1F's modular construc-
tion and well-defined interface, new functions could be implemented and easily slotted 
into the compilation flow. 
The SUTF compiler system includes a set of compiler passes that perform pro-
gram analyses, optimisations, and transformations on the SUIF intermediate represen-
tation (IR). Each pass can be implemented as a separate program. The SUIF IR uses a 
language-independent abstract syntax tree (AST) representation in two levels: 
High-S UIF. In this level, the AST nodes are high-level control-flow structures 
such as TREEJFs, TREEJORs, or TREE LOOPs. 
Low-SUIF. In this level, the high-level tree nodes are dismantled into lists of 
instructions. 
Both levels of representation can co-exist. For instance, unstructured control-flow in 
high-SUIF code is represented by low-level branch and jump instructions. The com-
piler passes communicate by reading from and writing to SUTF files, where information 
such as results from the analyses are carried in the annotations attached to the SUIF 
objects. The SUIF packages [84] used were: 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	57 
. basesuif (version 1.1.2). It is the base system for all other packages. 
. bas eparsuif (version 1.0.0.beta.2). This package includes libraries and passes 
for parallelisation and dependence analyses, loop transformations, and relevant 
optimisations. 
oldsui f (version 6.0.0). This package is a collection of libraries and passes that 
work on an earlier generation of the SUIF IR format. 
. suifbuilder (version 1.0.0). It is an interface for generating SUIF code. 
. suifvbrowser (version 1.0.0.beta.1). It is a graphical user interface (GUI). 
• tcovsui f (version 2.0) [85]. This is a contributory package which incorporates 
profile information into the SUIF code. 
The threadsuif package comprises of three main modules: 
Multithreaded loop transformer (loopth). 
Multithreaded control-speculation transformer (specth). 
• Multithreaded code generator (thgen). 
3.5.1.1 Front-end Transformations 
The front-end transformers, loopth and specth, process code in the high-SUIF format. 
They recognise TREE-FOR and TREE-LOOP structures for the multithreaded loop 
transformation (Chapter 4), and TREE-IF structures for the control-speculation trans-
formation (Chapter 5). The transformations are described in detail in those chapters. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	58 
3.5.1.2 Multithreaded Code Generation 
thgen generates the code targeted at the multithreaded architecture. It is a modi-
fied version of a MIPS code generator, mgen, in the oldsuif package. The output 
files from the front-end transformers are pre-processed by the SUIF passes including 
sw±ghnflew, oldsuif, and mexp. They reformat the code and prepare information 
such as register usage for thgen. The code generator works in three steps: 
Translate SUTF instructions into assembly ones and gather the register usage. 
Allocate saved registers. 
Determine the size of stack frame and allocate temporary registers. 
In our multithreaded model, each thread has a separate register file but they share 
the same memory space. Therefore, spilling registers to memory is avoided, assuming 
that there are always sufficient registers in the register file. Moreover, when a pro-
cedure call sequence is generated, infinite-saved-registers option is used so that the 
registers are not saved onto stack which is shared by all threads. If the program con-
tains recursive function calls, a private stack should be allocated to each thread. The 
front-end transformers allocate private memory to threads in the form of arrays and 
structures which are indexed by the thread numbers, and will be translated into . data 
section in the assembly code. 
3.5.2 Compilation Process 
An overview of the compilation process is illustrated in Figure 3.7. The analysis in 
SUIF function is quite simple and may not expose all the parallelism in the programs. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	59 
User hints, if needed, can be given in the source files using a pragma directive: 
#pragina suif_annote <annotation name> <information> 
hint 
When the source code is translated into SUTF IR, the hints will be written as SU1F 
annotations. Alternatively, annotations can be inserted directly into SUIF files via the 
graphical user interface, sui f vbrows er. 
The code is pre-processed by various distributed SUIF passes. First, porky handles 
classic optimisations including constant folding, forward propagation, copy propaga-
tion, and constant propagation. A new version of C program is generated from the 
optimised code. It is next compiled with gcc -a option, and executed to collect profil-
ing statistics. tcovsui f then annotates tree nodes in the SUIIF file with the basic block 
and line counts. The next step analyses high-SUTF structures, such as TREE_FORs, 
TREE...LOOPs, and TREEi1Fs, and determines whether they present good opportuni-
ties for the multithreaded and/or speculative execution. Besides detecting parallelis-
able loops, skweel also applies standard transformations such as loop normalisation 
and loop interchange. 
Following the analyses, loopth and specth recognise the threadable structures and 
transform them. Cost models can be used to aid the transformation. Ideally, this step 
should be machine-independent. Practically, however, the transformers are aware of 
the underlining target architecture and its execution models. After the transformation 
and code generation (by thgen), machine-specific optimisations can be applied, for 
example, instruction and register communication scheduling. 
Finally, the assembly output is supplied to the multithreaded simulator. Profile 
information is collected. It can be used to help improve the compiling options as well 
as to adjust the architectural assumptions such as the number of TPUs and ALUs. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 
	
WO 
Pragma Directives 	 310 C Program 
Parser 
InteractiveJ__ S(1'1R 	 —  
Classic Optimisation 	 thgen 




Arch. Config 	Front—End Transformation 
Cost Analyses 
I 	 loopth 	specth 
I 	 I 





( 	Target Architecture 
".-.... (Simulator) 
Figure 3.7: An overview of the compilation process 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	61 
3.6 Discussion 
A CMP-based multithreaded architecture has been described, which was enhanced to 
support hierarchical multithreaded execution, speculation, and register synchronisation 
and forwarding. The architecture executes the MIPS instruction set which was aug-
mented with multithreaded instructions. There are some similarities and differences 
between our clustered multithreaded architecture and previous ones, as described next. 
During program execution, a thread can dynamically allocate a cluster of slave 
TPUs to execute a program partition. The number of TPUs in the cluster can be spec-
ified in the cluster allocation command, which allows TPUs to be used in correspon-
dence to the parallelism in that program partition. This was inspired by the dynamic 
resource partitioning concept in the Simultaneous Multithreading (SNIT) [44]. Each 
slave thread can also allocate a cluster of slave TPUs at the next level. 
Interaction between the master and the slave threads was adapted from architec-
tures such as SPSM, M-Machine, and Superthreaded [18, 23, 68, 78]. Although the 
slave threads can be related to subthreads in [23, 68, 78], the latter typically reside in 
the master TPU or share hardware resources owned by the master thread. Our slave 
threads, on the other hand, execute on their own TPUs and transfer results to the master 
thread after they retire. These results are collected as the cluster's state (or the mas-
ter's temporary state), and will become the master's current state once the cluster is 
released. This operation is similar to merger in SPSM [18]. An underlining assump-
tion is that the slaves only execute program partitions which are encountered after the 
one executed by the master, according to the sequential semantics. If the program par-
titions executed by the master and by the slaves are encountered in the reverse order, 
then the master's current state will be reset to the point as if its execution had not yet 
started. 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	62 
The slave threads in the same cluster operate in the predecessor/successor style 
[45, 65, 68] as follows. Since the slave TPUs are logically connected to each other in 
a uni-directional ring, a new slave thread is only forked on the next TPU in the ring. 
After finishing their execution, the slaves must synchronise and retire sequentially. 
Hardware support for speculative execution is very simple. Like STAMPede [65], 
threads can switch between non-speculative and speculative modes - they rely on the 
software to determine in which mode they should be during the execution. Misspec-
ulation detection and recovery are also performed in the software. Mechanisms are 
provided for handling a mispredicted thread which include: interrupting that thread; 
aborting the slave cluster, if it is a master thread; or switching off the merger flag in 
the slave retirement command, if it is a slave thread. Then, the thread can retire and a 
new one performs the correct execution. 
A simple register synchronisation and forwarding mechanism was added to the ar-
chitecture. The strategy is as follows: a parent thread first executes a command to set 
unavailable bits in the registers, prior to forking a new thread; as the new thread is 
initialised, wait bits in these registers will be automatically set, which enforces syn-
chronisation if it tries to read these registers before they are forwarded from the parent. 
The set of unavailable registers can be determined by dataflow analysis in the compiler. 
This idea was borrowed from Multiscalar [12], which declares a set of registers to be 
written by each task in a create mask, and passes this mask to a successor task as an 
accum mask; a task blocks when it tries to read the registers specified in the accum 
mask whose values have not yet been available. The Multiscalar hardware propagates 
forwarded registers to all the processing units, whereas our multithreaded architecture 
only forwards registers from the parent to its children. 
Because the architectural support for multithreading is kept to the minimum, the 
Chapter 3. The Multithreaded Processor Architecture and The Compiler 	63 
onus is on the compiler to orchestrate the parallelism in programs and specify the 
execution. New multithreaded instructions were proposed to pass commands from 
the compiler to the architecture. At run-time, allocating clusters or forking threads 
are not guaranteed to be successful, depending on the availability of TPUs. In the 
case of clustering or forking failure, the program will be executed sequentially instead. 
Guarded execution is therefore a key feature in most of the multithreaded instructions. 
The main idea is to use the result from clustering or forking as guard operands in the 
subsequent multithreaded instructions, to ensure that the program is correctly executed 
in both sequential and multithreaded modes. 
The compiler implemented consists of source-to-source transformers for multi-
threaded loop execution and control speculation, and a code generator targetting the 
multithreaded architecture. The transformers are aware of the underlining architec-
ture and its execution models, which have been summarised and discussed earlier in 
the section. Then, the code generator generates MIPS instructions combined with the 
multithreaded ones. The multithreaded loop execution will be described in detail in 
Chapter 4, and the multithreaded control-speculative execution in Chapter 5. 
Chapter 4 
Multithreaded Loop Execution 
Loops are an important source of parallelism in sequential programs. Loop paralleli-
sation can be performed either statically or dynamically. In a dynamic approach the 
sequential loops are parallelised at run-time. A static approach, in contrast, transforms 
the loops at compile-time by inserting thread manipulation routines. Parallelisable 
loops can be either do-all loops, which contain no dependencies between iterations, or 
do-across loops, otherwise. Techniques for testing data dependence can be found in 
the literature [9, 19, 56, 75, 76].  A number of well-known loop optimisation techniques 
can be applied prior to the multithreaded transformation in order to expose more loop-
level parallelism. For instance, loop normalisation, loop skewing, and loop reversal, 
rearrange bounds and data dependency pattern in the loops, which enable further opti-
misations. Loop interchange switches the inner and the outer loops in a loop nest - a 
parallelisable loop can be moved outward to increase the granularity of the loop-level 
parallelism, or inward to prevent cache overflow. Loop fission separates sequential and 
parallelisable parts of a loop, by breaking a single loop into multiple smaller ones. It is 
also used to break large loops that do not fit into the cache. Loop fusion is the inverse 
of loop fission, which helps increase instruction-level parallelism in the loops. Loop 
64 
Chapter 4. Multithreaded Loop Execution 	 65 
unrolling can be applied for similar purposes, by replicating the body of the loops. 
Strip-mining and loop tiling improve memory locality by dividing the iterations into 
tiles and traversing between the tiles. Loop coalescing and loop collapsing transform 
a loop nest into a single loop, which eliminates the overheads of multiple loops and 
multi-dimensional array indexing. Loop peeling is usually performed in conjuction 
with the other optimisations as it handles remnant iterations which are leftover from 
applying other techniques. 
The multithreaded transformation is then performed after dependency and paral-
lelism analysis, and this is described next. 
4.1 Multithreaded Loop Transformations 
An overview of the compilation flow is displayed in Figure 4.1. The transformers de-
tect parallelisable loops in SUIF programs (the SUTF framework was described in Sec-
tion 3.5); they are TREE_FORs and TREE_LOOPs attached with annotations "do-all" 
or "do-across". These loops can be detected in SU]F passes such as skweel, from 
pragma directives inserted in the source code, or via the SUIF graphical user interface 
(GUT). Each parallelisable loop is pre-processed before it is transformed into the mul-
tithreaded version. High-level TREE_FORs and TREE_LOOPs constructs are disman-
tled into straight-line code using functions in the SUM module, porky. Figure 4.2(a) 
shows a dismantled structure similar to those produced from porky, which is next ex-
panded in preparation for the transformation (Figure 4.2(b)). The pre-processing pass 
also analyses the loop and prepares the following information: 
• The number of available TPUs in the processor, which is fixed for all the loops 
in the program. 
Chapter 4. Multithreaded Loop Execution 
Detect parallelisable loops 
- SUM package, e.g. skweel 
- User hints 
(annotations, directives, interactive GUI) 
Recognise parallelisable loops 
- Extract dependency information 
- Reformat code layout 
I (3) Transform 	I 
Figure 4.1: An outline of the loop transformation 
. The number of slave threads to execute the loop in parallel, which is obtained 
from the cost analysis. 
. The lists of instruction pairs that may cause loop-carried dependence. Sources 
and sinks of the dependency edges are maintained in strlist [num_dep_pairs] 
and lodi 1st [num_dep_pairs I  respectively. 
. Exit points. The natural exit of a loop is after the continuation test or BRK_JJABEL 
in Figure 4.2(a). Other exit points may also be present inside the loop body. 
Two loop transformation algorithms are described in Sections 4.1.1 and 4.1.2, 
respectively. Loop-Transformer-1 transforms loops with only natural exits while 
LoopTransformer2 transforms loops with multiple exits. The appropriate algorithm 
is chosen automatically for each loop by the pre-processing routine. 






T 	- _____ PARENT-LABEL: 
TREE-FOR / TREE-LOOP (loop body 
TOP-LABEL: _______________ 
(loop body) 	 I HlLD_LABEL: 
CONT_LABEL: 
(loop continuation test } H 	CONT_LABEL: loop continuation test 
I BRK_LABEL: 	 I 
I 	 EPILOGUE: 	 I 
........................................ 
TREE _NODE _LIST 	 . 	PRE _ABORT _LABEL: 
(a) 
ABORT-LABEL: .............. . 
BR K_LABEL: 
(b) 
Figure 4.2: Loop structure in SUIF IA, (a) before and (b) after loop expansion 







mt readsynch, csucc, xsucc, ysucc, merge; 
3 
	
mt myself, mychild, myparent; 
4 
	




6 readsynch = csucc = ysucc = 0; 
7 
	csucc = cform (NUM-SLAVES); 
8 myself = adr 0; 
9 
	
guard[myself] = csucc; 
10 ysucc = yfrk (guard[myselfl, TOP-LABEL); 
11 
	mychild = cadr 0; 
12 if ( ysucc ) { 
*13 	 sstp (0, mychild); 
14 independent works 
15 











xsucc * 0; 
22 myself = adr 0; 
23 
	xsucc = xfrk (guard[myself], CHILD-LABEL); 




26 original loop body 
27 	merge * 1; 












34 readsynch = 1; 
35 
	
merge * 0; 
36 CONT.LABEL: 
37 




39 myself = adr 0; 
40 
	xstp (guard[myselu],  merge); 
41 BRK_LABEL:  
II working variables 
II working variables 
II working variables 
II initialise guards 
'- 
Il 
/I form a slave cluster 
II get self's address 
I - 
II fork the first slave 
II get child's address 
I - 
/I send synchronisation signal to 1st slave 
from code motion 
release the cluster 
II the same copy as line 14 
II start loop execution 
II 
/I 
II get self's address 
II fork the next slave 
II get child's address 
I- 
II 










II branch to TOP-LABEL if continue 
II get self's address 
II retire slave thread 
II natural exit 
Figure 4.3: Multithreaded loop generated by LoopTransformer_1 
Chapter 4. Multithreaded Loop Execution 
4.1.1 Simple Loops 
Thread manipulation code is inserted in each block of the reformatted loop (Figure 
4.2(b)). Figure 4.3 shows the code outline of a multithreaded loop with the only exit 
at BRKJABELJ. Private variables can be allocated via arrays, e.g. guard[NUN_TPUS], 
and indexed by the thread addresses. They may also live in registers, provided that the 
number of registers is sufficient to prevent spilling to the shared memory. 
Master Thread 
The execution starts at PRE-LOOP. The master thread attempts to form a slave clus-
ter (line 7) and stores the result in guard. If the operation succeeds, then the loop 
will be executed by the slaves; otherwise, it will execute the loop itself. Due to the 
guard operands of subsequent multithreaded instructions (e.g lines 23 and 40), they 
are treated as flop instructions when executed by the master. 
Independent computation before or after the loop can be inserted ahead of crels 
(line 15) by code motion technique, and executed in parallel with the main loop ex-
ecution. When the master reaches crels, it sends a synchronisation signal to the 
slaves and waits until they retire. Then it transfers temporary register updates by the 
slaves to its register file and frees the cluster. Since the program counter of the last 
slave becomes the current program counter of the master, the execution will resume at 
BRKJJABEL (line 41) which is the exit point of the original loop. 
Slave Threads 
The execution starts at TOP-LABEL (line 19). A new thread is forked (line 23) before 
the current thread continues to execute the loop body. At the end of the execution, 
the merge flag is set (line 27) indicating that the register updates by this thread will 

















0 1 EPILOGUE 
_,.4 CONT_LABEL 
(loop continuation test 
more iteration 
slave_2 
more iteration 	- 
101 CHILD-LABEL 
more iteration 
xfrk fails 	CONT_LABEL 
loop continuation test 
Figure 4.4: Diagram of the multithreaded loop in operation 
be merged into the master's temporary state. If the fork is successful, then the current 
thread waits to retire and passes the synchronisation signal to its child (line 40). Oth-
erwise, it will perform the loop continuation test and execute the next iteration itself. 
The child's execution begins at CHILD LABEL (line 33). The loop continuation 
test (line 36) is performed early to determine whether to start a new iteration, thereby 
limiting the amount of speculative work of the child thread. If the test fails, the child 
only waits to synchronise with its parent and retires without merger. Figure 4.4 depicts 
the multithreaded loop in operation. Life cycles of the slave threads, except the first one 
which is spawned by the master (the first slave thread starts at TOP-LABEL), start from 
Chapter 4. Multithreaded Loop Execution 
	
71 
iteration i 	 iteration i + 1 
store instruction 
psg (xsucc, mychild, signal_id); 	 wat (readsynch, signal-id); 
load instruction 
Figure 4.5: Store/load synchronisation in Loop-Transformer-1 
childhood which are paths represented by dotted lines. When a child thread reaches 
PROLOGUE and executes the xstp instruction, it becomes a parent and follows the paths 
represented by the solid lines. From the diagram, the master thread follows the paths 
represented by the dashed line. 
Since every thread encountering an xstp is blocked until the signal is received 
from its parent, the master thread executes sstp (line 13) in order to pass the signal 
to the first slave prior to its own computation. This permits the slave TPUs to be 
reused by multiple slave threads, which is called recycling execution. However, this is 
impractical for nested loops. As a thread executing an outer loop iteration tries to pass 
the signal to its slaves, it may be blocked waiting for the signal from its own parent 
(since the signal is forced to pass around in the correct order, the thread will eventually 
be unblocked without causing any deadlock). In such cases, the sstp instruction is 
excluded since the execution is non-recycling. The non-recycling execution will be 
discussed later in the chapter. 
Data Dependence 
For each pair of dependent instructions in strlist and lodlist, synchronisation is 
enforced by passing and waiting for a signal, as shown in Figure 4.5. The signal-id is 
an integer value unique to each store/load pair; psg is a non-blocking send while wat is 
Chapter 4. Multithreaded Loop Execution 	 72 
a blocking receive. The execution of wat is guarded by readsynch, whose value is set 
when a new thread is created (line 34 in Figure 4.3). If a thread has to execute the next 
iteration itself, i.e. either xfrk fails or it is the master thread, then readsynch must 
be switched off (line 30 in Figure 4.3), since the new iteration need not synchronise 
its memory operations with the previous iteration executed by the same thread. On the 
other hand, psg is guarded by xsucc whose value is set if the thread successfully forks 
a new thread. 
4.1.2 Loops with Multiple Exits 
Loop-Transformer-2 (Figure 4.6) operates on loops with multiple exit points. At 
present, it automatically handles break statements in C which are embedded in the 
loop body. Thread manipulation code for PRELOOP, PARENTJJABEL, and CHILD_LABEL 
is the same as before with modification being made to PROLOGUE and EPILOGUE, and 
the additional PRE-ABORT-LABEL and ABORT-LABEL were introduced to handle specu-
lation. 
Speculation Handling 
Although the loop continuation is tested early when a new thread is spawned, the 
iteration may still be invalid if a thread executing any previous iteration encounters an 
exit point. Therefore, a newly-created thread has to turn off the safe flag and become 
speculative (line 23). Subsequent stores by the thread will be buffered in its private 
memory and only committed to the master thread's temporary buffer before it retires 
(lines 39 and 40). 
The original breaks in the source program were translated to jumps to BRK_LABEL 
in the SUIF code. The targets of these jumps were changed to PRE-ABORT-LABEL by 
Chapter 4. Multithreaded Loop Execution 
	
73 
the transformer. When the first thread encountering an exit branches to this label, it 
interrupts the child's execution (line 45) and commits its speculative stores and reg-
ister updates up to the break point (lines 46 and 47) before retiring. The subsequent 
threads are recursively interrupted (line 52) and retire without merger (line 53). If the 
interrupted thread governs the multithreaded execution of the inner loop, it also aborts 
its slave cluster (lines 44 and 51). 
Suppose that the following loop is transformed for the multithreaded execution: 
mt a[5] = 11, 2, 3, 4, 5}; 
mt b[5] = 16, 7, 8, 9, 101; 
for (mt j = 0; i < 5; i-H-) { 
if (1 < 3) 
a[i] = ( (a[i] * 111) + (b[i] * 222) ) * 333; 
if (i > 0) break; 
II 
The compiler will generate 5 slave threads {To, T1, T2, T3, T41 to execute the loop 
iterations (with induction variable i = 10, 1, 2, 3, 41, respectively). As T3 by-
passes the computation under the first condition and arrives at break before the others, 
it will interrupt T4 and try to commit its result. The xstp and cmmt instructions force 
the threads to wait until they are signalled by their predecessors. The next thread that 
encounters break is T1. It interrupts T2, which, in turn, interrupts T3 from blocking 
at the cmmt instruction. Eventually, the threads will commit and retire in the correct 
sequential order despite exiting the loop out-of-order, as shown in Table 4.1. 









jot readsynch, csucc, xsucc, ysucc, merge; 
3 
	
jot myself, mychild, myparent; 
4 
	









21 xsucc = 0; 
22 
	
myself = adr 0; 
*23 safe (guard[myselfl, 0); 
24 
	
xsucc = xfrk (guard[myself], CHILD-LABEL); 
25 mychild = cadr 0; 
26 PARENT-LABEL: 












*39 if C guard[myself] C cmmt (merge); 
40 
	
xstp (guard[myselfl,  merge); 
41 goto BRK_LABEL 
*42 	PRE-ABORT-LABEL: 
43 if ( guard[myself) ) { 
44 
	
crels (incsucc, 0); 













	crels (in_csucc, 0); 
52 isg (xsucc, mychild, ABORT-LABEL); 
53 
	




II working variables 
II working variables 
II working variables 
II initialise guards 
I - 
II 
II start loop execution 
'- 
Il 
/I get self's address 
II become speculative 
II fork the next slave 






II branch to TOP-LABEL if continue 
I - 
II commit/discard speculative stores 




II abort inner-level cluster 
II interrupt child's execution 
1/ commit/discard (current thread) 
1/ retire slave thread 
/1 
1/ abort inner-level cluster 
/1 interrupt child's execution 
retire without merger 
/1 natural exit 
Figure 4.6: Multithreaded loop generated by Loop-Transformer-2 
Chapter 4. Multithreaded Loop Execution 
	
75 
Table 4.1 Order of commit and retirement  
Thread Action Code Executed (line in Figure 4.6) 
T0 commits line 39 
T0 retires line 40 
T1 commits line 46 
T1 retires line 47 
T2 retires line 53 
T3 retires line 53 
T4 retires line 53 
Master Thread Execution 
As discussed in Section 4.1.1, the master thread suspends at the crels instruction. 
Some slaves commit their speculative stores to the master's temporary buffer. The data 
from this buffer is transfered to the working speculative buffer before the master thread 
resumes its execution at BRKJ.JABEL. If the loop is in a nest, then the master commits 
these results in the EPILOGUE of the outer loop iteration. For the outermost loop, the 
results collected from all the threads in the system are committed after the master exits 
at BRXLABEL. 
Data Dependence 
Data dependence between iterations is handled in much the same way as before. How-
ever, being speculative, each slave reads from the speculative buffer of its predecessor 
instead of from the shared memory. The psg/wat instruction pair (Figure 4.5) ensures 
that the consumer waits until the most recently-updated data is available. Figure 4.7 
shows an example of multithreading in nested loops, which assumes that there is data 
	






Outer LOOP 	1j 
Level 2 
Inner Loop. Ic, ~iLl 2) .... T 
GTCU L 	T1  T1 i Ti 2 I T2 	112L1 14 
SPEC BUFFER 	 I 	I 	I 	I 	I I 
sf1 I 1d I 	I 
Figure 4.7: Nested loop execution in speculative mode 
dependence between the outer loop iterations. Following the thread order maintained 
by the global thread control unit (GTCU), T2 retrieves the data from T1 's buffer after 
the synchronisation. Our transformation is applied to each loop separately. Therefore, 
data dependence between iterations of different loops, such as between T12 and T2, 
is not recognised by the transformer. In those cases, only one loop is chosen for the 
multithreaded execution. 
Arbitrary Exits 
In the case of arbitrary exits other than break statements, the first thread reaching an 
exit saves the target address before jumping to PRE -ABORT JABEL. Instructions are in-
serted at the end of the aborting sequence, i.e. after line 47, to load and branch to the 
target address after the execution is resumed by the master thread. 
Chapter 4. Multithreaded Loop Execution 
	
VAIA  
This transformation is also applicable to loops from the previous section. Because 
no exit is found in the body of such loops, PRE-ABORT-LABEL and ABORT-LABEL blocks 
are never executed. 
4.1.3 Register Communication 
Data dependence between threads can also be handled by register forwarding. This 
approach requires dataflow analysis in the assembly level or after the register alloca-
tion phase. As a result, when this option is selected, the front-end transformers only 
mark places where data dependencies might occur. The actual instructions, uregs and 
fregs, will be inserted during the code generation phase. The multithreaded transfor-
mation of the following loop is next considered: 
mt a[10] = 11, 2, 3, 4, 5, 6, 7, 8, 9, 10}; 
mt total = 0; 
for (mt j = 0; i < 10; i++) 
total = total+ a[i] * 2; 
Loop-carried data dependence exists due to the reads and writes of the variable total. 
Figure 4.8 shows the assembly code of the multithreaded loop when memory commu-
nication is used. The result of the summation (line 20) is stored in the shared memory 
and loaded by the next iteration. Synchronisations are required before the load (line 
18) and after the store (line 22). 
78 Chapter 4. Multithreaded Loop Execution 
2 L2.rnain: If TOP-LABEL 
3 L3.main: II PROLOGUE 
4 ii $75, 0 II 
5 adr $76 II 
6 muli $11, $76, 	4 II 
7 lai $12, __S1.main, 	0 II 
8 addu $13, $12, 	$11  
9 1w $73, 0($13) II load guard value 
10 xfrk $73, $75, 	L5.main II 
ll cadr $77 /- 
12 L4.main: II PARENT-LABEL (loop body) 
13 muli $14, $69, 	4 /- 
14 la $15, 64($29) /- 
15 addu $24, $15, 	$14 II calculate address of a(i) 
16 1w $25, 0($24) II load a[i] 
17 muli $80, $25, 	2 II $80 	- a(i) 	* 2 
*18 wat $79, 1 II synchronise load 
19 1w $8, 104($29) II load total 
20 add $9, $8, 	$80 II $9 * total + $80 
21 sw $9, 104($29) // store result 
*22 psg $75, $77, 	1 II synchronise store 
23 beqz $75, L8.main /- 
24 Ll0.main: II xfrk succeeds 
25 j L6.main /- 
26 L8.main: II xfrk fails 
27 li $79, 0 II turn off wat's guard 
28 j L0.main /- 
29 L5.main: II CHILD-LABEL 
30 ii $79, 1 II turn on wat's guard 
31 L0.main: // CONT_LABEL 
32 addi $69, $69, 	1 /- 
33 li $10, 10 /I 
34 bgt $10, $69, 	L2.main II continuation test 
35 L6.main: II EPILOGUE 
36 II 
Figure 4.8: Transformed loop using memory communication 










ii 	$45, 2 
*5 uregs 	$73, 64($45) 
6 





9 muli 	$14, $69, 4 
10 
	
la $15, 64($29) 
11 addu 	$24, $15, $14 
12 
	
1w $25, 0($24) 
13 
	 mull 	$80, $25, 2 
14 8 [''synch-read'' 11 
15 
	
8 (''synch-write'' 11 
16 add 	$70, $70, $80 
*17 
	
fregs 	$75, 64($45) 











PARENT-LABEL (loop body) 
Il 
/I 
II calculate address of a[i] 
II load a(i) 
II $80 — a(i] * 2 
II annotation added by transformer 
II annotation added by transformer 
II $70 $70 + $80 
II set $70 available and forward 
I- 
/I 
Figure 4.9: Transformed loop using register communication 
Figure 4.9 shows assembly code of the same loop when register communication is 
used. In our example, the source and sink of the dependency edge point to the same 
instruction (line 16). Lines 4, 5, and 17 are added after the code generation as it is 
when the register dependence (caused by $70) is known. The register identifier of $70 
is encoded as 
base register = 6 
70 = 6+32(2) , mask 	= 0x00000040 or 64 
offset 	=2 
The uregs instruction is inserted before xfrk to enforce synchronisation if the child 
thread tries to read $70 before it is available. The instruction is guarded by the same 
condition as xfrk. After the register value is produced, it is forwarded to the next 
thread by fregs which is guarded by the result of the fork. Figure 4.10 depicts the 
register communication between threads. 













fregs 	{$70} 	'. 
set $70: (W, U) = (cJ;•Q,1,. 
N 
slave-2 









egs ($70) ' 












receive $70: (W,U) = (0j±J 
[fregs {$70} 
set $70: (W, U) = (0,0) 
Figure 4.10: Diagram of register communication for register $70 
Chapter 4. Multithreaded Loop Execution 
	 81 
4.2 Performance Evaluation 
This section reports results of executing sequential and multithreaded programs on our 
simulated architecture. The architectural details and compilation framework were de-
scribed in Chapter 3. Optimisations performed by SUIF prior to the multithreaded 
transformation are classic optimisations (constant folding, forward propagation, copy 
propagation, and constant propagation) and basic loop optimisations (loop normalisa-
tion, loop skewing, and loop reversal). Sequential programs were transformed using 
Loop-Transformer-1 and Loop-Transformer-2 described earlier. Techniques such 
as loop unrolling and loop peeling were also explored. 
4.2.1 Benchmarks 
The C version of the Livermore kernels [82] were used as benchmarks in the experi-
ments. The kernels are placed in separate programs which consisted of three phases: 
initialisation, main computation of the kernel, and verification. Each kernel is executed 
repetitively enough to dominate the total execution time of the program. 
The statistics for the Livermore kernels are summarised in Table 4.2. The parame-
ters K and I were taken from the full benchmark version [81] (the first set of DO-loop 
spans), except matrix multiplication (U...21) which was scaled down to correspond to 
the same workload as the other kernels. The last four columns present statistics of the 
benchmarks from the sequential execution. 
Chapter 4. Multithreaded Loop Execution 
	 82 
Table 4.2 Benchmark description and general statistics  
Kernel Statistics Dynamic Distribution (%) 
K 	
] 
I 	J K * 1 mit Main Verify Name Kernel Description Instructions 
A.] Hydrodynamic code 70 1,001 70,070 1,779,326 0.64 99.03 0.32 
C..3 Inner product 90 1,001 90,090 1,183,761 0.95 99.05 0.00 
D_4 Banded linear equations 140 600 84,000 1,546,251 0.83 99.17 0.00 
F6 General linear recurrence equation 30 1,954 58,620 1,240,448 2.85 97.12 0.03 
G_7 State equation 40 995 39,800 2,616,272 3.08 96.70 0.22 
H8 Alternating direction, implicit inte- 100 198 19,800 3,489,128 1.09 98.67 0.24 
gration code 
I..9 An integration predictor 360 101 36,360 2,019,297 0.97 98.99 0.04 
J_10 A difference predictor 340 101 34,340 2,649,431 1.18 98.35 0.46 
L_12 First difference 120 1,000 120,000 2,060,662 0.75 98.92 0.33 
N_14 1-Dparticle-in-cell code 20 2,000 40,000 1,450,765 0.82 99.15 0.03 
R_18 2-D explicit hydrodynamic code 20 495 9,900 3,035,697 1.33 98.32 0.35 
U.21 Matrix multiplication 5 15,625 78,125 1,849,569 0.84 98.89 0.26 
V22 Planckian distribution procedure 70 1,001 70,700 2,415,445 0.79 98.96 0.25 
Average 2,102,773 1 	1.24 1 	98.56 0.20 
the number of times a kernel is executed. 
the number of iterations executed in a kernel execution. 
K *1 	: the total number of iterations executed. 
Chapter 4. Multithreaded Loop Execution 
	
83 
Table 4.3 Parameters for the simulated multithreaded architecture  
Configuration Sizes Latencies (in time units) 
instruction buffer (inst.) 10 ALU multiply 12 
total TPUs 18 ALU divide 20 
ALUs/TPU 2 ALU others 4 
registers/TPU 120 MU load/store 4 
LTCU queries (group 2) 2 
LTCU others 4 
buffer hit 1 
buffer miss 5 
Table 4.4 Multithreading overheads 
Overheads 
[ 	
Routine Average Tim7units] 
Master's PRE-LOOP 50 
Slave's: Loop-Transformer-1 
Fork / Initialisation PROLOGUE and CHILD 50 
Retirement EPILOGUE 42 
Slave's: Loop-Transformer-2 
Fork / Initialisation PROLOGUE and CHILD 54 
Retirement EPILOGUE 50 
PRE-ABORT 20 
ABORT 12 
Chapter 4. Multithreaded Loop Execution 
	
84 
Table 4.5 Details of parallelisable loops in the benchmarks  
Program AA C3 F..6 G...7 H8 1-9 J..10 LJ2 NA4 V22 
Loop-carried X / .J X X X X X X X 
Dependence 
Body Length 132 73 68 358 749 266 317 75 173 169 
(time units) a 
'From sequential execution, ALUs =2. 
4.2.2 Results and Discussions 
The first experiment compared the performance of multithreaded programs to their 
sequential version. Parameters used in the simulation are listed in Table 4.3. The 
sequential programs had been optimised using classic optimisations and executed on 
the architecture with the number of ALUs ranging from 1 to 4. It was observed that 
most programs used at most 2 ALUs. Although some programs used 3 or 4 ALUs, the 
utilisation of those extra ALUs were quite low. Thus, the number of ALUs per TPU in 
the table is set to 2. 
All the benchmarks except D_4, RJ8, and U21 contain one-level nested parallelis-
able loops. They were transformed into multithreaded code, for cluster sizes ranging 
from 2 to 16, in steps of 2. Because the benchmarks have only natural exits, we tested 
Loop-Transformer-2 by re-writing the loops using while (TRUE) instead of the orig-
inal for ( ... ),and break once the loop index value exceeds the upper bound. The 
execution times of both versions were similar, and therefore the results from the origi-
nal loops were reported in Figure 4.11. The multithreading overheads in Table 4.4 are 
the average execution time of the thread manipulation routines, which were measured 
from the experiments in this section. 
85 Chapter 4. Multithreaded Loop Execution 
2 	4 	6 	8 10 12 14 16 	2 	4 	6 	6 1U 11 14 10 	 & 	 4 	 0 	 0 	 IV 
2 	4 	6 	6 	10 12 14 lb 	Z 	4 	0 	8 	1U 1Z 14 ID 	 ' 11 	I 
2 	4 	6 	8 	10 12 14 16 2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 	1U 1Z 14 101 
V-axis: Speedup w.r.t sequential program 
X-axis : No. of slave TPUs 
Figure 4.11: Speedup of multithreaded programs with cluster size ranging from 2 to 16 
TPUs, in steps of  
Chapter 4. Multithreaded Loop Execution 
	
99 
[i±] 	TPU2 IT131 	











(b) cluster size = 6 
Figure 4.12: A saturation point being reached at cluster size = 4 
Loop bodies of the loops in C3, F_6, and LJ2 are fairly short. The ones in C3 
and F_6 also contain loop-carried data dependence. The multithreaded versions of 
these benchmarks deliver little speedup over the sequential programs; they are even 
worse when the cluster size is 2. In contrast, the multithreaded execution offers good 
speedup in G_7, H8, I_9, and f_JO and the loop bodies in these benchmarks are rea-
sonably large (see Table 4.5). The speedup generally levels off after 8-10 slave TPUs. 
The slave TPUs are recycled among slave threads. Due to the inherent parallelism and 
the execution pattern of the loops, after certain points, their performance will no longer 
improve in spite of the increase in the number of TPUs. Figure 4.12 depicts an exam-
ple. As mentioned in Section 3.4.1.3, the simulated architecture assumes that the bus 
delay is included in the delays of the other processor components such as ALUs, and 
bus contention is lumped in the contention for these resources. Communication delay 
was assumed to be uniform since the communication is only permitted between parent 
and child threads which are likely to execute on neighbouring TPUs. 
Chapter 4. Multithreaded Loop Execution 
	 87 
Table 4.6 Details of parallelisable nested loops 
from outermost to innermost loops 
Program Iterations Body Length (time units) 
a 
D_4 (3,194) (17633,67) 
R_18 (5,100) (5, 100) (5, 100) b (51116,497) (75174,720) (12586, 108) 
U_21 (25, 25, 25) (72489, 2881, 95) 
'From sequential execution, ALUs =2. 
b3 sets of nested loops executed sequentially. 
In the next experiment, multithreaded execution in nested loops in D_4, RJ8, and 
U.21 were performed (the details of these loops are shown in Table 4.6). The speedup 
of the multithreaded versions of DA, R_18, and U21 across different numbers of TPUs 
are shown in Figures 4.13 and 4.14. The nested loop execution is labelled as follows: 
N(2, 4), indicats that 2 and 4 slave TPUs are allocated to the outer and the inner loops, 
respectively. OUTER, MID, or INNER represent the multithreaded execution in the 
outermost, middle, or innermost loops only. The total number of TPUs in the graphs 
includes the master and the slave TPUs. 
Chapter 4. Multithreaded Loop Execution 
Figure 4.13: Speedup of multithreaded versions of D.4 and Ri8 















N(1 24' N 1 ( 	 .3,3) N(1,2,7) N(1,25) N(2,2,3) 
0 	 0 0 	 e 	0 	0 2,1.5) 	&2 16 e 	p . N (2.1.7) 
N(1 2 3 
*N(222) N(3.14) 
N(1,2,2) 	 N(2,1,4) N(4,1,3) 
1.00 
N(2,i.2) 	N(2,1.3) N(3,1,2) N(5,1.2) 
INNER 
0.001 
3 	4 	5 	6 	7 	8 	9 	10 	11 	12 	13 	14 	15 	16 	17 
Total no. of TPUs 
Nested Pvt.iltithreading: 	innermost loop is sequential 	outermost loop is sequential 
middle loop is sequential 	others 
Figure 4.14: Speedup of multithreaded versions of U.21 
For all the benchmarks, two-level multithreading yields no better performance than 
one-level multithreading in the outermost or middle loops. In the case of one-level 
multithreading in the innermost loops, it appears that the loop bodies in these bench-
marks are too small for the multithreaded execution to be beneficial. A drawback of 
the multithreading method as mentioned earlier is that at the start of each iteration, a 
fork instruction is executed and is only successful if the next slave TPU is available. 
A thread occupies a TPU, even though its execution has completed, until it receives 
the synchronisation signal from its predecessor, allowing it to commit and free the 






Figure 4.15: An example of nested multithreading 
TPU. An example is shown in Figure 4.15. In cluster {T41, T42}, the first slave thread 
( 1'41) waits for the signal from its master (T4) which, in turn, awaits the signal from its 
preceding master (T3). As a result, after a few iterations are executed in parallel, the 
remaining ones are executed sequentially because no thread is further sparked. 
One approach to this problem is to enhance the architecture, by differentiating the 
synchronisation between global and local levels so that the clusters can be managed 
fully independently from each other. The solution proposed in this thesis is to use 
compile-time techniques such as loop unrolling and loop peeling to improve perfor-
mance of the multithreaded programs. This approach was chosen as it does not require 
any alteration to the architectural design. In addition, since the current architecture 
permits threads to commit and retire one-by-one, it simplifies the study of control-
speculative execution which will be described in Chapter 5. 
4.2.2.1 Loop Peeling 
Loop peeling removes a small number of iterations from either the beginning or the end 
of a loop and executes them separately. A common use is to remove data dependencies 











A_i 	C_3 	F_6 	G_7 	H_8 	19 	J_10 	L_12 	N_14 (d) 	V_fl 
Cluster Size: • 2 • 4 • 6 	8 	jj 10 	12 	14 • 16 
Figure 4.16: RIEavg  of the multithreaded programs shown in Figure 4.11 
caused by the first or the last few iterations from the main loop, allowing the main 
loop to be further optimised and then parallelised. This section focuses on the main 
parallelisable loop and examines further use of loop peeling. 
Figure 4.16 gives the average ratio of the instructions executed (RIE) by the master 
TPU to those executed by the slave TPU. The RIE implies how the master TPU is 
utilised in comparison to the slave TPUs. In the multithreaded execution, each slave 
TPU may be reused by multiple slave threads, which is called recycling execution. 
From the graph, the master TPU is utilised less than 20% of the average utilisation of 
a slave TPU. Exceptions are F_6 and N_14. In F6, the multithreaded loop resides in 
a serial loop (which is executed by the master) and its upper bound is not constant. 
In N14, the multithreaded loop is followed by a serial loop and they cannot overlap; 
however, N_14 (d) gives the ratios after the number of instructions executed in the 
serial loop is deducted from the total number of instructions executed by the master. 
Chapter 4. Multithreaded Loop Execution 
	 92 
Due to the nature of the kernel code, while the slaves are executing the loop, little 
useful computation is left to the master. In the next experiment, early iterations of 
the loops were peeled prior to the multithreaded transformation. Once transformed, 
downward code motion is applied to allocate the peeled iterations to the master thread. 
If there are multiple exits from the loops, abort cluster instructions are inserted in the 
master's code ahead of those exits. The following variations were explored: 
. p.00 represents the original version of the multithreaded loop. 
. p.05 represents the loop in which 5% of the iterations were peeled. 
. p.10 represents the loop in which 10% of the iterations were peeled. 
. p.20 represents the loop in which 20% of the iterations were peeled. 
The percentage of iterations peeled is limited to 20% so that its sequential execution 
does not dominate the overall program execution. Due to the characteristic of the 
multithreaded loop in F_6, as mentioned earlier, it is excluded from the experiment. 
The resulting speedup is shown in Figure 4.17, and the RIE graphs and their stan-
dard deviations are shown in Figures 4.18 and 4.19, respectively. 
2 	4 	6 	8 	10 	12 	14 	16 
93 
.111 
2 	4 	6 	8 	10 	12 	14 	161 
Chapter 4. Multithreaded Loop Execution 
2 	4 	6 	8 	10 	12 	14 	16 
2 	4 	6 	8 	10 	12 	14 	16 	2 	4 	b 	ts 	it) 	IL 	1 11 	ID 
2 	4 	6 	8 	10 	12 	14 	16 
2 	4 	6 	8 	10 	12 	14 	16 
2 	4 	6 	8 	10 	12 	14 	16 
	
-a- p.00 	- 	p.10 
p.05 	—'---- p.20 
Y-axis: Speedup w.r.t. SEQ (p.00) 
X-axis: No. of slave TPUs 
Figure 4.17: Speedup of recycling multithreaded execution after loop peeling 









































A_i 	C_3 G_7 H_8 I_9 	J10 L_12 N_14 (d) 	V_22 
Cluster Si 	: • 2 • 4 • 6 08 	10 	fl 12 0 14 • 16 
Figure 4.18: RIEavg graphs after loop peeling 

















A_i 	C_3 	G_7 	H_B 	I_9 	J_iO 	L_12 	N_14 (d) 	V_Z 
- 	A_i 	C_3 	G_7 	H_8 	I_9 	J_10 	L_12 	N_14 (d) 	V_22 
	
Cluster Size: • 2 • 4 M 6 	8 	10 E 12 [] 14 r 16 
Figure 4.19: Standard deviations of the RIE bars in Figure 4.18 
Chapter 4. Multithreaded Loop Execution 
The multithreaded programs with loop peeling increase program speedup with the 
improvement up to 20%. In N_14, there is no improvement at all since the program's 
speedup is restricted by the serial loop execution. Comparisons of the RIE bars in 
Figures 4.16 and 4.18 reveal that the utilisation of the master TPU is substantially 
improved. However, as more iterations are allocated to the master, e.g. 10% - 20%, 
and the cluster size increases, the slave TPUs appear to be under-utilised in comparison 
to the master TPUs. Consequently, the program performance, in spite of some modest 
speedup, is limited by the sequential execution of the master thread. The standard 
deviation of the RIE, shown in Figure 4.19, implies how the iterations or threads are 
distributed among the slave TPUs. When there are fewer threads to distribute to the 
slave TPUs, i.e. p.10 and p.20, uneven workload becomes more visible, especially in 
those benchmarks composed of larger threads such as H_8, 1-9, and L10. 
4.2.2.2 Loop Unrolling 
Loop unrolling replicates the body of loops. If a loop is unrolled n times, then the new 
loop body contains n + 1 copies of the original loop body and the iteration step of the 
new loop is multiplied by n + 1. It is a common technique to increase the size, and 
therefore instruction-level parallelism, of the loop body which corresponds to thread 
size in the multithreaded execution. 
First, the impact of loop unrolling on the recycling multithreaded execution is stud-
ied. The following conditions are explored: 
. b.1 represents the original loop. 
. b.x2. The loop is unrolled once. 
. b.x4. The loop is unrolled 3 times. 
Chapter 4. Multithreaded Loop Execution 
	
97 
• b.x8. The loop is unrolled 7 times. 
5% of the total loop iterations plus leftovers are peeled (they are early iterations of 
the loop) so that the remainder is an exact multiple of the unrolling factor plus one. 
Exceptions are H8, 1-9, and JJO. In these benchmarks, the loops have only few 
iterations and the loop bodies are quite large; thus only the leftovers are peeled so that 
the master TPU does not execute more (original) loop iterations than those executed 
by a slave TPU. 
Graphs shown in Figures 4.20 and 4.21 demonstrate that a combination of loop 
unrolling and loop peeling yield significant speedup for most benchmarks. The perfor -
mance gained in N_14 is limited by the serial loop execution in the program, whereas 
the performance gained in N_14 (d) is more pronounced as the execution time of the se-
rial loop is deducted from the total execution time (of both the sequential and the mul-
tithreaded programs) and therefore the speedup observed is due to the multithreaded 
execution. The upper bound of the loop in F_6 is not constant; if the number of itera-
tions is less than the number of copies to be replicated, then the loop will be executed 
sequentially. Therefore the performance drop in b.x8 is due to the increasing ratio 
of the sequential execution to the multithreaded execution. Finally, H_8, 1-9, and J_10 
show little improvement because their original versions already performed well. More-
over, they had few loop iterations. As a result, the more the loops are unrolled, the less 
is loop-level parallelism exploited. 
2 	4 	6 	8 	10 	12 	14 	16 
2 	4 	6 	8 	10 	12 	14 	16 
2 	4 	6 	8 	10 	12 	14 	16   
I 
2 	4 	6 	8 	10 	12 	14 	161 
2 	4 	6 	8 	10 	12 	14 	16 
2 	4 6 	8 	10 12 14 	16 
b.1 -.--- b.x2 	-• b.x4 - b.x8 
2 	4 	6 	8 	10 	12 	14 	16 
Y-axis Speedup w.r.t. SEQ (b.1) 
X-axis: No. of stave TPUs 
Chapter 4. Multithreaded Loop Execution 
Figure 4.20: Speedup of recycling multithreaded execution after loop unrolling and loop 
peeling (continued in Figure 4.21) 
2 	4 	6 	B 	10 	12 	14 	16 
Chapter 4. Multithreaded Loop Execution 	 99 
2 	4 	6 	8 	10 	12 	14 	16 
b.1 	 b.x4 
b.x2 	-+- b.x8 
Y-axis : Speedup w.r.t. SEQ (b.1) 
X-axis: No. of slave TPUs 
2 	4 	6 	8 	10 	12 	14 	161 
Figure 4.21: Speedup of recycling multithreaded execution after loop unrolling and loop 
peeling (continued from Figure 4.20) 
In the non-recycling multithreaded execution, multiple threads cannot reuse the 
slave TPUs. It is more logical to allocate a chunk of iterations to each thread where the 
size of the chunk has impact on loop-level parallelism. In the second experiment, the 
benchmarks were optimised to fit resource utilisation of the non-recycling model and 
avoid any fork failure. For instance, if the loop in LJ2 that comprises 1000 iterations 
is to be executed by 4 TPUs, it will be unrolled 249 times to generate 4 chunks of 250 
iterations at the most. These chunks are re-rolled, producing small loops similar to 
the original one but with conditional exit and with adjusted upper and lower bounds', 
as illustrated in Figure 4.22. This was called loop chunking by Olukotun et al. [53]. 
The maximum number of iterations per chunk is I ' Then, a (full) 
1 1n practice, the loop is never unrolled and re-rolled - the new loops are constructed by modifying 
the original one. 
Chapter 4. Multithreaded Loop Execution 	 100 






Figure 4.22: Loop chunking for multithreaded execution on 4 TPUs 
chunk and leftover iterations can be jammed, peeled, and allocated to the master TPU 
while the rest are distributed among the slaves. 
The optimisation is performed prior to the multithreaded transformation. The total 
number of TPUs in the experiment, including the master and the slaves, is varied from 
2 to 16. F6 is excluded from the experiment because the number of iterations of the 
multithreaded loop is not constant and is unknown at the compile-time - the compiler 
would let F6 pass without any modification. 
The results are shown in Figure 4.23. Reasonable speedup can be seen in all the 
benchmarks since each of their multithreaded version is specifically compiled to match 
the number of TPUs available in the cluster. Because each TPU hosts only one thread 
that performs its computation in parallel with the others, the execution time of a loop is 
approximately the average execution time per thread, which corresponds to the amount 
of computation in a chunk, plus the total delays between all the threads. When the 
number of TPUs increases to a point that the chunk becomes too fine or there are 
insufficient iterations to allocate to every TPU, then the program performance will no 
longer improve. An example is H8, in which the number of iterations per chunk when 
there are 12, 4, 6, 8, 10, 12} TPUs participating in the execution are {50, 25, 17, 13, 
10, 91, respectively. The speedup significantly rises when 2-6 TPUs are used as the 
Chapter 4. Multithreaded Loop Execution 	 101 
chunks size is reduced from 50 to 17 iterations. It then levels off when more than 6 
TPUs are used as the chunk size is almost unchanged. However, when the number of 
TPUs increases to 14 (13 slaves plus a master), only 12 slave TPUs are actually used 
because there are not enough iterations to allocate to the last one. 
Comparing these results to the ones from the recycling model (Figures 4.20 and 
4.21), particularly b.x8, demonstrates that both versions give fairly similar speedup. 
Exceptions are GJ and LJ2, where the recycling model with loop unrolling performs 
noticeably better than the non-recycling one with loop chunking. In the former, new 
unrolled iterations can be further optimised during the back-end compilation. For ex-
ample, some memory references are replaced by registers and repetitive address cal-
culations are eliminated. In the latter, the new loop iterations are rolled back and even 
more instructions are added for checking and adjusting the loop bounds. Therefore, 
the performance gained from applying loop chunking to the non-recycling execution 
is due to the fact that the iterations are allocated to match the availability of resources, 
thus eliminating the fork penalty. However, there is still the overhead of the chunking 
added to each thread. 
In the next experiment, loop chunking was applied in conjunction with nested mul-
tithreading in benchmarks R_18 and U.21. The multithreaded execution in nested loops 
is non-recycling. The benchmarks were prepared as described next. 
Chapter 4. Multithreaded Loop Execution 
	
102 
2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	6 	10 12 14 101 
2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	6 	10 1Z 14 10 1 
2 	4 	6 	8 	10 12 14 161 2 	4 	6 	8 	10 12 14 16 
Y-axis Speedup w.r.t. SEQ 	
° N_14 
X-axis: No. of TPUs (master + slaves) 	
" N_14 (d) 
2 	4 	6 	8 	10 12 14 16 
Figure 4.23: Speedup of non-recycling multithreaded execution after loop chunking 
Chapter 4. Multithreaded Loop Execution 
	 103 
. For RJ8, 
- {N(2,2), ..., N(2,7)1 allocate 2 slave TPUs to the outermost loops while 
the innermost loops were chunked, with the number of TPUs ranging from 
2to7. 
- {N(3,2), ..., N(3,4)} allocate 3 slave TPUs to the outermost loops while 
the innermost loops were chunked, with the number of TPUs ranging from 
2to4. 
. For U21, 
- The innermost loops were always executed sequentially. 
- {N(2,2,1), ..., N(2,6,1)} allocate 2 slave TPUs to the outermost loops 
while the middle loop was chunked, with the number of TPUs ranging 
from 2 to 6. 
- {N(3,2,1), ..., N(3,4,1)} allocate 3 slave TPUs to the outermost loops 
while the middle loop was chunked, with the number of TPUs ranging 
from 2 to 4. 
Figure 4.24 compares the results from before (Figures 4.13 and 4.14) and after the 
optimisation. Fair improvement can be seen in both the benchmarks, with the increase 
in the speedup ranging between 25% and 30%. In the case of (L21, the benefit of the 
optimisation is less pronounced as more TPUs are used to execute the middle loop. 
This is due to the fact that it comprises only 25 iterations, and the amount of work per 
thread when more than 3 TPUs are used is little different, i.e. the number of iterations 
executed by a thread when there are 12, 3, 4, 5, 61 TPUs participating in the execution 
are 113, 8, 6, 5, 41, respectively. 








2.00 	 R_18 
1.00 	
N(2,x) 	N(2,x) - opt 
- N(3,x) -a'-- N(3,x) - opt 
0.00 










7 	8 	9 	10 	1 
Y-ads :Speedup w.r.t SEQ 
U21 
N(2,x,1) 	--- 	N(2,x,1)-opt 
N(3,x1) —e— N(3,x1)-opt 
12 	13 	14 	15 	16 
	
17 
X-ads : Total no. of TPU 
Figure 4.24: Speedup of nested-multithreaded programs with and without optimisation 
(loop chunking) 
Chapter 4. Multithreaded Loop Execution 
	 105 
4.2.2.3 Cluster and Fork Penalties 
According to the multithreaded execution model, if the master thread fails to form a 
slave cluster, then it has to execute the whole loop by itself. Although some instructions 
whose guard values are zeros can be bypassed, the performance of the transformed loop 
executed sequentially may still be worse than the original loop's. Once a cluster is 
allocated, if a thread fails to fork a new slave, its penalty is to execute the next iteration 
instead of retiring. Two multithreaded versions of each benchmark were prepared: 
. MSEQ is the multithreaded program using cluster size 2. 
. SSEQ is the multithreaded program using cluster size 1. 
The total number of TPUs in the architecture is changed to 2. Because of this, cform 
operations in MSEQ always fail. On the contrary, those in SSEQ always succeed, 
although the sole thread in the cluster always fails in xfrk. Hence both MSEQ and 
SSEQ are always executed sequentially. 
The performance displayed in Figure 4.25 indicates a worst case of the cluster and 
fork penalties in the unoptimised multithreaded programs. Given the performance lost 
from 100% of the cluster and fork failures, an average speedup of all benchmarks is 
around 0.8. In C3, F_6, and L_12, the speedup is only around 0.6 as they contain 
very small loops. Optimistically, with a combination of loop unrolling and loop peel-
ing such as the b.x8 strategy (in Figures 4.20 and 4.21), the speedup of the recycling 
multithreaded execution could be around 5 or higher if the cluster and fork operations 
succeed, or closer to 1 if they all fail. 
	
Chapter 4. Multithreaded Loop Execution 	 106 














A_i 	C_3 	F_6 	G_7 	H_8 	1_9 	J_10 	L_12 	N_14 V_22 	 Average 
Figure 4.25: Speedup of multithreaded programs being sequentially executed 
Loop chunking restructures the loop prior to the multithreaded transformation, 
which allows the number of threads created to match the number of TPUs available. 
However, in the next experiment, a loop is restructured so that ni chunks are created, 
but it is then transformed to be executed by n2 TPUs, n2 <n1 . The benchmarks were 
prepared as follows: 
• crnpl.T6. The loop is restructured to create 6 chunks, and multithreaded trans-
formed using cluster size 11, 3, 51. 
• crnpl.T8. The loop is restructured to create 8 chunks, and multithreaded trans-
formed using cluster size 11, 3, 5, 71. 
• crnpl.T1O. The loop is restructured to create 10 chunks, and multithreaded trans-
formed using cluster size 11, 3,5,7, 91. 
Chapter 4. Multithreaded Loop Execution 
	 107 
The total number of TPUs actually used is equal to cluster size plus one. In 
cmpl. Tmatch, the loop is restructured and transformed such that the number of threads 
created is equal to the number of TPUs, i.e. 12, 4, 6, 8, 10, 12, 14, 16}. They are the 
same programs as the ones shown in Figure 4.23. 
In Figure 4.26, the performance of cmpl.T6, cmpl.T8, and cmpl.T1O slightly drops 
when there are more threads than the TPUs available. A common observation in all 
benchmarks is that although both cmpl.T8 and cmpl.T10 suffer from fork penalty when 
they are given 6 TPUs, cmpl.T1O always performs better than cmpl.T8. This can be 
explained by the fact that cmpl.T1O generates more threads, thus exposing more loop-
level parallelism when the slave TPUs are reusable. The loops in this experiment are 
un-nested, which allows the masters to signal the slaves immediately after completing 
their execution. Even in nested loops, the execution can switch between recycling 
and non-recycling; this depends on the length of the outer and the inner loop bodies 
and the passing of the synchronisation signals at run-time. Therefore, the fork penalty 
observed in this experiment is optimistic. 
cmpl. Tmatch gives an upper bound of the multithreaded performance. Its approach 
involves restructuring and multithreading a loop for every specific number of TPUs. 
The loop that is restructured so that too few chunks are created offers little flexibility to 
the multithreaded transformation and execution. Hence, an optimistic approach should 
allow the pre-processor to create more chunks than the number of TPUs estimated 
by the multithreaded transformer to be available. For example, from Figure 4.26, it 
may be worth using the cmpl.T10 strategy if the loop can be pre-processed only once 
because the disadvantage of this program being executed by fewer than 10 TPUs is not 
too severe, considering its speedup and the best one achieved by cmpl.Tmatch. 
Chapter 4. Multithreaded Loop Execution 
	
108 
2 	 4 	 6 	 8 	 10 	2 	 4 	 b 	 iu 
2 	 4 	 6 	 6 	 1U 	 z 	 q 	 o 	 o 	 IV  
2 	 4 	 6 	 8 	 10 	2 	 4 	 b a 	 1u1 
2 	 4 	 6 	 6 	 1U 	 1 	 4 	 0 	 0 	 IV 
cmpi.Tmatch -.--- cmpLT8 
cmpi.T6 	—a— cmpi.T10 
Y-axis: Speedup w.r.t. SEQ 
X-axis: No. of TPUs (master + slaves) 
2 	 4 	 6 	 8 	 10 
Figure 4.26: Speedup of multithreaded programs with fork penalty 
Chapter 4. Multithreaded Loop Execution 
	 109 
4.2.3 Summary 
The following conclusions can be drawn from the experiments. Firstly, because our 
multithreaded execution relies on the software thread manipulation, the thread size 
should be large enough so that the benefit gained from the multithreading outweighs 
the overheads. Loop unrolling is employed for this very purpose. Furthermore, the 
resource utilisation of the master TPU can be improved with the application of loop 
peeling. The combination of both techniques achieved speedup between 5-10, when 
the loops were unrolled 7 times and one-level multithreading was applied. Figures 4.27 
and 4.28 summarise the performance of the one-level multithreaded programs. 
Limitation of multithreading in nested loops was noted. There are several clusters 
executing outer and inner loops simultaneously at different nest levels. In the current 
system, a unique (synchronisation) signal can be received by one thread at a time, 
which allows the thread to commit, retire, and free the TPU. As a result, while the 
signal is passed around in one cluster, the others repeatedly suffer from fork failures 
since the TPUs cannot be recycled. Aggressive loop unrolling provides a solution to 
this. Chunks of iterations are generated to match the number of TPUs available and 
allocated to individual threads. Speedup between 4-5 was achieved after the restruc-
turing of the inner loops in the nests (as seen in Figure 4.24). However, this technique 
compromises loop-level parallelism if too few threads are created while the TPUs are 
reusable at run-time. 
From the hardware perspective, increasing the number of TPUs allows more over-
lapping computation. However, there are points after which the increase in the number 
of TPUs will no longer improve the program performance. A case for the recycling 
execution is: when the execution pattern of the loop reaches a point that a new thread 
can reuse a TPU which has been freed by a previous thread, instead of using a new one. 
Chapter 4. Multithreaded Loop Execution 	 110 
For the non-recycling execution, which assigns a chunk of iterations to an individual 
thread, the chunk size is reversely proportional to the number of TPUs participating in 
the execution. If the chunk is already small, then adding an extra TPU will result in 
even finer threads, which is no further beneficial to the multithreaded execution. 
There are other compiler techniques which have not been explored. Because data 
cache is omitted from the simulation, techiques such as strip-mining or loop tiling 
which improve memory locality were not considered. In addition, most benchmarks 
contain small single loops, providing no opportunity for the application of loop fusion 
or loop fission. Finally, loop coalescing and loop collapsing which transform nested 
loops into single-level ones were not considered as the multithreaded execution in 
nested loops is one of the subjects to be examined in this research. 
4.3 Chapter Summary 
Two multithreaded loop transformers were implemented using SUTF framework. One 
handles simple loops with only natural exits. The other handles loops with multiple 
exits or whose upper bounds are unknown; such cases require the execution to be 
speculative. Results from the preliminary experiments were reported and discussed. 
Generally, the multithreaded programs deliver reasonable speedup with respect to the 
sequential ones. Other traditional techniques such as loop unrolling and loop peeling 
were also applied to improve the multithreaded performance. 
Chapter 4. Multithreaded Loop Execution 	 111 
3.50 















A_i 	C_3 F_6 	G_7 	H_8 	I_9 	J_10 	L_12 N14 	N_14 (d) 	V_22 
6.00 












A_i 	C_3 F_6 	G_7 	H_B 	I_9 	J_10 	L_12 N_14 	N_14 (d) 	V_22 
un-opt p.20 	 b.xB 
Note For F_6, b.x6 is show instead of b.x8 
Figure 4.27: Performance of one-level multithreaded programs (continued in Figure 
4.28) 
Chapter 4. Multithreaded Loop Execution 	 112 
10.00 













A_i 	C_3 F_6 	G_7 	H_8 	I_9 	J_10 L_12 	N_14 	N_14 (d) 	V_22 
10.00 











ii ' :iiI a, 2.00 1.00 0.00 
A_i 	C_3 F_6 	G_7 	H_8 	I_9 	J_10 L_12 	N_14 	N_14 (d) 	V_22 
un-opt p.20 	 b.x8 
Note : For F_6, b.x6 is show instead of b.x8 





Control-speculative execution permits either or both control-dependent paths of a 
branch to be executed before the outcome of that branch is known. In the multi-
threaded execution, the speculated paths are typically executed by separate threads. 
The choice of which path to speculate on is made using profile-based branch predic-
tion. The studies in [13, 24] revealed that most branches, especially the ones that can 
take either direction with high probability, exhibit the same behaviour across different 
program executions that use different input data. Strategies which rely on static pro-
gram analyses were studied by [8, 641. Their findings were that a branch, that chooses 
between continuing or exiting a loop or a procedure, is likely to take the continuation 
path. Moreover, the path that does not contain function calls is more likely to be taken 
since most programs use conditional calls to handle exceptions which rarely occur. 
If a branch has low confidence, i.e. both paths are equally probable, then the specu-
lation may be omitted; alternatively, by employing dual-path speculation both threads 
113 
Chapter 5. Multithreaded Control-Speculative Execution 	 114 
are launched to execute speculatively. For branches with high prediction confidence, 
single-path speculation forks only one thread to execute the more probable path. De-
ciding whether a branch has low prediction confidence based on the difference in prob-
abilities is subjective. It depends on a number of different factors such as the prediction 
accuracy, the misprediction penalty, or the resource availability. 
Control speculation allows several program partitions to be executed simultane-
ously, each of which may, in turn, be executed by multiple threads. Empirical studies 
undertaken as part of this work prioritised concurrent program partitions and evaluated 
resource allocation strategies. In the next section, the transformation for multithreaded 
control-speculative execution is first explained. 
5.1 Transformations for Control Speculation 
The transformers process SUIF programs in which high-level TREE-IF nodes are 
marked and dismantled into straight-line code, and low-level branch instructions are 
recognised'. The following analysis are performed prior to the transformation. 
The program is compiled procedure-by-procedure, for each one, a control-flow 
graph (CFG) is constructed. The first node in the graph is always ENTRY and the last 
one is either EXIT or RETURN, as shown in Figure 5.1. Dominator (or pre-dominator) 
and post-dominator nodes of the branches are calculated [2, 4], which is described next. 
Given two nodes n  and n2 in a CFG, ni dominates n2 if every path from ENTRY to n2 
goes through ni. Similarly, n2 post-dominates ni if every path from ni to EXIT goes 
through n2. Based on these definitions, the control-flow from ni to n2 is considered 
'A dismantled TREE-IF is arranged such that the original THEN path becomes the fall-through 
path and the branch is instead to the original ELSE path. To maintain consistency with other low-level 
structures, the fall-through path is called the ELSE path and the target of the branch is always the THEN 
path. 
Chapter 5. Multithreaded Control-Speculative Execution 	 115 
backward if n2 dominates ni, and a branch is considered aforward branch if it is not 
dominated by either of its targets. 
For each forward branch, parent and child regions represent boundaries at which 
speculation might be applied. The parent region is constructed by traversing the CFG 
upward, starting from the branch. Its dominators are added to the region until the first 
node which is not a dominator is reached or the re-convergent node of the previous 
branch has been included. On the other hand, traversing the CFG downward, starting 
from the branch, two child regions include nodes along THEN and ELSE paths. The 
construction of each region stops when the first post-dominator or the re-convergent 
node of that branch is reached. If a branch is found in a child region of another branch, 
then the parent region of the embedded branch is truncated so that only nodes in the 
enclosing child region are included. 
An example is shown in Figure 5.1. Node B represents a block of instructions as in 
a basic block, but the branch instruction at the end of the block is represented separately 
as an IF node 2 . The edge from IF(4) to B(1) represents backward control-flow. For-
ward branches are IF(1), IF(2), and IF(3). Their dominators, post-dominators, parent 
regions, and child regions, are calculated as shown in the figure. As IF(2) is located 
on a control-dependent path of IF(]), the parent region of IF(2) is truncated so that it 
is within the child region of IF(1). 
21 subsequent graphs in this thesis, only forward branches are represented separately. 




Pre—dominators 	{ B( 1) 
Post—dominators { B(6), B(7), IF(3), B(1O), 
IF(4), B(11) 
Parent Region 	{ B(1) 
Child Region 1 { B(2), B(4) 




B(1), IF(1), B(3) 





Child Region 1 B(4) 





Pre—dominators 	{ B(1), IF(1), B(6), B(7) } 





Child Region 1 B(8) 
Child Region 2 
	
B(9) 
Figure 5.1: An example of a control-flow graph 
Chapter 5. Multithreaded Control-Speculative Execution 	 117 
Table 5.1 Overheads of multithreaded speculative execution  
Overheads Average time units 
Spec-Transformer-1 : parent / child 30/44 
Spec-Transformer-2: parent/ child 40/36 
Spec-Transformer-3: parent / child (per branch) 38/ 52 
Branch probability is collected from sequential execution profiling and is added to 
SUIF files by tcovsuif. It gives two types of information: 
The cumulative probability along a control-flow path until a branch is encoun-
tered indicates whether that branch significantly contributes to the overall pro-
gram execution. In the example in Figure 5. 1, the cumulative probability of IF(1) 
and IF(3) are 1.0, but the cumulative probability of IF(2) is only 0.2. 
The individual probability determines which direction to speculate. Both paths 
of the branch can be speculatively executed if they are equally probable. 
Branches that are too fine for speculation are merged into parent or child regions of 
their neighbours, if possible. Criteria which are used to determine whether a branch 
is too fine or not include the cumulative probability of that branch and the size of its 
parent and child regions. The latter is compared with the speculation overheads in 
Table 5. 1, which are the average execution time of the thread manipulation routines in 
the parent and the child threads (measured from the experiments in Section 5.2). 
For a predicted branch, incoming control-flow from nodes other than itself to the 
child regions are diverted to new targets, by means of code replication which is similar 
to tail duplication techniques in superbiock or trace scheduling [10, 16, 19]. Outgoing 
control-flow from nodes other than the last one in the region is permitted from the 
Chapter 5. Multithreaded Control-Speculative Execution 	 118 
parent, but not from the child as the speculative execution can only be performed within 
the child region's boundary. 
In Figure 5. 1, the child regions of IF(1) and IF(2) are overlapped, starting from 
node B(4). Hence, B(4) is replicated and the control-flow from B(2) is directed to a 
new node B(4'), as shown in Figure 5.2. The replication of B(6) is optional since B(6) 
post-dominates both IF(1) and IF(2), but it is not included in the child region of neither 
branch. However, by adding B(6') to the major path of IF(1), the size of the speculated 
region can be increased to amortise the speculation overheads. The control-flow from 
unconditional branch orjump instructions are handled in the same way. More examples 
of code replication can be found in [51]. 
The final analysis involves extracting data dependency information from each pair 
of parent-and-child regions. Then, each predicted branch is transformed for the mul-
tithreaded control-speculative execution. An overview of the compilation flow is dis-
played in Figure 5.3. 
Chapter 5. Multithreaded Control-Speculative Execution 	 119 
Figure 5.2: The control-flow graph in Figure 5.1 after code replication 
Chapter 5. Multithreaded Control-Speculative Execution 	 120 
Determine predictable branches 
- tcovsuif 
- control—flow analysis 
- dependency analysis 
- user hints 
Pre—process selected branches and code regions 
- Extract information 
- Reformat code layout 
Transform 
Figure 5.3: An outline of the transformation for speculative execution 
5.1.1 Single-Path Speculation 
In single-path speculation, a thread is forked to speculatively execute the predicted 
path of the branch. For instance, path { B(2), B(4'), B(6') } of the branch IF(1) in 
Figure 5.2 is chosen for the single-path speculation. 
The branch structure which has been reformatted for the transformation is shown 
in Figure 54(b), from its original form in Figure 54(a). Figure 5.5 gives an example 
of the transformed code when the THEN path is predicted. Only the lines marked with 
an asterisk (*) are modified should the ELSE path be predicted; the alternative code 
can be found in the comment section of those lines. 





PAR — VERIFY: 
TREE-NODE-LIST : 	 ( 
condition 
parent region  
I PAR-RIGHT: 
TREE—IF PAR—WRONG: 
(condition) 	 Hi 
I I CH_PROLOGUE: 
ELSE—LABEL: 
ELSE path 





post—dominating child region 
TREE— NODE—LIST 
( a ) 
ELSE-LABEL: 
ELSE path 








Figure 5.4: Branch structure in the SUIF intermediate representation 





mt guard, fsucc, pbra; 
2 
	




fsucc = frk (sequence-no, CH-PROLOGUE); 
5 
	mychild = cadr 0; 
6 guard = 0; 
7 












13 if (fsucc) { 
14 
	 psg (fsucc, mychild, sequence-no); 
15 sstp (fsucc, mychild); 
16 




isg (fsucc, mychild, CH-ROLLBACK); 




guard = 1; 
23 safe (guard, 0); 
24 
	myparent = padr 0; 




27 	pbra = 1; 







pbra = 0; 




34 if (!guard) goto DONE-LABEL; 
35 




38 goto DONE-LABEL; 
39 CH-ROLLBACK: 
40 
	 abort slaves in THEN 





44 { 	. post-dominating region 	}  
II working variables 
II working variables 
I - 
II fork a speculative thread 
II get child's address 
If indicate that this is parent thread 
I - 
II 
II original branch instruction 
If ''branch PAR_WRONG'' if ELSE is predicted 
''goto PAR_RIGHT'' if ELSE is predicted 
II 
I, 
II pass signal to child 
II parent synchronises and stops 
II ''goto ELSE_LABEL'' if ELSE is predicted 
II interrupt child's execution 
If ''goto THEN-LABEL'  if ELSE is predicted 
indicate that this is child thread 
II become speculative (unsafe) 
II get parent address 
II ''goto ELSE-LABEL' ' if ELSE is predicted 
I - 




II post-dominating instructions included 
I' 
ll 
if this is parent, exit 
if this is child, wait for the signal 
II 
I, commit speculative stores 
I - 
/I or ELSE if it is predicted 




Figure 5.5: Code generated by Spec-Transformer-1, THEN path is predicted 
Chapter 5 Multithreaded Control-Speculative Execution 	 123 
Parent Thread 
The parent thread speculatively forks a child (line 4) before continuing its execution in 
the parent region. As the target of the predicted branch is always THEN-LABEL (line 9), 
it is verified as follows: 
If the THEN path is predicted: 
• the prediction is correct if the branch is taken (branch PAR-RIGHT). 
• the prediction is wrong otherwise (goto PAR-WRONG). 
If the ELSE path is predicted: 
• the prediction is wrong if the branch is taken (branch PAR-WRONG). 
• the prediction is correct otherwise (goto PAR-RIGHT). 
At PAR_RIGHT, if the frk instruction had succeeded, then the parent thread passes a 
signal to its child (line 14) and stops (line 15). The parent retires only when it becomes 
the head thread; therefore, sstp instruction is used. If the frk had failed, then the 
parent has to execute the correct path itself (line 17) since no child thread had been 
spawned. In case of a wrong prediction, i.e. PAR-WRONG, the parent interrupts its 
child's execution (line 19), and goes to the correct path (line 20). 
Child Thread 
The child's execution starts at CH-PROLOGUE (line 21). Being a speculative thread, it 
turns off the safe flag (line 23) before jumping to the predicted path (line 25). After 
the path's execution, it waits for a signal from its parent (line 35). If the prediction is 
correct, then all the speculative stores will be committed (line 37) as soon as the signal 




psg (fsucc, mychild, signal-id); 	 wat (guard, signal-id); 
load instruction 
Figure 5.6: Memory communication in Spec-Transformer-1 
is received. The child will also be appointed the next head thread and leave the branch 
structure at DONE-LABEL. 
There can be multithreaded loops along the speculated path. For a series of them, 
only the first loop is actually a speculative one since the master thread is always 
blocked when trying to pass the synchronisation signal to the slaves. Once the branch 
prediction is verified and the loop is on the correct path, the execution can resume and 
move on to the next loop. If the prediction is wrong, the child will be unblocked or 
interrupted from its current execution. It then jumps to CH-ROLLBACK and aborts the 
slave cluster before stopping. 
Data Dependence 
The data dependence between parent and child threads is handled in the same way as 
in the loop transformations. Memory communication for each dependent instruction 
pair is shown in Figure 5.6. Synchronisation between the parent and the child is en-
forced by passing and waiting for a signal. In the case of register communication (see 
Section 4.1.3), a set of registers that may cause data dependencies is declared by uregs 
instruction which is inserted before the frk (line 4). These registers can be forwarded 
to the child thread by fregs instruction, once they are available. 
Chapter 5. Multithreaded Control-Speculative Execution 	 125 
For a large number of data dependencies, the register communication would in 
practice be less costly than the memory communication. This is due to the following 
two reasons: no extra instruction is needed for the child thread, and the parent can 
forward up to 32 registers in one instruction. However, since the register usage infor-
mation is required, this has to be done during the back-end compilation, possibly in 
conjunction with the instruction scheduling, and should ensure that the child thread 
does not starve of registers. 
Post-Dominating Region 
To increase the thread size, post-dominating instructions, or instructions below the 
re-convergent point, might be included in the child thread. A branch structure is sym-
metric if both paths exclude the post-dominating code, or include the same copy of the 
post-dominating code. The variable pbra is set, in the THEN and the ELSE paths, to 
either 0 (included) or 1 (excluded). It is checked by the thread that leaves the branch 
structure as to whether additional execution is required (lines 43 and 44). 
Control-Flow Breaks 
Control-flow breaks in the parent and child regions are handled in the following ways: 
1. Conditional and unconditional branches or jumps. Trivial branches that are 
not speculated might be included in the parent or child regions during the pre-
transformation analysis. If the branches are included in the child region, then 
their targets must also be inside the region. If they are included in the parent 
region but the targets are outside the region, then interrupt instructions (similar 
to line 19) are added. The child thread is aborted before the parent jumps to the 
outside targets. 
Chapter 5. Multithreaded Control-Speculative Execution 	 126 
Procedure returns and program exits. These breaks are typically guarded by 
conditional branches. They are only included in the parent region. Interrupt 
instructions are inserted before the breaks to abort the child thread. Then the 
parent exits the current procedure or stops the program execution. 
Procedure calls. Only calls to non-recursive procedures are included. The pro-
cedure should not contain instructions that may raise exceptions. As dependence 
on data is not speculated, if the child consumes a value returned from a proce-
dure which is called by the parent, it has to wait until that value is available. As 
infinite-save-registers option is used in the back-end code generator, the values 
passed to and from the procedures are saved in registers only. The procedure 
calls might be included in the child region although it is avoided. 
Exceptions. The source code had been checked and modified to handle excep-
tions before it was translated into SUM JR. Instructions that may cause excep-
tions are guarded by conditional branches. If a condition leading to an exception 
occurs inside a procedure, then a unique value is returned. At the caller's site, a 
conditional branch is also added to check whether the value is returned by an ex-
ception or a normal execution. In case that the current site is the main procedure, 
then the program execution stops. This is translated into SUTF IR as a series of 
procedure returns and program exit guarded by conditional branches. 
5.1.2 Dual-Path Speculation 
In contrast to the single-path speculation, the dual-path one does not predict the branch 
direction. Instead, threads are forked to speculatively execute both paths. The more 
probable one is forked earlier so that it can acquire any available TPU before the other. 
Chapter 5. Multithreaded Control-Speculative Execution 	 127 
The parent thread also executes the code in the parent region and the conditional code 
in parallel with the child threads' execution. As soon as the branch direction is known, 
one of them will proceed while the other will be squashed. 
An example of code generated for dual-path speculation is shown in Figure 5.7. 
The transformer keeps two lists of dependent instruction pairs. Each of them is for data 
dependency between the parent and each child. Memory communication is handled in 
the same way as in the single-path speculation. Variables originally accessed by both 
paths are replicated to avoid the second speculative thread reading the value written by 
the first speculative thread. Otherwise, load operations in the second path are guarded 
by safe/unsafe switch indicating whether to load the safe version of data from the 
shared memory instead of searching through the speculative buffers (this applies to all 
threads, in the case of compound speculation). 
If register communication is used, an fregs instruction broadcasts registers to both 
child threads at once. The registers forwarded from the parent will be received only if 
the wait bits in the child's registers are set to TRUE. Figure 5.8 is an example of register 
communication, in which both child threads are register dependent on different regis-
ters of the parent. The dependent registers are declared by a uregs instruction prior to 
each fork; however, the effect of uregs is cumulative. Thus, upon thread initialisation 
of the second child, both $rl and $r2 are unavailable. During the computation, the 
wait bit in $rl is automatically set to FALSE upon the second thread's write-back op-
eration. If $rl is forwarded from the parent before the write-back, then it is accepted 
but overwritten afterwards. Alternatively, $rl in the second path could be renamed. 





mt guard, Tfsucc, E_fsucc, pbra; - 
2 
	




4 guard = 1; 
5 
	
Tlsucc = frk (sequenceno, THEN-LABEL); 
6 Tmychild = cadr 0; 
7 
	
Eisucc = Irk (sequenceno, ELSE-LABEL); 
8 Emychild = cadr 0; 
9 
	
guard = 0; 








14 isg (Tlsucc, Tmychild, CH-ROLLBACK-THEN); 
15 
	psg (E_fsucc, E_mychild, sequence-no); 
16 sstp (E_fsucc, E_mychild); 




19 isg (Elsucc, Emychild, CH-ROLLBACK-ELSE); 
20 
	psg (T_fsucc, T_mychild, sequence-no); 








25 pbra = 1; 
26 
	
safe (guard, 0); 







30 same as lines 25-26 
33 
	




35 if (!guard) goto DONE-LABEL; 
36 










	 abort slaves in ELSE 
42 stp (guard, -1); 
43 CH-ROLLBACK-THEN: 
44 
	 abort slaves in THEN 





48 { 	post-dominating region 	} 
II working variables 
II working variables 
I - 
II inherited by child threads 
II fork 1st speculative thread 
I - 
II fork 2nd speculative thread 
I - 
II indicate that this is parent thread 
'-
Il 
/I original is ''branch THEN-LABEL'' 
I - 
II interrupt child's execution in THEN 
II pass signal to child in ELSE 
II parent synchronises and stops 
I- 
II 
interrupt child's execution in ELSE 
II pass signal to child in THEN 




II post-dominating instructions excluded 




/I initialise thread 
I - 
/I 
I! if this is parent, exit 
II if this is child, wait for the signal 
I - 
II commit speculative stores 
II 
/I 
II child stops 
I - 




Figure 5.7: Code generated by Spec-Transformer-2 




uregs 	{ $rl 
fork child-1 
uregs 	( $r2 
fork child-2 
$r2 
$rl <-- . 
fregs { $rl, $r2 } 
I $rl  I 
child-1 
mit: unavailable ($ri) 
< _49 , wait 
[$rl, $r2) 
child-2 
mit:unavailable I $rl, $r2 I 
II avail 
< _49 II wait 
Figure 5.8: Register communication 
Chapter 5. Multithreaded Control-Speculative Execution 	 130 
loop A 
IF ( I) 
Figure 5.9: Sample nest of branches for Figure 5.10 
5.1.3 Nested Speculation 
The speculation is extended to nested branches. We focus on completely-nested branch 
structures. Interference from any other control-flow path was eliminated as a result of 
the code replication applied during the pre-transformation analysis. 
Figure 5.9 gives an example of nested branches, in which paths THEN(1), THEN(2), 
and THEN(3) are predicted. The generated code (a skeleton is shown in Figure 5.10) is 
similar to the one produced by SpecTransformeri (or SpecTransformer2, in the 
case of dual-path speculation), but with a few additional constraints. 




	 II executed by outermost branch 
2 
	out-guard = 0; 
	 II 
3 NEST-ID = sequence-no 
	 I, unique signal used in the nest 
4 
	
mt guard, fsucc, pbra; 
5 
	




7 same as Figure 5.5 
8 PAR-VERIFY: 
*9 	wat (out-guard, NEST-113); 








*14 psg (fsucc, mychild, NEST-ID); 
15 
	 sstp (fsucc, mychild); 
16 
17 
	else goto THEN-LABEL 
18 PAR-WRONG:  
19 
	
isg (fsucc, mychild, CH-ROLLBACK); 
*20 	psg (1, myself, NEST-113); 




23 out-guard = 0; 
24 














31 if (!guard) goto DONE-LABEL 
32 




34 cmmt (guard); 
*35 	psg (1, myself, NEST-ID); 
36 goto DONE-LABEL 
37 CH-ROLLBACK: 
38 
	 abort slaves in THEN 
*39 isg (inisucc, mychild, IN-CH-ROLLBACK); 
40 




II local variables (per branch) 
If local variables (per branch) 
II fork a speculative thread 
I - 
II wait until all outer branches resolved 




II pass signal to child 
I - 
II ''goto ELSE-LABEL'' if ELSE is predicted 
I - 
II interrupt child's execution 
depositsignal before proceeding 
II ''goto THEN-LABEL'' if ELSE is predicted 
I - 
/I switch off guard from outer branch 
II initialise child thread 
I - 
II 




II child waits for signal 
I - 
II 
II signal itself 
II or ELSE if it is predicted 
interrupt next thread in the nest 
I- 
II 
Figure 5.10: Code generated for nested speculation 
Chapter 5. Multithreaded Control-Speculative Execution 	 132 
Firstly, the branches are resolved in sequential order. The parent thread is blocked 
(line 9) until it receives the signal from its own parent which speculates the previous 
or outer branch. Once the signal is received, it proceeds to evaluate the branch and 
pass the signal to its child if the speculation is correct (line 14) before stopping (line 
15). Due to the default forking operation, the parent inherits all the guards from its 
predecessors and passes them to the child. The child leaves its own guard on but 
switches off the others (it only has to switch off the parent's guard). If the parent 
thread encounters any control-flow break from its parent region, it has to wait until the 
outer branches are resolved. Thus, a wat instruction similar to line 9 is inserted in front 
of the break. 
Another constraint is how incorrect speculation is handled (line 18). A simple strat-
egy has been implemented, i.e. the child thread and all its successors are aborted (line 
39). The parent thread then executes the correct path (line 21). After the speculation is 
resolved, the thread that executes the correct path (either the parent or the child) will 
leave the current branch at DONE-LABEL (line 41) and arrive at CH-RESOLVE (line 30) of 
the outer branch. 
Data Dependence 
The handling of data dependence is slightly more complicated as the transformation 
of each branch is performed separately. An example is displayed in Figure 5.11(a). At 
run-time, the order of threads T1, T2, and T3 is maintained by the global thread control 
unit (GTCU). T3 should read A from T1 's buffer (since there is no store to A's address 
by T2). Synchronisation between the grandparent (T1) and grandchild (T3) is required 
to ensure that the data retrieved by T3 is the correct version. 
Chapter 5. Multithreaded Control-Speculative Execution 	 133 
fork 
(A ~2(A3 T T 1 
T2 	 I 





Figure 5.11: Handling of data dependencies in nested branches 
We had opted to handle data dependencies and synchronisation on a parent/child 
basis. A copy instruction is therefore inserted in T2's code, as shown in Figure 5.11(b), 
to convey the data from T1 to T3. The same strategy is applied if register communi-
cation is used in place of memory communication. Although simple to implement, a 
drawback of this method is the introduction of artificial data dependencies. 
Chapter 5. Multithreaded Control-Speculative Execution 	 134 
Table 5.2 Description and general statistics of synthetic benchmarks 
Dynamic Distribution (%) 
mit I Main Verify Name Description Instructions 
SYN_1 Simple branch I 3,316,589 0.34 99.66 0.00 
SYN_2 Simple branch II 3,847,163 0.37 99.63 0.00 
SYN_3 A series of branches I 4,512,484 0.32 99.49 a  0.19 
SYN_4 A series of branches H 4,516,761 0.37 99.63 0.00 
SYNi A nest of branches I 3,846,046 0.34 99.56 0.10 
SYN6 A nest of branches II 4,481,544 0.35 99.65 0.00 
SYN7 Branch in multithreaded loop 4,682,408 0.12 99.88 0.00 
Average 
41171,856 
1 0.32 99.64 
J0.05 
'Sequential loop that computes Lrt3 is included 
5.2 Performance Evaluation 
5.2.1 Benchmarks 
Benchmarks used in the experiments were synthesised from modified Livermore loops 
which were arranged in conditional branch structures. Table 5.2 displays the general 
statistics of the benchmarks collected from their sequential execution. Figures 5.12 to 
5.20 show the modified Livermore kernels (their average execution time are shown in 
Table 5.3) and fragments of the benchmarks' source code and control-flow graphs. The 
simulator takes as its input the assembly code of the benchmarks. It takes around 25-30 
minutes to run a sequential program of 4.5 million dynamic instructions to completion, 
and up to 40-45 minutes for a multithreaded version of the same program (the overhead 
is due to the updating and searching of thread information in TPUs and GTCU). 
Chapter 5. Multithreaded Control-Speculative Execution 	 135 
Table 5.3 Average sequential execution time (per invocation) 
Kernel A_i C3 G7 L12 
Time Units 56613 28056 176000 39500 
II global variables 
mt xAL5011, xC(501) 	xG(501], xL(5011, y[SOl], z[523], u(523]; 
mt r, t, q; 
mt Lrtl, Lrt2, Lrt3, kLrtl, kLrt2; 
mt L, LOOP, N, csuxn; 
void mit () 
{ 
mt k; 
for (k = 0; k <= 500; k++) { 
xA(k) = xG[k) = xL[k] = 0; 
xC[k] = 500 - 
y(k] = 1; 
} 
for (k = 0; k < 522; k++) { 
u[k] = k; 
z[k] = k + k; 
} 
r = 5; t = 2; q = 0; 
} 
void A_i (mt ni, mt n2) 
{ 
for (mt k = nl; k < n2; k++) { 
xA[k] = q + y[k) * Cr * z(k + 101 + t * z(k + 11]); 
} 
} 
mt C_3 (mt ni, mt n2, mt mci, mt mc2) 
{ 
rC = 0; 
for (mt k = ni; k < n2; k++) { 
rC = rC + z[k] * xC(k]; 




II Loop A 
II Loop C 
Figure 5.12: Modified Livermore kernels (continued in Figure 5.13) 
Chapter 5. Multithreaded Control-Speculative Execution 	 136 
void G7 (mt ni, mt n2, mt mg) 
{ 
for (mt k = ni; k < n2; k++) { 
xG[k) = u[k] + r * (z[k] + r * y(kl) + 
t * ( u(k + 3) + r * (u(k + 21 + r * u[k + 1]) + 
t * ( u(k + 6] + r * (u[k + 5) + r * u[k + 4]))); 
xG(k] = xG[k) - mg; 
} 
} 
void L_12 (mt ni, mt n2, mt mu, mt m12) 
{ 
for (mt k = ni; k < n2; k++) { 
xL[k) = (nil * 	+ 1]) - (m12 * 
} 
} 
II Loop G 
II Loop L 
Figure 5.13: Modified Livermore kernels (continued from Figure 5.12) 
Four Livermore loops were used: L, A, and G are small, medium, and large fully-
parallelisable loops, respectively; C is a small loop with cross-iteration dependence. 
The structures of the synthetic benchmarks provide opportunities for single-path, dual-
path, and nested speculation. The first six benchmarks can be divided into two groups: 
{SYNJ, SYN3, SYN.51 and {SYN2 SYNA, SYN61. The second group imitates the 
first one, but it provides further opportunities to speculate on control-independent paths 
of the branches (Section 5.2.2.2). Branch probabilities and loop sizes in the parent and 
the child regions are varied between the benchmarks, in order to explore different 
resource allocation strategies. 
. SYNJ and SYN..2 contain simple branch structures. In SYN_1, the loops inside 
the branch structure (speculative) are bigger than the one dominating the branch 
(non-speculative), while the opposite is the case in SYN..2. 
Chapter 5. Multithreaded Control-Speculative Execution 	 137 
main (C 
{ 
LOOP = 70; N = 501; csum = 0; 
for CL = 1; L <= LOOP; L++)  
Lrtl = CL + r) * 	
I 
Lrt2 = CL 	r) + t; 
A_i (0, N); 	 11 IIE' (O 2 	 ELSE 0.77) 





k = k+50) 
esum = csum + xA[k] + xG[k]; 
} 
} 
Figure 5.14: Synthetic benchmark SYNJ 
• SYNJ and SYNA contain a series of branch structures. They are used for testing 
if the code generated from the transformers work correctly, i.e. the structures 
must be handled one-by-one. Hence, the first structures and their threads must 
be resolved before subsequent ones are speculated. 
• SYN..5 and SYN_6 contain nests of branch structures. The loops inside the inner 
branch structures (speculative) are the biggest ones in SYN5, but the smallest 
ones in SY!'L6. 
The last benchmark SYN_7 contains a parallisable loop, inside which is a branch struc-
ture. At present, the loop transformers and the speculation transformers work sepa-
rately, and control dependence across loop iterations are neither recognised nor han-
dled by the loop transformers. Thus, an assumption being made is that the branches 
inside the iterations must be independent of each other. 





LOOP = 70; N = 501; 
for (L = 1; L <= LOOP; L++) 
Lrtl = (L + r) * 
Lrt2 = CL * r) + t; 
A1 (0, N); 
if 	(xALN_11*Lrtl) ' (xA[N/21 Lrt2) 
L_12 (0, N-i, Lr(i, LrL2); 
else 
L..12 (0, N-i, Lrt2, Lrtl); 
G7 (1, N, 0); 
for (k = 1; k < N; k = k+50) 
csum = csuxu + xA[kJ+xG[kl+xL[k); 
}IF\ .0 23 	 ELSE 0 77 
Figure 5.15: Synthetic benchmark SYN_2 
main (C 
{ 
LOOP = 70; N = 501; 
for CL = 1; L <= LOOP; L++) { 
Lrtl = CL + r) * 
Lrt2 = CL * r) + t; 
A_i (0, N); 
if ((xA[N-l) * Lrtl) > (xA[N/2) * Lrt2)) 
G_7 (1, N, Lrtl); 
else 
G_7 (1, N, Lrt2); 
Lrt3 = 0; 
for (k = 1; k < N; k++) 
Lrt3 = Lrt3 + 3 * xG[k] / (xA(kl + 1); 
if C Lrt3 ' (Lrtl. * Lrt2) 
L_12 (0, N-I, r, t); 
else 
L_12 (0, N-i, t, r); 
for (k = 1; k < N; k = k+50) 
csuni = csum + xA[k] + xG[k] + xL[kj; 







Figure 5.16: Synthetic benchmark SYN3 





LOOP = 70; N = 501; 
for 	(L = 1; 	L <= LOOP; 	L++) 	{ 
Lrtl = 	(L + r) 	* t; THEN (0.23) ELSE (0-77) 
Lrt2 = 	(L 	* r) 	+ t; -- 	 - 
A_i (0, N); - - 	- 	 - 
if 	((xA[N-11 	* 	Lrtl) 	> 	(xA[N/2] 	* Lrt2)) 




qC = C3 (0, N, -Lrti, -Lrt2); 
if 	( 	(qCI500) 	> 	(Lrtl*Lrtl + Lrt2*Lrt2) 
L_12 (0, N-i, r, t); 
else THEN O 7 4 ELSE (0.26) 
L12 (0, N-i, t, r);  
for 	(k = 1; k < N; 	k = k+50) -- 
csuin = csum + xA[k] 	+ xG[k] 	+ xL[k]; 
Figure 5.17: Synthetic benchmark SYN_4 




LOOP = 70; N = 501; 
for CL = 1; L <= LOOP; L++) { 
Lrtl = CL + r) * 
Lrt2 = CL * r) + t; 
A_i (0, N); 
if ((xA[N-1] * Lrtl) > (xA[N/21 	Lrt2)) { 
qC = Ci (0, N, Lrti, -Lrt2); 
if C (qC/800) > ( Lrtl • Lrtl * Lrt2) 
G7 (1, N, Lrtl); 
else 
G_7 (1, N, -Lrti);  
continued 
else { 
qC = Ci (0, N, -Lrtl, Lrt2); 
if ( (qC/800) < ( Lr2 * Lrt2) 
G_7 (1, N, .Lr12); 
else 
G_7 (1, N, Lrt2); 
} 
for (k = 1; k < N; k = k+50) 
csum = csum + xA[k] + xGIkJ; 
THEN (0.23) 	 ELSE (0.77) 







	 ELSE (054) 
Q_7 (I, N. LrtI) 
	
U_i (I. N. -Utl) 
	
G-7 (I. N. -Lri2) 
	
O_7 (1. N, UM 
Figure 5.18: Synthetic benchmark SYN_5 





LOOP = 70; N = 501; 
for CL = 1; L <= LOOP; L++) { 
Lrtl = CL + r) * 
Lrt2 - CL * r) + t; 
Ai (0, N); 
if ((xA[N-l1Lrt1) < (xALN/2]*Lrt2)) { 
qC = Ci (0, N, Lrtl, 0); 
if C (qC % Lrtl) > (Lrtl / r) 
L_12 (0, N-i, r, 2*t); 
else 
LJ2 (0, N-i, t, 2*r); 
G7 (1, N, Lrtl); 
continued 
else { 
qC = Ci (0, N, Lrt2, 0); 
if C (qC % Lrt2) > (Lrt2 I r) 
L12 (0, N-i, 2*r,  t); 
else 
L_12 (0, N-i, 2t, r); 
G_7 (1, N, Lrt2); 
for (k = 1; k < N; k = k+50) 
CSUTh = CSUIS + xAtk]+xG[k]+xL[k); 
<lF(2) 
THEN OI 	 ELSE 0.575) 
~(O N- I,L2*r 
	
fHE 0.80) - 	 U SE 0.20 
0,NI.2,1) 	 2-1.  
G_7 ti. N. UiI) 
	
G_7 (I. N. Ln2) 
Figure 5.19: Synthetic benchmark SYN_6 
Chapter 5. Multithreaded Control-Speculative Execution 	 142 
main () 
LOOP = 30; N = 51; 
for (L = 1; L <= LOOP; L++) { 
for (k = 0; k < N; k++) { 
xA[k] = q + y[kJ * ft * ZR + 101 + t * ZR + 11]); II hi. 
kLrtl=k* (L+r+t)• 
kLrt2=k+ (L*r*t) ; 
if (xA[k) > (kLrtl * kLrt2)) 
qC = C3 (0, 201, kLrtl, -kLrt2); 
else 
qC = C3 (0, 201, -kLrtl, kLrt2); 
xA[k] = xA[k] + qC; 
} 
for (k = 1; k < N; k = k+10) 
csum = csun + xA[k]; 
} 
Aj 
THEN (0.28) 	 ELSE (0.72) 
01, 
Figure 5.20: Synthetic benchmark SYNJ 
Chapter 5. Multithreaded Control-Speculative Execution 	 143 
5.2.2 Results and Discussions 
Calls to the Livermore procedures were inlined so that the main procedures (main 0) 
contain the complete Livermore loops. The loops in all the benchmarks, with the 
exception of SYN_7, are unrolled 99 times and re-rolled to produce chunks of 100 it-
erations each. For loops A and C, the first chunks in each case contain 101 iterations. 
When the loop is multithreaded, the first chunk is allocated to the master thread while 
the others are distributed to the slaves. The architectural parameters used in the simu-
lation are the same as those listed in Table 4.3, except that the total number of TPUs 
is increased to 24. The probability threshold is set to 0.65, which implies that if the 
more probable path of a branch is less confident than this threshold, then the branch 
will be transformed for dual-path speculation. Since the synthetic benchmarks are 
well-structured, the pre-transformation processing which involves control-flow analy-
sis, region formation, and dependency analysis is straightforward. 
The performance of multithreaded non-speculative and speculative programs were 
compared. During the execution of control-independent loops in both cases, the slave 
TPUs are reusable. For the speculative execution of control-dependent loops, the 
reusability of the slave TPUs depends on when the (speculative) master threads receive 
synchronisation signals from their parents. A loop is control-independent of a branch 
if it dominates or post-dominates that branch, and is control-dependent, otherwise. 
The first set of results is displayed in Figure 5.21. In both non-speculative (MULTI) 
and speculative (SPEC) programs, the sizes of clusters executing the control- indepen-
dent loops range from 3 to 6 TPUs, whereas the sizes of those executing the control-
dependent loops are fixed at 3 TPUs. This allocation strategy (see Section 5.2.2.1) is 
called Cindep as more TPUs are given to the control-independent partitions. 








1.50 	 SYN_1 2.00 SYN_2 
1.00 	 -•- MULTI 
1. 
I 	-•- MULTI 
0.50 	 L- SPEC -- SPEC 
0.00 
3 	 4 	 5 	 6 
0.00 
3 	 4 5 	 6 
3.50 3.50 
3.001  .000 
2.50 	 - - 2.50 
2.000 2.00 
SYN3 
1.50 1.50 SYN 4 
-U- I 	MULTI 	I - 
1.00 
-.-S-SPEC 	I 
1.0 1  -U- MULTI 
I 
0.50 	 I D-SPEC 0.5( .- SPEC 
0.00 0.00 
3 	 4 	 5 	 6 3 	 4 5 	 6 
- 	 --- 
	
2.00 	 2.00 
SYN_5 
-U- MULTI 	I 	 - 
1.00 
-'-NS-SPEC 	 I -U- MULTI 
1.50 	 1.50 	 SYN 6 
1.00  
ND-SPEC 	 -.- N-SPEC 0.50 	
I 0.50 
0.00 	 0.00 
3 4 	 5 	 6 	 3 	 4 	 5 	 6 
Y - axis : Speedup w.r.t SEQ 	X - axis : Cluster Size (master + slaves) 
Figure 5.21: Speedup of speculative programs (Cindep policy) 
Chapter 5. Multithreaded Control-Speculative Execution 	 145 
For the speculative programs, prefix "N-" refers to nested speculation, "S-" single-
path speculation in spite of branches' low confidence, and "D-" dual-path speculation 
in cases of branches' low confidence. The speculative execution offers slight improve-
ment over the non-speculative one. Single-path speculation in spite of the branches' 
low probability causes frequent misprediction. Its penalty is in the lost opportunities 
of executing the correct paths in parallel with the parent threads' execution. From the 
graphs, it seems that the opportunities lost have little impact since the performance of 
S-SPEC and NS-SPEC in SYN3 and SYN.5 is marginally poorer than D-SPEC and 
ND-SPEC in the same benchmarks. 
5.2.2.1 Cluster Allocation 
The performance of the multithreaded non-speculative programs shown earlier is be-
low its true potential, i.e. the maximum speedup it could have achieved, given the total 
number of the TPUs available. The cluster allocation in these programs corresponds to 
the scheme used in their speculative counterparts. However, it is unfair when the na-
ture of the non-speculative execution is considered, i.e. the loops are executed one-by-
one. Because the control-dependent loops are executed after the branch directions are 
known, they can in fact reuse all the TPUs released by the control-independent loops. 
Figure 5.22 shows speedup of the non-speculative programs when all the loops are 
allocated the same number of TPUs ranging from 3 to 6. The increase in the speedup 
is significant when the number of TPUs matches the number of threads executing the 
loops. 
A similar scheme can be used in the speculative programs. The difference is that 
only the control-dependent loops on the non-speculative paths can reuse all the TPUs 
because they are executed after misprediction occurs and they are confirmed to be the 
Chapter 5. Multithreaded Control-Speculative Execution 	 146 
SYN_3 
3 	4 	 5 	 6 	3 	 4 	 5 
	
6 	3 	 4 	 5 	6 
SYN_6 
3 	 4 
	
5 	 6 	3 	 4 	 5 	 6 	3 	 4 
	
5 	6 
-.- MULTI - Clndep loops 	-e- MULTI - All loops 
V - axis: Speedup w.r.t SEQ 	X - axis: Cluster Size (master + slaves) 
Figure 5.22: A comparison of 2 cluster allocation policies for non-speculative programs 
correct paths. Ideally, these loops should reuse the TPUs released from both control-
independent loops and the loops on the mispredicted paths. However, unless synchro-
nisation is added, the loops on the correct paths may try to form clusters before the 
ones on the wrong paths release theirs. If the cluster sizes are larger than the number 
of TPUs guaranteed to be available when the misprediction recovery starts, then these 
operations may be unsuccessful, causing the loops to be sequentially executed instead. 
Chapter 5. Multithreaded Control-Speculative Execution 	 147 
Another cluster allocation strategy considers the contribution of each loop to the 
overall program execution. If multiple loops are executed concurrently, the one that 
contributes most to the total execution time should receive the largest number of TPUs. 
To calculate the amount of each loop's contribution, the cumulative probability along 
the control-flow path leading to that loop and its execution time are taken into account. 
For each loop i, 
Ti = cumulative probability x sequential execution time 
contribution (%) = Ti x 100 
Ti 
Four cluster allocation strategies are examined. They employ different criteria to 
prioritise loops in the benchmarks. The highly-prioritised ones are given numbers 
ranging from 4 to 6 TPUs 3 , whereas the others are always given only 3 TPUs. These 
strategies are 
• Clndep. The priority is given to the control-independent loops only. 
• NonSPEC. The priority is given to the non-speculative loops. If multiple loops 
are executed at the same time, the prioritised one is usually control-independent. 
The other non-speculative loops executed individually are also prioritised. 
• Critical. If multiple loops are executed at the same time, then the priority is 
given to the loop which contributes most to the overall program execution. The 
contribution factor of each loop in the benchmarks are calculated and shown in 
Table 5.4. 
• All. The same number of TPUs (3-6) is allocated to all the loops. 
3 Al1 prioritised loops in a benchmark are given the same amount of TPUs, for example, all of them 
are given 6 TPUs. 
Chapter 5. Multithreaded Control-Speculative Execution 	 148 
Table 5.4 Contribution of individual loop to the overall program execution  
rBenchmark Loop (path } 	I % Loop { path } 
SYN_1 A 	{controlindep} 24 
G 	{THEN} 17 G 	{ELSE} 58 
SYN_2 A 	{control indep} 21 L 	{THEN} 3 
G 	{control indep} 65 L 	{ELSE} 11 
SYN.3 A 	{control indep} 21 
G 	{THEN} 15 L 	{THEN} 9 
G 	{ELSE} 50 L 	{ELSE} 6 
SYN_4 A 	{control indep} 19 C 	{control indep} 9 
G 	{THEN} 13 L 	{THEN} 10 
G 	{ELSE} 45 L 	{ELSE} 3 
SYN_5 A 	(control indep} 22 
C 	{THEN} 2 C 	{ELSE} 8 
G 	{THEN, THEN} 10 G 	{ELSE, THEN} 24 
G 	{THEN, ELSE} 6 G 	{ELSE, ELSE} 28 
SYN_6 A 	{control indep} 19 
C 	{THEN} 2 C 	{ELSE} 7 
L 	{THEN, THEN} <1 L 	{ELSE, THEN} 8 
L 	{THEN, ELSE} 3 L 	{ELSE, ELSE} 2 
G 	{THEN} 14 G 	{ELSE} 45 
The results are shown in Figures 5.23 and 5.24. Generally, each speedup bar in 
the graphs has 4 layers. The bottom layer is the minimum speedup achieved by one 
of the 4 strategies, and the other 3 layers are the successive improvements of the other 
strategies over the previous ones. 
Chapter 5. Multithreaded Control-Speculative Execution 
	
149 
SYN_1 	 SYN_2 
3 	4 	5 	6 
SYN_3 












• NonSPEC loops 
L Ciridep loops 
D-SPEC 
3 	4 	5 	6 
SYN_6 
3 	4 	5 	6 	 3 	4 	5 	6 
	
Y - axis Speedup w.r.t SEQ 	X - axis: Cluster Size (master + slaves) 
Figure 5.23: A comparison of 4 cluster allocation policies for speculative programs 






3.00 	 Critical loops 
2.00 	 NonSPEC loops 
1.00 	 L cindep loops 
NS-SPEC 	 ND-SPEC 
I 	 I 	- 









3 	4 	5 	6 	 3 	4 	5 	6 	 3 	4 	5 	6 	 3 	4 	5 	6 
Y - axis: Speedup w.r.t SEQ 	X - axis : Cluster Size (master + slaves) 
Figure 5.24: A comparison of 4 cluster allocation policies for the nested speculation in 
S YN.5 
Chapter 5. Multithreaded Control-Speculative Execution 	 151 
From Figure 5.23, the All strategy performs best, followed by Critical, NonSPEC, 
and Clndep. This is clearly seen when the prioritised loops are given at least 5 TPUs, 
which allows all the threads to be successfully sparked. In SYN6, NonSPEC performs 
a little better than Critical. Both strategies prioritise loops A and G which are the main 
contributors to the total execution time. However, they have different views when 
choosing between loops C and L, i.e. NonSPEC favours C while Critical favours L. If 
both loops in question have little impact on the total execution time, it seems that the 
non-speculative one should be favoured because its results are at least guaranteed to be 
useful. For all the benchmarks, in general, the biggest improvement step comes from 
the Critical strategy. In SYN.2, the cluster allocation by Cindep is identical to the one 
by Critical; therefore, it appears in this figure (and Figure 5.21) that Cindep already 
gives good speedup. 
The speedup of nested speculation in SYN..5 is shown in the first graph of Figure 
5.24. When there are loops from multiple nest levels being executed at the same time, 
dual-path speculation holds back the performance improvement as it allows even more 
loops to compete for available TPUs. Although the total number of TPUs (24) seems 
to be sufficient, during the run-time, some cluster or fork operations are executed a 
little too early or too late in relation to the availability of resources. This happens es-
pecially when there are several multithreadable loops active simultaneously. The loop 
on the secondary path is often the last one attempting to form a cluster and thus it is 
most likely to fail. If the secondary path is correct, then the benefit from it having been 
partially executed is outweighed by the remaining execution being sequential. On the 
other hand, single-path speculation delays the less probable path until the mispredic-
tion recovery takes place, but it can gain more from the multithreaded execution on 
this path. It appears that in both single- and dual-path speculation, the Critical strategy 
Chapter 5. Multithreaded Control-Speculative Execution 	 152 
performs slightly better than All as a result of fewer simultaneously-executed loops 
competing for the TPUs. 
In the second graph of Figure 5.24, either the inner or the outer branch in the nest is 
speculated. IF(2) is always speculated because it is not handled in parallel with IF(1) 
and IF(3). The prefix "I-" or "02" indicates whether the inner or the outer branch 
is chosen. Having learnt that the performance of Cindep is only, at best, as good 
as NonSPEC's, it is excluded from the experiment. When there are sufficient TPUs to 
execute several loops simulteneously, dual-path speculation yields higher speedup than 
single-path speculation. Furthermore, outer-branch speculation yields higher speedup 
than inner-branch speculation. This can be explained by the fact that the branch in 
the deeper nest level is less likely to be encountered and is therefore less profitable to 
speculate. 
5.2.2.2 Control-Independent Execution 
In addition to speculating on control-dependent paths of a branch, another thread can 
be launched to execute the code after those paths converge. Although the code is 
to be executed regardless of the branch's direction, the thread as well as its children 
and slaves are speculative because this program fragment may be on either path of 
another branch. SYN.2, SYNJI, and SYN6 are used for studying the impact of control-
independent execution. In SYN6, loop G is control-dependent on the outer branch but 
independent of the inner branches. The transformation is adapted from the one that 
generates single-path speculative programs. 
Figure 5.25 illustrates the four major sections in the control-independent (CI) and 
the control-dependent, speculative (CSP) execution. These relate to the points where a 
new thread is forked (PAR-PREDICT) and initialised (CHPROLOGUE), and where the flow 
Chapter 5. Multithreaded Control-Speculative Execution 	 153 
Control Independence (CI) 	 Control—Dependent Speculation (CSP) 




<1 IF 	 PAR-VERIFY 





Figure 5.25: An outline of control-independent execution in SYN2 
of control is transferred from parent to child threads (PAR-VERIFY and CH-RESOLVE). 
The order in which CI and CSP threads are forked may also affect the program perfor-
mance since the first thread can compete for the TPUs before the other. 
In Figure 5.26, two cluster allocation strategies, NonSPEC and Critical, are em-
ployed. CI-CSP and CSP-CI indicate the order in which the CI and CSP threads are 
forked. This order is not significant when a reasonably large amount of TPUs are 
present. Comparing the best results from the Critical scheme to the best results from 
Figure 5.23, where only the CSP is performed, it appears that the CI technique further 
boosts the program speedup. 
SYN_2 
3 	 4 	 5 	 6 
SYN_4 
3 	 4 	 5 	 6 3 	 4 	 5 	 6 
SYN_6 







3 	 4 	 5 	 6 
- cI-cSP 	- .- csP-cl 
3 	 4 	 5 	 6 
Y - axis : Speedup w.r.t SEQ 
X - axis: Cluster Size (master + slaves) 
Figure 5.26: Speedup after CSP and Cl are performed (total TPUs = 24) 
Chapter 5. Multithreaded Control-Speculative Execution 	 155 
SYN_2 (total 8 TPUs) 
6.00. 6.00 
5.00 5.00 
4.00 4.00-  
3.00 -----_ 3.0{ 
2.00 2.0( 
1.00 1.0( 
NonSPEC loops Critical loops 
0.00 
3 	 4 
. 	0.0c 
5 3 	 4 	 5 






NonSPEC loops Critical loops 
0.0€ 
3 	 4 
0.00 
5 3 	 4 	 5 







NonSPEC loops Critical loops 
0.00 
3 	 4 	 5 
. 	0.00 
6 3 	 4 	 5 	 6 
Y - axis : Speedup w.r.t SEQ 
-- Cl-CSP 	--- CSP-Cl X - axis: Cluster Size (master + slaves) 
Figure 5.27: Speedup after CSP and CI are performed (total TPUs = 8, 12) 
Chapter 5. Multithreaded Control-Speculative Execution 	 156 
The order of forking between CI and CSP threads has a visible impact when the 
total number of TPUs decreases, as displayed in Figure 5.27. In SYN.2 and SYN_6, 
the CI threads execute dominant loops whereas the CSP threads execute trivial loops. 
Thus, it is more beneficial to allow the CI threads to acquire the TPUs first. In SYN_4, 
it is the opposite case. The CI threads not only execute the trivial loops themselves, but 
also fork threads to speculatively execute other trivial loops which are further ahead. 
As a result, the CSP threads which execute the most dominant loops are hindered when 
the CI-CSP policy is used. 
Knowing that loops on the control-dependent paths of the branch in SYN.2 and the 
inner branches in SYN_6 are the least dominant in the programs, we tested only the 
CI technique but omitted the CSP one. The Critical strategy determines which loops 
among those simultaneously executed should receive more TPUs. The results were 
plotted against the best ones from CSP (due to All strategy in Figure 5.23) and CSP+CI 
(due to Critical strategy in Figure 5.26), and shown in Figure 5.28. In both the bench-
marks, the highest speedups were achieved by performing only control-independent 
execution. 
In spite of having a similar program structure to the one in SYN.A, the CI region in 
SYN3 (which dominates the first branch and post-dominates the second one) consumes 
results from both control-dependent paths of the first branch. As data speculation 
is not supported, this region can only be executed after the first branch is resolved, 
but the lookahead speculation can be performed by speculating the second branch (or 
launching its CSP threads) immediately after the first one. However, the results in 
Figure 5.29 when compared with those in Figure 5.23 show that earlier execution of 
the lookahead paths, in this case, yields no further improvement since the loops on 
these paths are very small. 







0.00 	 0.__ 
3 4 	5 	6 	 3 	4 	5 
1 U CSP only 	• CSP + Cl 	Cl only I 
Y- axis: Speedup w.r.t SEQ 	X - axis : Cluster Size (master + slaves) 
Figure 5.28: Best performance from CSP, Cl, and CSP+Cl 
4.00-1 	 4.00 
3.001 	 - 	 3.00 
2.00 	 2.004 	- 
1.001 	 1.00 
0.00 	 0.00 
NonSPEC loops 	 Critical loops 
3 4 	5 	6 	 3 	4 	5 	6 
SYN_3 I 	• S-SPEC 	-. D-SPEC 








Figure 5.29: Results from the lookahead speculation 
Chapter 5. Multithreaded Control-Speculative Execution 	 158 
5.2.2.3 Concurrent Speculation 
Unique among the benchmarks, the branch in SYN_7 resides in the body of the outer 
loop which is multithreadable. Several instances of the branch can be speculated at 
the same time as they are independent of each other. Additionally, neither the branch 
nor the execution of its control-dependent paths causes premature exit from the outer 
loop. In Figure 5.30, N-MULTI allows multithreading in both the outer and the inner 
loops; 0-MULTI allows multithreading in the outer loop; N-SPEC and 0-SPEC are 
their speculative versions, respectively. For the inner loop, loop chunking is performed 
to create a maximum of 4 threads, each of which executes 50 iterations. The loop is 
always given 4 TPUs, including the master and the slaves. 
The speculation applied in the parallel loop iterations does not increase the program 
speedup over the non-speculative execution because, within each outer loop iteration, 
the parent region is very small compared to the (speculated) child region. As a result, 
there is little computation to perform in parallel with the speculative one and the pro-
gram suffers from the multithreading overheads involving both loop parallelisation and 
control speculation. 
Figure 5.32 illustrates the restructuring of the outer loop. It is unrolled 4 times, 
followed by upward code motion so that the original parent regions of all the branches 
in a new unrolled iteration are packed together. The branches are predicted at the start 
of every outer loop iteration. There are several permutations in which the speculation 
can be performed. In Figure 5.31, SPEC.1, SPEC.2, SPEC.3, and SPEC.4 speculate 
the first 1, 2, 3, and 4 branches respectively and in the order that they are encountered 
by the sequential flow of control. Furthermore, in order to restrict the TPU utilisation, 




















2 	 3 	 4 	 5 
Cluster Size (outer loop) 








—*— 	0-MULTI (opt) 
3.00 
CL 







2 	 3 	 4 	 5 
Cluster Size (outer loop) 
Figure 5.31: Speedup of speculative programs after the outer loop is optimised 
nt thread 
•1....' I 
upward code motion 
- 
Chapter 5. Multithreaded Control-Speculative Execution 	 160 
Figure 5.32: Loop unrolling and code motion being applied to the outer loop 
Chapter 5. Multithreaded Control-Speculative Execution 	 161 
Comparing the speedup of 0-MULTI in Figure 5.30 and 0-MULTI (opt) in Figure 
5.3 1, the optimised program performs slightly worse. On the other hand, the speedup 
of the speculative execution increases, particularly when at least 4 branches are pre-
dicted. SPEC.4 and N-MULTI (in Figure 5.30) require similar amount of TPUs since 
each thread executing an outer loop iteration is assisted by 4 other threads. A compar -
ison of speedup from both suggests that the available TPUs are still better used for the 
loop parallelisation of the inner loop than for the speculation. 
5.2.2.4 Path Selection 
In all the benchmarks considered so far, both paths of a branch contain identical sub-
structures or identical loops (with the same sizes but different parameters). The branch 
probability was sufficient for choosing a path to be speculatively executed in the case 
of single-path speculation. However, for an unbalanced control structure, the path with 
a much higher workload albeit lower probability could be more critical. 
Two synthetic benchmarks are displayed in Figures 5.33(a) and (b). Their control 
structures are similar to that of SYN_1, but loop G in one path is replaced by loop 
L. Loop parallelisation is the same as before, i.e. each loop is transformed for the 
multithreaded execution of 5 threads (including a master and slaves), each of which 
executes a maximum of 101 iterations. The contribution factor of each loop is also 
calculated and shown in Table 5.5. 
In SYNUBI , it is obvious that loop G is on the more probable path and dominates 
the total execution time. Based on the previous observations (Section 5.2.2.1), it would 
be beneficial to speculate on this path and allocate the largest number of TPUs to this 
loop. In SYNJJB.2, although the ELSE path has the higher probability, the loop on 
this path contributes the least to the total execution time. Figure 5.34 shows the results 
Chapter 5. Multithreaded Control-Speculative Execution 
	 162 
Il'E07 7 
0. N-I. L2 	 II.N,2) 
IFII 




Figure 5.33: Synthetic benchmarks with unbalanced control structures 
Table 5.5 Contribution of each loop in SYNUBJ and SY!sLUB.2 	 - 
Benchmark Loop { path } 	I % } _Loop { path } % 
SYNUBJ A 	{control indep} 28 
L 	{THEN} 5 G 	{ELSE} 67 
SYN_UB.2 A 	{coiztrol indep} 44 
G 	{THEN} 32 L 	{ELSE} 24 
from the speculative execution in SYN_UB2 as the following options are explored: 
• Single-path speculation on the ELSE path (which contains loop L). 
• Dual-path speculation (L+G). 
• Single-path speculation on the THEN path (which contains loop G). 
Since there can be at most 2 loops being executed at the same time, every loop 
is allocated 5 TPUs. It appears that when there are sufficient TPUs for all the loops, 
dual-path speculation yields the best speedup. In the case of single-path speculation, 
by executing loop G earlier instead of loop L, the speedup increases significantly and is 
Chapter 5. Multithreaded Control-Speculative Execution 	 163 
5 
w 
CL 	 Speculation 
i - 	 • L+G 
• 
0 	 - __________ - ______ 
24 	 8 
Total number of TPUs 
Figure 5.34: Speedup of the speculative execution in SYNUB.2 
almost as good as the result from dual-path speculation. However, when the number of 
TPUs is reduced to 8, dual-path speculation gives the worst speedup. The speculation 
on the THEN path (loop G) still gives better speedup than on the ELSE path (loop L), 
but the difference in their performance is small. 
In this experiment, the order in which threads are forked in dual-path speculation is 
not significant. The reason is that loops L and G are both executed by multiple threads 
if there are 24 TPUs, or by single threads each if there are only 8 TPUs. 
5.2.3 Summary 
The effects of control speculation were studied using synthetic benchmarks comprising 
sets of parallelisable loops and conditional branches. First, by giving similar TPU allo- 
cation to the loops in both multithreaded non-speculative and speculative programs, the 
Chapter 5. Multithreaded Control-Speculative Execution 	 164 
latter performed slightly better than the former. Because the control speculation per-
mits simultaneous execution of loops in several program fragments, the performance 
can be affected by poor resource allocation. 
Empirical studies on TPU allocation schemes were conducted. Contribution of 
each individual loop to the overall program execution was computed using cumula-
tive probability along the control-flow path until the loop is encountered and together 
with its sequential execution time. Among concurrently-executed loops, favouring the 
most dominant one (i.e. Critical strategy) delivered the best or close-to-best speedups. 
However, if none of those loops significantly contributed to the total execution time, 
better results were achieved by favouring the non-speculative one (i.e. NonSPEC strat-
egy). Allocating the same number of TPUs to every loop (i.e. All strategy) yielded the 
best speedup only if there were not too many loops competing for the TPUs. If both 
paths of a branch have significantly different workload, then the contribution factor of 
each path which had been calculated for the resource allocation purpose can be used 
to determine which one could benefit more from the speculative execution. 
Although multiple loops are executed simultaneously, they can be initialised by the 
cluster allocation commands at different cycles; the loop which is most favoured by 
the compile-time analysis may be the last one to acquire the TPUs. Furthermore, the 
loops whose execution have completed may free their TPUs a few cycles late. Since 
these are unforseen at the compile-time, the compiler should be aware that the total 
resource utilisation is slightly below the total resource availabilty, in order to avoid 
cluster and/or fork failures at the run-time. 
Performing speculation in multiple levels of nested branches at the same time could 
be detrimental to the program performance as a result of resource contention. At best, 
the speedup achieved was only as high as the result from outermost-branch speculation. 
Chapter 5. Multithreaded Control-Speculative Execution 	 165 
The speculation in the deeper nest levels was less profitable due to the lower cumulative 
probabilities of the inner branches. 
Besides the speculative execution of control-dependent threads (CSP), code frag-
ments below the branches' re-convergences can be executed by control-independent 
threads (CI). The combination of both generally performed better than the use of only 
CSP. However, there are a few instances where it performed worse: when loops on the 
CI paths were too small and/or the CI threads predicted further trivial branches, while 
the CSP threads failed to allocate clusters to execute more critical program fragments. 
Finally, while multiple iterations of a loop were executed in parallel, predicting 
the branches within those iterations provided improvement over the non-speculative 
execution, if there is sufficient parallel computation to offset the overheads of both 
loop parallelisation and control speculation. If the loop in question is an outer loop 
in a nest, the results so far suggested that the availbie TPUs were better used for the 
multithreaded execution of the inner loop than for the control speculation. 
53 Chapter Summary 
Transformation modules for control-speculative execution were implemented using 
SUIF framework. They support single-path, dual-path, and nested speculation. This 
chapter also described the use of profile information and the pre-transformation analy-
sis. Experiments were conducted to study the effectiveness of the control speculation, 
its interaction with multithreaded loop execution, and cluster allocation strategies. The 




6.1 Thesis Summary 
A framework has been proposed for multithreaded execution, which combines dis-
tributed program analyses, hierarchical thread management, and dynamic clustering of 
TPUs. The underlining idea is explained as follows. At compile-time, a program is re-
peatedly divided into sub-problems, each of which is specifically optimised and trans-
formed by a class of compilation techniques. The subsystems and their finer partitions 
are organised in a hierarchy with master/slave relationships between them. During 
run-time, the master threads attempt to allocate clusters of slave TPUs on which the 
slave threads execute. The dynamic cluster allocation enables the utilisation of TPUs 
to be adjusted to the sub-problems' requirements throughout the program execution. 
During the course of the research, a generic multithreaded architecture was mod-
elled and simulated, which was inspired by CMP-based architectures such as Su-
perthreaded. Enhancements were made to support hierarchy and dynamic cluster al-
location, with the TPUs being equipped with special units that manage the threads' 
166 
Chapter 6. Conclusions 	 167 
parent/child and master/slave relationships. Furthermore, control speculation and reg-
ister forwarding mechanisms were introduced. The main focus of the thesis was on 
compiler-based thread manipulation and the interface between the compiler and the 
architecture. The compiler plays an important role in exposing parallelism and or -
chestrating how programs will be executed on a relatively simple architecture. It re-
quires commands, inquiries, and feedback to be passed between these two layers via 
specially-proposed instructions augmented to the MIPS instruction set. In addition 
to the architectural design and simulation, a multithreaded compilation package was 
implemented as a part of the SUIF compiler system. The package is composed of 
front-end transformers for the multithreaded loop and control-speculative execution, 
and a target-machine code generator. 
With up to 16 TPUs, the multithreaded loop execution delivered speedup between 
5 and 10 when combined with loop unrolling and loop peeling. This was achieved 
by dispatching the iterations to threads one-by-one in single-level multithreading. For 
nested loops, chunks of iterations were dispatched in order to restrict per-thread ini-
tialisation, synchronisation and retirement overheads, particularly for the inner loops. 
Speedups of around 4 or 5 were achieved. However, when this was applied to single-
level multithreading, loop-level parallelism was compromised. 
In the presence of conditional branches, speculative execution of the control- de-
pendent paths boosted program speedup. The branches' post-dominating regions can 
be included into the speculative paths, aided by code motion and multithreaded trans-
formation, in order to increase the thread granularity. Alternatively, when there is more 
parallelism to be exploited, control-independent threads can be launched to execute 
those regions. Speedup was generally further improved after both control-dependent 
speculation and control-independent execution were applied. 
Chapter 6. Conclusions 
	
168 
As several master threads simultaneously execute program fragments, or parallelis-
able loops in our benchmarks, they compete for available TPUs in order to allocate 
slave clusters. Cluster allocation strategies have impacts on the program performance. 
In the case of multiple loops being executed concurrently, best results were achieved 
by alloting the dominant ones greater number of TPUs. The loops' contribution to the 
overall program execution time was calculated using their sequential execution profiles 
and the branch probability profile. 
Finally, the speedups achieved suggest that the benefit of the control speculation 
augments the gains made by loop parallelisation. In the experiments, both multi-
threaded non-speculative and speculative programs were generated using the same 
compilation options, i.e. parameters regarding loop unrolling, loop peeling, or loop 
chunking were the same for both versions so that their performance could be fairly 
compared. There are still outstanding issues such as: given a number of TPUs unused 
by loop parallelisation, should the compilation choices (including resource allocation) 
be further explored to improve the loop parallelisation instead of allocating the TPUs 
for speculative execution ? This will be discussed in the next section, along with some 
suggestions for future work. 
6.2 Discussion and Future Works 
6.2.1 Multithreaded Architecture 
Like other CMP-based architecture [26, 32, 58, 66, 70], our multithreaded architec-
ture is kept simple and relies heavily on the compiler to detect and exploit thread-level 
parallelism. A novel feature of the architecture is that clusters of TPUs are statically al-
located to program partitions at compile-time, and this information is communicated to 
Chapter 6. Conclusions 	 169 
the run-time system. The resource partitioning idea was inspired by SMT-based archi-
tectures [3, 44, 45, 71]. But these SMTs rely on a complete run-time system whereas, 
in our architecture, commands are passed from the compiler to perform cluster allo-
cation. The framework brings an advantage of the SMT's philosophy to the CMPs, 
i.e. during the execution of a program, TPUs can be used in proportion to the amount 
of thread-level parallelism in each program partition and/or the priority given to these 
partitions (e.g. a non-speculative partition may be given more TPUs than a speculative 
one, if they are executed in parallel). The other main features and restrictions in the 
architecture are discussed next. 
In practice, the location of slave TPUs on the chip and the size of the clusters 
would affect program performances differently. Large clusters are less likely to be 
successfully allocated than smaller ones and their TPUs are more likely to be scat-
tered. The distance between slave TPUs, in reality, would impact the signal delays and 
communication between threads. The main data transmission is usually in the thread 
initialisation phase, where current register values are copied from the parent's regis-
ter file to the child's. This operation would be more expensive than in other systems 
[3, 45, 58, 60], where the architectures consist of processing units arranged in ring 
topology (the child thread typically starts on the next processing unit in the ring) and 
registers can be rapidly transferred between physically neighbouring units. Besides the 
register transfer during thread initialisation, our register forwarding is similar to Multi-
scalar [12]. However, unlike Multiscalar, which propagates the forwarded registers to 
all the processing units, our architecture only forwards registers from the parent to the 
child threads. Because physically neighbouring TPUs could be assigned to different 
logical clusters, and these clusters may execute program partitions that are independent 
of each other, propagating the registers to all the TPUs would result in unnecessarily 
Chapter 6. Conclusions 	 170 
high communication overhead. Issues such as VLSI implementation, hardware-level 
cluster allocation, and register communication mechanisms should be further studied. 
Hardware support for speculative execution is kept to a minimum. A speculative 
buffer was added in each TPU to keep results when a thread is in speculative mode, 
with a mechanism to retrieve the correct version of the data from its predecessor. A 
new mechanism was developed to manage the speculative buffers according to the hi-
erarchy of threads. Unlike other systems [32, 54,65,671, there is no hardware support 
for misspeculation detection and recovery as these are managed in the software. Also, 
the memory hierarchy (including caches) and data speculation were not included in 
our framework. To integrate these features into the architecture would require fur-
ther investigations on how they would be organised around the hierarchy of threads. 
Other well-studied features, such as local branch predictors, should also be included to 
complete the functionality of the multithreaded architecture. 
During the research, a restriction on the current architecture was noted: clusters do 
not operate entirely independently of each other. If there are several clusters simulta-
neously active, then only the one whose master TPU hosts the current head thread is 
able to reuse its slave TPUs. In the other clusters, the TPUs cannot be freed until the 
synchronisation signal is received by the master threads and passed on to the slaves. 
Our solution was to assign large chunks of program partitions to the slave threads, at 
the expense of compromising some parallelism. An alternative approach could em-
ploy multiple levels of synchronisation and allow each cluster to be operated using a 
unique signal. There are a few concerns with this idea. Firstly, a thread must distin-
guish the signal it uses as a master thread (e.g. when executing crels, it passes the 
signal to the slaves and waits until the signal returns) from the one it uses as a slave 
(e.g. when executing xstp, it waits for the signal from the master or the predecessor 
Chapter 6. Conclusions 	 171 
slave). As multiple threads from several clusters may commit to the shared memory 
and retire simultaneously, care must also be taken to ensure that the program semantics 
is preserved. 
Further improvements can be made to cluster formation and forking mechanisms. 
In this work, a thread checks for available TPUs and a value, success or fail, is returned 
before it proceeds to execute the next instruction. Superthreaded [68, 69, 701 employs 
a different approach called delayed forking which always successfully spawns a new 
thread in spite of the delay. However, if there is no TPU available over a long pe-
riod, it may be better to let the current thread execute the code instead of waiting to 
spawn a new one. Alternatively, a time-out can be set for cluster formation and forking 
operations, with mechanisms that permit polling of available TPUs. 
6.2.2 Multithreaded Compiler 
Like other CMP-based systems such as Hydra [31, 32, 53,54], STAMPede [65, 66,67], 
or Superthreaded [68, 69, 70, 79], our multithreaded compiler was developed specif-
ically for the proposed architecture - it is aware of the execution models supported 
and the restrictions in the architecture when generating multithreaded programs. In 
these compilers, program transformations are performed at the front-end, where high-
level program structures such as loops can be easily recognised. Loop parallelisation 
is a main feature presented in all the compilers. Multiple loop iterations are typically 
executed by multiple threads in parallel in a predecessor/successor style. However, 
threads in Hydra, STAMPede, and Superthreaded commit to the shared memory and 
retire in a sequential order. As each thread commits, it also updates the current state 
of the processor. In our system, slave threads (which also execute loop iterations in a 
predecessor/successor style) commit and update the cluster's state, maintained by the 
Chapter 6. Conclusions 	 172 
master thread, instead of the processor's state. This hierarchical thread management 
would fit in well with the multithreaded execution in nested loops, provided that mul-
tiple clusters can be operated independently, as discussed in the previous section. With 
the current solution to dispatch big chunks of inner-loop iterations to slave threads, 
nested multithreading only performed as well as one-level multithreading in the outer-
most loops. 
In addition to the loop parallelisation, our compiler also generates code for coarse-
grained control speculation. Like STAMPede's approach, the compiler inserts instruc-
tions to mark speculative regions in the program and threads can switch between non-
speculative and speculative execution. However, unlike STAMPede, the misspecu-
lation detection, and recovery actions are all managed by software routines. Further-
more, because data speculation is not supported, threads will be forced to wait until the 
data they depend upon is made available. Our multithreaded compiler would therefore 
have to detect and work around data dependencies between threads (the benchmarks 
used in the research have only few data dependencies). 
The multithreaded code generator is a modification of a MIPS code generator. Sev-
eral back-end analysis were therefore not specifically targeted at our multithreaded ar-
chitecture. Because register usage is known after register allocation is performed at 
the back-end, register forwarding has to be handled separately from the multithreaded 
transformation at the front-end. User's specifications are needed to specify which pro-
gram partitions will use the register communication instead of the default memory 
communication. This differs from Multiscalar's approach [72], in which both task 
selection and register communication are performed at the assembly-code level. Con-
sequently, the Multiscalar compiler only needs to perform control-flow and data-flow 
analysis once. Our compiler, on the other hand, has to perform the analysis in both the 
Chapter 6. Conclusions 
	 173 
front-end and the back-end compilation. 
Besides the features mentioned above, there are still restrictions in the current com-
piler, as discussed next. The SUIF compiler system [84] was well suited for the im-
plementation of the compiler prototype because various specialised functions can be 
implemented separately and communicate via internal program representation and an-
notations. Basic structures, such as procedures, loops, and conditional branches, are 
easily recognised from the internal format. Hence, distributed program analysis and 
compilation would be well supported. More functions are still required to make the 
system fully automatic. At present, the compiler does not perform inter-procedural 
analysis and it relies on built-in SUIF functions to detect data dependencies (ones that 
are not detected by the compiler can be specified via a graphical user interface tool). 
However, in some areas such as embedded applications, where the compilation is per -
formed only once, the current semi-automatic system can still be useful provided that 
the code is specifically compiled and fully optimised to achieve high performance. 
For the multithreaded loop execution, heuristics or analytic models should be de-
veloped to estimate performance trends and determine a point where the benefit from 
loop-level parallelism peaks or reaches a plateau. After this point, further use of the 
control speculation could be worthwhile. Another use of heuristics or analytic mod-
els is in cluster allocation scheduling. As optimal requirement for individual program 
partitions can be estimated, schedules for the availability and the utilisation of TPU 
resources can then be determined for the entire application. One concern is the level 
in which analytic models are used in the compiler. Costs estimated from the high-level 
and the low-level internal representation might be very different, such that, after the 
transformation based on the front-end analysis, the final output programs may behave 
differently than expected. Feedback loops would be needed for the compilation pro- 
Chapter 6. Conclusions 	 174 
cess, as described in Chapter 3. Methods for mapping costs or results from the analysis 
in multiple levels should also be implemented. 
Another limitation in this work is the use of profile information: the programs being 
profiled, analysed, and transformed, always used the same input data and setting. More 
insight could be gained by using a variety of benchmarks that execute different input 
data sets. Further work is required for the simulator to accommodate larger and more 
realistic benchmarks, and very importantly, the integration of operating system and 
library calls into the simulator package. 
6.2.3 Applications 
One type of program that can be tackled by the compiler and executed on the multi-
threaded architecture are loop-based ones. There can be data dependence between loop 
iterations, and the dependency is detected by the SUTF compiler. In the compilation 
flow described in Chapter 4, basic loop optimisations such as loop normalisation, loop 
skewing, and loop reversal were also performed by the SUIF compiler prior to the mul-
tithreaded loop parallelisation to rearrange bounds and data dependency pattern in the 
loops. In the case of control speculation, the control structures need to be quite large. 
There should be substantial amount of computation in both the parent threads (which 
execute the code before conditional branches) and the child threads (which execute 
the speculated paths of the branches). Furthermore, the amount of data dependencies 
between program partitions should be at minimum. If possible, the program partitions 
should be independent from each other. 
Numeric programs such as Livermore [81, 82] used in the research would fit well 
in the framework, and examples that could be related to the synthetic benchmarks used 
in Chapter 5 are scientific calculators. In particular, signal and image processing for 
Chapter 6. Conclusions 	 175 
multimedia applications [25, 74], which are loop intensive (and the loop iterations are 
largely independent), would benefit from this approach. In these applications, a num-
ber of threads could also perform several computations in parallel, and communicate 
to each other but not regularly. 
6.3 Conclusion 
The main contribution of this thesis is the experimental evaluation of hierarchical mul-
tithreading in a framework consisting of a simulated multithreaded architecture and a 
compiler. Within the framework, fragments of a program can be specifically optimised 
and executed by clusters of thread processing units (TPUs) as orchestrated by compile-
time analysis. A multithreaded processor architecture has been proposed, which sup-
ports dynamic clustering of the TPUs and speculative execution. The transformation 
from sequential programs into multithreaded ones are performed in the compiler. The 
focus was on multithreaded loop and control-speculative execution. Based on the ex-
perimental results, significant program speedups were achieved by loop parallelisation, 
and could be further improved by control speculation. 
Appendix A 
Examples of Control-Flow Graphs 
The experiments in Chapter 5 used synthetic benchmarks which are well-structured. 
However, in real applications, some control-flow graphs would need pre-processing 
before they can be transformed for multithreaded control-speculative execution. The 
benchmarks used for demonstration in this chapter are heapsort [1] and 164.gzip [83]. 
Al heapsort 
This program performs heap-sorting on 2000 elements of an array in the ascending 
order. A control-flow graph (CFG) of the sorting function is shown in Figure A. 1. This 
benchmark was not use in the experiments in Chapter 5 because its control structures 
are too fine, i.e. a child region of each branch except IF(3) contains an average of 3 
instructions (counted in the C-code level). The child region of IF(3) contains a small 
sequential loop which checks and swaps between elements of the array. 
176 
177 Appendix A. Examples of Control-Flow Graphs 
I ENTERI 




0.00 	 1.00 
0 49 .
I B(6) 	I 




I EXIT I 
Figure A.1: CFG of the heap-sorting function 
Appendix A. Examples of Control-Flow Graphs 
	 178 
Suppose that the code granularity is not an issue, possible options for applying the 
control speculation on this CFG are: 
Speculation on IF(3) 
In this case, the parent region of IF(3) may include all the predecessor nodes 
according to the forward control flow, which are node B(1) and structures IF(1) 
and IF(2). The child region is the loop containing IF(4) and IF(5). Since there 
is a BREAK in the parent region, the child thread must be aborted if the parent 
executes this instruction (see Section 5.1.1, the handling of control-flow breaks). 
Speculation on IF(1) and IF(2) 
IF(2) is an incomplete sub-structure of IF(1) as their child regions are over-
lapped. Code replication technique can be applied so that each region has a 
separate copy of node B(4) and structure IF(3), as shown in Figure A.2. There 
is a BREAK in the child region of IF(2) which is not allowed because the child 
thread cannot exit the loop before the speculation is resolved. Thus, it is replaced 
by setting a flag cont to FALSE. This flag is set to TRUE when a new iteration 
starts, and is evaluated either after the speculation is resolved (at node B(13)) or 
along with the loop continuation test at the end of that iteration (at node B(11). 
The structures IF(1) and IF(3) form a series of branches while IF(2) is nested 
inside IF(1). 
The structure IF(3) may also be replicated, as shown in Figure A.3, in order to 
increase the size of the speculative threads. In this new CFG, IF(3) and IF(Y) 
can be speculated in conjunction with IF(1) and JF(2). 
The nested-speculation template was described in Section 5.1.3, and tested in 
SYN.5 and SYN6 benchmarks. Furthermore, the handling of a series of specu- 
Appendix A. Examples of Control-Flow Graphs 	 179 
lative structures were tested in SYN3 and SYN_4 benchmarks. 
Speculation on IF(4) and IF(5) 
IF(4) and IF(5) form a series of branches which are embedded in a sequential 
loop. IF(4) has only one child region which is low confident. This path may 
be speculated if it contains a large program partition, as discussed in Section 
5.2.2.4. Otherwise, B(6) which is a post-dominating, control-independent node 
may be executed instead. The control-independent execution was described and 
evaluated in Section 5.2.2.2. On the other hand, single-path speculation can be 
applied to JF(5). 
However, suppose that the loop is parallelisable, the TPU resources may be better 
dedicated to the loop parallelisation than to the control speculation. This was 
discussed in Section 5.2.2.3. 
Speculation on IF(3) and IF(4)-IF(5) 
There is a loop boundary between the structure IF(3) and the series of IF(4) 
and IF(5). Thus, IF(3) and IF(4)-IF(5) do not fit into our nested-speculation 
template. An even more complicated case is if the loop is parallelised and con-
current speculation (Section 5.2.2.3) is performed. However, these have not been 
studied in the thesis. In such cases, the compiler only speculates on IF(3). 








Figure A.2: CFG of the heap-sorting function after code replication (1) 
nj 
Appendix A. Examples of Control-Flow Graphs 
	
181 
Figure A.3: CFG of the heap-sorting function after code replication (2) 
Appendix A. Examples of Control-Flow Graphs 
	 182 
A.2 1 64.gzip 
This benchmark was taken from the SPEC CPU2000 suit [83]. It performs data com-
pression and decompression. The size of the benchmark is too big for both the multi-
threaded compiler and the simultor. Therefore, it was only profiled using test data set 
and the control-flow graph was constructed manually. Figures A.4 and A.6 show the 
control-flow graphs of procedures deflate-fast and inflate l'lock, respectively. 
1. Function deflatelast 
The control-flow graph of deflat&fast consists of a series of branches, some 
of which are nested ones. To handle function calls, the compiler would inline 
them, if possible, since it does not perform inter-procedural analysis. The func-
tions cannot contain instructions that raise exceptions or cause program exit; 
otherwise, an exception-free version of them are generated. From the example, 
longest-match contains assertion tests that cause the program to exit if the as-
sertions fail. Suppose that IF(1) is not speculated but is included in the parent 
region of IF(3). Instead of exiting the program immediately, an invalid value 
is returned from longest-match (see Figure A.5). This value is checked at the 
caller's site after the call instruction; then the child thread is aborted and the par-
ent thread exits the program. 
Similarly, ctJally in the child region of IF(3) contains assertion tests. The com-
piler would normally avoid speculation on IF(3). Nevertheless, suppose that the 
speculation is performed, a possible handling of cualiy is shown in Figure A.5. 
An extra condition may be added after the function call to check the validity of 
the returned value. However, the program exit will be delayed until the specula-
tion is resolved, i.e. at node B(8). 
183 Appendix A. Examples of Control-Flow Graphs 
------------- 	 ----------- 	
--------------- 
------------- 
ENTER 	 B(1) 
N 
0.37 	- 	0.63 	 I 
CALL longest match   
I.00— .f .00 




099 	 001 
B (5) 	 CALL check_ match 
CALL ctjally (...) 	 CALL ct_tally (...) 
F (4)> 
0.06 	 ()94 
B(6) 
I B(7) 	I 
I B(8) 
1.00 	0.00 
EXIT 	 LOOP (2) 
Figure A.4: CFG of procedure deflate-fast 
Appendix A. Examples of Control-Flow Graphs 
	 184 
Figure A.5: Handling of function calls inside procedure deflate.fast 
Appendix A. Examples of Control-Flow Graphs 
	
185 
2. Function inflate-block 
In Figure A.6, suppose that functions inflate-dynamic, inflate-stored, and in-
flate-fixed do not contain exception instructions, program exits, or calls to any 
other functions. The branches IF(1) and IF(2) can therefore be speculated. How-
ever, while being speculative, a child thread that executes inflate-dynamic or 
inflate-stored cannot exit the speculation scope, i.e. function inflate -block. The 
SUIF compiler generates a temporary variable tinpto store a value returned from 
inflatedynamic, inflate-stored, or inflate -fixed. After the outermost branch IF(1) 
is resolved, tmp can then be returned from inflate .block. Figure A.7 shows the 
branch structures which were arranged in a completely-nested form. 




CALL NEEDBITS (...) 
CALL DUMPBITS (...) 
<IF (1> 
IF (2) 	 RETURN inflate—dynamic () 
0.00 	1.00 
I RETURN inflate-stored() 	I 
rsxsisSI,r 
RETURN 2 I 	I RETURN inflate fixed () 
EXIT 
Figure A.6: CFG of procedure inflate-block 
Appendix A. Examples of Control-Flow Graphs 
	 187 
B(1) 	I 
CALL NEEDBITS ( ... ) 
CALL DUMPBITS ( ... ) 
IF (1) 
0.20 	 0.80 
Figure A.7: Completely-nested branches in procedure inflate-block 
Appendix B 
Global Thread Control Unit 
The Global Thread Control Unit (GTCU) is a central unit in the multithreaded pro-
cessor architecture. It maintains threads' information including relative order of all 
the active threads and a pointer to the head thread. The GTCU is accessed quite of-
ten during multithreaded execution, and even more frequently during (multithreaded) 
speculative execution (see Sections 3.2.1 and 3.4.1.3 for details). The following ex-
periment compares programs' performance when the access delay is set to 0, 1, and 2 
time units (all the results in Chapters 4 and 5 are based on zero delay). The benchmarks 
used are Livermore kernels from Chapter 4, which were transformed as follows: 
Non-speculative programs. 
The benchmarks were transformed using Loop-Transformer-1. Results from 
these programs, with the GTCU's access delay being set to zero, were shown in 
Figure 4.11. 
Speculative programs. 
The benchmarks were transformed using Loop-Transformer-2. These are the 
ones mentioned in Page 84. 
188 
Appendix B. Global Thread Control Unit 
	
189 
Both versions of the multithreaded programs generate a lot of threads during run-
time as the TPUs are reusable, and the speculative programs access the GTCU more 
often than the non-speculative ones due to the speculative load/store operations. Re-
sults are shown in Figures B. 1 and B.2. Speedup of the non-speculative programs when 
the GTCU delay is 0, 1, and 2 time units are very close. For the speculative programs, 
the difference in speedup is slightly more pronounced (when the delay is set to zero, 
the speculative programs give very similar speedup to the non-speculative ones). 
It appears that the delay in the GTCU has only slight impact on program's per-
formance because most accesses to the GTCU are for reading thread sequence and 
this unit is managed in a multiple-readers/single-writer style. To avoid contention and 
long access delay in a centralised unit (for the multiple reads), a table could be imple-
mented with each entry being a copy of the thread sequence exclusively used by each 
TPU, thus allowing multiple read operations to proceed in parallel. A write operation, 
on the other hand, will lock the whole table as it needs to broadcast an update in one 
entry to all the others. This should not cause bottleneck in the system since each thread 
updates the GTCU table only twice, i.e. when it is forked and when it retires, although 
it may read from the GTCU table several times during its execution while perfoming 
speculative load/store operations. 
2 	4 	6 	8 	10 12 14 16 	2 	4 	8 	8 	10 12 14 161 
2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 	10 12 14 161 
2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 	10 12 14 161 
Appendix B. Global Thread Control Unit 
	
190 
-U- delay= 0 
-.-- delay= 1 
delay = 2 
Y-axis: Speedup w.r.t. sequential program (delay = 0) 
X-a,cjs: No. of slave TPUS 
2 	4 	6 	8 	10 12 14 16 
 
Figure B.1: Speedup of non-speculative programs (with GTCU delay = 0, 1, and 2 time 
units) 
2 	4 	6 	8 10 12 14 18 
2 	4 	6 	8 	10 12 14 16 
2 	4 	6 	8 10 12 14 16 
Appendix B. Global Thread Control Unit 
	
191 
2 	4 	6 	8 10 12 14 16 	2 	4 	6 	8 10 12 14 lb 	2 	4 	8 	8 1U 12 iq 18 
2 	4 	6 	8 	10 12 14 16 	2 	4 	6 	8 10 12 14 16 	2 	4 	8 	8 	18 12 14 lb 
delay = 0 
delay= 1 
-• delay = 2 
-.-- 
V-axis Speedup w.r.t. sequential program (delay = 0) 
X-axis: No. of slave TPUS 
Figure B.2: Speedup of speculative programs (with GTCU delay = 0, 1, and 2 time 
units) 
Bibliography 
Alfred Aburto. FFP site. Naval Oceans Systems Center. 
ftp://ftp.nosc.mil/pub/aburto.  
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Tech-
niques, and Tools. Addison-Wesley Publishing, 1986. 
Haitham Akkary. A Dynamic Multithreading Processor. PhD thesis, Electrical 
and Computer Engineering, Portland State University, 1998. 
Andrew W. Appel and Maia Ginsburg. Modern compiler implementation in C. 
Cambridge University Press, 1998. 
Christoffer Arvidsson. A multi-threaded architecture platform. Master's thesis, 
Division of Informatics, University of Edinburgh, September 1999. 
D. K. Arvind and R. Rangaswami. Asynchronous multithreaded processor cores 
for system level integration. In Proceedings of the Conference on Intellectual 
Property (IP' 99), pages 105-110, Edinburgh, November 1999. Miller Freeman. 
Jean Bacon. Concurrent Systems: An Integrated Approach to Operating Systems, 
Database, and Distributed Systems, chapter 9: Low-level mechanisms for process 
synchronization, pages 210-248. Addison Wesley, 1993. 
Thomas Ball and James R. Larus. Branch prediction for free. In Conference on 
Programming Language Design and Implementation, pages 300-313, June 1993. 






David Bernstein, Doron Cohen, and Hugo Krawczyk. Code duplication: an assist 
for global instruction scheduling. In Proceedings of the 24th Annual International 
Symposium on Microarchitecture (MICRO-24), pages 103-113, November 1991. 
William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeflinger, 
David Padua, Paul Petersen, Bill Pottenger, Lawrence Rauchwerger, Peng Tu, 
and Stephen Weatherford. Effective automatic parallelization with polaris. Inter-
national Journal of Parallel Programming, May 1995. 
Scott E. Breach, T. N. Vijaykumar, and Gurindar S. Sohi. The anatomy of the 
register file in a multiscalar processor. In Proceedings of the 27th Annual Inter-
national Symposium on Microarchitecture (MICRO-27), November 1994. 
Brad Calder, Dirk Grunwald, and Amitabh Srivastava. The predictability of li-
braries. Technical Report WRL Technical Note TN-50, Western Research Labo-
ratory, July 1995. 
Ben Catanzaro. Multiprocessor System Architectures, chapter 8: Multithread Pro-
gramming Facilities for Implementing Multithreaded Applications, pages 229-
268. SunSoft Press, 1994. 
Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and 
Yale N. Patt. Simultaneous Subordinate Microthreading (SSMT). In Proceedings 
of the 26th International Conference on Computer Architecture, pages 186-195, 
May 1999. 
William Y. Chen, Scott A. Mahlke, Nancy J. Warter, Sadun Anik, and Wen-
mei W. Hwu. Profile-assisted instruction scheduling. International Journal for 
Parallel Programming, 22(2):151-18 1, April 1994. 
Gautham K. Dorai and Donald Yeung. Transparent threads: Resource sharing in 
SMT processors for high single-thread performance. In Proceedings of the list 
International Conference on Parallel Architectures and Compilation Techniques 
(PACT 2002), Virginia, September 2002. 
Bibliography 	 194 
Predeep K. Dubey, Kevin O'Brien, Kathryn M. O'Brien, and Charles Bar-
ton. Single-program speculative multithreading (SPSM) architecture: Compiler-
assisted fine-grained multithreading. In Proceedings of the IFIP WG 10.3 Work -
ing Conference on Parallel Architectures and Compilation Techniques (PACT 
95), pages 109-121, June 1995. 
John R. Ellis. Bulldog: a compiler for VLIW architectures. MIT Press, 1986. 
Hesham El-Rewini and Hesham H. Ali. How many times should A loop be un-
rolled? In Proceedings of the 7th Intl. Conf. Parallel and Distributed Computing 
Systems, Las Vegas, October 1994. 
Keith I. Farkas, Norman P. Jouppi, and Paul Chow. Register file design consid-
erations in dynamically scheduled processors. Technical Report WRL Research 
Report 95/10, Western Research Laboratory, November 1995. 
Erin Farquhar and Philip Bruce. The MIPS Programmer's Handbook. Morgan 
Kaufmann, 1994. 
Marco Fillo, Stephen W. Keckler, William J. Daily, Nicholas P. Carter, Andrew 
Chang, Yevgeny Gurevich, and Whay S. Lee. The M-machine multicomputer. 
Technical Report AIM-1532, Laboratory of Computer Science, MIT, 1995. 
Joseph A. Fisher and Stefan M. Freudenberger. Predicting conditional branch 
directions from previous runs of a program. In Proceedings of the 5th Annual 
International Conference on Architectural Support for Programming Languages 
and Operating Systems (ASPLOS-V), pages 85-95, MA, USA, October 1992. 
James D. Foley, Andires van Dam, Steven K. Feiner, John F. Hughes, and Richard 
L. Phillips. Introduction to Computer Graphics. Addison-Wesley Publishing 
Company, 1994. 
Manoj Franklin. The Multiscalar Architecture. PhD thesis, University of 
Wisconsin-Madison, 1993. 
Bibliography 	 195 
Manoj Franklin and Gunndar S. Sohi. ARB: A hardware mechanism for dynamic 
reordering of memory references. IEEE Transactions on Computers, May 1996. 
Eric Freudenthal and Alan Gottlieb. Process coordination with fetch-and-
increment. In Proceedings of the 4th International Conference on Architec-
tural Support for Programming Languages and Operating Systems (ASPLOS-IV), 
pages 260-268, Santa Clara, CA, April 1991. 
Milind Girkar, Mohammad R. Haghighat, Paul Grey, Hideki Saito, Nicholas J. 
Stavrakos, and Constantine D. Polychronopoulos. Illinois-Intel multithreading 
library: Multithreading support for intel architecture based multiprocessor sys-
tems. Intel Technology Journal, Qi Issue, February 1998. 
Sridhar Gopal, T. N. Vijaykumar, James E. Smith, and Gurindar S. Sohi. Spec-
ulative versioning cache. In Proceedings of the 4th International Symposium on 
High-Performance Computer Architecture (HPCA 4), February 1998. 
Lance Hammond, Mark Willey, and Kunle Olukotun. Data speculation support 
for a chip multiprocessor. In Proceedings of the 8th ACM Conference on Archi-
tectural Support for Programming Languages and Operating Systems (ASPLOS -
VII1), San Jose, CA, October 1998. 
Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, 
Michael Chen, and Kurile Olukotun. The Stanford Hydra CMP. IEEE MICRO, 
pages 71-84, March-April 2000. 
Timothy Heil and J. E. Smith. Selective dual path execution. Internal techni-
cal report, Department of Electrical and Computer Engineering, University of 
Wisconsin-Medison, November 1996. 
Japheth E. Hossell. Compiling java byte code for multithreaded architecture. 
Master's thesis, Division of Informatics, University of Edinburgh, September 
1999. 
Quinn Jacobson, Steve Bennett, Nikhil Sharma, and James E. Smith. Control 
flow speculation in multiscalar processors. In Proceedings of the 3rd Inter- 
Bibliography 	 196 
national Symposium on High Performance Computer Architecture (HPCA 3), 
Texas, February 1997. 
Quinn Jacobson, Eric Rotenberg, and Jim Smith. Path-based next trace predic-
tion. In Proceedings of the 30th Annual International Symposium on Microarchi-
tecture (MICRO-30), December 1997. 
Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice Hall, 1992. 
Venkata Krishnan and Josep Torrellas. A clustered approach to multithreaded 
processors. In Proceedings of the International Parallel Processing Symposium, 
pages 627-634, March 1998. 
Venkata Krishnan and Josep Torrellas. A chip-multiprocessor architecture with 
speculative multithreading. IEEE Transactions on Computer, Special Issue on 
Multithreaded Architecture, September 1999. 
Jee Myeong Ku. The design of an efficient and portable interface between a paral-
lelizing compiler and its target machine. Master's thesis, Electrical Engineering, 
University of Illinois at Urbana-Champaign, 1995. 
Bil Lewis and Daniel J. Berg. Threads Primer: A Guide to Multithreaded Pro-
gramming, chapter 5: Synchronisation, pages 61-86. SunSoft Press, 1996. 
Mikko H. Lipasti and John Paul Shen. Exceeding the dataflow limit via value 
prediction. In Proceedings of the 29th Annual International Symposium on Mi-
croarchitecture (MICRO-29), December 1996. 
Mikko H. Lipasti. Value locality and speculative execution. PhD thesis, De-
partment of Electrical and Computer Engineering, Carnegie Mellon University, 
1997. 
Jack Lee-jay Lo. Exploiting thread-level parallelism on Simultaneous Multi-




Pedro Marcuello, Antonio Gonzalez, and Jordi Tubella. Speculative multi-
threaded processors. In Proceedings of the ACM International Conference on 
Supercomputing (ICS 98), Australia, 1998. 
Pedro Marcuello and Antonio Gonzalez. Control and data dependence specu-
lation in multithreaded processors. In Proceedings of the Workshop on Multi-
threaded Execution, Architecture, and Compilation (MTEAC 98), 1998. 
Pedro Marcuello and Antonio Gonzalez. Clustered speculative multithreaded 
processors. In Proceedings of the ACM International Conference on Supercom-
puting (ICS 99), Greece, 1999. 
Pedro Marcuello, Jordi Tubella, and Antonio Gonzalez. Value prediction for 
speculative multithreaded architectures. In Proceedings of the 32nd Annual In-
ternational Symposium on Microarchitecture (MICRO-32), November 1999. 
Pedro Marcuello and Antonio Gonzalez. A quantitative assessment of thread-
level speculation techniques. In Proceedings of the 1st International Parallel and 
Distributed Processing Symposium, Mexico, May 2000. 
John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer syn-
chronization for shared-memory multiprocessors. In Proceedings of the 3rd 
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 
pages 106-113, Williamsburg, Virginia, 1991. 
Frank Mueller and David B. Whalley. Avoiding unconditional jumps by code 
replication. In ACM SIGPLAN Conference on Programming Language Design 
and Implementation, pages 322-330, June 1992. 
Tarun Nakra, Rajiv Gupta, and Mary Lou Soffa. Global context-based value pre-
diction. In Proceedings of the 5th International Symposium on High-Performance 
Computer Architecture (HPCA 5), Florida, January 1999. 
Kunle Olukotun, Lance Hammond, and Mark Willey. Improving the performance 




ACM International Conference on Supercomputing (ICS 99), Rhodes, Greece, 
June 1999. 
Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. 
Lam, and Kunle Olukotun. Software and hardware for exploiting speculative 
parallelism with a multiprocessor. Technical Report CSL-TR-97-715, Stanford 
University Computer Systems Laboratory, February 1997. 
Alastair Patrick. A co-design environment for java programs targetting asyn-
chronous processors. Bachelor's thesis, Division of Informatics, University of 
Edinburgh, June 1999. 
William Pugh. A practical algorithm for exact array dependence analysis. Com -
munication of the ACM, 35(8): 102-114, August 1992. 
Martin Rinard. Effective fine-grain synchronization for automatically parallelized 
programs using optimistic synchronization primitives. In Proceedings of the 6th 
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 
pages 112-123, Las Vegas, NV, 1997. 
Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace pro-
cessors. In Proceedings of 30th Annual International Symposium on Microarchi-
tecture (MICRO-30), December 1997. 
Eric Rotenberg and Jim Smith. Control independence in trace processors. In 
Proceedings of the 32th Annual International Symposium on Microarchitecture 
(MICRO-32), November 1999. 
Eric Rotenberg. Trace Processors: Exploiting Hierarchy and Speculation. PhD 
thesis, University of Wisconsin-Madison, 1999. 
Radu Rugina and Martin Rinard. Design-driven compilation. In Proceedings of 
the International Conference on Compiler Construction, Genova, Italy, 2001. 
Hideki Saito, Nicholas Stavrakos, Steven Carroll, Constantine Polychronopou-
los, and Alex Nicolau. The design of PROMIS compiler. In Lecture Notes in 
Computer Science 1575. Springer Verlag, March 1999. 
Bibliography 	 199 
Yiannakis Sazeides and James E. Smith. The predictability of data values. In 
Proceedings of the 30th Annual International Symposium on Microarchitecture 
(MICRO-30), December 1997. 
James E. Smith. A study of branch prediction strategies. In Proceedings of the 4th 
Annual International Symposium on Computer Architecture, volume SIGARCH 
Newsletter 9(3), pages 135-148, May 1981. 
J. Gregory Steffan, Christopher B. Colohan, and Todd C. Mowry. Architecture 
support for thread-level data speculation. Technical Report CMU-CS-97-188, 
School of Computer Science, Carnegie Mellon University, November 1997. 
J. Gregory Steffan and Todd C. Mowry. The potential for using thread-level 
data speculation to facilitate automatic parallelization. In Proceedings of the 4th 
International Symposium on High-Performance Computer Architecture (HPCA 
4), Las Vegas, February 1998. 
J. Gregory Steffan, Christopher B. Colohan, and Todd C. Mowry. Extend-
ing cache coherence to support thread-level data speculation on a single chip 
and beyond. Technical Report CMU-CS-98-171, School of Computer Science, 
Carnegie Mellon University, December 1998. 
Jenn-Yuan Tsai. Superthreading: Integrating Compilation Technology and Pro-
cessor Architecture for Cost-Effective Concurrent Multithreading. PhD thesis, 
Department of Computer Science, University of Illinois at Urbana-Champaign, 
1998. 
Jenn-Yuan Tsai, Zhenzhen Jiang, Zhiyuan Li, David J. Lilja, Xin Wang, Pen-
Chung Yew, Bixia Zheng, and Stephen J. Schwinn. Superthreading: Integrating 
compilation technology and processor architecture for cost-effective concurrent 
multithreading. Journal of Information Science and Engineering, Special Is-
sue on Compiler Techniques for High-Performance Computing, 14(1):205-222, 
March 1998. 
Bibliography 	 200 
Jenn-Yuan Tsai, Ran Huang, Chnstoffer Amlo, David J. Lilja, and Pen-.Chung 
Yew. The superthreaded processor architecture. IEEE Transactions on Comput-
ers, Special Issue on Multithreaded Architectures and Systems, 48(9), September 
1999. 
Dean M. Tuilsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multi-
threading: maximizing on-chip parallelism. In Proceedings of the 22nd Annual 
International Symposium on Computer Architecture, Italy, June 1995. 
T. N. Vijaykumar. Compiling for the Multiscalar Architecture. PhD thesis, Uni-
versity of Wisconsin-Madison, 1998. 
Staven Wallace, Brad Calder, and Dean Tuilsen. Thread multiple path execution. 
In Proceedings of the 25th International Symposium on Computer Architecture, 
June 1998. 
Alan Watt and Mark Watt. Advanced Animation and Rendering Techniques: The-
ory and Practice. ACM Press, 1992. 
Michael Wolfe. Optimizing Supercompilers for Supercomputers. Pitman Pub-
lishing, 1989. 
Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-
Wesley Publishing Company, 1996. 
Mark N. Yankelevsky and Constantine D. Polychronopoulos. (X-Coral: A multi-
grain, multithreading processor architecture. In Proceedings of the International 
Conference on Supercomputing (ICS 01), pages 358-367, 2001. 
Mohamed M. Zahran and Manoj Franklin. A feasibility study of hierarchical 
multithreading. In International Parallel and Distributed Processing Symposium 
(IPDPS), April 2002. 
B. Zheng, J. Y. Tsai, B. Y. Zang, T. Chen, B. Huang, J. H. Li, Y. H. Ding, J. Liang, 
Y. Zhen, P. C. Yew, and C. Q. Zhu. Designing the Agassiz compiler for concur-
rent multithreaded architectures. In Workshop on Languages and Compilers for 




Craig B. Zilles, Joel S. Emer, and Gurindar S. Sohi. The use of multithreading for 
exception handling. In Proceedings of the 32nd Annual International Symposium 
on Microarchitecture (MICRO-32), 1999. 
The ASCI LFK benchmark code. 
http: //www. Ilni. gov/asci-benchmarks/asci/limited/lfk/asciJfk.html.  
Livermore loops coded in C. http://www.netlib.orglbenchmarkllivermorec . 
Standard Performance Evaluation Corporation (SPEC). 
http://www.specbench.org/.  
The SUM 1 .x compiler system. http://suif.stanford.edu/suif/suifl/.  
TCOVSUIF 2.0. http://brass.cs.berkeley.edu/tcovsuif2.html.  
