Confidence Based Out-of-Order Renaming for Speculatively Multithreaded Processors by Malik, Kshitiz et al.
June 2006 UILU-ENG-06-2208
CRHC-06-04
CONFIDENCE BASED OUT-OF-ORDER 
RENAMING FOR SPECULATIVELY 
MULTITHREADED PROCESSORS
Kshitiz Malik, Kevin M. Woley, Samuel S. Stone, 
Mayank Agarwal, Vikram Dhar, Matthew I. Frank
Coordinated Science Laboratory
1308 West Main Street, Urbana, IL 61801
University o f Illinois at Urbana- Champa ign
REPORT DOCUMENTATION PAGE Form Approved  O M B NO . 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, 
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comment regarding this burden estimate or any other aspect of this 
collection of information, including suggestions for reducing this burden, to Washington Headquarters Services. Directorate for information Operations and Reports, 1215 Jefferson 
Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 20503.
1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
June 2006
4. TITLE AND SUBTITLE
Confidence Based Out-of-Order Renaming for Speculatively Multithreaded 
Processors
5. FUNDING NUMBERS
NSF CCR-0429711 
NSF EIA-0224453 
AMD Corp.6. a u t h o r (S) Kshitiz Malik, Kevin M. Woley, Samuel S. Stone, Mayank Agarwal, 
Vikram Dhar, Matthew I. Frank
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 
Coordinated Science Laboratory 
University o f Illinois 
1308 W. Main St.
Urbana, IL 61801
8. PERFORMING RGANIZATION 
REPORT NUMBER
UILU-ENG-06-2208
(CRHC-06-04)
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Science Foundation 
4201 Wilson Blvd.
Arlington, VA 22203
10. SPONSORING/MONITORING 
AGENCY REPORT NUMBER
11. SUPPLEMENTARY NOTES
12a. DISTRIBUTION/AVAILABILITY STATEMENT 
Approved for public release; distribution unlimited.
12b. DISTRIBUTION CODE
13. A B S TR A C T (Maximum 200  words)
Speculatively multithreaded processors find parallelism by speculatively fetching and renaming dynamic flows of instructions from 
perhaps) widely separated parts of the program flow graph. These processors must handle inter-thread register dependences. The 
approach followed here is to dynamically identify the consumers of interflow register mappings that will be (but have not yet been) 
produced in a logically earlier thread and then to dynamically awaken those consumers as soon as the mapping they are waiting for is 
produced.
The main contribution o f this paper is the design and evaluation of the inter-thread register renaming and synchronization 
mechanismsfor a speculatively multithreaded processor. Our scheme is realizable, aggressive, and flexible and achieves speedups 
within about 10% of those achievable by an oracle. We find that inter-thread synchronization mechanisms can and must use path 
confidence information so that the producers of register mappings can awaken consumer instructions at just the right time, neither so 
early that the producer is on a misspredicted branch path, nor so late as to add latency to the critical path. We also demonstrate that a 
relatively straight-forward predictor can find the set of consumer instructions that must wait without being overly conservative.
14. SUBJECT TERMS
computer systems organization, parallel processing, speculative multithreading
15. NUMBER OF PAGES 
14
16. PRICE CODE
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT 
OF REPORT OF THIS PAGE OF ABSTRACT
UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED UL
NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)
Prescribed by ANSI Std. 239-18 
298-102
Confidence Based Out-of-Order Renaming for Speculatively
Multithreaded Processors*
Kshitiz Malik, Kevin M. Woley, Samuel S. Stone, 
Mayank Agarwal, Vikram Dhar, Matthew I. Frank 
Electrical and Computer Engineering 
University of Illinois, Urbana-Champaign
June 9,2006
Abstract
Speculatively multithreaded processors find paral­
lelism by speculatively fetching and renaming dy­
namic flows of instructions from (perhaps) widely 
seperated parts of the program flow graph. These pro­
cessors must handle inter-thread register dependences. 
The approach followed in this paper is to dynamically 
identify the consumers of interflow register mappings 
that will be (but have not yet been) produced in a log­
ically earlier thread and then to dynamically awaken 
those consumers as soon as the mapping they are wait­
ing for is produced.
The main contribution of this paper is the design and 
evaluation of the inter-thread register renaming and 
synchronization mechanisms for a speculatively mul­
tithreaded processor that does not need compiler sup­
port. Our scheme is realizable, aggressive, and flexi­
ble and achieves speedups within about 10% of those 
achievable by an oracle. We find that inter-thread syn­
chronization mechanisms can and must use path con­
fidence information so that the producers of register 
mappings can awaken consumer instructions at just 
the right time, neither so early that the producer is on 
a misspredicted branch path, nor so late as to add la­
tency to the critical path. We also demonstrate that 
a relatively straight-forward predictor can find the set 
of consumer instructions that must wait without being 
overly conservative.
1 Introduction
Superscalar processors are profitable because they is­
sue and execute instructions out of order, but retire in­
structions in order, thus providing high performance 
on a programming model that is easy to reason about. 
However, superscalars fetch and rename instructions 
in program order and cancel all instructions after a 
branch mispredict, even when those instructions have
* University of Illinois, Center for Reliable and High Performance 
Computing, Technical Report Number UILU-ENG-06-2208.
done useful work. Speculatively Multithreaded pro­
cessors [13,14, 6, 7, 4, 1, 11, 8, 9] are a promising al­
ternative because they retire instructions in order, like 
a superscalar, but also fetch and rename instructions 
out of order. This allows them to find instruction level 
parallelism across widely separated regions of the pro­
gram, including past multiple branch mispredictions.
Although fetching out-of-order improves fetch effi­
ciency, it introduces complications because of dataflow 
that crosses thread boundaries. In particular, there may 
be register and memory value traffic between the in- 
order instructions that have not yet been fetched and 
the out-of-order instructions that have been fetched 
early by the speculative multithreading mechanism. 
That is, producers and consumers of inter-thread value 
traffic sometimes need to be synchronized. It turns out 
(empirically) that more often than not in speculatively 
multithreading systems, a program's value traffic oc­
curs either within a thread or from a producer instruc­
tion that fetched, renamed and executed before a par­
ticular thread was even spawned [13]. Thus, many 
of the instructions that are fetched out-of-order can be 
renamed and executed in the order they are fetched, 
while others need to wait for portions of the interme­
diate path to be fetched. And, when inter-thread reg­
ister communication does need to be synchronized the 
synchronization is often on the critical path.
This paper investigates how aggressively (specu­
latively) the processor should synchronize producer 
threads with the consumer instructions that depend 
on them. Similar to many earlier speculatively mul­
tithreaded systems [7, 1, 8], the system we evaluate, 
which we call the Poly Flow Speculative Multithreaded 
Processor is based on a simultaneously multithreaded 
core, rather than, for example, a chip multiprocessor. 
This means that in our system we have the option of 
allowing producer instructions to forward data to con­
sumer instructions speculatively, rather than waiting for 
all the branches before the producer instruction to com­
plete. We find that producers must forward data specula­
tively to consumers to avoid adding latency to critical
1
paths, but that the speculative forwarding o f data must 
be balanced with path confidence information to avoid pro­
ducers that are along mispredicted branch paths from 
signaling consumers too early and thus causing the 
consumers to also be canceled. The PolyFlow system 
is a Dynamic Speculative Multithreaded system, in that 
it takes an unmodified binary, and converts it dynami­
cally into threads. Thus, our register synchronization 
mechanism must handle inter-thread register depen­
dence without any help from the compiler.
The contributions of this work include, first, the 
design and analysis of an effective and realizable 
out-of-order register renaming mechanism for dy­
namic speculative multithreading. Our scheme sup­
ports out-of-order spawning and reconnect of flows, 
communicates register information only point-to-point 
from a predecessor to its immediate successor flow 
(rather than globally), requires no selective reexecu­
tion, and yet achieves speedups within about 10% of 
those achievable with an oracle that "knows" the opti­
mal time for producers to awaken consumers.
Second we demonstrate that out-of-order register 
renaming requires path confidence information to 
balance synchronization between register producers 
and consumers. We find that register producers must 
aggressively and speculatively release/awaken con­
sumers to avoid delays waiting for branches to com­
plete and retire. The producers must also, however, 
take branch confidence information into account to 
avoid awakening consumers too early, with data from 
along a mispredicted path. We present the design of an 
appropriate confidence predictor and show how to use 
it to drive the register renamer.
Third we demonstrate that inter-thread register con­
sumers can be identified dynamically. Our system 
does not use a compiler to identify consumer instruc­
tions, but rather derives this information at runtime. 
While finding the last dynamic instance of a producer 
is a difficult problem [7] that requires backward com­
piler analysis [15], we find that tike set of architectural 
registers that need to be synchronized is highly pre­
dictable. Since identifying these register's consumer 
instructions (and their transitive dependents) is a for­
ward analysis it can be performed at runtime in the 
front end of the processor. We find that a predictor with 
1-bit per architectural register per spawn point is suffi­
cient to allow us to identify the consumer instructions 
that must wait for a producer instruction from another 
flow.
The rest of this paper is structured as follows. The 
next section gives a motivating example to explain, 
in rough terms, the problem we are trying to solve 
and our solution, and discusses the relationship of our 
work to earlier work in speculative multithreading. 
Section 3 gives a more detailed description of our de­
sign. In Section 4 we demonstrate that our aggressive 
renaming scheme provides speedups within about 10%
of that achievable with an oracular confidence predic­
tor driving the time at which producers release con­
sumers. Section 5 concludes.
2 Background
In this section, we describe the register synchroniza­
tion problem faced by all speculatively multithreaded 
systems. In our domain of SMT-based speculative mul­
tithreading with a shared physical register file, the reg­
ister synchronization problem boils down to ensuring 
that all instructions get source physical mappings pro­
duced by their corresponding producer instructions, 
instead of some predecessor of the producer. Hence, 
we term register synchronization as the problem of per­
forming out-of-order renaming.
A speculatively multithreaded system needs to ad­
dress three issues regarding inter-thread register syn­
chronization. First the specific instructions that con­
sume inter-thread register data need to be identified, 
and be forced to wait (i.e., their renaming must be de­
layed) till the corresponding producer instruction has 
been renamed. We call such instructions waitsFor in­
structions. Second, waitsFor instructions need to be 
released, some time after the producer instruction has 
been renamed. Release here refers to the process of 
providing a waitsFor instruction the correct source reg­
ister mappings. If consumers are released too early 
(before the correct producer instruction has been re­
named) then the thread containing the consumer in­
struction will need to be canceled. On the other hand if 
the consumers are released too late the synchronization 
cost will add to the critical path and slow down pro­
gram execution. Finally, if consumer instructions are 
identified speculatively or released speculatively, there 
needs to be a validation process that makes sure that 
each consumer got matched with the correct producer 
instruction.
2.1 Example
Figure 1 shows an oversimplified pedagogical example 
intended to clarify these three issues. In this example 
thread 0 has spawned thread 1, and then thread 0 has 
entered a simple if-then-else statement starting at in­
struction Al. The first issue is to identify waitsFor in­
structions in thread 1, i.e., instructions which should 
wait before they go through the rename process. In 
this case, instruction D1 does not need to wait, because 
it can immediately retrieve its local register alias table 
mapping for register Rx (created by instruction Z l, and 
copied during the spawn process). Likewise instruc­
tion D5 does not need to wait, because register Rz has 
already been renamed locally by instruction D4. In­
struction D2, on the other hand, must wait for a map­
ping to be produced by either instruction B1 or C l. In-
2
Figure 1: How aggressively thread 0 should wake up in­
structions D2 and D3 in thread 1 depends on the predictabil­
ity o f the branch at A l. I f  the branch is highly predictable, 
then thread Vs instructions should be awakened as soon as 
thread 0 predicts through the branch, and reaches the po­
tential reconnection point, marked "reconnect," correctly re­
naming the intervening instructions in either block B or C. 
I f the branch at A l is hard to predict then thread 0 should 
wait until A l is actually resolved before releasing instruc­
tions D2 and D3 in thread 1
struction D3 may or may not need to wait for the map­
ping produced by instruction B2, depending on the di­
rection of branch A l. In Section 3 we demonstrate how 
to use the renamer to derive a non-conservative set of 
instructions that should wait.
The second question, of when to release the waiting 
instructions, depends on whether the branch at A l is 
highly predictable or not. If the branch is highly pre­
dictable then as soon as the renamer speculatively re­
names the predicted block (B or C), instructions D2 and 
D3 should be released. If, on the other hand, the branch 
at A l is not predictable, then it will be better to make 
D2 and D3 wait until the branch at A l is resolved so as 
not to force a flush in thread 1 because of a branch mis­
predict in thread 0. This represents a fundamental trade 
off between the benefit o f multithreading systems that branch 
mispredictions in different threads can be handled indepen­
dently against the cost o f adding synchronization to the crit­
ical path. We demonstrate in Sections 3 and 4 that the 
sweet spot in this trade off is releasing instructions ag­
gressively and speculatively, but by using a path confi­
dence predictor to gate release.
Another part of releasing waitsFor instructions is 
identifying the correct producer instruction from the 
dynamic instruction stream in the predecessor thread. 
Doing this without any help from the compiler is hard.
Identifying the program counter of the last-writer is 
not enough since the same PC may appear multiple 
times. We make the observation that when the pre­
decessor thread has renamed all its instructions, which 
will be a short while after it reaches its final PC, all last- 
writers in the predecessor have been renamed. We re­
fer to this specific release point in the dynamic instruc­
tion stream as the potential reconnection point, or simply 
reconnection point.
This may be a good time to send the physical reg­
isters of unsafe registers to the successor thread, all at 
once and in bulk, so that its waitsFor instructions can 
be released. We call this scheme release-on-arrival(RoA). 
Note that this release is speculative, since the predeces­
sor may arrive at the reconnection point on a bad path. 
Another option is to perform release when the prede­
cessor thread has retired all of its instructions, which 
we call release-on-retirement(RoR). We show in section
4.1 that performing RoA while taking branch confi­
dence into account performs much better than RoR, in 
spite of RoA performing release speculatively.
The final question, of how to validate whether we 
identified all waitsFor instructions (and did not miss 
out any), is addressed in Section 3. We augment each 
thread's register alias table with a set of 4 bits per ar­
chitectural register to track these inter-thread register 
dependences.
2.2 Related Work
2.2.1 Dynamic Speculative Multithreaded Systems
Most dynamic speculatively multithreaded processors 
[1,10] don't perform explicit register synchronization. 
Instead, they make the following assumption: Inter- 
thread register dependences don't occur, and even 
when they do, the values of architectural registers are 
not changed by any predecessor instruction between 
the spawn and the reconnection point.
When the above assumption is false, these proces­
sors use replay to selectively re-execute waitsFor in­
structions and their transitive dependents in the suc­
cessor thread, which usually happens when waitsFor 
instructions retire. Thus, previous dynamic specula­
tive multithreaded processors have proposed a combi­
nation of value prediction and re-execution to solve the 
out-of-order renaming problem. These systems have 
produced unique mechanisms that exploit value pre­
diction to get around the renaming problem. However, 
we show in Section 4.1 that resolving inter-thread reg­
ister dependences when consumer waitsFor instruc­
tions retire hurts the performance of a speculative mul­
tithreaded processor. Also, our goal is to develop a 
relatively simple hardware to handle out-of-order re­
naming. For these reasons, value prediction backed by 
replay is not an acceptable solution for us.
Skipper [2] is a dynamic out-of-order fetch proces-
3
sor that fetches from control-independent point in the 
program when it reaches a hard-to-predict branch. 
The authors develop an efficient mechanism to re­
name instructions out-of-order, and we use some of 
their insights to identify and delay waitsFor instruc­
tions. However, Skipper is intended as an add-on 
to a superscalar processor that mostly fetches in-order, 
whereas our design is intended towards a speculative 
multithreaded machine where out-of-order fetch is the 
norm, rather than the exception. For example, our 
renaming mechanism needs to support out-of-order 
spawn and reconnect, which Skipper did not investi­
gate. Also note that Skipper performs release when 
the hard-to-predict branch has resolved, whereas we 
use a more aggressive confidence based release mech­
anism. Finally, Skipper's dependence checking mecha­
nism was conservative, in that it sometimes signaled 
a violation even though a true dependence was not 
violated, which worked well for their domain. We 
have found that supporting a non-conservative check­
ing mechanism that flags a misprediction only if a vio­
lation has occurred to be important.
2.2.2 Compiler-based Speculative Multithreaded 
Systems
While our goal is to implement register synchroniza­
tion in a dynamic system without compiler support, 
we leverage a number of insights from previous work 
in compiler-based speculative multithreaded proces­
sors which place explicit register send and receive in­
structions in the binary.
The Multiscalar [13] used a compiler to both identify 
thread boundaries, and to move producer instructions 
up in the code and consumers (and their dependents) 
down in the code, as much as possible [15]. The Implic­
itly Multi-Threaded (IMT) processor [8] extended the 
Multiscalar to run on top of an aggressive SMT proces­
sor with speculation. In this system producer instruc­
tions could be fetched and executed along a bad path 
after a branch mispredict, and thus sometimes release 
consumers too early, causing some extra flushes. Our 
work builds on the IMT in three ways. First, we sup­
port out-of-order spawn and reconnect of threads. Sec­
ond, we substantially reduce the probability of flushes 
caused by producers releasing consumers early by gat­
ing the releases with branch confidence information. 
Third, we have developed a complete system for iden­
tifying the consumer instructions dynamically. Since 
our system identifies release points dynamically, rather 
than relying on the backward analysis that a compiler 
could give us, our release points are somewhat conser­
vative compared to those used on the IMT. This last 
point is discussed further in Section 4.
The Stampede speculative multithreaded system ex­
tended the Multiscalar compilation techniques to allow 
speculative movement of release instructions above
statically predicted branches [16], which has interest­
ing parallels to our proposal of performing speculative 
release based on dynamic branch confidence. Dynamic 
confidence predictions may give more accurate predic­
tions than the profile based confidence predictor used 
on Stampede. Further, our system only needs to wait 
for low confidence branches to execute, rather than re­
tire, which also removes a considerable amount of la­
tency from critical paths. It is difficult to compare our 
system to a TLS system implemented on a CMP, since 
our register synchronization is much more tightly cou­
pled.
3 Design
In this section we describe the out-of-order register 
renaming and synchronization mechanism as imple­
mented in PolyRow. We start with a broad description 
of PolyRow's microarchitecture in Section 3.1. We then 
provide a detailed view of the renaming mechanism in 
Section 3.2. We conclude with a discussion of our path 
confidence predictor in Section 3.3.
3.1 PolyFlow Microarchitecture
Figure 2 depicts PolyRow's microarchitecture. 
PolyRow is a speculative multithreaded processor 
that dynamically spawns flows from a single-threaded 
application. The overall organization is similar to an 
SMT machine.
The following sections define a flow, and detail the 
actions which occur over the lifetime of a flow and their 
relationship to architectural components.
3.1.1 Row State
A flow  is a microarchitectural entity that represents 
some portion of program execution. Each flow, like a 
thread, has a current program counter (PC), a rename 
table (mapping architectural register numbers to phys­
ical register numbers), and a reorder-buffer (ROB) of 
instructions that have been fetched (and possibly com­
pleted) but not yet retired. Similar to the POWER5 [12], 
PolyRow has a linked-list ROB that is dynamically 
shared among flows.
However, unlike a thread, a flow's state has a s t a r t  
pc (the program counter of the flow's first instruction) 
and a pointer to the s u c c e s s o r  flow (the next flow in 
program order). To enable out-of-order renaming, each 
flow also has a Diverter Queue, and four additional bits 
per RAT entry, which are described in Section 3.2.
The PolyRow renamer extends the conventional re­
naming mechanism to work on an out-of-order instruc­
tion stream. Each flow uses its own register alias ta­
ble (RAT) to rename its in-order instruction stream. 
On dispatch, all flows insert instructions into a shared,
4
Figure 2: PolyFlow Microarchitecture
non-blocking scheduler, as well as into their own ROB. 
Thus, intra-flow instruction dispatch in PolyFlow is 
similar to per-thread dispatch in an SMT processor. 
As with all other speculatively multithreaded systems 
PolyFlow's speculative and out-of-order memory sys­
tem must allow flows to communicate and synchronize 
memory operands in addition to register operands. 
Details of our approach to building this memory sys­
tem are outside the scope (and size constraints) of this 
paper. Further enhancements to renaming in PolyFlow 
are discussed in Section 3.2.
3.1.2 Flow Lifetime
The Flow Control Unit (FCU) manages the initiation 
(spawning), completion (reconnection), and removing 
(squashing) of individual flows.
Flow Spawn. We call the process in which one 
flow creates a new flow a spawn. As instructions are 
fetched, the FCU identifies control-independent points 
that could be spawned off. When the FCU decides 
that a spawn would be profitable (using some heuris­
tics), it spawns a new flow on an available SMT con­
text. While some Speculative Multithreaded proces­
sors permit only the youngest thread (in program order) 
to spawn new threads, PolyFlow's out-of-order spawn 
policy allows any flow to spawn new flows. At spawn, 
the new flow's s t a r t  and c u r re n t  PCs are set to the 
PC of the instruction that the FCU wishes to spawn to. 
An empty reorder-buffer queue is allocated for the suc­
cessor. The renaming actions that happen at spawn are 
described in Section 3.2.
Note that flows are chained together in a sequence 
representing sequential program order, shown in Fig­
ure 3. Each flow has a successor flow, which is immedi­
ately next to it in program order. When a flow spawns 
another flow, it inserts the new flow into the sequence 
between itself and its former successor, as shown in 
Figure 3 (b).
Reconnection. When a flow's dynamic instruction 
stream reaches the s t a r t  pc of its successor flow, as 
in Figure 3 (c), the predecessor flow can reconnect with
(a)
(b)
(c)
(d)
Retired
Instructions □ FetchedInstructions UnfetchedInstructions
Figure 3: Flow Lifetime: Flows are represented as the por­
tion o f the dynamic instruction stream. Figure (a) shows the 
state o f two flows prior to Flow 0's PC reaching a spawn 
point, represented by the dark vertical bar. When Flow 0 
reaches the spawn point, Figure (b), it spawns Flow 1 which 
is between Flow 0 and Flow 2 in program order. Figure 
(c) shows the state o f the machine when Flow 0 reaches the 
spawn PC o f Flow 1. At this point, Flow 0 has no more in­
structions to fetch. At the appropriate time, Flows 0 and 1 
are reconnected, shown in Figure (d). After reconnection, all 
o f the instructions once belonging to Flow 0 are now consid­
ered part o f Flow 1, and the appropriate resources o f Flow 0 
are freed. Note that only the first flow, in program order, is 
allowed to retire instructions.
5
£ £  CO ( Q  = •
CD Q3 CD 5T
_, . * _ S <? "• -n
Physical Reg g
its successor. At reconnection, the register data flow 
between the predecessor and successor flows is eval­
uated (Section 3.2) for correctness. If dependence vio­
lations are discovered, reconnection fails and the suc­
cessor flow is squashed. Otherwise, reconnection suc­
ceeds. Successful reconnection effectively combines the 
two flows into one logical flow, associated with a sin­
gle set of flow state depicted in Figure 3 (d). The result­
ing flow has one PC, start PC, and pointer to its suc­
cessor flow. The reorder buffer of the combined flow 
is the concatenation of the individual reorder buffers. 
The tail of the predecessor flow ROB is pointed at the 
head of the successor flow ROB to build the combined 
ROB. The rename state of the combined flow is derived 
in Section 3.2. Note that reconnection can occur only 
once between any two successive flows and does not 
need to occur in program order.
3.2 PolyFlow Register Renaming
The goal of our out-of-order renaming design is to 
support inter-flow register communication in a spec­
ulative multithreading system running on an SMT-like 
pipeline. We insisted that our renaming design support 
out-of-order spawn and reconnect, because ample pre­
vious work has indicated that it was important for per­
formance [4,1,9] (we have reconfirmed this in our sys­
tem). The renaming mechanism supports neither se­
lective reexecution nor value speculation; we deemed 
it too expensive to add either of these features to sup­
port speculative multithreading. Finally, we are careful 
in our design to make sure that all inter-flow commu­
nication is point-to-point because we did not want to 
build global broadcast buses into the rename unit.
The following sections describe the design of our 
inter-flow register renaming mechanism, beginning 
with a description of the additional state we associate 
with each flow (Section 3.2.1). We then describe how 
this additional state is updated throughout the lifetime 
of a flow (Section 3.2.2). Section 3.2.3 then describes 
our mechanism for identifying registers that are likely 
to be available to a particular flow at the time of spawn.
3.2.1 Register Renaming Flow State
Two additions to the SMT pipeline to support out-of- 
order renaming are the flow rename state bits and the Di­
verter Queues. To detect and avoid dependence viola­
tions we augment the RAT entry of each architectural 
register in each flow with four bits. These bits are de­
scribed below and shown in Figure 4.
• Written: Indicates that the register has been writ­
ten by its flow.
• Unsafe: Indicates that the flow has a more recent 
physical register mapping of this architectural reg­
ister than that copied to its successor (upon the 
successor spawn).
• waitsFor: Registers marked with this bit are ex­
pected to be written by a predecessor flow, i.e., 
the flow should not allow instructions reading this 
register to execute.
• Eager: Set when the register has been read by its 
flow, and was not previously marked waitsFor 
or Written.
The per-flow Diverter Queues (Figure 2) are 
used to hold those instructions for which a correct 
architectural-to-physical register mapping is currently 
unknown. The WaitsFor Predictor (Section 3.2.3) pre­
dicts which architectural registers will be unavailable 
to a newly spawned flow. In the spawned flow, the 
renamer delays the execution of instructions that are 
dependent upon registers marked w a its F o r  by plac­
ing them in a per-context Diverter Queue to await re­
naming. All instructions not explicitly dependent upon 
waitsFor registers are sent to the scheduler, including 
the transitive dependents of diverted instructions.
The process of delaying the renaming of instructions 
that have direct inter-flow dependences is called di­
version. When the FCU has determined that the di­
verted instructions in a flow can be safely released, 
the instructions in the Diverter Queue of that flow 
are renamed. The released instructions receive cor­
rect register mappings from the previous flow for each 
waitsFor register. After an instruction is released, it 
is sent to the scheduler.
3.2.2 Renaming Flow State Transitions
Flow Spawn. When a flow is spawned, the RAT of the 
predecessor is copied to the RAT of the newly created 
successor flow and the four flow state bitmaps are ini­
tialized. The successor's written and eager bits are 
cleared for each architectural register. The waitsFor 
bitmap is looked up in the WaitsFor Predictor (Sec­
tion 3.2.3) and logically OR-ed with the predecessor's 
current waitsFor bitmap. The unsafe bitmap is in­
herited from the predecessor, and the predecessor's 
unsafe bitmap is cleared.
6
// Instruction Source Renaming:
instr.src_phys := RAT.phys_reg[instr.src_arch]
RAT.eager[instr.src_arch] :=
RAT.eager[instr.src_arch] or 
(not RAT.written[instr.src_arch] and 
not RAT.waitsFor[instr.src_arch])
// Instruction Destination Renaming: 
instr.dest_phys := allocate_from_freelist()
RAT.phys_reg[instr.dest_arch] := 
instr.dest_phys
RAT.waitsFor[instr.dest_arch] := false 
RAT.written[instr.dest_arch] := true 
RAT.unsafe[instr.dest_arch] := true
writtencomb = 
unsafecomb = 
eagercomb =
waitsForcomb =
writteriprecl U writtenSucc 
unsaf epred U unsaf eSucc 
eagerpred U
(eagersucc - writtenpred) 
waitsForpred — writtenSUcc
Figure 6: Rules for combining the information sets of 
two flows, where the pred flow precedes, and is recon­
necting to, the succ flow to form a combined flow comb.
Figure 5: Renaming State Transitions. The instruction 
i n s t r  and rename-table RAT belong to the same flow. Each 
field in the RAT is updated per instruction source and desti­
nation.
Instruction Rename. The instructions within a flow 
are seen in-order by the renamer. As they are renamed, 
the flow state bitmaps associated with the flow are up­
dated to reflect each register read and update. These 
actions are summarized in Figure 5, described below.
When an instruction renames, each source 
architectural-to-physical register mappings are found 
in the flow's RAT and the eager bits associated with 
each source register are updated. The eager bit is set 
if the register is not currently written or waitsFor. 
Each architectural source's waitsFor bit is checked, 
and if any are set the instruction is steered to the 
flow's Diverter Queue instead of dispatching to the 
scheduler.
If the instruction has a destination register, a new 
physical destination register is assigned from the free 
register list and the RAT architectural-to-physical map­
ping is updated as normal. The architectural desti­
nation register is marked both written and unsafe. 
The final rename action is to clear the destination reg­
ister's waitsFor bit. This indicates that any register 
which may read from this architectural register in the 
future should not be diverted, as it will receive the cor­
rect mapping.
Violation Detection. When the program counter of 
a flow has arrived at the s t a r t  PC of its successor, 
and all of the predecessor instructions have been re­
named, PolyFlow may try to reconnect the two flows 
(Figure 3 (c)). When reconnection is attempted, the bits 
associated with each architectural register are used to 
determine if a read-after-write violation has occurred 
between the two flows.
Since the predecessor's instructions are earlier in 
program order, we are interested in only the reg­
ister writes which occurred between point where it 
spawned the successor flow and its final instruction. 
This information is held in the predecessor's set of
u n sa fe  bits. A violation can only occur if the succes­
sor flow read from a register before it wrote to it, which 
is captured by the successor's e a g e r  bits. The inter­
section of these two bit sets represent the architectural 
to physical mappings that the successor read from in­
correctly.
To check whether the two flows can be correctly re­
connected we check the condition:
u n saf epred D e a g e rsucc = =  0
If the intersection of the predecessor's unsafe set 
and the successor's eager set is empty, then we have 
guaranteed that the successor accessed register map­
pings for architectural registers that were either (1) not 
modified between the spawn point and the rename 
point or (2) modified by the successor before the read 
(so the source register was renamed correctly). Note 
that correctly predicting the waitsFor set is the key 
to avoiding reconnection check failure, since those reg­
isters marked waitsFor will not be read from eagerly. 
A failed reconnection results in the squashing of the 
successor flow, and all flows which follow. The pre­
decessor flow resumes fetching the instructions which 
had belonged to the successor flow, as if the successor 
had never been spawned.
Flow Reconnection. If reconnection is successful, 
we want to combine the two flows into one. The key 
point is that the resulting flow should appear as if there 
had never been two flows. For simplicity, we eliminate 
the predecessor flow and transform the successor flow 
into the combined flow. The combined flow will have 
the current PC of the successor flow, since this PC rep­
resents the only instructions in the two flows which re­
main to be fetched. The s t a r t  PC flow is the s t a r t  
PC of the predecessor flow, since this is the first PC 
fetched either flow in program order.
The information from each flow's RAT also needs 
to be merged. We make use of our register state bits 
to construct the resulting RAT. Since we will use the 
RAT of the successor flow as the base for the com­
bined flow's RAT, we need only copy from the prede­
cessor those mappings which were modified in prede­
cessor and not the successor. All other register map-
7
pings are either correct because they are unmodified in 
both flows (and thus identical) or have only been mod­
ified in the successor. The result is an architecturally 
correct RAT, associated with the combined flow's cur­
rent state. The set of registers updated in the succes­
sor's RAT upon reconnection is:
unsaf epred - writtenSUcc = updatedcomb
In the case of the four bitsets, we want the result to 
appear as if the two flows have never been separate. 
The details of combining of the bitsets are given in Fig­
ure 6. The eager and waitsFor bits are discussed 
below for added clarity.
The e a g e r  set of the combined flow should repre­
sent the set of architectural registers that would have 
been read by the combined flow before they were writ­
ten by the combined flow. Thus, the e a g e r  set of the 
combined flow will be the union of the e a g e r  set of 
the predecessor flow and those eag er registers of the 
successor that were not also written by the predecessor 
flow.
The waitsFor set of the combined flow should rep­
resent the set of architectural registers that the com­
bined flow should still is unlikely to have the correct 
mapping. Since every register mapping that the pre­
decessor flow is aware of has been communicated to 
the successor flow (the RATs have been merged), the 
set cannot be larger than the set of registers that the 
predecessor flow was waiting for. However, some of 
the predecessor's waitsFor registers may have been 
redefined by the successor flow, in which case, they 
needn't be 'waited for'. Thus, the waitsFor set of the 
combined flow is the waitsFor set of the predeces­
sor flow less the written set of the successor flow.
The final step in reconnection is to process the Di­
verter Queues of the two flows. Each instruction in 
the predecessor's Diverter Queue will remain diverted. 
However, the instructions in the successor's Diverter 
Queue can potentially be renamed since the predeces­
sor may have generated the register mapping on which 
they are dependent. For each instruction in the suc­
cessor Diverter Queue, we look up the source archi­
tectural registers we were waiting for in the RAT. If 
those entries are still marked waitsFor in the RAT 
then the instruction is remains diverted in the combined 
flow. Otherwise, the instruction now has the source 
architectural-to-physical register mapping it was wait­
ing for and is thus renamed and dispatched to the 
scheduler.
3.2.3 WaitsFor Prediction
A dynamic out-of-order renaming system has to pre­
dict the set of registers that the predecessor flow will 
write between the spawn point and the reconnection to 
the successor flow. We have coined these registers the
spawn's waitsFor set. When a flow is spawned, it re­
ceives a predicted waitsFor set from a hardware struc­
ture called the WaitsFor Predictor.
A "spawn" in a Speculative Multithreaded proces­
sor can be uniquely identified by a pair of program 
counters: the PC of the instruction which triggered 
the spawn and the PC of the first instruction of the 
spawned flow. The WaitsFor Predictor has a table with 
one entry per spawn PC pair. The prediction returned 
for a spawn is a bitmask which represents, for each ar­
chitectural register, whether or not the register is ex­
pected to be written by the predecessor as it executes 
instructions between the spawner PC and spawned PC.
The key insight behind the WaitsFor Predictor is 
that, while the actual control flow path taken by the 
predecessor flow in going from the spawn point to 
the reconnect point may change dynamically, the set 
of registers that are written does not vary signifi­
cantly. The waitsFor Predictor, needs to be highly ac­
curate: false positives cause instructions in the succes­
sor thread to wait unnecessarily, while false negatives 
(may) cause dependence violations.
To decide which registers should be marked 
waitsFor, the predictor keeps a counter per architec­
tural register for each spawn PC pair. As the prede­
cessor thread fetches instructions, the unsafe bitmap 
(described earlier) keeps track of registers that should 
have been marked waitsFor in its successor. When 
the predecessor arrives at the reconnection point, the 
predictor is trained using the unsafe bitset as the true 
waitsFor set.
If a particular register was waitsFor, its corre­
sponding counter is incremented by a static upcount 
value. Otherwise, register's the counter is decre­
mented by downcount. The next prediction compares 
counter values against a threshold, to decide if a regis­
ter should be marked waitsFor. We present results 
of three different WaitsFor Predictor configurations in 
Section 4.2. It was observed that a predictor using 1- 
bit counters per register (with an upcount, downcount, 
and threshold of 1) gives the best performance overall.
3.2.4 Example
Figure 7 illustrates the use of rename bitsets as threads 
get created, reconnect or get squashed, using machine 
snapshots at five different times, A to E. In the begin­
ning FO is the only active thread, with architectural 
register R1 mapped to physical register 101, and R2 
mapped to 103. Upon fetching and decoding past in­
struction S2, it spawns the flow F2. When the instruc­
tion S2 is renamed, F2 inherits its RAT from FO. In par­
ticular, the mappings for R1 and R2 are copied over 
to from FO to F2's RAT. In addition, as Section 3.2.2 
describes, FO's unsafe vector is copied over to F2 and 
cleared (although the unsafe vector of F2 is not func­
tionally useful-it does not have a successor flow). Ea-
8
Figure 7: An example illustrating the use of rename bitsets. Initially there is a single flow FO, which spawns flow F2 
and then FI. Flow FI reconnects successfully to F2. Subsequently FO reconnects to the combined flow, sign a lin g  a 
misreconnect. The progress in time is show along X-axis
ger set of F2 is initialized as being empty. Register R1 
is predicted as waitsFor, and any instruction having R1 
as one of its sources is diverted pending dataflow from 
a predecessor.
At time period B, flow FO writes to register R l, mark­
ing it as unsafe - if successor flow F2 were to use the 
mapping of R l copied over from FO at spawn, it would 
have used the incorrect mapping. However, since F2 
had R l marked as waitsFor, it correctly diverted any 
instructions which read from Rl (note that the instruc­
tion reading R l, marked in black, has infact been di­
verted). Next FO spawns FI, which is predicted to have 
an empty waitsFor set. Initialization of bit vectors pro­
ceeds as for previous spawn. In particular, the un­
safe set is copied over from FO to FI, FO's unsafe set 
is cleared, and FI's unsafe set gets Rl. Note that regis­
ter R l is unsafe for flow FI since this flow has a more 
recent mapping for R l than the flow's successor, F2.
At time C, FO writes to R2, and FI writes to R l, caus­
ing them to be marked unsafe in their respective flows.
Also, F2 writes to R l, causing Rl to be removed from its 
waitsFor set, since any future instructions will get the 
correct mapping for R l, and added to its unsafe set.
Upon reaching the reconnection point, flow FI tries
to reconnect to F2. The reconnect checks are done, and 
since F2 didn't eagerly read any register that was un­
safe in FI, the reconnect is successful. The new flow's 
unsafe vector is the union of the two unsafe vectors, 
which is Rl. The new flow gets an empty waitsFor set, 
since the waitsFor set of flow F2 was empty. Also note 
that the instruction reading R l that was diverted in F2 
now undergoes undiversion and gets the correct phys­
ical register mapping, 109. The combined flow makes 
progress, and at time D, does an eager read of register 
R2, based on the mapping inherited from FO. This map­
ping is wrong, since R2 is marked unsafe in FO. Thus, 
when FO reaches the reconnection point at time E, re­
connection checks fail for register R2. An invalid merge 
is signaled, and the successor thread formed from the 
merge of FI and F2 is flushed.
3.3 Path Confidence Prediction
Since it is possible for the predecessor flow to reach re­
connection along a misspeculated path, not all "suc­
cessful" reconnections result in a leap in forward 
progress. Up until reconnection, branch misspecula- 
tions in the predecessor do not effect successor flows;
9
we can safely roll back the state of the flow to the mis- 
speculation point without affecting any other flows.
However, if a flow contains a misspeculated branch 
and has reconnected with another flow, then the com­
bined flow is the only flow we have to work with. We 
must roll back to the state at the time of the mispre­
dicted branch, even if the instructions previously asso­
ciated with the successor flow had no dependences on 
those instructions along the mispredicted path. This 
results in a loss of a significant amount of computation 
that could be avoided by delaying reconnection until the 
machine has a high probability o f being on the predecessor's 
correct path. To this end, we use a branch confidence 
predictor to estimate the likelihood that a flow contains 
unresolved and mispredicted branches.
Branch confidence predictors [5, 3] estimate the 
probability that a branch is predicted correctly. When a 
flow reaches the possible reconnection point, we would 
like to estimate the likelihood that all of its unresolved 
branches were predicted correctly. In other words, we 
need the cumulative confidence estimate for all unre­
solved branches in the flow. We call this cumulative 
estimate Path Unconfidence. A high value of path un- 
confidence indicates uncertainty about the flow's un­
resolved branches. When a flow fetches a branch and 
predicts its direction, a confidence predictor provides 
an estimate of how likely is the branch to be mispre­
dicted, which we call the branch unconfidence. This 
value is added to the path unconfidence of the flow that 
fetched the branch. When branches resolve, the corre­
sponding flow's path unconfidence is decremented by 
the branch's unconfidence.
Path unconfidence is used to gate the reconnection 
process: when a flow arrives at the reconnection point, 
we allow it to reconnect to its successor only if its path 
unconfidence is below a certain threshold, called the 
Reconnection Threshold. Otherwise, the flow waits at 
the reconnection point, until one of the following three 
things happen:
First, if the flow executes a mispredicted branch, the 
flow recovers from the misspeculation as normal, con­
tinuing along its new path without affecting its suc­
cessor flow. Secondly, if the flow executes a branch 
that was correctly predicted which lowers the flow's 
path unconfidence below the reconnection threshold, 
the reconnection is allowed to proceed. Lastly, if a 
timeout number of cycles pass while waiting to recon­
nect, reconnection is triggered in spite of the current 
path unconfidence. Using a timeout value helps to im­
prove performance, since flows often have unresolved 
branches which are sitting in the Diverter Queue.
To build confidence mechanisms for reconnect­
gating, we leveraged previous work in branch confi­
dence estimation, along with extensive experimenta­
tion to determine the kind of predictors that well in 
this domain. This has resulted in a unique mecha­
nism for determining path unconfidence which we de-
Parameter Value
Pipeline Width 8 instr/cycle
Branch Predictor 8Kbit gshare
Confidence Predictor 8Kbit JRS
Misprediction Penalty at least 8 cycles
Reorder Buffer 1024 entries, dynamically 
shared
Functional Units 8 identical fully pipelined 
units
LI I-Cache 8Kbytes, 2-way set assoc., 
128 byte lines, 10 cycle miss
LI D-Cache 8Kbytes, 4-way set assoc., 
64 byte lines, 10 cycle miss
L2 Cache 512Kbytes, 8-way set assoc., 
128 byte lines, 100 cycle miss
Figure 8: Pipeline parameters.
scribe next. However, providing detailed reasons for 
our choices is beyond the scope of this paper.
We use two different branch confidence estimators, 
working together: the enhanced JRS predictor [5, 3] 
with 4 bits per entry, and another estimator, which 
we call the Global Miss Distance Counter (GMDC). 
The GMDC contains one 4-bit counter per flow to 
keep track of the number of branches that have been 
fetched since the most recent mispredict was resolved. 
This estimator exploits the insight presented in [3] that 
branch mispredicts are often clustered together, and 
thus, branches fetched immediately after a mispredict 
should have a lower confidence.
We keep track of the path confidence from these two 
estimators separately, in two different registers, called 
JRS Path Unconfidence, and GMDC Path Unconfidence. 
The JRS unconfidence value assigned to a branch is the 
counter value read from the JRS predictor subtracted 
from 16. (Note that this implies that even the most con­
fident branches get an unconfidence value of 1.) This 
unconfidence value is added to the flow's JRS path un­
confidence. Similarly, the GMDC unconfidence value 
for a branch is the value of the flow's GMDC counter, 
subtracted from 16. To gate reconnect, we use two sep­
arate reconnection thresholds, a JRS reconnection thresh­
old, and an GMDC reconnection threshold. Both the JRS 
path unconfidence and the GMDC path unconfidence 
of a flow must be below their respective thresholds be­
fore the reconnection is allowed.
4 Evaluation
Our experimental evaluation was performed on a sim­
ulator for the PolyFlow Speculative Multithreaded ar­
chitecture. The simulator executes a variant of the 
64-bit MIPS instruction set ISA which does not have
10
(2.693) (1.571) (1.866) (1.385) (2.423) (1.706) (1.923) (1.300) (1.550) (1.949) (1.695) (2.490)
Figure 9: The impact of path confidence information. The y-axis shows percentage speedup o f speculative multithreading 
over a superscalar (base superscalar IPC shown in parenthesis). The leftmost bar shows speedup o f release at arrival given an 
ideal (oracular) confidence predictor. The second bar shows a "hyper-aggressive" policy where release is performed whenever 
we arrive at the potential reconnection point, disregarding confidence information. The middle bar shows the use o f a realistic 
confidence predictor with a threshold set at 15. The fourth bar shows performance degradation due to a non-aggressive policy 
that releases only when there are no remaining unexecuted branches. The final bar shows the slow down from an even less 
aggressive policy that releases only when all branches have retired.
any special instructions to support speculative multi­
threading. The PolyFlow simulator is fully execution 
driven. It not only simulates timing, but also executes 
instructions out-of-order in the backend, writing re­
sults to the register file out of program order. When an 
instruction is retired, its results are compared against 
an architectural simulator, and an error is signaled if 
the results don't match. The PolyFlow simulator mod­
els mispredicted instructions accurately, since the back­
end treats good path and bad path instructions in ex­
actly the same way: both types of instructions execute 
and write values to the physical register file, and we 
use rename checkpoints to recover from branch mis­
predictions. The simulator also renames instructions 
out-of-order speculatively and uses the bitmap based 
checking mechanism described earlier to track waits- 
For instructions and catch true dependence violations 
in the presence of out-of-order spawns and reconnec­
tions.
Many essential functions of the PolyFlow architec­
ture are topics of ongoing research, including the 
spawn policy, the memory system, and out-of-order 
branch and confidence prediction. For the purposes 
of this paper, we idealized these parts of the ma­
chine, so that we could focus on the performance ef­
fects of renaming and data-forwarding. Thus, the 
simulator uses oracular memory dependence predic­
tion. The spawn points we use are obtained from a 
control-independence analysis of program traces, and 
the spawn policy uses oracularly known distance met­
rics to decide which spawns are useful. Finally, we 
used branch direction and confidence predictions from 
GShare and JRS [5] predictors respectively, executing 
the program in-order (i.e., we did not model out-of­
order branch resolution while training these predic­
tors).
We simulate a very aggressive, 8-wide machine, run­
ning 8 threads, with the configuration given in Fig­
ure 8. The superscalar model that we use is capable 
of fetching a maximum of one taken branch per cy­
cle. In PolyFlow mode, the machine can fetch from 
two threads in a cycle, with a maximum of one taken 
branch per cycle per thread. The PolyFlow instruction 
fetch unit uses path confidence to prioritize among dif­
ferent threads, giving preference to threads that have a 
higher confidence value. However, note that the results 
in Figure 10, use a round-robin fetch policy.
In Section 4.1 we demonstrate the performance im­
pact of forwarding data between flows speculatively, 
but only when we have high confidence in the spec­
ulation. In particular we find that our path-confidence 
predictor can achieve speedups over a base superscalar 
that are within 10% of an oracular system that synchro­
nizes at the "perfect" time. Section 4.2 demonstrates 
that our waitsFor predictor performance is also nearly 
ideal.
In the results presented here, we fast forwarded 
through the initialization phase of all benchmarks, and 
executed 100 million instructions after that. All the 
graphs that we present show the speedup of differ­
ent Speculative Multithreaded configurations over a 
superscalar. The absolute IPC numbers for the super­
scalar are shown below each benchmark name in Fig­
ure 9.
11
4.1 Speculative Data Forwarding
In this section, we look at the effects of speculative 
data-forwarding on the performance of a PolyFlow 
system. We use a perfect waitsFor predictor in these 
experiments. The results with a real waitsFor predictor 
are presented in Section 4.2. In Figure 9 we compare a 
variety of policies for selecting the time at which pro­
ducer threads release the consumer threads.
Most of these are Release-on-Arrival (RoA) policies, 
where inter-thread data forwarding (by releasing the 
waitsFor instructions) happens when the predecessor 
arrives at the potential reconnection point. The policies 
differ in how aggressive they are about assuming that 
we have arrived at the reconnection point along a good 
branch path instead of a mispredicted branch path. We 
model four different RoA policies:
• RoA, Perfect Branch Confidence: In this policy, 
data is forwarded from the predecessor to the suc­
cessor on good path arrival at the reconnection 
point. Good path arrival is determined oracularly. 
This configuration is the upper bound on the per­
formance of RoA.
• RoA, Path Confidence Threshold 15+10: This pol­
icy uses the path confidence predictor described in 
Section 3.3 to predict whether arrival at the recon­
nection point is good path or bad path. We used 
a JRS unconfidence threshold of 15, and a GMDC 
unconfidence threshold of 10.
A predecessor thread releases consumers in its 
successor thread when the predecessor thread ar­
rives at the potential reconnection point, and the 
path unconfidence (both JRS and GMDC) are less 
than the threshold. If the path unconfidence is too 
high the predecessor waits till either its branches 
resolve, decreasing its path unconfidence, or 35 
clock cycles elapse, which ever is earlier. At this 
point, waiting consumer instructions in the suc­
cessor are released.
• RoA, Path Confidence Threshold Infinity: This 
policy does not use path confidence, and aggres­
sively forwards data from predecessor to succes­
sor immediately upon the predecessor's arrival at 
the potential reconnection point.
• RoA, Path Confidence Threshold Zero: This con­
figuration conservatively forwards data from pre­
decessor to successor only when all branches 
in the predecessor have resolved (completed 
execution). Thus, reconnection happens non- 
speculatively.
The final configuration we evaluate is Release-on- 
Retirement (RoR), which is even more conservative 
than RoA with Path Confidence Threshold Zero. This 
policy waits to forward from predecessor to successor
until the predecessor has retired all its instructions, and 
therefore, is forwarding completely non-speculatively. 
Note, however, that RoR is somewhat less conservative 
than would be a policy based on waiting until the con­
sumer instruction retired, as would happen in systems 
that base their synchronization on full value specula­
tion with validation, and partial reexecution, at retire­
ment [1,10].
Figure 9, demonstrates that forwarding data aggres­
sively and speculatively gives better performance than 
forwarding it conservatively. Release-on-Retirement is 
particularly bad, and results in a small slowdown over 
the superscalar for some benchmarks. The other con­
figurations, RoA with a threshold of infinity, and RoA 
with a threshold of 0, both perform significantly worse 
than RoA with perfect confidence, although there is no 
clear favorite among the two. For some benchmarks, 
like twolf and bzip2, waiting until all branches in the 
predecessor have resolved is better. For other bench­
marks, like vortex forwarding data immediately upon 
arrival is better.
RoA with a threshold of 15+10 performs better than 
the above two configurations, and comes close to RoA 
with perfect confidence. We have also found that the 
particular path confidence threshold that performs best 
varies from one benchmark to another, although the 
results presented here are with a fixed threshold. An 
adaptive algorithm that adjusts the threshold dynami­
cally would probably do even better.
Note that using confidence to gate the reconnect sig­
nal reduces performance for one benchmark (vortex), 
compared to forwarding data immediately upon ar­
rival. The reason for this is the confidence based fetch 
prioritization policy used in our simulations. Such a 
policy reduces the likelihood of bad path arrival at the 
reconnection point: threads that are low in confidence 
fetch fewer instructions, and thus, are less likely to ar­
rive at the reconnection point. With a different fetch 
prioritization policy, using branch confidence to gate 
data forwarding becomes more important. For exam­
ple, Figure 10 demonstrates that in a system with a 
round-robin fetch algorithm, a threshold of 15 always 
performs significantly better than thresholds of zero or 
infinity. In this case all benchmarks, including vortex, 
are helped by using branch confidence to gate synchro­
nization.
Recall that our Release-on-Arrival mechanism for­
wards data from a predecessor thread to a successor 
thread only after the predecessor has arrived at the 
potential reconnection point. We wanted to under­
stand the performance cost of this design decision since 
several compiler based speculative multithreading sys­
tems [8,16] have taken pains to release individual reg­
isters at the earliest point where the compiler can prove 
there will be no more modifications to that register. 
Thus, we also performed experiments where we com­
pared our Release-on-Arrival policy with a completely
12
Figure 10: Using path confidence information to gate 
aggressive synchronization is even more important 
when the fetch policy is not biased toward more con­
fident paths. This graph shows speculative multithreading 
speedups over superscalar with a (sub-optimal) round-robin 
fetch policy. In this case the "hyper-aggressive" synchro­
nization policy never beats the confidence gated policy.
unrealizable oracle that can forward data from a pro­
ducer instruction to all consumer instructions as soon 
as the dynamic producer instruction is renamed. Note 
that this may be considerably earlier than a compiler 
could place a release or send instruction, since we are 
working with the dynamic instruction stream rather 
than the static program. We found, nonetheless, that 
Release-on-Arrival was usually quite competitive with 
the unrealizable oracle. For 8 out of our 12 bench­
marks (bzip2, crafty, gap, gzip, parser, perlbmk, vor­
tex and vpr-route) the unrealizable oracle got less than 
10% additional speedup over that achieved by Release- 
on-Arrival. For the other four benchmarks (gcc, mcf, 
twolf, and vpr-place) there is room for improvement. 
The twolf benchmark, in particular, achieved an extra 
38% speedup (132% compared to RoA's 94%) over the 
superscalar when synchronization is performed by the 
unrealizable oracle.
4.2 WaitsFor Prediction
In this section, we evaluate design space of waitsFor 
prediction. For all the experiments in this section, we 
use Release-on-Arrival with perfect branch confidence 
as the data-forwarding strategy, so that we can focus 
on the performance of the waitsFor predictor.
Recall that the waitsFor predictor decides which in­
structions in the successor thread should be delayed. 
If the predictor fails to mark an instruction as waits­
For, a dependence violation and thread squash could 
happen. If the predictor marks instructions as waits­
For unnecessarily, instructions in the successor thread 
may be delayed waiting for synchronization that is not 
actually required.
Figure 11: A one bit up-down waitsFor predictor usu­
ally provides speedups over a superscalar that are 
within a few percent of those produced by an or­
acle waitsFor predictor. The y-axis shows percentage 
speedup o f release-on-arrival with a variety o f waitsFor pre­
dictors over the aggressive superscalar. The total number o f  
spawner-spawnee pairs (predictor entries) is shown for each 
benchmark along the x-axis. The leftmost bar shows an "or­
acle" waitsFor predictor. The second bar shows a 1-bit pre­
dictor that simply uses the u n sa fe  set from the previous 
instance o f this spawn. The third bar shows a conservative 
saturating "up-only" counter that gets set if  a particular 
register should ever have been made waitsFor in the past. 
The rightmost bar shows a 3-bit counter.
We implemented a number of different waitsFor pre­
dictors, with different values for upcount(U), down- 
count(D) and threshold(T). We examine the perfor­
mance of three different predictors: a 1-bit predic­
tor that remembers the true waitsFor set from last 
time(U=l, D=l, T=l); a saturating predictor that never 
counts down (U=l, D=0, T=l); and an 3-bit predictor 
(U=8, D=l, T=l). Their performance is shown in Figure 
11, which also shows a perfect predictor for reference. 
We find that except for one benchmark (vpr-place), the 
amount of hysteresis in the waitsFor predictor does not 
affect performance much.
For vpr-place, a 1-bit predictor that simply remem­
bers the waitsFor set from the last time works best, but 
still loses about 13% of the speedup achieved by the 
oracle predictor. The fact that the 1-bit predictor works 
better than the "up-only" and 3-bit predictors indicates 
that this application has a consumer on the critical path 
that only occasionally needs to be synchronized. The 
non-oracular waitsFor predictors conservatively make 
this consumer waitsFor too often.
The total number of entries in the predictor, i.e., the 
total number of unique spawnerPC-spawnedPC pairs, 
is shown below each benchmark in Figure 11. We did 
not model size constraints and replacement policies for 
the waitsFor predictor.
13
5 Conclusion
This paper describes the inter-flow register renaming 
and synchronization hardware of an aggressive spec­
ulatively multithreaded system. The system runs on 
top of a simultaneous-multithreading-like pipeline that 
can support up to 8 simultaneously active threads. 
Our system combines a novel set of features. First, 
our inter-flow renaming and synchronization scheme 
supports out-of-order spawn and reconnect of threads. 
Second, it aggressively and speculatively synchronizes 
to minimize latency added to the critical path. Third, 
we have designed a path confidence predictor that 
works particularly well to gate our synchronization 
scheme so that it is not too aggressive. Fourth, we have 
demonstrated that our path confidence gating mech­
anism gives us performance within 10% of an oracle 
that "magically" knows the perfect time to perform 
synchronization. Finally, we have demonstrated that a 
straight-forward prediction of a single bit per architec­
tural register allows us to near-optimally identify the 
set of consumer instructions with no compiler support 
at all.
We made several decisions early in the design pro­
cess. In particular, we decided to target our design 
at a tightly coupled (simultaneously multithreaded) 
style system rather than a CMP. Also, we decided 
that our system would run binaries "out of the box," 
dynamically discovering inter-thread synchronization 
points, rather than relying on a compiler to identify 
and reschedule synchronization instructions. We be­
lieve that the insights we have gained in our system 
may well apply in these two broader contexts.
In particular, we believe that our insights about the 
necessity of aggressive, speculative, and path confi­
dence gated synchronization will carry over to spec­
ulative multithreading systems built on top of CMPs. 
Our results also indicate that there may be some bene­
fit to be gained by identifying the last dynamic register 
producer along particular paths of the program, and 
we plan to investigate mechanisms, both dynamic and 
compiler based, to gather this information.
Acknowledgements
The work reported in this paper was supported in part 
by the National Science Foundation under grant CCR- 
0429711. Samuel Stone is supported by an NSF Grad­
uate Research Fellowship. Computational resources 
were supported by an equipment donation from AMD 
Corp., and the National Science Foundation under 
grant EIA-0224453.
References
[1] Haitham Akkary and Michael A. Driscoll. A dynamic 
multithreading processor. In 31st Int'l Symp. Microarchi­
tecture, pages 226-236, November 1998.
[2] Chen-Yong Cher and T. N. Vijaykumar. Skipper: a 
microarchitecture for exploiting control-flow indepen­
dence. In MICRO 34, pages 4-15,2001.
[3J Dirk Grunwald, Artur Klauser, Srilatha Manne, and An­
drew R. Pleszkun. Confidence estimation for specula­
tion control. In 7SC4, pages 122-131, 1998.
[4] Lance Hammond, Mark Willey, and Kunle Olukotun. 
Data speculation support for a chip multiprocessor. In 
ASPEOS VIII, pages 58-69, October 1998.
[5J Erik Jacobsen, Eric Rotenberg, and James E. Smith. As­
signing confidence to conditional branch predictions. In 
MICRO 29, pages 142-152,1996.
[6] Venkata Krishnan and Josep Torrellas. A chip- 
multiprocessor architecture with speculative multi­
threading. IEEE Transactions on Computers, 48(9):866- 
880,1999.
[7] Pedro Marcuello, Antonio González, and Jordi Tubella. 
Speculative multithreaded processors. In Int'l Conf. Su­
percomputing, pages 77-84,1998.
[8] II Park, Babak Falsafi, and T. N. Vijaykumar. Implicitly- 
multithreaded processors. In ISCA-30, pages 39-51, 
2003.
[9] Jose Renau, James Tuck, Wei Liu, Luis Ceze, Karin 
Strauss, and Josep Torrellas. Tasking with out-of-order 
spawn in TLS chip multiprocessors: microarchitecture 
and compilation. In 19th Int'l Conf. Supercomputing (ICS), 
pages 179-188, 2005.
[10] Eric Rotenberg and James E. Smith. Control indepen­
dence in trace processors. In International Symposium on 
Microarchitecture, pages 4-15,1999.
[11] Amir Roth and Gurindar S. Sohi. Speculative data- 
driven multithreading. In HPCA 7, January 2001.
[12] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, 
and J. B. Joyner. POWER5 system microarchitecture. 
IBM Journal of Research and Development, 49(4/5), 2005.
[13] Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 
Multiscalar processors. In ISCA 22, pages 414-425, June 
1995.
[14] J. Gregory Steffan and Todd C. Mowry. The potential 
for using thread-level data speculation to facilitate auto­
matic parallelization. In HPCA 4, pages 2-13, February 
1998.
[15] T. N. Vijaykumar. Compiling for the Multiscalar Archi­
tecture. PhD thesis, University of Wisconsin-Madison 
Computer Sciences Department, January 1998.
[16] Antonia Zhai, Christopher B. Colohan, J. Gregory Stef­
fan, and Todd C. Mowry. Compiler optimization 
of scalar value communication between speculative 
threads. In ASPLOS-X, October 2002.
14
