Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor by Dean M. Tullsen et al.
Exploiting Choice: Instruction Fetch and Issue on an Implementable
Simultaneous Multithreading Processor
Dean M. Tullsen
 , Susan J. Eggers
 , Joel S. Emer
y, Henry M. Levy
 ,
Jack L. Lo
 , and Rebecca L. Stamm
y
 Dept of Computer Science and Engineering
yDigital Equipment Corporation
University of Washington HLO2-3/J3
Box 352350 77 Reed Road
Seattle, WA 98195-2350 Hudson, MA 01749
Abstract
Simultaneous multithreading is a technique that permits multiple
independent threads to issue multiple instructions each cycle. In
previous work we demonstrated the performance potential of si-
multaneous multithreading, based on a somewhat idealized model.
In this paper we show that the throughput gains from simultaneous
multithreading canbe achievedwithout extensive changesto a con-
ventional wide-issue superscalar, either in hardware structures or
sizes. We present an architecture for simultaneous multithreading
that achieves three goals: (1) it minimizes the architectural impact
on the conventional superscalar design, (2) it has minimal perfor-
manceimpact onasinglethreadexecutingalone,and(3) it achieves
signiﬁcant throughput gains when running multiple threads. Our
simultaneous multithreading architecture achieves a throughput of
5.4 instructions per cycle, a 2.5-fold improvement over an unmod-
iﬁed superscalar with similar hardware resources. This speedup is
enhancedbyanadvantageof multithreadingpreviouslyunexploited
in other architectures: the ability to favor for fetch and issue those
threads most efﬁciently using the processor each cycle, thereby
providing the “best” instructions to the processor.
1 Introduction
Simultaneousmultithreading(SMT)isatechniquethatpermitsmul-
tiple independent threads to issue multiple instructions each cycle
to a superscalar processor’s functional units. SMT combines the
multiple-instruction-issue features of modern superscalars with the
latency-hiding ability of multithreaded architectures. Unlike con-
ventional multithreaded architectures [1, 2, 15, 23], which depend
onfast contextswitchingto shareprocessorexecutionresources,all
hardware contexts in an SMT processor are active simultaneously,
competing each cycle for all available resources. This dynamic
sharing of the functional units allows simultaneous multithread-
ing to substantially increase throughput, attacking the two major
impediments to processor utilization — long latencies and limited
per-thread parallelism. Tullsen, et al., [27] showed the potential of
Proceedingsof the 23rd Annual International Symposium on
Computer Architecture, Philadelphia, PA, May, 1996
an SMT processor to achieve signiﬁcantly higher throughput than
either a wide superscalar or a multithreaded processor. That paper
also demonstrated the advantages of simultaneous multithreading
over multiple processors on a single chip, due to SMT’s ability to
dynamically assignexecution resourceswhere neededeach cycle.
Thoseresults showedSMT’s potential basedon a somewhat ide-
alizedmodel. Thispaperextendsthat workinfour signiﬁcantways.
First,wedemonstratethatthethroughputgainsofsimultaneousmul-
tithreadingarepossiblewithoutextensivechangestoaconventional,
wide-issue superscalar processor. We propose an architecture that
is more comprehensive,realistic, and heavilyleveraged off existing
superscalar technology. Our simulations show that a minimal im-
plementation of simultaneous multithreading achieves throughput
1.8 times that of the unmodiﬁed superscalar; small tuning of this
architecture increasesthat gain to 2.5 (reaching throughput as high
as 5.4 instructions per cycle). Second, we show that SMT need not
compromise single-thread performance. Third, we use our more
detailed architectural model to analyze and relieve bottlenecks that
did not exist in the more idealized model. Fourth, we show how
simultaneousmultithreading creates an advantagepreviouslyunex-
ploitable in other architectures: namely, the ability to choose the
“best” instructions, from all threads, for both fetch and issue each
cycle. By favoring the threads most efﬁciently using the processor,
we can boost the throughput of our limited resources. We present
severalsimple heuristicsfor this selectionprocess,anddemonstrate
how such heuristics, when applied to the fetch mechanism, can
increasethroughput by as much as 37%.
Thispaperisorganizedasfollows. Section2presentsourbaseline
simultaneous multithreading architecture, comparing it with exist-
ing superscalar technology. Section 3 describes our simulator and
our workload, and Section 4 showsthe performance of the baseline
architecture. In Section 5, we examine the instruction fetch pro-
cess,present several heuristics for improving it basedon intelligent
instruction selection, and give performance results to differentiate
thoseheuristics. Section6examinesthe instructionissueprocessin
a similar way. We then use the best designs chosen from our fetch
and issue studies in Section 7 as a basis to discover bottlenecks
for further performance improvement. We discuss related work in
Section 8 and summarize our results in Section 9.
This research was supported by ONR grants N00014-92-J-1395 and
N00014-94-1-1136, NSF grants CCR-9200832 and CDA-9123308, NSF
PYI Award MIP-9058439, the Washington Technology Center, Digital
EquipmentCorporation,and fellowships from Intel andthe Computer Mea-
surement Group.Instruction Cache
8
Decode
Register 
Renaming
floating point 
instruction queue
integer
instruction queue
fp
units
int/ld-store
units
Data
Cache
PC
Fetch
Unit
integer
registers
fp
registers
Figure 1: Our base simultaneous multithreading hardware architecture.
2 A Simultaneous Multithreading Processor
Architecture
In this sectionwe present the architecture of our simultaneousmul-
tithreading processor. We show that the throughput gains provided
by simultaneous multithreading are possible without adding undue
complexity to a conventional superscalarprocessordesign.
Our SMT architecture is derived from a high-performance, out-
of-order, superscalar architecture (Figure 1, without the extra pro-
gram counters) which representsa projection of current superscalar
design trends 3-5 years into the future. This superscalar proces-
sor fetches up to eight instructions per cycle; fetching is controlled
by a conventional system of branch target buffer, branch predic-
tion, and subroutine return stacks. Fetched instructions are then
decoded and passed to the register renaming logic, which maps
logical registers onto a pool of physical registers, removing false
dependences. Instructions are then placed in one of two instruc-
tion queues. Those instruction queues are similar to the ones used
by the MIPS R10000 [20] and the HP PA-8000 [21], in this case
holding instructions until they are issued. Instructions are issuedto
the functional units out-of-order when their operandsare available.
After completingexecution,instructionsareretiredin-order, freeing
physical registers that are no longer needed.
Our SMT architecture is a straightforward extension to this con-
ventional superscalar design. We made changes only when neces-
sary to enable simultaneous multithreading, and in general, struc-
tures were not replicated or resized to support SMT or a multi-
threaded workload. Thus, nearly all hardware resources remain
completely available even when there is only a single thread in
the system. The changesnecessary to support simultaneous multi-
threading on that architecture are:
￿ multiple program countersand somemechanismby which the
fetch unit selects one eachcycle,
￿ aseparatereturnstackforeachthreadfor predictingsubroutine
return destinations,
￿ per-thread instruction retirement, instruction queue ﬂush, and
trap mechanisms,
￿ a thread id with each branch target buffer entry to avoid pre-
dicting phantom branches,and
￿ a larger register ﬁle, to support logical registers for all threads
plus additional registers for register renaming. The size of
the register ﬁle affects the pipeline (we add two extra stages)
and the scheduling of load-dependent instructions, which we
discuss later in this section.
Noticeablyabsentfrom thislistisamechanismtoenablesimulta-
neous multithreaded scheduling of instructions onto the functional
units. Becauseanyapparentdependencesbetweeninstructionsfrom
different threadsare removedbytheregister renamingphase,a con-
ventional instruction queue (IQ) designed for dynamic scheduling
contains all of the functionality necessary for simultaneous mul-
tithreading. The instruction queue is shared by all threads and an
instructionfromanythreadinthequeuecanissuewhenitsoperands
are available.
We fetch from one program counter (PC) each cycle. The PC is
chosen,in round-robin order, from amongthosethreadsnot already
experiencing an I cache miss. This scheme provides simultaneous
multithreading at thepoint of issue, but onlyﬁne-grainmultithread-
ing of the fetch unit. We will look in Section 5 at ways to extend
simultaneous multithreading to the fetch unit. We also investigate
alternative thread priority mechanismsfor fetching.
A primary impact of multithreading on our architecture is on the
size of the register ﬁle. We have a single register ﬁle, as thread-
speciﬁclogicalregistersaremappedontoacompletelysharedphys-
ical register ﬁle by the register renaming. To support eight threads,
we need a minimum of 8*32 = 256 physical integer registers (for a
32-register instructionsetarchitecture), plusmoreto enableregister
renaming. Access to such a large register ﬁle will be slow, almost
certainly affecting the cycle time of the machine.
To account for the size of the register ﬁle, we take two cycles to
read registers instead of one. In the ﬁrst cycle values are read into
a buffer closer to the functional units. The instruction is sent to a
similar buffer at the same time. The next cycle the data is sent to a
functional unit for execution. Writes to the register ﬁle are treated
similarly, requiring an extra register write stage. Figure 2 shows
thepipelinemodiﬁedfor two-phaseregister access,comparedtothe
pipeline of the original superscalar.
The two-stage register access has several ramiﬁcations on our
architecture. First, it increases the pipeline distance between fetch
and exec, increasing the branch misprediction penalty by 1 cycle.
Second, it takes an extra cycle to write back results, requiring an
extra level of bypass logic. Third, increasing the distance betweenQueue Exec Reg Read Decode Fetch Rename Commit
misfetch penalty 2 cycles
register usage 4 cycle minimum
mispredict penalty 6 cycles
(a)
misfetch penalty 2 cycles
Queue Reg Write Exec Reg Read Reg Read Decode Fetch Rename Commit
register usage 6 cycle minimum
mispredict penalty 7 cycles
misqueue penalty 4 cycles
(b)
Figure 2: The pipeline of (a) a conventional superscalar processor and (b) that pipeline modiﬁed for an SMT processor, along with
some implications of those pipelines.
queue and exec increases the period during which wrong-path in-
structions remain in the pipeline after a misprediction is discovered
(the misqueue penalty in Figure 2). Wrong-path instructions are
those instructions brought into the processoras a result of a branch
misprediction. Those instructions consumeinstruction queue slots,
renamingregistersandpossiblyissueslots, all ofwhich, onanSMT
processor,could be used by other threads.
This pipeline does not increase the inter-instruction latency be-
tween most instructions. Dependent (single-cycle latency) instruc-
tions can still be issued on consecutivecycles, for example, as long
as inter-instruction latenciesare predetermined. That is the casefor
all instructions but loads. Since we are scheduling instructions a
cycle earlier (relative to the exec cycle), load-hit latency increases
by one cycle (to two cycles). Rather than suffer this penalty, we
schedule load-dependent instructions assuming a 1-cycle data la-
tency, but squashthose instructions in the caseof an L1 cachemiss
or a bankconﬂict. Thereare two performancecoststo this solution,
which we call optimistic issue. Optimistically issued instructions
that get squashedwasteissueslots, and optimistic instructions must
still be held in the IQ an extra cycle after they are issued, until it is
known that they won’t be squashed.
The last implication of the two-phaseregister accessis that there
are two more stages between rename and commit, thus increasing
the minimum time that a physical register is held by an in-ﬂight
instruction. This increases the pressure on the renaming register
pool.
We assume, for each machine size, enough physical registers
to support all active threads, plus 100 more registers to enable
renaming, bothfor theinteger ﬁle andtheﬂoatingpoint ﬁle; i.e., for
the single-thread results, we model 132 physical integer registers,
and for an 8-thread machine, 356. We expect that in the 3-5 year
time-frame, the scheme we have describedwill remove register ﬁle
access from the critical path for a 4-thread machine, but 8 threads
will still be a signiﬁcant challenge. Nonetheless, extending our
results to an 8-thread machineallows us to seetrends beyondthe 4-
threadnumbersandanticipatesother solutionsto this problem. The
number of registers available for renaming determines the number
ofinstructionsthatcanbeintheprocessorbetweentherenamestage
and the commit stage.
This architecture allows us to address several concerns about
simultaneous multithreaded processor design. In particular, this
paper shows that:
￿ Instruction scheduling is no more complex than on a dynami-
cally scheduledsuperscalar.
￿ Register ﬁle data paths are no more complex than in the su-
perscalar, and the performanceimplications of the register ﬁle
and its extended pipeline are small.
￿ The required instruction fetch throughput is attainable, even
without any increasein fetch bandwidth.
￿ Unmodiﬁed (for an SMT workload) cache and branch predic-
tion structures do not thrash on that workload.
￿ Even aggressive superscalar technologies, such as dynamic
schedulingandspeculativeexecution, are not sufﬁcientto take
full advantageof a wide-issue processorwithout simultaneous
multithreading.
We have only presented an outline of the hardware architecture
to this point; the next section provides more detail.
2.1 Hardware Details
Theprocessorcontains3ﬂoatingpointfunctionalunitsand6integer
units; four of thesixinteger units alsoexecuteloadsandstores. The
peakissue bandwidth out of the two instruction queuesis therefore
nine; however, the throughput of the machine is bounded by the
peakfetch anddecodebandwidths, which are eight instructions per
cycle. We assumethat all functional unitsare completelypipelined.
Table 1 shows the instruction latencies, which are derived from the
Alpha 21164 [8].
We assume a 32-entry integer instruction queue (which han-
dles integer instructions and all load/store operations) and a 32-
entry ﬂoating point queue, not signiﬁcantly larger than the HP PA-
8000 [21], which has two 28-entry queues.
The caches (Table 2) are multi-ported by interleaving them into
banks, similar to the design of Sohi and Franklin [26]. We model
lockup-freecachesandTLBs. TLBmissesrequiretwofull memory
accessesand no execution resources. We model the memory sub-
system in great detail, simulating bandwidth limitations and access
conﬂicts at multiple levels of the hierarchy, to address the concernInstruction Class Latency
integer multiply 8,16
conditional move 2
compare 0
all other integer 1
FP divide 17,30
all other FP 4
load (cache hit) 1
Table 1: Simulated instruction latencies
ICache DCache L2 L3
Size 32 KB 32 KB 256 KB 2M B
Associativity DM DM 4-way DM
Line Size 64 64 64 64
Banks 8 8 8 1
Transfer time 1 cycle 1 1 4
Accesses/cycle var (1-4) 4 1 1/4
Cache ﬁll time 2 cycles 2 2 8
Latency to
next level 6 6 12 62
Table 2: Details of the cache hierarchy
that memory throughput could be a limiting condition for simulta-
neous multithreading.
Each cycle, one thread is given control of the fetch unit, chosen
from amongthosenotstalledfor aninstructioncache(I cache)miss.
If we fetch from multiple threads, we never attempt to fetch from
threads that conﬂict (on an I cachebank) with each other, although
they may conﬂict with other I cacheactivity (cache ﬁlls).
Branchpredictionisprovidedbyadecoupledbranchtargetbuffer
(BTB) and pattern history table (PHT) scheme [4]. We use a 256-
entry BTB, organized as four-way set associative. The 2K x 2-bit
PHTis accessedbytheXOR of thelower bits of theaddressandthe
global history register [18, 30]. Return destinations are predicted
with a 12-entry return stack (per context).
We assume an efﬁcient, but not perfect, implementation of dy-
namicmemorydisambiguation. Thisisemulatedbyusingonlypart
of the address(10 bits) to disambiguate memory references, so that
it is occasionallyover-conservative.
3 Methodology
The methodology in this paper closely follows the simulation and
measurement methodology of [27]. Our simulator uses emulation-
based, instruction-level simulation, and borrows signiﬁcantly from
MIPSI [22], a MIPS-based simulator. The simulator executes un-
modiﬁed Alpha object code and models the execution pipelines,
memory hierarchy, TLBs, and the branch prediction logic of the
processor describedin Section 2.
In an SMT processor a branch misprediction introduces wrong-
path instructions that interact with instructions from other threads.
Tomodelthisbehavior,wefetchdownwrongpaths,introducethose
instructions into the instruction queues, track their dependences,
and issue them. We eventually squash all wrong-path instructions
a cycle after a branch misprediction is discovered in the execstage.
Our throughput results only count useful instructions.
Our workload comes primarily from the SPEC92 benchmark
suite[7]. Weuseﬁveﬂoatingpointprograms(alvinn, doduc,fpppp,
ora, and tomcatv) and two integer programs (espresso and xlisp)
from that suite, and the document typesetting program TeX. We
assigna distinct program to eachthread in the processor: the multi-
programmedworkloadstressesour architecturemorethanaparallel
program by presenting threads with widely varying program char-
acteristics and with no overlap of cache, TLB or branch prediction
usage. To eliminate the effects of benchmark differences, a single
data point is composedof 8 runs, eachT * 300 million instructions
in length, where T is the numberof threads. Eachof the 8 runs uses
a different combination of the benchmarks.
We compile each program with the Multiﬂow trace scheduling
compiler [17], modiﬁedto produceAlpha code. In contrast to [27],
we turn off trace scheduling in the compiler for this study, for two
reasons. In our measurements, we want to differentiate between
usefulanduselessspeculativeinstructions, whichis easywith hard-
warespeculation,but not possiblefor softwarespeculationwith our
system. Also, software speculation is not as beneﬁcial on an ar-
chitecture which features hardware speculation, and in some cases
is harmful. However, the Multiﬂow compiler is still a good choice
for our compilation engine, becauseof the high quality of the loop
unrolling, instruction scheduling and alignment, and other opti-
mizations, aswell as the easewith whichthe machinemodel canbe
changed. The benchmarks are compiled to optimize single-thread
performance on the base hardware.
4 Performance of the Base Hardware Design
In this section we examine the performance of the base architec-
ture and identify opportunities for improvement. Figure 3 shows
that with only a single thread running on our SMT architecture, the
throughput is less than 2% below a superscalar without SMT sup-
port. Thedropin throughputisdueto thelongerpipeline(described
in Section 2) used by the SMT processor. Its peak throughput is
84%higherthanthesuperscalar. Thisgainisachievedwithvirtually
no tuning of the base architecture for simultaneous multithreading.
This design combines low single-thread impact with high speedup
for even a few threads, enabling simultaneous multithreading to
reap beneﬁts even in an environment where multiple processesare
running only a small fraction of the time. We also note, however,
that the throughput peaks before 8 threads, and the processor uti-
lization, at less than 50% of the 8-issue processor, is well short of
the potential shown in [27].
We make several conclusions about the potential bottlenecks of
thissystemasweapproach8threads,aidedbyFigure3andTable3.
Issue bandwidth is clearly not a bottleneck, as the throughput rep-
resents a fraction of available issue bandwidth, and our data shows
that no functional unit type is being overloaded. We appear to
have enough physical registers. The caches and branch prediction
logic are being stressed more heavily at 8 threads, but we expect
the latency-hidingpotential of the additional threadsto makeupfor
thosedrops. The culprit appearsto be one or more of the following
three problems: (1) IQ size — IQ-full conditions are common, 12
to 21% of cycles total for the two queues; (2) fetch throughput —
eveninthosecycleswherewedon’texperienceanIQ-full condition,
our data shows that we are sustaining only 4.2 useful instructions
fetched per cycle (4.5 including wrong-path); and (3) lack of par-
allelism — although the queues are reasonably full, we ﬁnd fewer1
2
3
4
5
T
h
r
o
u
g
h
p
u
t
 
(
I
n
s
t
r
u
c
t
i
o
n
s
 
P
e
r
 
C
y
c
l
e
)
2468
Number of Threads
Unmodified Superscalar
Figure 3: Instruction throughput for the base hardware archi-
tecture.
than four out of, on average, 27 instructions per cycle to issue. We
expect eight threads to provide more parallelism, so perhaps we
have the wrong instructions in the instruction queues.
Therestofthispaperfocusesonimprovingthisbasearchitecture.
Thenext sectionaddresseseachof theproblemsidentiﬁedherewith
different fetch policies and IQ conﬁgurations. Section 6 examines
waystopreventissuewaste,andSection7re-examinestheimproved
architecture for new bottlenecks, identifying directions for further
improvement.
5 The Fetch Unit — In Search of Useful In-
structions
Inthissectionweexaminewaystoimprovefetchthroughputwithout
increasing the fetch bandwidth. Our SMT architecture shares a
single fetch unit among eight threads. We can exploit the high level
ofcompetitionforthefetchunitintwowaysnotpossiblewithsingle-
threaded processors: (1) the fetch unit can fetch from multiple
threads at once, increasing our utilization of the fetch bandwidth,
and (2) it can be selective about which thread or threads to fetch
from. Becausenot all paths provide equally useful instructions in a
particular cycle, anSMT processorcanbeneﬁtbyfetching from the
thread(s) that will provide the best instructions.
We examine a variety of fetch architectures and fetch policies
that exploit those advantages. Speciﬁcally,they attempt to improve
fetch throughput by addressing three factors: fetch efﬁciency, by
partitioning the fetch unit among threads (Section 5.1); fetch ef-
fectiveness, by improving the quality of the instructions fetched
(Section 5.2); and fetch availability, by eliminating conditions that
block the fetch unit (Section 5.3).
5.1 Partitioning the Fetch Unit
Recall that our baselinearchitecture fetches up to eight instructions
from one thread each cycle. The frequency of branches in typical
instruction streams and the misalignment of branch destinations
make it difﬁcult to ﬁll the entire fetch bandwidth from one thread,
Number of Threads
Metric 1 4 8
out-of-registers (% of cycles) 3% 7% 3%
I cachemiss rate 2.5% 7.8% 14.1%
-misses per thousand instructions 6 17 29
D cache miss rate 3.1% 6.5% 11.3%
-misses per thousand instructions 12 25 43
L2 cachemiss rate 17.6% 15.0% 12.5%
-misses per thousand instructions 3 5 9
L3 cachemiss rate 55.1% 33.6% 45.4%
-misses per thousand instructions 1 3 4
branch misprediction rate 5.0% 7.4% 9.1%
jump misprediction rate 2.2% 6.4% 12.9%
integer IQ-full (% of cycles) 7% 10% 9%
fp IQ-full (% of cycles) 14% 9% 3%
avg (combined) queue population 25 25 27
wrong-path instructions fetched 24% 7% 7%
wrong-path instructions issued 9% 4% 3%
Table 3: The result of increased multithreading on some low-
level metrics for the base architecture.
evenfor smallerblocksizes[5, 24]. Inthis processor,wecanspread
the burden of ﬁlling the fetch bandwidth among multiple threads.
For example, the probability of ﬁnding four instructions from each
of two threads should be greater than that of ﬁnding eight from one
thread.
In this section, we attempt to reduce fetch block fragmentation
(our term for the various factors that prevent us from fetching the
maximumnumberofinstructions)byfetchingfrommultiplethreads
each cycle, while keeping the maximum fetch bandwidth (but not
necessarily the I cache bandwidth) constant. We evaluate several
fetching schemes, which are labeled alg.num1.num2, where alg is
thefetchselectionmethod(inthissectionthreadsarealwaysselected
usingaround-robinpriority scheme),num1isthenumberofthreads
that can fetch in 1 cycle, and num2 is the maximum number of
instructions fetched per thread in 1 cycle. The maximum number
of total instructions fetched is always limited to eight. For each
of the fetch partitioning policies, the cache is always 32 kilobytes
organizedinto 8 data banks;agiven bankcandojust oneaccessper
cycle.
RR.1.8 — This is the baseline scheme from Section 4. Each
cycle one thread fetches as many as eight instructions. The thread
is determined by a round-robin priority scheme from among those
notcurrently suffering anI cachemiss. In this schemethe I cacheis
indistinguishable from that on a single-threaded superscalar. Each
cache bank has its own address decoder and output drivers; each
cycle, only one of the banks drives the cacheoutput bus, which is 8
instructions (32 bytes) wide.
RR.2.4, RR.4.2 — These schemes fetch fewer instructions per
thread from more threads (four each from two threads, or two each
from four threads). If we try to partition the fetch bandwidth too
ﬁnely,however,wemaysufferthreadshortage,wherefewerthreads
are available than are required to ﬁll the fetch bandwidth.
For these schemes, multiple cache addresses are driven to each
cache data bank, each of which now has a multiplexer before its
addressdecoder,toselectonecacheindexpercycle. Sincethecache
banksare single-ported, bank-conﬂictlogic is neededto ensurethat
each address targets a separate bank. RR.2.4 has two cache output0
1
2
3
4
5
T
h
r
o
u
g
h
p
u
t
 
(
I
P
C
)
12468
Number of Threads
RR.2.8
RR.4.2
RR.2.4
RR.1.8
Figure 4: Instruction throughput for the different instruction
cache interfaceswith round-robin instruction scheduling.
buses, each four instructions wide, while RR.4.2 has four output
buses,eachtwoinstructionswide. Forbothschemes,thetotalwidth
of the output busesis 8 instructions (identical to that in RR.1.8), but
additional circuitry is neededsoeachbankis capableof driving any
of the multiple (now smaller) output buses, andis able to select one
or none to drive in a given cycle. Also, the cache tag store logic
must be replicated or multiple-ported in order to calculate hit/miss
for each addresslooked up per cycle.
Thus, the hardware additions are: the address mux; multiple ad-
dress buses; selection logic on the output drivers; the bank conﬂict
logic; and multiple hit/miss calculations. The changes required for
RR.2.4 would have a negligible impact on area and cache access
time. The changes for RR.4.2 are more extensive, and would be
more difﬁcult to do without affecting area or access time. These
schemesactuallyreducethelatencyinthedecodeandrenamestages,
as the maximum length of dependency chains among fetched in-
structions is reducedby a factor of 2 and 4, respectively.
RR.2.8—Thisschemeattacksfetchblockfragmentationwithout
suffering from thread shortage by fetching eight instructions more
ﬂexibly from two threads. This can be implemented by reading an
eight-instruction block for each thread (16 instructions total), then
combiningthem. We takeasmany instructionsas possiblefrom the
ﬁrst thread,thenﬁll in with instructionsfrom thesecond,upto eight
total. Like RR.2.4, two addresses must be routed to each cache
bank, then multiplexed before the decoder; bank-conﬂict logic and
two hit/miss calculations per cycle are necessary; and each bank
drives one of the two output buses. Now, however, each output
bus is eight instructions wide, which doubles the bandwidth out of
the cache compared to any of the previous schemes. This could be
done without greatly affecting area or cycle time, as the additional
bussingcouldprobablybedonewithoutexpandingthecachelayout.
Inaddition,logictoselectandcombinetheinstructionsisnecessary,
which might or might not require an additional pipe stage. Our
simulations assumeit does not.
Figure 4 shows that we can get higher maximum throughput by
splitting the fetch over multiple threads. For example, the RR.2.4
scheme outperforms RR.1.8 at 8 threads by 9%. However, better
maximum throughput comes at the cost of a 12% single-thread
penalty; in fact, RR.2.4 does not surpass RR.1.8 until 4 threads.
The RR.4.2 scheme needs 6 threads to surpass RR.1.8 and never
catchesthe 2-thread schemes,suffering from thread shortage.
The RR.2.8 scheme provides the best of both worlds: few-
threads performance like RR.1.8 and many-threads performance
like RR.2.4. However, the higher throughput of this scheme puts
more pressureon the instruction queues,causingIQ-full conditions
at a rate of 18% (integer) and 8% (fp) with 8 threads.
With the RR.2.8 scheme we have improved the maximum
throughput by 10% without compromising single-thread perfor-
mance. This was achieved by a combination of (1) partitioning
the fetch bandwidth over multiple threads, and (2) making that par-
titionﬂexible. Thisisthesameapproach(althoughinamorelimited
fashion here) that simultaneous multithreading uses to improve the
throughput of the functional units [27].
5.2 ExploitingThread Choice in the Fetch Unit
The efﬁciency of the entire processor is affected by the quality of
instructionsfetched. Amultithreadedprocessorhasauniqueability
to control that factor. In this section, we examine fetching policies
aimed at identifying the “best” thread or threads available to fetch
eachcycle. Twofactorsmakeonethreadlessdesirablethananother.
Theﬁrst is the probability that athread is following a wrongpath as
aresult of anearlier branchmisprediction. Wrong-path instructions
consume not only fetch bandwidth, but also registers, IQ space,
and possibly issue bandwidth. The second factor is the length of
time the fetched instructions will be in the queue before becoming
issuable. We maximize the throughput of a queue of bounded size
byfeeding it instructions that will spendthe least time in the queue.
If we fetch too many instructions that block for a long time, we
eventually ﬁll the IQ with unissuable instructions, a condition we
call IQ clog. This restricts both fetch andissue throughput, causing
the fetch unit to go idle and preventing issuable instructions from
getting into the IQ. Both of these factors (wrong-path probability
and expected queue time) improve over time, so a thread becomes
more desirable as we delay fetching it.
We deﬁne several fetch policies, each of which attempts to im-
prove on the round-robin priority policy using feedbackfrom other
parts of the processor. The ﬁrst attacks wrong-path fetching, the
others attack IQ clog. They are:
BRCOUNT — Here we attempt to give highest priority to those
threads that are least likely to be on a wrong path. We do this
by counting branch instructions that are in the decode stage, the
rename stage, and the instruction queues, favoring those with the
fewest unresolved branches.
MISSCOUNT — This policy detects an important cause of IQ
clog. A long memory latency can cause dependent instructions to
backupin theIQ waitingfor theloadto complete, eventuallyﬁlling
the queue with blocked instructions from one thread. This policy
preventsthat by giving priority to thosethreadsthat havethe fewest
outstandingD cachemisses.
ICOUNT — This is a more general solution to IQ clog. Here
priority is given to threads with the fewest instructions in decode,
rename, and the instruction queues. This achieves three purposes:
(1) it preventsanyonethread from ﬁlling theIQ, (2) it giveshighest0
1
2
3
4
5
2468
0
1
2
3
4
5
2468
T
h
r
o
u
g
h
p
u
t
 
(
I
P
C
)
Number of Threads
IQPOSN.1.8
ICOUNT.1.8
MISSCOUNT.1.8
BRCOUNT.1.8
RR.1.8
IQPOSN.2.8
ICOUNT.2.8
MISSCOUNT.2.8
BRCOUNT.2.8
RR.2.8
Number of Threads
Figure 5: Instruction throughput for fetching based on several priority heuristics, all compared to the baseline round-robin scheme.
The results for 1 thread are the same for all schemes,and thus not shown.
priority to threadsthat are moving instructions through the IQ most
efﬁciently, and (3) it provides a more even mix of instructions from
the available threads, maximizing the parallelism in the queues. If
cache misses are the dominant cause of IQ clog, MISSCOUNT
may perform better, since it gets cache miss feedback to the fetch
unit more quickly. If the causes are more varied, ICOUNT should
perform better.
IQPOSN— LikeICOUNT, IQPOSNstrivestominimizeIQclog
and bias toward efﬁcient threads. It gives lowest priority to those
threads with instructions closest to the head of either the integer
or ﬂoating point instruction queues (the oldest instruction is at the
head of the queue). Threads with the oldest instructions will be
most prone to IQ clog, and those making the best progress will
have instructions farthest from the head of the queue. This policy
does not require a counter for each thread, as do the previous three
policies.
Like any control system, the efﬁciency of these mechanisms is
limited by the feedbacklatency resulting, in this case,from feeding
datafrom later pipelinestagesbackto the fetchstage. For example,
by the time instructions enter the queue stage or the exec stage, the
information used to fetch them is three or (at least) six cycles old,
respectively.
Both the branch-counting and the miss-counting policies tend to
producefrequent ties. In those cases,the tie-breaker is round-robin
priority.
Figure 5 shows that all of the fetch heuristics provide speedup
over round-robin. Branch counting and cache-miss counting pro-
vide moderate speedups, but only when the processor is saturated
with manythreads. Instructioncounting,incontrast, producesmore
signiﬁcant improvements regardless of the number of threads. IQ-
POSN provides similar results to ICOUNT, being within 4% at all
times, but never exceedingit.
The branch-counting heuristic does everything we ask of it. It
reduceswrong-path instructions, from 8.2% of fetched instructions
to 3.6%, and from 3.6% of issued instructions to 0.8% (RR.1.8 vs.
BRCOUNT.1.8 with eight threads). And it improves throughput
by as much as 8%. Its weakness is that the wrong-path problem
it solves is not large on this processor, which has already attacked
the problem with simultaneous multithreading. Even with the RR
scheme, simultaneous multithreading reduces fetched wrong-path
instructions from 16% with one thread to 8% with 8 threads.
Cache miss counting also achieves throughput gains as high as
8% over RR, but in general the gains are much lower. It is not
particularly effective at reducing IQ clog, as we get IQ-full condi-
tions 12% of the time on the integer queueand 14% on the ﬂoating
point queue (for MISSCOUNT.2.8 with 8 threads). These results
indicatethat IQ clog is more than simply the result of long memory
latencies.
1 8 Threads
Metric Thread RR ICOUNT
integer IQ-full (% of cycles) 7% 18% 6%
fp IQ-full (% of cycles) 14% 8% 1%
avg queuepopulation 25 38 30
out-of-registers (% of cycles) 3% 8% 5%
Table 4: Some low-level metrics for the round-robin and
instruction-counting priority policies (and the 2.8 fetch parti-
tioning scheme).
The instruction-counting heuristic provides instruction through-
put as high as 5.3 instructions per cycle, a throughput gain over the
unmodiﬁed superscalar of 2.5. It outperforms the best round-robin
resultby 23%. Instruction countingisaseffective at 2 and4threads
(in beneﬁt over round-robin) as it is at 8 threads. It nearly elimi-
nates IQ clog (see IQ-full results in Table 4) and greatly improves
the mix of instructions in the queues (we are ﬁnding more issuable
instructions despite having fewer instructions in the two queues).
Intelligent fetching with this heuristic is of greater beneﬁtthan par-
titioning the fetch unit, as the ICOUNT.1.8 scheme consistently0
1
2
3
4
5
T
h
r
o
u
g
h
p
u
t
 
(
I
P
C
)
12468
Number of Threads
0
1
2
3
4
5
12468
Number of Threads
ITAG,ICOUNT.1.8
BIGQ,ICOUNT.1.8
ICOUNT.1.8
ITAG,ICOUNT.2.8
BIGQ,ICOUNT.2.8
ICOUNT.2.8
Figure 6: Instructionthroughput for the 64-entry queue and early I cachetag lookup, when coupled with the ICOUNT fetch policy.
outperforms RR.2.8.
Table 4 points to a surprising result. As a result of simultaneous
multithreadedinstructionissueandtheICOUNTfetchheuristics,we
actually put less pressure on the same instruction queue with eight
threadsthanwith one,havingsharplyreducedIQ-full conditions. It
also reducespressure on the register ﬁle (vs. RR) by keepingfewer
instructions in the queue.
BRCOUNT and ICOUNT each solve different problems, and
perhaps the best performance could be achieved from a weighted
combination of them; however, the complexity of the feedback
mechanism increases as a result. By itself, instruction counting
clearly provides the best gains.
Given our measurement methodology, it is possible that the
throughput increases could be overstated if a fetch policy simply
favors those threads with the most inherent instruction-level paral-
lelism or the bestcachebehavior, thusachievingimprovementsthat
would not be seen in practice. However, with the ICOUNT.2.8 pol-
icy, the opposite happens. Our results show that this schemefavors
threadswithlowersingle-threadILP,thusitsresultsincludeahigher
sample of instructions from the slow threads than either the super-
scalar results or the RR results. If anything, then, the ICOUNT.2.8
improvements are understated.
Insummary,wehaveidentiﬁedasimpleheuristicthatisverysuc-
cessful at identifying the best threads to fetch. Instruction counting
dynamicallybiasestowardthreadsthat will useprocessorresources
most efﬁciently, thereby improving processor throughput as well
as relieving pressure on scarce processor resources: the instruction
queuesand the registers.
5.3 Unblocking the Fetch Unit
Byfetchingfrom multiple threadsandusingintelligent fetchheuris-
tics, wehavesigniﬁcantlyincreasedfetchthroughputandefﬁciency.
Themoreefﬁcientlyweareusingthefetchunit,themorewestandto
lose when it becomesblocked. In this section we examine schemes
that prevent two conditions that cause the fetch unit to miss fetch
opportunities, speciﬁcally IQ-full conditions and I cache misses.
The two schemesare:
BIGQ — The primary restriction on IQ size is not the chip area,
butthe time to searchit; therefore wecanincreaseits sizeaslongas
we don’t increase the search space. In this scheme, we double the
sizes of the instruction queues, but only search the ﬁrst 32 entries
for issue. Thisschemeallowsthe queuesto buffer instructionsfrom
the fetch unit when the IQ overﬂows.
ITAG— Whenathreadis selectedfor fetchingbut experiencesa
cachemiss, we losethe opportunity to fetch that cycle. If we dothe
I cachetaglookupsacycleearly, we canfetch aroundcachemisses:
cache miss accesses are still started immediately, but only non-
missing threads are chosenfor fetch. Because we need to have the
fetch addressa cycle early, we essentiallyadd a stageto the front of
the pipeline, increasing the misfetch and mispredict penalties. This
scheme requires one or more additional ports on the I cache tags,
so that potential replacement threads can be looked up at the same
time.
Although the BIGQ scheme improves the performance of the
round-robinscheme(not shown), 1.5-2% acrossthe board, Figure6
showsthat the bigger queuesaddno signiﬁcant improvement to the
ICOUNT policy. In fact, it is actually detrimental for several thread
conﬁgurations. Thisis becausethe buffering effect of the big queue
scheme brings instructions into the issuable part of the instruction
queuethatmayhavebeenfetchedmanycyclesearlier, usingpriority
information that is now out-of-date. The results indicate that using
up-to-date priority information is more important than buffering.
Theseresults showthat intelligent fetch heuristicshave madethe
extra instructionqueuehardwareunnecessary. Thebiggerqueueby
itself is actuallylesseffective atreducingIQ clogthanthe ICOUNT
scheme. With 8 threads, the bigger queues alone (BIGQ,RR.2.8)
reduce IQ-full conditions to 11% (integer) and 0% (fp), while in-
struction counting alone (ICOUNT.2.8) reduces them to 6% and
1%. Combining BIGQ and ICOUNT drops them to 3% and 0%.
Early I cache tag lookup boosts throughput as much as 8%Issue Number of Threads Useless Instructions
Method 1 2 4 6 8 wrong-path optimistic
OLDEST 2.10 3.30 4.62 5.09 5.29 4% 3%
OPT LAST 2.07 3.30 4.59 5.09 5.29 4% 2%
SPEC LAST 2.10 3.31 4.59 5.09 5.29 4% 3%
BRANCH FIRST 2.07 3.29 4.58 5.08 5.28 4% 6%
Table 5: Instruction throughput (instructions per cycle) for the issue priority schemes, and the percentage of useless instructions
issued when running with 8 threads.
over ICOUNT. It is most effective when fetching one thread
(ICOUNT.1.8, where the cost of a lost fetch slot is greater). How-
ever, it improves the ICOUNT.2.8 results no more than 2%, as the
ﬂexibility of the 2.8 scheme already hides some of the lost fetch
bandwidth. In addition, ITAG lowers throughput with few threads,
wherecompetitionforthefetchslotsislowandthecostofthelonger
misprediction penalty is highest.
Using a combination of partitioning the fetch unit, intelligent
fetching, and early I cache tag lookups, we have raised the peak
performanceof the baseSMT architecture by 37% (5.4 instructions
percyclevs. 3.9). Our maximumspeeduprelative to aconventional
superscalar has gone up proportionately, from 1.8 to 2.5 times the
throughput. That gain comes from exploiting characteristics of a
simultaneous multithreading processor not available to a single-
threaded machine.
High fetch throughput makes issue bandwidth a more critical
resource. We focus on this factor in the next section.
6 Choosing Instructions For Issue
Much as the fetch unit in a simultaneous multithreading processor
can take advantage of the ability to choose which threads to fetch,
the issue logic has the ability to choose instructions for issue. A
dynamicallyscheduledsingle-threadedprocessormay have enough
ready instructions to be able to choose between them, but with an
SMTprocessortheoptionsaremorediverse. Also, becausewehave
higher throughput than a single-threadedsuperscalar processor,the
issue bandwidth is potentially a more critical resource, so avoiding
issue slot waste may be more beneﬁcial.
In this section, we examine issue priority policies aimed at pre-
venting issue waste. Issue slot waste comes from two sources,
wrong-pathinstructions(resulting from mispredictedbranches)and
optimistically issued instructions. Recall (from Section 2) that we
optimistically issue load-dependent instructions a cycle before we
have D cache hit information. In the case of a cache miss or bank
conﬂict, we have to squash the optimistically issued instruction,
wasting that issue slot.
In a single-threaded processor, choosing instructions least likely
to be on a wrong path is always achieved by selecting the oldest
instructions (those deepest into the instruction queue). In a simul-
taneous multithreading processor, the position of an instruction in
the queue is no longer the best indicator of the level of speculation
of that instruction, as right-path instructions are intermingled in the
queueswith wrong-path.
The policies we examine are OLDEST FIRST, our default issue
algorithm up to this point, OPT LAST and SPEC LAST, which
only issue optimistic and speculative instructions (more speciﬁ-
cally, any instruction behind a branch from the same thread in the
instruction queue), respectively, after all others have been issued,
andBRANCH FIRST, whichissuesbranchesasearlyaspossiblein
order to identify mispredicted branches quickly. The default fetch
algorithm for each of these schemesis ICOUNT.2.8.
The strong message of Table 5 is that issue bandwidth is not yet
a bottleneck. Even when it does become a critical resource, the
amount of improvement we get from not wasting it is likely to be
boundedby the percentage of our issue bandwidth given to useless
instructions, which currently stands at 7% (4% wrong-path instruc-
tions,3%squashedoptimistic instructions). Becausewedon’toften
havemore issuableinstructionsthanfunctionalunits, we aren’t able
to and don’t need to reduce that signiﬁcantly. The SPEC LAST
scheme is unable to reduce the number of useless instructions at
all, while OPT LAST brings it down to 6%. BRANCH FIRST
actually increases it to 10%, as branch instructions are often load-
dependent;therefore, issuingthem as early as possibleoften means
issuing them optimistically. A combined scheme (OPT LAST and
BRANCH FIRST) might reduce that side effect, but is unlikely to
have much effect on throughput.
Since each of the alternate schemes potentially introduces mul-
tiple passes to the IQ search, it is convenient that the simplest
mechanismstill works well.
7 Where Are the Bottlenecks Now?
We have shown that proposed changes to the instruction queues
andthe issue logic are unnecessaryto achievethe best performance
with this architecture, but that signiﬁcant gains can be produced by
moderatechangesto theinstruction fetchmechanisms. Here weex-
aminethatarchitecturemoreclosely(usingICOUNT.2.8 asournew
baseline), identifying likely directions for further improvements.
Inthis sectionwepresentresultsofexperimentsintendedtoiden-
tifybottlenecksinthenewdesign. Forcomponentsthatarepotential
bottlenecks,wequantify thesizeof thebottleneckbymeasuringthe
impact of relieving it. For some of the components that are not
bottlenecks, we examine whether it is possible to simplify those
components without creating a bottleneck. Because we are iden-
tifying bottlenecks rather than proposing architectures, we are no
longerboundbyimplementationpracticalitiesintheseexperiments.
The Issue Bandwidth — The experiments in Section 6 indicate
that issuebandwidthis not a bottleneck. In fact, we found that even
an inﬁnite number of functional units increasesthroughput by only
0.5% at 8 threads.
InstructionQueueSize— Resultsin Section5would,similarly,
seemtoimplythatthesizeoftheinstructionqueueswasnotabottle-
neck, particularly with instruction counting; however, the schemes
we examined are not the same as larger, searchable queues, which
would also increase available parallelism. Nonetheless, the exper-iment with larger (64-entry) queues increased throughput by less
than 1%, despite reducing IQ-full conditions to 0%.
Fetch Bandwidth — Although we have signiﬁcantly improved
fetch throughput, it is still a prime candidate for bottleneck status.
Branch frequency and PC alignment problems still prevent us from
fully utilizing the fetchbandwidth. A schemethat allowsusto fetch
as many as 16 instructions (up to eight each from two threads),
increasesthroughput8% to 5.7instructions percycle. At that point,
however, the IQ size and the number of physical registers each
become more of a restriction. Increasing the instruction queues to
64 entries and the excess registers to 140 increases performance
another 7% to 6.1 IPC. These results indicate that we have not yet
completely removed fetch throughput as a performance bottleneck.
Branch Prediction — Simultaneous multithreading has a dual
effect on branch prediction, much as it has on caches. While it
putsmore pressureon thebranchpredictionhardware(seeTable3),
it is more tolerant of branch mispredictions. This tolerance arises
because SMT is less dependent on techniques that expose single-
thread parallelism (e.g., speculative fetching and speculative ex-
ecution based on branch prediction) due to its ability to exploit
inter-thread parallelism. With one thread running, on average 16%
of the instructions we fetch and 10% of the instructions we exe-
cute are down a wrong path. With eight threads running and the
ICOUNT fetch scheme, only 9% of the instructions we fetch and
4% of the instructions we execute are wrong-path.
Perfect branch prediction boosts throughput by 25% at 1 thread,
15% at 4 threads, and 9% at 8 threads. So despite the signiﬁcantly
decreased efﬁciency of the branch prediction hardware, simulta-
neous multithreading is much less sensitive to the quality of the
branch prediction than a single-threaded processor. Still, better
branch prediction is beneﬁcial for both architectures. Signiﬁcant
improvements come at a cost, however; a better scheme than our
baseline(doubling the size of both the BTB andPHT) yields only a
2% gain at 8 threads.
SpeculativeExecution—Theabilitytodospeculativeexecution
on this machine is not a bottleneck, but we would like to know
whether eliminating it would create one. The cost of speculative
execution (in performance) is not particularly high (again, 4% of
issued instructions are wrong-path), but the beneﬁts may not be
either.
Speculative execution can mean two different things in an SMT
processor, (1) the ability to issue wrong-path instructions that can
interfere withothers,and(2)theability toallowinstructionstoissue
before preceding branches from the same thread. In order to guar-
antee that no wrong-path instructions are issued, we need to delay
instructions4cyclesafter theprecedingbranchisissued. Doingthis
reducesthroughputby7%at 8threads,and38%at1thread. Simply
preventinginstructions from passingbranchesonlylowers through-
put 1.5% (vs. 12% for 1 thread). Simultaneous multithreading
(with many threads) beneﬁts much less from speculative execution
than a single-threaded processor; it beneﬁts more from the ability
to issue wrong-path instructions than from allowing instructions to
pass branches.
Memory Throughput — While simultaneous multithreading
hidesmemorylatencieseffectively, it is lesseffective if the problem
is memory throughput, since it does not address that problem. For
that reason,our simulator modelsmemorythroughputlimitations at
multiple levels of the cachehierarchy, and the busesbetweenthem.
With ourworkload,weneversaturateanysinglecacheor bus,butin
some cases there are signiﬁcant queueing delays for certain levels
of the cache. If we had inﬁnite bandwidth caches (i.e., the same
cachelatencies, but no cache bankor bus conﬂicts), the throughput
would only increaseby 3%.
Register File Size — The number of registers required by this
machine is a very signiﬁcant issue. While we have modeled the
effectsof register renaming, wehave not set the numberof physical
registers low enough that it is a signiﬁcant bottleneck. In fact,
setting the numberof excessregisters to inﬁnite instead of 100 only
improves 8-thread performance by 2%. Lowering it to 90 reduces
performance by 1%, and to 80 by 3%, and 70 by 6%, so there is
no sharp drop-off point. The ICOUNT fetch scheme is probably a
factor in this, as we’ve shown that it creates more parallelism with
fewer instructions in the machine. With four threads and fewer
excessregisters, the reductions were nearly identical.
1
2
3
4
5
T
h
r
o
u
g
h
p
u
t
 
(
I
P
C
)
12345
Threads
Figure7: Instructionthroughputformachineswith 200physical
registersand from 1 to 5 hardwarecontexts.
However,thisdoesnotcompletelyaddressthetotalsizeofthereg-
isterﬁle, particularlywhencomparingdifferent numbersof threads.
An alternate approach is to examine the maximize performance
achieved with a given set of physical registers. For example, if
we identify the largest register ﬁle that could support the scheme
outlined in Section 2, then we can investigate how many threads
to support for the best performance. The tradeoff arises because
supporting more hardware contexts leaves fewer (excess) registers
availableforrenaming. Thenumberofrenamingregisters, however,
determines the total number of instructions the processor can have
in-ﬂight. It is difﬁcult to predict the right register ﬁle size that far
into the future, but in Figure 7 we illustrate this type of analysis
by ﬁnding the performance achieved with 200 physical registers.
That equates to a 1-thread machine with 168 excess registers or a
4-threadmachinewith 72excessregisters, for example. In this case
there is a clear maximum point at 4 threads.
In summary,fetchthroughputisstill abottleneckin ourproposed
architecture. It maynolongerbeappropriatetokeepfetchandissue
bandwidth in balance, given the much greater difﬁculty of ﬁlling
the fetch bandwidth. Also, register ﬁle accesstime will likely be a
limiting factor in the number of threadsan architecturecan support.8 Related Work
A number of other architectures have been proposed that exhibit
simultaneous multithreading in some form. Tullsen, et al., [27]
demonstratedthepotential for simultaneousmultithreading, but did
not simulate a complete architecture, nor did that paper present
a speciﬁc solution to register ﬁle access or instruction scheduling.
Thispaperpresentsanarchitecturethatrealizesmuchofthepotential
demonstrated by that work, simulating it in detail.
Hirata, et al., [13] present an architecture for a multithreaded
superscalar processor and simulate its performance on a parallel
ray-tracing application. They do not simulate caches or TLBs and
their architecture hasno branchprediction mechanism. Yamamoto,
et al., [29] presentan analyticalmodel of multithreadedsuperscalar
performance, backedup by simulation. Their study models perfect
branching,perfectcachesandahomogeneousworkload(all threads
running the same trace). Yamamoto and Nemirovsky [28] simulate
anSMTarchitecture with separateinstructionqueuesandupto four
threads. Gulati andBagherzadeh[11]modela4-issuemachinewith
four hardware contexts and a single compiler-partitioned register
ﬁle.
Keckler and Dally [14] and Prasadhand Wu [19] describearchi-
tecturesthat dynamicallyinterleave operationsfrom VLIW instruc-
tions onto individual functional units.
Daddis and Torng [6] plot increases in instruction throughput as
a function of the fetch bandwidth and the size of the dispatchstack,
a structure similar to our instruction queue. Their system has two
threads, unlimited functional units, and unlimited issue bandwidth.
In addition to these, Beckmann and Polychronopoulus[3], Gun-
ther [12], Li and Chu [16], and Govindarajan, et al., [10] all dis-
cussarchitecturesthat featuresimultaneousmultithreading, noneof
which can issue more than one instruction per cycle per thread.
Our work is distinguished from most of these studies in our
dual goals of maintaining high single-thread performance and min-
imizing the architectural impact on a conventional processor. For
example, two implications of those goals in our architecture are
limited fetch bandwidth and a centralized instruction scheduling
mechanism basedon a conventionalinstruction queue.
Most of thesestudies either model inﬁnite fetch bandwidth (with
perfect caches) or high-bandwidth instruction fetch, each context
fetching from a private cache. However, Hirata, et al., and Daddis
and Torng both model limited fetch bandwidth (with zero-latency
memory), using round-robin priority, our baseline mechanism;nei-
ther model the instruction cache,however. Gulati andBagherzadeh
fetch from a single thread each cycle, and even look at thread se-
lection policies, but ﬁnd no policy with improvement better than
intelligent round robin.
Also, only a few of these studies use any kind of centralized
scheduling mechanism: Yamamoto, et al., model a global instruc-
tion queue that only holds ready instructions; Govindarajan, et
al., and Beckmann and Polychronopoulushave central queues, but
threads are very restricted in the number of instructions they can
have active at once; Daddis and Torng model an instruction queue
similar to ours, but they do not couple that with a realistic model of
functional units, instruction latencies, or memory latencies. Gulati
and Bagherzadeh model an instruction window composed of four-
instruction blocks, each block holding instructions from a single
thread.
The M-Machine [9] and the Multiscalar project [25] combine
multiple-issuewith multithreading, butassignworkontoprocessors
at a coarser level than individual instructions. Tera [2] combines
LIW with ﬁne-grain multithreading.
9 Summary
Thispaperpresentsasimultaneousmultithreading architecturethat:
￿ borrows heavily from conventionalsuperscalardesign, requir-
ing little additional hardware support,
￿ minimizes the impact on single-thread performance, running
only 2% slower in that scenario, and
￿ achieves signiﬁcant throughput improvements over the super-
scalar when many threads are running: a 2.5 throughput gain
at 8 threads, achieving 5.4 IPC.
Thefetchimprovementsresultfromtwoadvantagesofsimultaneous
multithreadingunavailabletoconventionalprocessors: theabilityto
partition the fetch bandwidth over multiple threads, and the ability
todynamicallyselectforfetchthosethreadsthatareusingprocessor
resourcesmost efﬁciently.
Simultaneous multithreading achieves multiprocessor-type
speedupswithout multiprocessor-typehardwareexplosion. Thisar-
chitecture achieves signiﬁcant throughput gains over a superscalar
usingthesamecachesizes,fetchbandwidth,branchpredictionhard-
ware, functional units, instruction queues, and TLBs. The SMT
processor is actually less sensitive to instruction queue and branch
prediction table sizes than the single-thread superscalar, even with
a multiprogrammed workload.
Acknowledgments
Wewouldliketo thankTryggveFossumfor hissupportofthis work
and for numerous suggestions. And we would like to thank Bert
Halstead and Rishiyur Nikhil for several valuable discussions, and
therefereesfor their helpfulcomments. Wewouldalsoliketothank
John O’Donnell from Equator Technologies, Inc. for access to the
sourcefor the Multiﬂow compiler.
References
[1] A. Agarwal,B.H. Lim, D.Kranz, andJ. Kubiatowicz. APRIL:
a processor architecture for multiprocessing. In 17th Annual
International Symposium on Computer Architecture, pages
104–114,May 1990.
[2] R. Alverson, D. Callahan, D. Cummings, B. Koblenz,
A. Porterﬁeld, and B. Smith. The Tera computer system.
In International Conference on Supercomputing, pages 1–6,
June 1990.
[3] C.J. Beckmann and C.D. Polychronopoulos. Microarchitec-
ture support for dynamicschedulingof acyclic taskgraphs. In
25th Annual International Symposium on Microarchitecture,
pages140–148, December 1992.
[4] B.CalderandD.Grunwald. Fastandaccurateinstructionfetch
and branch prediction. In 21st Annual International Sympo-
sium on ComputerArchitecture,pages 2–11, April 1994.[5] T.M. Conte, K.N. Menezes, P.M. Mills, and B.A. Patel. Opti-
mization of instruction fetch mechanismsfor high issue rates.
In 22ndAnnualInternationalSymposiumonComputerArchi-
tecture, pages 333–344,June 1995.
[6] G.E. Daddis, Jr. and H.C. Torng. The concurrent execution
of multiple instruction streams on superscalar processors. In
International Conferenceon Parallel Processing,pages I:76–
83, August 1991.
[7] K.M. Dixit. New CPU benchmark suites from SPEC. In
COMPCON, Spring 1992, pages 305–310,1992.
[8] J. Edmondson and P. Rubinﬁeld. An overview of the 21164
AXP microprocessor. In Hot Chips VI, pages 1–8, August
1994.
[9] M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang,
Y.Gurevich,andW.S.Lee. TheM-Machinemulticomputer. In
28th Annual International Symposium on Microarchitecture,
November 1995.
[10] R. Govindarajan, S.S. Nemawarkar, and P. LeNir. Design
and peformance evaluation of a multithreaded architecture.
In First IEEE Symposium on High-Performance Computer
Architecture, pages298–307,January 1995.
[11] M. Gulati and N. Bagherzadeh. Performance study of a mul-
tithreaded superscalar microprocessor. In Second Interna-
tional Symposium on High-Performance Computer Architec-
ture, pages291–301, February 1996.
[12] B.K. Gunther. Superscalar performance in a multithreaded
microprocessor. PhD thesis, University of Tasmania, Decem-
ber 1993.
[13] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki,
A. Nishimura, Y. Nakase, and T. Nishizawa. An elementary
processor architecture with simultaneous instruction issuing
from multiple threads. In 19th Annual International Sympo-
sium on Computer Architecture,pages136–145, May 1992.
[14] S.W. Keckler and W.J. Dally. Processor coupling: Integrating
compile time and runtime scheduling for parallelism. In 19th
Annual International Symposium on Computer Architecture,
pages 202–213,May 1992.
[15] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A
multithreading techniquetargeting multiprocessorsand work-
stations. In Sixth International Conference on Architectural
Supportfor ProgrammingLanguagesandOperating Systems,
pages 308–318,October 1994.
[16] Y. Li and W. Chu. The effects of STEF in ﬁnely parallel
multithreaded processors. In First IEEE Symposiumon High-
PerformanceComputerArchitecture,pages318–325,January
1995.
[17] P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Licht-
enstein, R.P. Nix, J.S. ODonnell, and J.C. Ruttenberg. The
multiﬂow trace scheduling compiler. Journal of Supercom-
puting, 7(1-2):51–142, May 1993.
[18] S.McFarling. Combiningbranchpredictors. TechnicalReport
TN-36, DEC-WRL, June 1993.
[19] R.G. Prasadh and C.-L. Wu. A benchmark evaluation of a
multi-threaded RISC processor architecture. In International
Conference on Parallel Processing, pages I:84–91, August
1991.
[20] Microprocessor Report, October 24 1994.
[21] Microprocessor Report, November 14 1994.
[22] E.G. Sirer. Measuring limits of ﬁne-grained parallelism. Se-
nior IndependentWork, Princeton University, June 1993.
[23] B.J.Smith. ArchitectureandapplicationsoftheHEPmultipro-
cessorcomputersystem. InSPIE RealTimeSignalProcessing
IV, pages241–248, 1981.
[24] M.D. Smith, M. Johnson, and M.A. Horowitz. Limits on
multiple instruction issue. In Third International Conference
on Architectural Support for Programming Languages and
Operating Systems, pages 290–302,1989.
[25] G.S. Sohi,S.E. Breach,andT.N. Vijaykumar. Multiscalar pro-
cessors. In 22nd Annual International Symposium on Com-
puter Architecture,pages 414–425,June 1995.
[26] G.S. Sohi and M. Franklin. High-bandwidth data memory
systems for superscalar processors. In Fourth International
Conference on Architectural Support for Programming Lan-
guagesand Operating Systems, pages53–62, April 1991.
[27] D.M. Tullsen, S.J. Eggers, and H.M. Levy. Simultaneous
multithreading: Maximizing on-chip parallelism. In 22nd
Annual International Symposium on Computer Architecture,
pages392–403, June 1995.
[28] W. Yamamoto and M. Nemirovsky. Increasing superscalar
performance through multistreaming. In Conference on Par-
allel ArchitecturesandCompilation Techniques,pages49–58,
June 1995.
[29] W. Yamamoto, M.J. Serrano, A.R. Talcott, R.C. Wood, and
M. Nemirosky. Performanceestimation of multistreamed, su-
perscalarprocessors. In Twenty-SeventhHawaii International
Conference on System Sciences, pages I:195–204, January
1994.
[30] T.-Y. Yeh and Y. Patt. Alternative implementations of two-
level adaptivebranchprediction. In 19thAnnualInternational
Symposium on Computer Architecture, pages 124–134, May
1992.