Evaluating a multithreaded superscalar microprocessor versus a multiprocessor chip by Sigmund, U. & Ungerer, Theo
EVALUATING A MULTITHREADED SUPERSCALAR
MICROPROCESSOR VERSUS A MULTIPROCESSOR CHIP
T UNGERER
Dept of Computer Design and Fault Tolerance D	 Karlsruhe Germany
Phone
   		 Fax   
ungererInformatikUniKarlsruhede
U SIGMUND
VIONA Development GmbH Karlstr  D Karlsruhe Germany
Phone
    Fax
   
ulivionade
This paper examines implementation techniques for future generations of micro
processors While the wide superscalar approach which issues  and more in
structions per cycle from a single thread fails to yield a satisfying performance
its combination with techniques that utilize more coarsegrained parallelism is
very promising These techniques are multithreading and multiprocessing Multi
threaded superscalar permits several threads to issue instructions to the execution
units of a wide superscalar processor in a single cycle Multiprocessing integrates
two or more superscalar processors on a single chip Our results show that the
threaded issue superscalar processor reaches a performance of  executed
instructions per cycle Using the same number of threads the multiprocessor chip
reaches a higher throughput than the multithreaded superscalar approach How
ever if chip costs are taken into consideration a threaded issue superscalar
processor outperforms a multiprocessor chip built from singlethreaded processors
by a factor of  in performance	cost relation
 Introduction
Current microprocessors utilize instructionlevel parallelism by a deep proces
sor pipeline and by the superscalar instruction issue technique DEC Alpha
 PowerPC 	 and 	 MIPS R				 Sun UltraSparc and HP PA
			
issue up to four instructions per cycle from a single thread VLSItechnology
will allow future generations of microprocessors to exploit instructionlevel par
allelism up to 
 instructions per cycle or more Possible techniques are a wide
superscalar approach IBM power processor is a issue superscalar proces
sor the VLIWapproach the SIMD approach within a processor as in the

HP PA		LC and the CISCapproach where a single instruction is dynam
ically split into its RISC particles as in the AMD K or the Intel PentiumPro
However the instructionlevel parallelism found in a conventional instruction
stream is limited Recent studies show the limits of processor utilization even
of todays superscalar microprocessors Using the SPEC benchmark suite
the PowerPC 	 shows an execution of 	 to  instructions per cycle 
and even an 
issue Alpha processor will fail to sustain  instructions per
cycle 
The solution is the additional utilization of more coarsegrained parallelism
The main approaches are the multiprocessor chip and the multithreaded proces
sor The multiprocessor chip integrates two or more complete processors on
a single chip Therefore every unit of a processor is duplicated and used in
dependently of its copies on the chip For example the Texas Instruments
TMS	C
	 Multimedia Video Processor  integrates four digital signal pro
cessors and a scalar RISC processor on a single chip
In contrast the multithreaded processor stores multiple contexts in dierent
register sets on the chip The functional units are multiplexed between the
threads that are loaded in the register sets Depending on the specic mul
tithreaded processor design only a single instruction pipeline is used or a
single dispatch unit issues instructions from dierent instruction buers si
multaneously Because of the multiple register sets context switching is very
fast Multithreaded processors tolerate memory latencies by overlapping the
longlatency operations of one thread with the execution of other threads  in
contrast to the multiprocessor chip approach
While the multiprocessor chip is easier to implement use of multithreading
in addition to a wide issue bandwidth is a promising approach Several ap
proaches of multithreaded processors exist in commercial and in research ma
chines
  The cyclebycycle interleaving approach exemplied by the Denelcor
HEP  and the Tera processor  switches contexts each cycle Because
only a single instruction per context is allowed in the pipeline the single
thread performance is extremely poor
  The blockinterleaving approach exemplied by the MIT Sparcle proces
sor  executes a single thread until it reaches a longlatency operation
such as a remote cache miss or a failed synchronization at which point
it switches to another context The Rhamma processor  switches con
texts whenever a load store or synchronization operation is discovered

  The simultaneous multithreading approach  combines a wide issue su
perscalar instruction dispatch with the multiple context approach by pro
viding several register sets on the microprocessor and issuing instructions
from several instruction queues simultaneously Therefore the issue slots
of a wide issue processor can be lled by operations of several threads
Latencies occurring in the execution of single threads are bridged by is
suing operations of the remaining threads loaded on the processor In
principle the full issue bandwidth can be utilized
While the simultaneous multithreading approach  surveys enhancements of
the Alpha  processor our multithreaded superscalar approach is based on
the PowerPC 	 
 Both approaches however are similar in their instruction
issuing policy We simulate the full instruction pipeline of the PowerPC 	
and extend it to employ multithreading However we simplify the instruction
set using an extended DLX  instruction set instead use static instead of
dynamic branch prediction and renounce the oating point unit We install
the same base processor in our multiprocessor chip simulations to guarantee a
fair comparison with the multithreaded superscalar approach
 Evaluation Methodology
  The Superscalar Base Processor
Our superscalar base processor see Fig  implements the sixstage instruc
tion pipeline of the PowerPC 	 processor fetch decode dispatch execute
complete and writeback
The processor uses various kinds of modern microarchitecture techniques as
eg separate code and data caches branch target address cache static branch
prediction inorder dispatch independent execution units with reservation
stations rename registers outoforder execution and inorder completion
The fetch and decode units always work on a continuous block of instructions
These blocks may not always be the maximum size as there are limitations by
the cache size instruction fetch cannot overlap cache lines and by branches
that are predicted to be taken As block size we choose the number of instruc
tions that could be issued simultaneously
The dispatch unit is restricted by the maximum issue bandwidth  the maxi
mum number of instructions that can be issued simultaneously to the execution

Instruction Cache
Fetch Unit
Fetch Buffer
Decode Unit
Dispatch Buffer
Dispatch Unit
Completion
Buffer
Completion Unit
Branch Target
Address Cache
Reservation
Stations
Execution Units
Data Cache
Register
Rename
Register Set
Instruction Memory
Data Memory
Register units
Execution units
Control pipeline
Figure 
 Superscalar architecture
units The maximum issue bandwidth of a processor is mainly limited by the
restricted number of busses from the rename registers to the execution units
and not by the instruction selection matrix Due to the issuing policy of the
PowerPC 	 executions are always issued in program order to the reservation
stations of the execution units The instructions are executed outoforder by
the execution units The use of separate reservation stations for the execu
tion units simplies the dispatch because only the lack of instructions the
lack of resources and the maximum bandwidth limit the simultaneous instruc
tion issue rate Data dependencies are taken care of by rename registers and
reservation station Instructions can be dispatched to the reservation stations
without checking for control or data dependencies An instruction will not be
executed until all source operands are available
All standard integer operations are executed in a single cycle by the simple
integer units Only the multiply and the divide instructions are executed in a
complex integer unit The execution of multiply instructions is fully pipelined

and consumes a specied number of cycles The integer divide is not pipelined
its latency may also be specied in the simulator
The branch unit executes one branch instruction per cycle Branch prediction
starts in the fetch unit using a branch target address cache A simple static
branch prediction technique is applied in the decode unit Each forward branch
is predicted as not taken each backward branch as taken A static branch
prediction simplies processor design but reduces the prediction accuracy
Completion is controlled by the completion unit which retires instructions in
program order with the same maximum rate as the maximum issue bandwidth
When an instruction is retired its result is copied from the rename register
to its register in the register set The rename register and the slot in the
completion buer are released
The memory interface is dened as a standard DRAM interface with cong
urable burst sizes and delays to simulate advanced RAM types like SDRAM or
EDO All caches are highly congurable to test penalties due to cache thrashing
caused by multiple threads
The processor is designed scalable the number of execution units and the
size of buers are not limited by any architectural specication This allows
experimentation to nd an optimized conguration for an actual processor
depending on chip size and expected application load
   Multithreaded Superscalar Base Processor
While the multiprocessor chip simply comprises two or more base processors
of a specic issue bandwidth the multithreaded superscalar approach is more
complicated In the multithreaded superscalar processor see Fig  the
buers of the control pipeline fetch buer dispatch buer and completion
buer and the register set are duplicated according to the number of hosted
threads Each thread has its own set of buers and registers thus running
logically independent of the other threads The processor is designed scalably
with respect to the number of hosted threads To the user of a multithreaded
superscalar processor the machine behaves like a multiprocessor Each thread
executes in its own context and is not aected by other threads
Since the fetch and decode unit only work on a single thread per cycle they
may also be duplicated to gain a higher throughput Instructions are still
fetched and decoded in blocks of contiguous instructions There is only a

I-Cache
Fetch
Fetch
Fetch
Buffer
Fetch
Buffer
Fetch
Buffer
Decode
Decode
Dispatch
Buffer
Dispatch
Buffer
Dispatch
Buffer
Dispatch
Complete
Buffer
Complete
Buffer
Complete
Buffer
Complete
BTAC Register Frames
Rename Register
Execution Units
Figure 
 Multithreaded control pipeline
single dispatch and a single completion unit The dispatch unit simultaneously
selects instructions from all independent dispatch buers up to its maximum
issue bandwidth The completion unit simultaneously retires any number of
instructions of any thread up to a total maximum of retired instructions per
cycle
The dispatch unit is not restricted with respect to the number of instruction
issues according to each thread There is no xed allocation between threads
and execution units in the multithreaded superscalar processor  in contrast to
the multiprocessor chip
The rename registers the branch target address cache the data and the in
struction cache are shared by all active threads There is no xed allocation
of any of these resources to specic threads This allows for maximum perfor
mance with any number of active threads
Each thread executes in a separate register set The contents of the registers
of a register set describe the state of a thread and form a socalled activation
frame An activation frame is called active if the thread is currently executed
by the processor ie the activation frame is represented in a register set
In addition to the active activation frames in the processor we provide an
activation frame cache which holds activation frames that are not currently
scheduled for execution The activation frames in the activation frame cache
and in the register sets can be interchanged without a signicant penalty By

adding a cache of not currently running threads the machine can be virtualized
to an arbitrary number of threads When a thread performs an instruction with
long latency like an external synchronization the thread can be preempted by
a ready thread in the cache
The activation frame cache is also used to implement a derivation of a register
window technique A new activation frame is created for every subprogram
activation thereby signicantly reducing the number of memory accesses to
the data cache
The execution units are expanded by a thread control unit that is responsible
for creation and deletion of threads for synchronization and communication
between threads Except for the loadstore and the thread control unit each
kind of execution unit may be arbitrarily duplicated
Within a multithreaded processor latencies are almost all covered by other
threads so the penalty created by a static branch prediction should not aect
average executions per cycle
  Simulator and Application Workload
Starting with the superscalar base processor we conducted a software simu
lation 	 evaluating various congurations of the multithreaded superscalar
approach and of the multiprocessor chip models All functional units were
simulated with correct cycle behaviour We employed an instructiondriven
simulator not a code tracer so all execution related eects are simulated
The simulator is either scriptdriven or interactive All functional units can be
observed during the interactive execution in separate windows which yields
the ability to investigate all runtime eects in detail In scriptdriven mode
versatile scripts can be dened to build up more complex test runs Vari
ous simulation results are collected for all units of the processors and for
each thread These reports can be automatically evaluated to create combined
results for several tests
The simulation workload is generated by a congurable workload generator
which creates random high level programs and compiles them to our machine
language The distribution of the machine instructions is similar to that gen
erated from high level programs This approach allows us to create workloads
for dierent types of programs without the need for a complete compiler

 Performance Results
For the performance results presented in this section we choose a multi
threaded simulation work load that represents general purpose programs with
out oatingpoint instructions for a typical register window architecture
Instruction type Average use
Integer 

Complexinteger 
Load 
Store 	
Branch 	

Threadcontrol 
With a single loadstore unit used in our approach the chosen work load has
a theoretical maximum throughput of about ve instructions per cycle the
frequency of load and store instructions sums up to 	 
For the simulation results presented below we used separate 
 KByte way
setassociative data and instruction caches with  Byte cache lines a cache ll
burst rate of  a fullyassociative entry branch target address cache
 general purpose registers per thread  rename registers a entry com
pletion buer up to  simple integer units single complex integer loadstore
branch and thread control units each execution unit with a entry reserva
tion station For the multithreaded approach we used two fetch units and two
decode units The number of simultaneously fetched instructions and the sizes
of fetch and dispatch buers are adjusted to the issue bandwidth We vary the
issue bandwidth and the number of hosted threads in each case from  up to

 according to the total number of execution units
The simulation results in Fig  show that the singlethreaded 
issue su
perscalar processor throughput only reaches a performance of  executed
instructions per cycle The fourissue approach is slightly better with 

due to instruction cache thrashing in the 
issue case
We also see in Fig  that the range of linear gain where the number of
executed instructions equals the issue bandwidth ends at an issue bandwidth
of about four instructions per cycle The throughput reaches a plateau at about
 instructions per cycle where neither increasing the number of threads nor
the issue bandwidth signicantly raises the number of executed instructions
Moreover the diagram on the right side in Fig  shows a marginal gain in


1 2 3
4 5 6
7 81
3
5
7
0
1
2
3
4
5
6
7
8
In
st
ru
ct
io
ns
 p
er
 c
yc
le
Issue bandwidth
Number of 
threads 0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
1 2 3 4 5 6 7 8
In
st
ru
ct
io
ns
 p
er
 c
yc
le
1
2
4
8
Number of threads
Issue bandwidth
Figure 
 Average instruction throughput per processor
instruction throughput when we advance from a  to an 
issue bandwidth
This marginal gain is nearly independent of the number of threads in the
multithreaded processor
Further simulations shown in  identify the single loadstore unit as a
principal bottleneck The dierence between the expected ve instructions
per cycle and the simulation result of  is due to bubbles in the loadstore
pipeline caused by data cache misses Our multithreaded superscalar approach
reaches the maximum throughput that is possible with a single loadstore unit
To compare the multithreaded approach with a multiprocessor solution we
show in Fig  the results normalized in relation to the number of threads
the number of instructions per cycle divided by the number of threads It
allows the comparison of parallel systems built up from basic multithreaded
or singlethreaded processors Fig  also shows that a multiprocessor built
of singlethreaded superscalar processors delivers the highest average instruc
tion throughput per thread note the throughput per processor is shown in
Fig  Each step to more parallel threads delivers less average through
put per thread It looks as if the multithreaded approach falls behind the
multiprocessor chip approach However the costs for the dierent processor
congurations were not taken into account The multiprocessor approach du
plicates complete processors whereas the multithreaded design only duplicates
parts of the processor

12
34
56
78
1
3
5
7
0
0,2
0,4
0,6
0,8
1
1,2
1,4
Instructions per 
cycle and thread
Issue bandwidth
Number of 
threads
Figure 
 Average instruction throughput per thread
To verify our results we compare them to Tullsens simulation in  We see
that our simulator produces dierent results especially viewing the 
threaded

issue superscalar approach 

 in the table below
NumberThreadsIssue Tullsens simulation  Our simulation


  

  	
 
	 
	

  
  
  
The reason for the deviating results of Tullsens simulation follows from the
high number of execution units in Tullsens approach and from the limited
exploitation of instructionlevel parallelism in the Alpha processor used as the
base of Tullsens simulation Compared to our simulations Tullsens approach
favours the multithreaded superscalar approach over the multiprocessor chip
approach For example up to eight loadstore units are used in Tullsens
simulation ignoring hardware costs and design problems we do not believe
that it is costeective to implement 
 simultaneously working loadstore units
within a multithreaded superscalar processor
It is obvious that dierent processor congurations can only be compared if a
measurement for their costs eg in chip space is used Otherwise unrealistic
	
processors are compared with each other simply stating that more units result
in more performance
To get a rough measurement of the costs for a processor conguration we
propose a formula based on the Power PC 	 oor plan The formula expresses
hardware costs based on chip space usage per unit Estimated costs per unit
Unit type Estimated cost
Integer unit 
LoadStore unit 
Fetch and decode unit 
Branch unit 
Caches 
Registers Number of Threads
Completion unit Issue Bandwidth
Dispatch unit Number of ThreadsIssue Bandwidth
The formula is only a rule of thumb The formula for the dispatch unit is based
on the required interconnections between the dispatch unit the register sets
and the execution units
1 2 3 4 5 6 7 8
1
3
5
7
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
Performance by 
cost, normalized 
to scalar 
processor
Issue bandwidth
Number of 
threads
Figure 
 Average instruction throughput in relation to chip costs
As each threads register set and dispatch queue has to be connected with all
execution units the required chip space is proportional to the product of the
number of hosted threads and the issue bandwidth of the processor

Fig  displays the instructions per cycle in relation to the hardware costs of
a specic processor conguration The solution with four threads and issue
bandwidth four shows the best performancecost relation However this ob
servation is application and design specic The advantage of multithreading
is highly dependent on the ratio of loadstore instructions to other instruc
tions in the workload Also the chip costs change with dierent architectural
decisions
 Conclusion
This paper examined the multithreaded superscalar processor in comparison to
the multiprocessor chip approach taking performance and hardware costs into
consideration For our research study we used a simulator for multithreaded
processors based on the PowerPC 	
While the singlethreaded 
issue superscalar processor only reaches a through
put of about  the 
threaded 
issue superscalar processor executes 
instructions per cycle the loadstore frequency in the work load sets the the
oretical maximum to  instruction per cycle Increasing the issue bandwidth
from  to 
 yields only a marginal gain in instruction throughput  a result that
is nearly independent of the number of threads in the multithreaded processor
The multiprocessor chip approach with 
 singlethreaded scalar processors
reaches 	 instructions per cycle Using the same number of threads the
multiprocessor chip reaches a higher throughput than the multithreaded su
perscalar approach refer to the previous paragraph However if we take
the chip costs into consideration a threaded issue superscalar processor
outperforms a multiprocessor chip built from singlethreaded processors by a
factor of 
 in performancecost relation
 References
 TA Diep C Nelson JP Shen Performance Evaluation of the PowerPC
	 Microprocessor The nd Annual International Symposium on Computer
Architecture Santa Margherita Ligure June     
 D E Tullsen S J Eggers H M Levy Simultaneous Multithreading
Maximizing OnChip Parallelism The nd Annual International Symposium

on Computer Architecture Santa Margherita Ligure June    
	
 Texas Instruments TMS	C
	 Technical Brief Multimedia Video Proces
sor MVP Texas Instruments 
 B J Smith The Architecture of HEP In J S Kowalik Ed Parallel
MIMD Computation The HEP Supercomputer and Its Applications The
MIT Pr ess Cambridge 

 R Alverson et al The Tera Computer System th International Confer
ence on Supercomputing Amsterdam June  	  
 A Agarwal et al The MIT Alewife Machine Architecture and Perfor
mance The nd Annual International Symposium on Computer Architec
ture Santa Margherita Ligure June     
 W Gruenewald Th Ungerer Towards Extremely Fast Context Switch
ing in a Blockmultithreaded Processor nd Euromicro Conference Prague
Sept  

 S P Song M Denman J Chang The PowerPC 	 RISC Microprocessor
IEEE Micro Vol  No  Oct  
  
 J L Hennessy D A Patterson Computer Architecture a Quantitative
Approach San Mateo 
	 U Sigmund Design of a Multithreaded Superscalar Processor Master
Thesis University of Karlsruhe  in German
 U Sigmund Th Ungerer Identifying Bottlenecks in Multithreaded Su
perscalar Microprocessors To be published

