Identifying bottlenecks in a multithreaded superscalar processor by Sigmund, U. & Ungerer, Theo
Identifying Bottlenecks in a Multithreaded
Superscalar Microprocessor
Ulrich Sigmund
 
and Theo Ungerer

 
VIONA Development GmbH Karlstr  D		 Karlsruhe Germany

University of Karlsruhe Dept of Computer Design and Fault Tolerance D

Karlsruhe Germany
Abstract This paper presents a multithreaded superscalar processor
that permits several threads to issue instructions to the execution units
of a wide superscalar processor in a single cycle Instructions can simul
taneously be issued from up to 
 threads with a total issue bandwidth
of 
 instructions per cycle Our results show that the 
threaded 
issue
processor reaches a throughput of  instructions per cycle
  Introduction
Current microprocessors utilize instructionlevel parallelism by a deep processor
pipeline and by the superscalar technique that issues up to four instructions
per cycle from a single thread VLSItechnology will allow future generations of
microprocessors to exploit instructionlevel parallelism up to  instructions per
cycle or more However the instructionlevel parallelism found in a conventional
instruction stream is limited
The solution is the additional utilization of more coarsegrained parallelism
The main approaches are the multiprocessor chip and the multithreaded proces
sor The multiprocessor chip integrates two or more complete processors on a
single chip Therefore every unit of a processor is duplicated and used indepen
dently of its copies on the chip In contrast the multithreaded processor stores
multiple contexts in dierent register sets on the chip The functional units are
multiplexed between the threads that are loaded in the register sets Multi
threaded processors tolerate memory latencies by overlapping the longlatency
operations of one thread with the execution of other threads  in contrast to
the multiprocessor chip approach Simultaneous multithreading 	 combines a
wide issue superscalar instruction dispatch with multithreading Instructions are
simultaneously issued from several instruction queues Therefore the issue slots
of a wide issue processor can be 
lled by operations of several threads
While the simultaneous multithreading approach surveys enhancements of
the Alpha  processor our multithreaded superscalar approach is based on
a simpli
ed PowerPC  processor 	 Both approaches however are similar
in their instruction issuing policy This paper focuses on the identi
cation and
avoidance of bottlenecks in the multithreaded superscalar processor Further
simulations shown in 	 install the same base processor in a multiprocessor
chip and compare it with the multithreaded superscalar approach
 The Multithreaded Superscalar Processor Model
Our multithreaded superscalar processor uses various kinds of modern microar
chitecture techniques as eg branch prediction inorder dispatch independent
execution units with reservation stations rename registers outoforder execu
tion and inorder completion We apply the full instruction pipeline of the Pow
erPC  and extend it to employ multithreading However we simplify the
instruction set using an extended DLX 	 instruction set instead use static
instead of dynamic branch prediction and renounce the oating point unit The
processor model is designed scalable the number of parallel threads the sizes
of internal buers register sets and caches the number and type of execution
units are not limited by any architectural speci
cation
We conducted a software simulation evaluating various con
gurations of the
multithreaded superscalar processor For the simulation results presented below
we used separate  KByte way setassociative data and instruction caches with
 Byte cache lines a cache 
ll burst rate of  a fullyassociative entry
branch target address cache  general purpose registers per thread  rename
registers a entry completion queue  simple integer units single complex
integer loadstore branch and thread control units each execution unit with a
entry reservation station The number of simultaneously fetched instructions
and the sizes of fetch and dispatch buers are adjusted to the issue bandwidth
We vary the issue bandwidth and the number of hosted threads in each case
from  up to  according to the total number of execution units
We choose a multithreaded simulation work load that represents general pur
pose programs without oatingpoint instructions for a typical register window
architecture We assume  integer  complex integer  load 
store  branch and  thread control instructions
 Performance Results
The simulation results in Fig  left show that the singlethreaded issue
superscalar processor throughput measured in average instructions per cycle
only reaches a performance of  executed instructions per cycle The four
issue approach is slightly better with  due to instruction cache thrashing in
the issue case
Increasing the number of threads from which instructions are simultaneously
issued to the  issue slots also increases performance The throughput reaches a
plateau at about  instructions per cycle where neither increasing the number
of threads nor the issue bandwidth signi
cantly raises the number of executed
instructions When issue bandwidth is kept small  to  instructions per cycle
and four to eight threads are regarded we expect a full exploitation of the issue
bandwidth As seen in Fig left however the issue bandwidth is only utilized
by about  Even a highly multithreaded processor seems unable to fully
exploit the issue bandwidth A further analysis reveales the single fetch and
decode units as bottlenecks leading to starvation of the dispatch unit
Therefore we apply two independent fetch and two decode units the simula
tion results are shown in Fig right The gradients of the graphs representing
the multithreaded approaches are much steeper indicating an increased through
put The processor throughput in an threaded issue processor is about four
times higher than in the single threaded issue case However the range of lin
ear gain where the number of executed instructions equals the issue bandwidth
ends at an issue bandwidth of about four instructions per cycle The throughput
reaches a plateau at about  instructions per cycle where neither increasing
the number of threads nor the issue bandwidth signi
cantly raises the number
of executed instructions Moreover the diagram on the right side in Fig  shows
a marginal gain in instruction throughput when we advance from a  to an
issue bandwidth This marginal gain is nearly independent of the number of
threads in the multithreaded processor
1 2 3
4 5 6
7 81
3
5
7
0
1
2
3
4
5
6
7
8
In
st
ru
ct
io
ns
 p
er
 c
yc
le
Issue bandwidth
Number of 
threads
1 2 3
4 5 6
7 81
3
5
7
0
1
2
3
4
5
6
7
8
In
st
ru
ct
io
ns
 p
er
 c
yc
le
Issue bandwidth
Number of 
threads
Fig  Average instruction throughput per processor with one fetch and one decode
unit and with with two fetch and two decode units
Further simulations see Fig  showed that a performance increase is not
yielded when the number of slots in the completion queue is increased over the
 slots assumed in the simulations above Instruction execution is limited by
true data dependencies and by control dependencies that cannot be removed
Also four writeback ports and  rename registers seem appropriate
The loadstore unit and the memory subsystem remain as the main bottle
neck that may potentially be removed by a dierent con
guration The loadstore
unit is limited to the execution of a single instruction per cycle Duplication
of the loadstore unit de
nitely increases performance However two or more
loadstore units that access a single data cache are dicult to implement because
of consistency and thrashing problems Using an instruction mix with  load
and store instructions potentially allows a processor throughput of 
ve instead
of the measured  instructions per cycle with a single loadstore unit
4 8 12 16 20 24 28 32
8
24
40
56
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
Completion queue 1 2
3 4 5
6 7 88
24
40
56
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
Writeback ports
Rename register
Fig  Sources of unused issue slots
To evaluate if a dierent memory con
guration might increase the through
put we simulated dierent cache sizes cache line sizes from  to  bytes
cache schemes direct mapped set associative workloads and numbers of active
threads The simulations show that our simulated processor is able to completely
hide all latencies caused by cache re
lls  by its multithreaded execution
model The multithreaded superscalar processor reaches the maximum through
put that is possible with a single loadstore unit Penalties caused by data cache
misses are responsible for the dierence to the theoretical maximum throughput
of 
ve instructions per cycle
 Conclusion
This paper surveyed bottlenecks in a multithreaded superscalar processor based
on the PowerPC  microarchitecture taking various con
gurations into con
sideration Using an instruction mix with  load and store instructions the
performance results show for an issue processor with four to eight threads that
two instruction fetch and two decode units four integer units  rename regis
ters four register ports and a completion queue with  slots are sucient The
single loadstore unit proves as the principal bottleneck because it cannot easily
be duplicated The multithreaded superscalar processor threaded issue is
able to completely hide latencies caused by  burst cache re
lls It reaches
the maximum throughput of  instructions per cycle that is possible with a
single loadstore unit
References
 Tullsen D E  Eggers S J Levy H M Simultaneous Multithreading Maximizing
OnChip Parallelism The nd Ann Int Symp on Comp Arch  		
 Song S P  Denman MChang J The PowerPC  RISC Microprocessor IEEE
Micro Vol  No   

	 Sigmund U Ungerer Th Evaluating a Multithreaded Superscalar Microprocessor
vs a Multiprocessor Chip th PASA Workshop Juelich World Sc Publ 
 Hennessy J L Patterson D A Computer Architecture a Quantitative Approach
San Mateo 
This article was processed using the L
A
T
E
X macro package with LLNCS style
