A comparison of data prefetching on an access decoupled and superscalar machine by Jones, G.P. & Topham, N.P.
  
 
 
 
Edinburgh Research Explorer 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
A comparison of data prefetching on an access decoupled and
superscalar machine
Citation for published version:
Jones, GP & Topham, NP 1997, A comparison of data prefetching on an access decoupled and superscalar
machine. in Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on.
pp. 65-70. DOI: 10.1109/MICRO.1997.645798
Digital Object Identifier (DOI):
10.1109/MICRO.1997.645798
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 05. Apr. 2019
A Comparison of Data Prefetching on an Access Decoupled and
Superscalar Machine
 
G P Jones N P Topham
Dept of Computer Science Dept of Computer Science
Edinburgh University Edinburgh University
Edinburgh Scotland UK Edinburgh Scotland UK
Abstract
In this paper we investigate the behavior of data
prefetching on an access decoupled machine and a su
perscalar machine We assess if there are benets to
using the decoupling paradigm given that an outof
order ooo superscalar architecture could in prin
ciple prefetch to the same degree as an access decoupled
machine
We have found that for large issue width the ac
cess decoupled machine can hide memory latency more
eectively than a single instruction window ooo su
perscalar architecture Our ndings also demonstrate
that an access decoupled machine oers the benet of
reducing the complexity of window issue logic
  Introduction
The future of high performance microprocessor
design is to provide improved performance by ex
tracting higher degrees of instruction level parallelism
In superscalar architectures parallelism is exploited
by reordering instructions within an instruction win
dow and issuing multiple independent instructions per
cycle However as processor speeds increase and issue
widths get larger the cost of a main memory access is
becoming relatively more expensive One solution is
to hide memory latency by data prefetching
Data prefetching is a technique that hides memory
latency by overlapping access and data operations
Data prefetching can be implemented in either hard
ware  and software  or a hybrid  of both
schemes However as memory latencies become relat
ively more expensive the number of independent over
lapped instructions required to hide the access times
increases Larger instruction windows are therefore
required to detect independent instructions that can
execute in parallel with memory access operations
The pressure to increase window sizes is also driven
 
This research was supported by EPSRC grant K
by the goal of providing ever larger issue widths
However large window and issue width sizes in
troduces greater complexity in window issue logic
Palacharla et al have shown that delays in the issue
logic vary quadratically with window and issue width
size 		 Since delays in issue logic will be critical to
processor clock there is a need to consider architec
tures that simplify issue window logic
To solve the window complexity problem some ar
chitectures use separate microclusters Microclusters
may share or have a dedicated instruction window

but each has its own register le and function units
This simplies window logic by agging instructions
for execution on particular microclusters
 and reduces
the size of the instruction window
 but can limit the
number of instructions issued per cycle
Access decoupling is a latency hiding technique that
partitions a programs  statically or dynamically 
into two separate instruction streams in order prefetch
data aggressively 	
 	
 	 The instruction streams
are loosely coupled One stream
 executed on an ad
dress unit AU
 prefetches data for the second stream

executed on a data unit DU Memory accesses can
then be pipelined to tolerate large memory latencies
provided the two streams can decoupled suciently
In principle the same level of prefetching in an ac
cess decoupled machine could be achieved with an out
oforder ooo superscalar architecture The ques
tion is then why should designers consider using the
decoupling paradigm
Memory latencies are typically  cycles whereas
arithmetic function latencies are  cycles excluding
divide and intrinsics A system could easily tolerate
a small degree of ooo execution amongst arithmetic
operations provided loads could slip by a large amount
with respect to arithmetic operations This slippage
between arithmetic and load operation is exactly what
occurs in a decoupled machine In eect
 we can im
plement a small instruction window for arithmetic and
access operations provided the latter can slip by a
large amount with respect to the former
In answer to our question
 we believe that an ac
cess decoupled machine can be viewed as a variant of
a microcluster architecture with two separate instruc
tion windows The asynchronously executing units

through code partition and dynamic slippage
 com
bine the benets of reducing window logic complexity
with data prefetching
In this paper we compare the relationship between
window size and memory latency for an access de
coupled machine DM and a single window ooo su
perscalar machine SWSM We also evaluate the size
of window required by the SWSM to achieve the same
performance as the DM
The thesis of this paper is developed in the following
way In section  we outline the DM and SWSM In
section  we describe our simulation technique In
section  we discuss the notion of the eective single
window ESW to help explain some of our ndings
In section  we present the results of our work Finally
in section  we draw together our ndings and suggest
avenues for future work in this area
 The Architectural Models
The access decoupled machine DM modelled
in our experiments is shown in Figure 	 The ma
chine consists of two separate outoforder ooo su
perscalar processors
 the address unit AU and the
data unit DU
 responsible for executing the access
and data operations Each unit has is own separate
instruction window
 functional units and register les
The units can share results by moving data between
register les The number of instructions issued per
cycle is determined by the issue width
The decoupled memory lies between the two su
perscalar pipelines and the rest of the memory sys
tem The decoupled memory receives addresses from
the AU and sends them to the memory system When
a referenced value is returned the decoupled memory
buers the value until it is requested by the DU Re
quests from the decoupled memory take a single cycle
AU self loads are executed in a similar way Previously
the decoupled memory has been implemented through
the use of queues 	
 	
The single window superscalar machine
SWSM is shown in Figure  The architecture is
an ooo machine with a single instruction window for
reordering operations In each cycle independent op
erations which are ready to execute are issued to the
function units Unlike the DM the full issue width
is available for issuing instructions every cycle This
Instruction
Window
Instruction
Window
Function Units + 
Register Files
Function Units + 
Register Files
Issue
Width
Issue
Width
DUAU
Memory System
Decoupled Memory
Bypass
Figure 	 DM
Function Units + 
Register Files
Issue
Width
Prefetch Buffer
Memory System
Window
Instruction
Figure  SWSM
means that if the SWSM is able to guarantee 	
utilisation of the full issue width it could outperform
the DM
There are dierent types of hardware
 software and
hybrid schemes for data prefetching For SWSM we
use a hybrid scheme Every memory operation com
prises two instructions
 a prefetch and an access oper
ation The prefetch instruction preloads data into the
prefetch buer ahead of the access operation Prefetch
operations
 unlike software schemes
 are allowed to be
gin execution as soon as runtime resources allow Us
ing this scheme we gain the benets of exact address
computation with dynamic execution The prefetch
buer is a fully associative buer responsible for stor
ing prefetched data Requests from the prefetch buer
take a 	 cycle
Thememory system consists of the mainmemory
but may also be composed of rst or second level
caches We are not concerned with a detailed sim
ulation of the memory system instead we model its
execution by considering every access to have a xed
cost The xed cost we refer to as the memory dier
ential MD The memory dierential is the dierence
in time between a register and memory system access
The purpose of all latency hiding techniques is to elim
inate any perceived memory dierential
 The Eective Single Window ESW
An advantage of the DM is that the dynamic slip
page between the window of instructions on the AU
and DU means that the eective single window size
can be greater than the sum of the individual units
window sizes Figure  illustrates the idea of the ESW
The diagram shows the streams for the AU
 DU and
a single instruction stream In the single instruction
stream the instructions are shown in program order
with later instructions appearing further down the
page and labelled with the units on which they ex
ecute in the DM The diagram shows that
 due to
the dynamic slippage between the units
 the AU is ex
ecuting instruction further into the instruction stream
than the DU The ESW is the minimum size of win
dow required to buer all instructions from the oldest
DU instruction to the youngest AU instruction
AU
AU
AU
AU
AU
AU
AU
DU
DU
DU
DU
DU
DU
DU
DU
DU
DU
DU
DU
AU
AU
AU
AU
AU
AU
DU
DU
DU
DU
AU
DU
Window
DU
AU
DU Instruction
Stream
Single Instruction
Stream
AU Instruction
Stream
Equivalent Single
Window
Window
Oldest DU
Instruction
Youngest AU
Instruction
Figure  Eective Single Window
 Simulation Technique
In our experiments we simulated the execution of
seven programs from the PERFECT club suite  for
a full discussion of the simulation technique see 
Load and store operations on the DM are executed
as one instruction on each of the units On the SWSM
loads and stores generate a prefetch and an access op
eration Integer and address computations have a 	
cycle cost Floating point operations take  cycles to
complete
There is no speculative execution but we assume
loop closing branches have been removed by optimisa
tions like loop unrolling and branch prediction Data
dependency analysis is perfect and false dependencies
are removed by renaming The purpose of examining
such an ideal case is to provide the best opportun
ity for prefetching data
 to have high instruction level
parallelism ILP and to place the greatest pressure
on the latency hiding mechanism
The issue width used for the AU and DU were 
and  respectively These widths were found to be
an optimal conguration in  An issue width of 
instructions was used for the SWSM
 Experimental Results
In this section we present the major ndings of the
paper For the purposes of this paper we have selected
three representative programs that exhibit the range
of observed behavior The three selected programs
were FLOQ
 MDG and TRACK Table 	 shows the
latency hiding eectiveness of all seven programs when
the window size is unlimited and the memory dier
ential is  cycles
 
The latency hiding eectiveness
LHE is dened as LHE  T
perfect
T
actual
where
T
actual
is the execution time for the DM and T
perfect
is
the execution time for a machine with perfect latency
hiding in which each memory access perceives a single
cycle latency It can be seen there are three bands
in which the programs are highly 	
 moder
ately  and poorly   eective at hiding
latency It can be seen that the three programs fall
within each of the bands
Prog DM Window Size
 	    	
TRFD 	   	  
ADM 	     
FLOQ      
DYFESM  	    
QCD      
MDG      
TRACK      
Table 	 Latency Hiding Eectiveness for MD
cycles
Figures 
  and  show the variation in speedup
with window size for the access decoupled and super
scalar architecture when the memory dierential is 
 
An MD of  was chosen because it is comparable to the
cost of a second level cache miss 	the pentium Pro has 
 cycle
L miss latency and it assumes a weak memory system cap
able of capturing no locality In practice for a high performance
architecture the memory system will be able to reduce the av
erage access time by using rst and second level caches
and  cycles When MD is  we see that for small
window sizes the DM performs better than the SWSM
with same window size This is due to the DM hav
ing two windows for reordering operations compared
to one for the SWSM This means there are fewer
resource conicts for window slots and greater scope
for reordering operations It will also be noticed that
the graphs show the law of diminishing returns for
increasing window size once window sizes are above
	 instructions
 doubling the size does not double the
speedup All the programs reach a cuto point for
window sizes between  and  instructions when the
SWSM performs more eectively This is due the be
net of the larger instruction issue width available to
the SWSM This benet is only realised once the in
struction window is large enough to utilise the avail
able issue width
In Figures 
  and  we see that once MD reaches
 cycles there is no cuto point when the SWSM
performs better than the DM This results applies
even for very large windows of 	 instruction slots
The dierence between the performance of the two ma
chines must be solely due to the more eective data
prefetching of the DM Operations on the SWSM
which on DM would have been executed on the DU

are causing address computations to execute later
 re
ducing the pipelining of memory accesses and decreas
ing the eectiveness of the data prefetching The dif
ference in performance between the two machines is
also dependent on the type of program For FLOQ
which is highly parallel the gap between the DM and
SWSM is large However
 for TRACK which has little
parallelism there is little dierence between the two
architectures
We can state therefore that for all the programs we
have simulated the DM is more eective at hiding large
memory latencies than the SWSM The dierence in
performance is dependent on the parallelism and de
coupling in the program Programs that decouple well
show the largest improvement in performance for the
DM
Figures 
  and  show
 for a range of memory
dierentials
 the ratio of the SWSM and DM window
sizes that yield equivalent performance We will refer
to this ratio as the equivalent window ratio The ra
tio was derived by projecting from the DM graph to
SWSM graph in Figures 
 and  The graphs show
the way in which the ratio varies as a function of the
memory latency It can be seen that as latencies ap
proach  cycles the ratio gets larger This is due
solely to the more eective data prefetching of the
access decoupled machine As the memory latency in
creases
 the DU waits longer for data to arrive and
the slippage between the two units grows This means
conceptually that the eective single window size see
Figure  for DM gets larger In order for the SWSM
to achieve equivalent performance it requires a corres
pondingly larger window
The graphs in Figures 
  and  also show that as
the DM window size is increased the equivalent win
dow ratio reduces This is due to the SWSM archi
tecture being able to reorder operations to a similar
degree as the DM
 and also the benets of the larger
issue width
Signicantly it can be observed that for a realistic
DM window size of  instructions and a memory
latency of  cycles
 the required increase in window
size for equivalent SWSM performance is dependent
on the program
 but lies between  to  Experi
ments with the other benchmark programs have also
been found to fall within this range Larger windows
introduce extra hardware complexity and longer win
dow logic delays

 We can state therefore that the
DM requires smaller instruction windows and hence
simpler window logic
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90 100
Sp
ee
du
p
Window Size
FLO52Q CIW=9 CL=99
ADM md=0
SWSM md=0
ADM md=60
SWSM md=60
Figure  FLOQ
Having shown that the DM performs consistently
better than the SWSM we now compare the latency
hiding eectiveness of the DM against a perfect
latency hiding technique one in which all the memory
dierential is hidden Table 	 shows the measured
LHE for dierent window sizes when the memory dif
ferential is  cycles
The results show that when window sizes are small
increasing the window size causes a reduction in the
LHE This is due to the extra parallelism on the

In  is was shown that delays vary quadratically with
window size
010
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90 100
Sp
ee
du
p
Window Size
MDG CIW=9 CL=99
ADM md=0
SWSM md=0
ADM md=60
SWSM md=60
Figure  MDG
0
2
4
6
8
10
12
14
0 10 20 30 40 50 60 70 80 90 100
Sp
ee
du
p
Window Size
TRACK CIW=9 CL=99
ADM md=0
SWSM md=0
ADM md=60
SWSM md=60
Figure  TRACK
DU placing greater pressure on the memory system
The AU window is not yet large enough allow the AU
to pipeline accesses suciently to hide the latency
However there eventually comes a point when the lar
ger window size allows more operations to execute in
parallel and the LHE starts to improve For six of
the programs this point is between  to  instruc
tions This result suggests that for realistic window
sizes 	 to  instructions
 increasing the window size
will result in the latency hiding mechanism of the DM
performing less eectively Table 	 also shows that
even with large window sizes we do not approach the
LHE of an DM with unlimited resources
Our ndings show that for realistic window sizes the
DM can hide latencies better than SWSM but that as
the window size increases its eectiveness at hiding
latency deteriorates This illustrates the tensions that
exist between having greater parallelism and the ac
cess decoupling mechanism As the window size get
0
1
2
3
4
5
6
7
10 20 30 40 50 60 70 80 90 100
Eq
ui
va
le
nt
 w
in
do
w 
ra
tio
Access Decoupled Window Size
FLO52Q
md=0
md=10
md=20
md=30
md=40
md=50
md=60
Figure  FLOQ
0.5
1
1.5
2
2.5
3
3.5
4
10 20 30 40 50 60 70 80 90 100
Eq
ui
va
le
nt
 w
in
do
w 
ra
tio
Access Decoupled Window Size
MDG
md=0
md=10
md=20
md=30
md=40
md=50
md=60
Figure  MDG
larger
 the instruction level parallelism increases and
the execution times fall However the extra parallelism
places greater pressure on the decoupling mechanism
resulting in a decrease in LHE The result is that
more of the critical path time is now composed of the
memory dierential There comes a point however

when the AU window is large enough to compensate
for the extra parallelism on the DU
 and more address
operations can be pipelined to hide the latency
In the short to medium term high performance ar
chitectures will have window sizes in the range that
shows a reduction in the LHE In future work we will
investigate mechanisms to improve the latency hiding
of the DM One possibility is a bypass mechanism
which captures the temporal locality exposed by de
coupling 
	 Conclusion and Future Work
This paper has focused on two objectives in the
design space of future microprocessors the need to
0.5
1
1.5
2
2.5
3
3.5
4
10 20 30 40 50 60 70 80 90 100
Eq
ui
va
le
nt
 w
in
do
w 
ra
tio
Access Decoupled Window Size
TRACK
md=0
md=10
md=20
md=30
md=40
md=50
md=60
Figure  TRACK
hide large memory latencies and the need to reduce
the complexity of window issue logic We have in
vestigated the use of data prefetching on an access
decoupled machine and a single window ooo super
scalar architecture
In this paper we have examined the relationship
between memory latency
 window size and speedup
for the two architectures In order to remove the im
pact of other architectural issues we have assumed
an idealistic environment This environment provides
good conditions for data prefetching
 high levels of ILP
and places the greatest pressure on the latency hiding
mechanism
We have found that the DM is more eective at
hiding memory latency than the SWSM For large
memory dierentials  cycles we have found that
even for large window sizes of 	 instructions
 the DM
consistently performs better than the SWSM Our res
ults have also shown that to achieve the same speedup
as an DM the SWSM needs a window size between 
to  larger The increase in window size required to
achieve equivalent performance on the SWSM was also
found to increase with larger latencies
To explain some of our ndings we have introduced
the concept of the eective single window The ESW
conceptually illustrates how the DM is able to per
form better than an architecture with twice the size
of instruction window
Our results have also shown how the latency hiding
eectiveness of the DM decreases as the window size
increases to  instructions Though the speedup did
increase with larger window size the DMwas not found
to be as eective at hiding latency However when
windows were greater than  instructions the LHE
was found to improve This behavior illustrates the
tensions that exist between higher ILP and the access
decoupling mechanism
This paper has shown that access decoupling can
combine the benets of latency hiding with simplifying
the window logic complexity We conclude therefore
that there is a need for further work in the use of access
decoupling In future work we will examine the eects
of code expansion on the DM and SWSM We will
also compare the dierence in performance between a
static and dynamic partition of the code on the DM
References
  A Berrached LD Coraor and PT Hulina A De
coupled AccessExecute Architecture for Ecient Ac
cess of Structured Data In Proc of the th Hawai
Int Conf on System Sciences volume  pp 	
	
Jan 

  D Bhandarkar and J Ding Performance Character
isation of the Pentium Pro Processor In Proc of the
rd Int Symp on High Performance Computer Ar
chitecture Feb 
 
 D Callahan K Kennedy and A Portereld Soft
ware Prefetching In th Ann Symp on Parallel Lan
guages and Operating Systems pp 	 Apr 
 	 Tzicker Chiueh Sunder  A Programmable Hardware
Prefetch Architecture for Numerical Loops In Proc
Supercomputing 	 pp 		 Nov 	
  M Berry et al The Perfect Club Benchmarks Eect
ive Performance Evaluation of Supercomputers Tech
report  CSRD University of Illinois Urbana
Champaign Urbana Illinois May 
  JWC FU and JH Patel Data Prefetching
Strategies for Vector Cache Memories In Proceed
ings The Fifth Int Parallel Processing Symp pp
 Apr 
  GP Jones and NP Topham A Limitation Study
into Access Decoupling EuroPar	
 Springer Ver
lag Vol LNCS 
 pp  Aug 
  GP Jones and NP Topham The Eect of Restricted
Instruction Issue on an Access Decoupled Machine
ParCo	
 Germany Sept 
  GP Jones and NP Topham Simplifying Hardware
for Out of Order Execution using the Decoupling
Paradigm Tech report CSG
 Edinburgh Uni
versity Sept 
  MKFarrens and ARPleszkun Implementation of
the PIPE Processor IEEE Computer pp  Jan

  S Palacharla NP Jouppi and JE Smith
ComplexityEective Superscalar Processors In th
Ann Int Symp on Computer Architecture 
  Wm A Wulf An Evaluation of the WM Architecture
In Proc Int Symp on Computer Architecture May

