Best-Offset Hardware Prefetching by Michaud, Pierre
Best-Offset Hardware Prefetching 
Pierre Michaud 
March 2016 
BOP: yet another data prefetcher 
 
•  Contribution: offset prefetcher with new mechanism for setting the 
prefetch offset dynamically 
- Improvement over Sandbox prefetcher (Pugsley et al., HPCA 2014) 
 
•  Good performance on the SPEC CPU benchmarks 
- tuned BOP won the 2015 Data Prefetching Championship 
 
•  Simple hardware 
2 
Offset prefetching (L2 cache) 
3 
 prefetch 
L2 access, line X prefetch line X+D 
into L2  offset D 
next-line prefetching    offset = 1 
Offset prefetching with physical addresses 
4 
 prefetch prefetch line X+D 
into L2  offset D 
if X+D is in same 
page as X 
offset prefetching works better with large pages  
(or with virtual addresses) 
L2 access, line X 
Offset prefetching is not new 
•  Not mainstream either (at least in academia) 
- Ki & Knowles, "Adaptive data prefetching using cache information", ICS 
1997 
- Pugsley et al., "Sandbox prefetching: safe run-time evaluation of 
aggressive prefetchers", HPCA 2014 
- other ? 
 
•  Different from stream prefetching 
- does not try to detect streams 
•  Different from delta-correlation prefetching 
- delta-correlation predicts which line will be accessed next 















X+1 X+2 X+3 X+4 X+5 X 





X+1 X+2 X+3 X+4 X+5 X 





X+1 X+2 X+3 X+4 X+5 X 





X+1 X+2 X+3 X+4 X+5 X 





X+1 X+2 X+3 X+4 X+5 X 
Periodic strides 
12 
P DL1 $ L2 $ a, a+96, a+192, 
a+288, a+384... 
X, X+1, X+3, 





non-constant periodic line stride 
(1,2,1,2,...) 
Offset = sum of strides in a period 
13 
line  
or multiple of that number (for timeliness) 
time of 
access 
X+1 X+2 X+3 X+4 X+5 X X+6 X+7 





X+1 X+2 X+3 X+4 X+5 X X+6 X+7 
no need for 
complicated 
prefetcher here ! 
or multiple of that number (for timeliness) 
Interleaved streams 
15 
P DL1 $ L2 $ 
a, a+96, a+192, 
a+288, a+384... 
X, X+1, X+3, 
X+4,X+6...   
b, b+128, b+256, 
b+384, b+512... 
Y, Y+2, Y+4, 




offset = multiple of 3 
time of 
access 






offset = multiple of 2 






prefetch both streams with offset = multiple of 6 
X+1 X+2 X+3 X+4 X+5 X X+6 X+7 Y+1 Y+2 Y+3 Y+4 Y+5 Y Y+6 Y+7 
19 

























































































assuming large pages 
benchmark libquantum 
23 



























•  The best offset depends on the application 
- full-fledged offset prefetchers select the offset dynamically 
•  The best offset may be > 100 
- when not limited by 4KB page boundaries 
 
•  Prefetch timeliness is essential for performance 
- high prefetch coverage is not sufficient 
24 
Dynamic offset selection 
•  Define a list of possible offsets 
- e.g., all numbers between -10 and +30 
- e.g., numbers between 1 and 255 with no prime factor greater than 5 
 
 
•  Define a mechanism for evaluating offsets 
 
•  Want simple hardware 
25 
Sandbox Prefetcher (SBP) 
•  Pugsley et al., HPCA 2014 
•  Introduces Sandbox method 
- evaluate offset by recording fake-prefetch addresses in Sandbox 
- on L2 cache access, check Sandbox  if hit, increment score for offset 
•  Multi-degree prefetcher  multiple prefetches per cache access 
- all the offsets with high enough coverage are potential candidates   
- smaller offsets first 
•  Prefetch timeliness not considered, only coverage 
 
26 
Best-Offset Prefetcher (BOP) 
 
•  Try to identify the single best offset 
•  Degree-one prefetch 
-  one cache access  one prefetch request 
•  New method for evaluating offsets 
•  Take into account both coverage and timeliness 
27 
New method for evaluating offsets 
•  When a prefetch completes, store in a recent requests (RR) table the base 
address of the prefetch 
- prefetched line is X+D, base address is X 
•  To evaluate offset d, upon access to X, check if X-d is in RR table  
- if hit in RR table, increment score for offset d 
•  Evaluate all the offsets in the list, one by one 
•  When learning phase finished, pick offset with highest score, update 
prefetch offset D  




eval & pick best 






prefetched line Y 
- 
prefetch line X+D 
 d 
 X-d  
 hit/miss 
access line X 
D 
 Y-D 
eval & pick best 








 X-d  
 hit/miss 





eval & pick best 






fetched line Y 
- 
 d 
 X-d  
 hit/miss 






•  One score per offset in the list 
- in the paper, 52 offsets, 5-bit scores  260 bits 
•  RR table 
- several possible implementations 
- in the paper, direct-mapped, 256 entries, 12-bit tags  3072 bits 
•  3 adders 
- e.g., 64B line, 2MB page  15-bit adders 
•  Misc. logic  
- iterate on the list, increment scores, find highest score,... 
32 
BOP's main weakness 
•  Tradeoff between prefetch coverage and timeliness 
- small offsets give higher prefetch coverage 
- large offsets hide memory latency better 
•  BOP selects the offset yielding the most timely prefetches  
- most of the time, this is OK 































•  Prefetch timeliness is essential for performance 
•  BOP is very effective on the SPEC CPU benchmarks 
- though not always optimal 
•  Simple hardware 
 
•  BOP is degree-one prefetcher  one prefetch per L2 access 
- multi-degree prefetching not the right solution for timeliness issues 
•  Maybe we don't need multi-degree prefetching at all 
- if we obtain 100% prefetch coverage with degree-2 prefetching, it 
means that we are doubling the memory traffic 
35 
36 
thanks for your attention 
