A New Technology of Multi-core Prefetching  by Juan, Fang & Hongbo, Zhang
Procedia Engineering 15 (2011) 3482 – 3486
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2011.08.652
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
          Procedia Engineering  00 (2011) 000–000 
Procedia
Engineering
www.elsevier.com/locate/procedia
Advanced in Control Engineeringand Information Science 
A New Technology of Multi-core Prefetching 
Fang Juan a, Zhang Hongbo a a*
aCollege of Computer Science, Beijing University of Technology, Beijing, China 
Abstract 
Memory access latency is a main bottleneck limiting further improvement of multi-core processor’s 
performance, data prefetching is an effective technique to hide data access latency. This paper proposes a 
new hardware prefetching technique based on Future execution that integrates Runahead execution. 
Future execution uses one core of CMP to prefetch date for a thread running on another core. Runahead 
execution is an out of order execution technique that allows microprocessors   pre-process instructions 
during cache miss cycles instead of stalling. We named the prefetching technique Future-runahead 
execution, and experiment result reveals that the relative execution time of Future-runahead execution 
tested by SPEC2000 program reduced by 8% and L2 Cache hit ratio increased by 9% compared to Future 
execution.
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of CEIS 2011] 
Keywords: CMP;  Prefetch;  Memory access latency;  IPC 
1. Related work 
There are 3 prefetching strategies: software-controlled strategies, hardware-controlled strategies and 
hybrid hardware/software-controlled strategies [1]. Software prefetching and hybrid hardware/software 
prefetching need developers or compiles inserting instruction in source codes[2], but hardware prefetching 
does not need the intervention of developer or compiler, and hardware prefetching could take advantage 
of multi-core, so this paper discuss a hardware prefetching technology base on multi-core: Future-
* Tel.: +86-10-67392984; fax: +86-10-67391742. 
E-mail address: fangjuan@bjut.edu.cn 
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
3483Fang Juan and Zhang Hongbo / Procedia Engineering 15 (2011) 3482 – 34862 Fang Juan/ Procedia Engineering 00 (2011) 000–000 
runahead execution, which is an improvement to Future execution that integrated Runahead execution, 
the follows of this section gives an introduction to the two technologies respectively.   
1.1. Structure of Future execution 
Future execution is a conventional chip multiprocessor containing two cores(a regular and a FE core), 
a shared L2 cache on chip and a value predictor. Each core has a superscalar execution engine with a 
private L1 instruction cache and a L1 data Cache[3]. The figure 1 shows the architecture of FE.  
Fig1   The FE architecture 
1.2.  Implementation of Future execution 
Future Execution uses one core of a CMP to prefetch data for a thread running on another core. The 
original unmodified program executes on the first core, the prefetching core simply executes a copy of all 
non-control instructions after they have executed in the primary core, As each instruction commits on the 
way to the second core, it updates the value predictor with its current result, and it’s output is replaced by 
a prediction of the likely output that the nth future instance of this instruction will produce. After that, the 
committed instruction is sent to the second core. Instructions are injected in the commit order in the first 
core to preserve the program semantics. According to the spatial locality, FE assumes that the same 
sequence of instructions will execute again in the future, the second core essentially executes n 
“iterations” utilizing the future values ahead of the non-speculative program running in the first core. The 
instructions speculatively executed on the second core issue load requests the main program will probably 
reference in the future [3].
1.3. Feature of Future execution 
The FE assumes that most cache misses are caused by repeatedly executed loads with a relatively 
small number of dynamic instructions between consecutive executions of these loads, and the sequence of 
executed instructions leading up to the loads tends to remain similar[3]. Hence, for each executed critical 
load, there is a high probability that the same load will be executed again soon. FE exploits this feature, 
use a value predictor predict the future data regular core will access, and FE core repeatedly executes 
instruction committed from regular core, and writes the result to L2 cache as the prefetched data regular 
core will use soon. There is a challenge FE could not resolve that if a new load instruction firstly executes, 
the pipeline will stall because of cache miss, for FE cannot provide accurate prefetching for instructions 
not having been implemented on regular core.  
3484  Fang Juan and Zhang Hongbo / Procedia Engineering 15 (2011) 3482 – 3486 Fang Juan/ Procedia Engineering 00 (2011) 000–000 3
1.4. Brief introduction of Runahead execution 
 Runahead execution was first proposed for in-order processors and further extended to perform 
prefetching for out-of-order architectures [4]. The runahead architecture “nullifies” and retires all memory 
operations that miss in the L2 Cache and remain unresolved at the time they get to the ROB head. It starts 
by taking a checkpoint of the architectural state. Then it retires the missing load and the processor enters 
runahead mode. In this mode the instructions proceed largely normally except for two major differences. 
First, the instructions that depend on the result of the load that was “nullified” do not execute but are 
nullified as well. They commit an invalid value and retire as soon as they reach the head of the ROB. 
Second, store instructions executed in runahead mode do not overwrite data in memory. When the 
original “nullified” memory operation completes, the processor rolls back to the checkpoint and resumes 
normal execution. All register values produced in runahead mode are discarded [5].
2. Future-Runahead execution 
As section 1.3 discusses, FE cannot provide accurate prefetching for instructions not having been 
implemented on regular core and may cause Cache pollution for random memory access programs, Cache 
miss cannot be avoided, in this case, the regular core in FE architecture could enter runahead mode, it not 
only avoids regular core’s likely cache misses and stalling of pipeline but also provides accurate data or 
instructions prefetching and eliminates possible cache misses when the pipeline recovers from run-ahead 
execution cycles. So we propose the idea that combines Future execution and Runahead execution, 
introduces the runahead execution to the regular core in FE. The figure 2 shows the architecture of FE-
runahead. 
Fig2   The FE-runahead architecture 
2.1. Implementation of FE-runahead  
FE-runahead has two running mode: (1) if L2 Cache miss occurs, the regular core enters runahead 
execution and checkpoints the architecture register file, then regular core gives the current instruction and 
instruction depending on it a  bogus value and tosses it out of instruction window. The subsequent 
instruction will be executed as long as it doesn’t depend on Cache miss instruction. Judge logic will shut 
down the link between two cores and prevent instruction copying to future execution core so as to avoid  
Cache pollution. When the Cache miss block is ready, regular core rolls back to checkpoint and regular 
core re-enter normal mode. In runahead cycles, regular core just finds likely Cache miss and  prefetches 
instruction in advance; (2) if there is no Cache miss on regular core, the implementation of FE-runahead 
3485Fang Juan and Zhang Hongbo / Procedia Engineering 15 (2011) 3482 – 34864 Fang Juan/ Procedia Engineering 00 (2011) 000–000 
is the same as FE. The original unmodified program executes on the first core, the prefetching core 
simply executes a copy of all non-control instructions after they have executed in the primary core, future 
execution core executes n “iterations” of instruction utilizing future value predicted by value predictor 
and writes the result of instruction to L2 Cache as the value regular core will probably reference in the 
future. 
2.2. Hardware support 
We add a judge logic unit compare to FE, the regular core reads the content of its architecture register 
(architecture register records the state of core when running mode transformation) and decides whether it 
implement runahead mode or not. The table 1 shows detailed parameter of simulated processor. 
Table 1 Parameter of simulated processor 
Processor 2 cores, L2 cache, 4 issues, out of order 
Cache
Private I-cache and D-cache for each core 
64KB IL1, 64KB DL1, 1MB L2 
2-way L1, 4-way L2 
Other hardware 
support 
Future value predictor:4K-entry ST2D, 3bc conf. 
estimator 
Judge logic unit:3 int, 2 mem, 1 fp 
Hardware stream prefetcher:between L2 and main 
memory, 16 streams max. prefetch distance: 8 strides 
2.3. Experiment platform 
We use Simics to realize the FE and Future execution and use GEMS(General Execution-driven 
Multiprocessor Simulator) to run SPEC2000Cint base benchmark suite(gzip, vpr, gcc, mcf, crafty, 
parser,bizp, gap etc.) test program to get the relative execution time of two architecture, the lower time 
value means high performance. Additional we use the Multi2sim with 4 mini benchmark programs to get 
the L2 Cache hit ratio of two architectures.   
2.4. Experiment result 
Figure 3 is the comparison of FE and FE-runahead's relative execution time tested by benchmark 
program.  
Fig3   The relative execution time of FE-runahead and FE 
3486  Fang Juan and Zhang Hongbo / Procedia Engineering 15 (2011) 3482 – 3486 Fang Juan/ Procedia Engineering 00 (2011) 000–000 5
As the Fig3 shows, in most benchmark test programs, the running time of FE-runahead is shorter than 
FE and except mcf, parser and twolf(three compute-intensive program) , the reason is that the regular core 
implements many invalid runahead cycle, that means in runahead mode, regular core just executes some 
instruction more once and do not provide valid prefetching date for the normal mode after runahead cycle 
completion, and the invalid runahead cycle added the overhead of regular processor. Overall the average 
relative execution time is reduced by 8%, it means that the performance of FE-runahead is better than FE. 
Figure 4 is the comparison of FE and FE-runahead’s L2 Cache hit ratio. 
Fig4 The L2 Cache hit ratio of FE-runahead and FE 
Fig4 shows that three test programs: sort, args and printf in 4 mini benchmark programs, FE-runahead 
got high L2 Cache hit ratio and FE got high L2 Cache hit ratio tested by math program. This result is 
same with result got by SPEC2000, running compute-intensive program in runahead mode may add 
overhead of processor and reduce performance. Overall the L2 Cache hit ratio of FE-runahead is higher 
than the FE about 9%. 
As discussed above, we conclude that though FE to get better experiment result in compute-intensive 
program, FE-runahead execution has higher performance and L2 Cache hit ratio in general. 
Acknowledgements 
       This paper is supported by Scientific Research Common Program of Beijing Municipal Commission 
of Education. 
References 
[1] Surendra Byna, Member, IEEE, Yong Chen, Student Member, ACM, IEEE and Xian-He Sun , Member, ACM, Senior 
Member, IEEE, Taxonomy of Data Prefetching for Multicore Processor, Journal of Computer Science and Technology, May 
2009,24(3): 405-417 
 [2] Mutlu O, Stark J, Wilkerson C, Patt Y N. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-
of-order Processor. In Proc. The 9th International Symposium on High-Performance Computer Architecture. USA, 2003:3~7 
[3] Ganusov I, Burtscher M. Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors. In Proc. The 
14th.Pa rallel Architectures and Compilation Techniques. St. Louis, USA, Sept. 2005:.231~242 
[4] Mutlu O, Stark J, Wilkerson C, Patt Y N. Runahead execution: An alternative to very large instruction windows for out-of-
order processors. In Proc. the 9th International Symposium on High-Performance Computer Architecture, San Jose, USA, Feb. 3-7, 
2003, p.129. 
[5] Zhou H. Dual-core execution: Building a highly scalable single-thread instruction window. In Proc. The 14th Parallel 
Architecture and Compilation Techniques. St. Louis, USA,2005:350~360 
