Replacement and placement policies for prefetched lines. by Sze, Siu Ching. & Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
REPLACEMENT AND PLACEMENT POLICIES FOR 
PREFETCHED LINES 
B Y 
SZE SlU CHING 
A DISSERTATION 
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS 
FOR THE DEGREE OF MASTER OF PHILOSOPHY 
DIVISION OF COMPUTER SCIENCE AND ENGINEERING 
THE CHINESE UNIVERSITY OF HONG HONG 
JUNE 1998 
% ^ _ 零啦
h/¥ 
^ ¾ ^ 
Copyright � 1998 by Computer Science and Engineering, the Chinese 
University of Hong Kong 
All right reserved. 
ii 
Acknowledgements 
I want to give my hearty thanks to my family. 
I also want to give my hearty thanks to Dr. Gilbert H. Young, Mr. S. C. Lau, 
Mr. S. Y. Yiu and especially to Ms. Priscilla K. Chan. 
It is clear there are many to whom I owe my thanks and acknowledgments. In 
order not to miss anyone of them, I just want to say thanks. 
Thanks for those who have helped me, taught me, carried me and supported 
me during my studies. Thanks a lot. 
iii 
Abstract 
As a result of technology advance, there is a widening gap between the rate at 
which a processing unit can consume operands and the rate at which the memory 
system can supply them. The introduction of cache helps alleviate this problem, 
and the design of cache memory is very critical to the overall system perfor-
mance. Due to the limited space on the processors, on-chip caches are usually 
small. Therefore, the cache space should be used carefully and efficiently. Accu-
rate prefetching and careful replacement of cache lines i are essential to improve 
the performance. In order to further improve the cache performance, different pre-
fetching algorithms for cache have been proposed[BaC91] [KlL91] [Smi78a]. With 
prefetching, data could be available before their actual use. However, due to the 
large volume and the random behaviour of data usage, it is difficult to prefetch 
data accurately and this results in cache pollution. 
Lau [Lau96] has proposed an accurate prefetching scheme, the Instruction 
Opcode and Addressing Mode Prefetching (IAP), which makes use of the future 
reference patterns embedded in certain instructions. Further to their study, it is 
also found that most prefetched data by the IAP scheme are likely to be referenced 
only once. Therefore, we proposed to use a mixed replacement policy to use 
together with the IAP scheme, to minimize the number of thrashing misses. Using 
both Least Recently Used (LRU) and Instant Zero (IZ) replacement algorithms 
with the IAP scheme outperforms the result of using LRU only. 
Furthermore, in order to optimize the benefit of temporal locality and minimize 
iA l ine is a b lock of data in context of the cache. Usually, line size is the same as block 
size. In our paper, block refers to data located in memory or lower-level caches, line refers to 
the data in level 1 cache. However, they are interchangeable. 
iv 
the cache pollution problem, another hardware replacement design is presented in 
this thesis. We propose a priority pre-updating scheme, which is used to update 
the priorities of cache lines prior to their normal updating situation. Simulation 
experiments are done wit this priority pre-updating scheme in cache model with 
prefetch-on-miss prefetching scheme. From the results, it is found that priority 
pre-updating helps minimizing the number of thrashing misses, optimizes the ben-
efit of temporal locality and reduces cache pollution. In order to obtain promising 
cache performance improvement, we add a victim cache to hold those fresh pre-
fetched lines that displaced from the data cache. The experimental results show 
that using priority pre-updating with the victim cache can achieve up to 50% 
reduction in memory delay. 
Beside the research on replacement of cache lines, we propose another hardware 
design, which concerns the placement of IAP lines. The cache lines prefetched by 
the Instruction Opcode and Addressing Mode Prefetching pose a referenced-once 
property, i.e., most of them are referenced one and only one time before the 
program terminates. Owing to this special reference behavior, a prefetch cache, 
which is dedicated to prefetched lines by Instruction Opcode and Addressing Mode 
Prefetching scheme, is implemented separately. The prefetch cache can reduce 
memory delay time up to 99%. 
V 
Contents 
1 Introduction 1 
1.1 Overlapping Computations with Memory Accesses 3 
1.2 Cache Line Replacement Policies 4 
1.3 The Rest of This Paper 4 
2 A Brief Review of IAP Scheme 6 
2.1 Embedded Hints for Next Data References 6 
2.2 Instruction Opcode and Addressing Mode Prefetching 8 
2.3 Chapter Summary 9 
3 Motivation 11 
3.1 Chapter Summary 14 
4 Related Work 15 
4.1 Existing Replacement Algorithms 16 
4.2 Placement Policies for Cache Lines 18 
4.3 Chapter Summary 20 
5 Replacement and Placement Policies of Prefetched Lines 21 
5.1 IZ Cache Line Replacement Policy in IAP scheme 22 
5.1.1 The Instant Zero Scheme 23 
5.2 Priority Pre-Updating and Victim Cache 27 
5.2.1 Priority Pre-Updating . . . 27 
5.2.2 Priority Pre-Updating for Cache 28 
vi 
5.2.3 Victim Cache for Unreferenced Prefetch Lines 28 
5.3 Prefetch Cache for IAP Lines . 31 
5.4 Chapter Summary 33 
6 Performance Evaluation 34 
6.1 Methodology and metrics 34 
6.1.1 Trace Driven Simulation 35 
6.1.2 Caching Models . . : 36 
6.1.3 Simulation Models and Performance Metrics 39 
6.2 Simulation Results 43 
6.2.1 General Results 44 
6.3 Simulation Results of IZ Replacement Policy 49 
6.3.1 Analysis To IZ Cache Line Replacement Policy 50 
6.4 Simulation Results for Priority Pre-Updating with Victim Cache . 52 
6.4.1 PPUVC in Cache with IAP Scheme 52 
6.4.2 PPUVC in prefetch-on-miss Cache 54 
6.5 Prefetch Cache 57 
6.6 Chapter Summary 63 
7 Architecture Without LOAD-AND-STORE Instructions 64 
8 Conclusion 66 
A CPI Due to Cache Misses 68 
A.1 Varying Cache Size 68 
A.1.1 Instant Zero Replacement Policy 68 
A.1.2 Priority Pre-Updating with Victim Cache 70 
A.1.3 Prefetch Cache 73 
A.2 Varying Cache Line Size 75 
A.2.1 Instant Zero Replacement Policy 75 
A.2.2 Priority Pre-Updating with Victim Cache 77 
A.2.3 Prefetch Cache 80 
vii 
A.3 Varying Cache Set Associative 82 
A.3.1 Instant Zero Replacement Policy 82 
A.3.2 Priority Pre-Updating with Victim Cache 84 
A.3.3 Prefetch Cache 87 
B Simulation Results of IZ Replacement Policy 89 
B.1 Memory Delay Time Reduction 89 
B.1.1 Varying Cache Size 89 
B.1.2 Varying Cache Line Size 91 
B.1.3 Varying Cache Set Associative 93 
C Simulation Results of Priority Pre-Updating with Victim Cache 95 
C.1 PPUVC in IAP Scheme 95 
C.1.1 Memory Delay Time Reduction 95 
C.2 PPUVC in Cache with Prefetch-On-Miss Only 101 
C.2.1 Memory Delay Time Reduction 101 
D Simulation Results of Prefetch Cache 107 
D.1 Memory Delay Time Reduction 107 
D.1.1 Varying Cache Size 107 
D.1.2 Varying Cache Line Size 109 
D.1.3 Varying Cache Set Associative 111 
D.2 Results of the Three Replacement Policies 113 
D.2.1 Varying Cache Size 113 
D.2.2 Varying Cache Line Size 115 
D.2.3 Varying Cache Set Associative 117 
Bibliography 119 
viii 
List of Figures 
2.1 Operations of LOAD-UPDATE and STORE-UPDATE (a) using 
the index-displacement addressing mode and (b) using the index-
based registers addressing mode 7 
2.2 Control flow for IAP scheme 10 
3.1 Percentage of Prefetch-On-Miss lines that are not referenced in IAP 
scheme 14 
5.1 A theoretical representation of a set in a four-way set associative 
cache 24 
5.2 Before a reference to an IAP line 25 
5.3 After reference to an IAP line 25 
5.4 Control Flow of the IZ Replacement Policy 26 
5.5 Architectural model of IAP scheme with PPU 28 
5.6 Illustration of PPU 29 
5.7 Control Flow of PPUVC 30 
5.8 Cache Support in the IAP architecture 32 
5.9 Control Flow of the Prefetch Cache Scheme 33 
6.1 Trace-driven simulator using xtrace 35 
6.2 Memory Model of the simulator: (a) Interleaved memory (b) Tim-
ing of data access . . . . .， . . . 38 
6.3 Comparison of number of prefetch-on-miss lines in IAP cache and 
prefetch-on-miss-only cache 53 
ix 
6.4 Percentage of prefetch-on-miss lines referenced in total number of 
prefetched lines 54 
6.5 The effect of Prefetch Cache size on cache performance 58 
6.6 An illustration of the performance difference between LRU and FIFO 61 
6.7 An instance of line activities in IZ prefetch cache 63 
A.1 MCPI by varying cache size in IZ scheme 69 
A.2 MCPI by varying cache size in PPUVC with IAP scheme 71 
A.3 MCPI by varying cache size in PPUVC with prefetch-on-miss scheme 72 
A.4 MCPI by varying cache size in prefetch cache scheme 74 
A.5 MCPI by varying cache line size in IZ scheme 76 
A.6 MCPI by varying cache line size in PPUVC with IAP scheme . . 78 
A.7 MCPI by varying cache line size in PPUVC with prefetch-on-miss 
scheme 79 
A.8 MCPI by varying cache line size in prefetch cache scheme 81 
A.9 MCPI by varying set associative in IZ scheme 83 
A.10 MCPI by varying set associative in PPUVC with IAP scheme . . 85 
A.11 MCPI by varying set associative in PPUVC with prefetch-on-miss 
scheme 86 
A.12 MCPI by varying set associative in prefetch cache scheme 88 
B.1 Results of the first group programs in IZ 89 
B.2 Results of the second group programs in IZ 90 
B.3 Results of the third group programs in IZ 90 
B.4 Results of the first group programs in IZ 91 
B.5 Results of the second group programs in IZ 91 
B.6 Results of the third group programs in IZ 92 
B.7 Results of the first group programs in IZ 93 
B.8 Results of the second group programs in IZ 93 
B.9 Results of the third group programs in IZ 94 
V 
C.1 Results of the first group programs in PPUVC 95 
C.2 Results of the second group programs in PPUVC 96 
C.3 Results of the third group programs in PPUVC 96 
C.4 Varying line size 97 
C.5 Results of the second group programs in PPUVC 97 
C.6 Results of the third group programs in PPUVC 98 
C.7 Varying set associative . . 99 
C.8 Results of the second group programs in PPUVC 99 
C.9 Results of the third group programs in PPUVC 100 
C.10 Results of the first group programs in PPUVC 101 
C.11 Results of the second group programs in PPUVC 101 
C.12 Results of the third group programs in PPUVC 102 
C.13 Results of the first group programs in PPUVC 103 
C.14 Results of the second group programs in PPUVC 103 
C.15 Results of the third group programs in PPUVC 104 
C.16 Results of the first group programs in PPUVC 105 
C.17 Results of the second group programs in PPUVC 105 
C.18 Results of the third group programs in PPUVC 106 
D.1 Results of the first group programs 107 
D.2 Results of the second group programs 108 
D.3 Results of the third group programs 108 
D.4 Results of the fourth group programs 108 
D.5 Results of the first group programs 109 
D.6 Results of the second group programs 109 
D.7 Results of the third group programs 110 
D.8 Results of the fourth group programs 110 
D.9 Results of the first group programs . . 111 
D.10 Results of the second group programs 111 
D.11 Results of the third group programs 112 
xi 
D.12 Results of the fourth group programs 112 
D.13 Results of the first group programs 113 
D.14 Results of the second group programs 113 
D.15 Results of the third group programs 114 
D.16 Results of the fourth group programs 114 
D.17 Results of the first group programs 115 
D.18 Results of the second group programs 115 
D.19 Results of the third group programs 116 
D.20 Results of the fourth group programs 116 
D.21 Results of the first group programs 117 
D.22 Results of the second group programs 117 
D.23 Results of the third group programs 118 
D.24 Results of the fourth group programs 118 
xii 
List of Tables 
6.1 SPEC Benchmark Applications used 34 
6.2 Percentages of LOAD/STORE-UPDATEs in SPEC92 Benchmark 
Suite 39 




Cache memory is a special high speed memory designed to supply the processor 
with the most frequently requested instructions and data. Instructions and data 
located in cache memory can be accessed many times faster than instructions and 
data located in main memory. The more instructions and data the processor can 
access directly from cache memory, the faster the computer runs as a whole. 
Memory caching is effective because most programs access the same data or 
instructions over and over. By keeping as much of this information as possible in 
cache memory (which is usually implemented with faster SRAM), the computer 
avoids accessing the slower main memory (which is usually implemented with 
slower DRAM). Some memory caches are built into the architecture or micropro-
cessors. Such internal on-chip caches are often called Level 1 (L1) caches. Cache 
memory makes use of the principal of locality. Locality of reference states basically 
that even within very large programs with several megabyte of instructions, only 
small portions of this code generally get used at once. Programs tend to spend 
large periods of time working in one small area of the code, perform the same job 
many times with slightly different operands, and move on to another area of code 
for another batch of routine jobs. This occurs because of loops, which are what 
programs use to do work many times in a rapid succession. 
Generally, there are two kinds of localities - temporal locality and spatial local-
ity. Temporal locality describes the likelihood that a recently-referenced address 
1 
Chapter 1 Introduction 
will be referenced again soon, while spatial locality describes the likelihood that 
a close neighbor of a recently-referenced address will be referenced soon. Con-
ventional cache memories rely on a program's temporal and spatial localities to 
reduce the average memory access latency. 
The gap between main memory and processor clock speeds is growing at an 
alarming rate. As a result, the system performance is increasingly dominated by 
the latency of servicing memory accesses, particularly those accesses which are not 
easily predicted by the temporal and spatial localities captured by conventional 
cache memory organizations [Smi82] [HeP95 . 
One obvious way to reduce number of the cache misses is enlarging the cache 
as much as possible, however, it is often difficult to achieve practically. There 
are two main reasons that limit the size of Level 1 cache: [1] The performance 
gained is not enough to compensate the cost for cache, which typically uses fast 
but expensive static RAM chips. The speed for SRAM is approximate 4 times 
faster than DRAM, however, SRAM chips cost more than six times as much 
as the DRAM chips normally used for main memory. Besides, the performance 
improvement is not linearly proportional to the size of cache, that is, a 512K bytes 
cache memory may not obtain 2 times better performance than a 256K bytes one. 
2] The CPU chip is usually small while SRAM size is comparable large, thus only 
limited space for Level 1 cache, while maintaining a reasonable processor chip size. 
Due to the large speed gap between the processor and main memory, it is 
obvious that performance of the system will then be largely determined by [1 
how effectively the on-chip memory is able to manipulate operands, minimize the 
frequencies of off-chip accesses, and [2] the rate at which the external memory 
system can supply operands. 
The main aim of cache memory is to reduce the CPU's idle waiting time. 
Improving cache performance of programs is one way of increasing the systems 
throughput. The effectiveness of the on-chip cache to maintain useful operands 
and minimizing the frequencies of off-chip accesses is one of the main factor to 
determine the performance of the system. 
2 
Chapter 1 Introduction 
In order to reduce the disparity between processor speed and memory ac-
cess time, many solutions have proposed to tackle this problem. Some have 
proposed adding additional features such as non-blocking fetches [Kro81], vic-
tim caches [Jou90], and sophisticated hardware prefetching [ChB92] to alleviate 
the access penalties for those references that have locality characteristics that are 
not captured by most conventional designs. 
1.1 Overlapping Computations with Memory Ac-
cesses 
Many solutions have been proposed to reduce the memory access and/or hide 
memory latency. An important approach is cache prefetching[Smi78a] [Smi78b 
Smi82] [HeP95], that is, the action of bringing data to the cache before they are 
actually needed. Prefetching is similar to speculative loads in the sense that it is 
non-blocking and behaves like a hint without incurring semantic faults. The main 
difference between prefetching and speculative loads is that data are loaded into 
the caches rather than registers. 
Depending on how prefetch requests are determined and initiated, prefetching 
can be either hardware-controlled [BaC91] [FuP91] [FuP92] or software-directed[Por89 
KlL91][MoL92]. The hardware approach detects accesses with regular patterns 
and issues prefetches at the run time of the programs, whereas the software ap-
proach relies on the compiler to analyze programs and to insert prefetch instruc-
tions during compilation of the programs. 
However, because of the low accuracy of some prefetching algorithms, there is 
a risk that the prefetched data that are never used before they are displaced from 
the cache. This leads to waste of memory space and bandwidth, thus poorer per-
formance results. The problem become worse when the prefetched data displace 
some useful data in the cache. The phenomenon is called cache pollution. 
3 
Chapter 1 Introduction 
1.2 Cache Line Replacement Policies 
Different replacement policies are employed to manage operands in the memory. 
Replacement takes place when a particular cache line or a set of cache lines is 
already full, and the line has to be evicted its contents to make room for the new 
incoming line. There is still no ideal replacement policy being invented, and it 
is unlikely that one would exist. Replacement policy in cache is different from 
the problem of replacement in paged main memories because the cache replace-
ment algorithm must be implemented entirely in hardware and must execute very 
quickly so as to catch up with the processor speed. 
Least Recently Used and First-In-First-Out are two most commonly used re-
placement policies. Beside these two well-known algorithms, there are Random, 
Pseudo-Least Recently Used which are used in some special systems. 
Not knowing whether a line will be accessed soon, Least Recently Used strategy 
is usually used in conventional cache as the replacement scheme. However, if the 
displaced line is referenced by the processor again, a thrashing miss i will occur. 
The situation may become worse, since one thrashing miss can lead to another 
thrashing miss. A good cache line replacement policy should try to find out the 
best candidate to be displaced and will help minimize these thrashing misses. 
1.3 The Rest of This Paper 
In this dissertation, we focus on techniques on better management of the pre-
fetched lines for cache with the IAP scheme and that with a traditional prefetch 
scheme - prefetch-on-miss only. 
IAP is an accurate prefetching scheme, in which it makes use of the information 
provided by the instruction opcodes and addressing modes for prediction. A brief 
review on IAP scheme will be given in Chapter 2. 
Chapter 3 will briefly describe the cache pollution problem brought by con-
ventional prefetching schemes, the reference-once property of IAP lines and how 
iA thrashing miss occurs when the line which was replaced must itself be reloaded 
4 
Chapter 1 Introduction 
IAP works. 
Chapter 4 will briefly describe previous research on prefetching, and stream 
buffers. 
The implementation of the replacement policy, Instant Zero (IZ) cache line 
replacement policy, will be discussed in Chapter 5. The details of another re-
placement policy, Priority Pre-Updating (PPU), which is designed to be used 
with different types of prefetching algorithms, will be given in the same chapter. 
The effect of using Priority Pre-Updating with victim cache ( P P U V C ) is also 
discussed in this chapter. 
In the same chapter, details on the placement policy of prefetched lines, pre-
fetch cache, will be given. 
Chapter 6 will present the simulation methodology, performance metrics that 
used and the results of performance evaluation. The designs are evaluated by 
simulating the some benchmark program in a uniprocessor environment. The 
results show that either IZ, PPU with victim cache and prefetch cache alone can 
obtain significant improvement in system performance. 
Finally, Chapter 7 will give an insight on future directions, and we will conclude 
this paper in Chapter 8. 
5 
Chapter 2 
A Brief Review of IAP Scheme 
2.1 Embedded Hints for Next Data References 
In the design of latest processor architectures, instruction opcodes and the ad-
dressing modes of the architecture definition usually have built-in mechanism to 
support the address calculation of future data references while the current datum 
is being referenced. It is also found that compound instructions are commonly 
used in RISC architecture to reduce the program execution path length. As it 
can be found frorn program instrumentation and tracing, certain simple RISC in-
structions are executed in pair. So it might be useful to define a single compound 
or extended opcode to execute the instruction pair, and this is particular useful if 
the new instruction opcode does not affect the processor clock cycle. Up to now, 
there are several machines have such kind of opcodes. For example, ADD-AND-
BRANCH, COMPARE-AND-BRANCH, LOAD- WORD-AND-UPDATE, etc. in 
HP's Precision Architecture 1.1 [HP94]. IBM and PowerPC has LOAD-UPDATE, 
LOAD-MULTIPLE[lBM89] [Mot92] [IBM94] [WeS94], etc.. The total number of 
instructions defined in current RISC processors range from 150 and 200, which 
is much larger than that of early RISC processors (about 50 to 70 instructions). 
The reason is that latest processors find these compound or extended opcodes 
to be very useful, and this embedded them into the instruction set. Among 
these compound instructions, it is found that the LOAD/STORE-UPDATE (or 
6 
Chapter 2 A Brief Review of IAP Scheme 
LOAD/STORE-MODIFY), are very helpful to manage on-chip cache activities. 
Array or pointer references to a large set of data are one of the major types of 
data references in typical programs. Data will be referenced one after another suc-
cessively, index-displacement and index-hased register addressing modes are usually 
employed for this type of accesses. Because these accesses occur very frequently, 
as a result, many systems tend to use compound opcodes like LOAD-UPDATE 
and STORE-UPDATE for the accesses. Beside loading or storing a datum into 
the register, the content of the index register, which is used in the address cal-
culation of current data reference, will be updated by each of these instructions. 
The operations of the LOAD-UPDATEdind STORE- UPDATE instructions using 
either index-displacement or index-hased registers addressing mode are shown in 
Figure 2.1. 
LOAD RriRx + Disp) LOAD Rr{Rx + Ry) 
Equivalent to Equivalent to 
Eff. Addr. = (¾) + Disp Eff. Addr. 二 {R^) + {Ry) 
Rr = {Eff. Addr.) Rr = {Eff. Addr.) 
7¾ = Eff. Addr. R^ = Eff. Addr. 
(a) (b) 
Figure 2.1: Operations of LOAD-UPDATE and STORE-UPDATE (a) using the 
index-displacement addressing mode and (b) using the index-based registers ad-
dressing mode 
The updating action ofthe LOAD-UPDATEov the STORE- UPDATE instruc-
tion is the preparation of the content of register R^ which is used in calculating 
of the effective address of the next expected datum. R^ is equal to the sum of the 
current data reference address Eff. Addr. (or the updated content of register i ^ ) 
and the displacement Disp (in the index-displacement addressing mode) or the 
register content Ry (in the index-hased register addressing mode). Thus, accurate 
data cache prefetching can be carried out and the address of prefetched data is 
equal to {Eff. Addr. + Disp) or {Eff. Addr. + Ry). It should be noted that values 
7 
Chapter 2 A Brief Review of IAP Scheme 
of Eff.Addr. and Disp (or Ry) are available to the cache prefetching unit during 
the execution of LOAD Rr{Rx + Disp) or LOAD Rr{Rx + Ry) instruction. 
2.2 Instruction Opcode and Addressing Mode 
Prefetching 
By using these hints of data references provided by the instruction opcode and 
addressing modes, Instruction Opcodes and Addressing Mode Prefetching (IAP) 
scheme which provides accurate data prefetching for on-chip cache is proposed 
by Lau [Lau96]. In Lau's study, IBM POWER architecture ( or the PowerPC 
architecture) is used as an example to show how IAP scheme should be designed 
and implemented. 
Figure 2.2 shows the control flow of IAP scheme. For each instruction i that is 
decoded and executed, its opcode will be checked first to determine if it belongs 
to LOAD-UPDATEoi STORE-UPDATE instruction. If such a case is detected, 
the address of next datum expected to be referenced in the near future will be 
re-calculated. Using the same addressing mode as i but with the updated contents 
of all registers used in the address calculation of i. Afterwards, this new address 
will be sent to the cache prefetch unit for accurate data prefetching. Beside these 
basic ideas, two enhancements have been integrated into the IAP scheme: 
• Default Prefetching vs. Selective Prefetching 
When executing each LOAD/STORE instruction z, if this instruction i be-
longs to LOAD/STORE- UPDATE instructions group, then the IAP scheme 
will be used for data prefetching, else the prefetch-on-miss is used as the de-
fault prefetching scheme for data prefetching. 
• Cache Block Prefetching vs. Next Data Reference Prefetching 
For each data prefetch requested by IAP scheme, if the target prefetched 
block j containing the candidate datum is not the same as current data 
8 
Chapter 2 A Brief Review of IAP Scheme 
referencing line z, then a prefetch of block j will be issued. If they are the 
same, then a prefetch request of block i + 1 will be issued. 
The above IAP scheme together with the two enhancements are the combined 
IAP that we use in this paper. 
2.3 Chapter Summary 
In this chapter, a general design for hardware controlled prefetching, which was 
proposed by Lau [Lau96], is introduced. By using information embedded in the 
instruction opcodes, Lau's design is able to single out the data references with 
constant strides from the pool of all data references and also able to find the 
corresponding stride values. With this valuable information, accurate prefetching 
can be accomplished and consequently, the CPU stall time due to data cache 
misses can be reduced. The cache block prefetching is introduced to tackle the 
problem of limited memory bus bandwidth. In order to exploit the spatial locality, 
the combined IAP scheme is equipped with default prefetching to issue prefetch 
requests for the mm-LOAD/STORE UPDATE instructions. 
9 
Chapter 2 A Brief Review of IAP Scheme 
H X — — I 
r ^ n 
Get next • 
instruction I • 
s ^ 
< LOAD/STORE- ^ ^  
r ^ ^ 八 r ^ 
/ Addressing ^ ^ OTHER . Generate a  
S > ^ ^ = " ^ 
INDEX-DISPLACEMENT 





� … X address ^ ^ 
Y i s < same as > ‘ ^  
\^^ current line ^ ^ 
\ a d d r e s s ? ^ ^ 
i I v X y 
Set prefetch address 
as that of the „ _, ^ 
preceding or ^ Sendprefetch 
following line > addressto ~ ~ I 
accord ingtothe prefetch queue 
direction of update 




Over the last two decades, the CPU clock cycle time has been decreasing at a 
much faster rate than the main memory access time. The average number of 
cycles per instruction has also been decreasing dramatically. The effect is more 
obvious for RISC machines with higher clock speed and data consumption rate. 
Unfortunately, a high bandwidth of the microprocessor is meaningless unless 
it is matched by a similarly powerful memory subsystem. Most of current micro-
processors rely on caches to reduce their effective memory access time. However, 
cache miss affects the overall performance of a system, i.e., if either the instruc-
tion or the operand required by the operations is not found in cache(s), the actual 
performance would decline for the large amount of cache misses. 
With current VLSI developments, several functional units, instruction and 
data caches, and some special hardware functional units can be included on the 
processor chip. Therefore, a first obvious method for reducing the average memory 
access time is to implement multi-level cache hierarchies [BaW89] with an on-chip 
first level cache. However, under the usual caching mechanism, the processor 
will still be stalled on a first-level cache miss and of course also on misses on 
any of the next levels of the memory hierarchy with an even larger penalty time, 
until the miss is resolved. Since a processor must stall on a cache miss, caches 
do not totally hide memory latency but, instead, they eliminate many off-chip 
memory accesses. In order to make further progress towards the reduction of 
11 
Chapter 3 Motivation 
memory latency, memory accesses due to cache misses must proceed in parallel 
with the processor execution. As a result, a number of different solutions have 
been proposed to allow computations to be overlapped with memory accesses. 
They basically provide efficient mechanisms to allow buffering and pipelining of 
memory references. 
Various data prefetching algorithms exist, some are hardware-assisted, some 
are software-directed and others are hybrid. The main fault of many of these 
algorithms is that they do not integrate replacement algorithms with prefetching 
methods. There is often a large penalty for prefetching into the cache because the 
wrong line was replaced. 
When incorporating the prefetching algorithms in a processor, several things 
have to take into consideration. First, it is possible to prefetch data into the cache 
that will never be used by the processor. This not only pollutes the cache, but 
also increases memory traffic. Second, if the data is prefetched too early, it can 
become stale before it is referred, this may also increase memory traffic. Therefore, 
in designing a processor with prefetching, careful balance between performance 
gains and tradeoff like cache pollution and memory traffic increase are required. 
From a different viewpoint, a conventional cache's hardware does not know 
the likelihood of whether a line will be accessed soon. A blind strategy is usually 
used to choose the line to be replaced when a miss has occurred, e.g. choose the 
least recently used line. However, if it happens that the displaced line is referenced 
by the processor again, a thrashing miss will occur. The problem becomes more 
serious since one thrashing miss can lead to another thrashing miss. 
It is estimated that 50% of all misses are thrashing misses, and that most of 
these can be avoided. A good cache replacement policy will help minimize these 
thrashing misses. 
There exists an accurate prefetching, the Instruction Opcodes and Addressing 
Modes Prefetching, which is proposed by Lau [Lau96]. IAP, which making use 
of the run-time information provided by the instruction opcodes and addressing 
modes, prefetches data accurately. However, it is found that data prefetched by 
12 
Chapter 3 Motivation 
IAP scheme have not posed the temporal locality property, a large portion of the 
data prefetched by the IAP scheme ^ are likely to be referenced one and only one 
time 2 before the program terminates. In order to handle the replacement of the 
data lines in the cache, a new strategy, tentatively termed as Instant Zero (IZ), is 
proposed. This new strategy aims at replacing the IAP lines intelligently, which 
will be explained in detail later. 
On-chip cache is usually small ,^ thus the precious cache space should be used 
carefully. Cache pollution problem highly affects the system performance, one 
obvious solution to solve the cache pollution problem is to kick out useless data 
in case of conflict or capacity misses. However, which data is useless and how 
to determine which should be kicked out is really a difficult problem. A poorly 
designed prefetching algorithm aggravates cache pollution problem, and wrong 
displacement of useful data degrades system performance. From Figure 3.1, we 
can find out that in some benchmark programs, more than 90% of prefetch-on-miss 
lines are unreferenced. It is obvious that most prefetch-on-miss lines are useless, 
i.e., they are not referenced before the program terminates. It is beneficial to 
shorten the life time of those possibly erroneously prefetched lines in the cache to 
minimize cache pollution. We propose a Priority Pre- Updating (PPU) scheme to 
tackle the problem, PPU helps determining the data to be kicked out and reduce 
cache pollution. 
As mentioned above, IAP lines are likely to be referenced one and only one 
time before the program terminates. Therefore, placing them in a separate cache 
space can localize their effects and minimize the cache pollution problems. As 
a result, we proposed to use an on-chip prefetch cache to hold all those data 
prefetched by IAP scheme. 
ilines prefetched by IAP scheme will tentatively called IAP lines in later sections. 
^Termed as reference-once property 
^Usually ranging from 4K bytes to 32K bytes. Though large on-chip cache is also found in 
current architecture, it is not common. 
13 
Chapter 3 Motivation 
n y^ywsy|BM5MMflj^iM^  ^  
0.2% P 64 p ^ ^ ^ ^ _wave5 
3 2 M ™ _ ^ . _丽腳 .懸酬^ ^ ^ ^ ^ o 4 n'ni|iii|^ iiiiiii|iIiiiiiiiiwi o _ t o m c a t v 
p 92 .1% ^iff(ffff||^mmmi a ‘ 95 .2% 
k;�>^ - o ^ i < ^ '�� '�-<如、赦1 3 2 _ i _ i ^ ， M i m w - ! i ^ m m m m _ s u 2 c o r 
W - - S = ^ 95 .2% -： . 
I p , ^ ^ ~ ^ ^ 爾 ~ ~ " ~ ~ ^ ^ ^ ^ ^ 國5口丨06296 
I 16 ^ _ ^ 2 % f 16 S w 丨..—.......9’,1% 1 2 ^ 95.1% 國 - 7 
« "'^ '““"""__"^ '^ T："…yrrn w 'l ^ ，'n�u",,k"M^^ •“ 
I " ^ ^ ^ 8 S l L l ^ Z：-'- I • 千 丨 . e s P r _ 
^ "'""'"a !•!^ .^^ .^^ • /^^ • !^^•^^ ^^ ^^ .^ ^• .^浓观.》鄉効 I 
8 W S ^ M T 1 M i i i M m c o m p r e s s 
_ ^ ^ : % 4 ^ 1 謎 ^ ^ , , " , : : 4 % .<' 
0 % 5 0 % 1 0 0 % 1 5 0 % 0 % 5 0 % 1 0 0 % 0 % 5 0 % 1 0 0 % 
P e r c e n t a g e of P O M l i n e s P e r c e n t a g e of P O M l i n e s P e r c e n t a g e of P O M 
n o t r e f e r e n c e d n o t r e f e r e n c e d l i n e s n o t r e f e r e n c e d 
(a) varying cache size (b) varying line size (c) varying set associativity 
Figure 3.1: Percentage of Prefetch-On-Miss lines that are not referenced in IAP 
scheme 
3.1 Chapter Summary 
Cache pollution is a side-effect of data prefetching, a poorly designed prefetching 
algorithm aggravates cache pollution problem. This problem has diverse effect on 
cache performance. A blind replacement strategy increase the thrashing misses, 
reduce the utilization of cache and indirectly cause cache pollution. The tech-
niques, Instant Zero replacement policy and Priority Pre-Updating with victim 
cache, tend to alleviate this problems and improve the cache performance. The 
cache lines prefetched by IAP scheme have a reference-once property, and placing 
them in a separate cache space, prefetch cache, is able to localize their influence 




For recent computer applications, it is common that there are many matrix ma-
nipulations with highly regular and sequential data references, and a lot of data 
are needed in performing computation. If the operands are not found in cache, 
the actual performance of the system would decline for the large amount of cache 
misses. 
In order to reduce the number of cache miss penalty, data should be pre-
fetched into the cache before their actual usage. Prefetching techniques consist of 
both hardware and software approaches. Existing cache prefetching schemes, ei-
ther hardware-driven [BaC91] [Smi78b] or software-assisted [Tha81] [Bre87] [Por89 
GoG90] [CaK91] [ChM91] [KlL91] [MoG91] [MoL92], are not very effective in reduc-
ing the processor idle time due to memory accesses. 
Hardware prefetching typically uses dynamic stride detection to perform run-
time calculation of prefetch addresses to be issued[ChB92] [FuP91] [FuP92]. The 
overheads of hardware prefetching are the cost for the additional hardware, and 
the limited ability of the dynamic units to perform any prefetching other than 
through arrays with linear strides. 
The prefetching accuracy oftraditional hardware driven data prefetching schemes 
is low (though it is relatively easier to be implemented), thus cannot get signifi-
cant improvement in data cache performance. Though we can find some accurate 
15 
Chapter 4- Related Work 
hardware-driven prefetching schemes of constant stride array elements, they usu-
ally require some complicated add-on hardware such as a prediction table. As a 
result, they are not suitable to be implemented as the first-level on-chip cache as 
the space on the CPU chip is very limited. Chen and Baer [ChB94] evaluated 
the effectiveness of lockup-free caches and hardware prefetching, and proposed a 
hybrid scheme based on a combination of these approaches. 
Software prefetching is more flexible than hardware prefetching, having the ad-
vantage of compile-time knowledge, but pays the price of software overhead, both 
in instructions issued and code size[CaK91] [KlL91] [MoL92]. Software-assisted 
cache prefetching schemes can also achieve high accuracy in prefetching array data 
references with constant strides, but the runtime overhead introduced is a big ob-
stacle to their popularity. Furthermore, architectural and compiler supports are 
needed for the software-assisted prefetching schemes. These also restrict the usage 
of software prefetching scheme in current processors and computer systems. Beside 
these, some promising approaches use hybrid hardware and software techniques, 
issuing limited instructions that provide hints to the prefetch hardware[Chi94 . 
When the cache or a particular set is full, and information is requested by 
the CPU from the lower level memory, some information in the cache must be 
selected for replacement. This implies that a cache miss needs not only a fetch 
but also a replacement. Cache replacement policies should implement totally in 
hardware and execute very quickly, so it will not have bad influence on the system 
performance. The replacement algorithms are mainly classified into usage-based 
and non-usage-based. Section 4.1 will give a brief description of some known 
replacement policies. 
4.1 Existing Replacement Algorithms 
In brief, cache misses can occur for three reasons: [1] the requested data have never 
been accessed before [compulsory miss), [2] the requested data have been accessed 
before, but the size of the working data set exceeds the cache size {capacity miss), 
16 
Chapter 4- Related Work 
or [3] the requested data had been in the cache but was displaced by an intervening 
reference to another address {conflict miss). Information resident in the cache has 
to be removed to bring in future information in the event of cache misses. The 
replacement algorithm determines the information to be discarded. The algorithm 
may be Least Recently Used (LRU), First In First Out (FIFO), Random, Pseudo-
LRU, etc. 
A truly random strategy is completely unacceptable for production test rea-
sons, as it is difficult to run test vectors on a chip that does not have completely 
deterministic behavior. Some relatively common replacement algorithms are as 
follows: 
1. Least Recently Used (LRU): An usage-based algorithm under which the line 
which has not been accessed for the longest time is replaced with the hope 
of reducing the chance of throwing out information that will soon be needed 
again. Its implementation requires every line to have extra bits to keep track 
of the age of its contents thus making the controller design more complicated. 
2. First in First Out (FIFO): The First In-First Out replacement policy chooses 
the page which has been in the memory the longest to be the one replaced, 
i.e., the page to be replaced is the oldest page in the cache, the one which 
was loaded before all the others. A pointer into the line space is maintained. 
On replacement, the line pointed by the pointer is ejected, and the pointer 
is incremented. The pointer is set to zero when the end of the line space is 
reached. 
3. Clock (or Second-chance): A pointer in the line space is maintained. On 
replacement, the used bit of the line pointed to by the pointer is checked. 
If it is set, it will be cleared and the pointer is incremented. The last step 
is repeated until a line with the used bit cleared is found and that line is 
ejected. The used bit is set on every access, and is cleared periodically. 
This method can be used to approximate LRU, but the periodicity of the 
clearing needs to be carefully set. It will be difficult to find an eject-able 
17 
Chapter 4- Related Work 
line if the period is too long. If the period is too short, locality will be lost 
and thrashing will occur frequently. 
4. Least Recently Modified: The Li^^7bits of lines is modified only on writes. 
5. Not-Most-Recently-Used: The most recently used line is kept in the cache, 
one of the remaining lines is selected and replaced. 
6. Least Frequently Used (LFU) 
The page to be replaced is the one used least often of the pages currently in 
the cache. 
7. Last In First Out (LIFO) 
The page to be replaced is the one most recently loaded into the cache. 
8. Optimal (OPT or MIN) 
The page to be replaced is the one that will not be used for the longest 
period of time. This algorithm requires future knowledge of the reference 
string which is not usually available. Thus, this policy is used for comparison 
studies. 
Among all of the above algorithms, the usage-based LRU is most commonly-
used in current memory design. As mentioned above, implementation for LRU 
has to keep track of the age of every cache line, and thus requires every cache line 
to have extra bits. Though this makes the controller design more complicated 
and expensive, it works well in most architecture. FIFO and Random are non-
usage-based algorithms, non-usage-based algorithms use basis other than usage 
for replacement decision. It is shown that non-usage-based algorithms all yield 
comparable performance [Smi82 . 
4.2 Placement Policies for Cache Lines 
Techniques on holding prefetched data in intermediate space other than the first-
level cache has also been proposed. Jouppi [Jou90] proposed to use a stream 
18 
Chapter 4- Related Work 
buffer to hold the prefetched data. Stream buffers prefetch cache lines starting 
at a cache miss address. The prefetched data is placed in the buffer instead of 
the cache. Stream buffers are useful in removing capacity and compulsory cache 
misses, as well as some instruction cache conflict misses. However, the stream 
buffer that proposed is actually a simple FIFO queues, and thus each time only 
the oldest element is visible to the processor. However, the newest replaced lines 
instead of older one are needed sometimes. As a result, the expected performance 
improvement in data cache is slight or nil. Therefore, multi-way stream buffer, 
which consists of four parallel stream buffers in a multi-way stream buffer and 
with LRU replacement policy, is proposed to solve the limited ability of stream 
buffer. When a miss occurs in the data cache that does not hit in any stream 
buffer, the least recently hit stream buffer is cleared and it is started fetching at 
the miss address. However, the utilization of buffer is still low, as only the first 
entry in each buffer can be searched. 
Jouppi [Jou90] has proposed another technique, miss caching, to minimize the 
miss penalty during a cache miss. A miss cache is a small fully-associate cache 
containing two to five cache lines of data. When a miss occurs, data is returned 
not only to the normal (upper) cache, but also to the miss cache under it, where 
it replaces the least recently used item. Each time the upper cache is probed, the 
miss cache is probed as well. If a miss occurs in the upper cache but the address 
hits in the miss cache, then the directed mapped cache can be reloaded in the 
next cycle from the miss cache. This replaces a long ofF-chip miss penalty with a 
short one-cycle on-chip miss. 
To make better use of the miss cache, victim caching is further proposed by 
Jouppi [Jou90]. Victim caching use a different replacement algorithm for the 
small fully-associative cache. Instead of loading the requested data into the miss 
cache on a miss, load the fully-associative cache with the victim line from the 
direct-mapped cache instead. With victim caching, no data line appears both 
in direct-mapped cache that hits in the victim cache, the contents of the direct-
mapped cache line and the matching victim cache are swapped. 
19 
Chapter 4- Related Work 
4.3 Chapter Summary 
In this chapter, a brief review on different prefetching algorithms is given. Besides, 
review on existing replacement and placement policies is also introduced. 
20 
Chapter 5 
Replacement and Placement 
Policies of Prefetched Lines 
In order to minimize the cache pollution and localize the influences brought by 
prefetched lines, we have tried different approaches. The different schemes try to 
focus on handling the life time of prefetched lines in the cache, and placement 
of IAP lines. Firstly, we propose to use a mixed replacement policy with both 
IZ and LRU replacement policies, this scheme helps to shorten the life time of 
referenced and useless IAP lines and it retains the temporal locality of demand-
fetched lines. Secondly, the Priority Pre-Updating scheme (PPU) is proposed 
to shorten the life time of possibly erroneously prefetched lines. In addition to 
IAP scheme, we found that PPU can work well on caches employing different 
prefetching algorithms. Thirdly , we use a on-chip prefetch cache to hold the 
prefetched data to localize the influences of IAP lines, Following sections will give 
details on these three schemes. 
21 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
5.1 IZ Cache Line Replacement Policy in IAP 
scheme 
Least Recently Used (LRU) is the most commonly used cache line replacement 
policy in both traditional cache designs and the IAP scheme mentioned in previous 
chapters. Though LRU is very popular in current cache designs, it still has many 
drawbacks. It is known that LRU cannot always replace the best line in the cache, 
and replace a wrong candidate line may cause cache miss afterwards. For example, 
consider the following code segment: 
f o r ( c o u n t = 0； count < 3; count++) 
f o r ( i = 0; i < 4 ； i++) 
a [ i ] += bCi]； 
let blocks A, B, C and D contain the data 6[0], 6[1], h[2] and b[3] respectively. If 
the cache is fully associated, with LRU chosen as the replacement policy, then the 
data blocks will be loaded into the cache in the following sequences: 
cache cache cache 
Z ] 回 [^ 
T T X 
^ 回 1^ 
(1) (2) (3) 
In (2), due to limited capacity, A is replaced by D as it is the least recently 
used one. However, block A is immediately in need after D when the outer loop 
is entered again (in (3)), and thus a cache miss follows. This situation continues 
until the end of the loops. The same situation occurs when FIFO policy is used. 
Though there is no prefect cache line replacement scheme found so far, a good 
replacement algorithm should try to reduce the probability of wrongly replacing 
a useful cache line. 
Although the IAP scheme can prefetch data accurately, the cache may not 
large enough to accommodate all lines brought in by demand fetch and prefetch 
22 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
requests. As a result, conflict misses occur frequently when the working set of 
the program is larger than the cache size, so we have proposed the Instant Zero 
replacement policy [SzY97] is proposed to handle the problem. IZ scheme is aimed 
at managing the cache replacement more efficiently and reducing the thrashing 
misses. It is found that the data in IAP lines are most likely to be referenced once 
only. If these prefetched lines are placed in the cache and obey the LRU cache line 
replacement policy, as they have lost the benefit of referencing in the near future 
once after being referenced, they are just occupying the precious space in cache 
without any contribution. As a result, most of them should be the best candidate 
to be replaced when either capacity or conflict misses occur. That is the reason 
why the proposed IZ replacement policy is used to handle these IAP lines. As a 
result, after the requested data of an IAP line has been referenced, it should be 
the best candidate to be discarded when there are conflicts among the cache lines. 
5.1.1 The Instant Zero Scheme 
In the IAP scheme, conflict misses are the main concern. If a line (say line i) 
is being referenced, and was not found in the cache, then a miss occurs. The 
idea of IZ is not to reduce the miss penalty caused by the reference to line i, 
but to minimize the probability of cache misses in the future. We can easily 
observe that most cache lines that generated by demand fetch or prefetch-on-miss 
possess certain degrees of localities, and those lines prefetched by the IAP scheme 
are likely to be referenced once only. As a result, the IAP lines that have been 
referenced should be the best candidate to be discarded if conflicts on the cache 
lines occur. Therefore, the replacement strategy in the IAP scheme is a mixed 
strategy by using both IZ and LRU. 
The mixed replacement policy of LRU and IZ can be summarized as follows. 
:1] The replacement policy for non-IAP scheme cache lines (either by demand fetch 
or by default prefetching) still obey the LRU policy. [2] For those lines prefetched 
by the IAP scheme will follow either LRU or IZ according to the following rule : 
23 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
• If the prefetch address is not the same as the current data reference line, 
then this prefetched line will follow the LRU policy. On the other hand, if 
the prefetch is the same as current data reference line, then as mentioned 
before, the block preceding it or following it will be prefetched. And this 
prefetched line will obey the IZ policy, in which the priority of the line will 
set to 0 immediately after its reference. 
To indicate whether a line is prefetched by the IAP scheme, 1 extra bit for 
each cache line is needed. The cache lines in a four-way set associative cache will 
look like Figure 5.1. 
Di Hi li Si PTag^  Data^  
•2 Hg l2 S2 PTag2 Data2 
D3 H3 l3 S3 PTag3 Dat^  
D4 H, l4 S, PTag, Data, 
^^^a^^^K^^^^^^^^^mmmmi^^^^mi^^^^^^^mmm^^^^^^m^^m^^^i^^^ 
1 2 1 1 
D Dirty bit 
H Hot bit, which indicates the priority of the corresponding data line 
I IAP bK, set if the line is prefetched by the IAP scheme 
S Line status bit (Valid bit): 
PTag Physical tag 
Data Cache Data 
Figure 5.1: A theoretical representation of a set in a four-way set associative cache 
The hot bits are used to indicate the priorities of the lines following LRU 
replacement policy. When a line in the cache is referenced or a new line is brought 
into the cache, the hot bits of the lines will be updated. In the former case, the 
referenced line will have the hot bits updated to the largest number within the 
same set. Those lines with hot bits value larger than the original value of that 
of the referenced line will have their hot bits decreased by 1，while others remain 
unchanged. In the latter case, the hot bits value of each cache line will be decreased 
by 1, and the one with new value equal to 0 will be displaced. The new line will 
have its hot bits value set to the highest number within the set. Therefore, lines 
24 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
with a lower priority are more likely to be kicked out, as they are the Least Recently 
Used one. In case of conflict in the cache lines, the line with hot bits equal to 0 
will be displaced in order to bring in a new line. 
line hotbits     
厂 0 0 10 1 1 
J 1 0 11 0 1 — 
QOt i </ ^^^.yv^ <tgSg ^^ ,¾¾ S* <NSS 
比1 ’ \ lOM iJmta. 
\ 2 0 01 1 1 ^ 
3 0 00 0 1 V L_ I  
^ ^ ™ ™ ™ ™ ^ ^ ^ ^ ™ ^ ^ — ^ ™ M ^ ^ ^ ^ ^ ^ M ^ ^ M ^ ^ — ^ ^ ^ — i ^ M ^ M ^ ^ M 
~ iowest priority \im 
Figure 5.2: Before a reference to an IAP line 
Now, let us have a look on how the IZ works within a set (say set i). Refer to 
Figure 5.2, if there has a miss to set i, then line 3 will be displaced. Since it is the 
one with lowest priority (the least recently used one). Suppose there is a reference 
go to line 0 before such a miss occurs, as it is an IAP line (with IAP bit set), it is 
then considered useless and will be the most likely one to be kicked out in future 
conflicts. Other than setting the priority of line 0 to 0, those lines with priorities 
lower than the original priority of line 0 should be incremented by 1. Those lines 
with priorities higher than the original priority of line 0 remain unchanged. As 
a result, the lines in the set with the priorities updated will look like Figure 5.3. 
Therefore, if there is a miss now, then line 0 instead of line 3 is the first one to be 
displaced. 
—iQW&ft priority Hm 
line ^ _ ^ _ ^  
厂 0 0 0 0 1 1 、 、 、 
J 1 0 11 0 1 
set 丨 < 2 t i T ^ ^ Baia  
、 3 0 01 0 1 
^ ^ ^ ^ ^ ™ ^ ^ ^ ™ ^ " ™ ^ ™ ^ ^ ^ ™ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ™ ^ ^ ™ ^ ^ ^ ™ ^ ^ ^ ^ ^ ™ i H ^ H B 
Figure 5.3: After reference to an IAP line 
The control flow of the proposed scheme is shown in Figure 5.4. 
25 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
< 
V 
Get next • 
instruction I • 
. S , 
<T LOAD/STORE- ^ ^ ！  
\ UPDATE? ^ ^ 
A r ^ 
/ Addressing ^ ^ OTHER Generate a 







v B ^ := N ^ 
\ ^ current line ^ ^ 
. \ X 
——^—— _ ^ ^ i _ ^ _ V 
Set prefetch address Ur 
as that of the „ , _r:i::t.. 
preceding or Send prefetch Update priority • 
following line address to — ^ "sing LRU m ^ > 
according to the prefetch queue strategy • 
direction of update • 
^ ~ . 
T^ddre??to^  I ™> Updatepriority |    
prefetch queue I 严 using IZstrategy | 
Figure 5.4: Control Flow of the IZ Replacement Policy 
26 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
If a line is prefetched by the IAP scheme, then the IAP bit in the corresponding 
entry will be set to 1. These IAP lines will be treated as normal cache lines until 
there is a reference to it. When an IAP line is referenced, its priority will be set 
to 0 immediately after its reference to obey the IZ replacement policy. 
5.2 Priority Pre-Updating and Victim Cache 
In order to minimize the cache pollution and localize the influences brought by 
prefetched lines, we propose the Priority Pre-Updating scheme (PPU) to shorten 
the life time of possibly erroneously prefetched lines. The effect of adding a small 
fully-associative victim cache to hold the unreferenced prefetched lines ejected 
from cache is also discussed here. 
5.2.1 Priority Pre-Updating 
A PPU unit is added to keep track of all the prefetched lines according to their time 
of prefetching. Whenever there is a reference to a prefetched line, priorities ofthose 
former prefetched lines, which are unreferenced and precede the current referenced 
line in the PPU unit, will be decremented by a constant stride. Normally, the 
priority of a line updates only when there are references to any lines within the 
same set. However, under the PPU scheme, the priorities of other lines in different 
sets may also be changed. 
In fact, the PPU aims at reducing the cache pollution problem as well as 
retaining the potential temporal locality of other lines. Figure 5.5 shows the 
overview of the architectural model of the cache under the PPU scheme. 
As shown in Figure 5.6，each entry in PPU maps to a prefetched line in the 
cache. When there is a reference go to a cache line, say /me03 (entry 3 in PPU), 
since /meoi and /meio precede /me03 in PPU, their priorities will be decreased by 
1. The cache lines following /me03 in PPU，such as ^me^o, will not be affected. 
27 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
EXECUTION UNIT Effective data address 一 ^^  
DATA CACHE . PPU UNIT ^  
iNSTR, REGISTER 
I I I I • � �： : 





" ^ ' T r ' t T 7 > ^ P R E F E T l u N . T H ^ 
detected ？ 
Figure 5.5: Architectural model of IAP scheme with PPU 
5.2.2 Priority Pre-Updating for Cache 
The basic rules of PPU are — [1] All prefetch-on-miss lines will be recorded in PPU 
unit, and [2] When there is a reference, if the requested line is found in PPU unit, 
those preceding unreferenced lines will have their priority decrement by 1. [3] The 
corresponding entry in the PPU unit for this reference line will be deleted. 
5.2.3 Victim Cache for Unreferenced Prefetch Lines 
Owing to the long delay between references of successive data, the priority pre-
updating scheme cannot obtain significant improvement in the cache performance, 
Data which have high potential to be referenced in the near future, are sometimes 
trashed out before their actual reference. In order to minimize the influence of this 
situation, we use a small victim cache with few entries as a secondary buffer for 
the unreferenced prefetch line. Therefore, a small fully-associative victim cache 
with four entries, each with the same size as a cache line, is integrated with cache. 
As experimental results show that victim caches with FIFO and LRU replacement 
policies have similar performance, FIFO is chosen as replacement policy in victim 
cache, for the ease of implementation in hardware. 
28 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
entries in cache sets 
PPU ""V jJ ::�� 
oldest |"7"^^^ : : : > setO 
� . � � … - 1 - | > ：： _j 
1 丨 t ^ ： £ \U 
i I ： ： !in<;i: I 
i Bmfmm& j ； ::: 
1 1 i - - i ： , , 
i i ::.. ^ I 
\ Of I i--r-.. ：； ^^iset2 
\ �1 ； ; .. ^ I 
^ pr6f0toh i :…」 “呢- _j 
i 丨 丨 一 
1 丨 ： 丨 、 , “ 
1 ？ 「-1,......:凑 ‘ 
"^ ¾ ?^:¾¾¾::¾^ ‘ , -':'::::¾ 
v^ yZ m-1 • 
v _ ^ ^ , , , 
, T 1 n "ne-
youngest _ in<;i., ^ 
^ 口 队画遍 ^ > setn 
^  
line .. 
L^ 2'-f J 
Figure 5.6: Illustration of PPU 
For each prefetched line discarded from the cache, it will first be checked if 
this line is unreferenced. If such a case is detected, this line will be placed in 
the victim cache. When the victim cache is full, the oldest line will be displaced. 
When there is a data reference to the cache, it will first check if the data appears 
in the data cache. If the data is not found in data cache, then it will check if the 
data is in the victim cache. If such a case happens, then one extra clock cycle will 
be spent to fetch the data from victim cache. 
Actually, the victim cache itself does not prefetch data but act as a secondary 
buffer to store prefetched data. Experimental results show that the cache per-
formance is significantly improved by using PPU with victim cache (PPUVC). 
Figure 5.7 gives a summary of the control flow of the PPUVC. 
29 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
<  
Y 




Record this line in 
t h e P P U u n i t , b y S .mp ly fo l l owLRU 
secifying i tsset replacement 
and line numbers P° 'ey 
^ . 
Place D in p 
data cache 
i 
Cache is full 
and line /^  is 
displaced 
V 
YES , X ^ P ^ ^ ! ! ! ^ ! ^ N � 
<^ l ineand is ^ ^ — ~ " 
\ fresh? ^ ^ 
———— ^ ^ ——^——L 
Place /^  in victim Write back if /^is | 
丨 7 I I - I 
T  
Figure 5.7: Control Flow of PPUVC 
30 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
When a data line arrives, it will first check if it is a prefetched line. If so, 
then a record will be kept in the PPU unit. When there is a reference to any 
cache line, then priorities of the lines will be checked and updated. If there is any 
conflict occurs in the cache and a line has to be replaced, then the one with lowest 
priority will be discarded. If the line to be discarded from cache is an unreferenced 
prefetched line, then it will be placed to the victim cache. On the other hand, the 
lines that have been referenced will not be placed in the victim cache. These lines 
will be written back to lower level memory if they have been modified. Otherwise, 
they will be simply discarded from cache. 
5.3 Prefetch Cache for IAP Lines 
IAP lines have quite different property as compared to demand-fetch lines and 
prefetch-on-miss lines, since they are likely to reference only once during the entire 
program execution time. In order to localize their effects and minimize their 
influences to normal cache lines, a small on-chip prefetch cache is added [YoS98'. 
All IAP lines will be placed in the prefetch cache instead of the on-chip cache. 
A prefetch cache has the same structure as a cache, which consists of a series of 
entries with a tag, an valid bit, a dirty bit, a hot bit and a data line. Prefetch 
cache functions independently from the data cache. When there is a reference, 
both data cache and prefetch cache will be checked in parallel for any potential 
hits. Figure 5.8 gives the general picture of cache support in IAP architecture. 
Through the on-chip cache controller, the processor attempts to access the 
data in the primary cache. If the data is there, then the processor will retrieve it. 
If a primary cache rniss occurs, i.e., the data is not found in the primary cache, 
the cache controller checks to see if the data is in the secondary cache. If the data 
is found in the secondary cache, it is fetched into the primary cache. If the data is 
not present in the secondary cache, it is fetched as a cache line from main memory 
and is written into both the secondary cache and the primary cache before the 
processor retrieves it. For the sake of simplicity, we assumed these is no secondary 
31 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
I CPU — — — — — 
I Cache Controller < ^ • Main Memory ^ 
|' ^ I I " ^ " " — L Z Z ^ 
I Instruction 
I Cache 
I Data Prefetch 
^ Cache Cache 
1 、 ^ ^ ^ ^ ^ ^ ^ ^ ^ » 飄 ^ ^ ^ ^ ^ « ^ » 、 、 全 
Secondary Cache _ 
~~pysa»»sssi»~»KiS5sssaa^�~aaa}tt»»i;fe»aiaS5;iSgSigi^g^；^^ 
Figure 5.8: Cache Support in the IAP architecture 
cache misses. In other words, the secondary cache is assumed to be infinitely large 
to hold all data. It is possible for the data to appear in different levels of memory. 
The data is kept consistent through the use of write back methodology, in which 
modified data is not written back to memory until the cache line is replaced. 
For any data prefetched by the prefetch unit, if this prefetch is generated by the 
IAP scheme, then it will be placed in the prefetch cache instead of the normal 
data cache. The data in prefetch cache is also kept consistent through write back 
methodology. 
In addition to implementing LRU, the conventional cache replacement policy, 
in the prefetch cache, we also try using the IZ and FIFO replacement policies in 
it. We have just implemented one non-usage-based algorithm, FIFO, since non-
usage-based algorithms are found to have similar performance as mentioned by 
Smith [Smi82 . 
Similar to the victim cache, the prefetch cache itself does not prefetch data 
but only keep prefetched data available for use. Figure 5.9 gives a summary of 
the control flow of the prefetch cache. 
For each data line arrived in cache, it will be checked if it is a prefetched line. 
If the data line is a demand fetch line, then place it in the normal data cache. 
Otherwise, it will be further checked to see if it is prefetched by the IAP scheme. 
If such a case is detected, then it will be placed in the prefetch cache, and all 
32 
Chapter 5 Replacement and Placement Policies of Prefetched Lines 
< 
Y 
Data line D p • 
arrived • r^ 
>r 
/\ r ^ 
/ ^ ^ Nn Place D in normal 
< ^ l s D p a n IAPI ine? ^ — ~ ~ > ^ 
v ^ ^^^^^ 
YES >r Place D in p 
prefetch cache 
Figure 5.9: Control Flow of the Prefetch Cache Scheme 
other types of prefetched lines will placed in the normal data cache. 
5.4 Chapter Summary 
In this chapter, two designs for hardware controlled replacement schemes and one 
replacement policy are proposed. By careful selection of data line to be replaced 
when a cache miss occurs, our replacement policies, IZ and PPUVC, will be able 
to reduce the cache pollution problem and retain the benefit of localities pose by 
normal cache lines. With the reference-once property of IAP lines, the placement 
design, prefetch cache, is able to localize the influence of IAP lines. Again, it helps 




6.1 Methodology and metrics 
In order to have a deeper understanding of the algorithms proposed, and to show 
their potentials, we evaluate our proposed architecture using detail trace-driven 
cache simulation. We use eight SPEC92 i benchmarks to generate traces for our 
study. They include three integer-intensive programs: compress, espresso and li, 
and five floating-point intensive programs: nasa7, spice2g6, su2cor, tomcatv and 
wave5. Table 6.1 gives a brief description of the benchmarks used. 
SPEC Benchmark Suite 
Program Language Description 
compress C Adaptive Lempel-Ziv compression 
espresso C Boolean function minimization 
J i C Lisp interpreter solving the nine queens problem 
nasa7 Fortran Seven floating-point synthetic kernels 
spice2g6 Fortran Analog circuit simulator 
su2cor Fortran Quantum physics mass computation 
tomcatv Fortran Mesh generation program 
wave5 Fortran Maxwell's equation solver 
Table 6.1: SPEC Benchmark Applications used 
iSPEC is a trademark of the Standard Performance Evaluation Corporation 
34 
Chapter 6 Performance Evaluation 
6.1.1 Trace Driven Simulation 
Clearly, the type of jobs (i.e., the job mix or instruction mix) will be important 
to the cache simulation, since the cache performance can be highly data and code 
dependent. 
For example, given a cache size of 8K lines with an anticipated miss rate of 
10%, (1 miss in 10 memory references) we would require about 80K lines to be 
fetched from memory before it could reasonably be expected that each line in 
cache was replaced. To determine reasonable estimations of actual cache miss 
rates, each cache line should be replaced a number of times (the accuracy of the 
determination depends on the number of such replacements.) This net effect is a 
memory trace of some factor larger is required, say another factor of 10, or about 
800K lines. That is, the trace length would be at least 100 times the size of the 
cache. Therefore, we chose to collect 100 millions instructions for each benchmark 
programs. 
With the help of xtrace facilities, each of the chosen benchmark programs in 
the SPEC92 suite was traced on the IBM RS/6000 workstation and 100 million 
instructions for each benchmark were collected. The process of trace-driven sim-




\ instrumented code 
xtrace facilities P rog .x t race 
” processor information (via a pipe) 
—— Prog.interface 
i address^ data trace 
configurations • Simulator 
T 
simulation results 
Figure 6.1: Trace-driven simulator using xtrace 
35 
Chapter 6 Performance Evaluation 
The benchmarks were firstly compiled on the RS/6000 workstation. The exe-
cutable codes were then instrumented by the program instrument, which inserted 
extra codes into the executable codes in order to extract the processor informa-
tion during the program execution. The instrumented codes were then handled by 
xtrace and executed on the RS/6000 machine. The extracted processor informa-
tion was then passed to the program interface via an explicit pipe. The program 
interface could be defined by the users to produce the desired trace format. The 
following information was recorded for each instruction that was traced: 
Inst^ddress, Inst-Content, < DataJief—Address if any >, < No_ofJBytesJief if any > 
The simulator is designed such that it can read in the configuration descriptions 
such as cache size, set-associativity, line size etc., and simulation objects such as 
the CPU and the memory system are created based on these parameters. The 
trace data were stored in hard disk, such that a number of simulators could be 
run in parallel on different machines to speed up the simulations. The simulator 
read in the trace recorded one by one. Then the content of each instruction was 
decoded and the opcode, the addressing mode together with the register(s) used 
in the address calculation were found. 
6.1.2 Caching Models 
The baseline cache and the proposed caches use a write-back, write-allocate pol-
icy, and an 8-entry prefetch queue. We assumed that the processor has an ideal 
instruction cache with no instruction cache miss incurred. An elementary archi-
tectural model, which consists of a processor with perfect pipelined and a 4-way 
associative data cache with a line size of 32 bytes and a total size of 16K bytes, is 
defined for the simulations and the replacement algorithm is assumed to be LRU 
(Least Recently Used). For comparison, each dimension of the cache (cache size, 
line size and set associativity) is varied respectively for different simulations (cache 
sizes range from 8K bytes to 32K bytes, line size from 16 bytes to 64 bytes, and 
36 
Chapter 6 Performance Evaluation 
set associativities simulated from 1 to 4) while the other two are kept constant. 
The memory model of the second level cache in the simulations is assumed to 
be interleaved and its design and timing characteristics are shown in Figure 6.2. 
The memory is organized into a number of banks (or modules) to handle multiple 
words at one time rather than a single word. Each bank is one word wide which 
is the same as the first level cache and the bus. A cache line usually consists of 
a number of words ( for example, a 32-bytes line consists of eight 4-byte words). 
Whenever a cache miss occurs in the first level cache and a fetching request is sent 
to the second level cache, the banks will work simultaneously - bank 0 will start 
reading for the first word in the block, bank 1, the second word, bank 2, the third 
word,... etc. However, since there is only one memory bus between the first and 
second level cache, the transfers of the words must be processed sequentially. As a 
result, the time for a demand fetch request to transfer a cache line between the first 
level cache and the second level cache memory can generally be summarized by the 
equation C1 + C2 x {hlocksize-l), where Ci is the delay time for the first word to 
arrive after a cache miss (that is, startup_cnjerhead + transfer_time-for—a-Word) 
and C2 as a parameter that indicates the bus bandwidth between the first level 
cache and the second level cache (that is, transfer time for a word). In our 
experiments, Ci was assumed to be 6 and 6¾ to be 1. 
For a given cache line size, the time for a demand request (due to the first 
level cache miss) to finish is assumed to be equal to the time for a prefetch (to 
the second level cache) to finish. The prefetch requests are reside in the prefetch 
queue and the request Rp at the beginning of the queue will be sent to the second 
level cache if the queue is not empty and the bus to the second level cache is 
free. If the address of the pending prefetch request immediately follows that of 
the current request R � ( d e m a n d fetch or prefetch) being processed in the second 
level cache (i.e. address—of—Rp = address_ofJlc + 1), the interleaved memory, 
that is, the second level cache, does not need to wait until the request R � i s 
completely finished. It can continue to process R^ when some memory banks are 
free, although the memory bus may be still transferring the data of R^. In this 
37 
Chapter 6 Performance Evaluation 
r ' " r = | - : 
CPU I 
； P ^ ； 
: L1 I ： 
< cache 1 » : L,:Q;;::iJ ： 
…’…^ ^ ^ ^ " " ^• ^ - 
BUS 
_ j^i^p^^^ _ 
Bank Bank Bank Bank 





fetch i fetch i+1 
V V 
K Ci H 
H c , H 
I 1 1 
req i xfer i 
1 1 1 
req i+1 xfer i+1 
(b) 
Figure 6.2: Memory Model of the simulator: (a) Interleaved memory (b) Timing 
of data access 
case, the startup overhead of R^ can be hidden and the time for completing the 
prefetch request will be equal to C2 x blocksize. 
The second level cache can only handle one request at a time, no matter it is 
a demand fetch or prefetch request. When a demand fetch miss occurs in the first 
level cache, it will try to fetch the data from the second level cache. However, it 
may be in a situation that the second cache is serving a prefetch request. In case 
of such conflict, the priority will given to the demand fetch miss and the prefetch 
request will be aborted and the demand fetch request will be started next cycle. 
For simplicity, the second level cache (memory) is assumed to be infinitely 
large. That is, there is no cache miss in the second level cache. 
38 
Chapter 6 Performance Evaluation 
Each instruction is assumed to be executed in one cycle and no superscalar 
architecture is simulated and cache access upon a cache hit is assumed to be one 
cycle. 
6.1.3 Simulation Models and Performance Metrics 
In Table 6.2, the percentages of LOAD/STORE-UPDATE instructions in the 
instruction mix of SPEC92 benchmarks programs are shown. As it can be seen 
from the percentages of the table, LOAD/STORE-UPDATE is fully utilized by 
current compiler technology. Ranging from a few percent to over 95 percent of 
the LOAD/STORE instructions belongs to LOAD/STORE- UPDATE category. 
Percentage of instructions Executed 
Benchmark Total LOAD-UPDATE Total STORE- UPDATE Total Total 
LOAD STORE LOAD/STORE LOAD/STORE- UPDATE 
compress— 21.5 0.0 9 .厂 0.3 30.7 0.3 
espresso "22.0 11.1 3.9 “ 1.3 25.9 — 1 ^ 
li 25.4 0.8 15.2 2.3 40.6 “ 3.1 
nasa7 _42.8 42.0 1.7 “ 1.4 44.5 43^ 
spice2g6 —18.3 1.4 9.9 “ 1.1 28.2 ^ 
su2cor — 26.4 8.5 14.1 6.1 4 0 . 5 _ 14.6 
tomcatv "29.6 18.3 11.1 10.1 ~" ^ 4 ^ 2 ^ 
wave5 26.5 1.4 9.7 “ 2.2 — 36.2 ^ 
Table 6.2: Percentages of LOAD/STORE-UPDATEs in SPEC92 Benchmark Suite 
Numerous experiments on various caching models were simulated using the 
collected SPEC92 traces as the input. 
• The simulated cache size ranged from 8K bytes to 32K bytes. 
• The simulated line size ranged from 8 bytes to 64 bytes. 
• The simulated set associativities ranged from 1 to 4. 
• The Time for a demand fetch request to transfer a cache line between the 
first level cache and the second level cache memory was equal to (Cl + C2 x 
{LineSize — 1)), where C1 is the delay time for the first word to arrive 
39 
Chapter 6 Performance Evaluation 
after a cache miss, and C2 is a parameter that indicates the bus bandwidth 
between the first level and the second level cache. 
• The time for a demand fetch request to finish was assumed to be the same 
as that for a prefetch request to finish for a given cache line size. 
• The second level cache/memory was assumed to be infinitely large, and there 
was no cache miss in the second level cache. 
• When there existed a memory request Ra trying to kill another request i ^ 
that was currently being served, there would be one cycle time delay before 
the new request Ra could start. 
• The size of the prefetch queue was assumed to be eight entries. 
• Each instruction was assumed to be executed in one cycle and no superscalar 
architecture was simulated. 
• One cycle access time was needed for a cache hit. 
• Seven cache prefetch models were simulated : 
1. Data cache without any prefetching. 
2. Data cache with prefetch-on-miss. 
3. Data cache with the combined IAP scheme - using the IAP scheme 
for LOAD/STORE- UPDATE instructions and the default prefetch-on-
miss for non-LOAD/STORE-UPDATE instructions. 
4. Data cache as mentioned in 3, but with a mixed replacement policy 
consisted of LRU and IZ. 
5. Data cache as mentioned in 2, but with priority pre-updating and vic-
tim cache to tackle with the prefetched lines. The size of the victim 
cache is assumed to be 4 entries, with one entry equal to the size of 
one line in data cache. 
40 
Chapter 6 Performance Evaluation 
6. Data cache as mentioned in 3, but with priority pre-updating and vic-
tim cache to tackle with the prefetched lines. The size of the victim 
cache is assumed to be 4 entries, with one entry equal to the size of 
one line in data cache. 
7. Data cache as mentioned in 3, but with a prefetch cache added to hold 
the prefetched lines by the IAP scheme. 
• All the enhancements that mentioned in Section 2 were implemented in the 
IAP scheme simulated. And only LOAD/STORE-UPDATE instructions 
using index-displacement addressing mode was handled in the simulated 
IAP. While for any LOAD/STORE- UPDATE instructions using index-hased 
register addressing mode, the IAP scheme did not issue any prefetch request. 
In the past, cache design had been frequently taken as a back seat to CPU 
design: the cache subsystem was often designed to fit the constraints imposed 
by the CPU implementation. The execution time of a program fundamentally 
depended on how well the two units worked together to execute instructions. The 
execution time was most effectively minimized when the realities of cache design 
influenced the CPU design and vice versa. Furthermore, caches had traditionally 
been evaluated solely on the basis of hit (or miss) ratios - a metric that can often 
be deceiving. 
In order to reflect the actual performance of the algorithms proposed, two 
main metrics were used here to evaluate the performances of different schemes. 
The first performance metric is cycle per mstruction due to memory (date 
cache) misses. This performance parameter, MCPI, measures the average addi-
tional processor stall time due to the first level cache misses. It also helps to show 
the degree of degradation of CPU performance due to the data cache misses in 
terms of memory cycle stalls per instruction. Generally, it can be calculated with 
the execution CPI and baseline CPI by the equations: 
MCPI = CPIexecution — C PIhaseline 
41 
Chapter 6 Performance Evaluation 
, p total-number-of—cycles—eocecuted 
LUI Lt^f o 0 ' Jr J. cXCCVjtioTi — 
totaLnumber^ofJnstructions 
,PPT — total—number-of—cydes-executed-for_no-cache-miss 
LLfvLi v_y 1 J. bascltTic — — — 
totaljnumher-ofJnstructions 
This is a better measurement parameter than the cache hit (or miss) ratio, 
as the penalty of a cache miss depends on the cache line size. Moreover, par-
tial cache hit (or miss) situation, which is the situation that when a block i is 
being prefetched, the cache line i is actually referenced, always occurs. Partial 
cache hit arises due to the limited bandwidth between the first level cache and 
the second level cache (or memory). When a partial hit occurs, the data pre-
fetching will be allowed to finished and the requested data are sent to the CPU. 
Thus the penalty of partial cache hits is not a constant, and it ranges from 1 to 
{maximum cache miss penalty — 1) (i.e. (C1 + C2 x {Block_Size — 1) — 1 ) ) . 
Under this situation, the cache hit ratio is much difficult to reflect the actual cache 
performance, as some part of the data is overlapped with the processor execution 
while other part of the data fetching time is visible to the processor. 
Since we assume the processor can execute each instruction in one cycle and 
there is an ideal instruction cache in the system, one may intuitively deduce that 
the fact that the memory bus between the processor and the first level cache is 
only 32-bit (4-byte) wide. It also means that at most 4 bytes can be transferred 
between the processor and the first level cache in one cycle. If the data needed 
to be loaded or stored by an instruction is longer than 4 bytes the instruction 
can only be finished after all required data are loaded in the processor and the 
execution time is sure to be longer than one cycle even when there is no cache miss. 
For example, a LOAD/STORE-DOUBLEWORD instruction will be executed for 
two cycles even when the data needed is found in the cache. As a result, the 
baseline CPI will probably be greater than one and this effect is more significant 
in the double precision floating point benchmarks such as nasa7 and tomcatv, in 
which most of the data are double words of 8 bytes long. Table 6.3 shows the 
42 
Chapter 6 Performance Evaluation 
baseline CPIs found for the eight benchmarks programs used in our simulations. 
Benchmark baseline CPI 
compress 1.032 
espresso 1.011 






Table 6.3: Baseline CPIs of SPEC92 Benchmark Suite 
Another one was the additional processor stall time due to the first level cache 
misses. 
The percentage of delay time reduction over no prefetch was defined as 
%DelayTimeReduction 二 
MemoryDelayTimeNoPrefetchCache — MemoryDelayTime prefetchCache 
MemoryDelayTimeNoPrefetchCache 
The metric can be used to show the extent of memory stall time reduces due 
to data cache miss with respect to an elementary cache using no prefetch scheme. 
6.2 Simulation Results 
In the following sections, the experimental results are presented to show the ben-
efits of the mixed replacement scheme in IAP, the impact of PPUVC in cache 
performance improvement and also the effect of cache performance with the use 
of a small prefetch cache. The architecture with the elementary caching model 
using no prefetch scheme is compared with the same architecture augmented by 
each of these schemes. 
43 
Chapter 6 Performance Evaluation 
6.2.1 General Results 
• All the replacement and placement schemes that deal with IAP lines seemed 
to have no effect on the benchmark compress. It can be found out in Ta-
ble 6.2, only 0.3% of the total instructions (less than 1% of LOAD/STORE 
instructions) is of the type LOAD/STORE-UPDATE. As the prefetching 
actions of the IAP scheme is triggered by the LOAD/STORE-UPDATE 
instructions, only a few prefetched requests will be generated for the IAP 
scheme and their effects will be negligible. Therefore, IZ replacement pol-
icy, PPUVC and also the prefetch cache, which work upon IAP lines, had 
insignificant effect on the cache performance. Moreover, the combined IAP 
scheme would revert back to a simple prefetch-on-miss scheme and the two 
schemes suffered a slight performance degradation with respect to the no 
prefetch cache, which is probably due to the lack of constant stride refer-
ences in the program (as reflected by the lack of LOAD/STORE-UPDATE 
instructions). 
• Prefetch-on-miss, the traditional hardware prefetching scheme, generally has 
some improvements over most of caching models tested except for the bench-
mark compress. The combined IAP scheme showed performance improve-
ment over all caching models for almost all benchmarks used (except com-
press). 
Varying Cache Size 
Figures A.1, A.2, A.3 and A.4 in Appendix A shows the simulation results for 
the eight benchmark programs using cache size varies from 8K to 32K bytes. 
All experiments were done with caching models of 32-byte line size and 4-way 
associativity. 
As expected, one can find that the MCPI decreased as the cache size increased. 
However, for some benchmarks, such as su2cor and tomcatv, the pure IAP schemes 
44 
Chapter 6 Performance Evaluation 
showed little improvement when the cache size was small, i.e., 8K bytes, but ex-
hibited substantial improvement when the cache was increased to 16K and 32K 
bytes. For tomcatv, when the cache size is very small (8K bytes), the pure IAP 
scheme actually degraded the performance instead of improving it (can be ob-
served in Figure A.1 (g). This is probably due to the small cache size and the 
aggressive cache prefetching scheme. Even though the prefetching can be very 
accurate, those accurately prefetched data will displace each other away from the 
data cache before they have the chance to be used. However, as the cache size 
increased from 8K bytes to 16K bytes, this cache conflict problem was minimized 
and the IAP scheme started to have substantial cache performance improvement. 
However, in the three proposed schemes 一 IZ, PPUVC and prefetch cache, the 
performance improvement was generally more significant for small cache size in 
su2cor, tomcatv and wave5 (Figures A.1, A.2, A.3 and A.4). When the cache size 
increased, the performance improvement was comparable to the pure IAP scheme. 
The significant improvements for small cache size in IZ and PPUVC schemes 
are due to the careful selection of lines to be replaced. By using the specific dis-
placing criteria in these two schemes, lines, which are likely to be useless in the 
future, are displaced in cases of conflicts. When the cache size was small, conflict 
misses occurred more frequently. If a replacement algorithm can accurately pre-
dict which line should be discarded, then many future memory accesses can be 
eliminated. IZ scheme makes use of the reference-once property of IAP lines for 
deciding the replacement criteria, and it can accurately predict which line should 
be discarded for most of the benchmark programs. For the PPUVC, it helps to 
shorten the life time of possibly mispredicted lines as well as maintaining the 
properties of locality. 
Varying Cache Line Size 
Figures A.5, A.6, A.7 and A.8 in Appendix A shows the simulation results for the 
eight benchmark programs using different prefetching schemes. The experiments 
were done with caching models of 16K-byte cache size, 4-way associativity and 
45 
Chapter 6 Performance Evaluation 
varying cache line size from 4 to 32 bytes. 
The MCPI curves for IAP schemes generally had U-like shapes. That is, the 
MCPIs of the programs first declined from small line size to the optimal line size. 
Then, the directions of the curve reversed and the MCPIs kept rising after the 
optimal line sizes. These observations are common to be found in most of the cache 
simulations. As the line size increases, more data will be fetched one time and 
the spatial locality between these data may be beneficial to processor execution. 
Moreover, it is also more economical on average to fetch a larger line one time than 
to fetch a smaller line several times separately because the time to fetch a line 
from memory is equal to Ci + C2 x {linesize - 1). As the line size increases from 
the smallest size to the optimal one, these effects are dominant and the MCPI 
continues to drop in this range. However, as the line size keeps increasing after 
that point, using larger cache line size for sequential prefetching seems to be not 
so effective. As the line size further increases, greater portions in the lines will 
contain data that will not be referenced in the near further and the lines will be 
kicked out without these data being touched. Moreover, increasing the cache line 
size elongates the time for fetching a line from the memory. This means the CPU 
must wait longer for the same amount of data needed (for example, the CPU is 
stalled longer for a 4-byte datum in a 64-byte line than a 4-byte datum in a 32-byte 
line). At the same time, this also increases the risks of killing the prefetches by 
demand fetches caused by real cache misses. Finally, the larger line size reduces 
the total number of distinct lines that can be put into the data cache and increases 
the conflicts between lines in the cache which may cause some useful data to be 
kicked out before it is referenced by the CPU. When these adverse effects of larger 
line size outweigh the benefits brought, increasing line size will mean higher miss 
rate, more processor idle time and lower CPU performance. These explain why 
the MCPI curves rose after passing the optimal line sizes. 
For some programs (compress, espresso, spice2g6 and su2cor), the MCPI 
curves showed that the caches worked better when the line size was small (4 
bytes). It is probably because the data of consecutive references are separated 
46 
Chapter 6 Performance Evaluation 
far apart and do not reside in the same line. As a result, only small portions 
of the large lines fetched from the secondary memory will be referenced in the 
near future and the locality introduced by the large line size does not help much. 
On the other hand, with the smaller line size, the cache with the same size can 
contain more lines and it gives more flexibility for the IAP schemes to do accurate 
prefetching. As a conclusion, smaller cache line size is preferred in these situa-
tions. This also agrees with what Lee [Lee87] found about smaller line sizes for 
data cache. 
However, for espresso (Figures A.5, A.6, A.7 and A.8 (b)), the increasing MCPI 
curves ofthe schemes worked on IAP lines turn around and began to drop when the 
line size increases from 32 bytes to 64 bytes (similar phenomena were also observed 
for the cache only and prefetch-on-miss curves when the line size increased from 
16 bytes to 32 bytes). Although the explanation for this phenomenon is not very 
clear, we suspect that this is related to the data accesses with large stride values 
of 32 to 64 bytes in the program. From 4 bytes to 32 bytes cache line size, the 
number of lines that can be stored in the cache is reduced by half each time when 
the line size is doubled. However, if the stride size of the data accesses is large, a 
small increase in the line size does not capture more useful data. Consequently, 
increasing the cache line size below 32 bytes line size only causes cache pollution 
and results in poor cache performance. When the cache line size was increased 
from 32 bytes to 64 bytes, sequential data prefetching using large line size starts 
to have some effect and the cache performance is improved. 
It seems that the IAP lines in the program nasa7 had high temporal locality. 
For the PPU scheme, it could not obtain significant improvement in prefetch-
on-miss only cache where only prefetch-on-miss lines are involved. However, the 
PPU could obtain a quite significant decrease in MCPI, especially when the line 
size was small, in cache with IAP scheme. IAP lines in nasa7 has high temporal 
locality, which could be further confirmed by its performance degradation in IZ 
scheme when comparing with the combined IAP (Figure A.5 (d)). In IZ scheme, 
an IAP line would be kicked out very soon after it has been referenced, due to the 
47 
Chapter 6 Performance Evaluation 
underlying assumption of low temporal locality in IZ scheme. However, it seems 
that this assumption is not true in nasa7. 
Varying Cache Set Associative 
Figures A.9, A.11, A.10 and A.11 in Appendix A shows the effect of increasing the 
cache set associativity. As it is expected, from the set associativity of 1 to 2, the 
performance was generally improved (except the benchmark su2cor). With a one-
way associativity (direct mapping) cache, every line could be placed at only one 
position. If it happens that two sequences of data accesses, for example, two arrays 
inside the same loop, are mapped to similar sets, they will continually displace 
each other's data line in the cache, although the displaced line may contain data 
that will be referenced in the near future. As a consequence, miss rate will be 
increased and the cache performance will be degraded. It accounts for the large 
/ 
improvement from one-way to two-way set associative cache. This effect was more 
obvious for the benchmark nasa7 (Figures A.9, A.11, A.10 and A.11 (d)). For 
the scheme PPUVC in the prefetch-on-miss-only cache model, the performance 
difference between direct-mapped and 2-way set associative was less significant 
than other schemes which involve IAP lines. For the direct-mapped case, IZ 
was actually reverted to combined IAP scheme. However, for prefetch cache and 
PPUVC in IAP scheme, the program nasa7 had a lower MCPI in direct-mapped 
situation than in others. The reason is that IAP has high accuracy of prefetching, 
and this aggressive prefetching prefetches data into the cache before their actual 
references. From Figure 6.2, we can find that almost all (over 90%) of the data 
references belong to the LOAD/STORE-UPDATE (constant stride) type and are 
mainly chains of array or pointer references. With this large amount of constant 
stride references, the chance of conflicts induced by the address mapping will 
probably be very high. Due to the relative high conflict misses in direct-mapped 
cache, some of these useful IAP lines were trashed out before their references. 
Therefore, there were wastes of clock cycles, as these lines had to fetch into the 
cache again in the future. With the use of victim cache, to hold this lines, in IAP 
48 
Chapter 6 Performance Evaluation 
scheme, it could extend the life time of these lines in the cache and avoid many 
potential memory accesses. Similar performance improvement could be observed 
in prefetch cache for nasa7, in which the fully-associative prefetch cache provided 
a place to hold these IAP lines before their references. In both cases, the IAP 
lines could stay somewhere near the cache before their references, and fetching 
from these buffer space was much faster than fetching the data from lower level 
memory. 
However, the cache performance was more or less the same for 2-way and 4-
way set associativity. Although increasing the associativity gives more flexibility 
for cache line placement, at the same time, the number of sets in the cache will be 
halved and the performance will be less dependent on the associativity in these 
situations. 
6.3 Simulation Results of IZ Replacement Pol-
icy 
From the figures in Appendix B, most of the experiments showed some improve-
ments when the IAP scheme together with the new replacement policy IZ was 
implemented. More importantly, the implementation of the IAP and IZ in cur-
rent architecture is not difficult, as no processor architecture changes are necessary, 
but only some simple additional on-chip cache hardware is required. 
• The effects of the IZ scheme could be classified into two main streams. 
The first group can achieve significant performance improvement, the sec-
ond group shows sightly improvement or nearly coincident with the original 
scheme prefetch-on-miss scheme. 
For some of the benchmarks such as spice2g6, espresso, li and nasa7, the 
default prefetching scheme seemed to have no impact to the cache perfor-
mance. The curve for the IZ scheme almost overlapped with each other. 
However, for su2cor, tomcatv and wave5, the IZ scheme helped to reduce 
49 
Chapter 6 Performance Evaluation 
the memory stall time further. This can be explained as follows. 
In the spice2g6, espresso, li and nasa7 programs, the reason may either [l]the 
spatial and temporal localities are weak in these programs, or [2] most of data 
references with strong temporal locality are referenced by LOAD/STORE-
UPDATE instructions. It seems that the former one is more likely, since 
performance degradation will surely be observed if case[2] is true. As a re-
sult, IZ just had no further significant improvement when comparing with 
combined IAP in these programs. On the other hand, for su2cor, tomcatv 
and wave5, a significant portion of the data references with strong locali-
ties are referenced by non- LOAD/STORE-UPDATE instructions and the 
numerous amount of prefetched lines generate conflicts and misses in the 
cache. The IZ scheme helps to resolve these conflicts and improve the cache 
performance. 
• Referring to Figures B.1 to B.9. The memory stall time reduction that could 
be achieved by the IZ scheme ranged from about a few percent to over 90%, 
with an average of about 50%. These figures really show the potentials of 
the IZ schemes. This kind of improvement in cache performance over a wide 
range of benchmark programs (instead of some small routines or kernels such 
as Livermore Kernels) is really substantial. Furthermore, this performance 
improvement can be obtained by just modifying the on-chip cache hardware 
and no change to the processor architecture (such as the instruction set) is 
required. 
6.3.1 Analysis To IZ Cache Line Replacement Policy 
With reference to the figures in Appendix B, some factors affecting the perfor-
mance of IZ could be observed. 
• Instruction Mix of Benchmark Programs 
When a benchmark program, such as compress, contains only few LOAD/STORE-
UPDATE instructions, there was no improvement brought by the IZ policy. 
50 
Chapter 6 Performance Evaluation 
This is up to our expectation, as IZ policy is mainly applicable to the lines 
prefetched by the IAP scheme. When there are only few of IAP lines, the ef-
fect brought by IZ will naturally be small. On the other hand, for benchmark 
programs with lots of LOAD/STORE- UPDATE instructions, IZ contributed 
significant improvement on the cache performance when the cache was not 
large enough to hold all the demand fetched and prefetched lines. This can 
be reflected by the simulation results of su2cor and tomcatv. 
As a result, the number of LOAD/STORE- UPDATE instructions affects the 
performance of cache with IZ policy. 
• Cache Size 
With reference to the two benchmark programs, su2cor and tomcatv, IZ 
had obtained the greatest improvement among the eight programs, it could 
be easily observed that there was larger improvement on cache performance 
for smaller cache size than large cache size. This scenario could be easily 
observed in the benchmark program wave5. The reason is that for larger 
cache size, the cache has enough space to hold the lines fetched or prefetched 
from the main memory. Thus the replacement of lines occurs less frequently, 
the effect of IZ cannot be observed then. However, as small cache size is 
not enough to hold all the data that needed, and thus replacement of lines 
occurs frequently. As a result, IZ can improve the utilization of cache space 
by displacing the referenced IAP lines out instead of other lines that may 
possess strong spatial or temporal localities. 
Generally, the number of prefetched lines actually referenced increased for 
cache with IZ scheme together with the combined IAP. This can be explained 
by the fact that some prefetched lines may have to wait a long time before 
their actual references, when there is not enough space in cache, some of 
these lines have to displaced out by LRU scheme before referencing. It will 
certainly be a waste of cycle time, as these lines have high probabilities for 
being referenced later and they may have to load into cache again. However, 
51 
Chapter 6 Performance Evaluation 
by using IZ scheme, these lines could stay in the cache for a longer time. As 
those referenced IAP lines will be displaced instead of other prefetched lines 
which have higher potential to be referenced in the future. 
IZ did show its worthiness on cache performance for small cache size and when 
the cache could not accommodate all the lines that contain data to be referenced. 
6.4 Simulation Results for Priority Pre-Updating 
with Victim Cache 
For any data reference, sequential checking of data cache and victim cache was 
performed. The data cache will be checked first to see if the requested data is 
there. If the data is not found in data cache, then a miss occurs, the cache 
controller will be informed to search the victim cache. If the data is found in the 
victim cache, then the processor use one more cycle to fetch the data from the 
victim cache. Otherwise, lower level memory should be involved. 
Referring to the figures in Appendix C, the performance improvement in the 
eight benchmark programs can be classified into three categories: 
1. the performance improvement was large (over 20%), 
2. the performance improvement was slight to moderate (1% to 20%), 
3. the improvement was nil or caused a minor degradation. 
6.4.1 PPUVC in Cache with IAP Scheme 
The metric on comparison were based on the comparison with combined IAP. The 
delay time reduction figures that quoted, was based on a 16K bytes cache size, 32 
bytes line size and 4-way set associative. 
IAP scheme has high accuracy in performing prefetching, the number of mis-
predicted lines is few. Due to its accuracy, cache misses are highly reduced. As a 
result, prefetch-on-miss lines are few, Figure 6.3 gives a comparison of number of 
52 
Chapter 6 Performance Evaluation 
prefetch-on-miss lines in cache with IAP and prefetch-on-miss only cache. More-
over, the references to prefetch-on-miss lines is few in the cache model with IAP 
scheme. Figure 6.4 shows the actual number of prefetch-on-miss lines referenced 
when comparing with the total number of prefetched lines. These explain the 
insignificant improvement for PPU in IAP schemes. 
I 6 4 r t.........."....".'....‘.‘...‘... ..,.........‘.....'...==23 E3nasa:POM 
山 ^^^^^^^_^^^^__^_r_,i:T"_r_nn^^^^^^^^^^^^_^ - f e = i ^ ^ ^ ^ @nasa:IAP 
I 16 L t 16 p — I 2 � ~ ^ ^  ^ 口一“ 
w p ^ — ^ . ^ ^ ^ S g ^ H ^ P Q(isp:iAP 
I "|^:._^^^"^S^..'"i 1 8 广 = I _p”!-”P!，ffl!»»__.l 
。 8 ^ ^ 」 8 ^ 1 ^ = 因 ™ 
^ ^ j ^ ; g ; j ^ g g j ; j ^ j ^ ^ g ^ ^ ^ ^ ^ g g g j ; j ; j ; ; j g g ^ g j j g j g j j g g j g j 4 & ^ i ^ ^ i j ^ _ | ^ ^ ^ | _ 0espr:IAP 
^aaaaa__, 1 1 1 1 ‘ 
0 1 2 3 4 0 1 2 3 4 
0 5 1 0 1 5 2 0 2 5 Hcomp:POM 
N u m b e r of P O M L i n e s x 1 0 “^  N u m b e r of P O M L i n e s x 1 0 ^ N u m b e r o f P O M L i n e s x 1 0 ^ E c o _ A P 
6 4 r ^ 1^ E]wave;POM 
3 2 J^^^^^f^^mam _ m 4 +剛眼丨丨丨丨屑 0wave:IAP 
c ！ . . . . . , ^ I - P ™ I P . . … . … . :•^^^•:•^ 0tomc:POM 
® 16 „„„JJJJJJ„„„>^ 孟 , R l.v/.'.v.v.v.'.v.v.'.vj g 2 1 一 
." ! ! ! !^ -s 16 ^ ^ i ^ ™ ^ j r r _ i s.omo:iAP 
I - ™ = — | I - ^ ™ 湧 j--丨「丨丨丨:丨-丨⑴h��ww 
Q S " 罢 g ^^^j j | | j j j j j j jy .W |-. i i-|V -I- • - ^ ^ ^ ^ UAAAMMj'  
8 -,,,,;,"i"-_-.-_^^^^ = ^ m ^ . 1 卜 "........-^^ Osu2c:POM 
m^^^^^^^y^^l^^^^^^^^^^^^^^.-.��� . . ，.」 4 u^uUUiuum.____ '••'•' • L...... . . . . . . , . . . ' 0su2c:IAP 
1 1 1 1 1 |j^ Jy^ j^^ y^ j^ :^ B^ TTUTT- "*"""""^^^""^^^^^^ I I I .厂 1 1 
0 1 2 3 4 5 ‘ ‘ ‘ 0 1 2 3 4 5 
0 1 0 2 0 3 0 Hspic:POM 
N u m b e r of P O M L ines x 10 “^  N u m b e r of P O M L ines x 10 “® N u m b e r of P O M L ines x 10 ^ E3splc:iAP 
(a) Varying cache size (b) Varying line size (c) Varying set associativity 
Figure 6.3: Comparison of number of prefetch-on-miss lines in IAP cache and 
prefetch-on-miss-only cache 
The programs can be divided into three groups, the first group consists of 
su2cor and tomcatv, which could achieve over 20% of memory delay time reduction 
for the default cache parameters .^ 
The second group consists of espresso, li and nasa7, in which each could obtain 
few percent of performance improvement. 
The third group consists of compress and wave5, as a little performance degra-
dation was observed. The classification is based on the default cache parameters. 
Though wave5 could obtain quite large performance improvement when the cache 
size was small, it is classified as in the third group for the sake of consistency. 
2l6K bytes cache size, 32 bytes line size and 4-way set-associative 
53 
Chapter 6 Performance Evaluation 
1-..:.:..:::::"¾ 下;._‘“-1..“,-“-“-..“-“.;.“-;.“他^^^  ~^ v^:^ ^^ ^^ ^^ '^A'A^ A^ 'JJJJ•^ ^^ w^'J•^ •^ A^^ •^ •^ •^ •^ J^ ..... [•••••••j 64 pffiP'' ^^—— 64.% [::_:_.:_:___.; —^ lwave 32 p J ^ >•• : 4 _ J i M i - t^omc 
I g " " ~ | 3 2 ； ^ ^ 請 。 ^ & — - -
16 ^ ^ j 16 ^ = I 2 ^ ~ • 二 
I p i ^ ^ ^ I B ^ ^ • i ^ ^ ^ = r 
^^^^^^^m «.1% 4 ^ r .一 ,53.5% CZZiiii^ ti!iii!itlZZllllll:: . c o m p 
r I 1 N=- I 1 P ： I 
0% 50% 100% 0% 50% 100% 0% 50% 100% 
Percen tage of P O M lines in total number of Percentage of P O M lines in total number of Percentage of P O M lines in to ta l n u m b e r of 
p re fe tched l ines prefe tched lines p re fe tched l ines 
(a) Varying cache size (b) Varying line size (c) Varying set associativity 
Figure 6.4: Percentage of prefetch-on-miss lines referenced in total number of 
prefetched lines 
Actually, the effect of PPUVC for IAP lines is not great. The reason is that 
IAP scheme has very high accuracy of prefetching, and thus mispredicted lines 
are few. On the other hand, it may degrade the performance when the time 
between two successive references to IAP lines is long. Thus these IAP lines, which 
are waiting for being referenced, may be determined as useless and have their 
priorities pre-updating by the mechanism. They bear the risks to be discarded 
before their actual reference. This situation will cause more potential cache misses 
and increases the number of memory accesses. 
For the first group of programs, the performance improvement for the two 
programs was actually less than that of PPUVC in prefetch-on-miss-only cache 
when the cache size was small. At which PPUVC in IAP could obtain only 28% 
of memory delay time reduction, while PPUVC in prefetch-on-miss-only cache 
could obtain 35%. The reason maybe possibly be due to relative small number 
of prefetch-on-miss lines in the IAP cache, and thus pre-updating of mispredicted 
prefetched lines has little effect. 
6.4.2 PPUVC in prefetch-on-miss Cache 
From the simulation results, one can easily see that PPU combined with vic-
tim cache resulted in better cache performance on average. All programs except 
54 
Chapter 6 Performance Evaluation 
spice2g6, in which with two curves nearly coincident, could achieve a further re-
duction in memory delay when compared with prefetch-on-miss-only cache. 
For the program compress, it had a little performance degradation when com-
paring with no prefetch cache, however, the degradation was small in this case 
when comparing with IZ. As compress posses low spatial locality property and 
there are few constant stride references, and therefore, the prefetched data by 
prefetch-on-miss are most likely to be useless. These caused a waste of clock cycle 
as well as pollution of the cache. For the PPU scheme, unlike IZ which only work 
on IAP lines, it helps to shorten the life time of mispredicted lines, and thus a 
better result could be observed when comparing with IZ. 
The programs can be divided into three groups. The first group could achieve 
up to 49% of memory delay time reduction, which was about 46% further re-
duction than prefetch-on-miss. This group of programs consists of su2cor and 
tomcatv. For the default parameters: 16K bytes cache size, 32 bytes line size and 
4-way set associative, su2cor could achieve 35.6% of Memory delay time reduc-
tion in PPUVC scheme, while prefetch-on-miss could obtain only 2.4%. Tomcatv 
could achieve 43.0% of memory delay time reduction while prefetch-on-miss only 
obtained 15.3%. i.e., there were about 33% and 28% further reduction in memory 
delay in su2cor and tomcatv respectively. 
The second group consists of espresso, li and wave5, in which they could achieve 
1% to 3% of memory delay time reduction. Note that when implementing PPUVC 
in IAP scheme, wave5 had a little performance degradation (about 1%) for the 
cache with default parameters. 
The third group consists of compress, nasa7 and spice2g6, in which the curves 
of nasa7 and spice2g6 were nearly coincident with that of prefetch-on-miss. How-
ever, compress showed a little performance degradation when comparing with no 
prefetch cache. 
55 
Chapter 6 Performance Evaluation 
The Effect of On-chip Cache Size on Victim Cache Performance 
When the cache size is small, the number of available lines in cache is few. As 
a result, thrashing of cache lines occurs frequently. Many prefetched lines are 
displaced from cache before their references. Therefore, memory cycles have to be 
spent to fetch these lines back into cache during their actual references. With the 
addition of priority pre-updating and victim cache, those prefetched lines that have 
not been referenced for a long time after their fetching will be discarded first. The 
miss penalty for erroneously displacing useful lines is reduced by using the small 
victim cache. This situation easily follows from the fact that some benchmark 
programs could obtain significant performance improvement when the cache size 
was small, while less promising results when the cache size became larger. The 
programs that highly illustrated this including tomcatv and wave5. Tomcatv could 
achieve a 50% further reduction in 8K bytes cache, when it is compared with no 
improvement when the cache size increased to 32K bytes. Wave5 obtained 27% 
further reduction in memory delay comparing with no improvement for 32K bytes 
cache. Figures C.10, C.11 and C.12 show the memory delay time reduction of 
victim cache by varying cache size. 
The Effect of Line Size of On-chip Cache on Victim Cache Performance 
When line size is large, the number of lines in the cache is reduced and thrashing 
of lines occurs more frequently. With PPUVC, the penalty due to discarding use-
ful data is minimized. Victim cache is able to hold few lines that are likely useful 
in the future. At the time these lines are actually needed, one cycle is needed to 
fetch them to use by the processor comparing with the long miss penalty when 
fetching from lower level memories. The program su2cor obtained 47% further 
memory delay time reduction with 64 bytes line size while there is no significant 
improvement for 4 bytes line size. With line size 4 bytes, tomcatv had no signif-
icant further improvement compared with prefetch-on-miss-only cache, however, 
the deviation between memory delay time reduction in of prefetch-on-miss-only 
56 
Chapter 6 Performance Evaluation 
cache and the cache with PPUVC became greater. In which tomcatv had a 59% 
of further reduction in memory delay for 64 bytes line size. The same situation 
was observed in wave5, which could obtain a 14% of further memory delay time 
reduction. Figures C.13 and C.14 shows the results of these two benchmarks. 
The Effect of Set Associativity of On-chip Cache on Victim Cache Per-
formance 
Comparing with prefetch-on-miss, most programs could obtain better performance 
improvement with victim cache when the degree of associativity was low. When 
the associativity was low, say, direct-mapped, thrashing of lines occurs frequently. 
For direct-mapped cache, memory blocks would map to the same cache line due 
to the mapping algorithm such as bit selection. This may result in the situation 
that numerous blocks compete for the same cache line，while many cache lines are 
remained unused. Consequently, utilization of cache is low, arid also many data 
are discarded before being referenced. The use of victim cache helps alleviate 
this problem. Figures C.16 to C.18 show the results of varying set associativity. 
Programs such as espresso, lisp, nasa7, su2cor, tomcatv and wave5 could achieve 
a further memory delay reduction ranged from 8% to 46%. While compress and 
spice2g6 had similar performance as the original one in direct-mapped cache. 
The results for victim cache show that the addition of a small amount of hard-
ware can dramatically improve the system performance. It is difficult, mostly 
impossible, to derive algorithms for caches that can optimize the system perfor-
mance of every programs and applications. An algorithm that can satisfy most of 
the programs seems to be more applicable and practical. 
6.5 Prefetch Cache 
The default size for the prefetch cache was lK bytes. Simulations were done on 
cache model with prefetch cache sizes ranged from 256 bytes to 4K bytes and 
results are shown in Figure 6.5. Prefetch cache of size lK bytes is a reasonable 
57 
Chapter 6 Performance Evaluation 
choice, though larger size seems to have better performance for certain bench-
marks. However, larger size means larger delay. Different set associativities were 
also perform in the prefetch cache, the associativity ranged from direct-mapped 
to fully associative. 
For any data references, parallel checking of prefetch cache and data cache 
was performed. It is assumed that there was no extra cycle delay for checking the 
prefetch cache. We chose lK bytes prefetch cache size, 32 bytes line size, 4-way 
set associative and fully-associative, as the default parameters. 
16 Kbytes cache size, 32 bytes line size, 4-way set-associative 
1.6 TFWfi flTl pMTl Wn 1.2 T 1-2 T 1.4 -• ‘ ‘ , r^ T"S nsTfl rwT8 c^r^ rST^ VMTM iJMTS ^ TT 
‘ , ‘ I -- B; 1 - _ � ‘ 04Kbytes 
1 . 2 - - :i:- •:•: 55 ::::: •；;： :: 均 ：« & ::::: ft i Cy 
; "• 1 f { , ;• 
1 -- - ； 0.8 - ^ ； 0.8 -- - ‘ '', S2Kbytes 
E 0.8-.; “ , •'- g 0 .6� ； ： ； g 0.6 -k ^ ( I 01Kby,es 
0.6 - ' ‘ . � 0 4 -� ‘‘ 1 , "i , ‘ 
0 4 -• 1 , ‘ . ^ ‘ '- 0-4 - - ' , ‘ “ : 0512 bytes 
02 -1 ‘ y 0.2 --: ；‘ ‘ I 02 -- ‘ ^ 
u-d ：； ," , ., , ,： U.^ ; ‘ „ B256 bytes 
0 | 剛 邏 1 謝 湖 | 頃 丨 . 踊 | 謝 1 丨 0 I I I I 0 I I I ^ 
1 2 4 fully- 1 2 4 fully- i 2 4 fully-
set assoc. assoc setassoc assoc set assoc. assoc 
a. c o m p r e s s b. e spresso c . x l i sp 
2.5 丁 1.6 j 3 丁 
1 4 - - � ,; r ^ ~ F ®  
2 M { l ^ _ 1:2 1 : "丨 2.5 - - _ _ J 
1.5 M f f i M i m 1-- :' � ’ 2 - , ff r | f 
§ 1__I ‘ : ‘ ‘ s 二--:1 ； ： : 3 s 1.5-: ^ \ ； 
1 ？ . ‘^  , 0.6 - - . '? j^ ^ g s 錄 j :¾ 
0.5--| : ： , 0.4--: ; ； II [ 丨 ： . \ 
瞧： ： ； ： 。 [ - ' ‘ ： °-5--; - “ 
0 I I I I 0 丨卜_ 1_|丨,.避 _|Ui fSiNji 0 I 國 I I W — 
1 2 4 fu l ly - 1 2 4 fu l ly - 1 2 4 fu l ly -
set assoc. assoc set assoc. assoc set assoc. assoc 
d. nasa7 e. s p i c e f. s u 2 c o r 
2.5 丁 1.2 T  
2 J T | 圓 1 i 丨： ： 1 1 ff[ 
I I psrl 0.8 --丨 r ‘ . 
‘i.;:l: S I IT ‘ � . “ ; : ； 
p : 、、 : ： 0 . 4 一 丨 . , , 
。 . 5 - | 丨 ； : \ 。.2-1 ‘ , : 
0 I I I I 0 I fM丨 I I 
1 2 4 fu l ly- 1 2 4 fu l ly -
set assoc. assoc set assoc. assoc 
g. t o m c a t v h. w a v e 5 
Figure 6.5: The effect of Prefetch Cache size on cache performance 
We examined further the performance curves by dividing the eight benchmarks 
58 
Chapter 6 Performance Evaluation 
into four groups: [1] performed extremely well, [2] performance improvement was 
moderate, [3] yielded a slight improvement to the performance, and [4] contribu-
tion to the reduction in data access penalty was nil. 
Prefetch Cache with FIFO Replacement Policy 
The following observations were obtained in the fully-associative prefetch cache. 
All programs except nasa7 obtained better performance than the basic IAP, and 
all obtained better results than prefetch-on-miss-only cache. The memory delay 
time reduction was up to 99% ^ as found in tomcatv. 
The first group is formed by su2cor and tomcatv with memory delay time 
reduction ranged from 51% to 99% in fully-associative prefetch cache. There 
was 44% to 55% further reduction than basic IAP. In Figure D.1 the delay time 
reduction of these two programs were clearly shown. 
The second group contains only wave5, which achieved 81% memory delay 
reduction over no prefetch, and was 21% further performance improvement than 
basic IAP. Figure D.2 show the simulation results of wave5 in terms of its the 
delay time reduction. 
The third group includes espresso, li and spice2g6. The memory delay time 
reduction was 33% up to 83%, which was 1% to 4% further reduction. Refer to 
Figure D.3 for the results of this group of programs. 
The fourth group contains only compress and nasa7, which had got no per-
formance improvement, and nasa7 had a slightly performance degradation when 
comparing with basic IAP. Figure D.4 shows the results. 
Prefetch Cache with LRU Replacement Policy 
In the fully-associative prefetch cache, the programs espresso, su2cor, tomcatv 
and wave5 showed better system improvement. The performance improvement of 
compress, li and spice2g6 was slight. Besides, nasa7 had a slightly performance 
^A11 percentages quoted are obtained by using the cache parameters: 16K bytes cache size, 
32 bytes line size and 4-way set associative 
59 
Chapter 6 Performance Evaluation 
degradation. In fully-associative prefetch cache, all benchmark programs showed 
same pattern and similar degree of performance improvement as that of FIFO 
prefetch cache. Refer to Figures D.13 to D.16 for the results of the four groups of 
programs in LRU prefetch cache. 
Referring to Table 6.2, only 0.3% of the instructions executed belonged to 
LOAD/STORE- UPDATE in compress, thus IAP lines were few in this benchmark 
program. As a result, the influence and usage of prefetch cache is insignificant. 
However, a different situation was observed in nasa7, which contained up to 43.4% 
of instructions executed belonged to LOAD/STORE-UPDATE, resulting in nu-
merous number of IAP lines. The prefetch cache may not be large enough to 
hold all the prefetched IAP lines, and some unreferenced IAP lines are thrashed 
by new prefetched IAP lines. The thrashing misses caused the degradation of 
performance in nasa7. As a result, nasa7 got a performance drop in both prefetch 
cache with FIFO and LRU replacement policy. 
Prefetch Cache with IZ Replacement Policy 
Using IZ replacement policy in fully-associative prefetch cache showed degradation 
in the performance. However, the degradation was of a much lesser extent in the 
4-way prefetch cache. In some programs, IZ in 4-way prefetch cache could even 
achieve similar performance as that of FIFO. 
The Effect of Set Associative on Prefetch Cache Performance 
Basically, prefetch cache with 4-way set-associative exhibited similar performance 
as that of fully-associative prefetch cache. However, in the programs su2cor and 
tomcatv, the performance improvement was of a lesser extent. Programs such as 
tomcatv could only achieve a maximum of 70% and 81% memory delay time reduc-
tion when using LRU and FIFO replacement policy respectively. Another bench-
mark su2cor also had smaller performance improvement. The smaller promising 
performance improvement in 4-way set-associative prefetch cache maybe due to 
the reason that IAP lines in tomcatv and su2cor were mapped to some particular 
60 
Chapter 6 Performance Evaluation 
sets, leaving other sets unused and thus thrashing of lines in the prefetch cache 
occurs. Utilization of prefetch cache is thus lower than that of fully-associative 
prefetch cache, in which the thrashing misses is minimized or eliminated. 
Comparison of the Three Simulated Replacement Policies 
In general, LRU replacement policy is a better choice for data cache when ignoring 
its costs [Smi82]. However, among all replacement policies that simulated, FIFO 
seems to be the most suitable one in prefetch cache. Figure 6.6 illustrates the 
reason why there is performance difference between LRU and FIFO replacement 
policies. 
Time ] ^ "youngest" line RFO 
line., ^ —• line., H 丁“ ^ -, , 
^ _ ^ The displacement 
0 line.^ line.2 1 order of lines in Set i 
, line.g line.g f (all lines are valid 
I ]j^4 ;oldesr line_ j ^ and unreferenced) 
i (a) • 
； "Aie" //ne" 
丁1 line.^ line.^ 
line. line .„ " A reference to 閣 
_i4 • line., • _i4 
(b) • 
_抖 , Hne..  ffi A new ^  
Tg l'^ i^i prefetched IAP "朋,2 
line.^ line brings into //ne,^  
- sef i V；~~’  
1—3 _憐 
(C)  
丨 — 腳 丨 — 酬 
丁3 、、論搏 line.^ 
line.^ line.^ 
y-"^^]T 丨—2 linel 
< l—3'�s % � 
(displaced ( \ > line.^ is displaced 
*-^->sy^''^;x<^^ 
\\�5 $•��� \\ 
Figure 6.6: An illustration of the performance difference between LRU and FIFO 
In Figure 6.6 ,^ we consider a set i which is full. All cache lines {linen, /me^2, 
^Referenced lines are shaded 
61 
Chapter 6 Performance Evaluation 
linci2,, linei4) in set i are valid and unreferenced originally at To. Please be re-
minded that the cache lines are drawn according to the order of their displacement 
(i.e., their priorities), but not representing their actual placement in cache set i. 
That is, the one with lowest priority will place at the bottom, while the one with 
highest priority at the top. If there is a reference to /mej4, which is the one with 
lowest priority, at time J\ in Figure 6.6b. Then lirWi4 will become the highest pri-
ority one in the set when following LRU replacement policy. However, it is obvious 
that its priority will be the same when FIFO replacement policy is used. At T2, 
there is a new prefetched line brought into set i, thus a line has to be discarded to 
accommodate the new line. As a result, we can see from Figure 6.6d that /me^3 is 
discarded if follows LRU replacement policy, but linea is displaced when follows 
FIFO policy. As mentioned before, IAP scheme poses very high accuracy, and 
thus linCis is very likely to be referenced in the future. Besides, most IAP lines 
pose referenced-once properties, linei4 is most possibly useless in the future. Thus 
discarding linei4 instead of linei^ may improve the system performance. These 
illustrate why LRU replacement policy performed worse than FIFO replacement 
policy. 
Using IZ replacement policy prefetch cache yielded the worst performance 
improvement. When applying to a data cache which mixed with IAP lines and 
normal cache lines, IZ replacement policy yielded good results. However, when 
comparing with other simulated replacement policies, it is not strange for IZ to 
obtain poor performance improvement in prefetch cache which contains only IAP 
lines. IAP lines pose reference-once properties, and thus once they are referenced, 
they are considered useless and can be discarded. In prefetch cache, all are IAP 
lines, and thus there exists cases that some referenced IAP lines remained in the 
prefetch cache for the entire executing time of the program. As an example, 
consider the situation in Figure 6.7. In which the oldest line linei4^ may remain in 
the prefetch cache for the entire execution time of program. 
Moreover, FIFO is a good choice as FIFO replacement policy requires the 
simplest hardware complexity among the above three replacement policy. 
62 
Chapter 6 Performance Evaluation 
The displacement | line,^ line,. Iine,, line., A npw i ina 
“ “ “ “ n • .口“* inew  
order of lines in Set i J Hne,^ A reference^ //ne,.^  Areferencew Hne,^ . Iine., prefetched ^ line；, 
(all lines are valid and ^ line,^ ‘ 'xo'linel '^ line,， ‘ '{o'lin'el •， ffm^ ^ “ “ ‘ — 让 ' lAPl ine b r i i i g r line,, 
unreferenced) [| //ne,, [ _ : J | : : : 涵 : : . : 1 � 广… … …胁〔 into Seti | ffml 厂 
bring to the head rV^^^^^^^^ 
� (b) (c) ^ (d) fei^) 
U v . / X / - . v ^ 
Figure 6.7: An instance of line activities in IZ prefetch cache 
Figures D.13 to D.16 show the results for the four groups of programs in 
FIFO and LRU, all the three simulated policies in 4-way set associative and fully 
associative prefetch cache. 
The results for prefetch cache show that the addition of a small amount of 
hardware can dramatically improve the system performance. 
6.6 Chapter Summary 
In this chapter, the performance of IZ, PPUVC and prefetch cache is evaluated 
using cycle by cycle simulations of the eight SPEC92 benchmarks. For comparison, 
the performance of a traditional hardware prefetching scheme, prefetch-on-miss, 
is also included. Besides, to show the potential of the proposed schemes, we also 
include the performance of the combined IAP scheme. Cache models with varying 
cache size, line size and associativity are simulated. Except the slight performance 
degradation for the benchmark program compress, the results show that the three 
schemes are generally effective in reducing the data access penalty in almost all 
the other benchmark programs tested. 
It is observed that IZ, PPUVC and prefetch cache outperform the combined 
IAP for most of the eight benchmark programs. Ranging from a few percent up 
to over 50% of further memory delay time reduction when comparing with the 





The IAP that has been proposed so far requires the definition of LOAD/STORE-
UPDATE compound instructions in the architecture. Power series such as the 
IBM/MOTOROLA/Apple PowerPC and the IBM RS/6000 contain such kind of 
instructions. For those machines without LOAD/STORE-UPDATE instructions, 
the IAP scheme still can be easily extended. 
First, some architectures have compound instructions that are functionally 
equivalent to the LOAD/STORE- UPDATE instructions defined in IBM PowerPC 
or RS/6000. One of these instructions is the LOAD/STORE-MODIFY in the 
HP's Precision Architecture (PA RISC) 1.1. As a result, the IAP scheme can be 
easily extended to this type of machines without any difficulties. Second, for those 
machines without similar kind of compound instructions, if an update-counter per 
register is available, then the IAP scheme can still be implemented. The function 
of the update-counter UC is to book-keep its corresponding register R(UC), if 
the register is an index register used by some LOAD/STORE instructions using 
index-displacement addressing mode in a loop, the value of the stride used by 
the index displacement LOAD/STORE instructions will be learnt during the first 
iteration of the loop and will be stored into the update-counter UC. Consequently, 
very accurate data prefetching comparable to the IAP scheme can be carried out 
64 
Chapter 1 Architecture Without LOAD-AND-STORE Instructions 
in the remaining iterations of the loop to improve the cache performance. As long 
as the IAP scheme can be implemented, the IZ scheme can also be used. 
Moreover, to show the potential of Priority Pre-Updating scheme, it is possi-
ble to import pre-updating into architectures which employ different prefetching 




In this dissertation, we propose two cache line replacement policies — IZ and 
PPUVC, and one placement policy - prefetch cache. These three policies are 
specially designed for prefetched lines. 
The first replacement policy,called IZ, is to be implemented with the IAP 
scheme. This is used to improve data cache performance. From our simulation 
study, we found that with this replacement policy built into the combined IAP 
scheme, the processor idle time due to memory access can easily be reduced by 
over 20%. In some programs, this replacement can even achieve a 99% of memory 
delay time reduction. 
In fact, the IZ replacement policy with IAP scheme has very good potential 
to be imported into current cache designs. The reasons are: [1] no change in the 
architecture is required, [2] no new compiler optimization technique is required, 
3] the IZ scheme can work with different kinds of replacement policies, [4] the 
IZ with IAP has high potential to improve system performance. Though the 
controller design for the IZ replacement strategy is similar to LRU, it is worthy 
to implement due to the low cost of hardware. 
The second replacement policy, Priority Pre-Updating, helps to determine 
which data should be replaced during cache misses by shortening the life time 
of those suspected erroneously prefetched lines. However, using PPU solely may 
not be able to obtain significant improvement. The program data set is usually 
66 
Appendix Conclusion 
much larger than the cache size, and thus thrashing of unreferenced useful lines 
occurs frequently. Therefore, a small fully-associative victim cache is added to 
hold those unreferenced prefetched lines, which have been displaced from data 
cache due to capacity or conflict misses. By using combined PPU and victim 
cache, it is possible to achieve iip to 50% memory delay time reduction in some 
of the SPEC92 benchmark suite in the cache model with prefetch-on-miss only. 
PPUVC can achieve up to 100% of memory delay time reduction in cache model 
with IAP scheme. 
Victim cache consists of only four entries, each entry is of the same size as 
a cache line, and is insignificant when comparing with the size of data cache. 
Therefore, it is worthy to implement it in current architecture. A more important 
concern, perhaps, is the extra hardware logic necessary to search the PPU and 
updating the cache. With the even-increasing amount of hardware logic available 
on a chip, the quantity of it involved is not really a serious problem. 
The placement policy 一 prefetch cache is used to hold prefetched IAP lines. 
With the assumption that the data cache and prefetch cache will be checked for a 
data reference, it is possible to achieve up to 90% of memory delay time reduction. 
Hardware costs are now low enough to permit extra hardware for prefetch 
cache, i.e., the amount of cache available for data is not necessarily reduced as a 
prefetch cache is added. As a result, it is worthy to implement prefetch cache in 
current architectures. This technique helps to reduce cache pollution and increases 
system performance. 
Whether or not the proposed mechanisms would be practical to implement in 
hardware is not really addressed in our experimentation, but the indications are 
that the difficulties would be the minor. Having some of the more complicated 
heuristics proven to be extremely useful, then their introduction into a hardware 
scheme could have proven challenging. 
67 
Appendix A 
CPI Due to Cache Misses 
A.1 Varying Cache Size 
A.1.1 Instant Zero Replacement Policy 
0.7 T 0.25 了 
T _ 4 _ IAP with 
0.6 - 陶 、 ^ IZ 
� 5 , � < � , � � � � . 2 - \ 
u - �- - � � � “ ^ \ “™m~~~combined 
0.4 __ 一 °.15 - ^ X IAP 
M C P I M C P I \ \ 
0.3 - - � 1 n ^ � \ \ ~^fr™~ prefetch-
\ � \ \為\ . � ] ^ o n - m i s s 
0.2 -- � � l \ \ 
0.1 _. 0.05 — ^ ^ ^ ^ ^ .....g.... no 
prefetch 
0 "J 1 1 1 1 0 J 1 1 1 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
a. compress b. espresso 
68 
Appendix A CPI Due to Cache Misses 
0 . 1 2 丁 0.9 丁 
T p^  _^IAPwith 
0.8 一 ® � � � . _ IZ 
0.1 -. g J 罔 
\ 0.7 — \ 
0.08 .. \ \ 06 \ ..^ .^ .combined 
i^\\ \ IAP 
M C P I 0.06 -_ \ \ \ M C P I 0.5 — ‘ � ^ \ 
\ \ H 0.4 -- \’ \ ™~^ ™prefetch-
0.04 _. \ k \ . ^ �_3 - \ \ �n_miss 
^ ^ \ : H 0.2 __ \ \ 
0.02 -• ^ ^ i i » ~ ^ \\ i n o 
^ 0.1 - " ^ ^ ^ ^ v ^ ^ \ _ prefetch 
0 J 1 1 1 1 0 J 1 1~~^-^ 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
c. Ii d. nasa7 
0.9 T 2 丁 
T _ 4 ~ IAP with 
0.8 . . g 1.8 , |z 
0.7 -• \ 1.6 —_ ^ ¾ ^ 
06 \ 1-4 -- ~~~' 4K% ~~m™~combined 
�:5 ：： t : : : : ^ � 1.2 -- % 'AP 
M C P I ^ ^ ^ \ ^ M C P I 1 一 ^ \ 
0-4 ““ “ ^ 0.8 _— V ® ……*—P'efetch-
0.3 - - \怨 o n - m i s s 
0.6 ―― \ 
0-2 - 0.4 一 -
01 ™^no 
0-2 —- prefetch 
0 J I 1 1 1 0 J 1 1 1~~ , 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
e. spice2g6 f. su2cor 
2.5 丁 0.09 .T 
_ ^ _ IAP with 
0 . 0 8 IZ 
2 - 气 � . � 7 __ 气 
^ \ 0.06 —  \ -_*^combined 
1.5- \ % O.OS i � \ IAP 
MCPI \\\\ MCPI 1 \ \ 
1 .. \\\k 0.04 —_ \ \ \ ^^prefetch-
\ \ K ^ \ 0.03 I \ \ ^ X o n - m i s s 
� . 5 . - ^ : : : : : ^ ^ . . . . . . « . . . . . : 
0 1 1 ^ 1 0 J 1 1 ^ ^ ^ , 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
g. tomcatv h. wave5 
Figure A.1: MCPI by varying cache size in IZ scheme 
69 
Appendix A CPI Due to Cache Misses 
A.1.2 Priority Pre-Updating with Victim Cache 
0-7 丁 0.25 T „ „ . . ^ T _ ^ _ PPU and 
0.6 .. - ^. F'FO 
^ ^ " ^ ^ ¾ ^ 0.2 - \ 
�-5 -- 八 . ^ ^ ^ _ \ _^ *__combined 
0.4 •• 八 0.15 一 ^ ^ A^P 
M C P I M C P I ' \ 
0.3 . . 1 ^ \ \ — ^ ^ p r e f e t c h -
0'1 — ^ X ^ ^ X " ' m on-miss 
�-2 -• ^ ^ ^ 
0 1 _ 0.05 - ^ ½ — g „ „ . n o 
prefetch 
° H 1 1 1 1 0 _| 1 1 1 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
a. compress b. espresso 
0.12 丁 0.9 —厂 
T 厂 1¾ _^PPUand 
0.8 ―― L J � FIFO 
0.1 - g H 
H 0.7 - \ 
0.08 .. 气 \ 0.6 \ ~ ~ « ™ c o m b i n e d 
^ \ _AP 
M C P I 0.06 - . A \ \ M C P I °.5 - . � \ 
W i s 0-4 - - \ \ ―合、™ prefetch-
0.04 __ ^ ^ \ \ 0.3 — \ \ on-miss 
® ^ \ H 0.2 _— \\ 0-02 - ^ < > p \\ -S"" "° 
� 0.1 - ¥ i b : ^ 每 prefetch 
0 J i 1 1 1 0 J 1 1 ^ ^ 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
C. Ii d. nasa7 
70 
Appendix A CPI Due to Cache Misses 
0.9 _ 2 
T _ ^ _ PPU and 
0-8 - S . . . 1-8 - - ^ FIFO 
0.7 .. \ 、 1.6 一 1 ^ ¾ 
0 6 H \ 1.4 — 各 ~~~m~™combined 
05 " < ^ � � � � ......\|g 1.2 - ^ - ^ - - ^ % ,AP 
M C P I • __ ^ " " " * ^ " ^ ^ ^ X ^ M C P I 1 - \ \ \ 
�.4 - ~ ^ � � „ � \ \%^ -^P^efetch-
03 _ � 0.8 —— \ \ ^ on-miss 
0.2 -- n/i \ _ 
u-4 - - \ — g „ , „ n 0 
0-1 - - 0.2 —— prefetch 
° "I 1 1 1 1 0 J 1 i 1 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
e. spice2g6 f. su2cor 
2.5 _^  0.09 ^ 
T T _ • _ PPU and 
0.08 - FIFO 
2 - 崎 0.07 .. ^ 
\ 0.06 -^*^combined 
1.5 - \ \ \ 'AP 
舂 % °-05 -- A \ 
M C P I \ \ \ \ M C P I 气 \ 
1 \ ^ 0.04 — W X - _ ^ p r e f e t c h -
•" \ ^ \ 乂 0.03 "C^ \\ �n-m" 
V . , ^ \ 0.01 - ^ ^ v ! ^ ® • prefetch 
0 \ 1 1 ^ ½ 1 0 J 1 1 — — ^ 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
g. tomcatv h. wave5 
Figure A.2: MCPI by varying cache size in PPUVC with IAP scheme 
0.7 丁 0.25 
4 
0.6 - - ^ _ _ ^ \ — ^ _ _ PPU and 
0 . 5 _ . " ^ ^ ^ ^ ^ 1 \ F_FO 
04 � -15 I X 
•^CPI M C P I ® V \ 纖…….prefetch-
0.3 - - … ^ v \ on-miss 
-- X \ 
0.2 -. ^V 
0 1 0.05 \ 凝 ― ^ ~ ~ " n o 
prefetch 
0 "I 1 1 1 i 0 J 1 1 1 [ 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
a. compress b. espresso 
71 
Appendix A CPI Due to Cache Misses 
0.12 0.9 
,&��� 
0.1 . . . 0_8 - 、、％ - ^ P P U a n d 
' \ 0.7 — \ FIFO 
0.08 - 1 \ 0.6 —  \ 
M C P I 0.06 _- \ \ M C P I 0-5 - - ® ~ - — - ^ \ ~ ~ » ™ ~ Prefetch-
\ \ 0.4 - \ \ on-miss 
0.04 - XX^ . �-3 - \\ 
X ^ 0.2 —  \\ 
0.02 __ ^ \\ ™-^k™" no 
0.1 - - N ^ prefetch 
0 H 1 H 1 1 0 J 1 1 1 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
C. li d. nasa7 
0.9 丁 2 
0.8 . . fy-_ 1.8 __ 
\ ^ ¾ , . , _ ^ P P U a n d 
0.7 -_ \ \ 1-6 - - ^ FIFO 
o,__ \ - \ 
纖 、 \ \ 1.2 _— \ 
MCP. 0.5 -- \ \ 、；> "^CP_ 1 4-___,^^ \ .^ ..prefetch-
：：•• � O.S ：； \ \ on_s 
0.6 —_ \ 
0.2 . . „ , V 
0.4 - • — f l O 
0-1 - - 0.2 __ prefetch 
° \ 1 ^ 1 1 0 J 1 1 ! 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
e. spice2g6 f. su2cor 
2.5 _ 0.09 丁 
^ 0.08 - 為 — • — PPU and 
““ % 0.07 -_ \ FIFO 
\ 0.06 —- \ 
1.5 - \ , \ \ 
M C P I \\ M C P I 0.05 - - \ \ ^ 1 L 
V x \ \ ~ - m ~ ~ prefetch-
i - _ ^ ^ \ :::: ^ x — 
0.5 “ ^ " " ^ ^ °.°2 ^ ^ \ N ^ ~ ~ ^ n o 
^ 0.01 + ^ ^ m prefetch 
0 ] 1 1 i 1 0 J 1 ^ 1 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
g. tomcatv h. wave5 
Figure A.3: MCPI by varying cache size in PPUVC with prefetch-on-miss scheme 
72 
Appendix A CPI Due to Cache Misses 
A.1.3 Prefetch Cache 
� - 7 T 0-25 T » Fully-
Q g K | a s s o c . 
• __ B ^ ^ : ^ ; p 0.2 — \ with FIFO 
0.5 -- W^- .^ \ ^^combined 
m \ IAP 
0.4 _. � - 1 5 - - , � k 
• • M C P I “ � \ \ ，^_prefetch-
• ' " 0.1 _ ^ ^ \ \ o n - m i s s 
0. - " ^ \ "® 
0.05 _ ^ ^ 、 i n o 
0.1 . . ^ prefetch 
° ^ 1 1 1 1 0 J 1 1 1 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
a. compress b. espresso 
0.12 丁 0.9 ~ # ~ Fully-
0.8 ® ^ � � � � a s s o c . 
0.1 - - ^ ‘ ^ H with FIFO 
^ 0.7 — \ 
'\ \ ~ ~ - p ~ _ ~ c o m b i n e d 
0-08 - - 4 ^ \ 0.6 \ � IAP 
M C P I 0.06 _• \ \ V M C P I ° ' - - 备 〜 � �A \ , , h 
\ \ V ] 0.4 一 \ \ — * — p r e f e t c h -
0.04 _. ^ ^ \ \ 0.3 — \ \ �n-miss 
0.0. _• ^ ^ ^ 0.2 -- |,,_ .、％ -¾--
� 0.1 — ^ “ ™ ‘ " ^ ¾ ¾ ¾ ^ ^ ^ ^ prefetch 
0 \ 1 1 1 1 0 J 1 1——^^ 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
c. li d. nasa7 
73 
Appendix A CPI Due to Cache Misses 
0.9 丁 2 ~ # ~ F u l l y -
0.8 . . IX} 1.8 —- assoc. 
0.7 .. \ 1.6 - fe^ w i t _ 
^ ^^ ¾^  —^—combined 
0-6 - - , \ \ ” ^ , IAP 
0 5 1 、 〉 、 " ^ ^ 1.2 一 \<^、、 
M c m 二 •• * " " ^ ¾ : : ^ M C P I 1 __ % ^ - ^ p r e f e t c h -
0 3 ^ ^ ^ 0.8 — ‘ ~ " — — < L Y ® o n - m i s s 
0.2 - 0.4 _— ^ ^ —g^no 
0.1 ._ 02 prefetch 
0 •] 1 1 1 1 0 J 1 1 1 , 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
e. spice2g6 f. su2cor 
2.5 T 0.09 • Fully-
0.08 . . assoc . 
0 _ H with FIFO 
2 - - ^ . 0.07 — \ 
\ \ ~ ~ p ™ - c o m b i n e d 
1.5— \ 。 • [ - 气 \ IAP 
\ \ \ 0.05 一 翻 \ \ 
MCP1 \ \ \ M C P I ^ \ \ 
\ \ w 0.04 — A \ \ \ - ^ i r - prefetch-
1 - \ ^ ^ \ 0.03 一 \ ^ \ — 
0.5 ._ K ^ \ > 0.02 —_ \ ^ ^ ； \ f n o 
\ ^ 0.01 ^ ^ C > § � prefetch 
0 \ • • ^1 1 0 J 1 ^ - ^ 1 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
g. tomcatv h. wave5 
Figure A.4: MCPI by varying cache size in prefetch cache scheme 
74 
Appendix A CPI Due to Cache Misses 
A.2 Varying Cache Line Size 
A.2.1 Instant Zero Replacement Policy 
1 T 0.16 了 
_ m ~#~ IAP with 
G_9 •• 1 0.14 __ X ° \ ,Z 
0-8 -• / 0.12 一 产 、'® 
0-7 -- / / ~™ _^_combined 
0.6 - - / 0.1 _. - . . . • � � ” IAP 
MCPI 0.5 _. 声丨 MCPI 0.08 一 my^^u^^^m^ 一、、^^ 
：：-- z * l Z o.oe — ^ / ^ ^ ^ ^ ^ + 二 
0.3 “ , - , ' 0.04 ―  � 
0.2 - - 賺 ^ n o 
0-1 - 002 prefetch 
0 J 1 1 1 1 1 1 0 J 1 1 1 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
a. compress b. espresso 
0.14 丁 3 _� 
� _ 4 _ IAP with 
0.12 .. fOT oT IZ 
^1� 2.5 —  g 
0-1 -- \ 2 \ ―徽^combined 
0.08 .. . fe^ \ IAP 
MCPI \ \, MCPI 1.5 — g] 。.。6 -义风、—, \ \ + = 
�.�4 • ^^:=i：：=^ — v N ^ i 
0.02 .. 0.5 -- • ~ V � \ ^ — ~ ^ ^ « ^ " ° 湖 ™ ^v prefetch 
• ~ » � ^ r = = = = « < _ » _ _ ^ 0 A 1 ! 1 1 1 1 0 ^ 1 1 !——^——I 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
c. Ii d. nasa7 
75 
Appendix A CPI Due to Cache Misses 
1 T 3 丁 T _^~_ IAP with 
0.9 -- „ 
H 25 
�.8 - / — P 
0.7 -- / g] f/ ~~M“™combined 
�.6 - / _ ^ — \ / IAP 
MCPI 0.5 - B ^ zfr〜A：；^ MCPI 1.5 - \ ^ W 
: : : : i = ^ ^ ^ ^ ^ ^ ^ 1— ^ ^ ^ . = 
Q 5 .. . . . . |^^.| . . . . . n 0 
0.1 -- prefetch 
0 "I 1 1 1 1 1 1 0 J 1 1 1 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
e. spice2g6 f. su2cor 
3 丁 0.12 ^ 
T ~*«_ IAP with 
ra ,z 
2.5 - \ 0.1 —- 罔 
2 -. \ 0.08 _ \ ~^~~combined 
k f \ IAP 
MCPI 1.5 -- A \ //p MCPI 0.06 __ A H 
\ \ // /^ "\ \ ~"^ r™ prefetch-
1 " ^ _ : # �.�4 - ^ \ � \ M �n-m'ss 
0.5 .. ^"^==¾¾¾^^^ 0.02 — ^ ^ > 〜 〜 ^ ^ - & - "° 
^ W ^ - ^ ^ < m W：：：：^  prefetch 
0 \ 1 1 i 1 1 1 0 J 1 1 + 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
Figure A.5: MCPI by varying cache line size in IZ scheme 
76 
Appendix A CPI Due to Cache Misses 
A.2.2 Priority Pre-Updating with Victim Cache 
1 T 0.16 _^  
T p^ — • — PPU and 
0-9 - - | ] 0.14 - , / / ® \ FIFO 
0.8 - / 0.12 _ ,PZ .¾ 
0.7 - - / / ™纖™ ™ c o m b i n e d 
0.6 - / 0.1 - i^l:.-.s IAP 
|rf A^ . 1 � � 
MCPI 0.5 -_ / ^ MCPI 0.08 —_ 网„，4#1 1 � 
：：-• 乂 o.oe — ^ ^ ^ ^ ^ * ^ 二 ： ： 
0 2 __ • - . . . 0.04 一 W ‘ 
0.1 . . °-02 prefetch 
° H 1 1 1 1 1 1 0 J 1 1 1 i 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
a. compress b. espresso 
0.14 丁 3 丁 
丁 _ ^ _ PPU and 
0 1 2 r~> FIFO 
-• H 2.5 .— _ 
\ N 
0-1 - - 2 \ ™™^~™ c 0 m b i n e d 
V i —_ \ IAP 0.08 - . 么 B.. \ 
MCPI \ MCPI 1.5 __ _ 
0.06 - - i L X B- - . . . r~n A \ —务—prefetch-
^ X ^ � � ......B".一.® 1 X � > n on-miss 
0.04 -- \ t 、 ： ： ^ — V 日 \ ” 
*•“>"^*>======^ 狐、 \ H�...••调 
0.02 _. 0.5 _— � � ~ ^> ^ ^ . . • ^ . . " ' & " . n o 
m m prefetch 
o J ~ ~ ~ . ~ ~ ^ , ~ ~ , ~ . _ . ~ I o J _ _ t f - ^ T - ^ ^ ^ ^ _ , 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
c. Ii d. nasa7 
77 
Appendix A CPI Due to Cache Misses 
1 丁 3 
T _ ^ _ PPU and 
0-9 - - 问 FIFO 
0.8 - / ^ 2.5 —_ 
0.7 -. , 2 H / _^^combined 
1 - , , . ^ T \ y , 'Ap 
M C P I 0.5 . . S . � . , . . ^ ¾ ^ V ： ^ ^ M C P I 1.5 一 \ . , ^ / 
::-• ^ ' ^ ^ ^ 1— ^><Si5^^7 + = 
0-2 - - Q g ^ v ^ ^ ^ ' ' ^ ^ „ , „ g „ „ no 
0.1 . . prefetch 
° \ 1 1 1 1 1 1 0 J 1 1 1 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
e. spice2g6 f. su2cor 
3 丁 0.12 丁 
T ~ A ~ PPU and 
H n FIFO 
2.5 - - 0.1 - _ 
\ \ 
™ ^ ™ c o m b i n e d 
2 - - 0>08 - -
\ m \ IAP 
H / ^ \ 
M C P I 1.5 - - A / 7 p t M C P I 0.06 —_ ^ S 
\ \ / / / \ \ ™ ~ s ! ^ prefetch-
1 - - » ^ ^ ^ ： ^ - — & X ^ . . � . " -— S o n - m -
0.5 -- \ i ^ ^ / 0.02 — ^ ' i r - ^ …S•…no 
* ' ^ ^ ^ V ^ W = = = = ^ ‘^  » prefetch 
0 \ 1 I ¢^~~I 1 1 0 J ！ 1 1 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
g. tomcatv h. wave5 
Figure A.6: MCPI by varying cache line size in PPUVC with IAP scheme 
1 j- 0.16 
0-9 -- 1 0.14 — 广 \ -#—PPUand 
0.8 - - / < ^¾ FIFO 
1 - / … / 
�.6 •• / �-1 - X ^ ^ ^ * ^ 
M C P I 0.5 - - y M C P I 0.08 —— & _ ^ ^ ^ ^ ^ — m — prefetch-
0-4 - � Z 0.06 __ r on_miss 
0.3 ._ ^sssss©**"^"^ 0.04 —_ 
0.2 . . � 
0.1 - �.02 I ~~^no 
0 ——I——I——I——I——！——1 0 ,——,__, ^ I _丨 , prefetch 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
a. compress b. espresso 
78 
Appendix A CPI Due to Cache Misses 
0.14 3 
_ 4 _ _ PPU and 
0 . 1 2 . . >, „ 「 ¾^ 2.5 . FIFO 
l _ \ 2 — _ \ 
0.08 - - m - \ \ ~~«~~. prefetch-
^PI 0 06 X \ ^P_ 1.5 - \ on-miss 
• __ 乂 \ 、 . 一 ‘ — \ \ 
0.04 __ 、 》 ^ ^ ~ » ^ 1 -- \ 乂 \ 、 
〜 、 砍 〜 1 
0.02 . . 0.5 _. ^ ¾ - ^ — — „ " ~ ^ " ° 
prefetch 
0 H 1 1 1 1 1 1 0 J 1 1 1 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
c. li d. nasa7 
1 j 3 
0.9 - - _ < 4 _ PPU and 
0.8 - - , 2-5 - p FIFO 
l - / 2 — _ \ / 
0.6 “ / y^ \ J 
M C P I 0.5 - . l \ � � , ^ M - - - . . . . ^ ^ M C P I 1.5 - \ ^ ~~m~~~ prefetch-
0.4 __ . - " " ^ “ 1 ^ : : : ; ; ; ^ ^ ^ _ _ ^ ^ ^ 。n-m'" 
0.2 . . 
0.5 - -
0.1 - - — ^ . — no 
n , , . , , , n prefetch 
0 ^ 1 1 1 1 1 1 0 ^  1 ！ j 1 1 i 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
e. spice2g6 f. su2cor 
3 了 0.12 
4 ~ > « ~ PPU and 
2.5 - \ 0.1 - \ FIFO 
2 -- \ 0.08 —- \ 
\ t \ 
M C P I 1.5 - - 4 \ / / M C P I 0.06 —_ 1^ 气 ™ * ™ prefetch-
^\^ \ / / N ^ \ cm-miss 
1- v ^ ^ … \ ^ 、 一 
0.5 - ^ ^ ^ ^ ^ ' ' ^ 0.02 ._ ^^--^-¾=:^ _ ,^no 
prefetch 
0 H 1 1 ! 1 1 1 0 J 1 i 1 1 , 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
g. tomcatv h. wave5 
Figure A.7: MCPI by varying cache line size in PPUVC with prefetch-on-miss 
scheme 
79 
Appendix A CPI Due to Cache Misses 
A.2.3 Prefetch Cache 
1 j 0.16 _ ^ _ _ Fully-
0.9 -- fi 0.14 __ / , ® \ assoc. 
0.8 .. f 0 12 ^" \® w _ t _ 
0,7 __ 雜 0-12 - ! _纖™combined 
0.6 - / °.1 - : > ‘ � � IAP 
M C P I 0.5 _. , 1 ¾ M C P I 0.08 —_ g . . ^ i : ^ l s �� 
0.4 _— / _ y>*"^>"""^^::: ; ;ri +Prefetch-
z f t 0.06 Z X p ^ ^ ^ ^ on-miss 
二 " , - � • ' 0.04 . ^ y ^ 
01 -• 0.02 _. / ™«~""° 
u'i - - prefetch 
° H 1 1 1 1 1 1 0 J 1 1 1 1 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
a. compress b. espresso 
0-14 丁 3 _ ^ _ Fully-
assoc. 
0.12 _ . ^1 
K | 2.5 -— 1 ^ With FIFO 
0-1 -- '\ 2 \ ™-^ ™combined 
0.08 _- 办 ^ \ IAP 
M C P I " X \ M C P I 1.5 - ^ 
0-06 -- * < X ^ ^ � . H g \ \ +prefetch-
- - ^ < ： ： ^ 1- \ : 、 1 — 
0-02 -• 0.5 — " ^ \ ^ 〜 〜 金 — — A „ ^ ^ „ no 
® "*s% r^===::gg-e^ -^ "l prefetch 
0 H 1 1 1 1 ! 1 0 ^ 1 1 1 +^  1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
c. li d. nasa7 
80 
Appendix A CPI Due to Cache Misses 
1 丁 3 ~~#~~Fu l l y -
0.9 _. assoc . 
0:8 : : / ® 2.5 —_ wi thFIFO 
/ _ ^ ~ ~鐵~ ~ c o m b i n e d 
::::: 八 2 - \ / -
"^^' ：： ：： ^ ^ = ^ " " ' '•； ：； ‘ _ ^ ^ ^ ， 女 = 
0.2 .. ~~~~~^^\^^^ ~~«~-"° 
0-5 • prefetch 
0.1 . . 
0 •! 1 1 1 1 1 1 0 J i 1 1 ！ 1 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
e. spice2g6 f. su2cor 
3 j 0.12 _ _ ^ ^ Fully-
g ] assoc. 
2.5 - . ^ 0.1 - - , with FIFO 
\ \ ™ ~ ^ ™ c o m b i n e d 
2 - \ 0.08 - \ IAP 
k / g \ 
M C P I 1.5 - - 4 \ / m M C P I 0.06 __ A S 
\ \ / / / ^ \ ^ � "…*……prefetch-
1 - - 丨 ) : # - —_ ^ > - « ^ on-miss 
0-5 - ^ ^ ^ \ t / ' i 0.02 _— . ^ < ^ - - ^ " : : : : : J * - " ° 
^ \ ^ � " ^ " ^ ^ m ^ ' " ^ ^ ^ ^ prefetch 
0 H 1 1 ~ " I ^ ~ ~ � •‘ ^ ^ ~ ^ t 1 0 J 1 V t ' t ^ ^ 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes 
g. tomcatv h. wave5 
Figure A.8: MCPI by varying cache line size in prefetch cache scheme 
81 
Appendix A CPI Due to Cache Misses 
A.3 Varying Cache Set Associative 
A.3.1 Instant Zero Replacement Policy 
0.62 T 0.18 
T 1 ^ _ ^ ~ _ IAP with 
B 0.16 - 四 � \ \ IZ 
0-6 -• \ 0.14 一 、 & "— . . . . - ® 
0.58 _. H \ 0.12 »*_™c�mbined 
\ \ �1 “ �\ IAP 
M C P I 0.56 . . \ k � � M C P I 0-1 - ^ ^ \ 4 - " ^ 
\ �� . � •^ 0.08 - - ^ x ^ ~~1k.™~ prefetch-
n _ 树 、 、 … ™ • o n - m i s s 
0.54 - . I~~^ ‘ "~~"g 0.06 一_ 
Q 52 …“_*4j.“" n o 
0.02 - - prefetch 
0-5 \ 1 1 1 1 0 J 1 1 ! 1 
1 2 4 1 2 4 
set assoc. set assoc. 
a. compress b. espresso 
0.3 丁 1 —「 
� _ ¢ , _ IAP with 
0.9 —_ ,7 
0.25 - . S \ 
\ 0 . 8 -- 、 \ 阿 一 一 H 
0.7 一- � ^ … … 鐵 ^ c o m b i n e d 
0, \ -
M C P I 0.5 - \ 
0 4 \ \ - ^ - " ^ j r - " prefetch-
� : 3 ：： \ ° n ‘ s s 
u.u3 _|_ � � $ ^ 0_2 —- \ ^ - S " " " ° 
^ ^ 0.1 - - P " “ ^ prefetch 
0 4 "H 1 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
c. li d. nasa7 
82 
Appendix A CPI Due to Cache Misses 
0.8 _ 1.7 丁 
T _ 4 _ IAP with IZ 
0.7 -_ ^ — » m _ 1-65 - f ^ 
U 粉 一 一 — g ^ , . , g J " " ^ 
0.6 - - 1.6 _— r ^ f f Z y^-~"~~~>—•^ 
t^  , / ~-*™-combined 
0.5 - - A.„, — — A 1.55 - - Z ' 1 IAP 
1"~~~~~>^=SzzzLi 这 y ^ - " " " s , 
M C P I 0.4 - - M C P I 1.5 __ . / 
, Z 叙 一„^叙—prefetch-
0.3 •• 1.45 ^ " ^ \ ^ ^ ^ on -miss 
0.2 . . 1.4 _. � 
™f&-no 
01 - - 1-35 ； prefetch 
0 J 1 1 1 1 1-3 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
e. spice2g6 f. su2cor 
1-4 丁 0.25 丁 
T _ 0 _ IAP with 
' 2 - S ^ \ —閃 � .2-- ^ 'Z 
1 - - \ \ � @ "一 " 執 ~ ~ « ~ ~ c o m b i n e d 
o.s .. V ^ ^ ^ — ^ o.s — \ -
M C P I ^ ‘ ^ MCPI � ' l 
0.6 - . ^ s ^ - - ^ ^ 0 1 \ -™*&™~ prefetch-
^ v \ on -miss 
�-4 •• � V 
- • • °.�5 -- ^ l : : 
0 1 1 i 1 0 J i 1 ^ ? 1 
1 2 4 1 2 4 
set assoc. set assoc. 
g. tomcatv h. wave5 
Figure A.9: MCPI by varying set associative in IZ scheme 
83 
Appendix A CPI Due to Cache Misses 
A.3.2 Priority Pre-Updating with Victim Cache 
0-62 0.18 丁 „„.. __ 
T T B, _^PPUand 
0 6 � ! 0.16 —- \ . . , F IFO 
^ ^ ’ 0.14 — ® ® 
0.58 _• H \ 0.12 — , |Combined 
\ \ �� IAP 
\ 0.1 —— 翔 \ 
M C P I 0.56 -- \ % ^ M C P I 徽 \ � . l 惑 
........,.."*""-^ ^=¾¾¾^ 0.08 - - ^ _ _ _ ^ \ � "““^™™ prefetch -
„ . . > q . . . . . . „ „ , ^ ^ ^ » — ~ » o n - m i s s 
0.54 - - I _ I - - - - - . g ^ 0.06 - -
Q 5 2 "..“卡：^叶^…,n o 
0.02 - prefetch 
0-5 \ 1 1 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
a. compress b. espresso 
0.3 丁 1 了 
T _ _ ^ _ P P U and 
� - 9 p ^ FIFO 
0.25 - - y � . 
i 0.8 -- \ \ 歡 . 一 . 一 … H 
0.2 . . A °.7 A ^ _ ^ c o m b i n e d 
% �.6 —  \ IAP 
M C P I 0.15 . . \ � . M C P I 0.5 \ � 
Ny% 4 � \.叙 :4 ™ � _^™ prefetch-
0.1 ._ % 0_4 - - \ o n - m i s s 
% \ 0.3 - \ 
0.05 - . ^ ^ ¾ 0-2 - - \ " . . g . . ' .•no 
^ ^ “ ^ 0.1 - ^ f - f prefetch 
0 ] 1 1 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
C. li d. nasa7 
84 
Appendix A CPI Due to Cache Misses 
0-8 了 1.8 
�.7 -- • — … S . ~ - . — . 1.6 - , ^ ^ ^ 5 * = ¾ + = n d 
0.6 .. L」 1-4 - • 一 
1 2 ™ " 徽 ™ c o m b i n e d 
0-5 - - ^_ . .__ , ._y^__……—惑 ^ IAP 
Wr^ _ 相 1 __ 
M C P I 0,4 - - M C P I 




."""•^�^!-•j"", n 0 0.1 .. 02 t^ 
"•^ prefetch 
° \ 1 1 1 1 0 J 1 1 1—— I 
1 2 4 1 2 4 
set assoc. set assoc. 
e. spice2g6 f. su2cor 
1-4 丁 0.25 ^ 
T _ % _ PPU and 
1 2 FIFO 
• '" S > \ ,团 0, __ \ 
1 - - ^ ^ : ® . . . . . ' " . " S \ ™ ~ « ™ c o m b i n e d 
H^ ^ - — i , \ � IAP 
0.8 •_ \ �-15 -- \ % 
M C P I ^" -m^ M C P I \ \ 
0.6 -- 〜〜、〜權 0 1 \ � > i V ^ p r e f e t c h -
- ^ ^ ^ . — \ �n-m'ss 
1 - … ^ ^ •....•…：： 
0 1 H 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
g. tomcatv h. wave5 
Figure A.10: MCPI by varying set associative in PPUVC with IAP scheme 
0.62 T 0.18 
_ 0.16 - ^ \ - 4 - P P U and 
0-6 - \ 014 \ �― ― ^ FIFO 
0.58 -- \ \ °.12 - 騰 \ 
M C P I 0.56 . . \ V _ MCPi D-1 ^ ^ - ^ i ™~M~™prefeteh-
X*.^ ：：；：；;^ ^ 0.08 — on-miss 
0.54 -- ^ 〜 、 ^ ^ 0.06 --
0.04 — 
0.52 . - ~™叙™_ n 0 
0-02 prefetch 
0-5 J 1 1 1 1 0 J ！——-——I i i 
1 2 4 1 2 4 
set assoc. set assoc. 
a. compress b. espresso 
85 
Appendix A CPI Due to Cache Misses 
0.3 1 
T _ « _ PPU and 
0.9 - -
0.25 - &^� FIFO 
1 0.8 -- \ 一 一 A 
0 , .. \ 。.7 - 鐵 、 、 
\ 0.6 —— \ _ _ _ ^ p r e f 
M C P I 0.15 - - N ^ M C P I 0.5 - - t < - , , , _ ^ v etch-on-
\ 0.4 — ^ ~ - ~ ~ ^ „ « » « " « « " » ^ miss 
� . i “ V -
0.05 __ ^ = ^ i ^ 0.2 _— —•^no 
^ m 0.1 prefetch 
0 H 1 1 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
c. li d. nasa7 
0.8 _j. 1.8 
0.7 - 一 一 〜 六 1.6 -- r ^ ^ ^ ^ ^ = * = i + = n d 
0.6 -. 1-4 
0-5 - ^ —*———m •""'""^^^^^^ 
• 1 - - "“織—prefetch-
M C P I 0.4 - - MCPI ^ 
0.8 — on-miss 
0.3 - . 
0 . 6 - -
0.2 - 0.4 
0.1 —_ 0 2 一 如 “ 
u-<s prefetch 
0 H 1 1 1 1 0 J i 1 i 1 
1 2 4 1 2 4 ‘ 
set assoc. set assoc. 
e. spice2g6 f. su2cor 
1.4 丁 0.25 
_ 4 _ PPU and 
1.2 -- 、 離 
i > \ 0.2 —— ^ FIFO 
1 - ^ ^ . � ^ 
0.8 - , �_15 V \ 
M C P I MCPI \ \ ~~«~_pr"efeteh_ 
«•« -- • ^ ~ ^ . 1 ^ ^ 0.1 —_ \ on_miss 
� . 4 •• \ . 
0.05 - 警 0 \ 
0-2 - ^^^^� - -^. " . . . . n� 
^ = w prefetch 
0 1 i 1 1 0 j ^ 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
g. tomcatv h. wave5 
Figure A.11: MCPI by varying set associative in PPUVC with prefetch-on-miss 
scheme 
86 
Appendix A CPI Due to Cache Misses 
A.3.3 Prefetch Cache 
0.62 丁 0.18 ~ # ~ ~ Fully-
0 16 &^<^ assoc. 
0.6 - - ^ ‘ � � � g ~ ™ ~. . — ™ f g with FIFO 
V 0-14 _™®___combined 
0.58 - - H \ 0.12 ^ IAP 
MCPI 0.56 -- N,^ V',.^ ^^ MCPI • : 鼠 \ " ^ ' ^ .....i—prefetch-
0.54 •_ fe��.g 0.06 — ‘ ~ ~ ^ ^ on.ss 
0.04 - - _ ™ g — n o 
0.52 -- prefetch 
0.02 
°-5 H 1 1 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
a. compress b. espresso 
0.3 丁 1 _^ _ ^ _ _ Fully-
0.9 assoc. 
0.25 - . 0 8 B \ with FIFO 
l _ I ::::: \ 、 _ 一 . ^ ~ 
M C P I 0,15 - - % M C P I 0.5 __ \ 
m ^ 、'％.—^"^^»—"�/• ™™^^ prefetch-
�.i -• V :::: \ ^  — ~ 
0.05 . . ^ ^ ¾ ] 0.2 - ^ V _ ^ - e - " ° 
^ = ^ 0.1 i i prefetch 
0 \ 1 1 1 1 0 J 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
c. li d. nasa7 
87 
Appendix A CPI Due to Cache Misses 
0.8 丁 1.8 ~ _ 4 ~ F u l l y -
0.7 -• & — — ~ g 1.6 - P ^ = f S = Z ^ : : , F 0 
„^ ^ 1 4 H-"•一 
0-6 -- • ™„^__™combined 
0 5 1.2 -- IAP 
%^ -"""——^ ^^=^  1 __ 
M C P I 0.4 - - M C P I 1 ^^A _ .p re fe tch -0.8 —  ^--—^~~~~~ “ 
0 3 • • on-miss 
. “ 0.6 ―  
0-2 - - 0.4 ―― ™ g ~ _ n o 
n 1 prefetch 
0.1 - - 0 .2 
0 _| 1 1 1 1 0 1 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
e. spice2g6 f. su2cor 
1.4 丁 0.25 ~ # _ F u l l y -
assoc. 
1_2 - 1 0.2 — i withFIFO 
1 _. m i^_^combined 
03 _• \ l � 0 . _— % IAP 
瞭 ' 0 6 � � \ MCPI ^ -^Prefetoh-
, 0.1 yX on-miss 
0.4 . . ^ ^ 
0.05 -— W : � \ \ - ^ - n o 
0-2 -- ^ ¾ ^ ^ prefetch 
0 \ f • f 1 0 J 1 ^ ^ 1 
1 2 4 1 2 4 
set assoc. set assoc. 
g. tomcatv h. wave5 
Figure A.12: MCPI by varying set associative in prefetch cache scheme 
88 
Appendix B 
Simulation Results of IZ 
Replacement Policy 
B.1 Memory Delay Time Reduction 
B.1.1 Varying Cache Size 
•—•— IAP with IZ 
~~m™ combined IAP 
~~~&~~~ prefetch-on-miss 
70.0% 丁 A67 0% 100.0% T J|99.7% 80.0% T 77.8¾ 
7 ‘ y/r39.2% y^ 
8 60.0% - / g 80.0% - 曹丨/ c 70.0% - B 4 . 7 o ^ / 職 
1 50.0% - / •云 / / -| 60.0% - ^ 7 ^ ) . 3 % 
召 / 召 6 0 . 0 % - / / = c n no/ / 
2 40.0% - / • £ 4i.5ov^ J 声 , . 产 ？ 舰 - - / . , 
I 30.0o/o - 26_3^_!!^ A - 3 - 1 40.0% -- Z / - r "�',> ^ 40.0% - . , , / ^ - i T ^ - . -
名 / 召 / / ^ 30.0% - / > 20.0% - / > 20.0% -- / / ^ ,�. 4' o / o /z.(s')% S 20.0% -- 26.1^  E 10 0% -- 6.7%/ A E 2,5%.v,Z 2 ^ m- m z..-^ s.s% o 0.0% _~r7^ 1 1 1 1 10.0% -
^ O Q o / 5 . 4 % y ^ … ― 和 . 、 、 2 - : . % i g ” I 
0 . 0 / o 4 ^ r ^ 丨 I ‘ • 。 / 8 1 6 3 2 0 . 0 % 1 1 1 1 
8 16 32 -20.0/o 3 16 32 
cache size in Kbytes cache size in Kbytes cache size in Kbytes 
(a) su2cor (b) tomcatv (c) wave5 
Figure B.1: Results of the first group programs in IZ 
89 
Appendix B Simulation Results of IZ Replacement Policy 
- ^ • ― IAP with IZ 
—••— combined IAP 
•"•-•-Js"”"• prefetch-on-miss 
60.0% j 54 9% 53 5% 50.0% 丁 46 2� 90-°% T 86.6°/- ^ -^^ ^__^89.4% 
- . n n°/ 5??.^=^^¾?^ 48.9% 45.0% - 42_8^ ;^^ 46.2% 80.0% -- 82.9# fe% 86.5% 
I 50-0/o - - … ^ i 9 ^ 4 _ g 40.0% -- ^ , - - 呢 I 70.0% 
I 40.0% - ^ _ • ^ / Z -..^ o 35.0% - / / o 60.0% -
^ 38,7^  -^38.6% "g 30.0% -- • � ‘ / I � 
> 30.0% -- ^ 25.0% -- 2^ -7# A % t 50.0% --
I I 20.0% - /Z，《3% 24.1% I 40.0% - ^ . ” r^〔？ r "^ .� . : ;G 
^ 20.0% - - 15.0o/o -- . e . r i 30.0o/o 
i 10.0% - I 10.0% o 20.0% 
I I 5.0% - o 10.0% 
0.0% •! 1 1 1 1 0.0% J 1 1 1 1 芝 0.0% J I I I I 
8 16 32 8 16 32 8 16 32 
cache size in Kbytes cache size In Kbytes cache size in Kbytes 
(a) espresso (b) ii (c) nasa7 
Figure B.2: Results of the second group programs in IZ 
0.0% H 1 1 1 1 40.0% T 37.7% 
c -0_5% - 8 16 - 35.0% - - 聽 % ^ ^ 二 ^ ; r " " 
0 .2 30 0 % -- 惑 、 3 : . G � z 33.2% _ i 
~ t3 JU -u/o 3 0 . ( ) % f \ � \ z 
= - 1 . 0 % - % 25.0% -- %；购 0) 2 �‘ ^ . co/ > 20.0% -*-combined ffi -1.5% -- « 
« -2.0% 召 15.0% - IAP 
^ -2.0% - -2.1%^,^^^^ 1^  10.0% __ 
1 ,Pi0 4^ >-2-3% E 5.0% --
i -2-5% - - � . 6 ^ z z A ^ �� I I , , +prefetch-
,^ .,„. ^ O.0/o H I 1 1 1 on-miss 
- 3 . 0 % 丄 - ‘ 协 -么？‘彳： 8 1 6 3 2 
cache size in Kbytes cache size in Kbytes 
(a) compress (b) spice2g6 
Figure B.3: Results of the third group programs in IZ 
90 
Appendix B Simulation Results of IZ Replacement Policy 
B.1.2 Varying Cache Line Size 
— • — IAP with IZ 
~~m""• combined IAP 
-"i&~~~ prefetch-on-miss 
70.0o/o T eio. 100.0o/o 丁 " V ^ 9 0 % 70.Qo/o 丁63。/。 ^5% 62。/。 
, 6 0 . 0 o / o - J s ^ [ 9 0 . 0 % - ^ ^ • “ ^ [ 6 0 . 0 % - < ^ ^ " > t ^ ^ ^ ^ H r " " ^ 
0 56% X^ o 80.0% -- \ o ��,^^““M 61%�\ 
1 50.0% - \ \ .云 70.0% - 瞧 鄉 V 云 50.0% - - 嫩 『 Y 
| 4 0 . 0 % ? � X ^ % I 6 0 . 0 % - - 飄 、 、 - ^ \ 6 0 % 售 4 0 . 0 % ， 1 \ . ^ 4 < 鄉 。 
j 3 0 . 0 % - - \ X ^ ^ 2 0 % I 二 = : 祝 ^ ； 1 ^ \ |30.0%-- ^ -
I 20.0% - \ \ ,3% ^ > 30.0% - \ \ > 20.0% -
1 10.00/o - P ^ g =:， - - V � _ 1B^  i 10.00/o -
^ 9%^^\ 1% « 10.0% -- \4 „.. § 
^ 0 . 0 % ——I“^I——I~"^=^¾^ ^ 0 . 0 % ——I""~"I~~I^I~~h^ 0 . 0 % ——I——I——I——I~~I~~~I 
••} /C-
- 1 0 . 0 % 丄 4 8 1 6 3 2 6 4 4 8 1 6 3 2 6 4 4 8 1 6 3 2 6 4 
line size in bytes line size in bytes line size in bytes 
(a) su2cor (b) tomcatv (c) wave5 
Figure B.4: Results of the first group programs in IZ 
— • — IAP with IZ 
徽“.…combined IAP 
”™”叙™™ prefetch-on-miss 60.0% T 50.0% T 48°/¾ 4ROA 
._�La,. 4 2 � � :, r r ^ ^ ^ ^ = •% p ^ : ^ ^ 
1 丽 。 - - a f , - i . , , � 1 " - � \ 3 » 18。.。％ - ’ - % ' » 
E 30.0% - ^ / ？ 30.0% - ��� ® 31% I 60.0% -
I 20.% --%y/.o. I := : ^^ X^ .. I 40.0.�—‘、t;r.、\ 
^ 10.0% - \ 1 1 ^ 15.0% -- � ' c ^ ^ � � , — 
1 0.0% ^ " ~ h ^ ^ ^ ~ ~ 1 ^ ^ , ~ ~ 1 I 10二 o 20.0% - 、‘一 
2 - 1 0 . 0 % - 4 1 16 32 64 I o ! o % I ~ ~ I ~ ~ I ~ ~ i ~ ~ ~ I ~ ~ I ~ ^ ^ I ^ Q .0% J ~ ~ 1 ^ ^ I ~ ~ I • _ • I _ I _ ^ | 
-20.0% i 4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes line size in bytes 
(a) espresso (b) li (c) nasa7 
Figure B.5: Results of the second group programs in IZ 
91 
Appendix B Simulation Results of IZ Replacement Policy 
6.0% 丁 5% 40.0% 丁 37% 
5.0% --證气 35.0% - 1 j 3 $ _ ^ I A P w i t h 
• = : : \ I 30.0. - \ ^ ： ； 'Z 
I 2.0�/�— \ I 2 5 , % — - \ \ f 
^ 1.0% -_ ^ > 20.0% --i9%j^ \ \ / / ~«™-combined 
•S 0.0% ——i^h^H~~I^^I~~I I 15.0% -- \ 1 / / IAP 
I -1.0%- 4 8 : | ^ 64 I • / � _ _ \ ^ / 
o -2.0% -- - 1 " ½ ^ -3% i \ 10赞/ 
£ l|^^^ c 5.0% -- \ / 
® -3.0% -- \ 5 • A^^ / .—•™~prefetch-
^ -4.0% -- , . > ,.4% 0.0% ^ ~ " ^ ~ ^ ~ " ^ ^ “ ^ ‘ ^ ‘ on-miss 
-5.0% 丄 4 8 i^ 32 64 
line size in bytes line size in bytes 
(a) compress (b) spice2g6 
Figure B.6: Results of the third group programs in IZ 
92 
Appendix B Simulation Results of IZ Replacement Policy 
B.1.3 Varying Cache Set Associative 
— • — IAP with IZ 
~~M~~~ combined IAP 
""~i&™~ prefetch-on-miss 
35 .0% j 3 3 . 9 ° 、 80 .0% 丁 70 .0% T 615°/ 
,30.0o/o - ^ ^ ! ! ^ 8 . 0 % , 70.0% - 65^^0-^% 60.0% - A i . 
I 25.0o/o I ^ I 60.0% - - ; > ^ I 50.0% - - ^ 
召 ^ o , 召 5 0 . 0 % - % 4 0 . 0 % - / / / 42.1% 0 20.0% -- 0) m ^ / / / > ^ 40.0% -- ^44.2% £ 30.0% -- / iT28A 
1 15-0% 本 I 30 0% - ^ « 20.0o/o - I5.2j/ f / 
TJ T> OU.U /o _^'^3Q 6 % ® 4 / 在 
^ 10.0% - s.9% 霧 \ > 20.0% - 22 3</省7 ^ 10.0% - / 聰 
1 5.0% - ^ - ^ ^ , 7 % I 10.0% - , — f Z Z ^ 1日3,“ 1 0.0% ——h^——I——I 
芝 0.0% 个 � ^ ^ ‘溪 丨 芝 0.0% L �r ^ ， . 7 % I ^ ~ I i -10.0% -- , 3 . # 2 4 
1 2 4 1 2 4 -20.0% 1 ,.^ 6,t^  
set assoc. set assoc. set assoc. 
(a) su2cor (b) tomcatv (c) wave5 
Figure B.7: Results of the first group programs in IZ 
— • — IAP with IZ 
~~~m-~ combined IAP 
-'~~^ ™~~ prefetch-on-miss 
60.0% T 55.1% 53.8% 53 5�z 45.0% y 42.8% 90.0% 丁 ^_f" |6.6% 
= 5 0 . 0 % - " ^ " : ^ r f ' L = 40.0% - 33.5^V^Q.8% = 80.0% - ^ f c r A s -
.2 y .2 35.0% - Z / .2 70.0% -- / 
? 40.0% - 4 : - ^ z l ~—， ,�‘ c c , I 30.0%-- / J I 60.0%-- / 
TJ . Z 39 p% ^rf.i:;% T3 / g V "D / £ -^-"" £ 25.0% - ,„.<,,/ P..,., A £ 50.0% -- / ^ 30.0% -- o, 4 >. 19.53/ / h b y^ 24.3% >. 44.2%d « �..’ ’ � 1^  20.0% -- 4 / /Z £ 40.0% - _ 声 ^ .n-' 0 0 / ^/ 0 /40.4% -'••-'" ^ 20.0% - ^ 15.0% - //fb.S% ^ 30.0% - z 
5 … 。 ， S 10.0% -- / / '^ “ ‘ 1* 20.0% - ‘ c 10.0% -- E / E H««o, 19.0% 
5 S 5 . 0 % - - 4 . 2 % d/ S 1 0 . 0 % -
芝 0.0% 1 1 1 1 芝 0.0% H——：^ 1 1 1 芝 0.0% J 1 1 1 1 
1 2 4 1 2 4 1 2 4 
set assoc. set assoc. set assoc. 
(a) espresso (b) li (c) nasa7 
Figure B.8: Results of the second group programs in IZ 
93 
Appendix B Simulation Results of IZ Replacement Policy 
0 . , t ~ ~ ^ — — 2 — — I — — ‘ 400% T 34.6% 35.1% ^ 
C • ° c 35.0/o 33.7% ^ " " " " " " * " f e ^ - ^ l .9% |Z 
0 -1.0% -- o 30.0% -- 4 ,v ^1.6% 
云 Z 29.4% W^^~^ 
3 -1.5% - 3 25.0% - 1 26.4% -a -2.0% T3 
2 -2.0% - -2.4^^^^ £ 20.0% ^ * - c o m b i n e d 
1 - 2 - 5 % - - / ^ ^ - . z ® ; : ^ 1 5 . 0 % - I A P 
> -3.0% - ' ^ - ^ ^ / - 2 . B % > 10.0% -
E -3.5% -- / i 5.0% - ^ , . , 
o / 0) —hr— prefetch-
S -4.0% -- -4.o%| S 0.0% ^ 1 1 1 1 on-miss 
- 4 . 5 % 丄 ” 1 2 4 
set assoc. set assoc. 
(a) compress (b) spice2g6 
Figure B.9: Results of the third group programs in IZ 
94 
Appendix C 
Simulation Results of Priority 
Pre-Updating with Victim Cache 
C.1 PPUVC in IAP Scheme 
C.1.1 Memory Delay Time Reduction 
Varying Cache Size 
70.0%T /67% 1 0 隱 � 1 ^ ^ P P U a n d 
c 60.00/�- / 80.0o/o 690/o / / FIFO 
•5 50.0% -- / I y / 
I 40 0% -- / ? 60.0% Z / -*^combined i . 28% 31%y ,6% "g 36% X^% y A ' ' - IAP 
_| 30.0% -- ^ ^ ^ • " ^ " " ‘ Z ^ 40.0% ^ ^ z . Z 
i 20.0%- / ! 20.0% / ,^,, , ~~^prefetch-
E 10-0% -- 5% ^ 7 % y z ^ R � o 3% /zZ on-miss 
- 。 • 。 ％ • -- V=^^r , _ _ , I 0.0% - - ^ ~ ~ ‘ ~ ~ ' ~ ~ ~ ^ 
o i f i o p 8 16 32 8 16 32 -20.0% 1 
cache size in Kbytes cache size in Kbytes 
(a) su2cor (b) tomcatv 
Figure C.1: Results of the first group programs in PPUVC 
95 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
— • — PPU and FIFO 
~~m~~~ combined IAP 
“™".ik“™" prefetch-on-miss 
60.0% J 53�/ 50.0% T 90.0% r 8 7 � 8 7|：^^^#9 % 
?， r t f e ^ : i : L ^ 49�/ 45.0% - 4 ^ ! t < : ^ 4 6 % 80.0% - 8 3 # ^ ^ " 1 # " " " " " " S6% 
8 50.0% - 5 : ^ ^ a ^ = ^ c 40.0o/o - ^ i f ^ ? 6 . i 70.0% 
0 4 0 . 0 % - 惑 — — 、 . 级 ‘ 1 3 5 . 0 % 2 8 o / o / ^ I 6 0 . 0 % _ 
^ 39¾ ^ 30.0% ^ / ^ 50.0% 
>• 30.0% - > 25.0% - , e # J^~~~~~~~~^狄：^ 40.0% -- ^^""^““^f"""^43% 
1 I 20.0% — z 24% o „ n no/ 山:)％ 吼^ 
^ 20.0% - - 15.0% I C > 3 0 . 0石 -
C c 丨，。 ir o n n<%. -
i io.oo/o - i 1�-0% - g ' °1/ I I 5.0% - I 10-0% -
0.0% A 1 1 1 1 0.0% J 1 i 1 1 0.0% ^ i ！ 1 1 
8 1 6 3 2 8 1 6 3 2 8 1 6 3 2 
cache size in Kbytes cache size in Kbytes cache size in Kbytes 
(a) espresso (b) li (c) nasa7 
Figure C.2: Results of the second group programs in PPUVC 
— • — PPU and FIFO 
—m— combined IAP 
"""A™~ prefetch-on-miss 
0.0% H 1 1 1 1 40.0% 丁 38% 80.0% T ^ 8 % 
,,„, 8 16 32 35.0% - 3 7 ^ ^ X ^ ^ 2 t ^ ^ [ 70.0% -- e o . . _ ^ ' ^ 
c -。-5% •• I 30.0% -- , ^ . ^ ^ 3 ^ ^ % I 60.0% -- ^ - ¾ ; 
0 o 丄‘"-.\��Z o / 
•云 -1.0% -- % 25.0% - ^ 26% 召 50.0% / 1 t 20.0% - t 40.0% - 38%^ /-4iiT“…^3% t -1.5% -- ro i5 n„„。， Z t -2.0% « 15.0% - 名 30.0% - z 
"35 -2 1 % • ^ 巡 ^ -2.0% -- #^^^^^^ \ 10.0% 5 20.0% 25% 
I ''- 5i<J^^^ -2-3% g 5.0% - i 10.0% -
I -2.5% -- -2.¾-;^ \ ^ ..2,7% 芝 0 . 0 % J——i——1——I——I 0 . 0 % J , 1 ！ 1 
_3.0% i -2.7% 8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes cache size in Kbytes 
(a) compress (b) spice2g6 (c) wave5 
Figure C.3: Results of the third group programs in PPUVC 
96 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
Varying Cache Line Size 
70.0% j 100.0% T ‘ 
61% • r»r««o/ „^ „, ~~^ ~"-^ ^ 88% ~•~PPU and 
60.0o/o - ^ - ^ 肌0% 96% ^ ' • ^ • ^ FIFO 
:>6%- \ c 80.0% - \ 
.9 50.0% - 1 \ 4 2 % I 70.0% - ^ ' ^ \ V 
1 40.0% -”％^\ K I 60.0% - � ~ » " ^ \ \ ^ % -^*-combined 
2 \ \ ^ “ 50.0% - ¾^% ,. V IAP 
i 3D_a% - \ X ^ 2 % I 40.0% - ^ 一 〜 《 丨 \ 
g 20.0% -- W i 3 % > 30.0% — \ \ \ \ —••^ --- prefetch-
S 10.0% - y ^ > i I 20.0% - \ \ _ % orvmiss 
E 狐 9 % 、 \ ： 5 " \ 1 % « 1 0 . 0 % - ^ - ^ 8 % 
I 0.0%——I——I——I~~>--^^ ^ 0.0% J——I——I——I——I——I——I 
- 1 0 . 0 % 1 4 8 1 6 3 2 6 4 ‘ 4 8 1 6 3 2 6 4 
line size in bytes line size in bytes 
(a) su2cor (b) tomcatv 
Figure C.4: Varying line size 
^ # ~ PPU and FIFO 
— m ~ — c o m b i n e d I A P 
—I™™ prefetch-on-miss 
60.0% y 50.0% j r ° 48% 100.0% J ^ % 4_^______^o 
- - - • - 42^^AC . r ： ：： - ^ V ^ , = :>^^^^^^6% 
I ：： ::4 f r �I = : : � 3 乂 1 = ：  I 
I 3 � . “ \ / ^ 25.0% - \ � I 50.0% - ^ - . ^ ^ % 
> 20_0% -.20%^  \ / 产 20% I 20.0% - \ I 40.0% - 众 t \ 
® 10.0% •- \ f / \ 15.0% - ’S% : 30.0% -- 、 拽 
？ 0 . 0 % - ^ • \ I / I I • • i = - I 20 .0% -c \ <D 5.0% 芸 10.0% --
I -10.0% - 4 、 1 6 32 64 芝 0.0% | 芝 0.0% | , , , , , , 
-20.0% i 4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes line size in bytes 
(a) espresso (b) li (c) nasa7 
Figure C.5: Results of the second group programs in PPUVC 
97 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
— ^ PPU and FIFO 
~~-m~~ combined IAP 
™™i_ prefetch-on-miss 
6-。％T W � w 40.0%T37\ _ 70.0% [63 6^ 
5.0o/o -5.3l|--^^ 35.0% - 1 ^ 60.0% __^ ">fc^ ^^ ^^ ^^ 5=^ =^^ ^^ 5^7% 
i = ：： \ I 30.% -4 .^丨％ I 50_0% :��:?*^r^6% 
I 2.0%-- \ \ 25.0%-- \\ f % 40.0o/o < f Z � � \ � 
！ • -- . K . . . . ：- 20.0% - 1 ’ � h > 3D_0% - X a . . 
I ° - °% - " ~ 1 ' V l . O o i I 1 I 15.0% -- \ \ J i % 
名 - 1 . 0 % -- 4 8 . , . , % ^ 2 64 > 10.0% - \ ^ ^ / ^ 20.0% -
1 - = - - ^ 2 . 7 % I 5_oo/�_— \ ， I 10.0% --
1 -3-0/° __ \ 2 „„。， 1¾.�•^^^>1 1% 2 
2 _4.0% -- 4^.1% 0.0% ^ 1 i~^^^i 1——I 0.0% ^ 1 1 1 1 1 1 
- 5 . 0 % 1 德 4 8 1 6 3 2 6 4 4 8 1 6 3 2 6 4 
line size in bytes line size in bytes line size in bytes 
(a) compress (b) spice2g6 (c) wave5 
Figure C.6: Results of the third group programs in PPUVC 
98 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
Varying Cache Line Size 
35.0% 丁 3 2 % 70.0% T ^ ° • „„ . . ^ ,^.^ ^ 29% J2° 60% ^ ^ — ^ PPU and 
c 30.0% -- ^ " ^ ^ - - - - ^ - - ' ' ^ ' ^ ^ r 60.0% 54�/� ^ ^ ^ ^ FIFO 0 o ^ - " " ^ 
f= 25.0% -- = 50.0% 3 y 44% ^ . . . ^ = m ~~~m™~combined 
2 20.0% -- I 40.0% -- z ^ IAP 
1 15.0% -- | - 30.0% - 3 ， ^ ' 
i 10.0% -- o., 1 i 20.0% - 狐 f +prefetch-
5 l � \ 5 % 7% I* , J 6 % on-miss 
E 5.0% -- A E 10.0% -- 7 % , z ' z 
i S% kh-. 1笼—.一一^ 2% I 6% i^^..»-»•“^' 
0 . 0 % ^ 1 ^ 1 1 0 . 0 % ^ 1 1 1 1 
1 2 4 1 2 4 
set assoc. set assoc. 
(a.)su2cor (b) tomcatv 
Figure C.7: Varying set associative 
—^•― PPU and FIFO 
~~m”™ combined IAP 
™~sls~~™ prefetch-on-miss 
60.0% T 5 5 V _ ^ 53�/� 45.0% T ^ % 90.0% T 8 9 % ^ K ^ _ _ ^ Z | ^ % 
• > — — i 40.0% - ^ 4 1 % 80.0% -- / ½ % 83% 
c 50.0% - z ^ ' 。 ： 奶 ^ § 35.0o/o - ^ ^ y y i 70.0% - / 
0 40.0% -- '' '"® .^ ~~~^ —^  i 30.0% -- / / 1 60.0% / 
^ _-..z 39% 39% ^ „ ^ „ „ , / U TD ^_ „ „ . / 
£ 30 0% , . r ^ ' 2 25.0% -- / p2.6o. ^ ,4% £ 50.0% - / 
S � & t 20.0% - l ^ y / X t 40.0% -彳 ' ' � _ / f e "^^> ' „ 
名 20.0% -- 名 15.0% 声二， 名 30.0% -- Z 
> >. / / 1 o% >, / 
1 10.0% - i 1。.。％ j/ I 20.0% ,.#^ 
o o 5.0% 4% g / 5 10.0% --
芝 0.0% J 1 1 1 1 乏 0.0% H ^ ^ ~ I 1 1 0.0% 1 1 1 1 
1 2 4 1 2 4 1 2 4 
set assoc. set assoc. set assoc. 
(a) espresso (b) li (c) nasa7 
Figure C.8: Results of the second group programs in PPUVC 
99 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
— • — PPU and FIFO 
~™«~~~ combined IAP 
~~~h™~^  prefetch-on-miss 
0.0% H 1 1 1 1 40.0% 丁 70.0% j 60% 
-0.5% - 1 2 4 c 35.0% -- ^ : > ^ ^ ¾ < ^ > ^ , c 60.0% — ^ ^ 
% -1-0% - .2 30.0% -- i, n 132-'� I 50.0% 4 1 % ^ 
I -1.5% - I 25.0% - - 嫩 ^ r 、 l 2 S % 1 40.0% - X 7 , _ % ® -2% ^ "g / / / 
^ -2.0% -- - 2 % ^ ^ ^ j % ^ 20.0% -- ^ 30.0% -_ / J / 
I -2.5% -- ^•^ ^ .觀 -；孜 "S 15.0% - ^ 20.0% -- 13% / / / 
"° ; ;3%s!^:" T j "S y / 0., yo,-
g* -3.0% - - -3% Z/，'。 1» 10.0% - ^ 10.0% - / ” 
i -3.5% -- ^ / g 5.0% E 0.0% ^ / — ^ 1 1 
-4.0% -- - 4 % ^ ^ 0.0% J 1 1 1 1 芝 - 1 0 . 0 % — A7P!.W 2 4 
- 4 . 5 % J- ‘“ 1 2 4 - 2 0 . 0 % 1 "息1〔;％ 
set assoc. set assoc. set assoc. 
(a) compress (b) spice2g6 (c) wave5 
Figure C.9: Results of the third group programs in PPUVC 
100 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
C.2 PPUVC in Cache with Prefetch-On-Miss Only 
C.2.1 Memory Delay Time Reduction 
Varying Cache Size 
50.0% T 50.0% T 48.8° /^ 49.6J^ » PPU and 
45.0% - ^>470% 45.0% -^^ ^^ ^^ ^^ _^^ ^^ 489% FIFO 
I 40.0o/o 3 5 , % y I 40.0% 43.0% / + p r e f e t c h . 
^ 35.0o/o 35.0.^ ^ § 35.0% / on-miss 
^ 30.0% •§ 30.0% / 
> 25.0% - ^ 25.0% - / 
^ 20.0% ^ 20.0% J 
‘ 1 5 . 0 % - ^ 15.0% -- ^16 .3% 
I 10.0% - ^ 鄉 ’ o 10.0% - Z 
1 5.0% - 棚 ^ ^ - ^ ^ 1 5.0% - , X 
2 0.0% 21^ f - f — _ , , 2 0.0% 2.s% f — — I 1 , 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes 
(a) su2cor (b) tomcatv 
Figure C.10: Results of the first group programs in PPUVC 
— • — PPU and FIFO 
~~m™ prefetch-on-miss 
50.0% T 47.5% 30.0% j 60.0% T 
45.0% -- 39 5% 39.6% ^ ^ ^ 24.7% 25.8。/。 
§ 40.0% - ; = ^ z < 碰 ， = 2 5 . 0 % - M=：^ I 50.0% - 5 0 . 5% < l ^ 4 4 8 % 
1 二： ： -% 见孤 1 隱 - - - / ^ 汉''。I 舰 - - ^ r < : 
> 25.0% - >. 15.0% - ^ 30.0% - / 
I 20.0% -- I 16.6% 2 25.i%ir 
Z 15.0% ？. 10.0% - Z 20.0% 
i 10-0% - i 5.0% - i 10.0% 5 5.0% - o o 
芝 0 . 0 % J 1 1 1 1 0 . 0 % J 1 1 i 1 0 . 0 % J 1 1 1 1 
8 16 32 8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes cache size in Kbytes 
(a) espresso (b) li (c) wave5 
Figure C.11: Results of the second group programs in PPUVC 
101 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
- ^ - PPU and FIFO 
™~m“™ prefetch-on-miss 
0 . 0 % 1 ‘ ‘ ‘ ‘ 4 5 . 0 % y 奶 牧 4 0 ^ _ ^ 4 4 . 1 % 3 5 . 0 % 丁 33.6% 
8 16 3 2 4 0 . 0 % - - 4 0 . 3 . - W % A % 43.0% 30.1% ^ ^ 3 3 . 2 % 
c -0-5% -- § 35.0% - o 肌 0 , �- - 3 0 . 0 t \ ^ ^ ^ 
.2 -1.0% o _ .o/ •云 2 5 . 0 % - ^ 2 6 . 4 % 
0 -1.0% ^ 召 30.0/o = 
召 y^ £ 25.0% o 20.0% -S -1.4%^ > : > -1.5% y ^ ^ 20.0% £ 15.0% 
^ / ^ 15.0% -- 名 - „ „ „ , 
！ -2.o% -- - 2 . i ^ r 10.0% - r 则 ％ 
1 � z i � i 5.0% - i 5.0% -
I -2.5% -- _2g.^'^ "^^^^ 芝 0.0% J 1 1 1 1 芝 0.0% J 1 1 1 1 
-3.0% ^ -2.7% 8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes cache size in Kbytes 
(a) compress (b) nasa7 (c) spice2g6 
Figure C.12: Results of the third group programs in PPUVC 
102 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
Varying Cache Line Size 
5 0 . 0 % T 46.6% 7 0 . 0 % T 
4 5 0 � / 44.5% A 61.6% 
i 40:0o/: - - A / § 60.0o/o -- / 
Z 3 5 . 0 % -- \ \ / 云 5 0 . 0 % -- ) 4 ^ ^ < < = = i : : ; r > > ^ l ^ / 
！ 二： ：： \Nz ！ 藝 - - - 4 ^ S ^ ^PPUand 
I 20.0% - \ I 30.0% - \ FIFO 
！ 1 5 . 0 % \ > 2 0 . 0 % \ ™ * ~ - p r e f e t c h -
I 1 0 . 0 % - - V p , „ „ „ , l \ o n - m i s s 
I 5.0o/o - - 9彳。,。\ I 10.0o/o -- > 3 . 
0 . 0 % 1 1 1 +^^"‘^ i 0 . 0 % ^ 1 1 1 1 1 1 
-5.0% 1 4 8 16 32 64^"'''' 4 8 16 32 64 
l i n e s i z e i n b y t e s l i n e s i z e i n b y t e s 
(a) su2cor (b) tomcatv 
Figure C.13: Results of the first group programs in PPUVC 
— • — P P U a n d F I F O 
~~^™~ p r e f e t c h - o n - m i s s 
40.0% T /^38.4% 35.0% T 1 ^ 50.0% 丁 ,,.^ 45.6% 
= 3 0 . 0 % -- / 双4% 30.0% -- 3 3 . ； ^ ^ = = ：： € : : : ^ ^ ^ ¾ " " ^ 
•I 蕭 ^ 雷 4 1 2 5 . 0 % - - - o f ^ I 35：0% , 6 W ^ 4 4 . 7 / �\ 
\ 匪 - - , 1 / - - I 20.0o/o -- 、 應 I 3 0 . 0 % -- X . S . 
I 10.0% -- \\ / / f 15.0o/o - iw% I 20：0% ” 
I 。•。％ - ' M ' 1 I ‘ t 1 0 二 -- i ) = ： 
1 -10.0% - 4 f 16 32 64 | • - | 5.0% -
2 i 0.0% H 1 1 1 1 1 1 0.0% ^ 1 1 1 1 1 1 
_20.0% 1 4 8 16 32 64 4 8 16 32 64 
l i n e s i z e i n b y t e s l i n e s i z e i n b y t e s l i n e s i z e i n b y t e s 
(a) espresso (b) li (c) wave5 
Figure C.14: Results of the second group programs in PPUVC 
103 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
— ^ ~ PPU and FIFO 
—m— prefetch-on-miss 
8.0% T ^ 7.5% 50.0% J 45.4% ^ ^ 35.0% 丁 
\ 45.0% - r - ^ ^ " ^ ^ Q_。, T ^30.8% 
C 6_0% - \ = 40.0% --仇3% ' ^ ) % ^ X ^ C 30.0/o - ^ ^ c . 6 % 
.2 4�o/o __ 5 . 3 ^ ^ \ Z 35.0% - \ 3 2 . 2 % 1 25.0% - ¥ 
I 2:0% -- \\ ！ == ：： 1 1 舰- f% / 
I 0.0% - I V y : i , , I - - -- 1 1 通 - n / 
i 4 8 U S > 64 1 15.0% -- > 10.0% - \ / 
t -2.0% - -1¾ xr\^ o 10.0% o \ / i \ \ - 现 I 5.0% -- i 5.0% - \W 
i -4.0% - \ , ^ 乏 0.0% J ~ ~ I ^ ~ I ~ ~ I ~ ~ I ~ ~ ~ I ~ ~ I 芝 0.0% J ^ I 7 ^ ：的’~^ I ~ ~ 1 
_6.o% 1 4 8 16 32 64 4 8 16 32 64 
line size in bytes line size in bytes line size in bytes 
(a) compress (b) nasa7 (c) spice2g6 
Figure C.15: Results of the third group programs in PPUVC 
104 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
Varying Cache Line Size 
40.0% j 50.0% j 47.9% 
[ 3 5 . 0 % -- W ° 45.0% -- ^ - ^ 6 % 4ao% 
I 30.0% - 27.8% Z I 40.0% -
0 4 ^ / o 35.0% -- —•— PPU and 
1 25.0% - ^--<22^ I 30.0% FIFO 
^ 20.0% ^ 25.0% --
I 15.0% - I 20.0% - ^~~~ prefetch-
\ 10.0% - \ 15.0% - /^ 15,3% on-miss 
I o 10.0% __ ^ ^ 
i 5-。％ - 3,4.. » � > • 溪 I 5.0% - 5,5% ^ ^% 
0.0% ^~~^-^——^^^ 1 1 ^ 0.0% ^ 1 1 1 i 
1 2 4 1 2 4 
set assoc. set assoc. 
(a) su2cor (b) tomcatv 
Figure C.16: Results of the first group programs in PPUVC 
— • — PPU and FIFO 
~~m™ prefetch-on-miss 
j«««, 39.5% 39^% 39.6% 24 7% 
40.0% 丁 4 # n : M 25.0% y M ° 50 0 % 丁 
# 39.2%^  38.6% ^ P 24,3% "^ " ' �T 44.8% 
_ 35.0% -- 31.7%// ^ 2 0 . 4 % / ^ / ymno/ y%. 
I 30.0o/o - - 徽 I 20.0% -- y / . 40.0% - 3 3 ^ - -
I 25.0% -- I 15.0o/o -- A < . e . 1 30-0% 一 / / 
；20.0% - ^ 10.7�// / o 20.0% -- / J 
I 15.0%-- I 10.0% -- Z / I 100%_- 12.2/ ^ -
& 10.0% -- ^ e。。， / 名 / i … i 5.0% - > 0.0% 1 I 1 1 
I -- 室 'W I 1 / 2 4 
0.0% J 1 1 1 1 芝 0.0% J W 1 1 1 I -10.0% - / 
1 2 4 1 2 4 . 2 0 . 0 % J- - - . 贴 ％ 
set assoc. set assoc. set assoc. 
(a) espresso (b) li (c) wave5 
Figure C.17: Results of the second group programs in PPUVC 
105 
Appendix C Simulation Results of Priority Pre- Updating with Victim Cache 
— • — PPU and FIFO 
—»— prefetch-on-miss 
0.0% i 1 1 1 1 45.0% 丁 41.0% 41.5% 408�/ 30.0% j30-0% ^ 
-0.5% - 1 2 4 40.0% -- • p = m 40,4% 29.4。攀〜^^% 
I -1.0o/o - . i 35.0% -- /奶 < ) 1 29-0% -- - 1 ^ 
I -1.5% - - i . Z % I = ：： / I - -- \ 
t _2.0% -- / ^ ^ 20.0o/o - / ^ 27.0% - X.e.5o. 
® _2.5% - / ^ - ^ -2..% o 棚‘7舞 o ipe4% ^ / g r ^ 15.0% -- "o 26.0% - 2G4� 
I -30% -- / 挑 r 10.0% - r 
I -3-5% -- - • ‘ / I 5.0% -- I 25.0% -
-4.0% - ,w%_ 0.0% J 1 1 1 1 芝 24.0% J 1 1 1 1 
-4.5% 丄 1 2 4 1 2 4 
set assoc. set assoc. set assoc. 
(a) compress (b) nasa7 (c) spice2g6 
Figure C.18: Results of the third group programs in PPUVC 
106 
Appendix D 
Simulation Results of Prefetch 
Cache 
D.1 Memory Delay Time Reduction 
D.1.1 Varying Cache Size 
60.0% T cno�/ 100.0% T 99.4% « « p 9 2% 
50.8% 9 9 ^ 799.0% — 4 — Fully-
c 50.0% - - 5 1 0 % • ^ ^ - - , ^ 4 6 . 2 % 80.0% - / assoc. 
.2 ^ ^ c / with FIFO 
0 40 0% - - • - / 
% pe.3% o 60.0% - / 
1 30.0% - / ^ ...,>, J A . . ~~*~~combined 
^ / ^ 40.0% — ”4( 'c / ^ /”r f�<> ,Ap 
名 20.0% -- / I / / 
g- / ^ 20.0% — / y 
F 10.0% -~ J .4 ^ /^ .-"'^ 5,3% I . . . r : : m i z z Z ‘ 1 0.0% _ ^ _ ^ ^ ^ — — , — — , … … 叙 - p r e f e t c h -
0 . 0 % ^ - ^ ^ ^ ~ ^ — — ‘ ‘ ^ -.! 4% 8 16 32 o n - m _ s s 
8 16 32 -20.0% i 
cache size in Kbytes cache size in Kbytes 
a. su2cor b. tomcatv 
Figure D.1: Results of the first group programs 
107 
Appendix D Simulation Results of Prefetch Cache 
1。。'。％ I 8120/ j92.0% + F u l l y 
90.0% -- ' ' ' \ ^ assoc. 
§ 80.0% -- ^ , 7 孤 with FIFO 
Z 70.0% -- / z 
！ = ::4,3% / > ^ — 
! =::3?.�^:>H",b 'AP 
0 2 0 . 0 % - - » 5 1 % 义么 
1 10.0% -- —tk~™ prefetch-
芝 0.0% J 1 1 1 1 on-miss 
8 16 32 
cache size in Kbytes 
a. wave5 
Figure D.2: Results of the second group programs 
~ # ~ Fully-assoc. with FIFO 
~~~m~™ combined iAP 
仰 3� , 全 prefetch-on-miss 
60.0% T 59 .3%, ,^_^6o/o 50.0% y 4 3 . 5 % ^ ' ' ' ' ^ ' 40.0% 丁 3 8 . 5� / � 1 
. S 0 . 0 . - - ^ . ¾ ^ - [ = : y ^ - = 3 5 . 0 % - - ^ ^ S ^ ^ 5 t 
1 40.0% - ^ ^ - 1 35.0% -- / 驅 I 30.0% -- 3 o J r ^ J > - -
^ SS./% …、 召 30.0% - 28 4%y/ 召 25.0% -- 广 
2 „„„„, 2 t/ 0 26.4% 
> 30.0% -- ^ 25.0% - 2,,^.® 名 — — ^ ^ 20.0% 
芸 SS 2 0 0。/ -- ,'<、").、’ 24,1% f 
名 20.0% -- •§ f f J ^ Z A),c I 15.0% 
> > 15-0/° — -l6.S% >. 
I 10.0% -- 0 10.0% o 10-0% 
§ i 5.0% - o 5.0% -
0.0% 1 1 1 1 ^ nno/ I I , I 芝 
0.0% H 1 1 1 1 0.0% -I 1 1 1 1 
8 16 32 o _(c oo 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes cache size in Kbytes 
a. espresso b. li c. spice2g6 
Figure D.3: Results of the third group programs 
0.0% i 1 1 1 1 90.0% j 32.9% 83.2% ™86.5% 
8 16 32 80.0% - _ % i = ^ r r n ^ + P u l l y _ 
n co/ __ 81.0% ^ " " ^ 7 8 . 6 % a s s o c . 
i c 70.0% with FIFO 
•云-1.0% -- 5 60.0% 
3 ] 
•g ^ 50.0% ...„.鐵—combined 
i -1.5% •• I 40.0% -奶4^.—•““.…二――t,, IAP 
名-2.0% - .3% 罢 30.0o/o 丨^、 
^ -2.5% _ _ _ _ _ _ ^ ^ 20.0% 
1 -2.5%-- H T ^ " ^ ^ ^ ^ - 2 . ^ , e t 1 = -- , , I ^ ~ " 
2 „ .,；'•'•'〜 2 0.0% 1 1 1 1 on-miss 
- 3 . 0 % 丄 -•^-•丨“ „ , 
8 16 32 
cache size in Kbytes cache size in Kbytes 
a. compress b. nasa7 
Figure D.4: Results of the fourth group programs 
108 
Appendix D Simulation Results of Prefetch Cache 
D.1.2 Varying Cache Line Size 
6 0 . 0 % 丁58% • 58% • 100.0% T ^_i#>->^% 
[ - r K 7 ^ e = J > ^ +=. 
S 40.0% -啦％ 'M V -¾ 70.0% 68% ^ - ^ ^^ with FIFO 
I 30.0. - \ I = ， - ‘ 洛 （ r � , � \ 
: \ \ > 50.0% -- _ _一 . |〜 \ ™™iH~combined 
« 2 0 . 0 % - - \气 _ I 40 .0% -45% “ 迹\ \ I A P 
名 \ V i " ^ 30.0% -- \ \ I 则％ • L ^ ^ ^ \ i I 20.0% -- V t 
§ 0 - 0 % ‘ — — " — — ‘ ~ ~ 卜 . ！ . 1 % I I 1。-。％ ―― ^ B % ™ ^ p r e f e t c h . 
^ .10 0% 1 4 8 16 32 64 0 _ 0 % � 1 1 ‘ ‘ ‘ ‘ on-miss 
4 8 16 32 64 
line size in bytes line size in bytes 
a. su2cor b. tomcatv 
Figure D.5: Results of the first group programs 
90.0% 丁〜 88%_4^ 
80.0% p% ^ ^ ^ " " " " ^ ^ ^ 6 9 % +Pully_ 
g 70.0% - - 65% N ^ assoc. 
•B J » ^ � with FIFO 
1 60.0% -- ^ _ ^ ^ � \ 
2 50.0% -f>4% N?^<. _ __• _, 
“ A 一各��令 腦 一條一 c o m b i n e d « 40.0% --,2% “ y � � i Z 45% � \ |AP 
•S 30.0% -- 3^2% 
S 20.0% --
0) 10.0% __ —sii-~~ prefetch-
乏 0 .0% J i 1 1 i 1 1 o n - m i s s 
4 8 16 32 64 
line size in bytes 
a. wave5 
Figure D.6: Results of the second group programs 
109 
Appendix D Simulation Results of Prefetch Cache 
~ • ^ Fully-assoc. with FIFO 
“™m™" combined IAP 
-•"••is- prefetch-on-miss 
肌°% T73�/i 50.0% T49o/o^ ^ ¾ ^ 40.0% 丁 
70.0% -- ° \ 45.0% --49% ^ ^ ¾ ^ % ^ ^ 35Qo/o ._36%4 J ^ 
I 50：0% ” j \ ^ ¾ ^ ¾ I 35：0% -:��、5% I 3 0 . 0 o . 孤 ^ ^ ^ 
i 40.0% - �W % A ^ ^ - ^ 3 . . i 30.0% - �� 、.丨％ I 25.0% - \ \ f 
i 30.0% - \ v y y ^ 25.0o/o \ \ > 20.0% - 4 ^ 
I 20.0% - ^ v % ^ \ / ^ 2 0 % I 20.0% - � 1 8 % I 15.0% -- 19%\ ^ £ ^ / 
\ 10.0% - \ V / ^ : = -- ？ 10.0% - \ : > j / 
i o_o% -~~I \ I / 1~ " I~ " I~~I i 1�-0% —- i 5 0% \ / 
1 -10.0% -- 4 V 16 32 64 | 5.0% — | • ^._/l% 
2 H 1 •� o^  oH 2 0.0% 1 1 1 1 1 1 ^ 0.0% 1 i ¥ 1 1 1 
-20.0% 1 
4 8 16 32 64 4 8 16 32 64 
line size in bytes ||ne size in bytes line size in bytes 
a. espresso b. li c. spice2g6 
Figure D.7: Results of the third group programs 
6.0% T 5% 90.0% 丁39% » ^ ^ _ _ ^ ^ 
5.0% - ; � / ��汽 80.0% - � � * " — ^ 1 ; ^ ^ ¾ ^ ^ .,2¾ - ^ F u l l y -
4.0% -- \ c 70.0% y n ^ % ^ 1 assoc. 
0 3.0% -- \ % 60.0% y 70% with FIFO 
1 2.0% -- \ I 50.0% - 概 ^ ^ 〜 、 、 
？ 1.0% -- \ 2 40.0% -- 45% , , r ^ +combined 
t 0.0% ——1~~^-V^~~‘~~‘ I 30.0% - ‘ �肥 � IAP 
名 - 1 . 0 % - 4 8 , . ^ . 3 2 64 ^ 20.0% _ 
5 -2-。％ -- X i 10.0% 
I 墨• X ^ ' , , I 0.0% ~ ~ , ~ ~ ！ ~ ™ I ^ ~ , ^ ^ ^ I ^ I + p r e f e t c h -
S -4 0% -- W'^ >''o on-miss 
- 5 . 0 % 工 4 8 1 6 3 2 6 4 
line size in bytes "ne size in bytes 
a. compress b. nasa7 
Figure D.8: Results of the fourth group programs 
110 
Appendix D Simulation Results of Prefetch Cache 
D.1.3 Varying Cache Set Associative 
60.0% j 100.0% y 99.4<>/_——^ ^99.4% 
51.4% .0 8% 90.0% _^#^Ful ly-
c 50.0% -- ~^~~~~~~«4g^ 2^^__________<<<""-^ • c 80 0% assoc. 
•I -B 7n n% -- with FIFO 
0 40.0% -- o 70.0/o 
^ "g 60.0% 
^ 30.0% -- ^ 50.0% ^ 4 2o., - » - combined 
1 I 40.0% -- 30.6% z Z ^ ‘ "AP 
^ 20-0% -• ^ 30.0% - z � Z 
0 0 20 0% — 22.3% » ^ 
E 10.0% --s.9% _s%_Jj^ _^.7% I 10.0% - 么1:淡� .„rpfptrh 
1 3.4% 知、、〜~"^ f . . , � . I 1°.0石 5.5% ^ . _ . ^ r % i ^ p r e f e t c h -
0.0% 1 s : t t;i^~~i 0.0% ^ 1 1 1 1 on-miss 
1 l f ; 4 1 2 4 
set assoc. set assoc. 
a. su2cor b. tomcatv 
Figure D.9: Results of the first group programs 
90.0% T 灰 rr•••••• 
。 一 / ‘ 8 1 . 2 % - ^ - Fully-
80.0% -- / assoc. 
§ 7G.0% _• / withFIFO 
•5 60.0% -- / M''-"' 
I 40：0% ：： 38.8% / ^ A , 2 . 1 % —織—combined 
> /T^.Z^ ,Z IAP 
•g 30.0% -- / W V ' 
1 20.0% -- X i^^ k 
i 10.0% -- • / / , , ^ - ^ - p r e f e t c h -
I 0.0% ^-y^"•I 1 1 
S / / on-miss 
_1M% "-13,.% f^- 2 4 
-20.0% i - 16 .5% 
set assoc. 
a. wave5 
Figure D.10: Results of the second group programs 
111 
Appendix D Simulation Results of Prefetch Cache 
~ 4 ~ Fully-assoc. with FIFO 
™m~~~ combined IAP 
.h prefetch-on-miss 
60.0% T 5 8 . 0 ^ ^ ^ ^ ^ ^ . 6 % 45.00. J . 5 % 40.0% T 36.2% 
50.0o/o - ^ = ^ - 3 . . 二 ： ： 3 5 ^ 娜 . 3 5 . 0 o / o — 3 3 ^ 1 = ^ ^ 3 3 . 0 % 
•云.ono/ '、么观 、 •云,„•„, / / •云 30.0% - 29 4% 4 k ll.S% 
I 4�.0% -- z t ; r 4 3 ^ 召 30.0% -- / / = __ S;^-^.s,4% 
£ 30no/ - 念 , ^ 2 2 5 . 0 % - / ^ . 2 % > 2 4 . 3 % ^ 2 5 . 0石 
t 31,7% t 20.0% - / / , Z ^ 20.0% -
！ 20.0% - I 15.0% ^ o J / j C . I 15-0% 
1 10.0% - I 1�-0% - y / I 10-。％ 
I I 5.0% — 4 . 2 l i / I 5.0% --
0 . 0 % • ! 1 1 1 1 0 . 0 % J ~ ^ 1 1 1 0 . 0 % J 1 1 1 1 
1 2 4 1 2 4 1 2 4 
set assoc. set assoc. set assoc. 
a. espresso b. li c. spice2g6 
Figure D.11: Results of the third group programs 
0.0% "I 1 1 1 1 90.0% J 86.2"/^ ___^ ^^ 83.5% 83.2% _4_Fully_ 
-0.5% -- 1 2 4 80.0% - ^"""__*^7;^^^f= | io% 二 ： . 
c -1.0% -- I 70.0% - / with FIFO 
1 -1.5% - 1 60.0% - / 
T3 „„„, o 50.0% / ~~m~~~combined £ -2.0% -- -2.3<%p /：% ^ 44,2% g ,._ 
^ , „ , -2.6% ^ ^ S : ^ ^ 40.0% 十 > ^40,4% IAP 
« -2.5% -- no.Jk-r^^^^'''^ 0 X40.4% 
名 -^-^Pm- ^ 30.0% - . Z 
i -3.0% - - 3 . 5 V ^ r 20.0% - 1.0. z 
I -3.5% - V ' I 10.0% - f p r e f e t c h -
^ -4.0% - - -¾^ ^ O.0o/o i , , , , �n-m_ss 
-4.5% 丄 1 2 4 
set assoc. set assoc. 
a. compress b. nasa7 
Figure D.24: Results of the fourth group programs 
112 
Appendix D Simulation Results of Prefetch Cache 
D.2 Results of the Three Replacement Policies 
D.2.1 Varying Cache Size 
1K bytes prefetch cache size, 32 bytes line size 
60.0% T 100.0% T W] 00% W] 00% _ 99% ~ • ~ 4-way with 
51% 51% ^ • ^ L R U 
I 50.0%-- t : : S ^ - I 80.0%-- 72^^^^^_tS . . . . . qo%+4 ,aywj th 
I • % — 叙 ^ . 33. I _ % - - e f ^ ^ ^ T - e o ： IZ 
> 30.0% - ⑶ 、 閱 為 2 、、瀬 49% …金 4-waywith 
« » • � � * ^ 7 % | - 40.0% - FIFO 
T3 20.0% -- � 1 20% •§ 
^ > 20 0% - � g � F u l l y - a s s o c . 
I 10.0% o . � with LRU 
‘ : 1 f i f ^ - 3 % 1 i ：丄 f ^ ~ ~ ^ ^ - i 3 。/，： r � c . 
~~$~~ Fully-assoc. 
cache size in Kbytes cache size in Kbytes ㈨他 nFO 
(a) su2cor (b) tomcatv 
Figure D.13: Results of the first group programs 
1K bytes prefetch cache size, 32 bytes line size 
100.0% 丁 92% " • •~~ 4-way with 
= ^t===^90% LRU 
I 80.0% - ^ B ^ / ^ ~~m~~~ 4-way with 
I 60.0% - 4 8 % ^ .......,... .Laywith 
I —�-- X, ^ :'7 
t , j m 2 8 % ••�...B ....Fully-assoc. 
i • % - - ^ ^ ^ ^ / ^ ^ ^ 1 4 % _LRU 
I ffi u ™ g ™ Fully-assoc. 
0.0% J 1 1 1 1 with IZ 
8 16 32 ___^_ Fully-assoc. 
cache size In Kbytes with FIFO 
(a) wave5 
Figure D.14: Results of the second group programs 
113 
Appendix D Simulation Results of Prefetch Cache 
~ # ~ 4-way with LRU 
~~m~~~ 4-way with IZ 
4-way with FIFO 
1K bytes prefetch cache size, 32 bytes line size Q “ Pully-assoc. with LRU 
g]..." Fully-assoc. with IZ 
~~•~~ Fully-assoc. with FIFO 
60.0% T KL59% 50.0% T M\ 49% ._ _„, 
T ®^^^^56o/o T 442M^ 40.0% T g39o/o 
0 : - - 4 6 % ^ ¾ : |40 .0%- - ^ ^ - ! = : : t < ^ : : 
1 • ° ^ « 37% I 30.0% - ^ / I 25.0% - ^''^'"'*^6°/° 
i- 30.0% - ^~~~~m^% > y ^ 23% 1 20.0% -
S 20.0% - 減 I • % -- - . 20% ^ I 15.00/�-
o 10.0% - S 10.0% - 2^%--^ °^/° I 10.0% -
E P ‘~~‘ o 
5 g E 5.0% --
2 0.0% 1 1 1 1 S nno/ , , , , •§ 
0-0/° ^ 1 1 1 1 S 0.0% ^ 1 1 1  
8 16 32 o _(cj oo 
8 16 32 8 16 32 
cache size in Kbytes cache size in Kbytes cache slze in Kbytes 
(a) espresso (b) li (c) spice2g6 
Figure D.15: Results of the third group programs 
1K bytes prefetch cache size, 32 bytes line size 
0.0% H 1 1 1 1 90.0% T ~#~ 4-way with 
80% 81% • _ , , c -0.5% -- 8 16 32 80.0% - g H ^ g 79% LRU 
•B 1 Qo/ o 70.07�-- --¾~~" 4-way with 
I _ • � _ _ Z 6 0 . 0 % - I Z 
^ _1.5^  -- "S 50.0% -- •: 4-waywith 
« -2.0% - ^ 40.0% - FIFO 
i -2.5%-- -2ffir-"^;;;*^::::^ _2.8% i 200% ” ^ 24% @ Fully-assoc. 
i - 3 . 0 % - - / • 5 肌 �而 , w i t h L R U 
o / I 10.0% - / 芝 -3.5% - ^6% o 0 0% I ^ _ _ I |~®~Fully-assoc. 
.^o/ 仏‘。。, - 5 % ^ m -5% with IZ 
-4.0% 丄 -10.0% 丄 ^^ W 32 
~ • ~ Fully-assoc. 
cache size in Kbytes cache size in Kbytes with FIFO 
(a) compress (b) nasa7 
Figure D.24: Results of the fourth group programs 
114 
Appendix D Simulation Results of Prefetch Cache 
D.2.2 Varying Cache Line Size 
1K bytes prefetch cache size, 32 bytes line size 
6�+�% T . r B ^ 1 � _ T M ^ ^ 9» + 4 : w i , h 
I : :: 5 ^ ： ^ : I ：： :: e # ^ ^ - r -
© 30.0% -- \ ^ N ^ ^ \ \ ^^^ ;;; 4-way with 
t 20.0% - \ \ I 40.0% -- 3 g � 4 1 � / � FIFO 
i 10.0% - \ \ 13% t 20.0% - \ & Fully-assoc. 
5 \ ^ ^ \ with LRU 
室 0.0% 1 1 ^ ^ ^ B -17¾ 0% I I 0.0% ——I ^ ^ ^ i ) k j - ^ - j ; ^ E 代 I - B - F u l l y - a s s o c . 
芝 - 1 0 . 0 % 丄 寸 ① - 突 S 2 寸 � ^ ^ ^ " " ^ ^ with IZ 
-20.0% 1 r •• 
~~•~~ Fully-assoc. 
line size in bytes line size in bytes with FIFO 
(a) su2cor (b) tomcatv 
Figure D.17: Results of the first group programs 
1K bytes prefetch cache size, 32 bytes line size 
~ # ~ 4-way with 
100.0% y LRU 
0 80.0% - » 5 = * ^ = = « -t4-waywith 
？ ' S ^ ^ 6 9 � / � IZ 
1 60.0% - \ l 7 % ...• 4_waywith 
« 40.0% - V^^^^ FIFO 
I 20.0% -- ^ ™ ^ - , : = � • 
J 0.0% J 1 1 1 1 1 1 g ] Fully-assoc. 
寸 °° ^ ^ S with IZ 
~ ~ • ~ Fully-assoc. 
Iinesizein bytes with FIFO 
(a) wave5 
Figure D.18: Results of the second group programs 
115 
Appendix D Simulation Results of Prefetch Cache 
~ ~ # ~ 4-way with LRU 
~~~m™~ 4-way with IZ 
4-way with FIFO 
1K bytes prefetch cache size, 32 bytes line size „„,.g„„. Pully-assoc. wi th LRU 
g] Fully-assoc. with IZ 
~ ® ^ Fully-assoc. with FIFO 
8 0 . 0 % 丁 73。, 5 0 . 0 % 丁 ^49% ^K49% 4 0 . 0 % -
I 5aoI ：； \ ^ ^ =« I 30.0%—— X^ Y “‘ 1 =二：： m /F^" 
1 = ：  J p ^ : : ; I 腿 - - \ ‘ - i = ：  x V / 
5 20.0% - ^ 6 ¾ o 10.0% - & ^ 10% 5 10.0% - V W 
I 1 0 . 0 % - I I 5 . 0 % - ^ \ ^ 
芝 0.0% ——I——I~I——I——I——I O.Qo/o J 1 1 1 1 1 1 芝 0.0% J——I——I m 1%I I I 
寸 CO CO oj 寸 寸 00 CD CM 寸 T~ CO (D 寸 00 CD CM 寸 T- CO CD T- CO CD 
line size in bytes 'ine size in bytes iine size in bytes 
(a) espresso {b) li (c) spice2g6 
Figure D.19: Results of the third group programs 
1K bytes prefetch cache size, 32 bytes line size 
6.0% T " • ~ ~ 4-way with 
§ 4.0%-- \5_2% 100.0% 81% 81�/� LRU 
1 浙 。 - V i 8 0 . 0 % -- 7 ^ ^ ^ ^ ^ ; ; 5 P < ^ ^ 7 0 % - ^ " - a v w l t h 
2 H 1.1% Z 60.0% - 57°/<^  J ^ *63% 
t 0.0% - _ _ _ I ? V ‘ ‘ I I 40.0% - l i ^ 急 5 = y w i t h 
0) 寸 ® C 0 f c ^ 1 % 寸 L \^ h*llhU �- 2 . 0 % - ^ ^ V . 3 o . I 20.0% - J^ …因…Fu,,y-assoc. 
i -4.0% -- 、 僅 I 0.0% -寸丨 j \ ' > S ^ , . ： withLRU 
芝 6oo/ 1 -2。-。％-- V S % �^Fully-assoc . 
- 6 - ° /。 2 - 4 0 . 0 % 丄 四 with IZ 
line size in bytes ,. • . ^ ‘ ~ « ~ Fully-assoc. 
Ime size in bytes with FIFO 
(a) compress (b) nasa7 
Figure D.24: Results of the fourth group programs 
116 
Appendix D Simulation Results of Prefetch Cache 
D.2.3 Varying Cache Set Associative 
1K bytes prefetch cache size, 32 bytes line size 
60.0%T 1 0 0 . 0 %丁 g ^ ^ : _ ^ ~ ~ H 9 9 0 / < , -^4-waywith 
qjo^ L R U 
=50.0% - &^ "">^ >^ __^ Z2i^ <^  51% c 80.0% - 7p0/ .,.•... .8Q%...4 81% 
§ ra^ ^ 4 5 % .2 ^ A " " ^ . 7.0/ " « - 4-way with 
•f� 40.0% -- 35%.^^ 0 ^ - ^ f ^ 17 
吕 37%---^..^ 召 60.0% - - _ ® ' j L « 58% IZ 
•§ 30.0% -- 釣％ 27% £ 5 ： ^ ^ • 4-waywith 
I 2 0 . 0 % - 1 、 、 ^ » ^ I 40.0%-- FIFO 
Z 10 0°/ t 20.0% - ... H•• Fully-assoc. 
^ — 1 � / 0 with L R U I 。•。％ ——^°°-Hg^"^ I 0.0% _^,o^^__^^^^_7% 丨 i Ful,y-assoc. 
芝-10.00/0丄 1 2 4 .20.0o/, 1 ^ ^ * ^ withlZ 
~ m ~ Fully-assoc. 
set assoc. set assoc. with FIFO 
(a) su2cor (b) tomcatv 
Figure D.21: Results of the first group programs 
1K bytes prefetch cache size, 32 bytes line size 
90.0% 丁 ~ " 4 ~ 4-way with 
^ 8 0 . 0 % - m 81% L R U 
•| 70.0% - f ~~i|™ 4-way with IZ 
3 60.0% -- / 
£ 5 0 . 0 % 丄 / ^ .+h 
/ ：： 4-way with « 40.0% - ^ 39% p|po 
% 30.0% I y ^ ^ 8 % FIFO 
I* 20.0% - io% Z 1 0 > / ^ ^ ^ .....S Fully-assoc. 
E 10.0% - K K . . ^ ^ S ' ^ with LRU 4�/ ® ^^ '^"'^^^  — 
芝 0 . 0 % ^ ^ ^ 1 1 u g ~ F u l l y - a s s o c . 
1 2 4 with IZ 
~ • ~ ~ Fully-assoc. 
set assoc. with FIFO 
(a) wave5 
Figure D.22: Results of the second group programs 
117 
Appendix D Simulation Results of Prefetch Cache 
~ « ~ 4-way with LRU 
~~~M~™ 4-way with IZ 
4-way with FIFO 
1K bytes prefetch cache size, 32 bytes line size ......g..... Fully-assoc. with LRU 
- W r - Fully-assoc. with IZ 
m Fully-assoc. with FIFO 
60.0% T jfcL 45.0% y HSi 44% 40 n% 
[ • 5 t = : r = i 5 6 % 観 ^ 3 3 . c 3 5 孤 3 ^ _ ^ 
5^¾ 4^90/0 510/0 0 35.0% - ^ ^ I 30 0% 3m____29r^33% 
g 40.0% - 吕 30.0% - 4 ^ 1 % o 3 0 - 0 / �- ^ W - - - # - _ ^ 6 % 
1 • � / 29% ^30/ 1 25.0%- / I 25.0�/�-- ^ ^ 
i 30.0o/o-- ^ ^ _ ^ ^ ! ^ 8 / o ^ 20.0% - / 1« 20.0% -
名 20.0% - ® 召 15.0% - 1 0 / 11。， 1 15.0% -
5 5 10.0% - I ^ 3 ^ |- 10.0% -
E 10.0% - E ^ / ® I I I 5.0% - ^ ^ 1 5.0% -
0.0% J 1 1 1 1 0.0% J m^——I 1 1 芝 0.0% J 1 1 1  
1 2 4 1 2 4 1 2 4 
set assoc. set assoc. set assoc. 
(a) espresso (b) li (c) spice2g6 
Figure D.23: Results of the third group programs 
1K bytes prefetch cache size, 32 bytes line size 
0.0% ^ 1 1 1 1 100.0% y """ •~ 4-way with 
nqo/ 1 1 2 4 ^ K L L R U c - 0 - 5 / ° — - c 8 0 . 0 % - 8 ^ ? ~ — f l ^ ~ H 8 1 % 真 
.2 _i 0 % - .2 ^ 8 i / o u ~~»~-4-waywith 
1 1 .0/ •§ 60.0% - IZ 
T3 -1.5/0 - - "D 
2 2 """.ik""". 4-way with > -2.0% - .2 3% >. 40.0% - p,po 
- 0 丄 - 2 . 6 % ^ _ _ ^ « FIFO 
名 - 2 . 5 % - ^==== f^^ ^^^ -^2.5。/。 名 20.0% - ..".'@ Ful ly -assoc. 
1* -3-0% - 3 5 � x  ^ 挑 5 2% 4�/ With LRU 
i -3.50/0— W i _ o ——,’ ^ _ • ^ 5 — _ ^ F u " y - a s s o c . 
芝 - 4 . 0 % 丄 芝 -20.0% 丄 1 2 with IZ 
~~•~~ Fully-assoc. 
set assoc. set assoc. with FIFO 
(a) compress (b) nasa7 
Figure D.24: Results of the fourth group programs 
118 
Bibliography 
BaW89] Baer, J.L., Wang, W.H., “Multi-level cache hierarchies: Organizations, 
protocols and performance," Journal of Parallel and Distributed Computing, 
Volume 6, Number 3, 1989, pp.451-476. 
BaC91] Baer, J.L., Chen, T.F., "An effective on-chip preloading scheme to reduce 
data access penalty," Proceedings of the 1991 International Conference on 
Supercomputing, 1991, pp.176-186. 
Bre87] Brent, G.A., "Using program structure to achieve prefetching for cache 
Memories," Ph.D Thesis, University of Illinois at Urbana-Champaign, Jan-
uary 1987. 
CaK91] Callahan, D., Kennedy, K., Porterfield, A., “Software prefetching," Pro-
ceedings of the Fourth Symposium on Architectural Support for Programming 
Languages and Operating Systems, April 1991, pp.40-52. 
ChB92] Chen, T.F., Baer, J.L., “Reducing memory latency via non-lining and 
prefetching caches," Proceedings of the Fifth International Conference on 
Architectural Support for Programming Languages and Operating Systems, 
Boston, MA, October 1992, pp.51-61. 
ChB94] Chen, T.F., Baer, J.L., "A performance study of software and hard-
ware data prefetching schemes," 21st Annual International Symposium on 
Computer Architecture, 1994, pp.223-232 
119 
ChM91] Chen, W . Y , Mahlke, S.A., Chang, P.P., Hwu, W.W.，"Data access 
microarchitectures for superscalar processors with cmpiler-assisted data pre-
fetching," Proceedings of Microcomputing 24-, 1991. 
Chi94] Chiueh, T.C., "Sunder: A programmable hardware prefetch architecture 
for numerical loops," Proceedings of the 1994 ^CM SIGMETRICS Con-
ference on Measurements and Modeling of Computer Systems, May 1994, 
pp.128-137. 
FuP91] Fu, W.C., Patel, J.H., "Data prefetching in multiprocessor vector cache 
memories," Proceedings of the 18th Annual Symposium on Computer Archi-
tecture, May 1991, pp.54-63. 
FuP92] Fu, W.C., Patel, J.H., "Stride directed prefetching in scalar processors," 
Proceedings of the 25th International Symposium on Microarchitecture, 1992, 
pp.102-110. 
GoG90] Gornish, E., Granston, E., Veidenbaum, A., “ Compiler-directed data 
prefetching in multiprocessor with memory hierarchies," Proceedings of the 
1990 International Conference on SuperComputing, 1990, pp.354—368. 
HeP95] Hennessy, J., Patterson, D., Computer Architecture: A Quantitative 
Approach, Morgan Kauffmann, 1995. 
HP94] Hewlett-Packard, Inc., PA-RISC 1.1 Architecture and Instruction Set 
B,eference Manual, HP Part Number 09740-90039, third Edition, February 
1994. 
IBM89] IBM, AIX V3.2for RISC Systems/6000: Assembler Language Reference, 
SC23-2197-01, 1989. 
IBM94] IBM, The PowerPC Architecture, edited by May, C., Silha, E., Simpson, 
R., Warren, H., Morgan Kauffmann, 1994. 
120 
Jou90] Jouppi, N.P., "Improving direct-mapped cache performance by the addi-
tion of a small fully-associative cache and prefetch buffers," Proceedings ofthe 
18th Annual Symposium on Computer Architecture, May 1990, pp.364-373. 
KlL91] A.C.Klaiber and H.M.Levy., "An architecture for software-controlled 
data prefetching," Proceedings of the 18th Annual International Symposium 
on Computer Architecture, 1991, pp.43-53. 
'Kro81] Kroft, D., "Lockup-free instruction fetch/prefetch cache organization," 
8th Annual International Symposium on Computer Architecture, IEEE Com-
puter Society Press, 1981, pp.81-87. 
Lau96] Lau, S.C., "Improving on-chip data cache performance using instruction 
register information," Master Thesis, Department of Computer Science and 
Engineering, the Chinese University of Hong Kong, June 1996. 
'Lee87] Lee, R.L., "The effectiveness of caches and data prefetch buffers in large-
scale memory multiprocessors," Ph.D Thesis, Department of Computer Sci-
ence, University of Illinois at Urbana- Champaign^ May 1987. 
MoG91] Mowry, T.C., Gupta A., “Tolerating latency through software-controlled 
prefetching in shared-memory multiprocessor," Journal of Parallel and Dis-
tributed Computing, Volume 1, Number 2, June 1991, pp.87-106. 
'MoL92] Mowry, T.C., Lam, M.S., Gupta, A., "Design and evaluation of a com-
piler algorithm for prefetching," Proceedings of the Fifth International Con-
ference on Architectural Support for Programming Languages and Operating 
System, Boston, M.A., October 1992, pp.62-73. 
Mot92] Motorola Inc., PowerPC601 RISC Microprocessor User's Manual, Pub-
lication Number MPC601UM/AD, 1992. 
Por89] Porterfield, A.K., “ Software methods for improvement of cache perfor-
mance on supercomputer applications," Technical Report COMP TR 89-93, 
Rice University, May 1989. 
121 
Smi78a] Smith, A.J., “Sequentially and prefetching in database systems," ACM 
Transactions on Database Systems, Volume 3, Number 3, 1978, pp.223-247. 
Smi78b] Smith, A.J., “Sequential program prefetching in memory hierarchies," 
IEEE Computer, Volume 11, Number 12，December 1978, pp.7-21. 
Smi82] Smith, A.J., “Cache memories," ACM Computing Surveys, Volume 14, 
Number 3, September 1982, pp.473-530, 
SzY97] Sze, S.C., Young, G.H., "Accurate data prefetching with intelligent re-
placement policy," Proceedings of the International Conference on Imag-
ing Science, Systems, and Technology, Las Vegas, Nevada, USA, July 1997, 
pp.74-77. 
Tha81] Thabit, K.D., “Cache management by the computer," Ph.D Thesis, Rice 
University, November 1981. 
;WeS94] Weiss, S., Smith, J.E., POWER and PowerPC, Morgan Kauffmann, 
1994. 
'YoS98] Young G.H., Sze, S.C., Lau, S.C., "An effective placement policy in cache 
for prefetched lines," to appear in ISORA，98, Kunrnng, China, August 1998. 
122 




































 s , , - . . . . - > A . V
 . . .

















- ! ( . . ^ . - 1 .
 
. “ • - : - .




















 . . . ; , f ^ ' J 3 < ^ n s ^






. - • . » '
 . . . X J ^ i
 i M ^ - ; M
 
- - - - - M M 
¢ . ¾ ^ . ¾ ^ -
 . . ^ - . ^ v -
 1 ^ ^ ¾
 - .
 . . . , \
 ^ . { . ^ i ^
 £ , i i . . f ^ : . ^ .
 






 • : •
 


































t A . 
4 
. . 一 
^ 
, j 
• • ! 
.










| ! . 一 
. ,
 
- . ¾ 
一 ； . 〕 
-
 ；
 , . . _ 
¥
 
C U H K L i b r a r i e s 
l ___ l l l l 
UU37U3772 
